Compare commits

...

5678 Commits

Author SHA1 Message Date
56b43f4fec Perform appropriate CUDA stream synchronization in distributed autograd. (#53929) (#54358)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53929

The local autograd engine performs appropriate stream synchronization
between autograd nodes in the graph to ensure a consumer's stream is
synchronized with the producer's stream before executing the consumer.

However in case of distributed autograd, the SendRpcBackward function receives
gradients over the wire and TensorPipe uses its own pool of streams for this
purpose. As a result, the tensors are received on TensorPipe's stream pool but
SendRpcBackward runs on a different stream during the backward pass and there
is no logic to synchronize these streams.

To fix this, I've enhanced DistEngine to synchronize these streams
appropriately when it receives grads over the wire.
ghstack-source-id: 124055277

(Note: this ignores all push blocking failures!)

Test Plan:
1) Added unit test which reproduced the issue.
2) waitforbuildbot.

Reviewed By: walterddr, wanchaol

Differential Revision: D27025307

fbshipit-source-id: 2944854e688e001cb3989d2741727b30d9278414

Co-authored-by: Pritam Damania <pritam.damania@fb.com>
2021-03-23 19:28:21 -07:00
6c394614f0 [CI] Install compatible cmath for Win builds (#54556)
* [CI]Install older cmath during Windows build (#54431)

Summary:
Based on peterjc123 analysis, `cmath` after 26bbe2ad50 (diff-3fa97ceb95d524432661f01d4b34509c6d261a2f7f45ddcf26f79f55b3eec88a) renders a lot of CUDA fail to compile with:
```
error: calling a __host__ function("__copysignf") from a __host__ __device__ function("c10::guts::detail::apply_impl< ::at::native::AUnaryFunctor< ::>  &,     ::std::tuple<float >  &, (unsigned long long)0ull > ") is not allowed
```
Workaround for https://github.com/pytorch/pytorch/issues/54382

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54431

Reviewed By: anjali411

Differential Revision: D27234299

Pulled By: malfet

fbshipit-source-id: b3f1fef941341222cc10cb27346fcf4a1d522a0c

* [CI] Install compatible cmath for Win binary builds (#54527)

Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/54527

Reviewed By: walterddr

Differential Revision: D27269528

Pulled By: malfet

fbshipit-source-id: 4afdc706598f3a6ad296468dfb77a70433ae7d0f
2021-03-23 19:02:01 -07:00
7c3c293ea7 [1.8] Don't build TensorPipe CMA backend on old glibc versions (#54491)
Some users who are building from source on old glibc versions are hitting the issue of TensorPipe using the process_vm_readv syscall which is not wrapped by glibc. This PR tries to check that condition in CMake and disable that backend in such cases.

This should have no effect on PyTorch's official builds, it should just help people who are building from source.
2021-03-23 15:56:26 -07:00
9d43171746 [1.8.1] Replace thrust with cub in randperm (#54537)
Summary:
Benchmark of
```python
%timeit torch.randperm(100000, device='cuda'); torch.cuda.synchronize()
```
thrust:
```
5.76 ms ± 42.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
```
cub:
```
3.02 ms ± 32.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
```

sync in thrust sort is removed

Warning:
Thrust supports 64bit indexing, but cub doesn't, so this is a functional regression. However, `torch.randperm(2**31, device='cuda')` fails with OOM on 40GB A100, and `torch.randperm(2**32, device='cuda')` fails with OOM on 80GB A100, so I think this functional regression has low impact and is acceptable.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53841

Reviewed By: albanD

Differential Revision: D26993453

Pulled By: ngimel

fbshipit-source-id: 39dd128559d53dbb01cab1585e5462cb5f3cceca

Co-authored-by: Xiang Gao <qasdfgtyuiop@gmail.com>
2021-03-23 15:45:20 -07:00
f3c950e04e various doc building cleanups (#54141) 2021-03-23 11:23:02 -07:00
b6f49807db third_party: Update kineto to fix libtorch builds (#54205)
Signed-off-by: Eli Uriegas <eliuriegas@fb.com>
2021-03-17 13:26:11 -07:00
d84e05be49 [fix] Dimension out of range in pixel_shuffle / pixel_unshuffle (#54178)
Co-authored-by: Joel Benjamin Schlosser <jbschlosser@fb.com>
2021-03-17 12:40:59 -07:00
c6139b7915 Make ideep honor torch.set_num_thread changes (#53871) (#54025)
Summary:
When compiled with OpenMP support `ideep`'s computational_cache would cache max number of OpenMP workers
This number could be wrong after `torch.set_num_threads` call, so clean it after the call.

Fixes https://github.com/pytorch/pytorch/issues/53565

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53871

Reviewed By: albanD

Differential Revision: D27003265

Pulled By: malfet

fbshipit-source-id: 1d84c23070eafb3d444e09590d64f97f99ae9d36
2021-03-16 11:46:19 -07:00
30baaef738 Use int8_t instead of char in [load|store]_scalar` (#52616) (#54022)
Summary:
Since `char` is not guaranteed to be signed on all platforms (it is unsigned on ARM)
Fixes https://github.com/pytorch/pytorch/issues/52146

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52616

Test Plan: Run ` python3 -c "import torch;a=torch.tensor([-1], dtype=torch.int8);print(a.tolist())"` on arm-linux system

Reviewed By: walterddr

Differential Revision: D26586678

Pulled By: malfet

fbshipit-source-id: 91972189b54f86add516ffb96d579acb0bc13311
2021-03-16 11:45:50 -07:00
264d0ecf83 [nn] nn.Embedding : padding_idx doc update (#53809) (#54026)
Summary:
Follow-up of https://github.com/pytorch/pytorch/pull/53447

Reference: https://github.com/pytorch/pytorch/pull/53447#discussion_r590521051

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53809

Reviewed By: bdhirsh

Differential Revision: D27049643

Pulled By: jbschlosser

fbshipit-source-id: 623a2a254783b86391dc2b0777b688506adb4c0e

Co-authored-by: kshitij12345 <kshitijkalambarkar@gmail.com>
2021-03-16 11:44:37 -07:00
51233ea4b0 Disabling dispatch to OneDNN for group convolutions when groups size = 24 * n (#54015)
* Disabling dispatch to OneDNN for group convolutions when groups size is 24 * n

* Add condition to non-zero grps

Co-authored-by: Vitaly Fedyunin <vitaly.fedyunin@gmail.com>
2021-03-16 07:34:18 -07:00
31a1a00ae8 Update Kineto revision for 1.8.1 (#54044)
Summary:
Updating Kineto to include bugfixes for 1.8.1

Test Plan: CI
2021-03-16 07:31:47 -07:00
bb98a99638 [ONNX] Update embedding export wrt padding_idx (#53931) (#54033)
Summary:
To be in-sync with https://github.com/pytorch/pytorch/issues/53447

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53931

Reviewed By: ngimel

Differential Revision: D27026616

Pulled By: malfet

fbshipit-source-id: 4c50b29fa296c90aeeeb1757bdaada92cbba33d4
2021-03-15 21:38:49 -07:00
295c7cf1de [ONNX] Update assign output shape for nested structure and dict output (#52893) (#53311) (#54019)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53311

Fixes dict output & nested tuple.

Test Plan: Imported from OSS

Reviewed By: pbelevich, malfet

Differential Revision: D26922426

Pulled By: SplitInfinity

fbshipit-source-id: c2c6b71c8d978b990181e0b025626dbf6ef2199e
2021-03-15 18:52:11 -07:00
3233861ec4 Fix test to use proper condition. (#52216) (#54028)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/52216

Test Plan: Imported from OSS

Reviewed By: pbelevich

Differential Revision: D26427506

Pulled By: ailzhang

fbshipit-source-id: ba4f2f66794cb2843926e5566eb4d25582f7fb2b

Co-authored-by: Ailing Zhang <ailzhang@fb.com>
2021-03-15 16:52:29 -07:00
47f4b3f7d4 Cherrypick #53576 into release/1.8 (#53766) 2021-03-15 13:36:09 -07:00
e450f1498f [ONNX] Support torch.isinf, torch.any and torch.all export to ONNX (#53328) (#53529) (#54007)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53529

Supported for ONNX export after opset 10.
This is not exportable to opsets < 10 due to
1. onnx::IsInf is introduced in opset 10
2. onnx::Equal does not accept float tensor prior to opset 11

Test Plan: Imported from OSS

Reviewed By: pbelevich, malfet

Differential Revision: D26922418

Pulled By: SplitInfinity

fbshipit-source-id: 69bcba50520fa3d69db4bd4c2b9f88c00146fca7

Co-authored-by: BowenBao <bowbao@microsoft.com>
2021-03-15 13:05:59 -07:00
6fd01f9440 [ONNX] Update inputs/input_names formatting to avoid ValueError with scriptMethods (#53519) (#53548) (#54005)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53548

fixes issue faced in #53506

Test Plan: Imported from OSS

Reviewed By: SplitInfinity

Differential Revision: D26922415

Pulled By: malfet

fbshipit-source-id: b61842827bb14cef8c7a7089b2426fa53e642c90

Co-authored-by: BowenBao <bowbao@microsoft.com>
2021-03-15 12:24:20 -07:00
b33e434d55 [v1.8.1] Pick up upstream fixes from TensorPipe (#53804)
- Support transferring >2GB over CMA
- Avoid loading stub version of CUDA driver
- Don't use unsupported mmap option on older kernels
- Don't join non-existing thread if CMA is not viable

The last two manifested as uncaught exceptions (hence crashes) when initializing RPC. The first one caused same-machine RPC requests to fail.
2021-03-15 12:22:10 -07:00
a3e4bf60bb [fix] nn.Embedding: allow changing the padding vector (#53447) (#53986)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/53368

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53447

Reviewed By: albanD

Differential Revision: D26946284

Pulled By: jbschlosser

fbshipit-source-id: 54e5eec7da86fa02b1b6e4a235d66976a80764fc

Co-authored-by: kshitij12345 <kshitijkalambarkar@gmail.com>
2021-03-15 12:21:05 -07:00
e991cdaf58 [CherryPick] Fixes for distribution validation checks (#53763)
* Add sample validation for LKJCholesky.log_prob

* Fix distributions which don't properly honor validate_args=False

A number of derived distributions use base distributions in their
implementation.

We add what we hope is a comprehensive test whether all distributions
actually honor skipping validation of arguments in log_prob and then
fix the bugs we found. These bugs are particularly cumbersome in
PyTorch 1.8 and master when validate_args is turned on by default
In addition one might argue that validate_args is not performing
as well as it should when the default is not to validate but the
validation is turned on in instantiation.

Arguably, there is another set of bugs or at least inconsistencies
when validation of inputs does not prevent invalid indices in
sample validation (when with validation an IndexError is raised
in the test). We would encourage the implementors to be more
ambitious when validation is turned on and amend sample validation
to throw a ValueError for consistency.

* additional fixes to distributions

* address failing tests

Co-authored-by: neerajprad <neerajprad@devvm903.atn0.facebook.com>
Co-authored-by: Thomas Viehmann <tv.code@beamnet.de>
2021-03-15 10:51:50 -07:00
4596a8ec8a Remove MNIST for XLA (#53274) (#53987)
Summary:
Mitigates https://github.com/pytorch/pytorch/issues/53267

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53274

Reviewed By: zhangguanheng66, ailzhang

Differential Revision: D26819702

Pulled By: cpuhrsch

fbshipit-source-id: 5b9b30db6f8fc414aa9f3c841429bf99bc927763

Co-authored-by: cpuhrsch <cpuhrsch@devvm2783.frc0.facebook.com>
2021-03-15 07:53:39 -07:00
512f289884 Example LSTMCell (#51983) (#54003)
Summary:
Fixes #{51801}
LSTMCell example updated

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51983

Reviewed By: agolynski

Differential Revision: D26467104

Pulled By: zou3519

fbshipit-source-id: 31c8bf89b21cd2f748b2cc28a74169082d81503c

Co-authored-by: CarlosJose126 <43588143+CarlosJose126@users.noreply.github.com>
2021-03-15 07:50:49 -07:00
c439f85b16 Fix set_device_map docs (#53508) (#53822)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53508

closes #53501

Differential Revision: D26885263

Test Plan: Imported from OSS

Reviewed By: H-Huang

Pulled By: mrshenli

fbshipit-source-id: dd0493e6f179d93b518af8f082399cacb1c7cba6
2021-03-12 17:31:29 -08:00
30712fca7e ci: Remove special versioning privileges for cu102 (#53133) (#53734)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53133

In light of some issues where users were having trouble installing CUDA
specific versions of pytorch we should no longer have special privileges
for CUDA 10.2.

Recently I added scripts/release/promote/prep_binary_for_pypi.sh (https://github.com/pytorch/pytorch/pull/53056) to make
it so that we could theoretically promote any wheel we publish to
download.pytorch.org to pypi

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: walterddr

Differential Revision: D26759823

Pulled By: seemethere

fbshipit-source-id: 2d2b29e7fef0f48c23f3c853bdca6144b7c61f22
(cherry picked from commit b8546bde09c7c00581fe4ceb061e5942c7b78b20)
Signed-off-by: Eli Uriegas <eliuriegas@fb.com>
2021-03-10 11:53:14 -08:00
debf62d95c [1.8.1] Explicitly export submodules and variables from torch module (#53675)
Summary:
For https://github.com/pytorch/pytorch/issues/47027.

Some progress has been made in https://github.com/pytorch/pytorch/issues/50665, but in my testing trying to unwrap the circular dependencies is turning into a neverending quest.

This PR explicitly exports things in the top-level torch module without any semantic effect, in accordance with this py.typed library guidance: https://github.com/microsoft/pyright/blob/master/docs/typed-libraries.md#library-interface

It may be possible to do some of the other fixes just using `__all__` where needed, but `__all__` has a semantic effect I would like to further review. This PR at least fixes simple completions like `torch.nn` in Pylance/pyright.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52339

Reviewed By: smessmer

Differential Revision: D26694909

Pulled By: malfet

fbshipit-source-id: 99f2c6d0bf972afd4036df988e3acae857dde3e1

Co-authored-by: Jake Bailey <5341706+jakebailey@users.noreply.github.com>
2021-03-10 10:10:42 -08:00
e30dc8d21b enable autocast for xla (#48570) (#53671)
Summary:
For enabling amp in torch/xla, see [this](https://github.com/pytorch/xla/pull/2654).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48570

Reviewed By: ezyang

Differential Revision: D26120627

Pulled By: ailzhang

fbshipit-source-id: 32627b17c04bfdad128624676ea9bf6f117bc97d

Co-authored-by: Chengji Yao <yaochengji@hotmail.com>
2021-03-10 10:06:02 -08:00
4e590c9ced Docs cherrypicks 1.8.1 (#53674)
* [FX] Cherrypick docs fixes

* Update code links to point to 1.8
2021-03-09 17:23:28 -08:00
6e9f2c8df0 [1.8 release only] Remove fx graph mode quantization doc from release (#53055) 2021-03-02 12:26:26 -08:00
37c1f4a7fe Fix hipify_python (#52756)
Co-authored-by: rraminen <rraminen@amd.com>
Co-authored-by: Nikita Shulga <nshulga@fb.com>
2021-02-26 14:13:54 -08:00
49b74a52a4 Catch Flake8 error codes with multiple letters (#52750) (#52801)
Summary:
The Flake8 job has been passing on `master` despite giving warnings for [over a month](https://github.com/pytorch/pytorch/runs/1716124347). This is because it has been using a regex that doesn't recognize error codes starting with multiple letters, such as those used by [flake8-executable](https://pypi.org/project/flake8-executable/). This PR corrects the regex, and also adds another step at the end of the job which asserts that Flake8 actually gave no error output, in case similar regex issues appear in the future.

Tagging the following people to ask what to do to fix these `EXE002` warnings:

- https://github.com/pytorch/pytorch/issues/50629 authored by jaglinux, approved by rohan-varma
  - `test/distributed/test_c10d.py`
- https://github.com/pytorch/pytorch/issues/51262 authored by glaringlee, approved by ejguan
  - `torch/utils/data/datapipes/__init__.py`
  - `torch/utils/data/datapipes/iter/loadfilesfromdisk.py`
  - `torch/utils/data/datapipes/iter/listdirfiles.py`
  - `torch/utils/data/datapipes/iter/__init__.py`
  - `torch/utils/data/datapipes/utils/__init__.py`
  - `torch/utils/data/datapipes/utils/common.py`
- https://github.com/pytorch/pytorch/issues/51398 authored by glaringlee, approved by ejguan
  - `torch/utils/data/datapipes/iter/readfilesfromtar.py`
- https://github.com/pytorch/pytorch/issues/51599 authored by glaringlee, approved by ejguan
  - `torch/utils/data/datapipes/iter/readfilesfromzip.py`
- https://github.com/pytorch/pytorch/issues/51704 authored by glaringlee, approved by ejguan
  - `torch/utils/data/datapipes/iter/routeddecoder.py`
  - `torch/utils/data/datapipes/utils/decoder.py`
- https://github.com/pytorch/pytorch/issues/51709 authored by glaringlee, approved by ejguan
  - `torch/utils/data/datapipes/iter/groupbykey.py`

Specifically, the question is: for each of those files, should we remove the execute permissions, or should we add a shebang? And if the latter, which shebang?

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52750

Test Plan:
The **Lint / flake8-py3** job in GitHub Actions:

- [this run](https://github.com/pytorch/pytorch/runs/1972039886) failed, showing that the new regex catches these warnings properly
- [this run](https://github.com/pytorch/pytorch/runs/1972393293) succeeded and gave no output in the "Run flake8" step, showing that this PR fixed all Flake8 warnings
- [this run](https://github.com/pytorch/pytorch/pull/52755/checks?check_run_id=1972414849) (in https://github.com/pytorch/pytorch/issues/52755) failed, showing that the new last step of the job successfully catches Flake8 warnings even without the regex fix

Reviewed By: walterddr, janeyx99

Differential Revision: D26637307

Pulled By: samestep

fbshipit-source-id: 572af6a3bbe57f5e9bd47f19f37c39db90f7b804

Co-authored-by: Sam Estep <sestep@fb.com>
2021-02-26 07:49:51 -08:00
11c78e9cb3 Expose documentation for LKJCholesky distribution (#52904)
This is already added to the master branch in https://github.com/pytorch/pytorch/pull/52763.
2021-02-26 07:47:29 -08:00
d6943ea58d apply diff 52351 (#52649) 2021-02-23 07:51:38 -08:00
02b61b49ea [1.8] Update XNNPACK (#52647)
Cherry-pick 55d53a4e70 into release/1.8 branch
2021-02-23 05:31:57 -08:00
d553478c98 [v1.8] Make TensorPipe work around bug in old versions of libibverbs (#52615)
The bug affects PyTorch users who meet two conditions:
- they have an old version of libibverbs installed (the userspace library), namely older than v25, which dates from Jul 29, 2019;
- but they do _not_ have an InfiniBand kernel module loaded.

In those cases they will experience a crash (uncaught exception) happening when initializing RPC, mentioning an "unknown error -38".

There is a workaround, which is for those users to activate a killswitch (which is private and undocumented) to disable the `ibv` backend of TensorPipe.
2021-02-22 16:55:12 -08:00
63333e2a25 [1.8] Update api doc for enabling TcpStore on Windows (#52601)
Summary:
Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51847

Reviewed By: albanD

Differential Revision: D26405678

Pulled By: malfet

fbshipit-source-id: 073b675225b48d1732771583f8f2473e0fdcf35c

Co-authored-by: Joe Zhu <jozh@microsoft.com>
2021-02-22 10:14:09 -08:00
8e7eebfc9a [1.8] Fix onnx mixed precision export for layernorm & fuseLogSoftmaxNllLoss (#52510)
Co-authored-by: Shubham Bhokare <32080845+shubhambhokare1@users.noreply.github.com>
2021-02-19 14:40:53 -08:00
f8afb8bdd0 [v1.8.0] Various CUDA 11.1 with BUILD_SPLIT_CUDA_FIXES (#52518)
Co-authored-by: Nikita Shulga <nshulga@fb.com>
Co-authored-by: peterjc123 <peterghost86@gmail.com>
Co-authored-by: Jane Xu <janeyx@fb.com>
2021-02-19 12:41:21 -08:00
0851cc42b0 Update freezing API - changes from 52337 (#52392)
Co-authored-by: eellison <eellison@fb.com>
2021-02-18 15:36:51 -08:00
804f7b6018 Add arm64 binary build (#52443) (#52469)
Summary:
This is getting tested by https://github.com/pytorch/pytorch/issues/52441.

Adds new config for macos arm64 to our binary builds.
Now stores artifacts for mac builds.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52443

Reviewed By: walterddr

Differential Revision: D26517330

Pulled By: janeyx99

fbshipit-source-id: 02774937a827bdd4c08486dc9f8fe63446917f1e
2021-02-18 15:17:27 -08:00
32758d30b3 onnx export of per channel fake quantize functions (#42835) (#52430)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/39502

This PR adds support for exporting **fake_quantize_per_channel_affine** to a pair of QuantizeLinear and DequantizeLinear. Per tensor support was added by PR https://github.com/pytorch/pytorch/pull/39738.

`axis` attribute of QuantizeLinear and DequantizeLinear, which is required for per channel support, is added in opset13 added by https://github.com/onnx/onnx/pull/2772.

[update 1/20/2021]: opset13 is being supported on master, the added function is now properly tested. Code also rebased to new master.

The function is also tested offline with the following code
```python
import torch
from torch import quantization

from torchvision import models
qat_resnet18 = models.resnet18(pretrained=True).eval().cuda()

qat_resnet18.qconfig = quantization.QConfig(
    activation=quantization.default_fake_quant, weight=quantization.default_per_channel_weight_fake_quant)
quantization.prepare_qat(qat_resnet18, inplace=True)
qat_resnet18.apply(quantization.enable_observer)
qat_resnet18.apply(quantization.enable_fake_quant)

dummy_input = torch.randn(16, 3, 224, 224).cuda()
_ = qat_resnet18(dummy_input)
for module in qat_resnet18.modules():
    if isinstance(module, quantization.FakeQuantize):
        module.calculate_qparams()
qat_resnet18.apply(quantization.disable_observer)

qat_resnet18.cuda()

input_names = [ "actual_input_1" ]
output_names = [ "output1" ]

torch.onnx.export(qat_resnet18, dummy_input, "quant_model.onnx", verbose=True, opset_version=13)
```
It can generate the desired graph.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42835

Reviewed By: houseroad

Differential Revision: D26293823

Pulled By: SplitInfinity

fbshipit-source-id: 300498a2e24b7731b12fa2fbdea4e73dde80e7ea

Co-authored-by: Hao Wu <skyw@users.noreply.github.com>
2021-02-18 12:50:40 -08:00
bcb64a8084 Fix upsample bicubic2d batching handling on CPU. (#52389) (#52445)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52389

Fixes: https://github.com/pytorch/pytorch/issues/49159

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D26496319

Pulled By: gchanan

fbshipit-source-id: d385cd683ef09e0596a9875ce84d03e6e77acc93
2021-02-18 12:46:39 -08:00
f07991d396 update symeig backward note about similar eigenvalues (#52311) (#52446)
Summary:
First part of https://github.com/pytorch/pytorch/issues/49886 to at least properly warn users of the current state

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52311

Reviewed By: soulitzer

Differential Revision: D26495644

Pulled By: albanD

fbshipit-source-id: 72abdfe41cdbcc1ac739a536eb85d1aa4ba90897
2021-02-18 12:45:47 -08:00
c458cd4852 [v1.8.0] .circleci: Downgrade CUDA 11.2 -> 11.1 for binaries (#52151) (#52406)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52151

CUDA 11.2 might not be as performant as we thought so let's downgrade to
something we think is more performant.

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D26408314

Pulled By: seemethere

fbshipit-source-id: e2446aa0115e2c2a79718b1fdfd9fccf2072822d
(cherry picked from commit a11650b069729997b002032d70e9793477147851)
Signed-off-by: Eli Uriegas <eliuriegas@fb.com>
2021-02-18 10:59:03 -08:00
f7c4afc0f4 [cmake] Add explicit cublas->cudart dependency (#52243) (#52404)
Summary:
Necessary to ensure correct link order, especially if libraries are
linked statically. Otherwise, one might run into:
```
/usr/bin/ld: /usr/local/cuda/lib64/libcublasLt_static.a(libcublasLt_static.a.o): undefined reference to symbol 'cudaStreamWaitEvent@libcudart.so.11.0'
/usr/local/cuda/lib64/libcudart.so: error adding symbols: DSO missing from command line
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52243

Reviewed By: seemethere, ngimel

Differential Revision: D26437159

Pulled By: malfet

fbshipit-source-id: 33b8bb5040bda10537833f3ad737f535488452ea
2021-02-17 16:07:41 -08:00
20554c00b6 [1.8] Remove torch.vmap (#52397)
torch.vmap is a prototype feature and should not be in the stable
binary. This PR:

- Removes the `torch.vmap` API
- Removes the documentation entry for torch.vmap
- Changes the vmap tests to use an internal API instead of torch.vmap.

Test Plan:
- Tested locally (test_torch, test_autograd, test_type_hints, test_vmap), but also wait
for CI.
2021-02-17 16:05:34 -08:00
3464d64f08 [1.8] Fix libnvrtc discoverability in package patched by auditwheel (#52365) 2021-02-17 16:05:05 -08:00
c6972eb3ac Skip OneDNN Convolution in case of groups = 24 #50042 (#52313)
Co-authored-by: Vitaly Fedyunin <vitaly.fedyunin@gmail.com>
2021-02-17 16:04:26 -08:00
25562d3d41 Use side-stream in CPU to GPU copies in DDP (#50180) (#52270)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50180

Resolves the regression in
https://github.com/pytorch/pytorch/issues/49819 by adding copy over background
stream similar to scatter. For internal use cases, this is gated with an env var that maintains the previous behavior when it is off.

Test Plan: CI

Reviewed By: mrshenli, ngimel

Differential Revision: D25818170

fbshipit-source-id: e50c76c035504b2a44e2be084701cee45c90df75
2021-02-17 09:49:30 -08:00
cd63c37bc6 ports fix (#52242)
Co-authored-by: Mike Ruberry <mruberry@devfair044.maas>
2021-02-13 17:59:51 -08:00
c79decdbba [v1.8 patch] [Resubmission] Add a documentation page for DDP communication hooks (#52215)
Co-authored-by: wayi <wayi@devgpu238.prn2.facebook.com>
2021-02-12 16:37:23 -08:00
c307a3f336 [1.8] Do not print warning if CUDA driver not found (#51806) (#52050)
Summary:
It frequently happens when PyTorch compiled with CUDA support is installed on machine that does not have NVIDIA GPUs.

Fixes https://github.com/pytorch/pytorch/issues/47038

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51806

Reviewed By: ezyang

Differential Revision: D26285827

Pulled By: malfet

fbshipit-source-id: 9fd5e690d0135a2b219c1afa803fb69de9729f5e
2021-02-12 12:20:46 -08:00
f071020756 Workaround arm64 gcc error in std::copysign (#51900) (#52049)
Summary:
Move definition of copysign template and specialization for
bfloat16/half types before first use of copysign in that file

Add comment explaining why this is necessary

Fixes https://github.com/pytorch/pytorch/issues/51889

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51900

Reviewed By: walterddr

Differential Revision: D26321741

Pulled By: malfet

fbshipit-source-id: 888858b11d9708fa140fe9c0570cc5a24599205b
2021-02-12 08:00:46 -08:00
4f436f8570 fake_quant cachemask: remove Python bindings (#51878) (#52160)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51878

`fake_quantize_per_tensor_affine_cachemask` and
`fake_quantize_per_channel_affine_cachemask` are implementation
details of `fake_quantize_per_tensor_affine` and
`fake_quantize_per_channel_affine`, removing the
Python bindings for them since there is no need to
expose them.

Test Plan:
```
python test/test_quantization.py TestFakeQuantize
```

Imported from OSS

Reviewed By: albanD, bugra

Differential Revision: D26314173

fbshipit-source-id: 733c93a3951453e739b6ed46b72fbad2244f6e97
(cherry picked from commit 33afb5f19f4e427f099653139ae45b661b8bc596)
2021-02-12 07:37:00 -08:00
ae11589710 [FX][1.8] Cherrypick three FX fixes to 1.8 (#52021)
* Fix leaf modules in Transformer

[ghstack-poisoned]

* Fix tuple type annotations

[ghstack-poisoned]

* Generalize dict key check in `create-arg` (#51927)

Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51927

Test Plan: Imported from OSS

Reviewed By: pbelevich

Differential Revision: D26329655

Pulled By: jamesr66a

fbshipit-source-id: a15e7d9564551521af12a8fde1c7524856f0cbc2
2021-02-12 07:35:34 -08:00
9e5bcc1020 1.8 cherrypick: Add metacompile of Ternary if (#51789) (#51913)
Summary:
Fixes issue: https://github.com/pytorch/pytorch/issues/49728
========
Ternary if operation fails in Torchscript when the condition variable is annotated as Final.

Tests:
=======
pytest -k test_ternary_static_if test/test_jit.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51789

Reviewed By: gmagogsfm

Differential Revision: D26278969

Pulled By: nikithamalgifb

fbshipit-source-id: 27d1383290211503188428fb2e8b7749f59ba16e

Co-authored-by: nikithamalgi <nikithamalgi@devvm146.prn0.facebook.com>
2021-02-09 21:34:26 -08:00
fa8578241d .jenkins: Release branch specific updates (#51982) 2021-02-09 21:33:29 -08:00
1368809532 [v1.8.0] [wip] doc_fix (#52006)
Summary:
tries to fix doc_test

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51825

Reviewed By: bertmaher

Differential Revision: D26295583

Pulled By: ngimel

fbshipit-source-id: 13f6e7f1675d810adfd4abd2d579e2812fe54c80
(cherry picked from commit 6c0bf28da651eb8ff1d2d0dcfe807ea757fb61e5)
Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Co-authored-by: Natalia Gimelshein <ngimel@fb.com>
2021-02-09 21:32:32 -08:00
4073248fc2 [FX] Hide experimental folder (#51987) 2021-02-09 15:44:33 -08:00
75153cb730 Disable unaliged-access test from TestVectorizedMemoryAccess.CopyKernel (#51864) (#51890)
Summary:
Test begins to fail after the driver udpate

See https://github.com/pytorch/pytorch/issues/51863

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51864

Reviewed By: bertmaher

Differential Revision: D26304018

Pulled By: malfet

fbshipit-source-id: bb7ade2f28d8cf8f847159d4ce92391f0794c258

Co-authored-by: Nikita Shulga <nshulga@fb.com>
2021-02-09 10:17:18 -08:00
5bb69b080c concantenate LICENSE files when building a wheel (#51634) (#51882)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/50695

I checked locally that the concatenated license file appears at `torch-<version>.dist-info/LICENSE` in the wheel.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51634

Reviewed By: zhangguanheng66

Differential Revision: D26225550

Pulled By: walterddr

fbshipit-source-id: 830c59fb7aea0eb50b99e295edddad9edab6ba3a

Co-authored-by: mattip <matti.picus@gmail.com>
2021-02-09 10:16:12 -08:00
9112f4eded [FX][docs] Indent forward (#51802)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51802

lol

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D26284311

Pulled By: jamesr66a

fbshipit-source-id: 0d303d8c99131abb8d97e0acd0ac2d810e1e950c
2021-02-05 18:01:27 -08:00
8c48af822e pytorch docs: add fake_quantize functions documentation (#51748)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51748

Adding docs for `fake_quantize_per_tensor_affine` and `fake_quantize_per_channel_affine`
functions.

Note: not documenting `fake_quantize_per_tensor_affine_cachemask` and
`fake_quantize_per_channel_affine_cachemask` since they are implementation details
of `fake_quantize_per_tensor_affine` and `fake_quantize_per_channel_affine`,
and do not need to be exposed to the user at the moment.

Test Plan: Build the docs locally on Mac OS, it looks good

Reviewed By: supriyar

Differential Revision: D26270514

Pulled By: vkuzo

fbshipit-source-id: 8e3c9815a12a3427572cb4d34a779e9f5e4facdd
2021-02-05 17:53:02 -08:00
ececbcfff2 [Conda][Kineto] efine weak acc_get_device_type if kineto is used (#51818)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51818

Reviewed By: ilia-cher

Differential Revision: D26291188

Pulled By: malfet

fbshipit-source-id: 68797e02fe4dd54d8030e67aaf28046a4fae0770
2021-02-05 17:46:30 -08:00
fb07aca7b0 Adding support for CUDA 11.2 in our nightly build matrix (#51611)
Summary:
Replacing 11.0 with 11.2 in our nightlies.

(am slightly uncertain why the manywheel linux tests worked before we added the GPU driver for 11.2)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51611

Reviewed By: malfet, seemethere, zhangguanheng66

Differential Revision: D26282829

Pulled By: janeyx99

fbshipit-source-id: b15380e5c44a957e6a85e4f5fb9691ab9c6103a5
2021-02-05 15:40:31 -08:00
5c3a054b12 Add FLOPS support to the new profiler API. (#51734)
Summary:
The new profiler API was added in PR#48280. This PR is to add FLOPS
support to the new profiler API.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51734

Test Plan:
```python
python test/test_profiler.py -k test_flops
```

Reviewed By: xuzhao9

Differential Revision: D26261851

Pulled By: ilia-cher

fbshipit-source-id: dbeba4c197e6f51a9a8e640e8bb60ec38df87f73
2021-02-05 15:03:35 -08:00
430329e875 Revert D26009829: Optimize relu on cpu using clamp_min
Test Plan: revert-hammer

Differential Revision:
D26009829 (2054cd56c5)

Original commit changeset: 7bb1583ffb3e

fbshipit-source-id: 3e945b438fb8d83f721e400ae69be8848cab9720
2021-02-05 14:48:06 -08:00
50c9c08203 Enable GPU/RE tags for caffe2/caffe2/python/TARGETS
Summary: Moving caffe2_core_gpu_python contbuild to use GPU/RE

Test Plan: CI

Reviewed By: malfet

Differential Revision: D26261826

fbshipit-source-id: a6f8c7bd8368c1cb69499ea0ea7d5add0956a7ad
2021-02-05 13:52:48 -08:00
2054cd56c5 Optimize relu on cpu using clamp_min (#50924)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50924

`clamp_min` seems slightly faster than `threshold` (on avx2 cpus)
because it compiles down to vmaxps, rather than vcmpps+vblendv.

I see the biggest perf difference (about 20% faster) with float
tensors at 32k-64k elements.  Bigger tensors are more memory bound
although it looks like it might still be a tiny win (2%).

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D26009829

Pulled By: bertmaher

fbshipit-source-id: 7bb1583ffb3ee242e347f59be82e0712c7631f7e
2021-02-05 13:03:40 -08:00
3cfbf6d3ac [quick-checks] Allow gradlew to be executable (#51796)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51796

Reviewed By: IvanKobzarev

Differential Revision: D26280152

Pulled By: malfet

fbshipit-source-id: ab19ddc8589471002fb330d8d97c81f5a6deeb6f
2021-02-05 12:54:53 -08:00
029f857b22 [Metal] Add hardswish and hardsigmoid to metal, fix broadcasting for binary elementwise ops
Summary:
Add hardswish_ and hardsigmoid_ activations to enable MobileNetV3.

Also fix binary elementwise ops to work when the first input is being broadcasted rather than the second.

Test Plan:
Test on device:
```
arc focus2 pp-ios
```
Test on mac
```
buck test pp-macos
```

Reviewed By: xta0

Differential Revision: D26241385

fbshipit-source-id: 6ce7269d60d63cf909b75a7f4e18fb17ac2f5d31
2021-02-05 12:46:37 -08:00
a930162c69 Revert D26276903: [pytorch][PR] Add LazyBatchNormXd
Test Plan: revert-hammer

Differential Revision:
D26276903 (aa1fd6b45a)

Original commit changeset: 0ac706974178

fbshipit-source-id: bfe01b01cd460f1e2845ea5ef1fc1514e6b6ba54
2021-02-05 12:37:29 -08:00
33973d45a9 Add acc_get_device_type weak symbol to kineto_profler (#51787)
Summary:
Move `kinetoAvailable` to profiler_kineto.h and make it a constexpr
Update kineto submodule
Fixes https://github.com/pytorch/pytorch/issues/51026

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51787

Reviewed By: seemethere

Differential Revision: D26278170

Pulled By: malfet

fbshipit-source-id: 0cdd903cd8e3106c830ccce03b903b787ae33190
2021-02-05 11:52:45 -08:00
59cb693c90 [quant] add docs for embedding/embedding_bag (#51770)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51770

Test Plan:
tested locally on mac

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D26279112

fbshipit-source-id: 8675d3ef712ecbe545bad0d3502181b3ccdd7f89
2021-02-05 11:43:15 -08:00
9c2dd5775a Fixed slight bug in FX docs (#51779)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51779

Reviewed By: ngimel

Differential Revision: D26279623

Pulled By: Chillee

fbshipit-source-id: 0cd2a487ce6b80ce0d3f81e2b2334ade20d816bb
2021-02-05 11:27:39 -08:00
aa1fd6b45a Add LazyBatchNormXd (#51548)
Summary:
This PR implements UninitializedBuffer and LazyBatchnormXd based on https://github.com/pytorch/pytorch/issues/44538. (cc. emcastillo and albanD)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51548

Reviewed By: zhangguanheng66

Differential Revision: D26276903

Pulled By: albanD

fbshipit-source-id: 0ac706974178363f8af075e59b41d5989418922f
2021-02-05 10:27:04 -08:00
5a962369e2 [Gradient Compression] Check if the backend is NCCL when a DDP communication hook is registered (#51759)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51759

Some unit tests actually register a comm hook on other backends like GLOO. Example: `test_ddp_comm_hook_future_passing_cpu`

Therefore, only do the check on `register_builtin_comm_hook`.

Currently DDP communication hook can only be supported on NCCL. Add a check in the registration methods.
ghstack-source-id: 121115814

Test Plan: unit tests.

Reviewed By: pritamdamania87

Differential Revision: D26268581

fbshipit-source-id: c739fa4dca6d320202dc6689d790c2761c834c30
2021-02-05 09:59:12 -08:00
105c3d2196 Update CODEOWNERS (#51726)
Summary:
add myself and alban to folders

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51726

Reviewed By: albanD

Differential Revision: D26254528

Pulled By: soulitzer

fbshipit-source-id: 91477dda3ff81014dbadd3a93f5f511ac3da81e0
2021-02-05 09:01:18 -08:00
a7ba051fa6 [QNNPACK, Sparsity] Add dyanmic linear sparse kernel for arm64 (#50591)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50591

Adds sparse kernels for arm64. Reg blocking factor of 8x8.

Test Plan:
q8gemm-sparse-test

Imported from OSS

Reviewed By: AshkanAliabadi

Differential Revision: D25925501

fbshipit-source-id: 8d62a8eb638f172ffaadfb1480ade0db35831189
2021-02-05 08:46:01 -08:00
70830b5ac0 [QNNPACK, Sparsity] Sparse kernel with 4x8 blocking (#50590)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50590

Larger blocking across M dim such as 8 in previous PR is likely
introducing wasted compute on the shapes being benchmarked.
Here we introduced 4x8 blocking of mrxnr. This helps 1) in packing
smaller data for small values of M and 2) for compute kernel it writes
same number of bytes but more contiguously. It is not certain but it
likely helps.

Test Plan:
q8gemm-sparse-test
fully-connected-sparse-test

Imported from OSS

Reviewed By: AshkanAliabadi

Differential Revision: D25925499

fbshipit-source-id: 01c661ceea38bd6ee8321bb85cf1d5da5de4e984
2021-02-05 08:42:53 -08:00
e8ee35a666 Add script to compare namespace content for release cleanup (#51685)
Summary:
Usage explanation will be in the release note runbook.

This allows to generate diffs like:
```
Processing torch.nn
Things that were added:
{'quantizable', 'ChannelShuffle', 'LazyConvTranspose2d', 'LazyConv2d', 'LazyConvTranspose3d', 'LazyConv1d', 'GaussianNLLLoss', 'LazyConv3d', 'PixelUnshuffle', 'UninitializedParameter', 'LazyLinear', 'LazyConvTranspose1d'}

Things that were removed:
set()
```

This can then be shared with module owners along with the commits to help them validate that the namespace changes for their submodule is as expected.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51685

Reviewed By: zhangguanheng66

Differential Revision: D26260258

Pulled By: albanD

fbshipit-source-id: 40e40f86314e17246899d01ffa4b2631e93b52f7
2021-02-05 07:54:00 -08:00
28c5d90b67 [JIT] Allow implicit boolean conversion of containers (#51683)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51683

**Summary**
This commit enables implicit boolean conversion of lists, strings, and
dictionaries in conditional expressions. Like Python, empty lists,
strings and dictionaries evaluate to `False` and their non-empty
counterparts evaluate to `True`. This allows users to write code like

```
torch.jit.script
def fn(l: List[int]):
  if l:
    ...
  else:
    ...
```

This has been requested by some users and would be a good usability
improvement.

**Test Plan**
This commit adds unit tests to `TestList`, `TestDict` and
`test_jit_string.py` to test this new feature.

Test Plan: Imported from OSS

Reviewed By: pbelevich

Differential Revision: D26264410

Pulled By: SplitInfinity

fbshipit-source-id: b764c18fd766cfc128ea98a02b7c6c3fa49f8632
2021-02-05 00:34:35 -08:00
d3023d86ba Revert D26249330: [Gradient Compression] Add a documentation page for DDP communication hooks
Test Plan: revert-hammer

Differential Revision:
D26249330 (e62aabac43)

Original commit changeset: ab973390ddb7

fbshipit-source-id: d508daed76219e7ca588cf7fb38aeaaffc61acfd
2021-02-04 22:38:06 -08:00
1065c2d5b6 Fix clang-tidy warnings in python_sugared_value.{h,cpp} (#51703)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51703

Reviewed By: gchanan

Differential Revision: D26245798

Pulled By: gmagogsfm

fbshipit-source-id: 01620adca820968324687982cc48390ff9336d20
2021-02-04 21:29:40 -08:00
c941730b96 [JIT/Futures] support set_exception api (#50983)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50983

There is currently no way to handle/propagate errors with the python-based futures API (they are raised correctly if set with an error, but this is only possible from C++).

This diff allows the Future's `unwrap_func` to be set in python optionally, so users can set futures completed with an exception and the error will throw as expected. This is mostly to support the following use case in the next diff:

```
ret_fut = torch.futures.Future(unwrap_func = lambda python_result: {
    # throw exception if needed
    if isinstance(python_result, Exception):
        throw python_result
})

rpc_fut = rpc.rpc_async(...) # RPC future that times out
# Goal is to propagate RPC error to this future
rpc_fut.add_done_callback(
res => {
    # Note that ret_fut.set_result(res.wait()) won't propagate the error
    try:
        ret_fut.set_result(res.wait())
    except Exception as e:
        ret_fut.set_result(e)
}
)
```
ghstack-source-id: 121021434

Test Plan:
unittest
```
buck test mode/dev-nosan mode/no-gpu //caffe2/test:futures -- te
st_unwrap --print-passing-details
```

Reviewed By: mrshenli

Differential Revision: D25950304

fbshipit-source-id: 7ee61e98fcd783b3f515706fa141d538e6d2174d
2021-02-04 20:22:19 -08:00
8e78dd6de8 [torch.futures] Fix doc inconsistency about callback args (#50979)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50979

Noticed that the documentation is inconsisntent about the arg needed
in the callback. It appears to require the future, so fix this in the docs.
ghstack-source-id: 121021431

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D25944637

fbshipit-source-id: 0bfcd4040c4a1c245314186d29a0031e634b29c3
2021-02-04 20:22:14 -08:00
21afbba79b [torch.futures] Clarify callback behavior when future is completed (#50978)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50978

Noticed that the documentation is not clear that the cbs are invoked
inline if the future is already completed. We should probably document this
behavior.
ghstack-source-id: 121021432

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D25944636

fbshipit-source-id: f4ac133d076ba9a5690fecfa56bde6d614a40191
2021-02-04 20:22:09 -08:00
c3f2f3294e [RPC] Add option to make rref.get_type not block. (#50977)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50977

Adds a `blocking` flag that can be set to False to make this API return a `Future` to the type. This is to make this function non-blocking, mostly for a future change that will allow `rref.rpc_async()` to be completely non-blocking (it currently calls and waits for this function that issues an RPC in-line).
ghstack-source-id: 121021433

Test Plan: Modified UT

Reviewed By: mrshenli

Differential Revision: D25944582

fbshipit-source-id: e3b48a52af2d4578551a30ba6838927b489b1c03
2021-02-04 20:18:50 -08:00
716a8c2153 make forward AD API private (#51693)
Summary:
Avoid leaking private functions in `torch.` namespace.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51693

Reviewed By: gchanan

Differential Revision: D26245046

Pulled By: albanD

fbshipit-source-id: 5481b57eb56ba96581848598d32ebf5894a7adf0
2021-02-04 19:02:29 -08:00
e62aabac43 [Gradient Compression] Add a documentation page for DDP communication hooks (#51715)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51715

Add a documentation page for DDP communication hooks.

Screenshot:

{F369781049}

Test Plan: View locally

Reviewed By: pritamdamania87

Differential Revision: D26249330

fbshipit-source-id: ab973390ddb785c5191f587a1b2b6de7d229e50e
2021-02-04 18:53:53 -08:00
de7eeb7752 Removes nonzero method warning (#51618)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/44284

https://github.com/pytorch/pytorch/pull/45413 incorrectly left this only partially fixed because it did not update the separate list of method signatures that were deprecated. This PR correctly fixes https://github.com/pytorch/pytorch/issues/44284. A test is added for the behavior, but until the WARN_ONCE flag is added it's toothless.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51618

Reviewed By: ngimel

Differential Revision: D26220181

Pulled By: mruberry

fbshipit-source-id: 397b47ac7e962d108d8fde0f3dc6468d6327d1c3
2021-02-04 17:43:43 -08:00
e7ff0854c6 [doc] Fix inconsistencies with torch.linalg.inv and deprecate torch.inverse (#51672)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51672

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D26240535

Pulled By: heitorschueroff

fbshipit-source-id: 16dbd0a8a8c0f851faa12bf092dbedfb7cb0b292
2021-02-04 17:19:45 -08:00
ff4848aaa1 [doc] Fix inconsistencies with linalg.pinv docs and deprecate pinverse (#51671)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51671

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D26240534

Pulled By: heitorschueroff

fbshipit-source-id: 26e2a3cad2105e6e2b7779e785666b38597450c5
2021-02-04 17:19:41 -08:00
e7d7256f2d doc] Fix inconsistencies with torch.linalg.matrix_rank doc (#51660)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51660

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D26234100

Pulled By: heitorschueroff

fbshipit-source-id: b9c48c0e172461ed2770d52c07a147152d51d4b7
2021-02-04 17:19:37 -08:00
0308261ddc [doc] Fix inconsistencies with torch.linalg.eigvalsh (#51659)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51659

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D26234102

Pulled By: heitorschueroff

fbshipit-source-id: 6a6711c7b129cd29f2c733c635c4192caaf42d22
2021-02-04 17:19:33 -08:00
87504c3265 [doc] Fix inconsistencies with torch.linalg.eigh (#51658)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51658

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D26234101

Pulled By: heitorschueroff

fbshipit-source-id: c1b5cc74ba0b32c49bfd843e97f957971d8be364
2021-02-04 17:19:29 -08:00
4835f203ec [doc] Fix inconsistencies with torch.linalg.det docs (#51651)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51651

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D26234103

Pulled By: heitorschueroff

fbshipit-source-id: 00ec7dae942bda887f57cb76752f8b5ef25d276a
2021-02-04 17:19:25 -08:00
7c12afb5e2 [doc] Fix inconsistencies with torch.linalg.cond doc (#51641)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51641

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D26234104

Pulled By: heitorschueroff

fbshipit-source-id: 5c2c9a206c4051092305d910ed0e808458e5afd9
2021-02-04 17:13:42 -08:00
4d703d040b Linear autodiff revert revert (#51613)
Summary:
patch PR https://github.com/pytorch/pytorch/issues/50856 and rollbak the revert D26105797 (e488e3c443)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51613

Reviewed By: mruberry

Differential Revision: D26253999

Pulled By: ngimel

fbshipit-source-id: a20b1591de06dd277e4cd95542e3291a2f5a252c
2021-02-04 16:32:05 -08:00
6dcbf396aa [QNNPACK, Sparsity] Added prepacking base aarch32 kernels (#50589)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50589

Adds 1. Input prepacking kernel 2. Compute kernels that processes
prepacked activation.
Hunch is that input prepacking will help with 1. Cache locality and 2.
Avoid a lot of address compute instructions.
Cache locality helps mainly comes from the fact that we are doing mr=8
and nr=4.
mr being 8 likely results in cache line evictions as likely cache
associativity is 4. Laying out transposed activations which are blocked
by mr=8 will lay all the transposed activation in one contiguous block.
Downside is that now we will tranpose all the blocks regardless of them
participating in compute. However it is likely that entire activation
matrix participates in compute for some output block.
Also add benchmark

Test Plan:
q8gemm-sparse-test
fully-connected-test-sparse

Imported from OSS

Reviewed By: AshkanAliabadi

Differential Revision: D25925502

fbshipit-source-id: b2c36419a2c5d23b4a49f25f9ee41cee8397c3be
2021-02-04 16:20:08 -08:00
47a6703bdb [QNNPACK, Sparsity] ARMV7, aarch32, kernels for dynamic linear (#50588)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50588

This diff introduces aarch32 asm kernel for sparse dense gemm.

Test Plan: Imported from OSS

Reviewed By: AshkanAliabadi

Differential Revision: D25925498

fbshipit-source-id: e9e19ce67157a4bc3cba4656f926e828442f09ad
2021-02-04 16:16:35 -08:00
3fec1e5025 fix hardsigmoid_backward for boundary case (#51454)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/51438.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51454

Reviewed By: mruberry

Differential Revision: D26243461

Pulled By: ngimel

fbshipit-source-id: 7d954dc47427f02b7cbf0344e9889db223bfb525
2021-02-04 14:37:58 -08:00
8c737f732b replacing ubuntu-latest with ubuntu-18.04 (#51744)
Summary:
following https://github.com/pytorch/pytorch/pull/51725#pullrequestreview-583703598

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51744

Reviewed By: samestep

Differential Revision: D26262089

Pulled By: janeyx99

fbshipit-source-id: fa24e5c15d24750f2a5ccd5b6a5aad9a4a3ad09f
2021-02-04 14:17:06 -08:00
094d597679 raise windows tol to 30% (#51733)
Summary:
Up the Windows tolerance set by https://github.com/pytorch/pytorch/pull/35818, as CI is still showing some flakes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51733

Test Plan: CI

Reviewed By: zou3519

Differential Revision: D26258005

Pulled By: robieta

fbshipit-source-id: 864c848b7b31a05a2d07d1e683342b3202377c10
2021-02-04 14:09:10 -08:00
ab0cf3b6b5 Add 'repeat' argument to profiler.schedule (#51630)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51630

Reviewed By: gdankel

Differential Revision: D26246317

Pulled By: ilia-cher

fbshipit-source-id: 28b572c837184fe1b2a07dd57e99aa72cb93a9cb
2021-02-04 13:51:04 -08:00
62aea33d7f Revert D26237328: Add compare_set operation and test to TCPStore
Test Plan: revert-hammer

Differential Revision:
D26237328 (7d00aec6bc)

Original commit changeset: c6837a4cc34f

fbshipit-source-id: 662f8067ead9bce0da13b35d393fb781635dd2b9
2021-02-04 13:43:05 -08:00
ecfb73aaca Update docs for torch.profiler.tensorboard_trace_handler (#51636)
Summary:
![image](https://user-images.githubusercontent.com/62738430/106856207-17f8c000-66f9-11eb-80c9-844f79de423e.png)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51636

Reviewed By: orionr

Differential Revision: D26246309

Pulled By: ilia-cher

fbshipit-source-id: 083868e9231727638238c5f5ca31e3566d5e2e7e
2021-02-04 13:32:59 -08:00
d4d5f8569f [FX] Fix mypy error in FX for rewriter (#51740)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51740

Reviewed By: jamesr66a

Differential Revision: D26261009

Pulled By: Chillee

fbshipit-source-id: ce97316aede5509fc8ed90b4eb6b758e2bc1fa7a
2021-02-04 13:15:51 -08:00
b150f150ba Add division overload with rounding_mode selection (#51706)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51706

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50280

As mentioned in gh-43874, this adds a `rounding_mode={'true', 'trunc', 'floor'}`
argument so `torch.div` can be used as a replacement for `floor_divide` during
the transitional period.

I've included dedicated kernels for truncated and floor division which
aren't strictly necessary for float, but do perform significantly better (~2x) than
doing true division followed by a separate rounding kernel.

Note: I introduce new overloads for `aten::div` instead of just adding a default
`rounding_mode` because various JIT passes rely on the exact operator schema.

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D26123271

Pulled By: mruberry

fbshipit-source-id: 51a83717602114597ec9c4d946e35a392eb01d46
2021-02-04 13:08:36 -08:00
949ab213dd Revert "Revert D26246231: [FX] Edits after comprehensive pass over docs" (#51728)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51728

This reverts commit 6c80fd005f23a55b3e4e655e867e0eed493ee416.

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D26254130

Pulled By: jamesr66a

fbshipit-source-id: f301688f85c512076fee9b83a986677ef893d2c5
2021-02-04 13:01:09 -08:00
8c0da1f5e9 [ONNX] Modifications in remove inplace ops passes to better handle binary inplace ops (#51318) (#51572)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51572

Modifications in remove_inplace_ops_for_onnx pass and remove_inplace_ops pass to better handle binary inplace ops

* Handles special case of binary inplace ops, where the first input node has a lower type precedence than the second input node.
* When the inplace node is converted to a regular op, this information is lost and the resulting type is based on type precedence, just like regular ops. To avoid this loss of information, we add a cast node before the input node with the higher data type precedence, so that both the input types are the same.

Test Plan: Imported from OSS

Reviewed By: pbelevich

Differential Revision: D26203117

Pulled By: SplitInfinity

fbshipit-source-id: f018b503701b9067dba053c2764c3b92ef1abc38
2021-02-04 12:44:49 -08:00
c7f1595b19 fix bug (#51222) (#51527)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51527

Fix bug in scatter_add symbolic

Test Plan: Imported from OSS

Reviewed By: pbelevich

Differential Revision: D26203119

Pulled By: SplitInfinity

fbshipit-source-id: e61f024e2daa7bc396fb264b8823a72ebf94ccdb
2021-02-04 12:44:44 -08:00
25b18bb5d7 [ONNX] Support list remove for onnx export (#51373) (#51526)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51526

* Support aten::Delete
* Refactor prepare_inplace_ops_for_onnx into one pass.

Test Plan: Imported from OSS

Reviewed By: pbelevich

Differential Revision: D26203114

Pulled By: SplitInfinity

fbshipit-source-id: ce940bca54a30c39f4b0810f62b0e7b497508f59
2021-02-04 12:44:40 -08:00
6d47e2cff8 [ONNX] Fix opset 11 ConstantChunk with negative dim (#51396) (#51525)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51525

Test Plan: Imported from OSS

Reviewed By: pbelevich

Differential Revision: D26203115

Pulled By: SplitInfinity

fbshipit-source-id: d76942f7cc5812c8a1cc16891e4956cc658283d8
2021-02-04 12:44:35 -08:00
ba824eb2d6 [ONNX] Update unsafe_chunk() method to support new version 13 of Split operator. (#51415) (#51524)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51524

* def unsafe_chunk() support and test in ops13.

* Use _unsqueeze_helper insteadof Unsqueeze operator

* Cast the splits into long.

* Change the test to a fixed dimension.

* Update test_pytorch_onnx_onnxruntime.py

* Disable test_loop_with_list for opset 13.

Test Plan: Imported from OSS

Reviewed By: pbelevich

Differential Revision: D26203123

Pulled By: SplitInfinity

fbshipit-source-id: b273aeff8339faa0e8e9f1fcfbf877d1b703209f

Co-authored-by: Negin Raoof <neginmr@utexas.edu>
2021-02-04 12:44:31 -08:00
8ae6b0c5f9 [ONNX] Enable Constant Folding for ONNX Opset 13 (#51096) (#51523)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51523

* Enable Constant Folding for ONNX Opset 13

* fix CI clang-diagnostic

* fix integers type

* fix comments:sort axes and support negative number

* update squeeze op constant folding

* fix format warning

* fix clang-format issue

Test Plan: Imported from OSS

Reviewed By: pbelevich

Differential Revision: D26203111

Pulled By: SplitInfinity

fbshipit-source-id: c33637ab39db614207bd442c6ab464bd09339b4a

Co-authored-by: hwangdeyu <deyhuang@qq.com>
2021-02-04 12:44:26 -08:00
1c7d966432 Update error message that displays when encountering an op unsupported for ONNX export. (#51387) (#51522)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51522

Test Plan: Imported from OSS

Reviewed By: pbelevich

Differential Revision: D26203121

Pulled By: SplitInfinity

fbshipit-source-id: 5920995b735cecb500b12948b8ad91803e576dcb
2021-02-04 12:44:22 -08:00
586c2e8d62 [ONNX] Fix graph sequence output from loop node (#51305) (#51521)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51521

* Add loop & if node to the list of nodes that could produce sequence type output.
* Switch from `[]` to `at()` to avoid segfault of out of range access.

Test Plan: Imported from OSS

Reviewed By: pbelevich

Differential Revision: D26203112

Pulled By: SplitInfinity

fbshipit-source-id: e990eeed933124b195be0be159271e33fb485063
2021-02-04 12:44:17 -08:00
3cc46002a3 [ONNX] Fix graph position to insert clone node for inplace op removal (#50123) (#51520)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51520

Previous insertBefore approach might end-up inserting clone node in inner sub-blocks, while then the node being used later at other outside call sites.

Test Plan: Imported from OSS

Reviewed By: pbelevich

Differential Revision: D26203124

Pulled By: SplitInfinity

fbshipit-source-id: 999511e901ad1087f360bb689fcdfc3743c78aa4
2021-02-04 12:44:12 -08:00
0e7e4d4217 [ONNX] Add silu operator support for onnx (#51193) (#51519)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51519

Support for yolov5 compound-scaled object detection models export.

Test Plan: Imported from OSS

Reviewed By: pbelevich

Differential Revision: D26203120

Pulled By: SplitInfinity

fbshipit-source-id: c70bd730ee5d6f8bdebaf8ff764b94ffe7673808
2021-02-04 12:44:08 -08:00
9191b639ba [ONNX] Enable remaining failed tests in opset13 (#50806) (#51518)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51518

* enable remaining test in opset13

* add comments for error version test info

* fix comments:opset12 unbind problem

* add ignore[no-redef]

* fix format

Test Plan: Imported from OSS

Reviewed By: pbelevich

Differential Revision: D26203122

Pulled By: SplitInfinity

fbshipit-source-id: e7d95bd2ce13f79f11965be82f640379cd55ff0f

Co-authored-by: hwangdeyu <deyhuang@qq.com>
2021-02-04 12:44:04 -08:00
3f185ac18e [ONNX] Export get/set attribute nodes (#50768) (#51517)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51517

Fix get/set attributes when getting/setting a model parameter.
This PR also fixes inplace ops in If blocks.

Test Plan: Imported from OSS

Reviewed By: pbelevich

Differential Revision: D26203116

Pulled By: SplitInfinity

fbshipit-source-id: bed6ee6dd92b5b43febc8c584a6872290f8fe33f
2021-02-04 12:43:59 -08:00
1829268e7f [ONNX] Improve error message for parse_arg in symbolic functions (#50512) (#51516)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51516

previous error message looks like this
```
RuntimeError: Unexpected node type: onnx::Gather
```
now
```
RuntimeError: Expected node type 'onnx::Constant' for argument 'groups' of node 'conv1d', got 'onnx::Gather'.
```

Repro example:
```python
    torch.jit.script
    def conv(x, w):
        return F.conv1d(x, w, groups=x.shape[0])

    class Net(nn.Module):
        def forward(self, x, w):
            return conv(x, w)

    model = Net()

    x = torch.randn(8, 8, 512)
    w = torch.randn(8, 1, 3)
    torch.onnx.export(model,
                        (x, w),
                        "file.onnx",
                        opset_version=12)
```

Test Plan: Imported from OSS

Reviewed By: pbelevich

Differential Revision: D26203118

Pulled By: SplitInfinity

fbshipit-source-id: 607b22f4cba4baa24154f197914b6817449ab9f8
2021-02-04 12:43:54 -08:00
8dd9fefacb [ONNX] Fix bug in unfold symbolic (#50504) (#51515)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51515

Fix bug in unfold symbolic

Test Plan: Imported from OSS

Reviewed By: pbelevich

Differential Revision: D26203113

Pulled By: SplitInfinity

fbshipit-source-id: 3a1b0013624d918de762a88ac6de8c9cafa0f732
2021-02-04 12:43:50 -08:00
7255b3f6b7 [ONNX] Update constant-folding of Gather op (#50554) (#51514)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51514

Update constant-folding of Gather operator so it also includes cases where rank of indices input is 0.
Currently it only support cases where rank of indices is 1.

Test Plan: Imported from OSS

Reviewed By: pbelevich

Differential Revision: D26191323

Pulled By: SplitInfinity

fbshipit-source-id: 7edcbd8835b0248fefb908aca394f5cca5eae29e
2021-02-04 12:40:30 -08:00
2d305b97e9 [FX] Added partial concrete values for symbolic tracing (#51609)
Summary:
Currently it's passed in a dict but might be worth considering whether we want to support other methods of passing it in (like a list corresponding to the positional args).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51609

Reviewed By: zou3519

Differential Revision: D26224464

Pulled By: Chillee

fbshipit-source-id: 305769db1a6e5fdcfb9e7dcacfdf153acd057a5a
2021-02-04 12:06:02 -08:00
2e8e560cdf Fix anomaly mode memory leak (#51610)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/51349

The memory leak happens when 1) `create_graph` is True AND 2) detect anomaly mode is on. When a backward node's constructor is called during backward, the current evaluating node is assigned as a "parent" of the created node. The code that assigns the parent encounters the below issue:

`functionToPyObject(parent_node)` returns a new PyObject (with refcount 1) or if PyObject already exists, increments its refcount by 1. However [PyDict_SetItem](1b55b65638/Objects/dictobject.c (L1532)) calls into [insertdict](https://github.com/python/cpython/blob/v3.8.1/Objects/dictobject.c#L1034) which increments refcount again. This means that when dict is destroyed, the refcount of the PyObject is at least one. This keeps `parent_node` (the backward function) alive, which then keeps the saved tensor alive.

Similar calls in the codebase to `functionToPyObject` won't require Py_DECREF if it is then passed into a tuple (instead of dict), because the analogous PyTuple_SetItem call does not increment refcount.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51610

Reviewed By: albanD

Differential Revision: D26240336

Pulled By: soulitzer

fbshipit-source-id: 2854528f66fab9dbce448f8a7ba732ce386a7310
2021-02-04 11:53:37 -08:00
0222966ecd Fix several minor things in .circleci/README.md (#51724)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51724

Reviewed By: walterddr

Differential Revision: D26252671

Pulled By: samestep

fbshipit-source-id: 53781c391e3b54f3896e88bce07f7ee66a19ac92
2021-02-04 11:43:59 -08:00
14273126d2 Numeric Suite: Swap with shadow modules only for quantized part of model (#51052)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51052

Ensure that shadow modules are inserted only for quantized modules in a model. Removes redundant module insertion.
ghstack-source-id: 121041113

Test Plan: buck test caffe2/test:quantization --  'test_compare_model_stub_partial \(quantization\.test_numeric_suite\.TestEagerModeNumericSuite\)'

Reviewed By: vkuzo

Differential Revision: D26054016

fbshipit-source-id: 73fc2fd2f0239b0363f358c80e34566d06a0c7cb
2021-02-04 11:40:30 -08:00
a0137808a7 Note on Modules for 1.8 docs (#51536)
Summary:
A new note on Modules for 1.8 documentation.

Rendered form can be seen here: https://alband.github.io/doc_view/notes/modules.html
(thanks Alban!)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51536

Reviewed By: albanD

Differential Revision: D26254282

Pulled By: jbschlosser

fbshipit-source-id: 09cbd46aa268a29b6f54fd48ffe1d6b98db0ff31
2021-02-04 11:28:11 -08:00
de9364aef2 fixes clang-tidy-11 install by using ubuntu18.04 instead of 20.04 (#51725)
Summary:
Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51725

Reviewed By: walterddr

Differential Revision: D26255539

Pulled By: janeyx99

fbshipit-source-id: 1b4459e0c474938c134c529501c6c04106d5b18e
2021-02-04 11:20:23 -08:00
1e2df9e46d [cuda] masked_scatter : static_cast init_value to circumvent cuda 11.2 issue (#51614)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/51544

Tested locally as there is no CI for 11.2 as of now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51614

Reviewed By: malfet

Differential Revision: D26253965

Pulled By: ngimel

fbshipit-source-id: 6d666a54871510ad0d00f915e45bbebcebc93015
2021-02-04 10:44:49 -08:00
7d00aec6bc Add compare_set operation and test to TCPStore (#51593)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51593

Test Plan: Imported from OSS

Reviewed By: gchanan

Differential Revision: D26237328

Pulled By: H-Huang

fbshipit-source-id: c6837a4cc34f8247df6e1c29c1f40fd9e7953313
2021-02-04 10:36:58 -08:00
003a240e68 [package] use WeakValueDictionary for global imported module registry (#51666)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51666

This ensures the modules will get properly unloaded when all references
to them die

Test Plan: Imported from OSS

Reviewed By: Lilyjjo

Differential Revision: D26232574

Pulled By: suo

fbshipit-source-id: a9889965aa35ba2f6cbbfbdd13e02357cc706cab
2021-02-04 09:42:18 -08:00
6c80fd005f Revert D26246231: [FX] Edits after comprehensive pass over docs
Test Plan: revert-hammer

Differential Revision:
D26246231 (c22bc4821d)

Original commit changeset: 8d6278a9fe1d

fbshipit-source-id: fdc83289f8fe7986bc02181eec55e4e72be2d812
2021-02-04 09:26:21 -08:00
4d85e30133 Support at::cpu on non-structured kernels (#51590)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51590

This PR backports a subset of Jiakai's changes from
https://github.com/pytorch/pytorch/pull/51554 that adds support
for at::cpu in non-structured kernels.

The unusual bits:

- Need to add a new forward inference rule for doing conversions
  of const optional<Tensor>& to const Tensor&
- Need to give the wrapper functions a prefix so that the call to
  wrapper is not ambiguous

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: ljk53

Differential Revision: D26209871

Pulled By: ezyang

fbshipit-source-id: 8162686039675ab92a2af7a14f6b18941f8944df
2021-02-04 09:19:45 -08:00
668e0f3598 Split anonymous and namespaced definitions in RegisterDispatchKey (#51585)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51585

Some payoff from the stack of refactors.  When I initially landed
at::cpu, Brian asked me why I couldn't just separate the anonymous
and namespaced definitions.  Well, it used to be annoying.  Now it's
not annoying anymore, so go ahead and split them up.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D26209873

Pulled By: ezyang

fbshipit-source-id: 63057d22acfaa0c17229947d9e65ec1193e360ec
2021-02-04 09:19:41 -08:00
a626b78467 Factor out structured generation into its own subclass. (#51583)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51583

There are no substantive changes in this PR.  The cluster of structured
helper methods is now split off into its own class.  To make sure all of
the original closure was available, I subclassed RegisterDispatchKey and
passed it all on; the only new thing closed over is the structured
functions group being processed.  I also renamed all the methods to
remove structured_ from their names as it is now redundant.

Most of the benefit is being able to remove a level of indentation
from gen_one.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: bhosmer

Differential Revision: D26209872

Pulled By: ezyang

fbshipit-source-id: 76c11410a24968d4f3d8a2bbc9392251a7439e6e
2021-02-04 09:19:37 -08:00
93c4f9f972 Split out RegisterDispatchKey to its own file (#51508)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51508

No substantive changes.  The codegen for this file was getting a
bit long so I moved it off into tools.codegen.dest submodule (I
wanted to do tools.codegen.gen but that conflicts with the existing
module; oy vey!)  To do this I had to move some other functions around
so that they were more generally accessible.  Otherwise
self-explanatory.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: ljk53

Differential Revision: D26187856

Pulled By: ezyang

fbshipit-source-id: fd3784571d03d01c4acb7ca589fcde4492526408
2021-02-04 09:19:32 -08:00
6045663f39 Use Literal to model targets. (#51500)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51500

I'm going to add some new Target types shortly, so having tighter
types for the individual unions will make it clearer which ones
are valid.

This is also the first use of typing_extensions in the codegen,
and I want to make sure it works.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: bhosmer

Differential Revision: D26187854

Pulled By: ezyang

fbshipit-source-id: 6a9842f19b3f243b90b210597934db902b816c21
2021-02-04 09:16:22 -08:00
c22bc4821d [FX] Edits after comprehensive pass over docs (#51705)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51705

Pull Request resolved: #51679

Test Plan: Imported from OSS

Reviewed By: Chillee

Differential Revision: D26246231

Pulled By: jamesr66a

fbshipit-source-id: 8d6278a9fe1da5e6c34eff4fedc4c7e18533fe0f
2021-02-04 08:11:07 -08:00
9920ae665b Make te a hidden package for now (#51690)
Summary:
As discussed with suo , having it in `torch._C.XX` means that it automatically gets added to `torch.XX` which is unfortunate. Making it `torch._C._XX` means that it won't be added to `torch.`.

Let me know if that approach to hide it is not good and we can update that.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51690

Reviewed By: gchanan

Differential Revision: D26243207

Pulled By: albanD

fbshipit-source-id: 3eb91a96635e90a6b98df799e3a732833dd280d5
2021-02-04 07:58:38 -08:00
ecf8166522 Support Union[NoneType, T] as input type (#51605)
Summary:
ghstack-source-id: 32db9661ce0f9441ef7061285bc24967c2808ea6
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51605

Fixes https://github.com/pytorch/pytorch/issues/51582
=========
In Python 3.9+ Union[T, NoneType] and Union[NoneType, T] as OptionalType.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51606

Test Plan:
====
python test/test_jit.py -v TestJit.test_union_to_optional

Reviewed By: pbelevich

Differential Revision: D26242353

Pulled By: nikithamalgifb

fbshipit-source-id: 0ac441fa1bdf2fb1044e3fe131bee47adda90bbb
2021-02-04 06:25:41 -08:00
f1f9b049d8 [profiler] Support top-level memory events (#51421)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51421

Mark memory events that did not happen within an operator context
explicitly in the profiler output.

Test Plan: python test/test_profiler.py -k test_memory_profiler

Reviewed By: ngimel

Differential Revision: D26166518

Pulled By: ilia-cher

fbshipit-source-id: 3c14d3ac25a7137733ea7cc65f0eb48693a98f5e
2021-02-04 04:14:15 -08:00
a9584f29c1 Fix attribution of some CUDA events to CPU events (#51632)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51632

Some fixes:
 - attribute CUDA Runtime events to proper PyTorch CPU events
 - make sure we don't accidentally attribute some CUDA kernels to the
 CUDA Runtime events that have semantically different ids
 - minor fixes in the output

Test Plan:
CI
https://gist.github.com/ilia-cher/0e78d0440fe02b77ff6721571c14f01c
https://gist.github.com/ilia-cher/8f655cf15beb1b11547fd3564a1c3958

Reviewed By: gdankel

Differential Revision: D26222734

Pulled By: ilia-cher

fbshipit-source-id: 13571dbeea0222ee1a531edacd1f4153f1e38da3
2021-02-04 03:54:02 -08:00
d6452a1a0c [profiler] Default activities value (#51561)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51561

Using CPU as a default activities value
https://github.com/pytorch/pytorch/issues/51337

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D26198910

Pulled By: ilia-cher

fbshipit-source-id: 7d7b227059a8eb48dc600a5ec077dd811fd9c8b4
2021-02-04 03:50:29 -08:00
7abba67d8c add dumping callstack to kineto (#51565)
Summary:
In profiler, pass operators' callstack to kineto and dump them into chrome tracing file.
The kineto side update is merged [here](66a4cad380)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51565

Reviewed By: malfet

Differential Revision: D26219324

Pulled By: ilia-cher

fbshipit-source-id: 96ac818012336602368647ff7b75048070f63b28
2021-02-04 03:30:32 -08:00
8c3e0ddbc6 [Usability] Tolerate torch.jit.script call to Enum classes (#51624)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51624

Reviewed By: SplitInfinity

Differential Revision: D26244694

Pulled By: gmagogsfm

fbshipit-source-id: c87a068cd11d6f497fa48dc206215300c55d6539
2021-02-04 01:51:49 -08:00
86861095fa Graceful invalidation of Python Node/Value/Block when C++ object is deleted (#50326)
Summary:
Previously we might have gotten segfaults and all, now it raises an exception.
Thread safety hasn't been an objective.

I have a followup to expand the Python interface for the API.

Fixes https://github.com/pytorch/pytorch/issues/49969.

wanchaol

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50326

Reviewed By: pbelevich

Differential Revision: D26096234

Pulled By: gmagogsfm

fbshipit-source-id: 5425772002eb4deb3830ed51eaa3964f22505840
2021-02-04 01:34:46 -08:00
c8af338407 Expand benchmark utils docs (#51664)
Summary:
Add some much needed documentation on the Timer callgrind output format, and expand what is shown on the website.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51664

Reviewed By: tugsbayasgalan

Differential Revision: D26246675

Pulled By: robieta

fbshipit-source-id: 7a07ff35cae07bd2da111029242a5dc8de21403c
2021-02-04 00:22:41 -08:00
1518aee639 unbreak bc test (#51702)
Summary:
Caused by https://github.com/pytorch/pytorch/issues/48223 revert

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51702

Reviewed By: mruberry

Differential Revision: D26245905

Pulled By: ngimel

fbshipit-source-id: 9fd7860ecb5c22b2e568db3347d51e648d6c5d6b
2021-02-03 23:03:26 -08:00
6a945bfb5c Fix memory leak in qnnpack ops (#51612)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51612

Test Plan: `pytest test/quantization/test_quantized_op.py` passes

Reviewed By: kimishpatel, dhruvbird

Differential Revision: D26217925

Pulled By: axitkhurana

fbshipit-source-id: f422a868d34ea5fe122fcdcce8b80c7859bfc415
2021-02-03 22:40:58 -08:00
e60f18c2ad Generate header with version #defines for LibTorch (#50073)
Summary:
Uses cmake's `configure_file()` macro to generate a new `torch/csrc/api/include/torch/version.h` header with `TORCH_VERSION_{MAJOR,MINOR,PATCH}` \#defines from an input file `torch/csrc/api/include/torch/version.h.in`.

For Bazel builds, this is accomplished with `header_template_rule()`.

For Buck builds, this is accomplished with `fb_native.genrule()`.

Fixes https://github.com/pytorch/pytorch/issues/44365

<img width="1229" alt="Screen Shot 2021-01-05 at 3 19 24 PM" src="https://user-images.githubusercontent.com/75754324/103809279-3fd80380-5027-11eb-9039-fd23922cebd5.png">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50073

Reviewed By: glaringlee

Differential Revision: D25855877

Pulled By: jbschlosser

fbshipit-source-id: 6bb792718c97e2c2dbaa74b7b7b831a4f6938e49
2021-02-03 22:18:53 -08:00
23c50a4a50 [PyTorch Mobile] Support torchbind custom classes in lite interpreter (#51432)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51432

ghstack-source-id: 120976584

torchbind is a convenient way to include custom class to both python and torchscript. CREATE_OBJECT is used to create an object of custom class.

CREATE_OBJECT was not supported by lite interpreter. The major reason was that for custom class directly defined in Python, there's no language parser in lite interpreter. It's still the case. However, for torchbind classes that are defined in C++, a python/torchscript parser is not needed.

This diff is to support the case of torchbind custom classes.
1. The class type can be resolved at import level.
2. If the class is not the supported torchbind class, an error message is provided at export stage. Workaround is also suggested.
3. Unit tests. C++: ```LiteInterpreterTest::BuiltinClass``` is added as an end-to-end test on supported class. Python: ```test_unsupported_createobject``` is changed to ```test_unsupported_classtype``` to test unsupported classes.

Test Plan: CI

Reviewed By: raziel

Differential Revision: D26168913

fbshipit-source-id: 74e8b6a12682ad8e9c39afdfd2b605c5f8e65427
2021-02-03 21:57:19 -08:00
1ffd26f8d8 [quant] Add reflection padding to conv (#49011)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/49011

Differential Revision: D25394384

Test Plan: Imported from OSS

Reviewed By: ayush29feb

Pulled By: z-a-f

fbshipit-source-id: 256aded53c3c6555772aacfc5b0bbd32ef24c972
2021-02-03 21:44:12 -08:00
c41678fd53 Use deterministic impl of index_put and index backward CPU when torch.are_deterministic_algorithms_enabled() == True (#51388)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/51366

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51388

Reviewed By: zou3519

Differential Revision: D26235290

Pulled By: ngimel

fbshipit-source-id: 64cce1a5e75d8a9ce9807c28d641da82ede666e2
2021-02-03 21:37:33 -08:00
f1a63b7c10 [FX] Added how to write transformations section (#51278)
Summary:
![image](https://user-images.githubusercontent.com/6355099/106121588-b8614a00-6125-11eb-923f-fcdf575cd6cd.png)

I still need to add links to vmap/grad/decomposition, but those haven't been added to the examples folder yet.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51278

Reviewed By: zou3519

Differential Revision: D26223103

Pulled By: Chillee

fbshipit-source-id: 3ad9bf76cd3438743edecdc17c44f8d1e00e5ea1
2021-02-03 21:32:43 -08:00
bd3ae117fc Fixes cat backward formula to return correct gradient values for R -> C case (#51681)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51681

Fixes https://github.com/pytorch/pytorch/issues/51627

Test Plan: Imported from OSS

Reviewed By: gchanan

Differential Revision: D26238748

Pulled By: anjali411

fbshipit-source-id: 1dc47f8ddddbf3f2c176f21e5dcee917f84f4c93
2021-02-03 21:29:55 -08:00
d8742eeed0 [quant] Support 2 dim input in quantized batchnorm 1d (#51597)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51597

aliging quantized batchnorm behavior with fp batchnorm

Test Plan:
python test/test_quantization.py TestQuantizedOps.test_batch_norm
python test/test_quantization.py TestQuantizedOps.test_batch_norm_relu

Imported from OSS

Reviewed By: vkuzo

Differential Revision: D26212489

fbshipit-source-id: 663d5d70cc82ea5cc68e66452590efe1342998f1
2021-02-03 21:05:03 -08:00
5d123ecf2f Fix caffee2 for LLVM trunk
Summary: LLVM trunk is at 13 now. I'm relaxing the places that only supports up to 12.

Test Plan: Try Sandcastle staging builds.

Reviewed By: ayermolo

Differential Revision: D26227448

fbshipit-source-id: 0b69a9c135b34db4de94b82ee38d2fb1b328888b
2021-02-03 20:45:11 -08:00
0c60922fb0 mem-efficient learnable fake quantization (#49315)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49315

Update the learnable fake-quantization to use c++ and CUDA kernels, and resolve some issues on using it with pytorch DDP.
The updated quantization operator have a different gradient calculation for scale and zero_point when the output is at the endpoints of clamp operation. The updated quantization operator calculates the gradient according to the gradient of the `clamp` function. This behavior is consistent with the gradient calculation for non-learnable fake-quant ops.
ghstack-source-id: 120821868

Test Plan:
# learnable_fake_quantization forward/backward op test
## Unit Test:
`buck test mode/dev-nosan -c fbcode.platform=platform009 //caffe2/test:quantization -- -v TestFakeQuantize`

## Benchmark Test:
`buck run mode/opt //caffe2/benchmarks/operator_benchmark/pt:quantization_test -- --operators FakeQuantizePerTensorOpBenchmark`

`buck run mode/opt //caffe2/benchmarks/operator_benchmark/pt:quantization_test -- --operators FakeQuantizePerChannelOpBenchmark`

### In **microseconds** (`1e-6` second),
References: P171624031
input size: [1, 3, 256, 256]
|                           | C++ Kernel | Non-backprop C++ Kernel |
|---------------------------|---------------|------------|-------------------------|---|
| Per Tensor CPU Forward    | 1372.123                | 1365.981 |
| Per Tensor Cuda Forward   | 84.586                 | 27.205|
| Per Channel CPU Forward   | 2306.668                | 2299.991|
| Per Channel Cuda Forward  | 154.742                 | 135.219 |
| Per Tensor CPU Backward   | 2544.617               | 581.268|
| Per Tensor Cuda Backward   | 304.529                 | 137.335|
| Per Channel CPU Backward   | 3328.188               |582.088 |
| Per Channel Cuda Backward  | 504.176                | 134.082|

input size: [1, 3, 512, 512]

|                           | C++ Kernel | Non-backprop C++ Kernel |
|---------------------------|---------------|------------|-------------------------|---|
| Per Tensor CPU Forward    | 5426.244                | 5726.440 |
| Per Tensor Cuda Forward   | 85.834                 | 26.871|
| Per Channel CPU Forward   | 9125.913                | 9118.152|
| Per Channel Cuda Forward  | 159.599                 | 145.117 |
| Per Tensor CPU Backward   | 14020.830               | 2214.864|
| Per Tensor Cuda Backward  | 285.525                 | 131.302|
| Per Channel CPU Backward  | 16977.141               |2104.345 |
| Per Channel Cuda Backward | 541.511                | 120.222|

# use learnable_fake_quantization in AI-denoising QAT:
f229412681

Reviewed By: raghuramank100

Differential Revision: D24479735

fbshipit-source-id: 5275596f3ce8200525f4d9d07d0c913afdf8b43a
2021-02-03 18:57:47 -08:00
7918f37e8c [FX] Move examples to pytorch/examples (#51686)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51686

Test Plan: Imported from OSS

Reviewed By: jansel

Differential Revision: D26241146

Pulled By: jamesr66a

fbshipit-source-id: b9cda75997fb98afd0e59ea78074fd7bd26ecebf
2021-02-03 18:41:11 -08:00
f2c4deabeb Extend subgraph_rewriter logic (#51532)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51532

- Change output of `replace_pattern` to `List[Match]` reflecting the
pattern(s) matched in the original graph
- Ensure that all Callables (not just FunctionType objects) work with
the rewriter
- Fix incorrect matching in degenerate case (`test_subgraph_rewriter_correct_output_replacement`)
- Verify that pattern matching works when pattern and original graph are
the same

Test Plan: Imported from OSS

Reviewed By: jamesr66a

Differential Revision: D26193082

Pulled By: ansley

fbshipit-source-id: 7f40c3862012a44adb88f403ade7afc37e50417f
2021-02-03 18:14:37 -08:00
627ec8badf Type-annotate tools/generate_torch_version (#51637)
Summary:
And add it to mypy.ini

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51637

Reviewed By: janeyx99

Differential Revision: D26225123

Pulled By: malfet

fbshipit-source-id: d70d539ae58a14321e82f4592aaa44b3ce6b6358
2021-02-03 18:07:01 -08:00
50d903f19f [optim] make functional api be private (#51316) (#51665)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51665

This reverts commit 896f82aa92eb7557229053a21da786f5927e64e0.

Test Plan: Imported from OSS

Reviewed By: gchanan

Differential Revision: D26232608

Pulled By: vincentqb

fbshipit-source-id: ca006baf4fb672c11c1bb003c39a29cbadb63dd3
2021-02-03 17:59:05 -08:00
45e5562fcc Beef up {jacobian, hessian} vectorize docs; eliminate a warning (#51638)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51638

This PR makes the following doc changes:
- Makes it clear to users that they should use vectorize "at their own
risk"
- Makes it clear that vectorize uses the "experimental prototype vmap"
so that when users see error messages related to vmap they will know
where it is coming from.

This PR also:
- makes it so that {jacobian, hessian} call a version of vmap that
doesn't warn the user that they are using an "experimental prototype".
The regular torch.vmap API does warn the user about this. This is to
improve a UX a little because the user already knows from discovering
the flag and reading the docs what they are getting themselves into.

Test Plan:
- Add test that {jacobian, hessian} with vectorize=True don't raise
warnings

Reviewed By: albanD

Differential Revision: D26225402

Pulled By: zou3519

fbshipit-source-id: 1a6db920ecf10597fb2e0c6576f510507d999c34
2021-02-03 17:15:16 -08:00
443a431ac3 Revert D25074763: [WIP] Update foreach APIs to use scalar lists
Test Plan: revert-hammer

Differential Revision:
D25074763 (cce84b5ca5)

Original commit changeset: 155e3d2073a2

fbshipit-source-id: ef0d153e2740b50bd4a95f7a57c370bb5da46355
2021-02-03 17:06:40 -08:00
d1bc1ab8ca Revert D25502940: Refactor ForeachUnaryOps.cu
Test Plan: revert-hammer

Differential Revision:
D25502940 (5cf3278723)

Original commit changeset: fce2f18a4f62

fbshipit-source-id: 2cef82bd3cb34783d9a0c6c16cc4321abab31932
2021-02-03 17:02:11 -08:00
16cfe970e0 Updates linalg documentation per feature review process (#51620)
Summary:
Notes the module is in beta and that the policy for returning optionally computed tensors may change in the future.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51620

Reviewed By: heitorschueroff

Differential Revision: D26220254

Pulled By: mruberry

fbshipit-source-id: edf78fe448d948b43240e138d6d21b780324e41e
2021-02-03 16:11:57 -08:00
1ee0c42d6d move ZipDataset to Zip DataPipe (#51599)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51599

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D26212859

Pulled By: glaringlee

fbshipit-source-id: 3fabcf8876d3c9c24339dbf6a12e0bb04b400108
2021-02-03 15:42:59 -08:00
34d4d79966 Autograd doc note fix (#51661)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51661

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D26230912

Pulled By: anjali411

fbshipit-source-id: 94323d7bce631a4c5781020e9650495461119ede
2021-02-03 15:08:35 -08:00
0d9ca21d74 [Static Runtime] Native stack for contiguous inputs (#50863)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50863

- Avoid calling unsqueeze on every input tensor by copying data directly
- Model benchmark shows small improvement: -2.3% (b=1), -1.1% (b=20)
- This diff does not yet modify torch::stack implementation, only the static_runtime path. A followup diff will do this.

Test Plan:
# Test
```
buck test //caffe2/aten:native_test
buck run //caffe2/test:torch
```
# Op benchmark
expected no changes here because this diff only touches static runtime
```
Baseline                  |Native                   |Change
6.38                      |6.336                    |-0.69%
6.553                     |6.588                    |0.53%
14.904                    |14.883                   |-0.14%
5.657                     |5.68                     |0.41%
5.612                     |5.795                    |3.26%
6.051                     |6.058                    |0.12%
4.225                     |4.252                    |0.64%
4.24                      |4.294                    |1.27%
6.28                      |4.249                    |-32.34%
6.267                     |4.257                    |-32.07%
418.932                   |404.356                  |-3.48%
417.694                   |404.752                  |-3.10%
1592.455                  |1583.277                 |-0.58%
2919.261                  |2685.636                 |-8.00%
211.458                   |202.838                  |-4.08%
211.518                   |203.229                  |-3.92%
783.953                   |792.198                  |1.05%
1457.823                  |1348.824                 |-7.48%
2032.816                  |1975.961                 |-2.80%
2090.662                  |2000.612                 |-4.31%
6487.098                  |6635.41                  |2.29%
11874.702                 |10853.302                |-8.60%
2123.83                   |2039.272                 |-3.98%
2195.453                  |2221.82                  |1.20%
6435.978                  |6593.363                 |2.45%
11852.205                 |10858.92                 |-8.38%
2036.526                  |1983.042                 |-2.63%
2055.618                  |2072.03                  |0.80%
6417.192                  |6681.064                 |4.11%
12468.744                 |10888.336                |-12.67%
4959.704                  |4954.734                 |-0.10%
5121.823                  |4996.84                  |-2.44%
5082.105                  |5029.652                 |-1.03%
5395.936                  |5438.628                 |0.79%
5162.756                  |5114.147                 |-0.94%
23798.08                  |21884.065                |-8.04%
4957.921                  |4972.01                  |0.28%
4971.234                  |4968.977                 |-0.05%
5005.909                  |5039.95                  |0.68%
5159.614                  |5180.426                 |0.40%
5013.221                  |5202.684                 |3.78%
20238.741                 |20212.581                |-0.13%
7632.439                  |7610.345                 |-0.29%
7589.376                  |7679.148                 |1.18%
7859.937                  |7850.485                 |-0.12%
8214.213                  |8150.846                 |-0.77%
11606.562                 |11724.139                |1.01%
34612.919                 |34817.677                |0.59%
```

# Adindexer model benchmark

```
caffe2=0 batch={1|20} profile=1 ./scripts/bwasti/static_runtime/run.sh
```

## Baseline
```
Batch 1
0.00291311 ms.    3.97139%. aten::stack (1 nodes)
Batch 20
0.00477447 ms.   0.934081%. aten::stack (1 nodes)

```

## Native stack (this change)
```
Batch 1
0.00115161 ms.    1.67388%. aten::stack (1 nodes)
Batch 20
0.00264831 ms.   0.543767%. aten::stack (1 nodes)
```

Reviewed By: hlu1

Differential Revision: D25988638

fbshipit-source-id: 82ce84c88963cae40dc5819004baf03ce9093ecc
2021-02-03 14:59:52 -08:00
fe67438f32 Replace AT_ASSERTM in ATen/core (#51579)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51579

Test Plan: Imported from OSS

Reviewed By: H-Huang

Differential Revision: D26206404

Pulled By: ezyang

fbshipit-source-id: b7e6a530b8ca3ebfa02c87037c37010f9ee0b0db
2021-02-03 14:42:30 -08:00
c60dacd4cf Replace all AT_ASSERTM in ATen/native (#51147)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51147

Test Plan: Imported from OSS

Reviewed By: H-Huang

Differential Revision: D26206403

Pulled By: ezyang

fbshipit-source-id: 6a5c331337bf03b3dc29ceef8f7eeb4539b22c7f
2021-02-03 14:39:21 -08:00
f38e1d2d60 [quant][graphmode][fx] Enable inception_v3 and googlenet static quant test (#51402)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51402

Test Plan: Imported from OSS

Reviewed By: vkuzo

Differential Revision: D26162805

fbshipit-source-id: 28ddc66f0593d28539dd6c6d3f617541e698d3bd
2021-02-03 14:32:00 -08:00
8e53bf010d Use new TensorPipe functions to create channels (#51550)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51550

ghstack-source-id: 120931213

Test Plan: Export to CircleCI

Reviewed By: beauby

Differential Revision: D26147946

fbshipit-source-id: edd44b5edf7041efcc9662cc3bfc550663976fc1
2021-02-03 14:20:49 -08:00
56ef24bc0f Use new TensorPipe functions to create transports (#51549)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51549

ghstack-source-id: 120931215

Test Plan: Export to CircleCI

Reviewed By: beauby

Differential Revision: D26147369

fbshipit-source-id: 43f58f27edec964c24c0bf4ea76f2a47695ee1ea
2021-02-03 14:17:49 -08:00
47557b95ef Removed typographical error from tech docs (#51286)
Summary:
Dublications removed from tech docs.

![Screenshot](https://user-images.githubusercontent.com/71665475/106158807-6e5b8100-6184-11eb-9036-bccdf2086c31.png)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51286

Reviewed By: albanD

Differential Revision: D26227627

Pulled By: ailzhang

fbshipit-source-id: efa0cd90face458673b8530388378d5a7eb0f1cf
2021-02-03 14:09:36 -08:00
333a0c8b6f Add support for generating faithful at::cpu signatures (#51499)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51499

I'm going to turn on at::cpu signatures on for all operators; before
I do it I want to make sure I'm at feature parity everywhere.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: bhosmer

Differential Revision: D26187855

Pulled By: ezyang

fbshipit-source-id: 8fdfd9d843fc98435b1f1df8b475d3184d87dc96
2021-02-03 14:03:50 -08:00
81c7c3bae5 Add api.structured; switch structured kernels to use const Tensor& everywhere (#51490)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51490

Mutable Tensor ref is a source of endless confusion for kernel writers;
if we're going to make everyone rewrite their kernels, might as well
also get rid of mutable Tensor& while we're at it.

This is a refactor-then-small-update double whammy.  The refactor
is to separate tools.codegen.api.structured from api.native for
describing the type signatures of structured kernels (previously,
I was naughtily reusing native for this purpose--now I need it to
behave differently as Tensor).  This started off as a copy paste, but
since there are not that many structured kernels so far I could delete
all of the legacy logic from native that didn't make sense (without
having to go out and fix all the use sites all at once).

One more small addition was teaching translate to convert Tensor& to const Tensor&.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: bhosmer

Differential Revision: D26182413

Pulled By: ezyang

fbshipit-source-id: ed636866add3581179669cf9283f9835fcaddc06
2021-02-03 14:03:46 -08:00
648cdb7d0a Relax type signature for tools.codegen.api.translate (#51477)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51477

Passing in a full binding is still OK, but if you have less
(e.g., an Expr/CType), that will do too. I'll need this for
some codegen patches I'm doing later.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: bhosmer

Differential Revision: D26179560

Pulled By: ezyang

fbshipit-source-id: 5730dfb2c91bf5325496e57b0c91eb6823c9194d
2021-02-03 14:00:47 -08:00
43df03de13 [Gradient Compression] Replace torch.sqrt(torch.sum(col ** 2)) by torch.norm() (#51629)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51629

Leverage the existing util functions as much as possible for potential performance gain.

Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202
ghstack-source-id: 120919883

Test Plan:
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl

No performance regression:
f248664994 uses `torch.norm()`
```
total:
  32 GPUs -- 32 GPUs: p25:  1.050    30/s  (batch size 32)
p50:  1.230    26/s  (batch size 32)
p75:  1.449    22/s  (batch size 32)
p90:  1.611    19/s  (batch size 32)
p95:  1.702    18/s  (batch size 32)

backward:
  32 GPUs -- 32 GPUs: p25:  0.769    41/s  (batch size 32)
p50:  0.920    34/s  (batch size 32)
p75:  1.139    28/s  (batch size 32)
p90:  1.322    24/s  (batch size 32)
p95:  1.440    22/s  (batch size 32)
```

f248678690 does not use `torch.norm()`
```
total:
  32 GPUs -- 32 GPUs: p25:  1.056    30/s  (batch size 32)
p50:  1.249    25/s  (batch size 32)
p75:  1.443    22/s  (batch size 32)
p90:  1.608    19/s  (batch size 32)
p95:  1.711    18/s  (batch size 32)

backward:
  32 GPUs -- 32 GPUs: p25:  0.777    41/s  (batch size 32)
p50:  0.939    34/s  (batch size 32)
p75:  1.127    28/s  (batch size 32)
p90:  1.322    24/s  (batch size 32)
p95:  1.448    22/s  (batch size 32)
```

Reviewed By: pritamdamania87

Differential Revision: D26219835

fbshipit-source-id: 31d8ad3401d4efced4a6069f4f1e169ea3372697
2021-02-03 13:39:11 -08:00
00675292ca replace silufp16 with cubic interpolation (#51645)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51645

added cubic interpolation

Test Plan: increase the input domain, reduced the threshold to 0

Reviewed By: kausv

Differential Revision: D26212239

fbshipit-source-id: e0813d8a4f3f54cfd0bf62e385cd28fa4a1976e8
2021-02-03 12:58:38 -08:00
cae4379826 Enable FLOPS Computation for Experimental Kineto Profiler (#51503)
Summary:
Add the FLOPS metric computation to the experimental Kineto profiler.
This includes saving necessary extra arguments and compute flops in the C++ code,
and extract the FLOPS value from the Python frontend.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51503

Test Plan:
Build PyTorch with USE_KINETO option, then run the unit test:

```python
python test/test_profiler.py -k test_flops
```

Reviewed By: ilia-cher

Differential Revision: D26202711

Pulled By: xuzhao9

fbshipit-source-id: 7dab7c513f454355a220b72859edb3ccbddcb3ff
2021-02-03 12:15:23 -08:00
3361d365bd [Gloo] Use TORCH_CHECK for ensuring tag is nonnegative (#51370)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51370

TORCH_CHECK should be used when confirming the correctness of function
arguments like the tag passed to Gloo functions.
ghstack-source-id: 120908449

Test Plan: Sandcastle/CI

Reviewed By: mingzhe09088

Differential Revision: D26152359

fbshipit-source-id: ddffaa6f11393aaedaf0870759dc526d8d4530ee
2021-02-03 11:48:20 -08:00
a3f2fe0d52 Prevent CUDAFuture from using uninitialized device index (#51505)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51505

Test Plan: Imported from OSS

Reviewed By: lw

Differential Revision: D26187380

Pulled By: mrshenli

fbshipit-source-id: 437bb1244a65ee859458d9a87fdaef9f4dd20b59
2021-02-03 11:04:33 -08:00
a651696ab4 fix misspelling in swa_utils.pyi (#51608)
Summary:
Change `avg_fun -> avg_fn` to match the spelling in the `.py` file.
(`swa_utils.pyi` should match `swa_utils.py`)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51608

Reviewed By: glaringlee

Differential Revision: D26224779

Pulled By: zou3519

fbshipit-source-id: 01ff7173ba0a996f1b7a653438acb6b6b4659de6
2021-02-03 10:51:22 -08:00
c639513378 [TensorExpr] Resubmit: Introduce ExternalCall nodes to TE IR. (#51594)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51594

ExternalCall nodes represent opaque calls to external functions to fill a
tensor (buffer) with values. It could be used to include nodes that are
otherwise not-representable as TE, or whose TE representation is currently too
slow.

To make an external function available in NNC as ExternalCall, one needs to
implement a "bridge" function that would take raw (void*) pointers to the data
along with the arrays containing dimension info. This function would then
internally call the desired external function and make sure the results of the
call are correctly placed in the provided raw data buffers.

The reason the PR was previously reverted was that the LLVM generated
calls to bridge functions were breaking unwind tables. This is now fixed
by requiring bridge functions to never throw and setting the
corresponding attribute in the LLVM generated code.

Differential Revision: D26213882

Test Plan: Imported from OSS

Reviewed By: pbelevich, ngimel

Pulled By: ZolotukhinM

fbshipit-source-id: db954d8338e2d750c2bf0a41e88e38bd494f2945
2021-02-03 10:22:54 -08:00
18a7ec7d7d Update the JIT complex type name to be consistent with Python (#51476)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51476

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D26179237

Pulled By: anjali411

fbshipit-source-id: 6a5c60c8545eb42416583836b8038ceffd3f3244
2021-02-03 09:59:08 -08:00
896f82aa92 [optim] make functional api be private (#51316)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51316

Make optim functional API be private until we release with beta

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D26213469

fbshipit-source-id: b0fd001a8362ec1c152250bcd57c7205ed893107
2021-02-03 09:29:33 -08:00
550c965b2e Re-enable test_standalone_load for Windows 11.1 (#51596)
Summary:
This fixes the previous erroring out by adding stricter conditions in cpp_extension.py.

To test, run a split torch_cuda build on Windows with export BUILD_SPLIT_CUDA=ON && python setup.py develop and then run the following test: python test/test_utils.py TestStandaloneCPPJIT.test_load_standalone. It should pass.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51596

Reviewed By: malfet

Differential Revision: D26213816

Pulled By: janeyx99

fbshipit-source-id: a752ce7f9ab9d73dcf56f952bed2f2e040614443
2021-02-03 08:58:34 -08:00
727f163bea caffe2 test.sh pip might not need sudo if pip is root (#50223)
Summary:
Update logic in MAYBE_SUDO check. Assumption was incorrect that if pip
was installed as user then sudo is needed. pip could be installed as
root and run as root. Assumption was initially pip was root and user was
non root.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50223

Reviewed By: H-Huang

Differential Revision: D26212127

Pulled By: walterddr

fbshipit-source-id: 20b316606b6c210dc705a972c13088fa3d9bfddd
2021-02-03 08:13:03 -08:00
5cf3278723 Refactor ForeachUnaryOps.cu (#49248)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/49248

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D25502940

Pulled By: izdeby

fbshipit-source-id: fce2f18a4f62f7a5fdd6747707d006c3588530d1
2021-02-03 07:05:27 -08:00
52de407b4b [DataLoader] Rename Functional DataSet to DataPipe (#51488)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51488

Test Plan: Imported from OSS

Reviewed By: H-Huang

Differential Revision: D26209888

Pulled By: ejguan

fbshipit-source-id: cb8bc852b1e4d72be81e0297308a43954cd95332
2021-02-03 07:01:09 -08:00
bea0519b0b [WIP][DataLoader] Implement BucketBatchIterableDataset (#51126)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51126

BucketBatch:
Get a chunk of data as a bucket, and sort the bucket by the specified key, then batching.
If sort key is not specified, directly use batchIterableDS..

1. Implement BucketBatch for bucket sampler
2. Improve BatchDS tests

Test Plan: Imported from OSS

Reviewed By: H-Huang

Differential Revision: D26209890

Pulled By: ejguan

fbshipit-source-id: 8519e2e49da158b3fe32913c8f3cadfa6f3ff1fc
2021-02-03 07:01:05 -08:00
14ee63f7e6 [WIP][DataLoader] Implement CallableIterableDataset (#50045)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50045

Add **CallableIterableDataset**
Modify **CollateIterableDataset** as another callable

Test Plan: Imported from OSS

Reviewed By: H-Huang

Differential Revision: D26209889

Pulled By: ejguan

fbshipit-source-id: d4773026c1269e43b29a3efb16e36e1865fdd024
2021-02-03 06:54:48 -08:00
c311b8961a Revert D26113953: [pytorch][PR] [ZeroRedundancyOptimizer] Elastic and pytorch compatible checkpoints
Test Plan: revert-hammer

Differential Revision:
D26113953 (bbe18e3527)

Original commit changeset: 030bfeee2c34

fbshipit-source-id: 6c1494ad01c2f96a15601329b4fce3fef4b38a01
2021-02-03 06:12:21 -08:00
75ee575671 [Usability] Handle repeated jit.script calls on function gracefully (#51545)
Summary:
Repeated calls on `class` is not handled since `class`'s compilation process will change soon in https://github.com/pytorch/pytorch/issues/44324

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51545

Reviewed By: H-Huang

Differential Revision: D26207010

Pulled By: gmagogsfm

fbshipit-source-id: 5f3f64b0e4b4ab4dbf5c9411d9c143472922a106
2021-02-03 02:09:25 -08:00
7b556db69d [PyTorch Mobile] Skip inferring function schema from the C++ function type (#50457)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50457

The code to infer function schema from a C++ function relies on templates and code expansion. This uses valuable binary size. We can avoid inferring the schema from the C++ function type (arguments, name, return value) in case that the function implementation is being added to the dispatcher via `m.impl`. In this case, it is assumed that we have a schema registered already. Adding an implementation via `m.def` still triggers schema inferrence.

In addition, we don't do schema schema checks on mobile, so the schema is not needed in the first place.
ghstack-source-id: 120915259

Test Plan:
Auto-unit tests succeed.

### Size test: igios

```
D25853094-V1 (https://www.internalfb.com/intern/diff/D25853094/?dest_number=119632217)

igios: Succeeded
Change in Download Size for arm64 + 3x assets variation: -21.8 KiB
Change in Uncompressed Size for arm64 + 3x assets variation: -45.5 KiB

Mbex Comparison: https://our.intern.facebook.com/intern/mbex/bsb:261049318687117@base/bsb:261049318687117@diff/
```

### Size test: fbios

```
D25853094-V1 (https://www.internalfb.com/intern/diff/D25853094/?dest_number=119632217)

fbios: Succeeded
Change in Download Size for arm64 + 3x assets variation: -27.2 KiB
Change in Uncompressed Size for arm64 + 3x assets variation: -80.1 KiB

Mbex Comparison: https://our.intern.facebook.com/intern/mbex/bsb:454289062251865@base/bsb:454289062251865@diff/
```

Reviewed By: smessmer

Differential Revision: D25853094

fbshipit-source-id: e138d9dff7561d424bfb732f3a5898466f018f60
2021-02-03 00:37:35 -08:00
62f6e55439 Fix the missing parameter in get_sha function (#51290)
Summary:
get_sha() function didn't pass in the pytorch_root argument, so subprocess.check_output always raise exception since pytorch_root is not defined, thus always return 'Unknown'.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51290

Reviewed By: soumith

Differential Revision: D26219051

Pulled By: malfet

fbshipit-source-id: fee2c4f5fdfc61983559eec1600b9accb344c527
2021-02-02 23:25:57 -08:00
ab4623da16 Document FX debugging (#51530)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51530

Test Plan: Imported from OSS

Reviewed By: jamesr66a

Differential Revision: D26192641

Pulled By: ansley

fbshipit-source-id: c69ab1bb2451d8ee5a729445f52bccc66e6f431b
2021-02-02 23:17:51 -08:00
f7313b3105 Fix Python.h discovery logic on some MacOS platforms (#51586)
Summary:
On all non-Windows platforms we should use 'posix_prefix' schema to discover location of Python.h header

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51586

Reviewed By: ezyang

Differential Revision: D26208684

Pulled By: malfet

fbshipit-source-id: bafa6d79de42231629960c642d535f1fcf7a427f
2021-02-02 21:38:37 -08:00
7360ce36e4 [QNNPACK:Sparsity] Add A matrix pretransformed based sparse kernels for FC (#50587)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50587

This diff introduces two kernesl. One is to pretransform A to do block
wise transforms.
And then the kernel that directly works on top pretransformed weights.

Test Plan:
./build/local/q8gemm-sparse-test
./build/local/fully-connected-sparse-test

Imported from OSS

Reviewed By: AshkanAliabadi

Differential Revision: D25925504

fbshipit-source-id: 9b02819405ce587f20e675b154895dc39ecd1bad
2021-02-02 21:33:02 -08:00
eb571b33fe [QNNPACK Sparse] Create fc sparse operator (#50586)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50586

Creates sparse operator for fully connected layer.

Test Plan:
./build/local/fully-connected-sparse-test

Imported from OSS

Reviewed By: AshkanAliabadi

Differential Revision: D25925503

fbshipit-source-id: 49042158ba3bf26a716a6d68258fc7ead85ce9d8
2021-02-02 21:32:58 -08:00
520f96b8c7 [QNNPACK] Block Sparse kernel. First commit. (#50585)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50585

This diff introduces sparse kenel for sse. Uses 1x4 block sparse
pattern.

Test Plan:
./build/local/q8gemm-sparse-test

Imported from OSS

Reviewed By: AshkanAliabadi

Differential Revision: D25925500

fbshipit-source-id: e112cafd3226f8c11487c139cd414fa53a58fd0d
2021-02-02 21:30:24 -08:00
444203c52f Fix torch.cdist backward CUDA error due to illegal gridDim setting (#51569)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/49928

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51569

Reviewed By: mruberry

Differential Revision: D26215694

Pulled By: ngimel

fbshipit-source-id: 0710417e6a802424e2dcada325f27452c95d042f
2021-02-02 20:41:24 -08:00
b48ee75507 Fix quantization doc issue (#50187)
Summary:
There has a description error in quantization.rst, fixed it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50187

Reviewed By: mrshenli

Differential Revision: D25895294

Pulled By: soumith

fbshipit-source-id: c0b2e7ba3fadfc0977ab2d4d4e9ed4f93694cedd
2021-02-02 20:33:21 -08:00
b18eeaa80a Implement np.diff for single order differences (#50569)
Summary:
Implements `np.diff` for single order differences only:
 - method and function variants for `diff` and function variant for `diff_out`
 - supports out variant, but not in-place since shape changes
 - adds OpInfo entry, and test in `test_torch`
 - automatic autograd because we are using the `Math` dispatch

_Update: we only support Tensors for prepend and append in this PR. See discussion below and comments for more details._

Currently there is a quirk in the c++ API based on how this is implemented: it is not possible to specify scalar prepend and appends without also specifying all 4 arguments.

That is because the goal is to match NumPy's diff signature of `diff(int n=1, int dim=-1, Union[Scalar, Tensor] prepend=None, Union[Scalar, Tensor] append)=None` where all arguments are optional, positional and in the correct order.
There are a couple blockers. One is c++ ambiguity. This prevents us from simply doing `diff(int n=1, int dim=-1, Scalar? prepend=None, Tensor? append=None)` etc for all combinations of {Tensor, Scalar} x {Tensor, Scalar}.

Why not have append, prepend not have default args and then write out the whole power set of {Tensor, Scalar, omitted} x {Tensor, Scalar, omitted} you might ask. Aside from having to write 18 overloads, this is actually illegal because arguments with defaults must come after arguments without defaults. This would mean having to write `diff(prepend, append, n, dim)` which is not desired. Finally writing out the entire power set of all arguments n, dim, prepend, append is out of the question because that would actually involve 2 * 2 * 3 * 3 = 36 combinations. And if we include the out variant, that would be 72 overloads!

With this in mind, the current way this is implemented is actually to still do `diff(int n=1, int dim=-1, Scalar? prepend=None, Tensor? append=None)`. But also make use of `cpp_no_default_args`. The idea is to only have one of the 4 {Tensor, Scalar} x {Tensor, Scalar} provide default arguments for the c++ api, and add `cpp_no_default_args` for the remaining 3 overloads. With this, Python api works as expected, but some calls such as `diff(prepend=1)` won't work on c++ api.

We can optionally add 18 more overloads that cover the {dim, n, no-args} x {scalar-tensor, tensor-scalar, scalar-scalar} x {out, non-out} cases for c++ api. _[edit: counting is hard - just realized this number is still wrong. We should try to count the cases we do cover instead and subtract that from the total: (2 * 2 * 3 * 3) - (3 + 2^4) = 17. 3 comes from the 3 of 4 combinations of {tensor, scalar}^2 that we declare to be `cpp_no_default_args`, and the one remaining case that has default arguments has covers 2^4 cases. So actual count is 34 additional overloads to support all possible calls]_

_[edit: thanks to https://github.com/pytorch/pytorch/issues/50767 hacky_wrapper is no longer necessary; it is removed in the latest commit]_
 hacky_wrapper was also necessary here because `Tensor?` will cause dispatch to look for the `const optional<Tensor>&` schema but also generate a `const Tensor&` declaration in Functions.h. hacky_wrapper allows us to define our function as `const Tensor&` but wraps it in optional for us, so this avoids both the errors while linking and loading.

_[edit: rewrote the above to improve clarity and correct the fact that we actually need 18 more overloads (26 total), not 18 in total to complete the c++ api]_

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50569

Reviewed By: H-Huang

Differential Revision: D26176105

Pulled By: soulitzer

fbshipit-source-id: cd8e77cc2de1117c876cd71c29b312887daca33f
2021-02-02 20:25:16 -08:00
e54cbb8250 Create PyTorch DDP logging APIs for applications to use (#50637)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50637

add APIs for logging pytorch ddp logging data in applications.

Test Plan: unit tests

Reviewed By: rohan-varma

Differential Revision: D25933411

fbshipit-source-id: 57c248a2f002da06a386fc7406d3e5533ebb9124
2021-02-02 18:24:21 -08:00
26f9ac98e5 Revert D26105797: [pytorch][PR] Exposing linear layer to fuser
Test Plan: revert-hammer

Differential Revision:
D26105797 (e488e3c443)

Original commit changeset: 6f7cedb9f6e3

fbshipit-source-id: f0858cefed76d726e9dba61e51e1eaf2af4c99c5
2021-02-02 17:39:17 -08:00
5a402274d4 [ROCm] add 4.0.1 to nightly builds (#51257)
Summary:
Depends on https://github.com/pytorch/builder/pull/628.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51257

Reviewed By: H-Huang, seemethere

Differential Revision: D26208135

Pulled By: malfet

fbshipit-source-id: 8a4386b5661c6f71df28d98279e2771c4044f06c
2021-02-02 16:52:38 -08:00
b283ac6da4 "whitelist" -> "allowlist" (#51375)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51375

Test Plan: Sandcastle tests

Reviewed By: iseeyuan

Differential Revision: D26150609

fbshipit-source-id: 1ca17bc8943598a42f028005d1f6d3f362fe2659
2021-02-02 16:20:34 -08:00
c791a30484 Fix warnings in "ForeachOpsKernels" with c10::irange (#50783)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50783

Compiling currently shows:
```
Jan 13 16:46:28 In file included from ../aten/src/ATen/native/ForeachOpsKernels.cpp:2:
Jan 13 16:46:28 ../aten/src/ATen/native/ForeachUtils.h:28:21: warning: comparison of integers of different signs: 'int' and 'size_t' (aka 'unsigned long') [-Wsign-compare]
Jan 13 16:46:28   for (int i = 0; i < tensors1.size(); i++) {
Jan 13 16:46:28                   ~ ^ ~~~~~~~~~~~~~~~
Jan 13 16:46:28 ../aten/src/ATen/native/ForeachUtils.h:44:21: warning: comparison of integers of different signs: 'int' and 'size_t' (aka 'unsigned long') [-Wsign-compare]
Jan 13 16:46:28   for (int i = 0; i < tensors1.size(); i++) {
Jan 13 16:46:28                   ~ ^ ~~~~~~~~~~~~~~~
Jan 13 16:46:28 ../aten/src/ATen/native/ForeachUtils.h:149:25: warning: comparison of integers of different signs: 'int64_t' (aka 'long long') and 'size_t' (aka 'unsigned long') [-Wsign-compare]
Jan 13 16:46:28   for (int64_t i = 0; i < tensors1.size(); i++) {
Jan 13 16:46:28                       ~ ^ ~~~~~~~~~~~~~~~
Jan 13 16:46:28 ../aten/src/ATen/native/ForeachUtils.h:164:25: warning: comparison of integers of different signs: 'int64_t' (aka 'long long') and 'size_t' (aka 'unsigned long') [-Wsign-compare]
Jan 13 16:46:28   for (int64_t i = 0; i < tensors1.size(); i++) {
Jan 13 16:46:28                       ~ ^ ~~~~~~~~~~~~~~~
Jan 13 16:46:28 ../aten/src/ATen/native/ForeachUtils.h:183:25: warning: comparison of integers of different signs: 'int64_t' (aka 'long long') and 'size_t' (aka 'unsigned long') [-Wsign-compare]
Jan 13 16:46:28   for (int64_t i = 0; i < tensors1.size(); i++) {
Jan 13 16:46:28                       ~ ^ ~~~~~~~~~~~~~~~
Jan 13 16:46:28 ../aten/src/ATen/native/ForeachUtils.h:198:25: warning: comparison of integers of different signs: 'int64_t' (aka 'long long') and 'size_t' (aka 'unsigned long') [-Wsign-compare]
Jan 13 16:46:28   for (int64_t i = 0; i < tensors1.size(); i++) {
Jan 13 16:46:28                       ~ ^ ~~~~~~~~~~~~~~~
Jan 13 16:46:28 ../aten/src/ATen/native/ForeachOpsKernels.cpp:150:1: warning: comparison of integers of different signs: 'int' and 'size_t' (aka 'unsigned long') [-Wsign-compare]
Jan 13 16:46:28 FOREACH_BINARY_OP_LIST_ALPHA(add);
Jan 13 16:46:28 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Jan 13 16:46:28 ../aten/src/ATen/native/ForeachOpsKernels.cpp:74:21: note: expanded from macro 'FOREACH_BINARY_OP_LIST_ALPHA'
Jan 13 16:46:28   for (int i = 0; i < tensors1.size(); i++) {                                                                           \
Jan 13 16:46:28                   ~ ^ ~~~~~~~~~~~~~~~
Jan 13 16:46:28 ../aten/src/ATen/native/ForeachOpsKernels.cpp:150:1: warning: comparison of integers of different signs: 'int' and 'size_t' (aka 'unsigned long') [-Wsign-compare]
Jan 13 16:46:28 FOREACH_BINARY_OP_LIST_ALPHA(add);
Jan 13 16:46:28 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Jan 13 16:46:28 ../aten/src/ATen/native/ForeachOpsKernels.cpp:84:21: note: expanded from macro 'FOREACH_BINARY_OP_LIST_ALPHA'
Jan 13 16:46:28   for (int i = 0; i < tensors1.size(); i++) {                                                                           \
Jan 13 16:46:28                   ~ ^ ~~~~~~~~~~~~~~~
Jan 13 16:46:28 ../aten/src/ATen/native/ForeachOpsKernels.cpp:151:1: warning: comparison of integers of different signs: 'int' and 'size_t' (aka 'unsigned long') [-Wsign-compare]
Jan 13 16:46:28 FOREACH_BINARY_OP_LIST_ALPHA(sub);
Jan 13 16:46:28 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Jan 13 16:46:28 ../aten/src/ATen/native/ForeachOpsKernels.cpp:74:21: note: expanded from macro 'FOREACH_BINARY_OP_LIST_ALPHA'
Jan 13 16:46:28   for (int i = 0; i < tensors1.size(); i++) {                                                                           \
Jan 13 16:46:28                   ~ ^ ~~~~~~~~~~~~~~~
Jan 13 16:46:28 ../aten/src/ATen/native/ForeachOpsKernels.cpp:151:1: warning: comparison of integers of different signs: 'int' and 'size_t' (aka 'unsigned long') [-Wsign-compare]
Jan 13 16:46:28 FOREACH_BINARY_OP_LIST_ALPHA(sub);
Jan 13 16:46:28 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Jan 13 16:46:28 ../aten/src/ATen/native/ForeachOpsKernels.cpp:84:21: note: expanded from macro 'FOREACH_BINARY_OP_LIST_ALPHA'
Jan 13 16:46:28   for (int i = 0; i < tensors1.size(); i++) {                                                                           \
Jan 13 16:46:28                   ~ ^ ~~~~~~~~~~~~~~~
Jan 13 16:46:28 ../aten/src/ATen/native/ForeachOpsKernels.cpp:158:1: warning: comparison of integers of different signs: 'int' and 'size_t' (aka 'unsigned long') [-Wsign-compare]
Jan 13 16:46:28 FOREACH_BINARY_OP_SCALARLIST(add);
Jan 13 16:46:28 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Jan 13 16:46:28 ../aten/src/ATen/native/ForeachOpsKernels.cpp:31:21: note: expanded from macro 'FOREACH_BINARY_OP_SCALARLIST'
Jan 13 16:46:28   for (int i = 0; i < tensors.size(); i++) {                                                                            \
Jan 13 16:46:28                   ~ ^ ~~~~~~~~~~~~~~
Jan 13 16:46:28 ../aten/src/ATen/native/ForeachOpsKernels.cpp:158:1: warning: comparison of integers of different signs: 'int' and 'size_t' (aka 'unsigned long') [-Wsign-compare]
Jan 13 16:46:28 FOREACH_BINARY_OP_SCALARLIST(add);
Jan 13 16:46:28 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Jan 13 16:46:28 ../aten/src/ATen/native/ForeachOpsKernels.cpp:40:21: note: expanded from macro 'FOREACH_BINARY_OP_SCALARLIST'
Jan 13 16:46:28   for (int i = 0; i < tensors.size(); i++) {                                                                            \
Jan 13 16:46:28                   ~ ^ ~~~~~~~~~~~~~~
Jan 13 16:46:28 ../aten/src/ATen/native/ForeachOpsKernels.cpp:159:1: warning: comparison of integers of different signs: 'int' and 'size_t' (aka 'unsigned long') [-Wsign-compare]
Jan 13 16:46:28 FOREACH_BINARY_OP_SCALARLIST(sub);
Jan 13 16:46:28 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Jan 13 16:46:28 ../aten/src/ATen/native/ForeachOpsKernels.cpp:31:21: note: expanded from macro 'FOREACH_BINARY_OP_SCALARLIST'
Jan 13 16:46:28   for (int i = 0; i < tensors.size(); i++) {                                                                            \
Jan 13 16:46:28                   ~ ^ ~~~~~~~~~~~~~~
Jan 13 16:46:28 ../aten/src/ATen/native/ForeachOpsKernels.cpp:159:1: warning: comparison of integers of different signs: 'int' and 'size_t' (aka 'unsigned long') [-Wsign-compare]
Jan 13 16:46:28 FOREACH_BINARY_OP_SCALARLIST(sub);
Jan 13 16:46:28 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Jan 13 16:46:28 ../aten/src/ATen/native/ForeachOpsKernels.cpp:40:21: note: expanded from macro 'FOREACH_BINARY_OP_SCALARLIST'
Jan 13 16:46:28   for (int i = 0; i < tensors.size(); i++) {                                                                            \
Jan 13 16:46:28                   ~ ^ ~~~~~~~~~~~~~~
Jan 13 16:46:28 ../aten/src/ATen/native/ForeachOpsKernels.cpp:160:1: warning: comparison of integers of different signs: 'int' and 'size_t' (aka 'unsigned long') [-Wsign-compare]
Jan 13 16:46:28 FOREACH_BINARY_OP_SCALARLIST(mul);
Jan 13 16:46:28 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Jan 13 16:46:28 ../aten/src/ATen/native/ForeachOpsKernels.cpp:31:21: note: expanded from macro 'FOREACH_BINARY_OP_SCALARLIST'
Jan 13 16:46:28   for (int i = 0; i < tensors.size(); i++) {                                                                            \
Jan 13 16:46:28                   ~ ^ ~~~~~~~~~~~~~~
Jan 13 16:46:28 ../aten/src/ATen/native/ForeachOpsKernels.cpp:160:1: warning: comparison of integers of different signs: 'int' and 'size_t' (aka 'unsigned long') [-Wsign-compare]
Jan 13 16:46:28 FOREACH_BINARY_OP_SCALARLIST(mul);
Jan 13 16:46:28 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Jan 13 16:46:28 ../aten/src/ATen/native/ForeachOpsKernels.cpp:40:21: note: expanded from macro 'FOREACH_BINARY_OP_SCALARLIST'
Jan 13 16:46:28   for (int i = 0; i < tensors.size(); i++) {                                                                            \
Jan 13 16:46:28                   ~ ^ ~~~~~~~~~~~~~~
Jan 13 16:46:28 ../aten/src/ATen/native/ForeachOpsKernels.cpp:161:1: warning: comparison of integers of different signs: 'int' and 'size_t' (aka 'unsigned long') [-Wsign-compare]
Jan 13 16:46:28 FOREACH_BINARY_OP_SCALARLIST(div);
Jan 13 16:46:28 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Jan 13 16:46:28 ../aten/src/ATen/native/ForeachOpsKernels.cpp:31:21: note: expanded from macro 'FOREACH_BINARY_OP_SCALARLIST'
Jan 13 16:46:28   for (int i = 0; i < tensors.size(); i++) {                                                                            \
Jan 13 16:46:28                   ~ ^ ~~~~~~~~~~~~~~
Jan 13 16:46:28 ../aten/src/ATen/native/ForeachOpsKernels.cpp:161:1: warning: comparison of integers of different signs: 'int' and 'size_t' (aka 'unsigned long') [-Wsign-compare]
Jan 13 16:46:28 FOREACH_BINARY_OP_SCALARLIST(div);
Jan 13 16:46:28 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Jan 13 16:46:28 ../aten/src/ATen/native/ForeachOpsKernels.cpp:40:21: note: expanded from macro 'FOREACH_BINARY_OP_SCALARLIST'
Jan 13 16:46:28   for (int i = 0; i < tensors.size(); i++) {                                                                            \
Jan 13 16:46:28                   ~ ^ ~~~~~~~~~~~~~~
Jan 13 16:46:28 ../aten/src/ATen/native/ForeachOpsKernels.cpp:163:1: warning: comparison of integers of different signs: 'int' and 'size_t' (aka 'unsigned long') [-Wsign-compare]
Jan 13 16:46:28 FOREACH_BINARY_OP_LIST(mul);
Jan 13 16:46:28 ^~~~~~~~~~~~~~~~~~~~~~~~~~~
Jan 13 16:46:28 ../aten/src/ATen/native/ForeachOpsKernels.cpp:53:21: note: expanded from macro 'FOREACH_BINARY_OP_LIST'
Jan 13 16:46:28   for (int i = 0; i < tensors1.size(); i++) {                                                             \
Jan 13 16:46:28                   ~ ^ ~~~~~~~~~~~~~~~
Jan 13 16:46:28 ../aten/src/ATen/native/ForeachOpsKernels.cpp:163:1: warning: comparison of integers of different signs: 'int' and 'size_t' (aka 'unsigned long') [-Wsign-compare]
Jan 13 16:46:28 FOREACH_BINARY_OP_LIST(mul);
Jan 13 16:46:28 ^~~~~~~~~~~~~~~~~~~~~~~~~~~
Jan 13 16:46:28 ../aten/src/ATen/native/ForeachOpsKernels.cpp:63:21: note: expanded from macro 'FOREACH_BINARY_OP_LIST'
Jan 13 16:46:28   for (int i = 0; i < tensors1.size(); i++) {                                                             \
Jan 13 16:46:28                   ~ ^ ~~~~~~~~~~~~~~~
Jan 13 16:46:28 ../aten/src/ATen/native/ForeachOpsKernels.cpp:164:1: warning: comparison of integers of different signs: 'int' and 'size_t' (aka 'unsigned long') [-Wsign-compare]
Jan 13 16:46:28 FOREACH_BINARY_OP_LIST(div);
Jan 13 16:46:28 ^~~~~~~~~~~~~~~~~~~~~~~~~~~
Jan 13 16:46:28 ../aten/src/ATen/native/ForeachOpsKernels.cpp:53:21: note: expanded from macro 'FOREACH_BINARY_OP_LIST'
Jan 13 16:46:28   for (int i = 0; i < tensors1.size(); i++) {                                                             \
Jan 13 16:46:28                   ~ ^ ~~~~~~~~~~~~~~~
Jan 13 16:46:28 ../aten/src/ATen/native/ForeachOpsKernels.cpp:164:1: warning: comparison of integers of different signs: 'int' and 'size_t' (aka 'unsigned long') [-Wsign-compare]
Jan 13 16:46:28 FOREACH_BINARY_OP_LIST(div);
Jan 13 16:46:28 ^~~~~~~~~~~~~~~~~~~~~~~~~~~
Jan 13 16:46:28 ../aten/src/ATen/native/ForeachOpsKernels.cpp:63:21: note: expanded from macro 'FOREACH_BINARY_OP_LIST'
Jan 13 16:46:28   for (int i = 0; i < tensors1.size(); i++) {                                                             \
Jan 13 16:46:28                   ~ ^ ~~~~~~~~~~~~~~~
Jan 13 16:46:28 ../aten/src/ATen/native/ForeachOpsKernels.cpp:195:1: warning: comparison of integers of different signs: 'int' and 'size_t' (aka 'unsigned long') [-Wsign-compare]
Jan 13 16:46:28 FOREACH_POINTWISE_OP_SCALAR(addcdiv);
Jan 13 16:46:28 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Jan 13 16:46:28 ../aten/src/ATen/native/ForeachOpsKernels.cpp:115:21: note: expanded from macro 'FOREACH_POINTWISE_OP_SCALAR'
Jan 13 16:46:28   for (int i = 0; i < input.size(); i++) {                                                                                           \
Jan 13 16:46:28                   ~ ^ ~~~~~~~~~~~~
Jan 13 16:46:28 ../aten/src/ATen/native/ForeachOpsKernels.cpp:195:1: warning: comparison of integers of different signs: 'int' and 'size_t' (aka 'unsigned long') [-Wsign-compare]
Jan 13 16:46:28 FOREACH_POINTWISE_OP_SCALAR(addcdiv);
Jan 13 16:46:28 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Jan 13 16:46:28 ../aten/src/ATen/native/ForeachOpsKernels.cpp:125:21: note: expanded from macro 'FOREACH_POINTWISE_OP_SCALAR'
Jan 13 16:46:28   for (int i = 0; i < input.size(); i++) {                                                                                           \
Jan 13 16:46:28                   ~ ^ ~~~~~~~~~~~~
Jan 13 16:46:28 ../aten/src/ATen/native/ForeachOpsKernels.cpp:196:1: warning: comparison of integers of different signs: 'int' and 'size_t' (aka 'unsigned long') [-Wsign-compare]
Jan 13 16:46:28 FOREACH_POINTWISE_OP_SCALAR(addcmul);
Jan 13 16:46:28 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Jan 13 16:46:28 ../aten/src/ATen/native/ForeachOpsKernels.cpp:115:21: note: expanded from macro 'FOREACH_POINTWISE_OP_SCALAR'
Jan 13 16:46:28   for (int i = 0; i < input.size(); i++) {                                                                                           \
Jan 13 16:46:28                   ~ ^ ~~~~~~~~~~~~
Jan 13 16:46:28 ../aten/src/ATen/native/ForeachOpsKernels.cpp:196:1: warning: comparison of integers of different signs: 'int' and 'size_t' (aka 'unsigned long') [-Wsign-compare]
Jan 13 16:46:28 FOREACH_POINTWISE_OP_SCALAR(addcmul);
Jan 13 16:46:28 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Jan 13 16:46:28 ../aten/src/ATen/native/ForeachOpsKernels.cpp:125:21: note: expanded from macro 'FOREACH_POINTWISE_OP_SCALAR'
Jan 13 16:46:28   for (int i = 0; i < input.size(); i++) {                                                                                           \
Jan 13 16:46:28                   ~ ^ ~~~~~~~~~~~~
Jan 13 16:46:28 ../aten/src/ATen/native/ForeachOpsKernels.cpp:198:1: warning: comparison of integers of different signs: 'int' and 'size_t' (aka 'unsigned long') [-Wsign-compare]
Jan 13 16:46:28 FOREACH_POINTWISE_OP_SCALARLIST(addcdiv);
Jan 13 16:46:28 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Jan 13 16:46:28 ../aten/src/ATen/native/ForeachOpsKernels.cpp:135:21: note: expanded from macro 'FOREACH_POINTWISE_OP_SCALARLIST'
Jan 13 16:46:28   for (int i = 0; i < input.size(); i++) {                                                                                                              \
Jan 13 16:46:28                   ~ ^ ~~~~~~~~~~~~
Jan 13 16:46:28 ../aten/src/ATen/native/ForeachOpsKernels.cpp:198:1: warning: comparison of integers of different signs: 'int' and 'size_t' (aka 'unsigned long') [-Wsign-compare]
Jan 13 16:46:28 FOREACH_POINTWISE_OP_SCALARLIST(addcdiv);
Jan 13 16:46:28 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Jan 13 16:46:28 ../aten/src/ATen/native/ForeachOpsKernels.cpp:145:21: note: expanded from macro 'FOREACH_POINTWISE_OP_SCALARLIST'
Jan 13 16:46:28   for (int i = 0; i < input.size(); i++) {                                                                                                              \
Jan 13 16:46:28                   ~ ^ ~~~~~~~~~~~~
Jan 13 16:46:28 ../aten/src/ATen/native/ForeachOpsKernels.cpp:199:1: warning: comparison of integers of different signs: 'int' and 'size_t' (aka 'unsigned long') [-Wsign-compare]
Jan 13 16:46:28 FOREACH_POINTWISE_OP_SCALARLIST(addcmul);
Jan 13 16:46:28 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Jan 13 16:46:28 ../aten/src/ATen/native/ForeachOpsKernels.cpp:135:21: note: expanded from macro 'FOREACH_POINTWISE_OP_SCALARLIST'
Jan 13 16:46:28   for (int i = 0; i < input.size(); i++) {                                                                                                              \
Jan 13 16:46:28                   ~ ^ ~~~~~~~~~~~~
Jan 13 16:46:28 ../aten/src/ATen/native/ForeachOpsKernels.cpp:199:1: warning: comparison of integers of different signs: 'int' and 'size_t' (aka 'unsigned long') [-Wsign-compare]
Jan 13 16:46:28 FOREACH_POINTWISE_OP_SCALARLIST(addcmul);
Jan 13 16:46:28 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Jan 13 16:46:28 ../aten/src/ATen/native/ForeachOpsKernels.cpp:145:21: note: expanded from macro 'FOREACH_POINTWISE_OP_SCALARLIST'
Jan 13 16:46:28   for (int i = 0; i < input.size(); i++) {
```
this diff fixes that

Test Plan: Sandcastle tests

Reviewed By: xush6528

Differential Revision: D25935046

fbshipit-source-id: 9a042367410b3c1ffd27d9f957a623f1bae07d20
2021-02-02 16:13:03 -08:00
e488e3c443 Exposing linear layer to fuser (#50856)
Summary:
1. enabling linear in autodiff;
2. remove control flow in python for linear;

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50856

Reviewed By: pbelevich

Differential Revision: D26105797

Pulled By: eellison

fbshipit-source-id: 6f7cedb9f6e3e46daa24223d2a6080880498deb4
2021-02-02 15:39:01 -08:00
5499e839f1 [Fuser] Do not attempt to use OpenMP if build without OpenMP support (#51504)
Summary:
Clang from XCode does not support `-fopenmp` option, no need to try to compile with it.
Infer whether OpenMP is supported by checking _OPENMP define.
Also, use clang compiler if host app was compiled with clang rather than gcc.
Fix few range loop warnings and add static_asserts that range loop variables are raw pointers.

This changes makes fuser tests on OS X a bit faster.

Before:
```
% python3 test_jit.py -v  TestScript.test_batchnorm_fuser_cpu
Fail to import hypothesis in common_utils, tests are not derandomized
CUDA not available, skipping tests
test_batchnorm_fuser_cpu (__main__.TestScript) ... clang: error: unsupported option '-fopenmp'
clang: error: unsupported option '-fopenmp'
warning: pytorch jit fuser failed to compile with openmp, trying without it...
ok

----------------------------------------------------------------------
Ran 1 test in 0.468s

OK
```

After:
```
% python3 test_jit.py -v  TestScript.test_batchnorm_fuser_cpu
Fail to import hypothesis in common_utils, tests are not derandomized
CUDA not available, skipping tests
test_batchnorm_fuser_cpu (__main__.TestScript) ... ok

----------------------------------------------------------------------
Ran 1 test in 0.435s

OK
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51504

Reviewed By: smessmer

Differential Revision: D26186875

Pulled By: malfet

fbshipit-source-id: 930b3bcf543fdfad0f493d687072aaaf5f9e2bfc
2021-02-02 15:31:59 -08:00
38eb836387 [complex] Enable complex autograd and jit tests for trace (#51537)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/50381

Now that `index_fill_` supports complex, we can enable complex support for `trace`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51537

Reviewed By: H-Huang

Differential Revision: D26198904

Pulled By: anjali411

fbshipit-source-id: d62bb02549919fe35b0bac44f77af964ebd0e92e
2021-02-02 15:24:38 -08:00
209e27eaff [FX] Add note about more use cases of FX (#51576)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51576

Test Plan: Imported from OSS

Reviewed By: ansley

Differential Revision: D26203610

Pulled By: jamesr66a

fbshipit-source-id: d33a3e7e0f3a959349ed0e29a1aba0592022606d
2021-02-02 14:57:48 -08:00
37f1412965 [Pytorch Mobile] Preserved all functions generated by bundled inputs (#51496)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51496

A previous change added the possibility of more functions being generated when bundled inputs are attached. Want to preserve those here in optimize_for_mobile
ghstack-source-id: 120862718

Test Plan:
Created a dummy model. Augment several methods with bundled inputs. Call optimize for mobile. Verified the functions are still there.

Discovered a weird interaction between freeze_module and bundled inputs. If the user does something like
   inputs =[<inputs>]
   augment_many_model_functions_with_bundled_inputs(
             model,
             inputs={
                 model.forward : inputs,
                 model.foo : inputs,
             }
  )
to attach their bundled inputs, freeze_module within optimize_for_mobile will error out. Instead the user would need to do something like
   inputs =[<inputs>]
   inputs2 =[<inputs>]  # Nominally the same as the the inputs above
   augment_many_model_functions_with_bundled_inputs(
             model,
             inputs={
                 model.forward : inputs,
                 model.foo : inputs2,
             }
  )

Reviewed By: dhruvbird

Differential Revision: D26005708

fbshipit-source-id: 3e908c0f7092a57da9039fbc395aee6bf9dd2b20
2021-02-02 14:57:44 -08:00
cce84b5ca5 [WIP] Update foreach APIs to use scalar lists (#48223)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/48223

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D25074763

Pulled By: izdeby

fbshipit-source-id: 155e3d2073a20d16bdbe358820170bf53f93c7a5
2021-02-02 14:54:28 -08:00
506fdf9abf [ROCm] disable tests for ROCm 4.0.1 (#51510)
Summary:
These tests are failing for ROCm 4.0/4.0.1 release.  Disable the tests until they are fixed.

- TestCuda.test_cudnn_multiple_threads_same_device
- TestCudaFuser.test_reduction

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51510

Reviewed By: H-Huang

Differential Revision: D26205179

Pulled By: seemethere

fbshipit-source-id: 0c3d29989d711deab8b5046b458c772a1543d8ed
2021-02-02 14:39:08 -08:00
bbe18e3527 [ZeroRedundancyOptimizer] Elastic and pytorch compatible checkpoints (#50956)
Summary:
- Makes it possible to use non-sharded optimizer checkpoints (as long as the model/param groups are the same, of course)
- Makes it possible to save with a given world size, and load with another world size
- Use Torch Distributed built-in broadcast object list instead of a ad-hoc version

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50956

Reviewed By: malfet

Differential Revision: D26113953

Pulled By: blefaudeux

fbshipit-source-id: 030bfeee2c34c2d987590d45dc8efe05515f2e5c
2021-02-02 14:32:13 -08:00
a990ff7001 [SobolEngine] Fix edge case of dtype of first sample (#51578)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51578

https://github.com/pytorch/pytorch/pull/49710 introduced an edge case in which
drawing a single sample resulted in ignoring the `dtype` arg to `draw`. This
fixes this and adds a unit test to cover this behavior.

Test Plan: Unit tests

Reviewed By: danielrjiang

Differential Revision: D26204393

fbshipit-source-id: 441a44dc035002e7bbe6b662bf6d1af0e2cd88f4
2021-02-02 14:24:56 -08:00
4746b3d1fb Added missing VSX dispatch for cholesky_inverse (#51562)
Summary:
It was overlooked that vsx dispatch is also needed for cholesky_inverse cpu dispatch.
See https://github.com/pytorch/pytorch/pull/50269#issuecomment-771688180

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51562

Reviewed By: H-Huang

Differential Revision: D26199581

Pulled By: anjali411

fbshipit-source-id: 5d02c6da52ce1d2e9e26001f5d4648a71dd0e829
2021-02-02 13:45:35 -08:00
2565a33c98 [Vulkan] Remove redundant qualifiers on writeonly images. (#51425)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51425

Test Plan: Imported from OSS

Reviewed By: SS-JIA

Differential Revision: D26179605

Pulled By: AshkanAliabadi

fbshipit-source-id: 26358cd4fd23922fed21120e120774eea0b728df
2021-02-02 13:37:59 -08:00
0402df5427 [Vulkan] Improve error handling in a few places. (#51423)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51423

Test Plan: Imported from OSS

Reviewed By: SS-JIA

Differential Revision: D26179604

Pulled By: AshkanAliabadi

fbshipit-source-id: 2e270423bf7e960e9303b17e0ca1a1530b760ad3
2021-02-02 13:34:43 -08:00
365986cfe0 Add tensorboard_trace_handler for profiler (#50875)
Summary:
Add a tensorboard_trace_handler to output tracing files and formalize the file name for tensorboard plugin.
As discussed in https://github.com/pytorch/pytorch/pull/49231

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50875

Reviewed By: H-Huang

Differential Revision: D26098493

Pulled By: ilia-cher

fbshipit-source-id: 906ea118682f8bff412e76ca3f391bebab23b0ff
2021-02-02 13:28:00 -08:00
cde7fa6e3c update kineto submodule (#51566)
Summary:
To make the dumping callstack feature work, I have to make pytorch refer to latest kineto version.
Callstack feature's kineto side: [Link](66a4cad380);
Callstack feature's pytorch side: [Link](https://github.com/pytorch/pytorch/pull/51565)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51566

Reviewed By: ilia-cher

Differential Revision: D26205782

Pulled By: gdankel

fbshipit-source-id: 52d835e45a87ab4630fd22ea024cb41b82c96ebc
2021-02-02 13:17:05 -08:00
a38a648cb7 Test if allocator is set only in DEBUG mode. (#51360)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51360

Invariant should be satisfied by call sites of allocator
ensuring that the device type makes sense.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: swolchok

Differential Revision: D26170202

Pulled By: ezyang

fbshipit-source-id: f23681f34187c0d3da794f7a8c869ea8da88365d
2021-02-02 12:51:15 -08:00
0ff855efea Make empty_cpu sanity test CPU only in DEBUG mode (#51358)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51358

BackendSelect is expected to enforce this invariant.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: swolchok

Differential Revision: D26149502

Pulled By: ezyang

fbshipit-source-id: f53ab66e8324b729a4057b376fe3d60b14daf2fb
2021-02-02 12:47:56 -08:00
351ee1ece7 Remove duplicate check for THPLayout in toSugaredValue (#51543)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51543

Reviewed By: Lilyjjo

Differential Revision: D26202297

Pulled By: gmagogsfm

fbshipit-source-id: f0d40c9d73b579a68e34c54b004d329fd3b76ff3
2021-02-02 12:34:29 -08:00
ec378055c3 add OneDNN linear backward (#49453)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/49453

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D26006889

Pulled By: VitalyFedyunin

fbshipit-source-id: 06e2a02b6e01d847395521a31fe84d844f2ee9ae
2021-02-02 12:18:59 -08:00
4fdebdc0c9 Improve PyTorch profiler flop computation formulas (#51377)
Summary:
Improve the flops computation formula of aten::conv2d operator to support stride, pad, dilation, and groups arguments.

This diff also fixes the following issues:
- Apply a factor of 2 to aten::mm because output accounts for multiplication and addition.
- Fix incorrect names of scalar operators to aten::mul and aten::add.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51377

Test Plan:
```python
python test/test_profiler.py
```

Reviewed By: jspark1105

Differential Revision: D26165223

Pulled By: xuzhao9

fbshipit-source-id: 2c5f0155c47af2e6a19332fd6ed73ace47fa072a
2021-02-02 11:49:04 -08:00
55a4aa79aa [package] patch inspect.getfile to work with PackageImporter (#51568)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51568

The default behavior of inspect.getfile doesn't work on classes imported
from PackageImporter, because it returns the following.

    sys.modules[kls.__module__].__file__

Looking in `sys.modules` is hard-coded behavior. So, patch it to first
check a similar registry of PackageImported modules we maintain.

Test Plan: Imported from OSS

Reviewed By: yf225

Differential Revision: D26201236

Pulled By: suo

fbshipit-source-id: aaf5d7ee8ca0155619c8185e64f70a30152ac567
2021-02-02 11:29:29 -08:00
b6c6fb7252 fix windows 11.1 test2 by disabling test (#51573)
Summary:
`TestStandaloneCPPJIT.test_load_standalone` fails with the split torch_cuda build, but the error seems irrelevant (cannot find `nvToolsExt64_1.dll`). Temporarily disabling as I'm investigating why that dependency is even there.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51573

Reviewed By: malfet, H-Huang

Differential Revision: D26203084

Pulled By: janeyx99

fbshipit-source-id: 373aeae8165506384e433bc256b80eea4a7a5048
2021-02-02 11:01:26 -08:00
751c30038f [JIT] Properly convert Python strings implictly to device (#51340)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51340

**Summary**
`toIValue` assumes that any value passed for an argument of type
`torch.device` is a valid device object, even when it is not. This can
lead to device type arguments of functions being assigned incorrect
values (see #51098).

This commit adds an explicit check that the passed in object is indeed a
`torch.device` using `THPDevice_Check` and only then does is it
converted to an `IValue`. Since implicit conversion from strings to
devices is generally allowed, if `THPDevice_Check` fails, it is assumed
that the object is a string and an `IValue` containing a `c10::Device`
containing the passed in string is returned.

**Test Plan**
This commit adds a unit test to `test_jit.py` to test that invalid
strings passed as devices are not longer silently accepted.

**Fixes**
This commit fixes #51098.

Test Plan: Imported from OSS

Reviewed By: pbelevich

Differential Revision: D26187190

Pulled By: SplitInfinity

fbshipit-source-id: 48c990203431da30f9f09381cbec8218d763325b
2021-02-02 10:57:56 -08:00
74ec9e7ccf compare_model_outputs_fx API implementation (#49266)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49266

compare_model_outputs_fx API implementation
ghstack-source-id: 120828880

Test Plan:
buck test mode/dev caffe2/test:quantization_fx -- 'test_compare_model_outputs_linear_static_fx'
buck test mode/dev caffe2/test:quantization_fx -- 'test_compare_model_outputs_conv_static_fx'
buck test mode/dev caffe2/test:quantization_fx -- 'test_compare_model_stub_linear_static_fx'
buck test mode/dev caffe2/test:quantization_fx -- 'test_compare_model_stub_conv_static_fx'
buck test mode/dev caffe2/test:quantization -- 'test_compare_model_outputs_linear_static'
buck test mode/dev caffe2/test:quantization -- 'test_compare_model_outputs_conv_static'

Reviewed By: vkuzo

Differential Revision: D25507933

fbshipit-source-id: 1b502b5eadb0fafbe9e8c2e843410bca03c63fd6
2021-02-02 10:43:25 -08:00
0118dec2e3 [Pytorch] Expanded Bundled Inputs To Any Public Function (#51153)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51153

Enabled bundled inputs for all public functions that the user wants in a torchscript module. An important caveat here is that you cant add bundled inputs to functions that were in the nn.module but weren't caught in the scripting/tracing process that brought the model to torchscript.

Old Api is exactly the same. Still only works on forward, return types the same, etc.

-----------New API-------------

Attachment of inputs:

***augment_model_with_bundled_inputs*** : works the same as before but added the option to specify an info dictionary.

***augment_many_model_functions_with_bundled_inputs*** : Similar to the above function but allows the user to specify a Dict[Callable, List[<inputs>]] (mapping function references to the bundled inputs for that function) to attach bundled inputs to many functions

Consumption of inputs:

***get_all_bundled_inputs_for_<function_name>()*** : Works exactly like get_all_bundled_inputs does, but can be used for functions other then forward if you know ahead of time what they are called, and if they have bundled inputs.

***get_bundled_inputs_functions_and_info()*** : This is easily the hackiest function. Returns a Dict['str', 'str'] mapping function_names to get_all_bundled_inputs_for_<function_name>. A user can then execute the functions specified in the values with something like
    all_info = model.get_bundled_inputs_functions_and_info()
    for func_name in all_info.keys():
        input_func_name = all_info[func_name]['get_inputs_function_name'][0]
        func_to_run = getattr(loaded, input_func_name)
The reason its done this way is because torchscript doesn't support 'Any' type yet meaning I can't return the bundled inputs directly because they could be different types for each function. Torchscript also doesn't support callable so I can't return a function reference directly either.
ghstack-source-id: 120768561

Test Plan:
Got a model into torchscript using the available methods that I'm aware of (tracing, scripting, old scripting method). Not really sure how tracing brings in functions that arent in the forward call path though. Attached bundled inputs and info to them successfully. Changes to TorchTest.py on all but the last version of this diff (where it will be/is removed for land) illustrate what I did to test.

Created and ran unit test

Reviewed By: dreiss

Differential Revision: D25931961

fbshipit-source-id: 36e87c9a585554a83a932e4dcf07d1f91a32f046
2021-02-02 10:33:59 -08:00
6465793011 Fix Dirichlet.arg_constraints event_dim (#51369)
Summary:
This fix ensures
```py
Dirichlet.arg_constraints["concentration"].event_dim == 1
```
which was missed in https://github.com/pytorch/pytorch/issues/50547

## Tested
- [x] added a regression test, covering all distributions

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51369

Reviewed By: H-Huang

Differential Revision: D26160644

Pulled By: neerajprad

fbshipit-source-id: 1bb44c79480a1f0052b0ef9d4605e750ab07bea1
2021-02-02 10:26:45 -08:00
Jan
a5b65ae40a Fix small typo (#51542)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/51541

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51542

Reviewed By: albanD

Differential Revision: D26199174

Pulled By: H-Huang

fbshipit-source-id: 919fc4a70d901916eae123672d010e9eb8e8b977
2021-02-02 10:14:17 -08:00
8f0968f899 Fix: Bad autograd side effects from printing (#51364)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/49756

## Background
Fix applied here is to remove the grad enabled check from `collect_next_edges`, unconditionally returning the actual collected edges. This pushes the responsibility for determining whether the function should be called without grad mode to its call-sites. With this update, `collect_next_edges` will no longer incorrectly return an empty list, which caused the problem described in the issue. Three call-sites depended on this behavior and have been updated.

Beyond bad printing side effects, this fix addresses the more general issue of accessing `grad_fn` with grad mode disabled after an in-place operation on a view. The included test verifies this without the use of print.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51364

Test Plan:
```
python test/test_autograd.py TestAutogradDeviceTypeCPU.test_inplace_view_then_no_grad_cpu
```

Reviewed By: zou3519

Differential Revision: D26190451

Pulled By: jbschlosser

fbshipit-source-id: 9b004a393463f8bd4ac0690e5e53c07a609f87f0
2021-02-02 09:30:27 -08:00
c39fb9771d [complex] Enable complex autograd tests for diag (#51268)
Summary:
Reference: https://github.com/pytorch/pytorch/issues/33152

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51268

Reviewed By: pbelevich

Differential Revision: D26179236

Pulled By: anjali411

fbshipit-source-id: e9756136eaaced5a8692228a158965f77505e7b9
2021-02-02 09:10:28 -08:00
43084d7aab add type annotations to conv_fused/blas_compare/blas_compare_setup (#51235)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/51234

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51235

Reviewed By: malfet

Differential Revision: D26147184

Pulled By: walterddr

fbshipit-source-id: 1ca1a1260785c8b7f4c3c24d7763ccbdaa0bfefb
2021-02-02 08:50:49 -08:00
c6f37e50f2 [doc] Add deprecation message to torch.slogdet in favor of torch.linalg.slogdet (#51354)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51354

Re-created from https://github.com/pytorch/pytorch/pull/51301 because of issues with ghstack.

This PR is part of a larger effort to ensure torch.linalg documentation is consistent (see #50287).

Updated torch.slogdet documentation to add a deprecation message in favor of torch.linalg.slogdet.

Test Plan: Imported from OSS

Reviewed By: H-Huang

Differential Revision: D26148679

Pulled By: heitorschueroff

fbshipit-source-id: 4d9f3386d9ba6dc735a4d1e86cfcd88eaba3cbcc
2021-02-02 07:58:01 -08:00
1caed167fb [doc] Fix linalg.slogdet doc consistency issues (#51353)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51353

Re-created from https://github.com/pytorch/pytorch/pull/51300 because of issues with ghstack.

This PR is part of a larger effort to ensure torch.linalg documentation is consistent (see https://github.com/pytorch/pytorch/issues/50287).

Updated torch.linalg.slogdet to include notes about cross-device synchronization, backend routines used and fix signature missing out argument.

Test Plan: Imported from OSS

Reviewed By: H-Huang

Differential Revision: D26148678

Pulled By: heitorschueroff

fbshipit-source-id: 40f6340226ecb72e4ec347c5606012f31f5877fb
2021-02-02 07:54:29 -08:00
c0d58bce0d move Tar Dataset to Tar DataPipe (#51398)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51398

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D26162319

Pulled By: glaringlee

fbshipit-source-id: a84879fe4ca044e34238d5e1d31a245d4b80ae8e
2021-02-02 07:46:53 -08:00
a07a37e4fb reenable BUILD_SPLIT_CUDA for windows and fixes Linux 11_1 tests (#51538)
Summary:
Reenabling split build for windows and also fixes linux 11_1 tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51538

Reviewed By: lw

Differential Revision: D26198269

Pulled By: janeyx99

fbshipit-source-id: 363b2eed6631d75592120834d1543b438cfd2d8f
2021-02-02 05:38:21 -08:00
4f37150f40 Revert D26179083: [TensorExpr] Introduce ExternalCall nodes to TE IR.
Test Plan: revert-hammer

Differential Revision:
D26179083 (f4fc3e3920)

Original commit changeset: 9e44de098ae9

fbshipit-source-id: d15684e04c65c395b4102d4f98a4488482822d1b
2021-02-02 05:29:41 -08:00
41e4c55379 Correct subgraph rewriter pattern containment rules (#51529)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51529

Test Plan: Imported from OSS

Reviewed By: jamesr66a

Differential Revision: D26192470

Pulled By: ansley

fbshipit-source-id: 6e44f7df1e245835365ec868ae9cc539ecc873f2
2021-02-02 05:13:03 -08:00
8bb0dff7e2 Write FX Subgraph Rewriter tutorial (#51531)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51531

Test Plan: Imported from OSS

Reviewed By: jamesr66a

Differential Revision: D26192992

Pulled By: ansley

fbshipit-source-id: 769901b418d4580cdf8aed2451dd8ef3d8ddf0d1
2021-02-02 05:02:51 -08:00
5c5db25cd5 [AutoAccept][Codemod][FBSourceClangFormatLinter] Daily arc lint --take CLANGFORMAT
Reviewed By: zertosh

Differential Revision: D26195387

fbshipit-source-id: 009860c4237048125e31e8abea44e8222e13715c
2021-02-02 04:54:15 -08:00
79e7544cb4 [Gradient Compression] Check start_PowerSGD_iter > 1 and add guidance on tuning PowerSGD configs. (#51427)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51427

A user reported that `start_PowerSGD_iter` failed when it's set as 1. This is because allocating memory for error tensors somehow overlap with bucket rebuilding process at iteration 1.

Check `start_PowerSGD_iter > 1` instead of `start_PowerSGD_iter >= 1`.

Also add a unit test of `test_invalid_powerSGD_state` and some guidance on tuning PowerSGD configs.

Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202
ghstack-source-id: 120834126

Test Plan: buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_invalid_powerSGD_state

Reviewed By: rohan-varma

Differential Revision: D26166897

fbshipit-source-id: 34d5b64bb3dd43acb61d792626c70e6c8bb44a5d
2021-02-02 04:30:24 -08:00
d555768e8f [FX] Added invert example (#51478)
Summary:
Added an inverse example.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51478

Reviewed By: pbelevich

Differential Revision: D26190544

Pulled By: Chillee

fbshipit-source-id: 4324ea8b917557f4c49f3b9aecd35c4e9ab36bf3
2021-02-02 02:38:22 -08:00
96a22123f4 Automated submodule update: tensorpipe (#51469)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51469

This is an automated pull request to update the first-party submodule for [pytorch/tensorpipe](https://github.com/pytorch/tensorpipe).

New submodule commit: b4098ad5de

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51346

Test Plan: Ensure that CI jobs succeed on GitHub before landing.

Reviewed By: mrshenli

Differential Revision: D26177172

Pulled By: lw

fbshipit-source-id: 4ec508ce78cf521b11fed52ffdfc6f788ca6a6d0
2021-02-02 01:13:38 -08:00
f4fc3e3920 [TensorExpr] Introduce ExternalCall nodes to TE IR. (#51475)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51475

ExternalCall nodes represent opaque calls to external functions to fill a
tensor (buffer) with values. It could be used to include nodes that are
otherwise not-representable as TE, or whose TE representation is currently too
slow.

To make an external function available in NNC as ExternalCall, one needs to
implement a "bridge" function that would take raw (void*) pointers to the data
along with the arrays containing dimension info. This function would then
internally call the desired external function and make sure the results of the
call are correctly placed in the provided raw data buffers.

Test Plan: Imported from OSS

Reviewed By: pbelevich, Chillee

Differential Revision: D26179083

Pulled By: ZolotukhinM

fbshipit-source-id: 9e44de098ae94d25772cf5e2659d539fa6f3f659
2021-02-02 00:50:46 -08:00
b106250047 Introduced AliasInfo for OpInfo (#50368)
Summary:
Introduced AliasInfo for OpInfo.

Context: Split of https://github.com/pytorch/pytorch/issues/49158

cc mruberry , please let me know if you'd like to see here more code to cover

> [ ] fold test_op_aliases.py into OpInfo-based testing in test_ops.py

from https://github.com/pytorch/pytorch/issues/50006

and/or add `UnaryUfuncInfo('abs')` as discussed https://github.com/pytorch/pytorch/pull/49158/files#r548774221

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50368

Reviewed By: ngimel

Differential Revision: D26177261

Pulled By: mruberry

fbshipit-source-id: 2e3884a387e8d5365fe05945375f0a9d1b5f5d82
2021-02-02 00:10:09 -08:00
7328710cbc [PyTorch][codemod] Replace immediately-dereferenced cast calls w/castRaw (#50229)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50229

`fastmod -m 'cast(<((at|c10)::)?\w+Type>\(\)\s*)->' 'castRaw${1}->'` Presuming it builds, this is a safe change: the
result of `cast()` wasn't being saved anywhere, so we didn't need
it, so we can use a raw pointer instead of a new `shared_ptr`.
ghstack-source-id: 120769170

Test Plan: CI

Reviewed By: SplitInfinity

Differential Revision: D25837494

fbshipit-source-id: 46319100dc0dfc78f6d2b45148207f83481f2ada
2021-02-01 23:12:07 -08:00
f0006315a9 Add support for complex valued keys for dict in TS (#51472)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51472

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D26177963

Pulled By: anjali411

fbshipit-source-id: 5841159c36b07290b1d88d4df27a0bf8c17d9df8
2021-02-01 22:40:01 -08:00
9c474c97b7 Disable BUILD_SPLIT_CUDA for now (#51533)
Summary:
BUILD_SPLIT_CUDA needs to be exported to true in order for cpp_extensions.py to work properly--currently disabling to make tree green

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51533

Reviewed By: pbelevich

Differential Revision: D26194055

Pulled By: janeyx99

fbshipit-source-id: 08d3cc7e6ba57011dddbf27f96ef5acb648b6b9a
2021-02-01 22:25:24 -08:00
c354888e5d compare_model_stub_fx API implementation (#48951)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48951

compare_model_stub_fx API implementation
ghstack-source-id: 120817825

Test Plan:
buck test mode/dev caffe2/test:quantization_fx -- 'test_compare_model_stub_conv_static_fx'
buck test mode/dev caffe2/test:quantization_fx -- 'test_compare_model_stub_linear_static_fx'

Reviewed By: vkuzo

Differential Revision: D25379000

fbshipit-source-id: f1321d37b60b56b202e7d227e370ce13addb10cc
2021-02-01 22:16:14 -08:00
d02ea9a141 [ROCm] add hipMAGMA support (#51238)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/48831.

- CI image is updated to build hipMAGMA from source and set env MAGMA_HOME.
- CMake is updated to separate different requirements for CUDA versus ROCm MAGMA.
- Some unit tests that become enabled with MAGMA are currently skipped for ROCm due to failures.  Fixing these failures will be follow-on work.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51238

Reviewed By: ngimel

Differential Revision: D26184918

Pulled By: malfet

fbshipit-source-id: ada632f1ae7b413e8cae6543fe931dcd46985821
2021-02-01 22:09:33 -08:00
5e09ec6518 Fixed SVD ignoring "some/full_matrices" flag for empty inputs (#51109)
Summary:
For empty inputs `torch.svd` (and `torch.linalg.svd`) was returning incorrect results for `some=True` (`full_matrices=False`).
Behaviour on master branch:
```python
In [1]: import torch
In [2]: a = torch.randn(0, 7)
In [3]: a.svd()
Out[3]:
torch.return_types.svd(
U=tensor([], size=(0, 0)),
S=tensor([]),
V=tensor([[0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0.]]))
In [4]: a.svd(some=False)
Out[4]:
torch.return_types.svd(
U=tensor([], size=(0, 0)),
S=tensor([]),
V=tensor([[0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0.]]))
```
`some` flag is ignored and 7x7 `V` matrix is returned in both cases. `V` should have 7x0 shape when `some=True`.

This PR fixes that.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51109

Reviewed By: ngimel

Differential Revision: D26170897

Pulled By: mruberry

fbshipit-source-id: 664c09ca27bb375fabef2a046d0a09ca57b01aac
2021-02-01 21:51:58 -08:00
4b65a27a35 [testing] Add OpInfo for round and logit (#51272)
Summary:
Reference: https://github.com/pytorch/pytorch/issues/50006

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51272

Reviewed By: ngimel

Differential Revision: D26177020

Pulled By: mruberry

fbshipit-source-id: 4728b14c7a42980c7ca231ca1946430e0e38ed5b
2021-02-01 21:15:40 -08:00
205c971431 [PyTorch] Remove always-empty string args to inferFunctionSchemaFromFunctor (#51307)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51307

Using -ftime-trace shows that this roughly halves time spent compile/optimizing this function in an optimized build of RegisterCPU.cpp (savings of about 1 second).
ghstack-source-id: 120697493

Test Plan: manual build with -ftime-trace as above; sketch of directions at https://fb.workplace.com/groups/894363187646754/permalink/1153321361750934/ , except that I extracted a compiler invocation for RegisterCPU.cpp by injecting a syntax error and running buck build with -v 3 so that I could rebuild and measure just the one file quickly.

Reviewed By: ezyang

Differential Revision: D26135978

fbshipit-source-id: 756499fbcc8d3b169bae5a463f63caecb79f7fcd
2021-02-01 19:21:17 -08:00
1416fb9877 [PyTorch] IWYU in torch/csrc/utils/future.h (#51293)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51293

It looks like this header did not need ivalue.h at all.
ghstack-source-id: 120697488

Test Plan: CI to ensure correctness

Reviewed By: ezyang

Differential Revision: D26128288

fbshipit-source-id: a24a7e49b9d623fb182bdfaf286972739497e770
2021-02-01 19:18:12 -08:00
a1c5eba4bd [FX] Move some heavily used passes out of experimental (#51392)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51392

Test Plan: Imported from OSS

Reviewed By: Chillee

Differential Revision: D26161172

Pulled By: jamesr66a

fbshipit-source-id: 04bfe606555bdf1988f527231d4de2e0196e6b37
2021-02-01 19:02:26 -08:00
a3353d1ec0 [FX] Support ellipsis as arg (#51502)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51502

Test Plan: Imported from OSS

Reviewed By: ansley

Differential Revision: D26186578

Pulled By: jamesr66a

fbshipit-source-id: 91943af38412bafc1766398dfaebdf50b64ccd74
2021-02-01 18:54:14 -08:00
88af2149e1 Add build option to split torch_cuda library into torch_cuda_cu and torch_cuda_cpp (#49050)
Summary:
Because of the size of our `libtorch_cuda.so`, linking with other hefty binaries presents a problem where 32bit relocation markers are too small and end up overflowing. This PR attempts to break up `torch_cuda` into `torch_cuda_cu` and `torch_cuda_cpp`.

`torch_cuda_cu`: all the files previously in `Caffe2_GPU_SRCS` that are
* pure `.cu` files in `aten`match
* all the BLAS files
* all the THC files, except for THCAllocator.cpp, THCCachingHostAllocator.cpp and THCGeneral.cpp
* all files in`detail`
* LegacyDefinitions.cpp and LegacyTHFunctionsCUDA.cpp
* Register*CUDA.cpp
* CUDAHooks.cpp
* CUDASolver.cpp
* TensorShapeCUDA.cpp

`torch_cuda_cpp`: all other files in `Caffe2_GPU_SRCS`

Accordingly, TORCH_CUDA_API and TORCH_CUDA_BUILD_MAIN_LIB usages are getting split as well to TORCH_CUDA_CU_API and TORCH_CUDA_CPP_API.

To test this locally, you can run `export BUILD_SPLIT_CUDA=ON && python setup.py develop`. In your `build/lib` folder, you should find binaries for both `torch_cuda_cpp` and `torch_cuda_cu`. To see that the SPLIT_CUDA option was toggled, you can grep the Summary of running cmake and make sure `Split CUDA` is ON.

This build option is tested on CI for CUDA 11.1 builds (linux for now, but windows soon).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49050

Reviewed By: walterddr

Differential Revision: D26114310

Pulled By: janeyx99

fbshipit-source-id: 0180f2519abb5a9cdde16a6fb7dd3171cff687a6
2021-02-01 18:42:35 -08:00
87ad77eb4e T66557700 Support default argument values of a method (#48863)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48863

Support default arguments when invoking a module via PyTorch Lite (`mobile::Module`).

Test Plan:
buck test mode/dbg //caffe2/test/cpp/jit:jit -- LiteInterpreterTest.MethodInvocation

buck test mode/dbg caffe2/test:mobile -- test_method_calls_with_optional_arg

Reviewed By: iseeyuan

Differential Revision: D25896212

fbshipit-source-id: 6d7e7fd5f3244a88bd44889024d81ad2e678ffa5
2021-02-01 18:35:13 -08:00
ec3aae8cdb [JIT] Enable saving modules with hooks in FBCODE (#51241)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51241

Test Plan: Imported from OSS

Reviewed By: SplitInfinity

Differential Revision: D26111488

Pulled By: Lilyjjo

fbshipit-source-id: 3315068ac9adef8aa23670a4a5f86c5a54fdd1f7
2021-02-01 17:01:44 -08:00
630ee57bc2 [PyTorch] Provide overload of torchCheckFail taking const char* (#51389)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51389

This should reduce code size when STRIP_ERROR_MESSAGES is defined by allowing callers of TORCH_CHECK to avoid creating `std::string`s.
ghstack-source-id: 120692772

Test Plan: Measure code size of STRIP_ERROR_MESSAGES builds

Reviewed By: ezyang

Differential Revision: D25891476

fbshipit-source-id: 34eef5af7464da6534989443859e2765887c243c
2021-02-01 16:48:46 -08:00
c77fc2ee06 [nnc] Vectorize bitwise ops (#51492)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51492

We missed these originally.  This helps vectorize log_fast.
ghstack-source-id: 120783427

Test Plan:
```
buck run mode/opt //caffe2/benchmarks/cpp/tensorexpr:tensorexpr_bench
```

This might have made bench_approx faster but it could be noise.

Before:
```
----------------------------------------------------------------------------
Benchmark                     Time           CPU Iterations UserCounters...
----------------------------------------------------------------------------
log_nnc_fast/64             108 ns        108 ns    5576102 log/s=590.91M/s
log_nnc_fast/512            569 ns        569 ns    1230258 log/s=899.961M/s
log_nnc_fast/8192          8047 ns       8046 ns      89715 log/s=1018.08M/s
log_nnc_fast/32768        31066 ns      31065 ns      22368 log/s=1054.81M/s
logit_nnc_fast/64           149 ns        149 ns    4851520 logit/s=428.646M/s
logit_nnc_fast/512          980 ns        979 ns     712033 logit/s=522.742M/s
logit_nnc_fast/8192       13326 ns      13325 ns      51916 logit/s=614.805M/s
logit_nnc_fast/32768      54743 ns      54739 ns      12844 logit/s=598.624M/s
```

After:
```
----------------------------------------------------------------------------
Benchmark                     Time           CPU Iterations UserCounters...
----------------------------------------------------------------------------
log_nnc_fast/64             100 ns        100 ns    7012963 log/s=640.588M/s
log_nnc_fast/512            496 ns        496 ns    1415357 log/s=1032.26M/s
log_nnc_fast/8192          7600 ns       7595 ns      88258 log/s=1078.62M/s
log_nnc_fast/32768        30300 ns      30298 ns      22442 log/s=1081.52M/s
logit_nnc_fast/64           152 ns        152 ns    4505712 logit/s=420.279M/s
logit_nnc_fast/512          816 ns        816 ns     873834 logit/s=627.267M/s
logit_nnc_fast/8192       12090 ns      12088 ns      58234 logit/s=677.675M/s
logit_nnc_fast/32768      51576 ns      51531 ns      14645 logit/s=635.888M/s
```

Reviewed By: bwasti

Differential Revision: D26155792

fbshipit-source-id: 16724b419c944aa7d4389ae85838018455a5605f
2021-02-01 16:38:57 -08:00
a23e82df10 [nnc] Tweak log_nnc_sleef so vectorization kicks in (#51491)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51491

The vectorizer heuristic is pretty dumb and only kicks in if the
unroll factor is exactly 8 or 4.

It's still slower than direct implementation, which isn't surprising.
ghstack-source-id: 120783426

Test Plan:
`buck run mode/opt //caffe2/benchmarks/cpp/tensorexpr:tensorexpr_bench`

Before:
```
---------------------------------------------------------------------------
Benchmark                    Time           CPU Iterations UserCounters...
---------------------------------------------------------------------------
log_nnc_sleef/64           438 ns        438 ns    1795511 log/s=146.259M/s
log_nnc_sleef/512         3196 ns       3195 ns     210032 log/s=160.235M/s
log_nnc_sleef/8192       77467 ns      77466 ns       8859 log/s=105.749M/s
log_nnc_sleef/32768     310206 ns     310202 ns       2170 log/s=105.634M/s
log_nnc_fast/64            100 ns        100 ns    7281074 log/s=637.144M/s
log_nnc_fast/512           546 ns        546 ns    1335816 log/s=938.361M/s
log_nnc_fast/8192         7360 ns       7359 ns      91971 log/s=1.11316G/s
log_nnc_fast/32768       30793 ns      30792 ns      22633 log/s=1064.17M/s
log_aten/64           427 ns        427 ns    1634897 log/s=150.021M/s
log_aten/512          796 ns        796 ns     877318 log/s=643.566M/s
log_aten/8192        6690 ns       6690 ns     102649 log/s=1.22452G/s
log_aten/32768      25357 ns      25350 ns      27808 log/s=1.29263G/s
```

After:
```
---------------------------------------------------------------------------
Benchmark                    Time           CPU Iterations UserCounters...
---------------------------------------------------------------------------
log_nnc_sleef/64           189 ns        188 ns    3872475 log/s=340.585M/s
log_nnc_sleef/512         1307 ns       1307 ns     557770 log/s=391.709M/s
log_nnc_sleef/8192       20259 ns      20257 ns      34240 log/s=404.404M/s
log_nnc_sleef/32768      81556 ns      81470 ns       8767 log/s=402.209M/s
log_nnc_fast/64            110 ns        110 ns    6564558 log/s=581.116M/s
log_nnc_fast/512           554 ns        554 ns    1279304 log/s=923.376M/s
log_nnc_fast/8192         7774 ns       7774 ns      91421 log/s=1053.75M/s
log_nnc_fast/32768       31008 ns      31006 ns      21279 log/s=1056.83M/s
```

Reviewed By: bwasti

Differential Revision: D26139067

fbshipit-source-id: db31897ee9922695ff9dff4ff46e3d3fbd61f4c2
2021-02-01 16:35:37 -08:00
5b0a6482c1 Out variant for embedding_bag_4bit_rowwise_offsets (#51324)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51324

Add out variant for embedding_bag_4bit_rowwise_offsets and add it to static runtime registry

Test Plan:
```
MKL_NUM_THREADS=1 OMP_NUM_THREADS=1 numactl -m 0 -C 1 buck-out/opt/gen/caffe2/caffe2/fb/predictor/ptvsc2_predictor_bench --scripted_model=$INLINE_CVR_DIR/210494966_0.predictor.disagg.remote_request_only_remote_cast.pt --pt_inputs=$INLINE_CVR_DIR/remote_ro_wrapped_input_data.pt --pt_enable_static_runtime=true --pt_cleanup_activations=true --pt_enable_out_variant=true --compare_results=true --iters=5000 --warmup_iters=5000 --num_threads=1 --do_profile=true
```

before:
```
0.789023 ms.    54.8408%. quantized::embedding_bag_4bit_rowwise_offsets (82 nodes)
```

after:
```
0.620817 ms.    49.7136%. quantized::embedding_bag_4bit_rowwise_offsets (82 nodes)
```

Reviewed By: ajyu

Differential Revision: D26138322

fbshipit-source-id: 44d3f15d04636404ebd4c1e9eecf73c7ad972944
2021-02-01 16:15:57 -08:00
b198cf4f1c port index_fill_ from TH to ATen. (#50578)
Summary:
As per title. The port is based on TensorIterator.
Supports complex input.

Resolves https://github.com/pytorch/pytorch/issues/24714.
Resolves https://github.com/pytorch/pytorch/issues/24577.
Resolves https://github.com/pytorch/pytorch/issues/36328.
Possibly resolves https://github.com/pytorch/pytorch/issues/48230

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50578

Reviewed By: ngimel

Differential Revision: D26049539

Pulled By: anjali411

fbshipit-source-id: 2be4e78f7a01700c593a9e893e01f69191e51ab1
2021-02-01 16:08:37 -08:00
09bc58796e Hashing logic for c10::complex (#51441)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51441

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D26170195

Pulled By: anjali411

fbshipit-source-id: 9247c1329229405426cfbd8463cabcdbe5bdb740
2021-02-01 15:56:44 -08:00
8fa328f88e [doc] Deprecate torch.cholesky in favor of torch.linalg.cholesky (#51460)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51460

This PR is part of a larger effort to ensure torch.linalg documentation is consistent (see #50287).

* #51459 [doc] Fix linalg.cholesky doc consistency issues

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D26176130

Pulled By: heitorschueroff

fbshipit-source-id: cc89575db69cbfd5f87d970a2e71deb6522a35b1
2021-02-01 15:47:08 -08:00
8583f7cbe2 [doc] Fix linalg.cholesky doc consistency issues (#51459)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51459

This PR is part of a larger effort to ensure torch.linalg documentation is consistent (see #50287).

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D26176131

Pulled By: heitorschueroff

fbshipit-source-id: 2ad88a339e6dff044965e8bf29dd8c852afecb34
2021-02-01 15:43:47 -08:00
c08078031f [Gradient Compression] Allow BatchedPowerSGD to run vanilla allreduce for the first K iterations (#51270)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51270

Similar to #50973, allow the batched version to run vanilla allreduce for the first K iterations.

This may be useful if the batched version can be applied to some use cases where the accuracy requirement is not very strict.

Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202
ghstack-source-id: 120725858

Test Plan:
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl

baseline: f248001754
batched PowerSGD: f246960752

The training time was reduced from 54m48s to 30m33s, and the accuracy is approximately the same: 44.21 vs 44.35

Reviewed By: rohan-varma

Differential Revision: D26077709

fbshipit-source-id: 6afeefad7a3fbdd7da2cbffb56dfbad855a96cb5
2021-02-01 15:26:29 -08:00
718e4b110b add git submodule troubleshoot to CONTRIBUTING.md (#51458)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/51355.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51458

Reviewed By: janeyx99

Differential Revision: D26176233

Pulled By: walterddr

fbshipit-source-id: 758e4203e11c81489234bbca812d1a3738504148
2021-02-01 14:30:00 -08:00
109bc1047e [NNC] Generate C++ code for Allocate and Free (#51070)
Summary:
This is the initial skeleton for C++ codegen, it includes generations for Allocate and Free.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51070

Test Plan: New unit tests are added to `test_cpp_codegen.cpp`.

Reviewed By: ZolotukhinM

Differential Revision: D26061818

Pulled By: cheng-chang

fbshipit-source-id: b5256b2dcee6b2583ba73b6c9684994dbe7cdc1f
2021-02-01 13:06:51 -08:00
642afcb168 Add sgn to torch.rst so that it appears in the built docs (#51479)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51479

Fixes https://github.com/pytorch/pytorch/issues/50146

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D26179734

Pulled By: anjali411

fbshipit-source-id: 1cda9a3dc9ce600e585900eea70fbecac0635d5c
2021-02-01 12:43:06 -08:00
d1ddc5d65d [PyTorch] Outline OperatorEntry::assertSignatureIsCorrect fail path (#51269)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51269

This saves about 10% of the compile time of Functions.cpp. Found using clang-9's `-ftime-trace` feature + ClangBuildAnalyzer.

Test Plan:
Compared -ftime-trace + ClangBuildAnalyzer output.

Before: P167884397

After: P167888502

Note that time spent generating assertSignatureIsCorrect is way down, though it's still kind of slow.

Reviewed By: ezyang

Differential Revision: D26121814

fbshipit-source-id: 949a85d8939c02e4fb5ac1adc35905ed34414724
2021-02-01 12:40:19 -08:00
9877777fee [PyTorch] check isValidUnboxed() in the dispatcher (#51247)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51247

See code comment for explanation.

This measures neutral compared to the previous diff with `perf stat` when running on a
benchmark that calls empty in a loop. I think that we should commit it
anyway because:
1) I have previously seen it make a difference when applied earlier in
the stack.
2) This makes sense both on principle and via inspecting output
assembly: we avoid having to touch the boxed kernel at all (usually)
and instead use the unboxed kernel for both the validity check in
`OperatorEntry::lookup` and the actual `KernelFunction::call`.
ghstack-source-id: 120697497

Test Plan: Aforementioned perf measurement

Reviewed By: ezyang

Differential Revision: D26113650

fbshipit-source-id: 8448c4ed764d477f63eb7c0f6dd87b1fc0228b73
2021-02-01 12:40:14 -08:00
4495b49ffa [PyTorch] Pass TensorOptions by value (#51165)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51165

`TensorOptions` does not have a non-trivial copy, move, or
destroy operation and is small enough to fit in a register, so it
seems like we should pass it by value.
ghstack-source-id: 120697498

Test Plan:
Measured timing for empty framework overhead benchmark before & after this change:

Before:
```
I0126 16:02:50.662864 2137574 bench.cpp:139] Mean 0.268645
I0126 16:02:50.662891 2137574 bench.cpp:140] Median 0.267485
I0126 16:02:50.662896 2137574 bench.cpp:141] Min 0.266485
I0126 16:02:50.662901 2137574 bench.cpp:142] stddev 0.00219359
I0126 16:02:50.662915 2137574 bench.cpp:143] stddev / mean 0.00816537

          2,968.37 msec task-clock                #    0.997 CPUs utilized            ( +-  0.03% )
               250      context-switches          #    0.084 K/sec                    ( +-  2.21% )
                 1      cpu-migrations            #    0.000 K/sec
            11,403      page-faults               #    0.004 M/sec                    ( +-  0.28% )
     5,898,481,882      cycles                    #    1.987 GHz                      ( +-  0.03% )  (50.05%)
    16,169,242,938      instructions              #    2.74  insn per cycle           ( +-  0.03% )  (50.06%)
     3,076,546,626      branches                  # 1036.443 M/sec                    ( +-  0.05% )  (50.05%)
         2,531,859      branch-misses             #    0.08% of all branches          ( +-  0.89% )  (50.03%)
```

After:
```
I0126 16:23:20.010062 2244624 bench.cpp:139] Mean 0.266814
I0126 16:23:20.010092 2244624 bench.cpp:140] Median 0.265759
I0126 16:23:20.010099 2244624 bench.cpp:141] Min 0.260291
I0126 16:23:20.010107 2244624 bench.cpp:142] stddev 0.00548279
I0126 16:23:20.010118 2244624 bench.cpp:143] stddev / mean 0.0205491

          2,983.75 msec task-clock                #    0.995 CPUs utilized            ( +-  0.36% )
               243      context-switches          #    0.082 K/sec                    ( +-  1.26% )
                 1      cpu-migrations            #    0.000 K/sec
            11,422      page-faults               #    0.004 M/sec                    ( +-  0.18% )
     5,928,639,486      cycles                    #    1.987 GHz                      ( +-  0.36% )  (50.02%)
    16,105,928,210      instructions              #    2.72  insn per cycle           ( +-  0.05% )  (50.02%)
     3,150,273,453      branches                  # 1055.809 M/sec                    ( +-  0.03% )  (50.05%)
         3,713,617      branch-misses             #    0.12% of all branches          ( +-  0.83% )  (50.07%)

```

It looked close to neutral, so I used `perf stat` to confirm it's about a 1% instruction count win.

For deciding whether this stack is worth it, I went back and ran `perf stat` on the baseline diff before I started touching the dispatcher:

```
          2,968.37 msec task-clock                #    0.997 CPUs utilized            ( +-  0.03% )
               250      context-switches          #    0.084 K/sec                    ( +-  2.21% )
                 1      cpu-migrations            #    0.000 K/sec
            11,403      page-faults               #    0.004 M/sec                    ( +-  0.28% )
     5,898,481,882      cycles                    #    1.987 GHz                      ( +-  0.03% )  (50.05%)
    16,169,242,938      instructions              #    2.74  insn per cycle           ( +-  0.03% )  (50.06%)
     3,076,546,626      branches                  # 1036.443 M/sec                    ( +-  0.05% )  (50.05%)
         2,531,859      branch-misses             #    0.08% of all branches          ( +-  0.89% )  (50.03%)
```

If I've done the arithmetic correctly, we have an 0.39% instruction count win.

Reviewed By: ezyang

Differential Revision: D25983863

fbshipit-source-id: 87d1451a01ead25738ea6b80db270d344bc583b2
2021-02-01 12:40:08 -08:00
341c76dcc1 [PyTorch] Add C10_ALWAYS_INLINE to critical dispatcher paths (#51245)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51245

Splitting this out from #51164 (D26069629) to allow it to
land separately; I'm sure this is a good idea but I'm less sure about
#51164.
ghstack-source-id: 120697499

Test Plan:
double-check effect on empty benchmark with perf stat;
didn't move

Reviweers: ezyang, messmer

Reviewed By: ezyang

Differential Revision: D26112627

fbshipit-source-id: 50d4418d351527bcedd5ccdc49106bc642699870
2021-02-01 12:39:58 -08:00
673687e764 [PyTorch] Refactor Dispatcher to inline less code in fast path (#51163)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51163

The Dispatcher seems to have been in a precarious local
maximum: I tried to make several different changes to parameter
passing and ended up with regressions due to reduced inlining that
swamped any gains I might have gotten from the parameter passing
changes.

This diff reduces the amount of inline code on the fast path. It
should both reduce code size and provide a platform for making further
improvements to the dispatcher code.

It is a slight performance regression, but it unblocked the following
two diffs (which seem to get us back where we were) from landing.
ghstack-source-id: 120693163

Test Plan:
CI, framework overhead benchmarks to check the size of the
regression

Compared timing for empty framework overhead benchmark before/after.

Build command: `buck build mode/no-gpu //caffe2/caffe2/fb/high_perf_models/pytorch/benchmark_framework_overheads:cpp_benchmark mode/opt-clang --show-output`
Run with `numactl -m  0 -C 3 path/to/cpp_benchmark -op empty -niter 100`

Before:
```
I0126 16:02:04.373075 2135872 bench.cpp:139] Mean 0.266272
I0126 16:02:04.373106 2135872 bench.cpp:140] Median 0.266347
I0126 16:02:04.373111 2135872 bench.cpp:141] Min 0.263585
I0126 16:02:04.373117 2135872 bench.cpp:142] stddev 0.0021264
I0126 16:02:04.373131 2135872 bench.cpp:143] stddev / mean 0.00798581
```

After:
```
I0126 16:02:30.377992 2137048 bench.cpp:139] Mean 0.27579
I0126 16:02:30.378023 2137048 bench.cpp:140] Median 0.275281
I0126 16:02:30.378029 2137048 bench.cpp:141] Min 0.270617
I0126 16:02:30.378034 2137048 bench.cpp:142] stddev 0.00308287
I0126 16:02:30.378044 2137048 bench.cpp:143] stddev / mean 0.0111783
```

Yes, it's a regression, but I compared D26069629 stacked on this diff vs not:

With this diff:

```
I0126 16:02:50.662864 2137574 bench.cpp:139] Mean 0.268645
I0126 16:02:50.662891 2137574 bench.cpp:140] Median 0.267485
I0126 16:02:50.662896 2137574 bench.cpp:141] Min 0.266485
I0126 16:02:50.662901 2137574 bench.cpp:142] stddev 0.00219359
I0126 16:02:50.662915 2137574 bench.cpp:143] stddev / mean 0.00816537
```

Without:
```
I0126 20:40:27.815824 3240699 bench.cpp:139] Mean 0.270755
I0126 20:40:27.815860 3240699 bench.cpp:140] Median 0.268998
I0126 20:40:27.815866 3240699 bench.cpp:141] Min 0.268306
I0126 20:40:27.815873 3240699 bench.cpp:142] stddev 0.00260365
I0126 20:40:27.815886 3240699 bench.cpp:143] stddev / mean 0.00961624
```

So we do seem to have accomplished something w.r.t. not overwhelming the inliner.

Reviewed By: ezyang

Differential Revision: D26091377

fbshipit-source-id: c9b7f4e187059fa15452b7c75fc29816022b92b1
2021-02-01 12:36:48 -08:00
ec611aca88 [Pytorch Mobile] Expose _export_operator_list to python (#51312)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51312

Follow up to D24690094 (4a870f6518) exposing the api in python. Created matching unit test.
ghstack-source-id: 120611452

Test Plan: Ran unit test

Reviewed By: dhruvbird

Differential Revision: D26112765

fbshipit-source-id: ffe3bb97de0a4f08b31719b4b47dcebd7d2fd42a
2021-02-01 12:09:02 -08:00
609f76f27a [WIP][FX] Add Interpreter and Transformer (#50420)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/50420

Test Plan: Imported from OSS

Reviewed By: zdevito

Differential Revision: D25880330

Pulled By: jamesr66a

fbshipit-source-id: 27d34888e36e39924821fed891d79f969237a104
2021-02-01 11:40:12 -08:00
0831984ed5 [Resubmission][Gradient Compression] Refactor default_hooks.py and powerSGD_hook.py by creating a util function that make a vanilla allreduce future (#51400)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51400

Resubmission of #51094

Address https://github.com/pytorch/pytorch/pull/50973#discussion_r564229818

Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202
ghstack-source-id: 120725690

Test Plan:
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl

buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_default_ddp_comm_hooks_nccl

Reviewed By: rohan-varma

Differential Revision: D26162333

fbshipit-source-id: ccc2eae5383a23673e00d61cb5570fb8bf749cd0
2021-02-01 11:34:41 -08:00
6c24296795 [PyTorch] Devirtualize TensorImpl::has_storage (#51049)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51049

This diff makes it OK to query has_storage() on all TensorImpls. I added debug assertions that storage_ is indeed never set on them, which is required for this to be correct.
ghstack-source-id: 120714380

Test Plan: CI

Reviewed By: ezyang

Differential Revision: D26008498

fbshipit-source-id: b3f55f0b57b04636d13b09aa55bb720c6529542c
2021-02-01 11:30:23 -08:00
765062c085 [PyTorch] Devirtualize TensorImpl::storage_offset (#51048)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51048

There doesn't seem to be any reason to prohibit accessing the always-zero storage_offset of those TensorImpls that prohibit set_storage_offset.
ghstack-source-id: 120714379

Test Plan: CI

Reviewed By: ezyang

Differential Revision: D26008499

fbshipit-source-id: cd92ac0afdebbd5cf8f04df141843635113b6444
2021-02-01 11:27:13 -08:00
50fa415a4d [testing] Add OpInfo for ceil and floor (#51198)
Summary:
Reference: https://github.com/pytorch/pytorch/issues/50006

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51198

Reviewed By: malfet

Differential Revision: D26105099

Pulled By: mruberry

fbshipit-source-id: 6cfa89f42b87cca66dbc5bf474d17a6cad7eb45a
2021-02-01 10:10:36 -08:00
449098c2d2 [SobolEngine] Update direction numbers to 21201 dims (#49710)
Summary:
Performs the update that was suggested in https://github.com/pytorch/pytorch/issues/41489

Adjust the functionality to largely match that pf the scipy companion PR https://github.com/scipy/scipy/pull/10844/, including
- a new `draw_base2` method
- include zero as the first point in the (unscrambled) Sobol sequence

The scipy PR is also quite opinionated if the `draw` method doesn't get called with a base 2 number (for which the resulting sequence has nice properties, see the scipy PR for a comprehensive discussion of this).

Note that this update is a **breaking change** in the sense that sequences generated with the same parameters after as before will not be identical! They will have the same (better, arguably) distributional properties, but calling the engine with the same seed will result in different numbers in the sequence.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49710

Test Plan:
```
from torch.quasirandom import SobolEngine

sobol = SobolEngine(3)
sobol.draw(4)

sobol = SobolEngine(4, scramble=True)
sobol.draw(5)

sobol = SobolEngine(4, scramble=True)
sobol.draw_base2(2)
```

Reviewed By: malfet

Differential Revision: D25657233

Pulled By: Balandat

fbshipit-source-id: 9df50a14631092b176cc692b6024aa62a639ef61
2021-02-01 08:44:31 -08:00
b1907f5ebc Fix pickling for Tensor subclasses (redo) (#47732)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/47051
Redo of https://github.com/pytorch/pytorch/issues/47115

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47732

Reviewed By: izdeby

Differential Revision: D25465382

Pulled By: ezyang

fbshipit-source-id: 3a8d57281a2d6f57415d5735d34ad307f3526638
2021-02-01 07:32:52 -08:00
508bab43e7 Support complex number list in JIT (#51145)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51145

Test Plan: Imported from OSS

Reviewed By: SplitInfinity

Differential Revision: D26154025

Pulled By: anjali411

fbshipit-source-id: 74645f9b6467757ddb9d75846e778222109848f0
2021-01-31 23:54:14 -08:00
40c0fffb4b Fixes docs (#51439)
Summary:
pytorch_python_doc_build is failing with:

```
Jan 31 04:30:45 /var/lib/jenkins/workspace/docs/source/notes/broadcasting.rst:6: WARNING: 'any' reference target not found: numpy.doc.broadcasting
```

this removes the incorrect reference and adds an updated link.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51439

Reviewed By: ngimel

Differential Revision: D26170232

Pulled By: mruberry

fbshipit-source-id: 829999db52e1e860d36d626d0d9f26e31283d14b
2021-01-31 22:00:26 -08:00
d1dcd5f287 [fbgemm_gpu] Use the latest philox_cuda_state API for stochastic rounding (#51004)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51004

Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/493

Follow up on the failure case on FP16 stochastic rounding:
- https://github.com/pytorch/pytorch/pull/50148
- D26006041

From Natalia:
- https://github.com/pytorch/pytorch/pull/50916 is the fix, philox_engine_inputs is deprecated btw so if you could refactor it to use philox_cuda_state that would be great.
- instructions to change the call https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/CUDAGeneratorImpl.h#L48-L83, it will be important to use philox_cuda_state with graph capture.

Benchmark:
- Before this Diff:
```
(base) [jianyuhuang@devgpu017.atn5.facebook.com: ~/fbsource/fbcode/hpc/ops/benchmarks] $  buck run mode/opt //hpc/ops/benchmarks:split_table_batched_embeddings_benchmark device -- --fp16 --stoc 2>&1 | tee before_diff.log
PARSING BUCK FILES: FINISHED IN 0.4s
CREATING ACTION GRAPH: FINISHED IN 0.0s
DOWNLOADED 0 ARTIFACTS, 0.00 BYTES, 0.0% CACHE MISS
BUILDING: FINISHED IN 5.3s (100%) 6474/6474 JOBS, 0 UPDATED
BUILD SUCCEEDED
DEBUG:root:Using fused exact_row_wise_adagrad with optimizer_args=OptimizerArgs(stochastic_rounding=True, gradient_clipping=False, max_gradient=1.0, learning_rate=0.1, eps=0.1, beta1=0.9, beta2=0.999, weight_decay=0.0, eta=0.001, momentum=0.9)
INFO:root:Embedding parameters:  0.41 GParam,  0.82GB
INFO:root:Accessed weights per batch:  83.89MB
INFO:root:Forward, B: 512, E: 100000, T: 32, D: 128, L: 20, W: False, BW:  607.48GB/s, T: 138us
INFO:root:ForwardBackward, B: 512, E: 100000, T: 32, D: 128, L: 20, BW:  220.85GB/s, T: 1139us
```

- After this Diff:
```
(base) [jianyuhuang@devgpu017.atn5.facebook.com: ~/fbsource/fbcode/hpc/ops/benchmarks] $  buck run mode/opt //hpc/ops/[5/1935]
ks:split_table_batched_embeddings_benchmark device -- --fp16 --stoc 2>&1 | tee after_diff.log
PARSING BUCK FILES: FINISHED IN 1.1s
CREATING ACTION GRAPH: FINISHED IN 0.0s
DEBUG:root:Using fused exact_row_wise_adagrad with optimizer_args=OptimizerArgs(stochastic_rounding=True, gradient_clipping=Fal
se, max_gradient=1.0, learning_rate=0.1, eps=0.1, beta1=0.9, beta2=0.999, weight_decay=0.0, eta=0.001, momentum=0.9)           INFO:root:Embedding parameters:  0.41 GParam,  0.82GB
INFO:root:Accessed weights per batch:  83.89MB
INFO:root:Forward, B: 512, E: 100000, T: 32, D: 128, L: 20, W: False, BW:  608.80GB/s, T: 138us
INFO:root:ForwardBackward, B: 512, E: 100000, T: 32, D: 128, L: 20, BW:  229.17GB/s, T: 1098us
```

Test Plan: CI

Reviewed By: ngimel

Differential Revision: D26038596

fbshipit-source-id: 5360395c1c3b1a062b38e5695239258e892c63c4
2021-01-31 20:42:43 -08:00
0e1c5cb354 fixing index clamping for upsample nearest kernel backward (#51240)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/51036

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51240

Reviewed By: ailzhang

Differential Revision: D26139221

Pulled By: ngimel

fbshipit-source-id: 0591ac6d1f988b54c1b1ee50d34fb7c2a3f97c4e
2021-01-31 15:22:58 -08:00
9cf62a4b5d [1.8] Add additional tests for object-based APIs (#51341)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51341

Adds tests for objects that contain CPU/GPU tensors to ensure that
they can also be serialized/deserialized appropriately.
ghstack-source-id: 120718120

Test Plan: CI

Reviewed By: SciPioneer

Differential Revision: D26144100

fbshipit-source-id: f1a8ccb9741bb5372cb7809cb43cbe43bf47d517
2021-01-30 19:50:08 -08:00
c255628134 [Collective APIs] Make python object collective API args consistent (#50625)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50625

Make API signatures consistent and provide default argument similar to
the tensor collectives.
ghstack-source-id: 120718121

Test Plan: CI

Reviewed By: wanchaol

Differential Revision: D25932012

fbshipit-source-id: d16267e236a65ac9d55e19e2178f9d9267b08a20
2021-01-30 19:47:16 -08:00
721ba97eb6 Create op benchmark for stack (#51263)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51263

- Add benchmark for stack op

Test Plan:
```
buck build mode/opt //caffe2/benchmarks/operator_benchmark/pt:stack_test --show-output
MKL_NUM_THREADS=1 OMP_NUM_THREADS=1 buck-out/gen/caffe2/benchmarks/operator_benchmark/pt/stack_test.par --tag_filter=static_runtime | grep Execution

Forward Execution Time (us) : 6.380
Forward Execution Time (us) : 6.553
Forward Execution Time (us) : 14.904
Forward Execution Time (us) : 5.657
Forward Execution Time (us) : 5.612
Forward Execution Time (us) : 6.051
Forward Execution Time (us) : 4.225
Forward Execution Time (us) : 4.240
Forward Execution Time (us) : 6.280
Forward Execution Time (us) : 6.267
Forward Execution Time (us) : 418.932
Forward Execution Time (us) : 417.694
Forward Execution Time (us) : 1592.455
Forward Execution Time (us) : 2919.261
Forward Execution Time (us) : 211.458
Forward Execution Time (us) : 211.518
Forward Execution Time (us) : 783.953
Forward Execution Time (us) : 1457.823
Forward Execution Time (us) : 2032.816
Forward Execution Time (us) : 2090.662
Forward Execution Time (us) : 6487.098
Forward Execution Time (us) : 11874.702
Forward Execution Time (us) : 2123.830
Forward Execution Time (us) : 2195.453
Forward Execution Time (us) : 6435.978
Forward Execution Time (us) : 11852.205
Forward Execution Time (us) : 2036.526
Forward Execution Time (us) : 2055.618
Forward Execution Time (us) : 6417.192
Forward Execution Time (us) : 12468.744
Forward Execution Time (us) : 4959.704
Forward Execution Time (us) : 5121.823
Forward Execution Time (us) : 5082.105
Forward Execution Time (us) : 5395.936
Forward Execution Time (us) : 5162.756
Forward Execution Time (us) : 23798.080
Forward Execution Time (us) : 4957.921
Forward Execution Time (us) : 4971.234
Forward Execution Time (us) : 5005.909
Forward Execution Time (us) : 5159.614
Forward Execution Time (us) : 5013.221
Forward Execution Time (us) : 20238.741
Forward Execution Time (us) : 7632.439
Forward Execution Time (us) : 7589.376
Forward Execution Time (us) : 7859.937
Forward Execution Time (us) : 8214.213
Forward Execution Time (us) : 11606.562
Forward Execution Time (us) : 34612.919
```

Reviewed By: hlu1

Differential Revision: D25859143

fbshipit-source-id: a1b735ce87f57b5eb67e223e549248a2cd7663c1
2021-01-30 10:32:14 -08:00
e26fccc22b update profiler doc strings (#51395)
Summary:
Fixes formatting for autograd.profiler doc string (was broken), slightly expands profiler.profile documentation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51395

Reviewed By: ilia-cher

Differential Revision: D26162349

Pulled By: ngimel

fbshipit-source-id: ac7af8e0f3dbae2aa899ad815d2311c2758ee57c
2021-01-29 23:37:06 -08:00
17b5683156 Multi-GPU Kineto profiler test (#51391)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51391

Adding a test to check the kineto profiler on multiple gpus

Test Plan: python test/test_profiler.py

Reviewed By: ngimel

Differential Revision: D26160788

Pulled By: ilia-cher

fbshipit-source-id: f3554f52176cc26e7f331d205f1a514eb03aa758
2021-01-29 23:26:12 -08:00
11cda929fb [StaticRuntime] Fix bug in MemoryPlanner (#51342)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51342

There is a subtle bug with the MemoryPlanner with regard to view ops with out variant.

```
  def forward(self, a: Tensor, shape: List[int]):
      b = a.reshape(shape)
      return b + b
```
In this case, if we replace reshape with the out variant, b would be managed by the MemoryPlanner and the storage of its output would have been set to nullptr right after inference by the MemoryPlanner if opts.cleanup_activations is true. Because b is a view of a, the storage of a is also set to nullptr, and this violates the API which promises that a is const.

To fix this bug, I changed the MemoryPlanner so that it puts b in the unmanaged part.

Test Plan:
Add unit test to enforce the constness of inputs

```
buck test //caffe2/benchmarks/static_runtime:static_runtime_cpptest
```

Reviewed By: ajyu

Differential Revision: D26144203

fbshipit-source-id: 2dbacccf7685d0fe0f0b1195166e0510b2069fe3
2021-01-29 21:16:02 -08:00
09e48dbd33 Handle error during dict expansion (#51374)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51374

Test Plan: Imported from OSS

Reviewed By: gmagogsfm

Differential Revision: D26155995

Pulled By: ansley

fbshipit-source-id: 04e924cb641565341c570c6cf5e5eec42e4f9c8b
2021-01-29 18:46:10 -08:00
7ab89f58be expose memory_fraction and gpu_process docs (#51372)
Summary:
Per title

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51372

Reviewed By: mruberry

Differential Revision: D26157787

Pulled By: ngimel

fbshipit-source-id: 97eac5f12881a2bf62c251f6f7eaf65fdbe34056
2021-01-29 18:22:34 -08:00
7d30f67659 remove LegacyDefinitions as it is empty now (#51251)
Summary:
Per title

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51251

Reviewed By: mruberry

Differential Revision: D26120574

Pulled By: ngimel

fbshipit-source-id: 223b4f358932f47e0af7413752c7db7c35402260
2021-01-29 18:15:11 -08:00
d5541c50a3 add a c++ interface in processGroup to get its backend name (#51066)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51066

backend name of a processgroup created using distributed_c10d python API is tracked, but there is no good way to track name of a processgroup created using processGroup c++ API. In some cases, knowing backend name of a processGroup is useful, e,g., log the backend name, or write some codes that have dependency on the known backend.
ghstack-source-id: 120628432

Test Plan: unit tests

Reviewed By: pritamdamania87

Differential Revision: D26059769

fbshipit-source-id: 6584c6695c5c3570137dc98c16e06cbe4b7f5503
2021-01-29 17:28:42 -08:00
662b6d2115 [dist_optim] update the doc of DistributedOptimizer (#51314)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51314

updating the doc of DistributedOptimizer to include TorchScript enablement information

Test Plan: Imported from OSS

Reviewed By: pbelevich

Differential Revision: D26156032

Pulled By: wanchaol

fbshipit-source-id: 1f3841f55918a5c2ed531cf6aeeb3f6e3a09a6a8
2021-01-29 17:12:52 -08:00
a88e1d3ddf [complex] Complex support for masked_scatter and autograd support for masked_scatter and masked_select (#51281)
Summary:
Reference: https://github.com/pytorch/pytorch/issues/33152

Changes
* Enable complex support for masked_scatter
* Enable half support for masked_scatter CPU
* Enable complex autograd support for masked_scatter CPU and masked_select (both CPU and CUDA).

**Note**:
Complex Support for masked_scatter CUDA is disabled as it depends on `masked_fill` which is yet to be ported to ATen.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51281

Reviewed By: ailzhang

Differential Revision: D26127561

Pulled By: anjali411

fbshipit-source-id: 6284926b934942213c5dfc24b5bcc8538d0231af
2021-01-29 13:49:31 -08:00
fe645fdfc7 Update _torch_docs.py (#51212)
Summary:
Fix `torch.linalg.qr` reference where it's desired to render fully-qualified name into docs.

Suggested fix for https://github.com/pytorch/pytorch/pull/47764/files#r565368195

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51212

Reviewed By: ezyang

Differential Revision: D26142496

Pulled By: ailzhang

fbshipit-source-id: 052b2085099baa372e3b515b403f25d23cf50785
2021-01-29 13:03:09 -08:00
da920fa141 Enable rocm tests in common nn (#51227)
Summary:
Fixes #{issue number}
Resubmitting a new PR as the older one got reverted due to problems in test_optim.py.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51227

Reviewed By: ezyang

Differential Revision: D26142505

Pulled By: ailzhang

fbshipit-source-id: a2ab5d85630aac2d2ce17652ba19c11ea668a6a9
2021-01-29 12:54:04 -08:00
52609c8c65 .github: Up frequency of stale checks (#51365)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51365

We have a pretty big backlog of PRs when it comes to checking for stale and the action only supports processing 30 PRs at a given time.

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: samestep

Differential Revision: D26153785

Pulled By: seemethere

fbshipit-source-id: 585b36068683e04cf4e2cc59013482f143ec30a3
2021-01-29 12:50:40 -08:00
dbfaf966b0 [android] turn on USE_VULKAN for android builds by default (#51291)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51291

Turning on USE_VULKAN for android builds
Remove standalone android vulkan build

Testing all ci jobs (for master): https://github.com/pytorch/pytorch/pull/51292

Test Plan: Imported from OSS

Reviewed By: AshkanAliabadi

Differential Revision: D26141891

Pulled By: IvanKobzarev

fbshipit-source-id: e8e1a4ab612c0786ce09217ab9370fd75a71eb00
2021-01-29 11:58:21 -08:00
ebd2a82559 Replace all AT_ASSERTM in RNN_miopen.cpp (#51072)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51072

AT_ASSERTM is deprecated and should be replaced by either TORCH_CHECK or
TORCH_INTERNAL_ASSERT, depending on the situation.

Test Plan: Imported from OSS

Reviewed By: ailzhang

Differential Revision: D26074364

Pulled By: ezyang

fbshipit-source-id: 742e28afe49e0a546c252a0fad487f93410d0cb5
2021-01-29 11:40:38 -08:00
dfca1e48d3 Replace all AT_ASSERTM under c10/ (except Exception.h) (#50843)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50843

AT_ASSERTM is deprecated and should be replaced by either TORCH_CHECK or
TORCH_INTERNAL_ASSERT, depending on the situation.

Test Plan: Imported from OSS

Reviewed By: ailzhang

Differential Revision: D26074365

Pulled By: ezyang

fbshipit-source-id: 46e13588fad4e24828f3cc99635e9cb2223a6c2c
2021-01-29 11:37:07 -08:00
c41ca4ae5b [doc]Fix autograd.detect_anomaly docs incorrectly formatted (#51335)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/51141

Two bullet points don't render as bullet points.

Before
<img width="657" alt="screenshot before" src="https://user-images.githubusercontent.com/19372617/106240701-125a3080-6248-11eb-9572-f915aa9b72e1.png">

After
<img width="888" alt="screenshot after" src="https://user-images.githubusercontent.com/19372617/106240714-17b77b00-6248-11eb-8e54-51be103639e9.png">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51335

Reviewed By: izdeby

Differential Revision: D26148582

Pulled By: ezyang

fbshipit-source-id: 5aff6f9bd7affdf13bec965e9bf1a417e5caa88d
2021-01-29 11:18:51 -08:00
5021582fe6 Fix benchmarks/distributed/ddp/benchmark.py (#51095)
Summary:
Fixes the issue reported in https://github.com/pytorch/pytorch/issues/50679 by using built-in object-based collectives. User has verified this patch works

Test with:
RANK=0 python3 pytorch-dist-benchmark.py --world-size 2 --master-addr 127.0.0.1 --master-port 23456
RANK=1 python3 pytorch-dist-benchmark.py --world-size 2 --master-addr 127.0.0.1 --master-port 23456

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51095

Reviewed By: SciPioneer

Differential Revision: D26070275

Pulled By: rohan-varma

fbshipit-source-id: 59abcaac9e395bcdd8a018bf6ba07521d94b2fdf
2021-01-29 11:10:13 -08:00
1b089c1257 Modernize for-loops (#50899)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/50899

Test Plan: Sandcastle tests + OSS CI

Reviewed By: ezyang

Differential Revision: D26001931

fbshipit-source-id: d829d520f647aacd178e1c7a9faa6196cc5af54e
2021-01-29 10:52:31 -08:00
edaa23c8ab extend init_group_test timeout to 5s (#51330)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/50662

![image](https://user-images.githubusercontent.com/16190118/106225549-58030300-6220-11eb-948d-1998bdafc245.png)

From: https://circleci.com/api/v1.1/project/github/pytorch/pytorch/10203733/output/105/0?file=true&allocation-id=60022ee190b8596d279f4531-0-build%2F195A7D58 (e86f941395)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51330

Reviewed By: izdeby

Differential Revision: D26148618

Pulled By: ezyang

fbshipit-source-id: 708d7522843da2f5c919cf41919e6819f89903e2
2021-01-29 10:44:28 -08:00
30675d0921 Added OpInfo-based testing of triangular_solve (#50948)
Summary:
Added OpInfo-based testing of `torch.triangular_solve`.

These tests helped to discover that CPU `triangular_solve` wasn't working for empty matrices and for CUDA inputs a warning was printed to the terminal. It is fixed now.

CUDA gradgrad checks are skipped.
```
11.44s call     test/test_ops.py::TestGradientsCUDA::test_fn_gradgrad_triangular_solve_cuda_complex128
2.97s call     test/test_ops.py::TestGradientsCUDA::test_fn_gradgrad_triangular_solve_cuda_float64
1.60s call     test/test_ops.py::TestGradientsCPU::test_fn_gradgrad_triangular_solve_cpu_complex128
1.36s call     test/test_ops.py::TestOpInfoCUDA::test_supported_dtypes_triangular_solve_cuda_complex128
1.20s call     test/test_ops.py::TestGradientsCUDA::test_fn_grad_triangular_solve_cuda_complex128
0.86s call     test/test_ops.py::TestCommonCUDA::test_variant_consistency_jit_triangular_solve_cuda_complex64
0.85s call     test/test_ops.py::TestCommonCUDA::test_variant_consistency_jit_triangular_solve_cuda_complex128
0.81s call     test/test_ops.py::TestCommonCUDA::test_variant_consistency_jit_triangular_solve_cuda_float64
0.77s call     test/test_ops.py::TestCommonCUDA::test_variant_consistency_jit_triangular_solve_cuda_float32
0.46s call     test/test_ops.py::TestCommonCPU::test_variant_consistency_jit_triangular_solve_cpu_complex128
0.44s call     test/test_ops.py::TestCommonCPU::test_variant_consistency_jit_triangular_solve_cpu_complex64
0.44s call     test/test_ops.py::TestGradientsCUDA::test_fn_grad_triangular_solve_cuda_float64
0.42s call     test/test_ops.py::TestGradientsCPU::test_fn_gradgrad_triangular_solve_cpu_float64
0.40s call     test/test_ops.py::TestCommonCPU::test_variant_consistency_jit_triangular_solve_cpu_float32
0.40s call     test/test_ops.py::TestCommonCPU::test_variant_consistency_jit_triangular_solve_cpu_float64
0.17s call     test/test_ops.py::TestGradientsCPU::test_fn_grad_triangular_solve_cpu_complex128
```

Ref. https://github.com/pytorch/pytorch/issues/50006

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50948

Reviewed By: ailzhang

Differential Revision: D26123998

Pulled By: mruberry

fbshipit-source-id: 54136e8fc8a71f107dddb692c5be298c6d5ed168
2021-01-29 10:31:07 -08:00
1b479416b7 Clarify logic in ir_emitter (#51299)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51299

Test Plan: Imported from OSS

Reviewed By: gmagogsfm

Differential Revision: D26131245

Pulled By: ansley

fbshipit-source-id: ecd69275214775804f5aa92f9b4c0b19be19b596
2021-01-29 10:05:01 -08:00
c0966914bc Internal gradcheck wrapper in testing._internal that sets certain flags to True (#51133)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/49409

There are many call sites where, gradcheck/gradgradcheck is now being implicitly invoked with `check_batched_grad` as True, but they were previously False. Cases fall into two basic categories:
1) the call site was previously using `torch.autograd.gradcheck` but is now changed to use the globally imported function instead
3) the call site was already using globally imported function, but does not explicitly pass `check_batched_grad` flag

Only in the _assertGradAndGradgradChecks cases, which are infrequent, I assumed that the the author is aware that omitting the flag means not applying check_batched_grad=True. (but maybe that is not the case?)

Overall this PR in its current state assumes that unless the author explicitly specified `check_batched_grad=False`, they were just probably not aware of this flag and did not mean to have this flag as False.

So far exceptions to the above (as discovered by CI) include:
 - Mkldnn (opaque tensors do not have strides) https://app.circleci.com/pipelines/github/pytorch/pytorch/264416/workflows/e4d87886-6247-4305-8526-2696130aa9a4/jobs/10401882/tests
 - all cases in test_sparse (https://app.circleci.com/pipelines/github/pytorch/pytorch/264553/workflows/3c1cbe30-830d-4acd-b240-38d833dccd9b/jobs/10407103)
 - all cases in test_overrides (https://app.circleci.com/pipelines/github/pytorch/pytorch/264553/workflows/3c1cbe30-830d-4acd-b240-38d833dccd9b/jobs/10407236)
 - test_autograd (test_LSTM_grad_and_gradgrad) - (https://app.circleci.com/pipelines/github/pytorch/pytorch/264553/workflows/3c1cbe30-830d-4acd-b240-38d833dccd9b/jobs/10407235)
 - test_data_parallel (test_data_parallel_buffers_requiring_grad) - *SIGSEGV* (https://app.circleci.com/pipelines/github/pytorch/pytorch/264820/workflows/14d89503-040d-4e3d-9f7b-0bc04833589b/jobs/10422697)
 - test_nn (https://app.circleci.com/pipelines/github/pytorch/pytorch/264919/workflows/df79e3ed-8a31-4a8e-b584-858ee99686ff/jobs/10427315)

Possible TODO is to prevent new tests from invoking external gradcheck.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51133

Reviewed By: ezyang

Differential Revision: D26147919

Pulled By: soulitzer

fbshipit-source-id: dff883b50f337510a89f391ea2fd87de2d531432
2021-01-29 09:13:37 -08:00
5a406c023e Revert D26070147: [Gradient Compression] Refactor default_hooks.py and powerSGD_hook.py by creating a util function that make a vanilla allreduce future
Test Plan: revert-hammer

Differential Revision:
D26070147 (e7b3496232)

Original commit changeset: 8c9339f1511e

fbshipit-source-id: fa1e9582baec9759a73b3004be9bb19bdeb6cd34
2021-01-29 09:06:24 -08:00
270111b7b6 split quantization jit op (#51329)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51329

Currently the test_qbatch_norm_relu is containing too many examples and causing timeout. Splitting them for now to fix the timeout issue

Test Plan: buck test caffe2/test:quantization

Reviewed By: supriyar

Differential Revision: D26141037

fbshipit-source-id: da877efa78924a252a35c2b83407869ebb8c48b7
2021-01-29 07:49:53 -08:00
3397919dcf Rowwise Prune op (Add the test to OSS run_test), Make the op private. (#46131)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46131

Refer to the title.

Test Plan: `buck test caffe2/test:pruning`

Reviewed By: raghuramank100

Differential Revision: D24230472

fbshipit-source-id: 8f0a83446c23fdf30d0313b8c3f5ff1a463b50c7
2021-01-29 06:08:18 -08:00
ebe26b81d2 [PyTorch Mobile] Enable partial loading of GPU models on linux CPU machines (#51236)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51236

The problem we currently have with tracing is that GPU models can't load on devvm CPU machines. Here's why:

1. The Metal GPU ops don't exist so the validation that checks for missing ops kicks in and prevents loading
2. Even if the check for missing ops is commented out, the actual model contents can't be succssfully loaded (see t83364623 for details)

Hence, to work around these problems and allow tracing to detect GPU models, and skip actual tracing for these (as discussed in the meeting 2 weeks ago and based on recommendations from raziel, iseeyuan, and xta0), we're adding code to detect these GPU models based on the set of operators that show up in the file `extra/mobile_info.json`.

The code then skips tracing, and picks up the root operators from the model itself.

The diff below this one will be removed before landing since we don't want to check in the model - I've kept it here in case anyone wants to patch this diff in and run the command on their devvm locally.
ghstack-source-id: 120638092

Test Plan:
See {P168657729} for a successful run of tracing on a GPU model (person segmentation tier-0, v1001) provided by xta0

Also ran `buck test //xplat/pytorch_models/build/...` successfully.

Reviewed By: ljk53

Differential Revision: D26109526

fbshipit-source-id: 6119b0b59af8aae8b1feca0b8bc29f47a57a1a67
2021-01-29 01:00:08 -08:00
534aabce14 [nnc] Don't use sleef where it's slower (#51246)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51246

Using sleef is sometimes slower than libm (I haven't debugged why).
The easy solution is to not use sleef in those cases.  With this diff, plus the
prior one to use sleef period, we've sped up every unary op:
ghstack-source-id: 120614087

Test Plan: `buck run mode/opt -c python.package_style=inplace //caffe2/benchmarks/cpp/tensorexpr:bench_ops`

Reviewed By: ZolotukhinM

Differential Revision: D26113672

fbshipit-source-id: 6b731ac935b3652c8b3e3f4a5d2baa39ff31323a
2021-01-28 22:35:11 -08:00
0a9764ecc1 [nnc] Expose vectorized math functions to jit fuser. (#51190)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51190

We want to be able to call fast vectorized functions from sleef inside
the jit fuser, but only when they're supported by the host processor.  Enabling
this feature has two parts:

1. Record the addresses of the symbols, assuming they're defined.  Sleef only
defines vectorized functions if AVX is enabled, so we need to define __AVX__ to
get access to those symbols.  We don't actually need to compile anything with
AVX; the symbols just have to be present.

2. Before emitting a call to sleef, check if the host processor actually has
AVX.  LLVM makes this easy since we can just check the target feature string
for "+avx".
ghstack-source-id: 120614086

Test Plan:
```
buck run mode -c python.package_style=inplace //caffe2/benchmarks/cpp/tensorexpr:bench_ops
```

shows a significant speedup on most math functions (esp sigmoid, which goes
from 13% of ATen speed to parity).

Reviewed By: navahgar

Differential Revision: D26096170

fbshipit-source-id: b7268a50d73f8dc03b4db11cc38b8402387eed2d
2021-01-28 22:35:07 -08:00
d74a226daa [nnc] Use sleef if its symbols are available (#51187)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51187

Instead of relying on #ifdefs, we want to use sleef if its symbols are
available.  This diff adds the mechanism to do that check using LLVM's symbol
lookup.

This diff by itself is a no-op, because sleef isn't properly being exposed to
LLVM yet (the `#ifdef __AVX__` checks are always false, because torch/jit isn't
built with `-mavx`).  The next diff will properly expose the symbols, and
perform run time checks.
ghstack-source-id: 120614091

Test Plan: `buck test //caffe2/test/cpp/tensorexpr:`

Reviewed By: Krovatkin

Differential Revision: D26096206

fbshipit-source-id: 3f2b37500276e8bf50a167ecf8aeeb295d7ec232
2021-01-28 22:35:03 -08:00
0a065ebe86 [nnc][trivial] Refactor llvm_jit so the wrapper class doesn't depend on ifdefs (#51186)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51186

Just a bit of drive-by cleanup; the wrapper class should be the same
for all builds so let's not conditionally compile it for no reason.
ghstack-source-id: 120614088

Test Plan: buck build

Reviewed By: navahgar

Differential Revision: D26096205

fbshipit-source-id: 1e4cb682614fae0e889ba35fb1edb489fb99158e
2021-01-28 22:34:59 -08:00
1114fd6b3a [nnc] Refactor generation of intrinsics to reduce the amount of macro-hell (#51125)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51125

The big pile of X-macros used for emitting (possibly vectorized)
intrinsics makes it **really** difficult to change that code in any systematic
way (which I'm about to do in a later diff).

We can factor most of what the macro does into a fairly simple function.  There
are still macros but they're just a bunch of case/call helper/break
boilerplate.
ghstack-source-id: 120614089

Test Plan: `buck test mode/opt -c python.package_style=inplace //caffe2/benchmarks/cpp/tensorexpr:bench_ops`

Reviewed By: ZolotukhinM

Differential Revision: D26078384

fbshipit-source-id: 843548033f73d88c5d9a031c285b92f73be21390
2021-01-28 22:31:49 -08:00
43f0ccd1ec torch.cuda.memory_allocated to return {} if not initialized (#51179)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/49952

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51179

Reviewed By: ngimel

Differential Revision: D26094932

Pulled By: malfet

fbshipit-source-id: 0ec28ef9b0604245753d3f2b0e3536286700668d
2021-01-28 20:38:17 -08:00
916af892b3 [quant][fx] Update name of packed weight attributes (#51259)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51259

Store the FQN of the module that is using the packed weights (the quantized op)

In the case of fusion we update the scope mapping to store the module path of the fused node.

Test Plan:
python test/test_quantization.py test_packed_weight_fused_op

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D26117964

fbshipit-source-id: 9d929997baafb1c91063dd9786a451b0040ae461
2021-01-28 20:31:08 -08:00
05c8cd748d memory efficient per-channel fq: use it everywhere, delete old version (#51265)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51265

This PR is the cleanup after #51159. High level, we make the new
definition of fake_quant per channel be the definition used by autograd, but keep the old
function around as a thin wrapper to keep the user facing API the same.

In detail:

1. point fake_quantize_per_channel_affine's implementation to be fake_quantize_per_channel_affine_cachemask
2. delete the fake_quantize_per_channel_affine backward, autograd will automatically use the cachemask backward
3. delete all the fake_quantize_per_channel_affine kernels, since they are no longer used by anything

Test Plan:
```
python test/test_quantization.py TestFakeQuantize
```

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D26120957

fbshipit-source-id: 264426435fabd925decf6d1f0aa79275977ea29b
2021-01-28 19:42:25 -08:00
267e243064 fake_quant: more memory efficient per-channel backward (#51255)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51255

This is the same as #50561, but for per-channel fake_quant.

TODO before land write up better

Memory and performance impact (MobileNetV2): TODO

Performance impact (microbenchmarks): https://gist.github.com/vkuzo/fbe1968d2bbb79b3f6dd776309fbcffc
* forward pass on cpu: 512ms -> 750ms (+46%)
* forward pass on cuda: 99ms -> 128ms (+30%)
* note: the overall performance impact to training jobs should be minimal, because this is used for weights, and relative importance of fq is dominated by fq'ing the activations
* note: we can optimize the perf in a future PR by reading once and writing twice

Test Plan:
```
python test/test_quantization.py TestFakeQuantize.test_forward_per_channel_cachemask_cpu
python test/test_quantization.py TestFakeQuantize.test_forward_per_channel_cachemask_cuda
python test/test_quantization.py TestFakeQuantize.test_backward_per_channel_cachemask_cpu
python test/test_quantization.py TestFakeQuantize.test_backward_per_channel_cachemask_cuda
```

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D26117721

fbshipit-source-id: 798b59316dff8188a1d0948e69adf9e5509e414c
2021-01-28 19:39:35 -08:00
f2e41257e4 Back out "Revert D26077905: Back out "Revert D25850783: Add torch::deploy, an embedded torch-python interpreter"" (#51267)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51267

Original commit changeset: b70185916502

Test Plan: test locally, oss ci-all, fbcode incl deferred

Reviewed By: suo

Differential Revision: D26121251

fbshipit-source-id: 4315b7fd5476914c8e5d6f547e1cfbcf0c227781
2021-01-28 19:30:45 -08:00
e7b3496232 [Gradient Compression] Refactor default_hooks.py and powerSGD_hook.py by creating a util function that make a vanilla allreduce future (#51094)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51094

Address https://github.com/pytorch/pytorch/pull/50973#discussion_r564229818

Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202
ghstack-source-id: 120619680

Test Plan:
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl

buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_default_ddp_comm_hooks_nccl

Reviewed By: rohan-varma

Differential Revision: D26070147

fbshipit-source-id: 8c9339f1511e8f24cc906b9411cfe4850a5a6d81
2021-01-28 19:03:18 -08:00
9d731e87de [Gradient Compression] Explicitly specify the dtype of the error tensor (#50985)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50985

Explicitly specify the dtype of error tensor when it is initialized by zeros.

Previously if the dtype of input tensor is FP16, the error tensor is still created in FP32, although later it will be assigned by another FP16 tensor (`input_tensor_cp` - `input_tensor`).

This change will make the dtype of error tensor look more clear.

Additionally, also explicitly specify the dtype if rank-1 tensor buffer is empty.

Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202
ghstack-source-id: 120377786

Test Plan:
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl

buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_DistributedDataParallel_powerSGD_ddp_comm_hook

Reviewed By: rohan-varma

Differential Revision: D26034988

fbshipit-source-id: e0d323d0b77c6a2478cdbe8b31a1946ffd1a07da
2021-01-28 19:03:14 -08:00
b619d37bb4 [Gradient Compression] Simplify the implementation of error feedback and warm-start (#50981)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50981

Since vanilla allreduce will to be applied in the first few iterations, bucket rebuilding process will not affect caching per-variable tensors.

Previously the cached tensors used for error feedback and warm-up need to be rebuilt later, because their corresponding input tensors' shape will be changed after the bucket rebuild process.

Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202
ghstack-source-id: 120617971

Test Plan: real run

Reviewed By: rohan-varma

Differential Revision: D26034418

fbshipit-source-id: e8744431c7f3142d75b77b60110e6861c2ff5c14
2021-01-28 18:59:40 -08:00
00d4ec840e clone pytorch.github.io with depth 1 (#48115)
Summary:
Speeds up clone of pytorch.github.io in CI/CD - currently takes ~7 minutes each run.

Locally these are the results: 3.73 seconds vs 611.87 seconds.

With depth 1:

```
$ time git clone https://github.com/pytorch/pytorch.github.io -b site --depth 1
...
3.73s user 2.97s system 23% cpu 28.679 total
```

Without:

```
$ time git clone https://github.com/pytorch/pytorch.github.io -b site
...
611.87s user 66.16s system 96% cpu 11:41.99 total
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48115

Reviewed By: mrshenli

Differential Revision: D25107867

Pulled By: ranman

fbshipit-source-id: b6131b51df53b7f71d9b4905181182699c0c6c09
2021-01-28 18:40:10 -08:00
8a8fac6681 Remove debug-only assertion from vulkan::api::Command::Command as the buffer can legitimately be null. (#51160)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51160

Test Plan: Imported from OSS

Reviewed By: IvanKobzarev

Differential Revision: D26131252

Pulled By: AshkanAliabadi

fbshipit-source-id: 69f324ceed711753d77ab7c6b6a20a29cdbdf5f9
2021-01-28 18:33:50 -08:00
592a8ad1c8 Define static constexpr variable in at::native::vulkan:::api::Handle. (#51006)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51006

Test Plan: Imported from OSS

Reviewed By: IvanKobzarev

Differential Revision: D26131253

Pulled By: AshkanAliabadi

fbshipit-source-id: 950bf57b348726fe5da4fed6a8b1e108c7a52e11
2021-01-28 18:30:18 -08:00
5ed0ad4b6a DataPipe naming convension update (#51262)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51262

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D26120628

Pulled By: glaringlee

fbshipit-source-id: 6855a0dd6d4a93ff93adce1039960ffd7057a827
2021-01-28 17:44:36 -08:00
f9f22c8b5c Add serialization logic for complex numbers (#51287)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51287

This reverts commit dfdb1547b9c1934904bfd137b4007d6a46a6f597.

Test Plan: Imported from OSS

Reviewed By: SplitInfinity

Differential Revision: D26131165

Pulled By: anjali411

fbshipit-source-id: 047167fac594ddb670c5e169446e90e74991679a
2021-01-28 17:25:35 -08:00
6e4746c1ac Port cholesky_inverse to ATen (#50269)
Summary:
Now we can remove `_th_potri`!

Compared to the original TH-based `cholesky_inverse`, complex (https://github.com/pytorch/pytorch/issues/33152) and batched inputs (https://github.com/pytorch/pytorch/issues/7500) are now supported both on CPU and CUDA.

Closes https://github.com/pytorch/pytorch/issues/24685.
Closes https://github.com/pytorch/pytorch/issues/24543.

Ref. https://github.com/pytorch/pytorch/issues/49421, https://github.com/pytorch/pytorch/issues/42666

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50269

Reviewed By: bdhirsh

Differential Revision: D26047548

Pulled By: anjali411

fbshipit-source-id: e4f191e39c684f241b7cb0f4b4c025de082cccef
2021-01-28 16:24:41 -08:00
9f6e0de548 Update third_party/build_bundled.py (#51161)
Summary:
Follow up https://github.com/pytorch/pytorch/issues/50695. my bad for merging with one missing commit.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51161

Reviewed By: ailzhang

Differential Revision: D26134761

Pulled By: walterddr

fbshipit-source-id: a606f6cfbb5c48b3c6f3859896522f294e1b077e

Co-authored-by: Nikita Shulga <nikita.shulga@gmail.com>
2021-01-28 14:41:09 -08:00
7097c0d4f3 [quant][graphmode][fx] Add support for functional conv1d and conv3d (#51155) (#51254)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51254

This PR added support for quantizing functional conv1d, conv3d,  conv1d_relu and conv3d_relu

Test Plan:
python test/test_quantization.py TestQuantizeFxOps.test_functional_conv

Imported from OSS

Reviewed By: vkuzo

Differential Revision: D26116172

fbshipit-source-id: 56e7d799c11963fe59ee3a1b6eb23f52007b91dc
2021-01-28 14:32:32 -08:00
35990b5f56 .github: Remove title from stale alert (#51306)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51306

Title wasn't rendering correctly so let's just remove it altogether, it
shouldn't matter that much in the long run

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: samestep

Differential Revision: D26134907

Pulled By: seemethere

fbshipit-source-id: 54485cb66fb57f549255f9e7bcfb39b51fe69776
2021-01-28 14:23:21 -08:00
1379842f4a Add private mechanism to toggle vmap fallback warnings (#51218)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51218

Fixes #51144.

Context
=======

Users have complained about warning spam from batched gradient
computation. This warning spam happens because warnings in C++ don't
correctly get turned into Python warnings when those warnings arise from
the autograd engine.

To work around that, this PR adds a mechanism to toggle vmap warnings.
By default, the vmap fallback will not warn when it is invoked. However,
by using `torch._C._debug_only_display_vmap_fallback_warnings(enabled)`,
one can toggle the existence of vmap fallback warnings.

This API is meant to be a private, debug-only API. The goal is to be
able to non-intrusively collect feedback from users to improve
performance on their workloads.

What this PR does
=================

This PR adds an option to toggle vmap warnings. The mechanism is
toggling a bool in ATen's global context.

There are some other minor changes:
- This PR adds a more detailed explanation of performance cliffs to the
autograd.functional.{jacobian, hessian} documentation
- A lot of the vmap tests in `test_vmap.py` rely on the fallback warning
to test the presence of the fallback. In test_vmap, I added a context
manager to toggle on the fallback warning while testing.

Alternatives
============

I listed a number of alternatives in #51144. My favorite one is having a new
"performance warnings mode" (this is currently a WIP by some folks on
the team). This PR is to mitigate the problem of warning spam before
a "performance warnings mode" gets shipped into PyTorch

Concerns
========

I am concerned that we are advertising a private API
(`torch._C._debug_only_display_vmap_fallback_warnings(enabled)`) in the
PyTorch documentation. However, I hope the naming makes it clear to
users that they should not rely on this API (and I don't think they have
any reason to rely on the API).

Test Plan
=========

Added tests in `test_vmap.py` to check:
- by default, the fallback does not warn
- we can toggle whether the fallback warns or not

Test Plan: Imported from OSS

Reviewed By: pbelevich, anjali411

Differential Revision: D26126419

Pulled By: zou3519

fbshipit-source-id: 95a97f9b40dc7334f6335a112fcdc85dc03dcc73
2021-01-28 13:05:00 -08:00
f68e5f1dbf .github: Update stale messaging add newlines (#51298)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51298

newlines weren't being respected so just add them through the `<br>` html
tag.

Also changes the wording for open source ones to designate that a
maintainer may be needed to unstale a particular PR.

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: janeyx99

Differential Revision: D26131126

Pulled By: seemethere

fbshipit-source-id: 465bfc0ba4dc16a7a90e0c03c33d551184e35f5b
2021-01-28 12:39:29 -08:00
b028653670 Add missing -inf order for linalg.norm OpInfo (#51233)
Summary:
Follow up to https://github.com/pytorch/pytorch/issues/50746

I accidentally missed the `ord=-inf` case in the OpInfo for `torch.linalg.norm` when I wrote it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51233

Reviewed By: malfet

Differential Revision: D26117160

Pulled By: anjali411

fbshipit-source-id: af921c1d8004783612b3a477ae2025a82860ff4e
2021-01-28 12:33:00 -08:00
8b27c2ccca add mising VSX dispatches (#51217)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/51132

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51217

Reviewed By: malfet

Differential Revision: D26120485

Pulled By: ezyang

fbshipit-source-id: d83384964f9980c9a921d0c7159f07e88025ea92
2021-01-28 12:17:50 -08:00
96cedefd8e [Pipe] Refactor convert_to_balance under non-test package. (#50860)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50860

Since fairscale.nn.Pipe still uses 'balance' and 'devices' parameters,
other frameworks like fairseq still use these parameters. As a result, the
`convert_to_balance` method is a nice utility to use for migrating to PyTorch
Pipe without changing a lot of code in other frameworks.

In addition to this I've renamed the method to be more illustrative of what it
does and also allowed an optional devices parameter.
ghstack-source-id: 120430775

Test Plan:
1) waitforbuildbot
2) Tested with fairseq

Reviewed By: SciPioneer

Differential Revision: D25987273

fbshipit-source-id: dccd42cf1a74b08c876090d3a10a94911cc46dd8
2021-01-28 12:10:21 -08:00
cedfa4ccd8 Make DeviceCachingAllocator's error handling more defensive and a bit easier to read (#51158)
Summary:
^
Currently, `alloc_block`'s error handling has a couple (imo) minor flaws.  It might clear the error state even if the error had nothing to do with memory allocation. It might also clear the error state even if it didn't attempt a cudaMalloc, meaning it might clear an error state that came from some completely unrelated earlier cuda call.

The diffs and comments are the best explanation of my preferred (new) error-checking policy.

The diffs add very little work to the common (successful, allocation satisfied by existing block) hot path.  Most of the additional logic occurs in `alloc_block`, which is a slow path anyway (it tries cudaMalloc).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51158

Reviewed By: malfet, heitorschueroff

Differential Revision: D26101515

Pulled By: ezyang

fbshipit-source-id: 6b447f1770974a04450376afd9726be87af83c48
2021-01-28 10:54:20 -08:00
33d5180684 [fx] improve args mutation error (#51175)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51175

gives a suggestion about how to deal with immutable args/kwargs list

Test Plan: Imported from OSS

Reviewed By: jamesr66a

Differential Revision: D26093478

Pulled By: zdevito

fbshipit-source-id: 832631c125561c3b343539e887c047f185060252
2021-01-28 10:19:38 -08:00
4288f08d30 Enable TensorPipe's CUDA GDR channel (#50763)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50763

ghstack-source-id: 120561489

Test Plan: Exported to GitHub

Reviewed By: mrshenli

Differential Revision: D25959672

fbshipit-source-id: b70f4b130806bf430869170bf4412697a6910275
2021-01-28 10:12:28 -08:00
cc211bb43e .github: Add workflow to stale pull requests (#51237)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51237

Stales pull requests at 150 days and then will close them at 180 days

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: stonks

Reviewed By: yns88

Differential Revision: D26112086

Pulled By: seemethere

fbshipit-source-id: c6b3865aa5cde3415b6dd6622c308895a16e805f
2021-01-28 09:37:55 -08:00
c9cebaf9b8 Enable TensorPipe's InfiniBand transport (#50761)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50761

ghstack-source-id: 120561368

Test Plan: Ran CI on GitHub

Reviewed By: mrshenli

Differential Revision: D25959502

fbshipit-source-id: 3d0a49546a6ac175608b677986d4344fbb1cf845
2021-01-28 08:43:32 -08:00
288b94a8ee [quant][fx] Make scale, zero_point buffers in the model, use FQN (for quantize_per_tensor ops) (#51171)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51171

Following up on previous PR, this PR makes scale and zero_point for quantize_per_tensor to be
registered as buffers in the module.
Currently the dtype is still stored as attr (not registered as buffer) since we can only register tensor types.

Test Plan:
python test/test_quantization.py test_qparams_buffers

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D26092964

fbshipit-source-id: a54d914db7863402f2b5a3ba2c8ce8b27c18b47b
2021-01-28 08:35:46 -08:00
4c3f59b70e [quant][fx] Make scale, zero_point buffers in the model and use FQN (for quantized ops) (#51166)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51166

Currently scale and zero_point values are stored as constant values in the graph.
This prevents these values from being updated in the graph and also does not enable saving
these values to state_dict

After this PR we store scale/zero_point values for quantized ops as buffers in the root module
and createe get_attr nodes for them in the graph.

We also use the FQN of the module where the quantized ops are present to name these attributes so
that they can be uniquely  identified and mapped to quantized ops.

Test Plan:
python test/test_quantization.py TestQuantizeFx.test_qparams_buffers

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D26092965

fbshipit-source-id: b549b2d3dccb45c5d38415ce95a09c26f5bd590b
2021-01-28 08:35:42 -08:00
096adf4b8b [quant][fx] Scope support for call_function in QuantizationTracer (#51086)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51086

Previously we only supported getting scope for call_module and custom qconfig dict for call_module.
This PR extends the Scope class to record the scope for all node types.
For call_function qconfig if module_name is specified it takes precedence over function qconfig.

Test Plan:
python test/test_quantization.py test_qconfig_for_call_func

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D26077602

fbshipit-source-id: 99cdcdedde2280e51812db300e17d4e6d8f477d2
2021-01-28 08:32:24 -08:00
b955da3310 Adding correct error message for for..else (#51258)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/51040

========
Add error message for for..else statement in Torchscript

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51258

Test Plan:
=====
pytest -k test_for_else test/test_jit.py

Reviewed By: pbelevich

Differential Revision: D26125148

Pulled By: nikithamalgifb

fbshipit-source-id: 82b67ab1c68e29312162ff5d73b82c8c0c9553df
2021-01-28 08:17:31 -08:00
7a8c64da4d [AutoAccept][Codemod][FBSourceClangFormatLinter] Daily arc lint --take CLANGFORMAT
Reviewed By: zertosh

Differential Revision: D26122735

fbshipit-source-id: 0ff54a67192835c2daa331c1f13c252a96f494cb
2021-01-28 04:35:22 -08:00
0e8e739a9f Move AcceleratedGraphModule out of graph_manipulation. (#51220)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51220

testing with OS this time...

Reviewed By: jfix71

Differential Revision: D26105140

fbshipit-source-id: b4b7a8f0f4cc8f96f9f8b270277a71061d5e5e84
2021-01-28 02:39:12 -08:00
df07e1cea8 Automated submodule update: tensorpipe (#51203)
Summary:
This is an automated pull request to update the first-party submodule for [pytorch/tensorpipe](https://github.com/pytorch/tensorpipe).

New submodule commit: 228f060ccc

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51203

Test Plan: Ensure that CI jobs succeed on GitHub before landing.

Reviewed By: lw

Differential Revision: D26099684

Pulled By: yns88

fbshipit-source-id: 0ff985e9d4914d0d00120f96d0f5ba77371f005c
2021-01-28 01:22:01 -08:00
392abde8e6 patch nvrtc API for cuda TK >= 11.1 (#50319)
Summary:
CUDA TK >= 11.1 provides ptxjitcompiler that emits SASS instead of PTX.
1. This gives better backward-compatibility that allows future TK to work with older driver, which might not necessarily be able to load generated PTX through JIT compile and would error out at runtime;
https://docs.nvidia.com/deploy/cuda-compatibility/#using-ptx
2. Meanwhile, SASS doesn't provide good future compatibility, so for unsupported arch, we fallback to PTX to support future device.
https://docs.nvidia.com/deploy/cuda-compatibility/index.html#cubin-compatibility

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50319

Reviewed By: malfet

Differential Revision: D26114475

Pulled By: ngimel

fbshipit-source-id: 046e9e7b3312d910f499572608a0bc1fe53feef5
2021-01-27 23:58:20 -08:00
9fe7c0633f Add centered FFT example to fftshift docs (#51223)
Summary:
Closes https://github.com/pytorch/pytorch/issues/51022

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51223

Reviewed By: malfet

Differential Revision: D26110201

Pulled By: mruberry

fbshipit-source-id: c659c5dca30eda4b67ed6d931a93de9a33e72895
2021-01-27 23:50:48 -08:00
d035d56bfb [StaticRuntime] Add out variant for reshape and flatten (#51249)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51249

- Add out variant for reshape and flatten. reshape and flatten only create tensor views when it can. In cases where it can't, it does a copy. The out variant reuses the TensorImpl for both cases. The difference is that the TensorImpl is a view in the first case, but a normal TensorImpl in the second case.
- Create a separate registry for the view ops with out variants. Because Tensor views can't participate in memory reuse (memonger), we need to track these ops separately.
- The MemoryPlanner does not track the StorageImpl of tensor views because they don't own the storage, however, in cases where reshape does not create a view, the MemoryPlanner does manage the output tensor.

Reviewed By: ajyu

Differential Revision: D25992202

fbshipit-source-id: dadd63b78088c129e491d78abaf8b33d8303ca0d
2021-01-27 22:44:11 -08:00
16132a4b1d Make sure ConstantPadNd op preserves memory format (#50898)
Summary:
* ConstantPadNd op didn't preserve memory format for non quantized cases

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50898

Test Plan: pytest test/test_nn.py::TestConstPadNd

Reviewed By: kimishpatel

Differential Revision: D26003407

Pulled By: axitkhurana

fbshipit-source-id: a8b56d32734772acae6f5c2af4dfe0bd3434cab1
2021-01-27 22:36:44 -08:00
52ab858f07 STFT: Improve error message when window is on wrong device (#51128)
Summary:
Closes https://github.com/pytorch/pytorch/issues/51042

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51128

Reviewed By: mruberry

Differential Revision: D26108998

Pulled By: ngimel

fbshipit-source-id: 1166c19c2ef6846e29b16c1aa06cb5c1ce3ccb0d
2021-01-27 22:31:57 -08:00
83287a6f2b [pytorch] change codegen dispatch key from string to enum (#51115)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51115

Add enum type for dispatch key. Prepare to implement the DispatchTable
computation logic in python for static dispatch.

Verified byte-for-byte compatibility of the codegen output.

Test Plan: Imported from OSS

Reviewed By: bhosmer

Differential Revision: D26077430

Pulled By: ljk53

fbshipit-source-id: 86e74f3eb32266f31622a2ff6350b91668c8ff42
2021-01-27 22:28:52 -08:00
773c71cb3a [atem] Fix type check bug in bmm_out_or_baddbmm_ (#51248)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51248

Fix bug raised in https://github.com/pytorch/pytorch/issues/50980 by adding a dtype check back to bmm_out_or_baddbmm_.

Test Plan:
```
buck test //caffe2/test:linalg
buck test //caffe2/aten:math_kernel_test
```

Reviewed By: ngimel

Differential Revision: D26113575

fbshipit-source-id: 0d6e03eae70822f8ceeffefd915aee01030304ce
2021-01-27 21:51:04 -08:00
88baf470d1 [JIT] Provide more info when attribute fails to convert (#50870)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50870

**Summary**
Module attributes whose types cannot be determined based on annotations
or inference based on their values at script time are added to the
concrete type of the corresponding module as "failed attributes". Any
attempt to access them in scripted code produces an error with a message
explaining that the attribute could not be contributed to a
corresponding attribute on the TorchScript module. However, this error
is not more specific than that.

This commit modifies `infer_type` in `_recursive.py` so that it returns
`c10::InferredType` instead, which allows more information about typing
failures to be communicated to the caller through the `reason()` method
on this class. This information is appended to the hint added to the
module concrete type for failed attributes.

**Testing**
This commit adds a unit test to `test_module_containers.py` that checks
that extra information is provided about the reason for the failure
when a module attribute consisting of a list of `torch.nn.Module` fails to convert.

Test Plan: Imported from OSS

Reviewed By: pbelevich

Differential Revision: D26091472

Pulled By: SplitInfinity

fbshipit-source-id: fcad6588b937520f250587f3d9e005662eb9af0d
2021-01-27 20:37:10 -08:00
12a434abbc Revert D26077905: Back out "Revert D25850783: Add torch::deploy, an embedded torch-python interpreter"
Test Plan: revert-hammer

Differential Revision:
D26077905 (dc2a44c4fc)

Original commit changeset: fae83bf9822d

fbshipit-source-id: b70185916502ba9ebe16d781cf0659b9f7865c9a
2021-01-27 19:53:29 -08:00
dfdb1547b9 Revert D26094906: Add serialization logic for complex numbers
Test Plan: revert-hammer

Differential Revision:
D26094906 (2de4ecd4eb)

Original commit changeset: 7b2614f3ee4a

fbshipit-source-id: 6f32a9fc6bb2a904ca1a282bbc6b2df0aee50068
2021-01-27 19:44:26 -08:00
0335222a4a memory efficient fq: use it everywhere, delete the old version (#51159)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51159

This PR is the cleanup after #50561. High level, we make the new
definition of fake_quant be the definition used by autograd, but keep the old
function around as a thin wrapper to keep the user facing API the same.

In detail:
1. point `fake_quantize_per_tensor_affine`'s implementation to be `fake_quantize_per_tensor_affine_cachemask`
2. delete the `fake_quantize_per_tensor_affine` backward, autograd will automatically use the cachemask backward
3. delete all the `fake_quantize_per_tensor_affine` kernels, since they are no longer used by anything

Test Plan:
```
python test/test_quantization.py TestFakeQuantize
```

performance testing was done in the previous PR.

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D26090869

fbshipit-source-id: fda042881f77a993a9d15dafabea7cfaf9dc7c9c
2021-01-27 19:39:05 -08:00
983b8e6b62 fake_quant: add a more memory efficient version (#50561)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50561

Not for review yet, a bunch of TODOs need finalizing.

tl;dr; add an alternative implementation of `fake_quantize` which saves
a ask during the forward pass and uses it to calculate the backward.

There are two benefits:

1. the backward function no longer needs the input Tensor, and it can be
gc'ed earlier by autograd.  On MobileNetV2, this reduces QAT overhead
by ~15% (TODO: link, and absolute numbers).  We add an additional mask Tensor
to pass around, but its size is 4x smaller than the input tensor. A
future optimization would be to pack the mask bitwise and unpack in the
backward.

2. the computation of `qval` can be done only once in the forward and
reused in the backward. No perf change observed, TODO verify with better
matrics.

TODO: describe in more detail

Test Plan:
OSS / torchvision / MobileNetV2
```
python references/classification/train_quantization.py
  --print-freq 1
  --data-path /data/local/packages/ai-group.imagenet-256-smallest-side/prod/
  --output-dir ~/nfs/pytorch_vision_tests/
  --backend qnnpack
  --epochs 5
TODO paste results here
```

TODO more

Imported from OSS

Reviewed By: ngimel

Differential Revision: D25918519

fbshipit-source-id: ec544ca063f984de0f765bf833f205c99d6c18b6
2021-01-27 19:36:04 -08:00
d14d8c7f7f Add convenience import (#51195)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51195

Add kineto_available to torch.profiler

Test Plan:
>>> import torch.profiler
>>> torch.profiler.kineto_available()
True

Reviewed By: ngimel

Differential Revision: D26113906

Pulled By: ilia-cher

fbshipit-source-id: fe4502d29d10d8bd9459b0504aa0ee856af43acc
2021-01-27 19:23:50 -08:00
ea0d304e2e Rewrite "ProfilerStep#<num>" in profiler output (#51194)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51194

Aggregate all "ProfilerStep#<num>" together

Test Plan:
python test/test_profiler.py -k
test_kineto_profiler_api

Reviewed By: ngimel

Differential Revision: D26113907

Pulled By: ilia-cher

fbshipit-source-id: 2bc803befc85153f07e770ea3c37b57e2870a1ba
2021-01-27 19:23:46 -08:00
4fb33f1d3a Trim profiler file paths (#51192)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51192

Trim profiler file paths when using stack traces

Test Plan:
python test/test_profiler.py -k test_source
```
			       SumBackward0         0.02%       6.000us         0.51%     154.000us     154.000us             1  test/test_profiler.py(91): test_source
																 ...conda3/envs/pytorch/lib/python3.8/unittest/case.py(633): _callTestMethod
																 ...r/local/miniconda3/envs/pytorch/lib/python3.8/unittest/case.py(676): run
																 ...al/miniconda3/envs/pytorch/lib/python3.8/unittest/case.py(736): __call__
																 .../local/miniconda3/envs/pytorch/lib/python3.8/unittest/suite.py(122): run
```

Reviewed By: ngimel

Differential Revision: D26113905

Pulled By: ilia-cher

fbshipit-source-id: 2b71c31b6c4437855d33013d42d977745e6f489f
2021-01-27 19:12:27 -08:00
e2eb97dd76 [ONNX] Fix param names (#50764) (#50955)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50955

Preserve name of parameters for ONNX.

Looks like output->copyMetadata(input) API is giving the same debugName to the output. So the name of the original input is changed. This update avoid the name change by just copying the type.

Test Plan: Imported from OSS

Reviewed By: pbelevich

Differential Revision: D26050880

Pulled By: SplitInfinity

fbshipit-source-id: 8b04e41e6df7f33c5c9c873fb323c21462fc125b
2021-01-27 17:49:11 -08:00
84e9bff85d [ONNX] Replace optional parameters of Resize with placeholder for ops13. (#50574) (#50954)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50954

* Replace optional parameters of Resize with placeholder for ops13.

* Use common methods to handle different versions.

* Correct flake8 issue.

* Update per comments.

* Add something to trigger CI again.

* Trigger another round of CI.

Test Plan: Imported from OSS

Reviewed By: pbelevich

Differential Revision: D26050882

Pulled By: SplitInfinity

fbshipit-source-id: aea6205a1ba4a0621fe1ac9e0c7d94b92b6d8f21
2021-01-27 17:49:07 -08:00
68034197e8 [ONNX] Support gelu for fp16 export (#50487) (#50911)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50911

Need to replace dtype of export created scalars from float to double. (In torch implicit conversion logic, python numbers are double)

Test case skipped in CI due to that current CI job env does not have CUDA support.

Test Plan: Imported from OSS

Reviewed By: pbelevich

Differential Revision: D26050889

Pulled By: SplitInfinity

fbshipit-source-id: 1fdde23a68d4793e6b9a82840acc213e5c3aa760
2021-01-27 17:49:02 -08:00
70dcfe2991 [ONNX] Enable _jit_pass_onnx_fold_if only when dynamic_axes is None (#50582) (#50910)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50910

Fixing pytorch/vision#3251 (PR #49410 triggers the torch vision test build failure, on three tests test_faster_rcnn, test_mask_rcnn, test_keypoint_rcnn. )

The offending PR is fine on pytorch UT, because the torchvision and pytorch test has a gap when we merge them - we are using different test API on two sides, therefore causing some discrepancy.

This PR bridge the gap for the above three tests, and disable _jit_pass_onnx_fold_if pass until it gets fixed.
Allow _jit_pass_onnx_fold_if only when dynamic_axes is None.

Test Plan: Imported from OSS

Reviewed By: pbelevich

Differential Revision: D26050886

Pulled By: SplitInfinity

fbshipit-source-id: b765ffe30914261866dcc761f0d0999fd16169e3
2021-01-27 17:48:58 -08:00
e90a480d40 [ONNX] Add logical_and, logical_or, logical_xor torch op support in pytorch exporter (#50570) (#50909)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50909

Fixes #{}
Add logical_and, logical_or, logical_xor torch op support in pytorch exporter.

Test Plan: Imported from OSS

Reviewed By: pbelevich

Differential Revision: D26050884

Pulled By: SplitInfinity

fbshipit-source-id: 2db564e9726c18a3477f9268a0ff862cd2c40e4d
2021-01-27 17:48:53 -08:00
b308fb78d1 [ONNX] Add binary_cross_entropy_with_logits op to ONNX opset version 12 (#49675) (#50908)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50908

Fixes #{#47997}
Exporting the operator binary_cross_entropy_with_logits to ONNX opset version 12.

Test Plan: Imported from OSS

Reviewed By: pbelevich

Differential Revision: D26050885

Pulled By: SplitInfinity

fbshipit-source-id: e4167895eed804739aa50481679500a4d564b360
2021-01-27 17:48:49 -08:00
1723ab53c4 [ONNX] Update Reducesum operator for opset 13 (#50532) (#50907)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50907

* udpate symbolic for squeeze/unsqueeze

* update c++ unsqueeze/squeeze creation

* clang format

* enable tests

* clang format

* remove prints

* remove magic number

* add helper function

* fix build issue

* update opset9 symbolic with helper function

* fix utility test

* fix prim_fallthrough opset skip

* enable reducesum opset 13

* enable embedding_bag which contain reducesum op

* add ReduceSum helper

* remove block_listed_operators

* remove local test code

* remove embedding_bag() in opset13 file

* remove unuse import

Test Plan: Imported from OSS

Reviewed By: pbelevich

Differential Revision: D26050888

Pulled By: SplitInfinity

fbshipit-source-id: 88307af6a7880abf94eac126ec1638e962de8c1f

Co-authored-by: BowenBao <bowbao@microsoft.com>
Co-authored-by: hwangdeyu <deyhuang@qq.com>
2021-01-27 17:48:45 -08:00
7e4c956955 [ONNX] Support opset13 Squeeze and Unsqueeze (#50150) (#50906)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50906

In opset 13, squeeze/unsqueeze is updated to take axes as input, instead of attribute.

Test Plan: Imported from OSS

Reviewed By: pbelevich

Differential Revision: D26050883

Pulled By: SplitInfinity

fbshipit-source-id: 7b5faf0e016d476bc75cbf2bfee6918d77e8aecd
2021-01-27 17:48:40 -08:00
1c9347c666 [ONNX] Use parameter values in onnx shape inference (#49706) (#50905)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50905

Adds an additional run of onnx shape inference after constant folding, since initializer may have changed and affected shape inference.

Test Plan: Imported from OSS

Reviewed By: pbelevich

Differential Revision: D26050881

Pulled By: SplitInfinity

fbshipit-source-id: 9e5d69c52b647133cd3a0781988e2ad1d1a9c09d
2021-01-27 17:45:32 -08:00
dc2a44c4fc Back out "Revert D25850783: Add torch::deploy, an embedded torch-python interpreter" (#51124)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51124

Original commit changeset: 1c7133627da2

Test Plan: Test locally with interpreter_test and on CI

Reviewed By: suo

Differential Revision: D26077905

fbshipit-source-id: fae83bf9822d79e9a9b5641bc5191a7f3fdea78d
2021-01-27 16:49:42 -08:00
e975169426 [TensorExpr] Redesign Tensor class. (#50995)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50995

This change makes 'Tensor' a thin wrapper over 'Buf' and 'Stmt', and
merges it with recently introduced 'CompoundTensor'. A statement for the
tensor is either passed directly to the Tensor constructor (akin to
'CompoundTensor'), or is built immediately in constructor.

LoopNest is no longer responsible for constructing statements from
tensors - it simply stitches already constructed statements contained in
Tensors. This has a side effect that now we cannot construct several
loopnests from the same tensors - we need to explicitly clone statements
if we want to do that. A special copy constructor was added to LoopNest
to make it more convenient (note: this only affects tests, we don't
usually create multiple loopnests in other places).

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D26038223

Pulled By: ZolotukhinM

fbshipit-source-id: 27a2e5900437cfb0c151e8f89815edec53608e17
2021-01-27 16:14:22 -08:00
b804084428 [TensorExpr] Move 'lowerToStmt' method from 'LoopNest' to 'Tensor'. (#50994)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50994

Eventually, 'Tensor' will be fully responsible for its 'Stmt' and moving
this method to it is one step in that direction.

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D26038222

Pulled By: ZolotukhinM

fbshipit-source-id: 0549f0ae6b46a93ff7608a22e79faa5115eef661
2021-01-27 16:14:18 -08:00
42aeb68128 [TensorExpr] Move 'initializer' field from 'Tensor' to 'Buf'. (#50993)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50993

This is the first step to make 'Tensor` a thin wrapper over 'Buf' and
'Stmt', which will be finished in subsequent PRs. This change also
allows to remove 'buf_initializers_' from 'LoopNest', making it "less
stateful".

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D26038224

Pulled By: ZolotukhinM

fbshipit-source-id: f418816e54c62f291fa45812901487394e9b95b5
2021-01-27 16:10:53 -08:00
3f23ad5bce [Bug] fix for module_has_exports (#50680)
Summary:
The attributes in `dir(mod)` may not be valid, this will throw error when calling `getattr`.
Use `hasattr` to test if it is valid.

Here is an example:
```python
class A:
    def __init__(self, x):
        if x:
            self._attr = 1

    property
    def val(self):
        return getattr(self, '_attr')

a = A(False)
print('val' in dir(a))
print(hasattr(a, 'val'))

b = A(True)
print('val' in dir(b))
print(hasattr(b, 'val'))
```

And the outputs:
```
True
False
True
True
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50680

Reviewed By: malfet

Differential Revision: D26103975

Pulled By: eellison

fbshipit-source-id: 67a799afe7d726153c91654d483937c5e198ba94
2021-01-27 16:03:24 -08:00
1321f2bfe6 [PyTorch] Port Caffe2 opti for BatchMatMul batch size 1 to baddbmm (#51057)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51057

Caffe2 has an
[optimization](f8eefbdf7a/caffe2/operators/batch_matmul_op.h (L192))
for the case where the batch size is 1 that uses the underlying `gemm`
instead of `gemm_batched` BLAS function. This diff tries to port that
optimization to `baddbmm_mkl`.

Note that I have very little linear algebra background and am just
going off existing code and cblas API documentation, so please
review without assuming I know what I'm doing with the math itself.
ghstack-source-id: 120342923

Reviewed By: hlu1

Differential Revision: D26056613

fbshipit-source-id: feef80344b96601fc2bd0a2e8c8f6b57510d7856
2021-01-27 15:59:57 -08:00
98d9a6317d Rename profile.next_step() to profile.step() to consistent with optimizer.step() (#51032)
Summary:
Similar with Optimizer.step(), profile.next_step() occurs every iteration and calls at the end of each iteration. So it's better to make them same naming style.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51032

Reviewed By: heitorschueroff

Differential Revision: D26097847

Pulled By: ilia-cher

fbshipit-source-id: ea2e5c8e865d99f90b004ec7797271217efeeb68
2021-01-27 15:52:58 -08:00
621198978a Move USE_NUMPY to more appropriate targets (#51143)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51143

Test Plan: CI

Reviewed By: wconstab

Differential Revision: D26084123

fbshipit-source-id: af4abe4ef87c1ebe5434938320526a925f5c34c8
2021-01-27 15:44:12 -08:00
2de4ecd4eb Add serialization logic for complex numbers (#50885)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/50885

Test Plan: Imported from OSS

Reviewed By: SplitInfinity

Differential Revision: D26094906

Pulled By: anjali411

fbshipit-source-id: 7b2614f3ee4a30c4b4cf04aaa3432988b38a0721
2021-01-27 15:19:36 -08:00
3b6f30824c OpInfo JIT op.output_func handling support (#50775)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/50775

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D25964541

Pulled By: Lilyjjo

fbshipit-source-id: 8cf1ee9191d526cc46ae283f38c2d64bd60afdb2
2021-01-27 15:04:23 -08:00
eaf5ca09dc Migrate masked_scatter_ CUDA to ATen (#50039)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/49542

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50039

Reviewed By: heitorschueroff

Differential Revision: D26096247

Pulled By: ngimel

fbshipit-source-id: ec1810d3412e0d7ab6b950265a3123519ad886c1
2021-01-27 14:17:02 -08:00
1c8d11c9e2 [PyTorch] Save a refcount bump in make_variable (#51180)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51180

This fast path still did a refcount bump because it copied the inner intrusive_ptr to the stack. Now it's moved.
ghstack-source-id: 120460258

Test Plan:
1) profile empty benchmark & inspect assembly to verify move
2) run framework overhead benchmarks

Reviewed By: bhosmer

Differential Revision: D26094951

fbshipit-source-id: b2e09f9ad885cb633402885ca1e61a370723f6b8
2021-01-27 14:09:30 -08:00
f7e90cf311 Revert D26089965: [quant][graphmode][fx] Add support for functional conv1d and conv3d
Test Plan: revert-hammer

Differential Revision:
D26089965 (dd1a97b3ae)

Original commit changeset: 4aea507d05b7

fbshipit-source-id: f54184cafb9dd07858683489d8bd147474e7e4b3
2021-01-27 13:27:10 -08:00
40eea6d9d1 Support device map for distributed autograd while using TensorPipe. (#44859)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44859

TensorPipe's `set_device_map` option was applied during the forward
pass. However, if we ran the backward pass for the graph we would not
automatically pick up the reverse device mapping.

As a result, users had to specify both forward and backward device mapping
which is very tedious to do.

In this PR, I've added this functionality such that TensorPipe automatically
picks up the reverse device mapping during the backward pass. This is done by
storing the appropriate device mapping in the "recv" autograd function for
distributed autograd.

#Closes: https://github.com/pytorch/pytorch/issues/44170
ghstack-source-id: 119950842

Test Plan:
1) waitforbuildbot
2) Unit test added.

Reviewed By: mrshenli

Differential Revision: D23751975

fbshipit-source-id: 2717d0ef5bde3db029a6172d98aad95734d52140
2021-01-27 13:01:44 -08:00
6d098095eb [numpy] torch.lgamma: promote integer inputs to float (#50140)
Summary:
Reference: https://github.com/pytorch/pytorch/issues/42515

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50140

Reviewed By: mrshenli

Differential Revision: D25951094

Pulled By: mruberry

fbshipit-source-id: e53f1dbddff889710f05d43dbc9587382d3decb0
2021-01-27 12:08:46 -08:00
dd1a97b3ae [quant][graphmode][fx] Add support for functional conv1d and conv3d (#51155)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51155

This PR added support for quantizing functional conv1d, conv3d,  conv1d_relu and conv3d_relu

Test Plan:
python test/test_quantization.py TestQuantizeFxOps.test_functional_conv

Imported from OSS

Reviewed By: vkuzo

Differential Revision: D26089965

fbshipit-source-id: 4aea507d05b744807e993f6d3711ab308fb7591b
2021-01-27 12:00:35 -08:00
1b7a4f9cde .github: Add GitHub Actions workflow to build wheels (#50633)
Summary:
Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50633

Reviewed By: samestep

Differential Revision: D26083492

Pulled By: seemethere

fbshipit-source-id: c133671b9cf5074539133ee79fca5c680793a85d
2021-01-27 11:52:28 -08:00
b77f72b5a0 Enable TensorPipe's SHM transport (#50760)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50760

The SHM transport uses shared-memory-backed ringbuffers to transfer small payloads between processes on the same machine.

It was disabled in v1.6 due to a CMake mishap but we've since realized that it also doesn't work that well in docker and other setups. Enabling it here to see whether CircleCI fails.
ghstack-source-id: 120470890

Test Plan: Exported three times to CircleCI with tests consistently passing

Reviewed By: mrshenli

Differential Revision: D23814828

fbshipit-source-id: f355cb6515776debad536924de4f4d3fbb05a874
2021-01-27 11:45:09 -08:00
d3ec204ef2 [quant][graphmode][fx] Add functional conv2d + relu (#51079)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51079

Added support for functional conv2d + relu, will add conv1d and conv3d in future PR

Test Plan:
python test/test_quantization.py TestQuantizeFxOps.test_functional_conv

Imported from OSS

Reviewed By: vkuzo

Differential Revision: D26089964

fbshipit-source-id: 8703de17de1469f7076651c386c83fb5922a56eb
2021-01-27 11:20:55 -08:00
00adc7b07f Fix more JIT tests under Python-3.9 (#51182)
Summary:
Mostly replace `global Foo` with `make_global(Foo)`
The only real fix is generating Subscript annotation, which is a follow up from https://github.com/pytorch/pytorch/pull/48676

Fixes https://github.com/pytorch/pytorch/issues/49617

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51182

Reviewed By: gmagogsfm

Differential Revision: D26095244

Pulled By: malfet

fbshipit-source-id: 0e043d9a2cf43fff71dfbb341f708cd7af87c39a
2021-01-27 10:57:03 -08:00
9b6d463704 Move std and var tests to OpInfos (#50901)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/50901

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D26083289

Pulled By: mruberry

fbshipit-source-id: 7e14ff37bba46dd456e0bc0aa9c4e0a632d0734c
2021-01-27 10:50:51 -08:00
e9ffad088f numeric suite: add types to eager (#51168)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51168

Adds types to function I/O for numeric suite.  This is for readability
and static type checking with mypy.

Test Plan:
```
mypy torch/quantization/
```

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D26092454

fbshipit-source-id: d37cf61e4d9604f4bc550b392f55fb59165f7624
2021-01-27 10:40:49 -08:00
16dd5ca8ab Followup of kron PR (#51045)
Summary:
Followup of https://github.com/pytorch/pytorch/pull/50927

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51045

Reviewed By: mruberry

Differential Revision: D26089204

Pulled By: ngimel

fbshipit-source-id: 77291dd83fba32d6f80a8540910b112a1d85a892
2021-01-27 10:33:05 -08:00
4a2aa0f5f1 index_put_ for complex tensors on CUDA (#51148)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51148

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D26102025

Pulled By: anjali411

fbshipit-source-id: b1b6fd12fda03c4520a3c3200226edf352496188
2021-01-27 09:11:37 -08:00
0b5303e833 Propagate CreationMeta when chaining views (#51061)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/49824

## Background

When creating a view of a view, there was a possibility that the new view would be less restrictive than the previous view, incorrectly sidestepping the error that should be thrown when using in-place operations on the new view.

The fix addresses this by propagating `CreationMeta` from the previous view to the new view. Currently, the old view's `creation_meta` is only propagated when the new view's `creation_meta == CreationMeta::DEFAULT`. This ensures that the new view is not less restrictive than the previous view wrt. allowing in-place operations.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51061

Test Plan:
```
python test/test_autograd.py TestAutogradDeviceTypeCPU.test_inplace_view_of_multiple_output_view_cpu
python test/test_autograd.py TestAutogradDeviceTypeCUDA.test_inplace_view_of_multiple_output_view_cuda
python test/test_autograd.py TestAutogradDeviceTypeCPU.test_inplace_multiple_output_view_of_view_cpu
python test/test_autograd.py TestAutogradDeviceTypeCUDA.test_inplace_multiple_output_view_of_view_cuda
```

Reviewed By: heitorschueroff

Differential Revision: D26076434

Pulled By: jbschlosser

fbshipit-source-id: c47f0ddcef9b8449427b671aff9ad08edca70fcd
2021-01-27 09:00:51 -08:00
5ec2e26310 DOC, BLD: make the python docs build failures print a nicer message (#50356)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/50330

- Encapsulate the `make html` call and capture the stdout/stderr with a `tee` command
- If the buld fails, print out the `WARNING:` lines of the build log and finish up with a message

I tried it out on my branch, but did not write a test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50356

Reviewed By: ezyang

Differential Revision: D26101762

Pulled By: brianjo

fbshipit-source-id: ba2b704d3244ef5139ca9026c5250537bf45734f
2021-01-27 07:41:00 -08:00
22ac4f3c59 Add vectorize flag to torch.autograd.functional.{jacobian, hessian} (#50915)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50915

Fixes #50584
Add a vectorize flag to torch.autograd.functional.jacobian and
torch.autograd.functional.hessian (default: False). Under the hood, the
vectorize flag uses vmap as the backend to compute the jacobian and
hessian, respectively, providing speedups to users.

Test Plan:
- I updated all of the jacobian and hessian tests to also use
vectorized=True
- I added some simple sanity check tests that check e.g. jacobian with
vectorized=False vs
jacobian with vectorized=True.
- The mechanism for vectorized=True goes through batched gradient
computation. We have separate tests for those (see other PRs in this
stack).

Reviewed By: heitorschueroff

Differential Revision: D26057674

Pulled By: zou3519

fbshipit-source-id: a8ae7ca0d2028ffb478abd1b377f5b49ee39e4a1
2021-01-27 07:32:30 -08:00
fd9a85d21b Doc update for complex numbers (#51129)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51129

Test Plan: Imported from OSS

Reviewed By: pbelevich

Differential Revision: D26094947

Pulled By: anjali411

fbshipit-source-id: 4e1cdf8915a8c6a86ac3462685cdce881e1bcffa
2021-01-27 07:32:26 -08:00
ada916675f update HistogramObserver to be scriptable (#51081)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51081

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51001

fix tests in TestQuantizeJitOps

Test Plan:
Imported from OSS
python test/test_quantization.py

Reviewed By: raghuramank100

Differential Revision: D26038759

Pulled By: lyoka

fbshipit-source-id: 0977ba7b8b26a9f654f20f5c698a7a20ec078c35
2021-01-27 07:27:03 -08:00
0a4bc72890 [ROCm] work around compiler issue for IGammaKernel.cu (#50970)
Summary:
Add const to static variable inside `__host__ __device__` function.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50970

Reviewed By: izdeby

Differential Revision: D26081478

Pulled By: heitorschueroff

fbshipit-source-id: 77cf145f7e0570359aa00aec4c8b82c950815f81
2021-01-27 07:22:53 -08:00
b60494000b DOC: udate left navbar links for vision and text (#51103)
Summary:
A tiny PR to update the links in the lefthand navbar under Libraries. The canonical link for vision and text is `https://pytorch.org/vision/stable` and `https://pytorch.org/text/stable` respectively. The link without the `/stable` works via a redirect, this is cleaner.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51103

Reviewed By: izdeby

Differential Revision: D26079760

Pulled By: heitorschueroff

fbshipit-source-id: df1fa64d7895831f4e6242445bae02c1faa5e4dc
2021-01-27 07:19:00 -08:00
7b85adf20f Add back pycuda.autoinit to test_pt_onnx_trt (#51106)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/51105 by adding back the `import pycuda.autoinit`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51106

Reviewed By: mingzhe09088

Differential Revision: D26086808

Pulled By: heitorschueroff

fbshipit-source-id: 88d98796c87a44cedaa1f6666e9f71a424293641
2021-01-27 07:10:11 -08:00
1935880860 [PyTorch] Remove unnecessary dispatcher.h include in torch/library.h (#51162)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51162

It's unused.
ghstack-source-id: 120427120

Test Plan: CI

Reviewed By: bhosmer

Differential Revision: D25859010

fbshipit-source-id: 7bb21312843debaedaa6a969727c171b2bb0e6b2
2021-01-26 22:19:32 -08:00
42929e573a add missing return statement to inlined vec_signed (#51116)
Summary:
Fixes #{issue number}
This is not really a new issue, just a proposed minor fix to a recent previous issue (now closed) https://github.com/pytorch/pytorch/issues/50640 which was a fix for https://github.com/pytorch/pytorch/issues/50439.

That fix added inlining for vec_signed (and others) but in one case the return was accidentally omitted.  This results in a build error:
```                 from �[01m�[K../aten/src/ATen/cpu/vec256/vec256.h:19�[m�[K,
                 from �[01m�[Katen/src/ATen/native/cpu/FillKernel.cpp.VSX.cpp:3�[m�[K:
�[01m�[K../aten/src/ATen/cpu/vec256/vsx/vsx_helpers.h:�[m�[K In function ‘�[01m�[Kvint32 vec_signed(const vfloat32&)�[m�[K’:
�[01m�[K../aten/src/ATen/cpu/vec256/vsx/vsx_helpers.h:33:1:�[m�[K �[01;31m�[Kerror: �[m�[Kno return statement in function returning non-void [�[01;31m�[K-Werror=return-type�[m�[K]
```

I've confirmed that the error disappears after this one-line fix.  (Note: There is another issue encountered later in the build unrelated to this particular fix, as I noted in a separate comment in the original issue.  I'm trying to make some sense of that one, but in any event it would be a subject for another issue/PR).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51116

Reviewed By: heitorschueroff

Differential Revision: D26078213

Pulled By: malfet

fbshipit-source-id: 59b2ee19138fa1b8d8ec1d35ca4a5ef0a67bc123
2021-01-26 20:16:18 -08:00
ba316a7612 Fix TF32 failures in test_linalg.py (#50453)
Summary:
On Ampere GPU, matmuls are computed by default with TF32 when the dtype is `torch.float`:  https://pytorch.org/docs/stable/notes/cuda.html#tensorfloat-32-tf32-on-ampere-devices, which results in reduced precision in results. However, linear algebra usually need higher precision, therefore lots of tests in `test_linalg.py` are failing on Ampere GPU because of precision issue.

To fix this issue:
- Most linear algebra methods, except for matmuls, should add `NoTF32Guard`
- Expected results in unit tests should compute matmuls using numpy instead of pytorch cuda.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50453

Reviewed By: glaringlee

Differential Revision: D26023005

Pulled By: ngimel

fbshipit-source-id: f0ea533494fee322b07925565b57e3b0db2570c5
2021-01-26 19:51:20 -08:00
b6eaca9f1f Add type annotation logic for complex numbers (#50884)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/50884

Test Plan: Imported from OSS

Reviewed By: heitorschueroff

Differential Revision: D26086963

fbshipit-source-id: f103f7f529d63d701c4f17862e30eafbab7d0c68
2021-01-26 19:39:35 -08:00
e2041ce354 Fix docstring to clarify logits usage for multiclass case (#51053)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/50378.

Additionally, this has some minor fixes:
 - [x] Fix mean for half-cauchy to return `inf` instead of `nan`.
 - [x] Fix constraints/support for the relaxed categorical distribution.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51053

Reviewed By: heitorschueroff

Differential Revision: D26077966

Pulled By: neerajprad

fbshipit-source-id: ca0213baa9bbdbc661aebbb901ab5e7fded38a5f
2021-01-26 17:01:39 -08:00
221d7d99e1 [torch vitals] move into namespace and fix windows tests
Summary:
as in title

resolves D25791248 (069602e028)

Test Plan: buck test //caffe2/aten:vitals

Reviewed By: EscapeZero, malfet

Differential Revision: D26090442

fbshipit-source-id: 07937f246ec0a6eb338d21208ada61758237ae42
2021-01-26 16:50:45 -08:00
3cc14a0dff [p2c2] Add support for Int8FCPackWeight in model transformation
Summary:
In order to enable FC int8 quantization in P2C2, we are trying to run the caffe2 op Int8FCPackWeight in the model transformation pipeline.

The net is being generated from the python side, and passed back into C++ and run here: https://fburl.com/diffusion/3zt1mp03,  with these dependencies included: https://fburl.com/diffusion/rdjtdtcf

However, when the net is executed, it errors out with:
```
Cannot create operator of type 'Int8FCPackWeight' on the device 'CPU'
```

This diff attempts to fix this issue.

Test Plan:
To reproduce, just this test without
```
buck test //aiplatform/modelstore/transformation/tests:pyper_to_caffe2_dispatcher_test
```

Reviewed By: jspark1105

Differential Revision: D25965167

fbshipit-source-id: a7414669abb8731177c14e8792de58f400970732
2021-01-26 16:35:23 -08:00
345844d9d8 test, fix deepcopy of tensor with grad (#50663)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/3307

Previously, `self.grad` was not ~cloned~ deepcopied to the returned tensor in `deepcopy`. Added a test and an implementation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50663

Reviewed By: heitorschueroff

Differential Revision: D26074811

Pulled By: albanD

fbshipit-source-id: 536dad36415f1d03714b4ce57453f406ad802b8c
2021-01-26 16:19:53 -08:00
97ea95ddd7 Delete tabs from becnh_approx.cpp (#51157)
Summary:
Introduced by D25981260 (f08464f31d)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51157

Reviewed By: bwasti

Differential Revision: D26090008

Pulled By: malfet

fbshipit-source-id: b63f1bb1683c7261902de7eaab24a05a5159ce7e
2021-01-26 15:53:47 -08:00
57484103be Revert D25675618: Move AcceleratedGraphModule out of graph_manipulation.
Test Plan: revert-hammer

Differential Revision:
D25675618 (c8a24ebe54)

Original commit changeset: 55636bb2d3d6

fbshipit-source-id: 7b196f7c32830061eca9c89bbcb346cdd66a211e
2021-01-26 15:31:18 -08:00
24eab1d80d BLD: create a LICENSE_BUNDLED.txt file from third_party licenses (#50745)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/50695.

Rather than maintain a LICENSE_BUNDLED.txt by hand, this build it out of the subrepos.

I ~copied and adapted the sdist handling from Numpy~ added a separate file, so the LICENSE.txt file of the repo remains in pristine condition and the GitHub website still recognizes it. If we modify the file, the website will no longer recognize the license.

This is not enough, since the license in the ~wheel~ wheel and sdist is not modified. Numpy has a [separate step](https://github.com/MacPython/numpy-wheels/blob/master/patch_code.sh) when preparing the wheel to concatenate the licenses. I am not sure where/if the [conda-forge numpy-feedstock](https://github.com/conda-forge/numpy-feedstock/) also fixes up the license.

~Should~ I ~commit~ commited the artifact to the repo and ~add~ added a test that reproducing the file is consistent.

Edit: now the file is part of the repo.

Edit: rework the mention of sdist. After this is merged another PR is needed to make the sdist and wheel ship the proper merged license.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50745

Reviewed By: seemethere, heitorschueroff

Differential Revision: D26074974

Pulled By: walterddr

fbshipit-source-id: bacd5d6870e9dbb419a31a3e3d2fdde286ff2c94
2021-01-26 14:55:07 -08:00
c4029444d1 [nnc] Per-operator benchmarks (#51093)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51093

Operator level benchmarks comparing eager-mode PyTorch to
NNC-generated fused kernels.  We wouldn't normally see these in isolation, but
it points out where NNC is falling short (or doing well).

I threw in a composed hardswish for fun, because it's my favorite activation
function.

Notably, it exposes a bug in our build process that's preventing vectorization
from using `sleef`, so we're using scalar calls to libm with predictably lousy
performance.  Fix incoming.

This benchmark is similar to the pure NNC approach in `microbenchmarks.py`, but
will include the overhead of dispatching the fused kernel through TorchScript.
ghstack-source-id: 120403675

Test Plan:
```
op                        eager        nnc    speedup
hardswish                 0.187      0.051       3.70
hardswish                 0.052      0.052       1.00
sigmoid                   0.148      1.177       0.13
reciprocal                0.049      0.050       0.98
neg                       0.038      0.037       1.02
relu                      0.037      0.036       1.03
isnan                     0.119      0.020       5.86
log                       0.082      1.330       0.06
log10                     0.148      1.848       0.08
log1p                     0.204      1.413       0.14
log2                      0.285      1.167       0.24
exp                       0.063      1.123       0.06
expm1                     0.402      1.417       0.28
erf                       0.167      0.852       0.20
erfc                      0.181      1.098       0.16
cos                       0.124      0.793       0.16
sin                       0.126      0.838       0.15
tan                       0.285      1.777       0.16
acos                      0.144      1.358       0.11
asin                      0.126      1.193       0.11
cosh                      0.384      1.761       0.22
sinh                      0.390      2.279       0.17
atan                      0.240      1.564       0.15
tanh                      0.320      2.259       0.14
sqrt                      0.043      0.069       0.63
rsqrt                     0.118      0.117       1.01
abs                       0.038      0.037       1.03
ceil                      0.038      0.038       1.01
floor                     0.039      0.039       1.00
round                     0.039      0.292       0.13
trunc                     0.040      0.036       1.12
lgamma                    2.045      2.721       0.75
```

Reviewed By: zheng-xq

Differential Revision: D26069791

fbshipit-source-id: 236e7287ba1b3f67fdcb938949a92bbbdfa13dba
2021-01-26 14:10:08 -08:00
f08464f31d [nnc] Add benchmarks
Summary: Adding a set of benchmarks for key operators

Test Plan:
buck build mode/opt -c 'fbcode.caffe2_gpu_type=none' caffe2/benchmarks/cpp/tensorexpr:tensorexpr_bench
OMP_NUM_THREADS=1 MKL_NUM_THREADS=1 numactl -C 3 ./buck-out/gen/caffe2/benchmarks/cpp/tensorexpr/tensorexpr_bench

Reviewed By: ZolotukhinM

Differential Revision: D25981260

fbshipit-source-id: 17681fc1527f43ccf9bcc80704415653a627b396
2021-01-26 13:51:33 -08:00
6f3aa58d80 Fix autograd thread crash with python-3.9 (#50998)
Summary:
Update pybind repo to include `gil_scoped_acquire::disarm()` methods
In python_engine allocate scoped_acquire as unique_ptr and leak it if engine is finalizing for Python-3.9+

Fixes https://github.com/pytorch/pytorch/issues/50014 and https://github.com/pytorch/pytorch/issues/50893

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50998

Reviewed By: ezyang

Differential Revision: D26038314

Pulled By: malfet

fbshipit-source-id: 035411e22825e8fdcf1348fed36da0bc33e16f60
2021-01-26 13:29:47 -08:00
069602e028 [torch vitals] Initial implementation (#51047)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51047

If the environment variable `TORCH_VITAL` is set to a non-zero length string, the vitals a dumped at program end.

The API is very similar to google's logging

Test Plan: buck test //caffe2/aten:vitals

Reviewed By: bitfort

Differential Revision: D25791248

fbshipit-source-id: 0b40da7d22c31d2c4b2094f0dcb1229a35338ac2
2021-01-26 13:09:58 -08:00
83bfab2fb6 toTensor cleanup on sparsenn & static runtime ops (#51113)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51113

toTensor() on an lvalue IValue returns a reference; no need to copy.
ghstack-source-id: 120317233

Test Plan:
fitsships

Compared `perf stat` results before/after (was on top of a diff stack
so don't take baseline as where master is)

Before:
```
         74,178.77 msec task-clock                #    0.999 CPUs utilized            ( +-  0.31% )
            17,125      context-switches          #    0.231 K/sec                    ( +-  3.41% )
                 3      cpu-migrations            #    0.000 K/sec
           109,535      page-faults               #    0.001 M/sec                    ( +-  1.04% )
   146,803,364,372      cycles                    #    1.979 GHz                      ( +-  0.30% )  (50.03%)
   277,726,600,254      instructions              #    1.89  insn per cycle           ( +-  0.02% )  (50.03%)
    43,299,659,815      branches                  #  583.720 M/sec                    ( +-  0.03% )  (50.03%)
       130,504,094      branch-misses             #    0.30% of all branches          ( +-  1.14% )  (50.03%)
```

After:
```
         72,695.01 msec task-clock                #    0.999 CPUs utilized            ( +-  1.18% )
            15,994      context-switches          #    0.220 K/sec                    ( +-  5.21% )
                 3      cpu-migrations            #    0.000 K/sec
           107,743      page-faults               #    0.001 M/sec                    ( +-  1.55% )
   145,647,684,269      cycles                    #    2.004 GHz                      ( +-  0.30% )  (50.05%)
   277,341,084,993      instructions              #    1.90  insn per cycle           ( +-  0.02% )  (50.04%)
    43,200,717,263      branches                  #  594.273 M/sec                    ( +-  0.02% )  (50.05%)
       143,873,086      branch-misses             #    0.33% of all branches          ( +-  0.59% )  (50.05%)
```

Looks like an 0.7% cycles win (barely outside the noise) and an 0.1%
instructions win.

Reviewed By: hlu1

Differential Revision: D26051766

fbshipit-source-id: 05f8d71d8120d79f7cd80aca747dfc537bf7d382
2021-01-26 13:06:46 -08:00
a949d7b1c8 Workaround Python3.9 limitations in test_jit_py3 (#51088)
Summary:
In Python-3.9 and above `inspect.getsource` of a local class does not work if it was marked as default, see https://bugs.python.org/issue42666 https://github.com/pytorch/pytorch/issues/49617
Workaround by defining `make_global` function that programmatically accomplishes the same

Partially addresses issue raised in https://github.com/pytorch/pytorch/issues/49617

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51088

Reviewed By: gmagogsfm

Differential Revision: D26069189

Pulled By: malfet

fbshipit-source-id: 7cf14b88ae5d2b95d2b0fd852717a9202b86356e
2021-01-26 12:49:35 -08:00
c8a24ebe54 Move AcceleratedGraphModule out of graph_manipulation.
Test Plan:
buck test //caffe2/test:test_fx_experimental
buck test //glow/fb/fx_nnpi_importer:test_importer

Reviewed By: jfix71

Differential Revision: D25675618

fbshipit-source-id: 55636bb2d3d6102b400f2044118a450906954083
2021-01-26 12:39:49 -08:00
81ae8edf16 Revert D26018916: [pytorch][PR] Automated submodule update: tensorpipe
Test Plan: revert-hammer

Differential Revision:
D26018916 (5f297cc665)

Original commit changeset: dc8aaa98d4e0

fbshipit-source-id: cd81a7950c7141e0711faabf03292098a8cf14d3
2021-01-26 11:45:48 -08:00
afa79a4df5 [quant][graphmode][fx] cleanup linear module test case (#50976)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/50976

Test Plan: Imported from OSS

Reviewed By: supriyar

Differential Revision: D26032531

fbshipit-source-id: 9725bab8f70ac79652e7bf9f94376917438d60e0
2021-01-26 11:14:22 -08:00
b822aba8ec Enable BFloat support for gemms on arch other than ampere (#50442)
Summary:
Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50442

Reviewed By: bdhirsh

Differential Revision: D26044981

Pulled By: mruberry

fbshipit-source-id: 65c42f2c1de8d24e4852a1b5bd8f4b1735b2230e
2021-01-26 11:07:07 -08:00
3562ca2da2 [dist_optim] add warning to distributed optimizer (#50630)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50630

Add a warning log to distributed optimizer, to warn user the optimizer
is created without TorchScript support.

Test Plan: Imported from OSS

Reviewed By: rohan-varma

Differential Revision: D25932777

Pulled By: wanchaol

fbshipit-source-id: 8db3b98bdd27fc04c5a3b8d910b028c0c37f138d
2021-01-26 10:30:55 -08:00
6dda0363bb [reland] Refactor mypy configs list into editor-friendly wrapper (#50826)
Summary:
Closes https://github.com/pytorch/pytorch/issues/50513 by resolving all four checkboxes. If this PR is merged, I will also modify one or both of the following wiki pages to add instructions on how to use this `mypy` wrapper for VS Code editor integration:

- [Guide for adding type annotations to PyTorch](https://github.com/pytorch/pytorch/wiki/Guide-for-adding-type-annotations-to-PyTorch)
- [Lint as you type](https://github.com/pytorch/pytorch/wiki/Lint-as-you-type)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50826

Test Plan:
Unit tests for globbing function:
```
python test/test_testing.py TestMypyWrapper -v
```

Manual checks:

- Uninstall `mypy` and run `python test/test_type_hints.py` to verify that it still works when `mypy` is absent.
- Reinstall `mypy` and run `python test/test_type_hints.py` to verify that this didn't break the `TestTypeHints` suite.
- Run `python test/test_type_hints.py` again (should finish quickly) to verify that this didn't break `mypy` caching.
- Run `torch/testing/_internal/mypy_wrapper.py` on a few Python files in this repo to verify that it doesn't give any additional warnings when the `TestTypeHints` suite passes. Some examples (compare with the behavior of just running `mypy` on these files):
  ```sh
  torch/testing/_internal/mypy_wrapper.py $PWD/README.md
  torch/testing/_internal/mypy_wrapper.py $PWD/tools/fast_nvcc/fast_nvcc.py
  torch/testing/_internal/mypy_wrapper.py $PWD/test/test_type_hints.py
  torch/testing/_internal/mypy_wrapper.py $PWD/torch/random.py
  torch/testing/_internal/mypy_wrapper.py $PWD/torch/testing/_internal/mypy_wrapper.py
  ```
- Remove type hints from `torch.testing._internal.mypy_wrapper` and verify that running `mypy_wrapper.py` on that file gives type errors.
- Remove the path to `mypy_wrapper.py` from the `files` setting in `mypy-strict.ini` and verify that running it again on itself no longer gives type errors.
- Add `test/test_type_hints.py` to the `files` setting in `mypy-strict.ini` and verify that running the `mypy` wrapper on it again now gives type errors.
- Change a return type in `torch/random.py` and verify that running the `mypy` wrapper on it again now gives type errors.
- Add the suggested JSON from the docstring of `torch.testing._internal.mypy_wrapper.main` to your `.vscode/settings.json` and verify that VS Code gives the same results (inline, while editing any Python file in the repo) as running the `mypy` wrapper on the command line, in all the above cases.

Reviewed By: walterddr

Differential Revision: D26049052

Pulled By: samestep

fbshipit-source-id: 0b35162fc78976452b5ea20d4ab63937b3c7695d
2021-01-26 09:04:14 -08:00
31194750f2 [jit] Fix ResolutionCallback definition (#51089)
Summary:
`ResolutionCallback` returns `py::object` (i.e. `Any`) rather than `py::function` (i.e. `Callable`)

Discovered while debugging test failures after updating pybind11

This also makes resolution code slightly faster, as it eliminates casts from object to function and back for every `py::object obj = rcb_(name);` statement.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51089

Reviewed By: jamesr66a

Differential Revision: D26069295

Pulled By: malfet

fbshipit-source-id: 6876caf9b4653c8dc8e568aefb6778895decea05
2021-01-26 08:47:38 -08:00
5834b3b204 Fix test_jit_cuda_archflags on machine with more than one arch (#50405)
Summary:
This fixes the following flaky test on machine with gpus of different arch:
```
_________________________________________________________________________________________________________________ TestCppExtensionJIT.test_jit_cuda_archflags __________________________________________________________________________________________________________________

self = <test_cpp_extensions_jit.TestCppExtensionJIT testMethod=test_jit_cuda_archflags>

    unittest.skipIf(not TEST_CUDA, "CUDA not found")
    unittest.skipIf(TEST_ROCM, "disabled on rocm")
    def test_jit_cuda_archflags(self):
        # Test a number of combinations:
        #   - the default for the machine we're testing on
        #   - Separators, can be ';' (most common) or ' '
        #   - Architecture names
        #   - With/without '+PTX'

        capability = torch.cuda.get_device_capability()
        # expected values is length-2 tuple: (list of ELF, list of PTX)
        # note: there should not be more than one PTX value
        archflags = {
            '': (['{}{}'.format(capability[0], capability[1])], None),
            "Maxwell+Tegra;6.1": (['53', '61'], None),
            "Pascal 3.5": (['35', '60', '61'], None),
            "Volta": (['70'], ['70']),
        }
        if int(torch.version.cuda.split('.')[0]) >= 10:
            # CUDA 9 only supports compute capability <= 7.2
            archflags["7.5+PTX"] = (['75'], ['75'])
            archflags["5.0;6.0+PTX;7.0;7.5"] = (['50', '60', '70', '75'], ['60'])

        for flags, expected in archflags.items():
>           self._run_jit_cuda_archflags(flags, expected)

test_cpp_extensions_jit.py:198:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
test_cpp_extensions_jit.py:158: in _run_jit_cuda_archflags
    _check_cuobjdump_output(expected[0])
test_cpp_extensions_jit.py:134: in _check_cuobjdump_output
    self.assertEqual(actual_arches, expected_arches,
../../.local/lib/python3.9/site-packages/torch/testing/_internal/common_utils.py:1211: in assertEqual
    super().assertEqual(len(x), len(y), msg=self._get_assert_msg(msg, debug_msg=debug_msg))
E   AssertionError: 2 != 1 : Attempted to compare the lengths of [iterable] types: Expected: 2; Actual: 1.
E   Flags: ,  Actual: ['sm_75', 'sm_86'],  Expected: ['sm_86']
E   Stderr:
E   Output: ELF file    1: cudaext_archflags.1.sm_75.cubin
E   ELF file    2: cudaext_archflags.2.sm_86.cubin

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50405

Reviewed By: albanD

Differential Revision: D25920200

Pulled By: mrshenli

fbshipit-source-id: 1042a984142108f954a283407334d39e3ec328ce
2021-01-26 08:38:54 -08:00
5f297cc665 Automated submodule update: tensorpipe (#50946)
Summary:
This is an automated pull request to update the first-party submodule for [pytorch/tensorpipe](https://github.com/pytorch/tensorpipe).

New submodule commit: f463e0ebfc

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50946

Test Plan: Ensure that CI jobs succeed on GitHub before landing.

Reviewed By: lw

Differential Revision: D26018916

fbshipit-source-id: dc8aaa98d4e002e972d5c6783f2351c29f7db239
2021-01-26 08:21:30 -08:00
95ae9a20e4 Enable ROCM Skipped tests in test_ops.py (#50500)
Summary:
Removed skipCUDAIfRocm to re-enable tests for
ROCM platform.

Initially, Only 4799 cases were being run.
Out of those, 882 cases were being skipped.
After removing skipCUDAIfRocm from two places
in test_ops.py, now more than 8000 cases are
being executed, out of which only 282 cases
are bing skipped, which are FFT related tests.

Signed-off-by: Arindam Roy <rarindam@gmail.com>

Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50500

Reviewed By: albanD

Differential Revision: D25920303

Pulled By: mrshenli

fbshipit-source-id: b2d17b7e2d1de4f9fdd6f1660fb4cad5841edaa0
2021-01-26 08:09:18 -08:00
233e4ebdb6 Implement autograd functions for c10d communication operations (#40762)
Summary:
Closes https://github.com/pytorch/pytorch/issues/40702, Fixes https://github.com/pytorch/pytorch/issues/40690

Currently wip. But I would appreciate some feedback. Functions should be double-differentiable.

Contrary to b35cdc5200/torch/nn/parallel/_functions.py
This PR generates list of tensors instead of aggregating the received data in a single tensor. Is this behavior correct?

Thanks!

Pull Request resolved: https://github.com/pytorch/pytorch/pull/40762

Reviewed By: glaringlee

Differential Revision: D24758889

Pulled By: mrshenli

fbshipit-source-id: 79285fb4b791cae3d248f34e2aadb11c9ab10cce
2021-01-26 07:52:51 -08:00
83315965ab Turn on batched grad testing for CriterionTest (#50744)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50744

This PR adds a `check_batched_grad=True` option to CriterionTest and
turns it on by default for all CriterionTest-generated tests

Test Plan: - run tests

Reviewed By: ejguan

Differential Revision: D25997676

Pulled By: zou3519

fbshipit-source-id: cc730731e6fae2bddc01bc93800fd0e3de28b32d
2021-01-26 07:37:15 -08:00
e843974a6e Revert D25850783: Add torch::deploy, an embedded torch-python interpreter
Test Plan: revert-hammer

Differential Revision:
D25850783 (3192f9e4fe)

Original commit changeset: a4656377caff

fbshipit-source-id: 1c7133627da28fb12848da7a9a46de6d3b2b67c6
2021-01-26 02:07:44 -08:00
a51b9a823c Improve docs around Math/DefaultBackend & add PythonDispatcher class. (#50854)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/50854

Test Plan: Imported from OSS

Reviewed By: bhosmer

Differential Revision: D26008542

Pulled By: ailzhang

fbshipit-source-id: e9c0aa97ac2537ff612f5faf348fcb613da09479
2021-01-25 23:10:36 -08:00
9f19843d19 [Gradient Compression] Typo fixes in PowerSGD (#50974)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50974

Typo fixes.

Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202
ghstack-source-id: 120257221

Test Plan: N/A

Reviewed By: rohan-varma

Differential Revision: D26031679

fbshipit-source-id: 9d049b50419a3e40e53f7f1275a441e31b87717b
2021-01-25 22:55:54 -08:00
ffaae32d60 [Gradient Compression] Allow PowerSGD to run vallina allreduce for the first K iterations (#50973)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50973

This can extend the original PowerSGD method to a hybrid approach: vanilla allreduce + PowerSGD. This can help further improve the accuracy, at the cost of a lower speedup.

Also add more comments on the fields in `PowerSGDState`.

Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202
ghstack-source-id: 120257202

Test Plan:
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl

buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_DistributedDataParallel_powerSGD_ddp_comm_hook

Reviewed By: rohan-varma

Differential Revision: D26031478

fbshipit-source-id: d72e70bb28ba018f53223c2a4345306980b3084e
2021-01-25 22:38:39 -08:00
880f007480 Add torch.eig complex forward (CPU, CUDA) (#49168)
Summary:
Related to issue https://github.com/pytorch/pytorch/issues/42666

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49168

Reviewed By: mrshenli

Differential Revision: D25954027

Pulled By: mruberry

fbshipit-source-id: e429f9587efff5e638bfd0e4de864c06f41c63b1
2021-01-25 21:27:08 -08:00
502ca0105d Added cuda bindings for NNC (#51046)
Summary:
See above.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51046

Reviewed By: ZolotukhinM

Differential Revision: D26053419

Pulled By: Chillee

fbshipit-source-id: 9cc2dc434239a1ad77d30a1e5c0a9592be4944dc
2021-01-25 20:41:40 -08:00
6ef66213ee [PT QNNPACK] Temporarily disable input pointer caching (#51051)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51051

Disable input pointer caching on ios. We are seeing some issues with this on some ios devices.

Test Plan:
FB:
Test this in of IG with BT effect.

Reviewed By: IvanKobzarev, AshkanAliabadi

Differential Revision: D25984429

fbshipit-source-id: f6ceef606994b22de9cdd9752115b3481cd7bd96
2021-01-25 20:34:06 -08:00
5adbace8e6 Abort node in fast_nvcc if ancestor fails (#51043)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51043

This PR makes `fast_nvcc` stop at failing commands, rather than continuing on to run commands that would otherwise run after those commands. It is still possible for `fast_nvcc` to run more commands than `nvcc` would run if there's no dependency between them, but this should still help to reduce noise from failing `fast_nvcc` runs.

Test Plan: Unfortunately the test suite for this script is FB-internal. It would probably be a good idea to move it into the PyTorch GitHub repo, but I'm not entirely sure how to do so, since I don't believe we currently have a good place to put tests for things in `tools`.

Reviewed By: malfet

Differential Revision: D26007788

fbshipit-source-id: 8fe1e7d020a29d32d08fe55fb59229af5cdfbcaa
2021-01-25 18:12:51 -08:00
a347c747df Fix TransformedDistribution shaping logic (#50581)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/50496
Fixes https://github.com/pytorch/pytorch/issues/34859
Fixes https://github.com/pytorch/pytorch/issues/21596

This fixes many bugs involving `TransformedDistribution` and `ComposeTransform` when the component transforms changed their event shapes. Part of the fix is to introduce an `IndependentTransform` analogous to `distributions.Independent` and `constraints.independent`, and to introduce methods `Transform.forward_shape()` and `.inverse_shape()`. I have followed fehiepsi's suggestion and replaced `.input_event_dim` -> `.domain.event_dim` and `.output_event_dim` -> `.codomain.event_dim`. This allows us to deprecate `.event_dim` as an attribute.

## Summary of changes

- Fixes `TransformDistribution` and `ComposeTransform` shape errors.
- Fixes a behavior bug in `LogisticNormal`.
- Fixes `kl_divergence(TransformedDistribution, TransformedDistribution)`
- Adds methods `Transform.forward_shape()`, `.inverse_shape()` which are required for correct shape computations in `TransformedDistribution` and `ComposeTransform`.
- Adds an `IndependentTransform`.
- Adds a `ReshapeTransform` which is invaluable in testing shape logic in `ComposeTransform` and `TransformedDistribution` and which will be used by stefanwebb flowtorch.
- Fixes incorrect default values in `constraints.dependent.event_dim`.
- Documents the `.event_dim` and `.is_discrete` attributes.

## Changes planned for follow-up PRs

- Memoize `constraints.dependent_property` as we do with `lazy_property`, since we now consult those properties much more often.

## Tested
- [x] added a test for `Dist.support` vs `Dist(**params).support` to ensure static and dynamic attributes agree.
- [x] refactoring is covered by existing tests
- [x] add test cases for `ReshapedTransform`
- [x] add a test for `TransformedDistribution` on a wide grid of input shapes
- [x] added a regression test for https://github.com/pytorch/pytorch/issues/34859

cc fehiepsi feynmanliang stefanwebb

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50581

Reviewed By: ezyang, glaringlee, jpchen

Differential Revision: D26024247

Pulled By: neerajprad

fbshipit-source-id: f0b9a296f780ff49659b132409e11a29985dde9b
2021-01-25 16:34:12 -08:00
250c71121b Create a DDPLoggingData and expose it to python interface (#50622)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50622

1. Define a DDPLoggingData struct that is the placeholder for all the ddp related logging fields
2. Put the DDPLoggingData struct in the C10 directory so that it can be easily imported by c10 and torch files
3. Expose get_ddp_logging_data() method in python so that users can get the logging data and dump in their applications
4. Unit test tested the logging data can be set and got as expected
5. Follow up will add more logging fields such as perf stats, internal states, env variables and etc
ghstack-source-id: 120275870

Test Plan: unit tests

Reviewed By: SciPioneer

Differential Revision: D25930527

fbshipit-source-id: 290c200161019c58e28eed9a5a2a7a8153113f99
2021-01-25 15:23:07 -08:00
3192f9e4fe Add torch::deploy, an embedded torch-python interpreter (#50458)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50458

libinterpreter.so contains a frozen python distribution including
torch-python bindings.

Freezing refers to serializing bytecode of python standard library modules as
well as the torch python library and embedding them in the library code.  This
library can then be dlopened multiple times in one process context, each
interpreter having its own python state and GIL.  In addition, each python
environment is sealed off from the filesystem and can only import the frozen
modules included in the distribution.

This change relies on newly added frozenpython, a cpython 3.8.6 fork built for this purpose.  Frozenpython provides libpython3.8-frozen.a which
contains frozen bytecode and object code for the python standard library.

Building on top of frozen python, the frozen torch-python bindings are added in
this diff, providing each embedded interpreter with a copy of the torch
bindings.  Each interpreter is intended to share one instance of libtorch and
the underlying tensor libraries.

Known issues

- Autograd is not expected to work with the embedded interpreter currently, as it manages
its own python interactions and needs to coordinate with the duplicated python
states in each of the interpreters.
- Distributed and cuda stuff is disabled in libinterpreter.so build, needs to be revisited
- __file__ is not supported in the context of embedded python since there are no
files for the underlying library modules.
using __file__
- __version__ is not properly supported in the embedded torch-python, just a
workaround for now

Test Plan: tested locally and on CI with cmake and buck builds running torch::deploy interpreter_test

Reviewed By: ailzhang

Differential Revision: D25850783

fbshipit-source-id: a4656377caff25b73913daae7ae2f88bcab8fd88
2021-01-25 15:14:28 -08:00
ddf26816d3 Make torch.svd return V, not V.conj() for complex inputs (#51012)
Summary:
**BC-breaking note:**

torch.svd() added support for complex inputs in PyTorch 1.7, but was not documented as doing so. The complex "V" tensor returned was actually the complex conjugate of what's expected. This PR fixes the discrepancy.

This will silently break all users of torch.svd() with complex inputs.

**Original PR Summary:**

This PR resolves https://github.com/pytorch/pytorch/issues/45821.

The problem was that when introducing the support of complex inputs for `torch.svd` it was overlooked that LAPACK/MAGMA returns the conjugate transpose of V matrix, not just the transpose of V. So `torch.svd` was silently returning U, S, V.conj() instead of U, S, V.

Behavior of `torch.linalg.pinv`, `torch.pinverse` and `torch.linalg.svd` (they depend on `torch.svd`) is not changed in this PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51012

Reviewed By: bdhirsh

Differential Revision: D26047593

Pulled By: albanD

fbshipit-source-id: d1e08dbc3aab9ce1150a95806ef3b5da98b5d3ca
2021-01-25 14:06:41 -08:00
f8eefbdf7a fake_quant: fix device affinity and buffer resizing for state_dict (#50868)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50868

Ensures that `FakeQuantize` respects device affinity when loading from
state_dict, and knows how to resize scale and zero_point values
(which is necessary for FQ classes wrapping per channel observers).

This is same as https://github.com/pytorch/pytorch/pull/44537, but for
`FakeQuantize`.

Test Plan:
```
python test/test_quantization.py TestObserver.test_state_dict_respects_device_affinity
```

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D25991570

fbshipit-source-id: 1193a6cd350bddabd625aafa0682e2e101223bb1
2021-01-25 13:50:28 -08:00
68c218547c Add documentation page for pipeline parallelism. (#50791)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50791

Add a dedicated pipeline parallelism doc page explaining the APIs and
the overall value of the module.
ghstack-source-id: 120257168

Test Plan:
1) View locally
2) waitforbuildbot

Reviewed By: rohan-varma

Differential Revision: D25967981

fbshipit-source-id: b607b788703173a5fa4e3526471140506171632b
2021-01-25 13:47:13 -08:00
a7cf04ec40 Workaround for MAGMA accessing illegal memory in batched cholesky (#50957)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50957

MAGMA has an off-by-one error in their batched cholesky implementation which is causing illegal memory access for certain inputs. The workaround implemented in this PR is to pad the input to MAGMA with 1 extra element.

**Benchmark**
Ran the script below for both before and after my PR and got similar results.

*Script*
```
import torch
from torch.utils import benchmark

DTYPE = torch.float32
BATCHSIZE = 512 * 512
MATRIXSIZE = 16

a = torch.eye(MATRIXSIZE, device='cuda', dtype=DTYPE)

t0 = benchmark.Timer(
    stmt='torch.cholesky(a)',
    globals={'a': a},
    label='Single'
)

t1 = benchmark.Timer(
    stmt='torch.cholesky(a)',
    globals={'a': a.expand(BATCHSIZE, -1, -1)},
    label='Batched'
)

print(t0.timeit(100))
print(t1.timeit(100))
```

*Results before*
```
<torch.utils.benchmark.utils.common.Measurement object at 0x7faf9bc63400>
Single
  2.08 ms
  1 measurement, 100 runs , 1 thread
<torch.utils.benchmark.utils.common.Measurement object at 0x7faf9bc63400>
Batched
  7.68 ms
  1 measurement, 100 runs , 1 thread
```

*Results after*
```
<torch.utils.benchmark.utils.common.Measurement object at 0x7faf9bc63400>
Single
  2.10 ms
  1 measurement, 100 runs , 1 thread
<torch.utils.benchmark.utils.common.Measurement object at 0x7faf9bc63400>
Batched
  7.56 ms
  1 measurement, 100 runs , 1 thread
```

Fixes https://github.com/pytorch/pytorch/issues/41394, https://github.com/pytorch/pytorch/issues/26996, https://github.com/pytorch/pytorch/issues/48996

See also https://github.com/pytorch/pytorch/issues/42666, https://github.com/pytorch/pytorch/pull/26789

TODO
 ---
- [x] Benchmark to check for perf regressions

Test Plan: Imported from OSS

Reviewed By: bdhirsh

Differential Revision: D26050978

Pulled By: heitorschueroff

fbshipit-source-id: 7a5ba7e34c9d74b58568b2a0c631cc6d7ba63f86
2021-01-25 13:39:24 -08:00
9dfbfe9fca Add type annotations to torch.overrides (#50824)
Summary:
This is a follow up PR of https://github.com/pytorch/pytorch/issues/48493.

Fixes https://github.com/pytorch/pytorch/issues/48492

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50824

Reviewed By: bdhirsh

Differential Revision: D26050736

Pulled By: ezyang

fbshipit-source-id: 049605fd271cff28c8b6e300c163e9df3b3ea23b
2021-01-25 13:20:09 -08:00
75cba9d0d1 More about cudnn refactor (#50827)
Summary:
- Resolves ngimel's review comments in https://github.com/pytorch/pytorch/pull/49109
- Move `ConvolutionArgs` from `ConvShared.h` to `Conv_v7.cpp`, because cuDNN v8 uses different descriptors therefore will not share the same `ConvolutionArgs`.
- Refactor the `ConvolutionParams` (the hash key for benchmark):
  - Remove `input_stride`
  - Add `input_dim`
  - Add `memory_format`
- Make `repro_from_args` to take `ConvolutionParams` instead of `ConvolutionArgs` as arguments so that it can be shared for v7 and v8
- Rename some `layout` to `memory_format`. `layout` should be sparse/strided and `memory_format` should be contiguous/channels_last. They are different things.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50827

Reviewed By: bdhirsh

Differential Revision: D26048274

Pulled By: ezyang

fbshipit-source-id: f71aa02d90ffa581c17ab05b171759904b311517
2021-01-25 12:58:25 -08:00
28869d5a80 [quant][graphmode][fx] Add support for quantizing functional linear + {functional relu/module relu} (#50975)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/50975

Test Plan: Imported from OSS

Reviewed By: supriyar

Differential Revision: D26032532

fbshipit-source-id: a084fb4fd711ad52b2da1c6378cbcc2b352976c6
2021-01-25 12:49:58 -08:00
95a0a1a18f Update docstring on return type of jvp and vjp (#51035)
Summary:
Updates the docstrings, that `jvp` and `vjp` both return the primal `func_output` first as part of the return tuple,
in line with the docstrings of [hvp](c620572a34/torch/autograd/functional.py (L671)) and [vhp](c620572a34/torch/autograd/functional.py (L583)).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51035

Reviewed By: bdhirsh

Differential Revision: D26047693

Pulled By: albanD

fbshipit-source-id: 5f2957a858826b4c1884590b6be7a8bed0791efd
2021-01-25 12:40:30 -08:00
09b896261c Skip test_lc_1d for ROCM (#50964)
Summary:
The test is flaky on ROCM when deadline is set to 1 second. This is affecting builds as it is failing randomly.
Disabling for now.

Signed-off-by: Arindam Roy <rarindam@gmail.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50964

Reviewed By: houseroad

Differential Revision: D26049370

Pulled By: BIT-silence

fbshipit-source-id: 22337590a8896ad75f1281e56fbbeae897f5c3b2
2021-01-25 11:43:37 -08:00
ac0a3cc5fd Merge CompilationUnit from torch._C and torch.jit (#50614)
Summary:
This simplifies our handling and allows passing CompilationUnits from Python to C++ defined functions via PyBind easily.

Discussed on Slack with SplitInfinity

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50614

Reviewed By: anjali411

Differential Revision: D25938005

Pulled By: SplitInfinity

fbshipit-source-id: 94aadf0c063ddfef7ca9ea17bfa998d8e7b367ad
2021-01-25 11:06:40 -08:00
5e79b8e06d Back out "Revert D25903846: [pytorch][PR] Structured kernel definition for upsample_nearest2d" (#50794)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50794

Original commit changeset: b4a7948088c0

There are some subtle extra tweaks on top of the original. I can unbundle them, but I've opted to keep it with the port because it's the easiest way to make sure the changes are exercised.

* There's a bugfix in the codegen to test if a dispatch key is structured *before* short circuiting because the dispatch key was missing in the table. This accounts for mixed structured-nonstructured situations where the dispatch table is present, but the relevant structured key isn't (because the dispatch table only exists to register, e.g., QuantizedCPU)
* Dispatch tables for functions which delegate to structured kernels don't have Math entries from generated for them.
* It's now illegal to specify a structured dispatch key in a delegated structured kernel (it will be ignored!) add is now fixed to follow this
* There are some extra sanity checks for NativeFunctions validation
* Finally, unlike the original PR, I switched the .vec variant of upsample_nearest2d to also be DefaultBackend, bringing it inline with upsample_nearest1d.
ghstack-source-id: 120038038

Test Plan:
```
buck test mode/dev //coreai/tiefenrausch:python_tests -- --exact 'coreai/tiefenrausch:python_tests - test_can_run_local_async_inference_cpu (coreai.tiefenrausch.tests.python_test.TiefenrauschPY)' --run-disabled
```

Reviewed By: ngimel

Differential Revision: D25962873

fbshipit-source-id: d29a9c97f15151db3066ae5efe7a0701e6dc05a3
2021-01-25 10:43:53 -08:00
f7b339d11c Clarify wording around overrides subclasses. (#51031)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/47117

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51031

Reviewed By: bdhirsh

Differential Revision: D26047498

Pulled By: albanD

fbshipit-source-id: dd0a7d9f97c0f6469b3050d2e3b4473f1bee3820
2021-01-25 08:19:13 -08:00
a6257b2fe2 Fix #48903 (#50817)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50817

Replace some longs with int64_t.  Thanks Tom Heaven for contributing
this patch.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: bdhirsh

Differential Revision: D25975915

Pulled By: ezyang

fbshipit-source-id: c1061a85f80ad17fa4fb313da797bc6d5ba203c2
2021-01-25 07:44:41 -08:00
806010b75e [BE] move more unittest.main() to run_tests() (#50923)
Summary:
Relate to https://github.com/pytorch/pytorch/issues/50483.

Everything except ONNX, detectron and release notes tests are moved to use common_utils.run_tests() to ensure CI reports XML correctly.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50923

Reviewed By: samestep

Differential Revision: D26027621

Pulled By: walterddr

fbshipit-source-id: b04c03f10d1fe96181b720c4c3868e86e4c6281a
2021-01-25 07:23:09 -08:00
8690819618 OpInfo: Add DecorateInfo class similar to SkipInfo for decorators (#50501)
Summary:
Follow up to https://github.com/pytorch/pytorch/issues/50435

I have confirmed this works by running
```
pytest test_ops.py -k test_fn_gradgrad_fft`
```
with normally and with `PYTORCH_TEST_WITH_SLOW=1 PYTORCH_TEST_SKIP_FAST=1`. In the first case all tests are skipped, in the second they all run as they should.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50501

Reviewed By: ezyang

Differential Revision: D25956416

Pulled By: mruberry

fbshipit-source-id: c896a8cec5f19b8ffb9b168835f3743b6986dad7
2021-01-25 04:51:04 -08:00
5a5bca8ef0 [AutoAccept][Codemod][FBSourceClangFormatLinter] Daily arc lint --take CLANGFORMAT
Reviewed By: zertosh

Differential Revision: D26043955

fbshipit-source-id: 0a5740a82bdd3ac7bd1665a325ff7fe79488ccea
2021-01-25 04:20:03 -08:00
627a331257 Port CPU torch.orgqr to ATen (#50502)
Summary:
Now we can remove `_th_orgqr`!

Compared to the original TH-based `orgqr`, complex (https://github.com/pytorch/pytorch/issues/33152) and batched inputs are now supported.
CUDA support will be added in a follow-up PR.

Closes https://github.com/pytorch/pytorch/issues/24747

Ref. https://github.com/pytorch/pytorch/issues/49421, https://github.com/pytorch/pytorch/issues/42666

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50502

Reviewed By: mrshenli

Differential Revision: D25953300

Pulled By: mruberry

fbshipit-source-id: f52a74e1c8f51b5e24f7b461430ca8fc96e4d149
2021-01-25 02:57:05 -08:00
48b6b9221a [BE] Make Vec256 header only library (#50708)
Summary:
Do it by removing extraneous header dependencies.
None of the at::vec256 primitive depend on notion of Tensor, therefore none of the headers that vec256 depends on should include <ATen/Tensor.h>

Implicity test it be removing c10 and tensor dependency when building `vec256_test_all_types`
Split affine_quantizer into affine_quantizer_base (that contains methods operating on raw types) and affine_quantizer (which contains Tensor specific methods)

Fixes https://github.com/pytorch/pytorch/issues/50567

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50708

Reviewed By: walterddr

Differential Revision: D25949168

Pulled By: malfet

fbshipit-source-id: c3323be7252865a52c7d94026a5a39b494e44efb
2021-01-24 21:46:36 -08:00
186c3da037 Add cusolver gesvdj and gesvdjBatched to the backend of torch.svd (#48436)
Summary:
This PR adds cusolver `gesvdj` and `gesvdjBatched` to the backend of `torch.svd`.

I've tested the performance using cuda 11.1 on 2070, V100, and A100. The cusolver gesvdj and gesvdjBatched performances are better than magma in all square matrix cases. So cusolver backend will replace magma backend when available.

When both matrix dimensions are no greater than 32, `gesvdjBatched` is used. Otherwise, `gesvdj` is used.

Detailed benchmark is available at https://github.com/xwang233/code-snippet/tree/master/svd.

Some relevant code and discussions
- https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/kernels/linalg/svd_op_gpu.cu.cc
- https://github.com/google/jax/blob/master/jaxlib/cusolver.cc
- https://github.com/cupy/cupy/issues/3174
- https://github.com/tensorflow/tensorflow/issues/13603
- https://www.nvidia.com/en-us/on-demand/session/gtcsiliconvalley2019-s9226/

See also https://github.com/pytorch/pytorch/issues/42666 https://github.com/pytorch/pytorch/issues/47953

Close https://github.com/pytorch/pytorch/pull/50516

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48436

Reviewed By: ejguan

Differential Revision: D25977046

Pulled By: heitorschueroff

fbshipit-source-id: c27e705cd29b6fd7c8ac674c1f9f490fa26ee1bf
2021-01-24 15:47:05 -08:00
1f40f2a172 Add improved support for parallelization and related graph opts (#5257)
Summary:
Pull Request resolved: https://github.com/pytorch/glow/pull/5257

- Add RescaleQuantized parallelization support to graph opts' parallelization code
- On NNPI, mirror Rescale parallelization for FC/Relus that come before it
- Sink Reshapes below Quantize and ConvertTo
- Remove unnecessary ConvertTo when following a Dequantize (i.e. just change the elem kind of the Dequantize instead)

Test Plan: Added unit tests

Reviewed By: hyuen, mjanderson09

Differential Revision: D25947824

fbshipit-source-id: 771abd36a1bc7270bf1f901d1ec6cb6d78e9fd1f
2021-01-23 17:20:30 -08:00
c9cae1446f fix unflatten_dense_tensor when there is empty tensor inside (#50321)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50321

Quantization team reported that when there are two empty tensors are replicated among ranks, the two empty tensors start to share storage after resizing.

The root cause is unflatten_dense_tensor unflattened the empty tensor as view of flat tensor and thus share storage with other tensors.

This PR is trying to avoid unflatten the empty tensor as view of flat tensor so that empty tensor will not share storage with other tensors.

Test Plan: unit test

Reviewed By: pritamdamania87

Differential Revision: D25859503

fbshipit-source-id: 5b760b31af6ed2b66bb22954cba8d1514f389cca
2021-01-23 12:14:34 -08:00
e544d74c55 [CPU] Add torch.trace for complex tensors (#50380)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/50380

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D25949361

Pulled By: anjali411

fbshipit-source-id: 9910bc5b532c9bf3add530221d643b2c41c62d01
2021-01-23 09:04:31 -08:00
2c3c2a4b7a [dist_optim] add distributed functional AdamW optimizer (#50620)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50620

Add TorchScript compatible AdamW functional optimizer to distributed optimizer

Test Plan: Imported from OSS

Reviewed By: rohan-varma

Differential Revision: D25932774

Pulled By: wanchaol

fbshipit-source-id: 64eb4aeaa3cab208d0ebbec7c4d91a9d43951947
2021-01-23 01:04:45 -08:00
3f982e56b1 [dist_optim] add distributed functional RMSprop optimizer (#50619)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50619

Add TorchScript compatible RMSprop functional optimizer to distributed optimizer

Test Plan: Imported from OSS

Reviewed By: rohan-varma

Differential Revision: D25932775

Pulled By: wanchaol

fbshipit-source-id: bd4854f9f95a740e02a1bebe24f780488460ba4d
2021-01-23 01:04:41 -08:00
6c81b4d917 [dist_optim] add distributed functional Adadelta optimizer (#50623)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50623

Add TorchScript compatible Adadelta functional optimizer to distributed optimizer

Test Plan: Imported from OSS

Reviewed By: rohan-varma

Differential Revision: D25932772

Pulled By: wanchaol

fbshipit-source-id: d59b04e5f0b6bab7e0d1c5f68e66249a65958e0b
2021-01-23 01:04:36 -08:00
cd2067539e [dist_optim] add distributed functional sgd optimizer (#50618)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/50618

Test Plan: Imported from OSS

Reviewed By: rohan-varma

Differential Revision: D25932778

Pulled By: wanchaol

fbshipit-source-id: 8df3567b477bc5ba3556b8c5294cd3da5db963ad
2021-01-23 01:04:32 -08:00
5cbe1e4933 [dist_optim] add distributed functional Adam optimizer (#50624)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50624

Add TorchScript compatible Adam functional optimizer to distributed optimizer

Test Plan: Imported from OSS

Reviewed By: rohan-varma

Differential Revision: D25932770

Pulled By: wanchaol

fbshipit-source-id: cab3f1164c76186969c284a2c52481b79bbb7190
2021-01-23 01:01:37 -08:00
5a661e0171 [WIP][Grad Compression] Unittest to verify allreduce_hook parity (#50851)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50851

Improves upon the previous unittest to ensure allreduce_hook results in the same gradients as vanilla allreduce in DDP.

ghstack-source-id: 120229103

Test Plan:
buck build mode/dev-nosan //caffe2/test/distributed:distributed_nccl_fork --keep-going
BACKEND=nccl WORLD_SIZE=2 ~/fbcode/buck-out/dev/gen/caffe2/test/distributed/distributed_nccl_fork#binary.par -r test_ddp_hook_parity

Reviewed By: SciPioneer

Differential Revision: D25963654

fbshipit-source-id: d55eee0aee9cf1da52aa0c4ba1066718aa8fd9a4
2021-01-23 00:47:08 -08:00
6aec1eba15 [aten] Make aten::flatten call native::reshape (#50859)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/50859

Test Plan:
Unit test:
```
buck test //caffe2/test:torch
```
Benchmark:
```
MKL_NUM_THREADS=1 OMP_NUM_THREADS=1 numactl -m 0 -C 13 \
./buck-out/opt/gen/caffe2/caffe2/fb/predictor/ptvsc2_predictor_bench \
--scripted_model=/home/hlu/ads/adindexer/adindexer_ctr_mobilefeed/pt/merge_v2/traced_precomputation.pt \
--pt_inputs=/home/hlu/ads/adindexer/adindexer_ctr_mobilefeed/pt/merge_v2/container_precomputation_bs20.pt \
--iters=10000 --warmup_iters=10000 --num_threads=1 --pt_enable_static_runtime=true \
--pt_cleanup_activations=true --pt_enable_out_variant=true --do_profile=true
```

Reduces the total time spent on flatten from 1.22% to 0.97% (net 0.25% reduction).
```
Before:

Static runtime ms per iter: 0.0725054. Iters per second: 13792.1
    0.000857179 ms.    1.21862%. aten::flatten (1 nodes)

After:

Static runtime ms per iter: 0.0720371. Iters per second: 13881.7
    0.000686155 ms.    0.97151%. aten::flatten (1 nodes)
```

Reviewed By: ajyu

Differential Revision: D25986759

fbshipit-source-id: dc0f542c56a688d331d349845b78084577970476
2021-01-22 23:12:01 -08:00
069e68a2a4 Fix ScriptModule docstring (#48608)
Summary:
Fixes a typo in `ScriptModule`'s docstring and converts it to the raw format (`r"""...`).

Fixes https://github.com/pytorch/pytorch/issues/48634

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48608

Reviewed By: anjali411

Differential Revision: D25242022

Pulled By: gmagogsfm

fbshipit-source-id: 5199868af999c6c360c7dd5e2813659f1028acab
2021-01-22 22:32:18 -08:00
ce0f335515 [PyTorch Mobile] Add an overload for deserialize() that doesn't accept the extra_files map. (#50932)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50932

After the change to split `_load_for_mobile()` into multiple methods, one which takes in the `extra_files` map, and one which doesn't, we can change the implementation of the `deserialize()` method with different overloads as well. Suggested by raziel on D25968216 (bb909d27d5).

ghstack-source-id: 120185089

Test Plan: Build/Sandcastle.

Reviewed By: JacobSzwejbka

Differential Revision: D26014084

fbshipit-source-id: 914142137346a6246def1acf38a3204dd4c4f52f
2021-01-22 21:54:24 -08:00
ab331da7ac Rewrite kron with broadcasting at::mul (#50927)
Summary:
Because it is shorter, faster, and does not have TF32 issue.

Benchmark: https://github.com/zasdfgbnm/things/blob/master/2021Q1/kron.ipynb

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50927

Reviewed By: glaringlee

Differential Revision: D26022385

Pulled By: ngimel

fbshipit-source-id: 513c9e9138c35c70d3a475a8407728af21321dae
2021-01-22 20:58:17 -08:00
789f6f1250 [FX] Minor docs changes (#50966)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/50966

Test Plan: Imported from OSS

Reviewed By: suo

Differential Revision: D26029101

Pulled By: jamesr66a

fbshipit-source-id: 4374771be74d0a4d05fdd29107be5357130c2a76
2021-01-22 16:23:19 -08:00
5c1c858ca8 Revert D25977352: [pytorch][PR] Refactor mypy configs list into editor-friendly wrapper
Test Plan: revert-hammer

Differential Revision:
D25977352 (73dffc8452)

Original commit changeset: 4b3a5e8a9071

fbshipit-source-id: a0383ea4158f54be6f128b9ddb2cd12fc3a3ea53
2021-01-22 15:53:44 -08:00
ffc8a26991 philox_engine_inputs should also round increment to a multiple of 4 (#50916)
Summary:
`philox_engine_inputs()` is deprecated.  Callers should refactor to use `philox_cuda_state()`, and afaik all call sites in aten have already been refactored, but in the meantime on behalf of other consumers (ie extensions, possibly some lingering call sites in jit), `philox_engine_inputs` should handle the increment the same way `philox_cuda_state` does.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50916

Reviewed By: mrshenli

Differential Revision: D26022618

Pulled By: ngimel

fbshipit-source-id: 17178ad099ddc17d3596b9508ae4dce729b44f57
2021-01-22 15:51:15 -08:00
63838b9330 Turn on batched_grad testing for NewModuleTest (#50740)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50740

This PR adds a `check_batched_grad=True` option to
NewModuleTest-generated NN tests.

Test Plan: - run tests (`pytest test/test_nn.py -v -rf`)

Reviewed By: ejguan

Differential Revision: D25997679

Pulled By: zou3519

fbshipit-source-id: b75e73d7e86fd3af9bad6efed7127b36551587b3
2021-01-22 15:33:09 -08:00
de8cd6b201 [BE] Replace M_PI with c10::pi constexpr variable (#50819)
Summary:
Also, get rid of MSVC specific `_USE_MATH_DEFINES`

Test at compile time that c10::pi<double> == M_PI

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50819

Reviewed By: albanD

Differential Revision: D25976330

Pulled By: malfet

fbshipit-source-id: 8f3ddfd58a5aa4bd382da64ad6ecc679706d1284
2021-01-22 15:15:31 -08:00
a66851a2ad [FX] torch.fx.symbolic_trace patching improvements and math.* support (#50793)
Summary:
This contains some improvements and refactoring to how patching is done in `torch.fx.symbolic_trace`.

1) Functions from `math.*` are now supported without needing to call `torch.fx.wrap()`.  `wrap()` actually errors on some of these function because they are written in C and don't have `__code__` requiring use of the string version.  `math` usage is relatively common, for example [BERT uses math.sqrt here](6f79061bd1/torchbenchmark/models/BERT_pytorch/bert_pytorch/model/attention/single.py (L16)).  Both `math.sqrt()` and `from math import sqrt` (copying to module namespace) are supported.  When modules are called FX now searches the module's global scope to find methods to patch.

2) [Guarded behind `env FX_PATCH_GETITEM=1`] Fixes a failed trace of [PositionalEmbedding from BERT](6f79061bd1/torchbenchmark/models/BERT_pytorch/bert_pytorch/model/embedding/position.py (L24)), which failed to trace with the error `TypeError: slice indices must be integers or None or have an __index__ method` (a Proxy() is getting passed into `Tensor.__getitem__`).  See https://github.com/pytorch/pytorch/issues/50710 for why this is disabled by default.

3) Support for automatically wrapping methods that may have been copied to a different module scope via an import like `from foo import wrapped_function`.  This also isn't exposed in `torch.fx.wrap`, but is used to implement `math.*` support.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50793

Test Plan: Added unittests to check each feature

Reviewed By: jamesr66a

Differential Revision: D25999788

Pulled By: jansel

fbshipit-source-id: f1ce11a69b7d97f26c9e2741c6acf9c513a84467
2021-01-22 15:05:24 -08:00
dd1c2a06b7 refactor profiling optional (#47667)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47667

Test Plan: Imported from OSS

Reviewed By: anjali411, ngimel

Differential Revision: D25255572

Pulled By: Krovatkin

fbshipit-source-id: d0152c9ef5b1994e27be9888bcb123dca3ecd88f
2021-01-22 14:45:28 -08:00
f0e72e54cc Fix CUDA RPC Stream Synchronization (#50949)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50949

When converting RPC Message into Python objects, we were not using
a CUDAFuture for the chained Future. As a result, the streams are
not synchronized when calling `rpc_async(...).wait()`. This commit
uses `Future::then` API to create the chained Future, which will
be creating a CUDAFuture if the existing Future is a CUDA one.

fixes #50881
fixes #50839

Test Plan: Imported from OSS

Reviewed By: pritamdamania87

Differential Revision: D26020458

Pulled By: mrshenli

fbshipit-source-id: 25195fbc10b99f4c401ec3ed7a382128464b5f08
2021-01-22 14:05:43 -08:00
78f30386c5 Implement Swish(SiLU) operator in FP16
Summary:
Used Caffe2 Swish implemenmtation to implement the operator. Will need
to resolve the error introduced.
```
test_quantized_swish_2D (tests.operators.testQuantizedSilu.TestSiLU) ... input:
 (tensor([[-6.0000, -5.9961, -5.9922,  ..., -5.7734, -5.7695, -5.7656],
        [-5.7617, -5.7539, -5.7500,  ..., -5.5352, -5.5312, -5.5234],
        [-5.5195, -5.5156, -5.5117,  ..., -5.2930, -5.2891, -5.2852],
        ...,
        [ 5.2852,  5.2891,  5.2930,  ...,  5.5117,  5.5156,  5.5195],
        [ 5.5234,  5.5312,  5.5352,  ...,  5.7500,  5.7539,  5.7617],
        [ 5.7656,  5.7695,  5.7734,  ...,  5.9922,  5.9961,  6.0000]]),)
base_res:
 tensor([[-0.0148, -0.0149, -0.0149,  ..., -0.0179, -0.0180, -0.0180],
        [-0.0181, -0.0182, -0.0182,  ..., -0.0218, -0.0218, -0.0220],
        [-0.0220, -0.0221, -0.0222,  ..., -0.0265, -0.0266, -0.0266],
        ...,
        [ 5.2585,  5.2625,  5.2665,  ...,  5.4895,  5.4935,  5.4975],
        [ 5.5015,  5.5094,  5.5134,  ...,  5.7318,  5.7357,  5.7437],
        [ 5.7476,  5.7516,  5.7555,  ...,  5.9773,  5.9812,  5.9852]])
tnco_res:
 tensor([[-0.0148, -0.0149, -0.0149,  ..., -0.0179, -0.0180, -0.0180],
        [-0.0181, -0.0182, -0.0182,  ..., -0.0218, -0.0218, -0.0220],
        [-0.0220, -0.0221, -0.0222,  ..., -0.0265, -0.0265, -0.0266],
        ...,
        [ 5.2578,  5.2617,  5.2656,  ...,  5.4922,  5.4922,  5.4961],
        [ 5.5000,  5.5078,  5.5156,  ...,  5.7305,  5.7383,  5.7422],
        [ 5.7461,  5.7500,  5.7539,  ...,  5.9766,  5.9805,  5.9844]])
nnpi_res:
 tensor([[-0.0148, -0.0149, -0.0149,  ..., -0.0179, -0.0180, -0.0180],
        [-0.0181, -0.0182, -0.0182,  ..., -0.0218, -0.0218, -0.0220],
        [-0.0220, -0.0221, -0.0222,  ..., -0.0265, -0.0266, -0.0266],
        ...,
        [ 5.2585,  5.2625,  5.2665,  ...,  5.4895,  5.4935,  5.4975],
        [ 5.5015,  5.5094,  5.5134,  ...,  5.7318,  5.7357,  5.7437],
        [ 5.7476,  5.7516,  5.7555,  ...,  5.9773,  5.9812,  5.9852]])
diff:
 tensor([[4.1956e-06, 9.8441e-07, 6.0154e-06,  ..., 4.2785e-06, 7.6480e-06,
         1.0842e-05],
        [1.3988e-06, 4.1034e-06, 6.5863e-06,  ..., 5.3961e-06, 2.9635e-06,
         1.0209e-05],
        [1.2219e-06, 7.9758e-06, 1.7386e-05,  ..., 3.0547e-07, 2.2141e-05,
         1.4316e-05],
        ...,
        [7.0286e-04, 7.8678e-04, 8.7023e-04,  ..., 2.6422e-03, 1.3347e-03,
         1.4052e-03],
        [1.4753e-03, 1.6141e-03, 2.2225e-03,  ..., 1.2884e-03, 2.5592e-03,
         1.4634e-03],
        [1.5216e-03, 1.5793e-03, 1.6365e-03,  ..., 6.9284e-04, 7.4100e-04,
         7.8964e-04]])
nnpi traced graph:
 graph(%self : __torch__.tests.operators.testQuantizedSilu.SiLUModel,
      %x : Float(*, *, requires_grad=0, device=cpu)):
  %3 : None = prim::Constant()
  %4 : bool = prim::Constant[value=0]() # /data/users/kaus/fbsource/fbcode/buck-out/dev/gen/glow/fb/torch_glow/custom_nnpi_ops/testQuantizedSilu#binary,link-tree/tests/operators/testQuantizedSilu.py:13:0
  %5 : Device = prim::Constant[value="cpu"]() # /data/users/kaus/fbsource/fbcode/buck-out/dev/gen/glow/fb/torch_glow/custom_nnpi_ops/testQuantizedSilu#binary,link-tree/tests/operators/testQuantizedSilu.py:13:0
  %6 : int = prim::Constant[value=0]() # /data/users/kaus/fbsource/fbcode/buck-out/dev/gen/glow/fb/torch_glow/custom_nnpi_ops/testQuantizedSilu#binary,link-tree/tests/operators/testQuantizedSilu.py:13:0
  %7 : int = prim::Constant[value=6]() # /data/users/kaus/fbsource/fbcode/buck-out/dev/gen/glow/fb/torch_glow/custom_nnpi_ops/testQuantizedSilu#binary,link-tree/tests/operators/testQuantizedSilu.py:13:0
  %8 : Float(*, *, requires_grad=0, device=cpu) = aten::zeros_like(%x, %7, %6, %5, %4, %3) # /data/users/kaus/fbsource/fbcode/buck-out/dev/gen/glow/fb/torch_glow/custom_nnpi_ops/testQuantizedSilu#binary,link-tree/tests/operators/testQuantizedSilu.py:13:0
  %input : Float(*, *, requires_grad=0, device=cpu) = glow::FusionGroup_0(%x, %8)
  %10 : Tensor = aten::silu(%input) # /data/users/kaus/fbsource/fbcode/buck-out/dev/gen/glow/fb/torch_glow/custom_nnpi_ops/testQuantizedSilu#binary,link-tree/torch/nn/functional.py:1804:0
  return (%10)
with glow::FusionGroup_0 = graph(%0 : Float(*, *, requires_grad=0, device=cpu),
      %1 : Float(*, *, requires_grad=0, device=cpu)):
  %2 : int = prim::Constant[value=1]()
  %input : Float(*, *, requires_grad=0, device=cpu) = aten::add(%0, %1, %2) # /data/users/kaus/fbsource/fbcode/buck-out/dev/gen/glow/fb/torch_glow/custom_nnpi_ops/testQuantizedSilu#binary,link-tree/tests/operators/testQuantizedSilu.py:13:0
  %4 : int = prim::Constant[value=1]() # /data/users/kaus/fbsource/fbcode/buck-out/dev/gen/glow/fb/torch_glow/custom_nnpi_ops/testQuantizedSilu#binary,link-tree/tests/operators/testQuantizedSilu.py:13:0
  return (%input)

tnco traced graph:
 graph(%self : __torch__.tests.operators.testQuantizedSilu.___torch_mangle_0.SiLUModel,
      %x : Float(*, *, requires_grad=0, device=cpu)):
  %2 : int = prim::Constant[value=1]() # /data/users/kaus/fbsource/fbcode/buck-out/dev/gen/glow/fb/torch_glow/custom_nnpi_ops/testQuantizedSilu#binary,link-tree/tests/operators/testQuantizedSilu.py:13:0
  %3 : None = prim::Constant()
  %4 : bool = prim::Constant[value=0]() # /data/users/kaus/fbsource/fbcode/buck-out/dev/gen/glow/fb/torch_glow/custom_nnpi_ops/testQuantizedSilu#binary,link-tree/tests/operators/testQuantizedSilu.py:13:0
  %5 : Device = prim::Constant[value="cpu"]() # /data/users/kaus/fbsource/fbcode/buck-out/dev/gen/glow/fb/torch_glow/custom_nnpi_ops/testQuantizedSilu#binary,link-tree/tests/operators/testQuantizedSilu.py:13:0
  %6 : int = prim::Constant[value=0]() # /data/users/kaus/fbsource/fbcode/buck-out/dev/gen/glow/fb/torch_glow/custom_nnpi_ops/testQuantizedSilu#binary,link-tree/tests/operators/testQuantizedSilu.py:13:0
  %7 : int = prim::Constant[value=6]() # /data/users/kaus/fbsource/fbcode/buck-out/dev/gen/glow/fb/torch_glow/custom_nnpi_ops/testQuantizedSilu#binary,link-tree/tests/operators/testQuantizedSilu.py:13:0
  %8 : Float(*, *, requires_grad=0, device=cpu) = aten::zeros_like(%x, %7, %6, %5, %4, %3) # /data/users/kaus/fbsource/fbcode/buck-out/dev/gen/glow/fb/torch_glow/custom_nnpi_ops/testQuantizedSilu#binary,link-tree/tests/operators/testQuantizedSilu.py:13:0
  %12 : Tensor = fakeNNPI::addFP16(%x, %8, %2)
  %11 : Tensor = fakeNNPI::siluFP16(%12)
  return (%11)

FAIL

======================================================================
FAIL: test_quantized_swish_2D (tests.operators.testQuantizedSilu.TestSiLU)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/data/users/kaus/fbsource/fbcode/buck-out/dev/gen/glow/fb/torch_glow/custom_nnpi_ops/testQuantizedSilu#binary,link-tree/tests/operators/testQuantizedSilu.py", line 26, in test_quantized_swish_2D
    validate_nnpi_model(model, (x,), expected_ops, [])
  File "/data/users/kaus/fbsource/fbcode/buck-out/dev/gen/glow/fb/torch_glow/custom_nnpi_ops/testQuantizedSilu#binary,link-tree/tests/utils.py", line 73, in validate_nnpi_model
    assert is_equal
AssertionError
```

Test Plan:
Run test with buck test  mode/dev
//glow/fb/torch_glow/custom_nnpi_ops:testQuantizedSilu

Reviewed By: hyuen

Differential Revision: D25981369

fbshipit-source-id: dd0f3686b3cbf6fc575c959c7661125ecbf0b0db
2021-01-22 13:57:54 -08:00
ca3ce77746 Dump torch::jit::AliasDb objects as Graphviz files (#50452)
Summary:
This PR adds a simple debugging helper which exports the AliasDb state as a [GraphViz](http://www.graphviz.org/) graph definition. The generated files can be viewed with any Graphviz viewer (including online based, for example http://viz-js.com)

Usage:

1. Call `AliasDb::dumpToGraphvizFile()` from a debugger. Using gdb for example:
`call aliasDb_->dumpToGraphvizFile("alias.dot")`

2. Add explicit calls to `AliasDb::dumpToGraphvizFile()`, which returns `true` if it succeeds.

An example output file is attached: [example.zip](https://github.com/pytorch/pytorch/files/5805840/example.zip)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50452

Reviewed By: ngimel

Differential Revision: D25980222

Pulled By: eellison

fbshipit-source-id: 47805a0a81ce73c6ba859340d37b9a806f9000d5
2021-01-22 13:38:47 -08:00
73dffc8452 Refactor mypy configs list into editor-friendly wrapper (#50826)
Summary:
Closes https://github.com/pytorch/pytorch/issues/50513 by resolving the first three checkboxes. If this PR is merged, I will also modify one or both of the following wiki pages to add instructions on how to use this `mypy` wrapper for VS Code editor integration:

- [Guide for adding type annotations to PyTorch](https://github.com/pytorch/pytorch/wiki/Guide-for-adding-type-annotations-to-PyTorch)
- [Lint as you type](https://github.com/pytorch/pytorch/wiki/Lint-as-you-type)

The test plan below is fairly manual, so let me know if I should add more automated tests to this PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50826

Test Plan:
Unit tests for globbing function:
```
python test/test_testing.py TestMypyWrapper -v
```

Manual checks:

- Uninstall `mypy` and run `python test/test_type_hints.py` to verify that it still works when `mypy` is absent.
- Reinstall `mypy` and run `python test/test_type_hints.py` to verify that this didn't break the `TestTypeHints` suite.
- Run `python test/test_type_hints.py` again (should finish quickly) to verify that this didn't break `mypy` caching.
- Run `torch/testing/_internal/mypy_wrapper.py` on a few Python files in this repo to verify that it doesn't give any additional warnings when the `TestTypeHints` suite passes. Some examples (compare with the behavior of just running `mypy` on these files):
  ```sh
  torch/testing/_internal/mypy_wrapper.py README.md
  torch/testing/_internal/mypy_wrapper.py tools/fast_nvcc/fast_nvcc.py
  torch/testing/_internal/mypy_wrapper.py test/test_type_hints.py
  torch/testing/_internal/mypy_wrapper.py torch/random.py
  torch/testing/_internal/mypy_wrapper.py torch/testing/_internal/mypy_wrapper.py
  ```
- Remove type hints from `torch.testing._internal.mypy_wrapper` and verify that running `mypy_wrapper.py` on that file gives type errors.
- Remove the path to `mypy_wrapper.py` from the `files` setting in `mypy-strict.ini` and verify that running it again on itself no longer gives type errors.
- Add `test/test_type_hints.py` to the `files` setting in `mypy-strict.ini` and verify that running the `mypy` wrapper on it again now gives type errors.
- Remove type hints from `torch/random.py` and verify that running the `mypy` wrapper on it again now gives type errors.
- Add the suggested JSON from the docstring of `torch.testing._internal.mypy_wrapper.main` to your `.vscode/settings.json` and verify that VS Code gives the same results (inline, while editing any Python file in the repo) as running the `mypy` wrapper on the command line, in all the above cases.

Reviewed By: glaringlee, walterddr

Differential Revision: D25977352

Pulled By: samestep

fbshipit-source-id: 4b3a5e8a9071fcad65a19f193bf3dc7dc3ba1b96
2021-01-22 13:35:44 -08:00
7e10fbfb71 Add note about TCP init in RPC tests to contributing doc. (#50861)
Summary:
We added this option in https://github.com/pytorch/pytorch/pull/48248, but it would be good to document it somewhere as well, hence adding it to this contributing doc.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50861

Reviewed By: mrshenli

Differential Revision: D26014505

Pulled By: rohan-varma

fbshipit-source-id: c1321679f01dd52038131ff571362ad36884510a
2021-01-22 13:28:03 -08:00
2ab497012f Add at::cpu namespace of functions for structured kernels (#49505)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49505

I have a problem which is that static runtime needs a way to bypass
dispatch and call into kernels directly.  Previously, it used
native:: bindings to do this; but these bindings no longer exist
for structured kernels!  Enter at::cpu: a namespace of exactly
at:: compatible functions that assume all of their arguments are
CPU and non-autograd!  The header looks like this:

```
namespace at {
namespace cpu {

CAFFE2_API Tensor & add_out(Tensor & out, const Tensor & self, const Tensor & other, Scalar alpha=1);
CAFFE2_API Tensor add(const Tensor & self, const Tensor & other, Scalar alpha=1);
CAFFE2_API Tensor & add_(Tensor & self, const Tensor & other, Scalar alpha=1);
CAFFE2_API Tensor & upsample_nearest1d_out(Tensor & out, const Tensor & self, IntArrayRef output_size, c10::optional<double> scales=c10::nullopt);
CAFFE2_API Tensor upsample_nearest1d(const Tensor & self, IntArrayRef output_size, c10::optional<double> scales=c10::nullopt);
CAFFE2_API Tensor & upsample_nearest1d_backward_out(Tensor & grad_input, const Tensor & grad_output, IntArrayRef output_size, IntArrayRef input_size, c10::optional<double> scales=c10::nullopt);
CAFFE2_API Tensor upsample_nearest1d_backward(const Tensor & grad_output, IntArrayRef output_size, IntArrayRef input_size, c10::optional<double> scales=c10::nullopt);

}}
```

This slows down static runtime because these are not the "allow
resize of nonzero tensor" variant binding (unlike the ones I had manually
written).  We can restore this: it's a matter of adding codegen smarts to
do this, but I haven't done it just yet since it's marginally more
complicated.

In principle, non-structured kernels could get this treatment too.
But, like an evil mastermind, I'm withholding it from this patch, as an extra
carrot to get people to migrate to structured muahahahaha.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: smessmer

Differential Revision: D25616105

Pulled By: ezyang

fbshipit-source-id: 84955ae09d0b373ca1ed05e0e4e0074a18d1a0b5
2021-01-22 13:11:59 -08:00
7b12893155 [BE] .gitignore adding test-reports/ folder (#50952)
Summary:
Cant think of a reason not .gitignore test-reports folder. this can be helpful when
1. running `python test/test*.py` from github root directory since it creates the folder at root.
2. CI test report path generated by `torch/testing/_internal/common_utils.py` creates the folder in the same path where the test python file locates.

Creating a PR to make sure CI is happy. this is also needed by https://github.com/pytorch/pytorch/issues/50923

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50952

Reviewed By: samestep

Differential Revision: D26022436

Pulled By: walterddr

fbshipit-source-id: 83e6296de802bd1754b802b8c70502c317f078c9
2021-01-22 12:12:45 -08:00
a291b254ee Migrate masked_scatter_ CPU to ATen (#49732)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/49541

Reference: https://github.com/pytorch/pytorch/issues/24507

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49732

Reviewed By: ejguan

Differential Revision: D25991438

Pulled By: ngimel

fbshipit-source-id: a43bd0bfe043d8e32a6cadbbf736a0eaa697e7ec
2021-01-22 12:05:56 -08:00
db079a9877 Padding: support complex dtypes (#50594)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50594

Fixes #50234

Test Plan: Imported from OSS

Reviewed By: pbelevich

Differential Revision: D25987316

Pulled By: anjali411

fbshipit-source-id: c298b771fe52b267a86938e886ea402badecfe3e
2021-01-22 11:57:42 -08:00
c908ebd4a1 [android] fix yuv conversion - remove define (#50951)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/50951

Test Plan: Imported from OSS

Reviewed By: fmassa

Differential Revision: D26021488

Pulled By: IvanKobzarev

fbshipit-source-id: 6d295762bb1160a3ed8bafac08e03e1eeb07d688
2021-01-22 11:30:57 -08:00
8ab1a1495d Rename set_deterministic to use_deterministic_algorithms (#49904)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/49100

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49904

Reviewed By: ezyang, mrshenli

Differential Revision: D25956761

Pulled By: mruberry

fbshipit-source-id: 86a59289d50825a0ebbd7c358b483c8d8039ffa6
2021-01-22 11:27:07 -08:00
7cb4712b38 count_nonzero with requires grad (#50866)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/50792

fixes `count_nonzero` for tensors with requires_grad and also includes test

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50866

Reviewed By: ejguan

Differential Revision: D25996202

Pulled By: albanD

fbshipit-source-id: 61f2d7d62dd04e574a65ad03ef3a358b141fbae7
2021-01-22 11:19:59 -08:00
d5dc65a45c Document example of Proxy use (#50583)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/50583

Test Plan: Imported from OSS

Reviewed By: pbelevich

Differential Revision: D26010501

Pulled By: ansley

fbshipit-source-id: 947121af7e57c16c96f849fbbb3fa83e97d003b2
2021-01-22 11:05:51 -08:00
89cafde8a4 Modernize for-loops (#50912)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/50912

Test Plan: Sandcastle tests

Reviewed By: ansley

Differential Revision: D26001948

fbshipit-source-id: 3bfe6a8283a2b1882ed472f836ae1b6e720e519f
2021-01-22 10:53:24 -08:00
156da22566 [PyTorch] Eliminate static default_extra_files_mobile from header import.h (#50832)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50832

Please see the previous diff in this stack for the motivation to do so. This makes the same change but for the non-mobile codebase.
ghstack-source-id: 120184012

Test Plan: Sandcastle + Build

Reviewed By: raziel, iseeyuan

Differential Revision: D25979986

fbshipit-source-id: 7708f4f6a50cb16d7a23651e5655144d277d0a4f
2021-01-22 09:59:56 -08:00
d60d108280 [nnc] Expose fast tanh/sigmoid (#50736)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50736

Exposes tanh and sigmoid to other backends

Test Plan: buck test caffe2/test/cpp/tensorexpr:tensorexpr -- "ATen.fast"

Reviewed By: bertmaher

Differential Revision: D25884911

fbshipit-source-id: f9a5286450331f60935cfd40bb23f4a4f4c1d087
2021-01-22 09:56:02 -08:00
47f0bda3ef Improve complex support in common_nn test machinery (#50593)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50593

There are no equivalent to torch.FloatTensor, torch.cuda.FloatTensor for complex
types. So `get_gpu_type` and `get_cpu_type` are broken for complex tensors.

Also found a few places that explicitly cast inputs to floating point types,
which would drop the imaginary component before running the test.

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D25954050

Pulled By: mruberry

fbshipit-source-id: 1fa8e5af233aa095c839d5e2f860564baaf92aef
2021-01-22 09:44:45 -08:00
9ac30d96aa Add complex IValues (#50883)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/50883

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D26003682

Pulled By: anjali411

fbshipit-source-id: f02967d2d236d740cd8647891f732f1d63098d3e
2021-01-22 09:44:40 -08:00
002d978428 Sparse benchmarking utils (#48397)
Summary:
This is a benchmarking tooling to work with sparse tensors. To implement this, we extended PR `benchmarking util` [https://github.com/pytorch/pytorch/issues/38338](https://github.com/pytorch/pytorch/pull/38338) for sparse tensors.   In order to extend the proposed utility library the **FuzzedTensor** class was extended  by creating the new **FuzzedSparseTensor** class.  In addition two new operator classes were added, the `UnaryOpSparseFuzzer` and `BinaryOpSparseFuzzer`.

The class `FuzzedSparseTensor` adds new input parameters to the constructor:
1. `sparse_dim`: The number of sparse dimensions in a sparse tensor.
2. `nnz`:   Number of non-zero elements in the sparse tensor.
3. `density`: The density of the sparse tensor.
4. `coalesced`: As we know the sparse tensor format permits coalesced/uncoalesced sparse tensors.

and removes `probability_contiguous`, `max_allocation_bytes`, `roll_parameter`, `tensor_constructor` as they are dense-tensors related parameters.

In addition, I've extended the `torch.utils.benchmark.examples` to work with the new classes `FuzzedSparseTensor`, `UnaryOpSparseFuzzer` and `BinaryOpSparseFuzzer`.

Hopefully, this tooling and these examples will help to make other benchmarks in other PRs. Looking forward to your thoughts and feedback. cc robieta, mruberry,  ngimel

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48397

Reviewed By: ejguan

Differential Revision: D26008137

Pulled By: mruberry

fbshipit-source-id: 2f37811c7c3eaa3494a0f2500e519267f2186dfb
2021-01-22 09:40:59 -08:00
0436ea125b OpInfo: Remove promotes_integers_to_float and infer it instead (#50279)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50279

This allows different sample inputs to have different behavior for the same
operator. For example, `div(..., rounding_mode='true')` will promote but other
rounding modes don't. The current boolean flag is too restrictive to allow this.

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D25950011

Pulled By: mruberry

fbshipit-source-id: 7e82b82bedc626b2b6970d92d5b25676183ec384
2021-01-22 09:32:37 -08:00
4bbff92014 Refactor build targets for torch::deploy (#50288)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50288

torch::deploy will bundle the objects contained in libtorch-python together with frozenpython into a shared library.  Therefore, the libtorch-python objs can't bring with them a dependency on system python.

Buck TARGETS are added throughout the caffe2 tree to make available objects or headers that will be needed by torch::deploy but would have brought unsuitable dependencies if accessed using existing targets.

CMakeLists are modified to separate a torch-python-objs object library which lets torch::deploy compile these objs with the same compile flags as libttorch_python used, but without some of the link-time dependencies such as python.

CudaIPCTypes is moved from libtorch_python to libtorch_cuda because it is really not a python binding, and it statically registers a cuda_ipc_callback which would be duplicated if included in each copy of torch::deploy.

Test Plan: no new functionality, just ensure existing tests continue to pass

Reviewed By: malfet

Differential Revision: D25850785

fbshipit-source-id: b0b81c050cbee04e9de96888f8a09d29238a9db8
2021-01-22 09:16:32 -08:00
5f07b53ec2 [TensorExpr] Add LoopNest::simplify. (#50850)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/50850

Test Plan: Imported from OSS

Reviewed By: Chillee

Differential Revision: D25985085

Pulled By: ZolotukhinM

fbshipit-source-id: e51709423c2c12b37b449a9d7bb22be04cda7ef1
2021-01-22 08:43:34 -08:00
2ba2ab9e46 [packaging] add support for BytesIO (#50838)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50838

Similar to `torch.save` and `torch.jit.save`, accept a IO-like object
instead of just a file.

Test Plan: Imported from OSS

Reviewed By: nikithamalgifb

Differential Revision: D25982719

Pulled By: suo

fbshipit-source-id: 42f3665932bbaa6897215002d116df6338edae50
2021-01-22 08:33:39 -08:00
c7d348fea6 Turn on batched grad testing for non-autogenerated tests in test_nn.py (#50739)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50739

This does not turn on batched grad testing for autogenerated NewModuleTest
tests and CriterionTest tests. Those are coming later.

Test Plan: - run tests

Reviewed By: ejguan

Differential Revision: D25997677

Pulled By: zou3519

fbshipit-source-id: b4b2d68e0f99c3d573faf237e1e531d0b3fced40
2021-01-22 07:40:20 -08:00
b2e5617553 [ROCm] rename HIP_HCC_FLAGS to HIP_CLANG_FLAGS (#50917)
Summary:
ROCm 3.5 replaced hcc with hip-clang and deprecated HIP_HCC_FLAGS.
HIP_CLANG_FLAGS should be used moving forward. HIP_HCC_FLAGS will
be removed soon.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50917

Reviewed By: ejguan

Differential Revision: D26008094

Pulled By: walterddr

fbshipit-source-id: cfec4f96fbd9bd338834a841c37267f6a4703cab
2021-01-22 07:24:05 -08:00
8eb90d4865 Add Gaussian NLL Loss (#50886)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/48520.

cc albanD (This is a clean retry PR https://github.com/pytorch/pytorch/issues/49807)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50886

Reviewed By: ejguan

Differential Revision: D26007435

Pulled By: albanD

fbshipit-source-id: 88fe91b40dea6f72e093e6301f0f04fcc842d2f0
2021-01-22 06:56:49 -08:00
e34992ebee Set USE_KINETO=1 (#49897)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49897

Resend of https://github.com/pytorch/pytorch/pull/49201

Test Plan: see 49201

Reviewed By: malfet

Differential Revision: D25717102

Pulled By: ilia-cher

fbshipit-source-id: 5e794a7f5fe160ca64ac9d190c4fd3e8f1e443e6
2021-01-22 00:09:21 -08:00
7494f0233a snake_case FX IR names (#50876)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/50876

Test Plan: Imported from OSS

Reviewed By: nikithamalgifb

Differential Revision: D26002640

Pulled By: ansley

fbshipit-source-id: 4de8a63ef227ae3d46fab231f739c8472289ca4d
2021-01-21 22:25:57 -08:00
7f22af13b9 Add alternative prettyprinting method to Graph (#50878)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/50878

Test Plan: Imported from OSS

Reviewed By: SplitInfinity, eellison

Differential Revision: D26009183

Pulled By: ansley

fbshipit-source-id: 300913ea634d9a0e5b00deb831154ef126ad4180
2021-01-21 22:15:56 -08:00
d33cc4c01b Use quiet_NaN() in calc_digamma, not NAN (#50412)
Summary:
This not only specifies the data types of these NaNs, but also indicate
that the function isn't signaling anything unusual.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50412

Reviewed By: mrshenli

Differential Revision: D25899828

Pulled By: ezyang

fbshipit-source-id: a8ded10954ad08cba3098aa473c6b77f2e03dc93
2021-01-21 22:02:00 -08:00
bb909d27d5 [PyTorch Mobile] Eliminate static default_extra_files_mobile from header import.h (#50795)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50795

There's [a post](https://fb.workplace.com/groups/2148543255442743/permalink/2583012411995823/) about a customer having to pass in `-Wno-global-constructors` to disable warnings related to calling constructors for global objects. This is related to the initialization of `default_extra_files_mobile` in `import.h`.

It requires end users to pass in the compiler flag, since the definition is now in code (.cpp files) that they will be compiling.

In addition, it makes the API for `_load_for_mobile` non-re-entrant (i.e. can not be safely used concurrently from multiple threads without the caller taking a mutex/lock) if the `extra_files_mobile` argument is not explicitly passed in.

Instead, a better option would be to create different overloads; one which requires all 3 parameters, and one that can work with 1-2. This solves the problem without creating a static variable.

ghstack-source-id: 120127083

Test Plan: Build Lite Interpreter and sandcastle.

Reviewed By: raziel

Differential Revision: D25968216

fbshipit-source-id: fbd80dfcafb8ef7231aca301445c4a2ca9a08995
2021-01-21 21:22:48 -08:00
d46210958e Remove use_c10_dispatcher: full lines added in the last couple days (#50769)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50769

There were a couple new of these lines added in the last couple of days but they're not necessary anymore.
This PR removes them and also adds an assertion to make sure we don't add any more.
ghstack-source-id: 120133715

Test Plan: waitforsandcastle

Reviewed By: bhosmer

Differential Revision: D25961316

fbshipit-source-id: e2befc5b6215b42decb2acedcacfb50734857e2f
2021-01-21 20:35:26 -08:00
57fb2c0fcc [PPC] Add missing vec_[signed|neg|sldw] definitions (#50640)
Summary:
Base on quickwritereader 's  comment: https://github.com/pytorch/pytorch/issues/50439#issuecomment-760025933
Those builtins were added in gcc-8 or newer

Fixes https://github.com/pytorch/pytorch/issues/50439

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50640

Reviewed By: walterddr

Differential Revision: D25934384

Pulled By: malfet

fbshipit-source-id: b5dcfcf644ab92a78279c4dca5dbffbb8d8aae0c
2021-01-21 19:57:53 -08:00
533cb9530e Introducing TORCH_CUDA_CPP_API and TORCH_CUDA_CU_API to the code (#50627)
Summary:
Sub-step of my attempt to split up the torch_cuda library, as it is huge. Please look at https://github.com/pytorch/pytorch/issues/49050 for details on the split and which files are in which target.

This PR introduces two new macros for Windows DLL purposes, TORCH_CUDA_CPP_API and TORCH_CUDA_CU_API. Both are defined as TORCH_CUDA_API for the time being.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50627

Reviewed By: mruberry

Differential Revision: D25955441

Pulled By: janeyx99

fbshipit-source-id: ff226026833b8fb2fb7c77df6f2d6c824f006869
2021-01-21 19:09:11 -08:00
3aed177484 [PyTorch] inline Dispatcher::singleton (#50644)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50644

The dispatcher is a very hot code path; not inlining
`Dispatcher::singleton()` was hurting perf.

Test Plan:
Profiled our internal empty() benchmark. `perf stat` shows
about a 1.7% reduction in cycles spent; the benchmark's timing itself
shows a small reduction.

Reviewed By: dzhulgakov, bhosmer

Differential Revision: D25935275

fbshipit-source-id: a328f8ac8ea479bbe5c6ddb80f98838ae6058bbd
2021-01-21 19:01:16 -08:00
21c2542b6a Independent constraint (#50547)
Summary:
Addresses https://github.com/pytorch/pytorch/issues/50496

This fixes a number of inconsistencies in torch.distributions.constraints as used for parameters and supports of probability distributions.
- Adds a `constraints.independent` and replaces `real_vector` with `independent(real, 1)`. (this pattern has long been used in Pyro)
- Adds an `.event_dim` attribute to all constraints.
- Tests that `constraint.check(data)` has the correct shape. (Previously the shapes were incorrect).
- Adds machinery to set static `.is_discrete` and `.event_dim` for `constraints.dependent`.
- Fixes constraints for a number of distributions.

## Tested
- added a new check to the constraints tests
- added a new check for `.event_dim`

cc fehiepsi feynmanliang stefanwebb

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50547

Reviewed By: VitalyFedyunin

Differential Revision: D25918330

Pulled By: neerajprad

fbshipit-source-id: a648c3de3e8704f70f445c0f1c39f2593c8c74db
2021-01-21 18:42:45 -08:00
5016637955 [FX] Update overview docstring (#50896)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/50896

Test Plan: Imported from OSS

Reviewed By: ansley

Differential Revision: D26002067

Pulled By: jamesr66a

fbshipit-source-id: 3b4d4b96017d16739a31f25a306f55b6f96324dc
2021-01-21 17:31:54 -08:00
eb0fe70680 [distributed_test]Enable disabled ROCm tests. (#50421)
Summary:
Signed-off-by: Jagadish Krishnamoorthy <jagdish.krishna@gmail.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50421

Reviewed By: ejguan

Differential Revision: D26006844

Pulled By: zhaojuanmao

fbshipit-source-id: aa6ac5ee2d37f354d52328c72eb2cd23f5665f53
2021-01-21 17:22:40 -08:00
aa3c28a29e [static runtime] Shortcut resize_({0})
Summary:
We do a lot of resize_({0}) to force `out` operators to properly
resize their results, and `resize_` does a fair bit of extraneous work
(e.g. trip through dispatch, checks for memory_format and named tensors, etc.).
If we strip it down to the bare minimum it's just setting the sizes to 0, so
lets do that directly.

Test Plan:
Perf results suggest maybe a 1% win:
```
batch 20: P163138256 (large win, 1.7%, mostly in fb_fc_out)
batch 1: P163139591 (smaller win, 0.88%, mostly in resize_)
```

Reviewed By: swolchok

Differential Revision: D25932595

fbshipit-source-id: d306a0a15c0e1be12fde4a7f149e3ed35665e3c0
2021-01-21 17:08:47 -08:00
8e9ed27a53 install magma for cuda 11.2 in conda (#50559)
Summary:
This PR allows us to start adding CUDA 11.2 Linux tests onto CI.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50559

Reviewed By: ejguan

Differential Revision: D26007595

Pulled By: janeyx99

fbshipit-source-id: e179dbe54e9390899d556dd201a1a179b2399d20
2021-01-21 15:44:39 -08:00
137f2a385a [ONNX] Handle sequence output for models (#50599)
Summary:
Duplicate of https://github.com/pytorch/pytorch/issues/46542

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50599

Reviewed By: SplitInfinity

Differential Revision: D25928897

Pulled By: bzinodev

fbshipit-source-id: a898cef7b2d15a287aedd9798ce1423cebf378d4
2021-01-21 15:36:41 -08:00
c082e2184d Add autograd tests for complex matrix norm nuclear and +/-2 (#50746)
Summary:
Also upgrades `linalg.norm`'s autograd and jit tests to `OpInfo`

Fixes https://github.com/pytorch/pytorch/issues/48842

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50746

Reviewed By: mruberry

Differential Revision: D25968246

Pulled By: anjali411

fbshipit-source-id: d457069ddb4caf2a5caed1aa64c791ef0790952c
2021-01-21 15:33:08 -08:00
201f0c1fdf Automated submodule update: tensorpipe (#50895)
Summary:
This is an automated pull request to update the first-party submodule for [pytorch/tensorpipe](https://github.com/pytorch/tensorpipe).

New submodule commit: ee15f7a7c5

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50895

Test Plan: Ensure that CI jobs succeed on GitHub before landing.

Reviewed By: ejguan

Differential Revision: D26001623

fbshipit-source-id: 680d182ba5a6ce1d9cb2467136e8b27fe8266d0f
2021-01-21 15:28:22 -08:00
3cd8ed972a add and adjust kernel launch checks under fbcode/caffe2/caffe2/utils (#50862)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50862

add all missing kernal launch check for all cu and cuh files under caffe2/caffe2/utils

Test Plan:
building
```buck build //caffe2/caffe2:``` gives no error
Tests all pass
```buck test //caffe2/caffe2:```

check using the check to ensure there is no show under
`fbcode/caffe2/caffe2/utils`

 the PR on github shows all tests are passed https://github.com/pytorch/pytorch/actions/runs/500036434

Reviewed By: r-barnes

Differential Revision: D25987367

fbshipit-source-id: 52add63a14f2da855c784ab24468f64056c93836
2021-01-21 15:20:55 -08:00
16691516a5 Add batched grad testing to OpInfo (#50818)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50818

This PR does two things:
1. Add batched grad testing to OpInfo
2. Improve the error message from `gradcheck` if batched gradient
computation fails to include suggestions for workarounds.

To add batched grad testing to OpInfo, this PR:
- adds new `check_batched_grad=True` and `check_batched_gradgrad=True`
attributes to OpInfo. These are True by default because we expect most
operators to support batched gradient computation.
- If `check_batched_grad=True`, then `test_fn_grad` invokes gradcheck
with `check_batched_grad=True`.
- If `check_batched_gradgrad=True`, then `test_fn_gradgradgrad` invokes
gradgradcheck with `check_batched_grad=True`.

The improved gradcheck error message looks like the following when an
exception is thrown while computing batched gradients:
https://gist.github.com/zou3519/5a0f46f908ba036259ca5e3752fd642f

Future
- Sometime in the not-near future, we will separate out "batched grad
testing" from "gradcheck" for the purposes of OpInfo to make the
testing more granular and also so that we can test that the vmap
fallback doesn't get invoked (currently batched gradient testing only
tests that the output values are correct).

Test Plan: - run tests `pytest test/test_ops.py -v -k "Gradients"`

Reviewed By: ejguan

Differential Revision: D25997703

Pulled By: zou3519

fbshipit-source-id: 6d2d444d6348ae6cdc24c32c6c0622bd67b9eb7b
2021-01-21 15:13:06 -08:00
1cce4c5eee Update Kineto revision (#50855)
Summary:
Update Kineto revision

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50855

Test Plan:
build with USE_KINETO=1
test/test_profiler.py

Reviewed By: gdankel

Differential Revision: D25987298

Pulled By: ilia-cher

fbshipit-source-id: d3f22832df74b2d14c338715e601f6f4bae85d6a
2021-01-21 14:34:57 -08:00
884fb48794 Miscellaneous batched grad testing (#50738)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50738

This PR adds batched grad testing for:
- test_linalg.py
- test_unary_ufuncs.py

Future:
- add batched grad testing for test_nn
- enable option for batched grad testing in OpInfo

Test Plan: - run tests

Reviewed By: ejguan

Differential Revision: D25997678

Pulled By: zou3519

fbshipit-source-id: 9a9f6694c041580061bd52b5e45661c872b0b761
2021-01-21 14:26:46 -08:00
8ede828df7 [te] Speed up relu on cpu
Summary:
We were implementing it using ifThenElse, which creates conditional
branches that complicate llvm's vectorization.  Using CompareSelect directly
yields clean vectorized code with nothing but vmovups and vmaxps.

Test Plan: Trivial benchmark shows 33% speedup on large tensors (256k elements).

Reviewed By: eellison

Differential Revision: D25986637

fbshipit-source-id: 72dd7776924f73c036d46dca30dff22404d86b82
2021-01-21 14:16:23 -08:00
98e2914614 [android] Fix YUV camera image to tensor (#50871)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50871

Issue: https://discuss.pytorch.org/t/trouble-with-yuv420-to-float-tensor-conversion/106721/3
Decoding was wrong and the result image had artifacts.

Testing:
Patch test_app with:
[input_tensor_to_bitmap.txt](https://github.com/pytorch/pytorch/files/5847553/input_tensor_to_bitmap.txt)

gradle -p android test_app:installMnetLocalCameraDebug -PABI_FILTERS=arm64-v8a

Before fix:
![before_yuv_fix](https://user-images.githubusercontent.com/6638825/105317604-63a35980-5b90-11eb-9609-2ed5818130bd.png)

After fix:
![after_yuv_fix](https://user-images.githubusercontent.com/6638825/105317643-70c04880-5b90-11eb-88b7-92dd90db8ed2.png)

Test Plan: Imported from OSS

Reviewed By: fmassa

Differential Revision: D25992519

Pulled By: IvanKobzarev

fbshipit-source-id: 4a46ed39c1cd70f8987fcc1023520e9659ae5d59
2021-01-21 13:53:57 -08:00
b5242d66b6 [quant][doc] Adding a table comparing eager and fx graph mode (#50413)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/50413

Test Plan:
.

Imported from OSS

Reviewed By: vkuzo

Differential Revision: D25886960

fbshipit-source-id: b99178d3900eedec920dbff28ab956f97be2661a
2021-01-21 13:43:42 -08:00
4d169258ef Revert D25976245: [pytorch][PR] Enable Skipped ROCM Tests in common_nn.py
Test Plan: revert-hammer

Differential Revision:
D25976245 (24a0272132)

Original commit changeset: 801032534f91

fbshipit-source-id: 561e6d761cb694451d5f87557b4f96f37d19dd90
2021-01-21 13:28:37 -08:00
4cca08368b Adds per-op microbenchmarks for NNC (#50845)
Summary:
Runs through vast majority of primitive ops that exist in NNC and benchmarks them against PyTorch ops on CPU. Dumps out a plot like this.

![nnc](https://user-images.githubusercontent.com/6355099/105247994-a854d380-5b43-11eb-9ac9-1ee779e5ab54.png)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50845

Reviewed By: ngimel

Differential Revision: D25989080

Pulled By: Chillee

fbshipit-source-id: 6d6a39eb06b3de9a999993224d5e718537c0c8c4
2021-01-21 13:21:01 -08:00
4ac489091a Improve call provenance during GraphModule scripting (#50538)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/50538

Test Plan: Imported from OSS

Reviewed By: pbelevich, SplitInfinity

Differential Revision: D25935403

Pulled By: ansley

fbshipit-source-id: 2baf5e0ba0fa3918e645fc713a9e80d10bbc84e5
2021-01-21 12:03:19 -08:00
df96344968 [optimizer] refactor AdamW to use functional API (#50411)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/50411

Test Plan: Imported from OSS

Reviewed By: izdeby

Differential Revision: D25932776

Pulled By: wanchaol

fbshipit-source-id: e8e1696b3390ba7909b36fd0107c58b892520432
2021-01-21 11:00:45 -08:00
ce1781d8db [optimizer] refactor RMSProp to use functional API (#50410)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/50410

Test Plan: Imported from OSS

Reviewed By: izdeby

Differential Revision: D25932779

Pulled By: wanchaol

fbshipit-source-id: b0d6007ea83d77e2d70d04681163ea7e4632c5cd
2021-01-21 11:00:41 -08:00
d6fb27ce72 [optimizer] refactor Adadelta to use functional API (#50409)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/50409

Test Plan: Imported from OSS

Reviewed By: izdeby

Differential Revision: D25932780

Pulled By: wanchaol

fbshipit-source-id: 2fc025f66a0e0863f21689892e19d8a5681f2f2f
2021-01-21 11:00:36 -08:00
a0cf5566d8 [optimizer] refactor SGD to use functional API (#45597)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45597

Test Plan: Imported from OSS

Reviewed By: izdeby

Differential Revision: D25932773

Pulled By: wanchaol

fbshipit-source-id: bc5f830d6812f847475b9bdcc67865d9968e3282
2021-01-21 10:57:08 -08:00
b96a6516a6 Add CPP Full Reduction Benchmarks. (#50193)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50193

* Supports aten, native reference implementation, and NNC TE implementations.
* Support functionality checks against aten, in addition to performance checks.

Test plans:

* After enable "BUILD_TENSOREXPR_BENCHMARK" in CMakeLists.txt,
* bin/tensorexpr_bench --benchmark_filter=Reduce1D

Measurements:

On a Broadwell E5-2686 CPU,

Reduce1D/Torch/16777216            5638547 ns    5638444 ns        119 BYTES=11.902G/s
Reduce1D/Naive/16777216           19308235 ns   19308184 ns         36 BYTES=3.47567G/s
Reduce1D/NativeRfactor/16777216    8433348 ns    8433038 ns         85 BYTES=7.95785G/s
Reduce1D/NativeVector/16777216     5608836 ns    5608727 ns        124 BYTES=11.9651G/s
Reduce1D/NativeTiled/16777216      5550233 ns    5550221 ns        126 BYTES=12.0912G/s
Reduce1D/TeNaive/16777216         21451047 ns   21450752 ns         33 BYTES=3.12851G/s
Reduce1D/TeSplitTail/16777216     23701732 ns   23701229 ns         30 BYTES=2.83145G/s
Reduce1D/TeSplitMask/16777216     23683589 ns   23682978 ns         30 BYTES=2.83363G/s
Reduce1D/TeRfactorV2/16777216      5378019 ns    5377909 ns        131 BYTES=12.4786G/s

Result summary:

* The single-threaded performance with NNC TeRfactorV2 matches and exceeds Aten and avx2 naive counterpart.

Follow-up items:

* rfactor does not work well with split
* We don't have a multi-threaded implementation yet.
  * Missing "parallel" scheduling primitive, which is not different from what we need for pointwise ops.

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D25821880

Pulled By: zheng-xq

fbshipit-source-id: 8df3f40d1eed8749c8edcaacae5f0544dbf6bed3
2021-01-21 10:00:50 -08:00
88b36230f5 Add full reduction benchmark. (#50057)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50057

As part of the effort to calibrate TE reduction performance, adding a full reduction benchmark.
Also add a "skip_input_transformation" option.
Fixed other reduction benchmarks to accept specific benchmarks that was listed.

Test plans:
* python -m benchmarks.tensorexpr --device=cpu --mode=fwd reduce_full
* python -m benchmarks.tensorexpr --device=cpu --mode=fwd reduce_full_fwd_cpu_16777216_s1
* python -m benchmarks.tensorexpr --device=cpu --mode=fwd reduce_full_fwd_cpu_16777216_s0
* python -m benchmarks.tensorexpr --device=cpu --mode=fwd reduce2d_inner
* python -m benchmarks.tensorexpr --device=cpu --mode=fwd reduce2d_inner_fwd_cpu_640_524288
* python -m benchmarks.tensorexpr --device=cpu --mode=fwd reduce2d_outer
* python -m benchmarks.tensorexpr --device=cpu --mode=fwd reduce2d_outer_fwd_cpu_640_524288

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D25774138

Pulled By: zheng-xq

fbshipit-source-id: fd4598e5c29991be476e42235a059e8021d4f083
2021-01-21 09:56:46 -08:00
24a0272132 Enable Skipped ROCM Tests in common_nn.py (#50753)
Summary:
Removed test_cuda=(not TEST_WITH_ROCM)
in common_nn.py to enable the skipped tests
for ROCM.

Signed-off-by: Arindam Roy <rarindam@gmail.com>

Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50753

Reviewed By: mrshenli

Differential Revision: D25976245

Pulled By: ngimel

fbshipit-source-id: 801032534f911d24d231bc9f0d3235a4506412c0
2021-01-21 09:48:47 -08:00
480bb7d356 Automated submodule update: tensorpipe (#50807)
Summary:
This is an automated pull request to update the first-party submodule for [pytorch/tensorpipe](https://github.com/pytorch/tensorpipe).

New submodule commit: 9f84778d47

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50807

Test Plan: Ensure that CI jobs succeed on GitHub before landing.

Reviewed By: lw

Differential Revision: D25973473

fbshipit-source-id: 62a9808a6ce5e6c4b51fdf272b687118a8c116b8
2021-01-21 01:23:05 -08:00
439afda090 [Gradient Compression] Fix warm-start for PowerSGD laywerwise compression (#50283)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50283

Realize that for the layerwise compression, the previous warm-start implementation only skips memory allocations, but does not skip filling random values for Qs.

Also fix the unit test in distributed_test.py. Previously the process group was not created correctly, and not communication occurred in the test_DistributedDataParallel_powerSGD_ddp_comm_hook.

Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202
ghstack-source-id: 120101220

Test Plan:
Verified the fix by adding added some loggings locally.

Also verified no NE diff on Ads 1x.

Reviewed By: rohan-varma

Differential Revision: D25846222

fbshipit-source-id: 1ebeeb55ceba64d4d904ea6ac1bb42b1b2241520
2021-01-20 22:31:44 -08:00
d0e942f9a7 [FX][docs] Add limitations of symbolic tracing (#50638)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/50638

Test Plan: Imported from OSS

Reviewed By: ansley

Differential Revision: D25933780

Pulled By: jamesr66a

fbshipit-source-id: 0aa97ea05203fbcb707b0e947a465e206104b7df
2021-01-20 21:42:16 -08:00
c88eed97c7 Make split_module results deterministic (#50470)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/50470

Test Plan: Imported from OSS

Reviewed By: jamesr66a

Differential Revision: D25899130

Pulled By: ansley

fbshipit-source-id: 45d63992cbe17eb01f709d02800c2eef1bd2ad08
2021-01-20 21:35:04 -08:00
4954417163 CONTRIBUTING.md: add instructions on how to remote desktop into Windows CI (#50841)
Summary:
Adds a link to existing instructions in CONTRIBUTING.md, so those instructions are more visible to contributors.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50841

Reviewed By: samestep

Differential Revision: D25983089

Pulled By: janeyx99

fbshipit-source-id: 0b777ec760765153c607515ab09441dd0cfddf3c
2021-01-20 18:46:56 -08:00
c945a5bb5e fix typo of quantized README.md (#50681)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/50681

Reviewed By: ngimel

Differential Revision: D25978905

Pulled By: jerryzh168

fbshipit-source-id: e8bff59a7a6b2b6f79273c010c32480db0997e7d
2021-01-20 17:43:25 -08:00
7fdc6a27b8 Skip test_variant_consistency_eager_addr_cpu_bfloat16 (#50836)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50836

Fixes the broken master

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D25981125

Pulled By: anjali411

fbshipit-source-id: 4043b6a7287700c7c9f0ce703eef53bb666ff655
2021-01-20 16:03:00 -08:00
c147aa306c Use doctest directly to get docstring examples (#50596)
Summary:
This PR addresses [a two-year-old TODO in `test/test_type_hints.py`](12942ea52b/test/test_type_hints.py (L21-L22)) by replacing most of the body of our custom `get_examples_from_docstring` function with [a function from Python's built-in `doctest.DocTestParser` class](https://docs.python.org/3/library/doctest.html#doctest.DocTestParser.get_examples). This mostly made the parser more strict, catching a few errors in existing doctests:

- missing `...` in multiline statements
- missing space after `>>>`
- unmatched closing parenthesis

Also, as shown by [the resulting diff of the untracked `test/generated_type_hints_smoketest.py` file](https://pastebin.com/vC5Wz6M0) (also linked from the test plan below), this introduces a few incidental changes as well:

- standalone comments are no longer preserved
- indentation is now visually correct
- [`example_torch_promote_types`](4da9ceb743/torch/_torch_docs.py (L6753-L6772)) is now present
- an example called `example_torch_tensor___array_priority__` is added, although I can't tell where it comes from
- the last nine lines of code from [`example_torch_tensor_align_as`](5d45140d68/torch/_tensor_docs.py (L386-L431)) are now present
- the previously-misformatted third line from [`example_torch_tensor_stride`](5d45140d68/torch/_tensor_docs.py (L3508-L3532)) is now present

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50596

Test Plan:
Checkout the base commit, typecheck the doctests, and save the generated file:
```
$ python test/test_type_hints.py TestTypeHints.test_doc_examples
$ cp test/generated_type_hints_smoketest.py /tmp
```
Then checkout this PR, do the same thing, and compare:
```
$ python test/test_type_hints.py TestTypeHints.test_doc_examples
$ git diff --no-index {/tmp,test}/generated_type_hints_smoketest.py
```
The test should succeed, and the diff should match [this paste](https://pastebin.com/vC5Wz6M0).

Reviewed By: walterddr

Differential Revision: D25926245

Pulled By: samestep

fbshipit-source-id: 23bc379ff438420e556263c19582dba06d8e42ec
2021-01-20 15:55:36 -08:00
1bde5a216f [TensorExpr] Use wider type for scalars (#50774)
Summary:
Scalars have to be double / 64-bit integers to match eager semantics.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50774

Test Plan: python test/test_jit_fuser_te.py -k TestTEFuser.test_clamp

Reviewed By: ngimel

Differential Revision: D25978214

Pulled By: asuhan

fbshipit-source-id: ba765b7d215239f2bf0f3d467e4dce876f7ccb91
2021-01-20 15:12:27 -08:00
24fd84313f [pytorch] fix ConstRefCType usage in codegen/api/native.py (#50742)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50742

Fixed the other usage of `BaseCType('const ...&)` on #49138.

Checked byte-for-byte compatibility of the codegen output.

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D25955565

Pulled By: ljk53

fbshipit-source-id: 83ebd6b039892b805444867ed97a6e2fa6e72225
2021-01-20 15:01:37 -08:00
44922f26f5 Add support for NCCL alltoall (#44374)
Summary:
In https://github.com/pytorch/pytorch/issues/42514, NCCL `alltoall_single` is already added. This PR adds NCCL `alltoall`.

The difference between `alltoall_single` and `alltoall` is: `alltoall_single`  works on a single tensor and send/receive slices of that tensor, while `alltoall` works on a list of tensor, and send/receive tensors in that list.

cc: ptrblck ngimel

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44374

Reviewed By: zhangguanheng66, mrshenli

Differential Revision: D24455427

Pulled By: srinivas212

fbshipit-source-id: 42fdebdd14f8340098e2c34ef645bd40603552b1
2021-01-20 14:57:12 -08:00
87fb3707d9 ZeroRedundancyOptimizer: an implementation of a standalone sharded optimizer wrapper (#46750)
Summary:
Implement the first stage of ZeRO, sharding of the optimizer state, as described in [this blog post](https://www.microsoft.com/en-us/research/blog/zero-2-deepspeed-shattering-barriers-of-deep-learning-speed-scale/) and [this paper](https://arxiv.org/abs/1910.02054). This implementation is completely independent from the [DeepSpeed](https://github.com/microsoft/DeepSpeed) framework, and aims at providing ZeRO-compliant building blocks within the PyTorch scheme of things.

This works by:
- acting as a wrapper to a pytorch optimizer. ZeROptimizer does not optimize anything by itself, it only shards optimizers for distributed jobs
- each rank distributes parameters according to a given partitioning scheme (could be updated), and owns the update of a given shard only
- the .step() is called on each rank as expected, the fact that the optimizer actually works on a shard of the model is not visible from the outside
- when the update is completed, each rank broadcasts the updated model shard to all the other ranks

This can be used with DDP, although some communications are wasted in that case (gradients are all-reduced to all ranks). This implementation was initially developed in [Fairscale](https://github.com/facebookresearch/fairscale), and can also be used with an optimized DDP which only reduces to the relevant ranks. More context on ZeRO and PyTorch can be found in [this RFC](https://github.com/pytorch/pytorch/issues/42849)

The API with respect to loading and saving the state is a known pain point and should probably be discussed an updated. Other possible follow ups include integrating more closely to a [modularized DDP](https://github.com/pytorch/pytorch/issues/37002), [making the checkpoints partition-agnostic](https://github.com/facebookresearch/fairscale/issues/164), [exposing a gradient clipping option](https://github.com/facebookresearch/fairscale/issues/98) and making sure that mixed precision states are properly handled.

original authors include msbaines, min-xu-ai and myself

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46750

Reviewed By: mruberry

Differential Revision: D25958918

Pulled By: blefaudeux

fbshipit-source-id: 14280f2fd90cf251eee8ef9ac0f1fa6025ae9c50
2021-01-20 14:36:16 -08:00
c3e3e60657 Add cloud-tpu-client to xla CI. (#50823)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/50823

Test Plan: Imported from OSS

Reviewed By: seemethere

Differential Revision: D25976931

Pulled By: ailzhang

fbshipit-source-id: f29c24c232944a103b59d9fea9b1c19969a7821b
2021-01-20 13:44:49 -08:00
be7e9845a1 Remove gtest_prod.h from TP agent. (#50766)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50766

This header breaks certain builds since it causes PyTorch to depend
on gtest.
ghstack-source-id: 119991167

Test Plan: waitforbuildbot

Reviewed By: mrshenli

Differential Revision: D25960810

fbshipit-source-id: ceaaad499f6f363ef35c6623475ae8f191d86171
2021-01-20 13:15:48 -08:00
ac8e90fa6d quantization: Linear + BatchNorm1d fusion (#50748)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50748

Adds support for Linear + BatchNorm1d fusion to quantization.

This is a redo of dreiss's https://github.com/pytorch/pytorch/pull/37467, faster
to copy-paste it than rebase and deal with conflicts.

Test Plan:
```
python test/test_quantization.py TestFusion.test_fusion_linear_bn_eval
```

Imported from OSS

Reviewed By: supriyar

Differential Revision: D25957432

fbshipit-source-id: 24e5b760f70186aa953ef65ab0182770e89495e4
2021-01-20 12:59:02 -08:00
db86dd8ad7 Fix replication_pad for cuda launch configuration (#50565)
Summary:
Fix https://github.com/pytorch/pytorch/issues/49601

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50565

Reviewed By: mruberry

Differential Revision: D25968843

Pulled By: ngimel

fbshipit-source-id: 6d2d543132b501765e69b52caaa283fb816db276
2021-01-20 11:52:12 -08:00
cf1882adeb Fix indexing for overrides. (#49324)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/46277

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49324

Reviewed By: mruberry

Differential Revision: D25959334

Pulled By: ezyang

fbshipit-source-id: bac48b8ffee89d10aa04c004de2b53b4e54a96c2
2021-01-20 11:34:02 -08:00
16faabe7f0 [ROCm] re-enable tests (#50691)
Summary:
Signed-off-by: Kyle Chen <kylechen@amd.com>

cc: jeffdaily

re-enable test_torch.py and test_unary_ufuncs.py tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50691

Reviewed By: mruberry

Differential Revision: D25967842

Pulled By: ngimel

fbshipit-source-id: dc0f6cb68fe4d151c2719bdf67ead96e1396acf2
2021-01-20 11:23:39 -08:00
fbf7eec86d Update JIT_OPT macro for easier use (#50602)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/50602

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D25931371

fbshipit-source-id: cf6bc58c419a1dc0018639596b304a3a05e38360
2021-01-20 11:15:20 -08:00
112a583467 Enable TensorPipe's CMA channel (#50759)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50759

ghstack-source-id: 120032288

Test Plan: Exported to CircleCI and tested

Reviewed By: mrshenli

Differential Revision: D25959326

fbshipit-source-id: be6df209ff3a79a8961acbda64ee7805a5c434a9
2021-01-20 10:53:47 -08:00
c18403a693 [metal] Use MPSCNN kernels for binary elementwise ops
Summary: Previously, binary elementwise kernels such as add, sub, and mul were implemented with custom shaders. However, MPSCNN has kernels for these operations for iOS >=11.3. Update these ops to use MPSCNN kernels instead of shader implementations.

Test Plan:
Test on device:
`arc focus2 pp-ios`

Test on mac
`buck test pp-macos`

Reviewed By: xta0

Differential Revision: D25953986

fbshipit-source-id: 3acac3fa7dbe70f92572c21c0f0cfcdedfcdcf23
2021-01-20 10:41:35 -08:00
1e0809dbf9 [PyTorch] Remove CAFFE2_FB_LIMITED_MOBILE_CAPABILITY (#50385)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50385

We no longer use this flag internally, and it's not referenced externally either, so let's clean up.
ghstack-source-id: 119676743

Test Plan: CI

Reviewed By: ezyang

Differential Revision: D25852220

fbshipit-source-id: a4427edff6cbb241340f9f6ae6db4e74832949c2
2021-01-20 10:26:54 -08:00
4f3cdd971c Fix test_dispatch.py when running with TORCH_SHOW_CPP_STACKTRACES=1 (#50509)
Summary:
`test_dispatch.py` has many asserts about the error message. When running with `TORCH_SHOW_CPP_STACKTRACES=1`, the error message is different from when `TORCH_SHOW_CPP_STACKTRACES=0`, which makes many tests in `test_dispatch.py` fail. This PR fixes these failures when running with `TORCH_SHOW_CPP_STACKTRACES=1`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50509

Reviewed By: ngimel

Differential Revision: D25956853

Pulled By: ezyang

fbshipit-source-id: 3b3696742a7dfb8f52f23a364838ec96945c5662
2021-01-20 10:15:01 -08:00
f1c578594b JIT Testing: Improve assertAutodiffNode error message (#50626)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/50626

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D25932184

Pulled By: Lilyjjo

fbshipit-source-id: 6fa5a652eb1a0c10bb9d9040b9a708fdf93aaf46
2021-01-20 10:05:52 -08:00
1cc8f8a750 Add complex autograd support and OpInfo based test for torch.addr (#50667)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/50667

Test Plan: Imported from OSS

Reviewed By: pbelevich

Differential Revision: D25957584

Pulled By: anjali411

fbshipit-source-id: a6b2880971027389721f4e051009b7d9694f979b
2021-01-20 09:43:13 -08:00
66adfcd258 tools: Move sha check to else statement (#50773)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50773

Moves the sha check for version generation to the else clause
since it was causing issues for users building pytorch when the .git
directory was not present and PYTORCH_BUILD_VERSION was already set

Test Plan:
CI

Closes https://github.com/pytorch/pytorch/issues/50730
Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Reviewed By: janeyx99

Differential Revision: D25963486

Pulled By: seemethere

fbshipit-source-id: ce1b315f878d074f2ffb6b658d59cbd13150f27f
2021-01-20 09:34:43 -08:00
e1bb476980 Issue #48724. Only set the CMake IMPORTED_LOCATION property in static… (#49173)
Summary:
… library builds, as it is already set in shared library builds from the target that was imported from Caffe2.

This was identified on Windows builds when PyTorch was built in shared Release mode, and a testapp was built with RelWithDebInfo in CMake.
The problem appeared to be that because IMPORTED_LOCATION (in TorchConfig.cmake) and IMPORTED_LOCATION_RELEASE were both set (in Caffe2Targets.cmake), there occurred some confusion in the build as to what was correct. The symptoms are the error:

ninja: error: 'torch-NOTFOUND', needed by 'test_pytorch.exe', missing and no known rule to make it

in a noddy consuming test application.

Fixes https://github.com/pytorch/pytorch/issues/48724

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49173

Reviewed By: malfet

Differential Revision: D25974151

Pulled By: ezyang

fbshipit-source-id: 3454c0d29cbbe7a37608beedaae3efbb624b0479
2021-01-20 09:23:27 -08:00
22902b9242 [WIP] JIT Static Hooks: cpp tests (#49547)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/49547

Test Plan: Imported from OSS

Reviewed By: heitorschueroff

Differential Revision: D25771118

Pulled By: Lilyjjo

fbshipit-source-id: cd8a58ff008a1c5d65ccbfbcbcb0214781ece16f
2021-01-20 09:12:57 -08:00
3b88e1b0e7 [WIP] JIT Static Hooks: python tests (#49546)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/49546

Test Plan: Imported from OSS

Reviewed By: heitorschueroff

Differential Revision: D25771119

Pulled By: Lilyjjo

fbshipit-source-id: bf8a8e20f790691d3ff58fa9c8d0d9ab3e8322c4
2021-01-20 09:12:53 -08:00
0eb41e67fe [WIP] JIT Static Hooks: serialization logic (#49545)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/49545

Test Plan: Imported from OSS

Reviewed By: heitorschueroff

Differential Revision: D25771121

Pulled By: Lilyjjo

fbshipit-source-id: fe08936d601618010b9c64e2bb769e0b67cb7187
2021-01-20 09:12:49 -08:00
9c49457233 [WIP] JIT Static Hooks: schema checking logic (#49975)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/49975

Test Plan: Imported from OSS

Reviewed By: heitorschueroff

Differential Revision: D25771120

Pulled By: Lilyjjo

fbshipit-source-id: 262892cec45b6894bd8c0c20b9cfee43065abc7c
2021-01-20 09:12:45 -08:00
a722d28ef0 [WIP] JIT Static Hooks: adding hooks to class type and adding logic for hook running/compilation (#49544)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49544

Implementation of design laid out in: https://fb.quip.com/MY9gAqlroo0Z

Test Plan: Imported from OSS

Reviewed By: heitorschueroff

Differential Revision: D25771122

Pulled By: Lilyjjo

fbshipit-source-id: dc4a8461f71c58ae75144ca1477cd1c0e9f0f325
2021-01-20 09:09:30 -08:00
1f5c3b3aae Revert D25958987: [pytorch][PR] Add type annotations to torch.overrides
Test Plan: revert-hammer

Differential Revision:
D25958987 (2ace4fc01e)

Original commit changeset: aadc065c489b

fbshipit-source-id: efd8b7c3cbe03d5ab0afa0d7c695182623285a3a
2021-01-20 08:59:44 -08:00
4a8ef4525e Add new backend type for Intel heterogeneous computation platform. (#49786)
Summary:
Add a new device type 'XPU' ('xpu' for lower case) to PyTorch. Changes are needed for code related to device model and kernel dispatch, e.g. DeviceType, Backend and DispatchKey etc.

https://github.com/pytorch/pytorch/issues/48246

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49786

Reviewed By: mrshenli

Differential Revision: D25893962

Pulled By: ezyang

fbshipit-source-id: 7ff0a316ee34cf0ed6fc7ead08ecdeb7df4b0052
2021-01-20 08:15:18 -08:00
a3b8cbcdfc Let TensorPipe detect peer access (#50676)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/50676

Test Plan: Imported from OSS

Reviewed By: beauby

Differential Revision: D25941962

Pulled By: mrshenli

fbshipit-source-id: 7d4fd3b4fbd5ae5a0c50ad65605ced9db10ede4a
2021-01-20 08:04:51 -08:00
4803eaf502 Implement NumPy-like function torch.fmax() & torch.fmin() (#49312)
Summary:
- Implementing the NumPy-like function`torch.fmax()` and `torch.fmin()` recommended in https://github.com/pytorch/pytorch/issues/48440

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49312

Reviewed By: izdeby

Differential Revision: D25887246

Pulled By: heitorschueroff

fbshipit-source-id: d762eeff8b328bfcbe7d48b7ee9d2da72c249691
2021-01-20 06:45:25 -08:00
2ace4fc01e Add type annotations to torch.overrides (#48493)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/48492

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48493

Reviewed By: mruberry

Differential Revision: D25958987

Pulled By: ezyang

fbshipit-source-id: aadc065c489bf1a8c6258de14c930e396df763bc
2021-01-20 06:32:22 -08:00
4aea007351 [JIT] Fix archive file extension in examples and docs (#50649)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50649

**Summary**
Tutorials, documentation and comments are not consistent with the file
extension they use for JIT archives. This commit modifies certain
instances of `*.pth` in `torch.jit.save` calls with `*.pt`.

**Test Plan**
Continuous integration.

**Fixes**
This commit fixes #49660.

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D25961628

Pulled By: SplitInfinity

fbshipit-source-id: a40c97954adc7c255569fcec1f389aa78f026d47
2021-01-20 02:04:46 -08:00
e00966501b [quant] Add non-fbgemm fallback implementation for embedding lookup ops (#50706)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50706

Add a default CPU implementation for quantized embedding lookup operators.
This should enable the ops to execute on mobile as well where we don't have fbgemm.

Test Plan:
python test/test_quantization.py
and CI tests

Imported from OSS

Reviewed By: vkuzo

Differential Revision: D25956842

fbshipit-source-id: 07694888e5e1423b496af1a51494a49558e82152
2021-01-19 23:56:26 -08:00
5205cc1c62 [FX] Fix NoneType annotation in generated code (#50777)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/50777

Test Plan: Imported from OSS

Reviewed By: Chillee

Differential Revision: D25966026

Pulled By: jamesr66a

fbshipit-source-id: 8e36521eee03eade7e1b602e801229c085b03488
2021-01-19 23:16:58 -08:00
8f5ad00e13 [JIT] Print out CU address in ClassType::repr_str() (#50194)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50194

**Summary**
`ClassType::repr_str()` prints out only the name of a `ClassType`, which
is not always enough to disambiguate it. In some situations, two
`ClassTypes` are compared and do not match despite having identical
names because they are in separate compilation units. In such cases, the
error message can seem nonsensical (e.g. `expected type T but found type
T`). This commit modifies `ClassType::repr_str()` so that it prints out
the address of the type's compilation unit to make these messages less
puzzling (e.g. `expected type T (0x239023) but found type T (0x230223)`).

**Test Plan**
This commit adds a unit test, `ClassTypeTest.IdenticalTypesDifferentCus`
that reproduces this situation.

**Fixes**
This commit fixes #46212.

Test Plan: Imported from OSS

Reviewed By: tugsbayasgalan

Differential Revision: D25933082

Pulled By: SplitInfinity

fbshipit-source-id: ec71b6728be816edd6a9c2b2d5075ead98d8bc88
2021-01-19 23:04:30 -08:00
dea9af5c06 Cat benchmark: use mobile feed tensor shapes and torch.cat out-variant (#50778)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50778

- use tensor shapes from ctr_mobilefeed merge net
- use pt cat out-variant for a fairer comparison otherwise benchmark includes time to construct result tensor

Test Plan:
turbo off, devbig machine
```
MKL_NUM_THREADS=1 OMP_NUM_THREADS=1 buck-out/gen/caffe2/benchmarks/operator_benchmark/c2/concat_test.par --tag_filter=static_runtime
```
```
# ----------------------------------------
# PyTorch/Caffe2 Operator Micro-benchmarks
# ----------------------------------------
# Tag : static_runtime

# Benchmarking Caffe2: concat
# Name: concat_sizes(1,40)_N5_axis1_add_axis0_devicecpu_dtypefloat
# Input: sizes: (1, 40), N: 5, axis: 1, add_axis: 0, device: cpu, dtype: float
Forward Execution Time (us) : 0.619

# Benchmarking Caffe2: concat
# Name: concat_sizes[(1,160),(1,14)]_N-1_axis1_add_axis0_devicecpu_dtypefloat
# Input: sizes: [(1, 160), (1, 14)], N: -1, axis: 1, add_axis: 0, device: cpu, dtype: float
Forward Execution Time (us) : 0.369

# Benchmarking Caffe2: concat
# Name: concat_sizes[(1,20,40),(1,4,40),(1,5,40)]_N-1_axis1_add_axis0_devicecpu_dtypefloat
# Input: sizes: [(1, 20, 40), (1, 4, 40), (1, 5, 40)], N: -1, axis: 1, add_axis: 0, device: cpu, dtype: float
Forward Execution Time (us) : 0.590

# Benchmarking Caffe2: concat
# Name: concat_sizes[(1,580),(1,174)]_N-1_axis1_add_axis0_devicecpu_dtypefloat
# Input: sizes: [(1, 580), (1, 174)], N: -1, axis: 1, add_axis: 0, device: cpu, dtype: float
Forward Execution Time (us) : 0.412

# Benchmarking Caffe2: concat
# Name: concat_sizes(20,40)_N5_axis1_add_axis0_devicecpu_dtypefloat
# Input: sizes: (20, 40), N: 5, axis: 1, add_axis: 0, device: cpu, dtype: float
Forward Execution Time (us) : 2.464

# Benchmarking Caffe2: concat
# Name: concat_sizes[(20,160),(20,14)]_N-1_axis1_add_axis0_devicecpu_dtypefloat
# Input: sizes: [(20, 160), (20, 14)], N: -1, axis: 1, add_axis: 0, device: cpu, dtype: float
Forward Execution Time (us) : 1.652

# Benchmarking Caffe2: concat
# Name: concat_sizes[(20,20,40),(20,4,40),(20,5,40)]_N-1_axis1_add_axis0_devicecpu_dtypefloat
# Input: sizes: [(20, 20, 40), (20, 4, 40), (20, 5, 40)], N: -1, axis: 1, add_axis: 0, device: cpu, dtype: float
Forward Execution Time (us) : 9.312

# Benchmarking Caffe2: concat
# Name: concat_sizes[(20,580),(20,174)]_N-1_axis1_add_axis0_devicecpu_dtypefloat
# Input: sizes: [(20, 580), (20, 174)], N: -1, axis: 1, add_axis: 0, device: cpu, dtype: float
Forward Execution Time (us) : 6.532
```
```
MKL_NUM_THREADS=1 OMP_NUM_THREADS=1 buck-out/gen/caffe2/benchmarks/operator_benchmark/pt/cat_test.par --tag_filter=static_runtime
```
```
# ----------------------------------------
# PyTorch/Caffe2 Operator Micro-benchmarks
# ----------------------------------------
# Tag : static_runtime

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[(1,160),(1,14)]_N-1_dim1_cpu
# Input: sizes: [(1, 160), (1, 14)], N: -1, dim: 1, device: cpu
Forward Execution Time (us) : 3.313

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[(1,20,40),(1,4,40),(1,5,40)]_N-1_dim1_cpu
# Input: sizes: [(1, 20, 40), (1, 4, 40), (1, 5, 40)], N: -1, dim: 1, device: cpu
Forward Execution Time (us) : 3.680

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[(1,580),(1,174)]_N-1_dim1_cpu
# Input: sizes: [(1, 580), (1, 174)], N: -1, dim: 1, device: cpu
Forward Execution Time (us) : 3.452

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[(20,160),(20,14)]_N-1_dim1_cpu
# Input: sizes: [(20, 160), (20, 14)], N: -1, dim: 1, device: cpu
Forward Execution Time (us) : 4.653

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[(20,20,40),(20,4,40),(20,5,40)]_N-1_dim1_cpu
# Input: sizes: [(20, 20, 40), (20, 4, 40), (20, 5, 40)], N: -1, dim: 1, device: cpu
Forward Execution Time (us) : 7.364

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[(20,580),(20,174)]_N-1_dim1_cpu
# Input: sizes: [(20, 580), (20, 174)], N: -1, dim: 1, device: cpu
Forward Execution Time (us) : 7.055
```

Reviewed By: hlu1

Differential Revision: D25839036

fbshipit-source-id: 7a6a234f41dfcc56246a80141fe0c84f769a5a85
2021-01-19 22:50:28 -08:00
06c734d8c7 Generalize sum_intlist and prod_intlist, clean up dimensionality functions (#50495)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/50495

Test Plan:
```
buck test mode/opt //caffe2/c10:c10_test_0
```

Reviewed By: ngimel

Differential Revision: D25902853

fbshipit-source-id: a7d30251ca443df57dd8005ed77dba7b2f1002d4
2021-01-19 22:35:55 -08:00
47c57b8836 Fix Native signature for optional Tensor arguments (#50767)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50767

The native signature for optional tensor arguments wrongly produced "Tensor" instead of "optional<Tensor>". We didn't notice this because all internal ops currently use hacky_wrapper, and for hacky_wrapper, "Tensor" is correct.

This PR fixes that and ports one op (batch_norm) to not use hacky_wrapper anymore as a proof of fix.
ghstack-source-id: 120017543

Test Plan: waitforsandcastle

Reviewed By: bhosmer

Differential Revision: D25960941

fbshipit-source-id: ca6fe133109b5d85cff52390792cf552f12d9590
2021-01-19 21:55:46 -08:00
cebab83d3f Fix USE_MKLDN defaults (#50782)
Summary:
Fixes regression introduced by https://github.com/pytorch/pytorch/pull/50400
`cmake_dependent_option` semantic is following (see https://cmake.org/cmake/help/v3.19/module/CMakeDependentOption.html);
`cmake_dependent_option(<option> "<help_text>" <value> <depends> <force>)`
I.e. depends should be true for CPU_INTEL or CPU_AARCH64 but default value should be ON only if CPU_INTEL is true

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50782

Reviewed By: xuzhao9

Differential Revision: D25966509

Pulled By: malfet

fbshipit-source-id: c891cd9234311875762403f7125d0c3803bb0e65
2021-01-19 21:41:53 -08:00
4ff1823fac Add Sparse support for torch.sqrt (#50088)
Summary:
Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50088

Reviewed By: mrshenli

Differential Revision: D25894003

Pulled By: ezyang

fbshipit-source-id: 93688c33b2f9a355c331d6edb3e402935223f75b
2021-01-19 20:19:07 -08:00
38c45bdd2d [FX] Fix tracing a free function with embedded constant (#50639)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/50639

Test Plan: Imported from OSS

Reviewed By: ansley

Differential Revision: D25934142

Pulled By: jamesr66a

fbshipit-source-id: de9053d4f92a7a2f4f573378837ff5ae78e539b1
2021-01-19 19:20:34 -08:00
7526e38cd3 Revert "Stable sort for CPU (#50052)" (#50752)
Summary:
This reverts commit c99f35605105f7366bcf4709df534da3ceab9a15.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50752

Reviewed By: zou3519

Differential Revision: D25958146

Pulled By: glaringlee

fbshipit-source-id: f4068d038f9bd337bac8b673eaeb46a4646f6c77
2021-01-19 18:21:25 -08:00
08c90d9e55 Automated submodule update: tensorpipe (#50765)
Summary:
This is an automated pull request to update the first-party submodule for [pytorch/tensorpipe](https://github.com/pytorch/tensorpipe).

New submodule commit: 6c8ed2e6f7

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50765

Test Plan: Ensure that CI jobs succeed on GitHub before landing.

Reviewed By: mrshenli

Differential Revision: D25960813

fbshipit-source-id: 80b4e48ef04f22f750a2eb049f5f7114715c0a1e
2021-01-19 17:29:00 -08:00
327539ca79 Fix bug in hipify if include_dirs is not specified in setup.py (#50703)
Summary:
Bugs:
1) would introduce -I* in compile commands
2) wouldn't hipify source code directly in build_dir, only one level down or more

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50703

Reviewed By: mrshenli

Differential Revision: D25949070

Pulled By: ngimel

fbshipit-source-id: 018c2a056b68019a922e20e5db2eb8435ad147fe
2021-01-19 16:30:17 -08:00
526659db20 whitelist ops we can build shapes for (#49125)
Summary:
Whitelist ops we can build shapes for.
Otherwise, `buildShapeExpressions` assumes that `aten::unsqueeze` is just a regular op.

```
[DUMP tensorexpr_fuser.cpp:329] buildShapeExpressions for
[DUMP tensorexpr_fuser.cpp:329] graph(%1 : float,
[DUMP tensorexpr_fuser.cpp:329]       %3 : Float(50, 28, strides=[28, 1], requires_grad=0, device=cuda:0),
[DUMP tensorexpr_fuser.cpp:329]       %8 : float,
[DUMP tensorexpr_fuser.cpp:329]       %10 : Float(50, strides=[1], requires_grad=0, device=cuda:0)):
[DUMP tensorexpr_fuser.cpp:329]   %11 : int = prim::Constant[value=1]()
[DUMP tensorexpr_fuser.cpp:329]   %12 : Float(50, 1, strides=[1, 1], requires_grad=0, device=cuda:0) = aten::unsqueeze(%10, %11)
[DUMP tensorexpr_fuser.cpp:329]   %9 : Float(50, 1, strides=[1, 1], requires_grad=0, device=cuda:0) = aten::mul(%12, %8)
[DUMP tensorexpr_fuser.cpp:329]   %6 : Float(50, 28, strides=[28, 1], requires_grad=0, device=cuda:0) = aten::add(%3, %9, %11)
[DUMP tensorexpr_fuser.cpp:329]   %2 : Float(50, 28, strides=[28, 1], requires_grad=0, device=cuda:0) = aten::div(%6, %1)
[DUMP tensorexpr_fuser.cpp:329]   return (%2, %6, %9)
[DEBUG tensorexpr_fuser.cpp:347] Adding a mapping for %3 %162 : int[] = aten::size(%27)
[DEBUG tensorexpr_fuser.cpp:347] Adding a mapping for %10 %163 : int[] = aten::size(%23)
[DEBUG tensorexpr_fuser.cpp:402] Building sizes for %12 : Float(50, 1, strides=[1, 1], requires_grad=0, device=cuda:0) = aten::unsqueeze(%10, %11)
[DEBUG tensorexpr_fuser.cpp:405] Getting aten::size for %10
[DEBUG tensorexpr_fuser.cpp:402] Building sizes for %9 : Float(50, 1, strides=[1, 1], requires_grad=0, device=cuda:0) = aten::mul(%12, %8)
[DEBUG tensorexpr_fuser.cpp:405] Getting aten::size for %12
[DEBUG tensorexpr_fuser.cpp:402] Building sizes for %6 : Float(50, 28, strides=[28, 1], requires_grad=0, device=cuda:0) = aten::add(%3, %9, %11)
[DEBUG tensorexpr_fuser.cpp:405] Getting aten::size for %3
[DEBUG tensorexpr_fuser.cpp:405] Getting aten::size for %9
[DEBUG tensorexpr_fuser.cpp:402] Building sizes for %2 : Float(50, 28, strides=[28, 1], requires_grad=0, device=cuda:0) = aten::div(%6, %1)
[DEBUG tensorexpr_fuser.cpp:405] Getting aten::size for %6
[DEBUG tensorexpr_fuser.cpp:907] Inserting a typecheck guard for a node%156 : Float(50, 28, strides=[28, 1], requires_grad=0, device=cuda:0) = prim::TensorExprGroup[Subgraph=<Graph>](%3, %27, %16, %23)
[DUMP tensorexpr_fuser.cpp:463] After guarding fusion groups:
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49125

Reviewed By: albanD

Differential Revision: D25926997

Pulled By: Krovatkin

fbshipit-source-id: f8041bbfc12be16c329754c6d16911d12aa352ef
2021-01-19 16:17:21 -08:00
4816bf62d6 Fix nvcc function signature causing assert in TypeIndex.h (#49778)
Summary:
Adding NVCC function signature to fully_qualified_type_name_impl()

Fixes https://github.com/pytorch/pytorch/issues/48568

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49778

Reviewed By: albanD

Differential Revision: D25848006

Pulled By: ezyang

fbshipit-source-id: 5afa73ecbb1a3f3b7b68a69b2dcdc27ad38dc44d
2021-01-19 15:25:32 -08:00
a9e46f1413 add type annotations to torch.nn.modules.container (#48969)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/48968

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48969

Reviewed By: mrshenli

Differential Revision: D25728987

Pulled By: walterddr

fbshipit-source-id: 02c3aa2078f4ed6cc6edd90ffe1177d789c328a9
2021-01-19 15:12:17 -08:00
a1b1d0cdc0 Better split of the windows test jobs (#50660)
Summary:
See discussion in https://github.com/pytorch/pytorch/pull/50320#discussion_r554447365.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50660

Reviewed By: xuzhao9, samestep

Differential Revision: D25959021

Pulled By: seemethere

fbshipit-source-id: 7623bddc09e7d55208b8a1af4b5a23fba2cdeb14
2021-01-19 15:07:33 -08:00
ebd142e94b initial commit to enable fast_nvcc (#49773)
Summary:
draft enable fast_nvcc.
* cleaned up some non-standard usages
* added fall-back to wrap_nvcc

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49773

Test Plan:
Configuration to enable fast nvcc:
  - install and enable `ccache` but delete `.ccache/` folder before each build.
  - `TORCH_CUDA_ARCH_LIST=6.0;6.1;6.2;7.0;7.5`
  - Toggling `USE_FAST_NVCC=ON/OFF` cmake config and run `cmake --build` to verify the build time.

Initial statistic for a full compilation:
* `cmake --build . -- -j $(nproc)`:
  - fast NVCC
```
        real    48m55.706s
        user    1559m14.218s
        sys     318m41.138s
```
  - normal NVCC:
```
        real    43m38.723s
        user    1470m28.131s
        sys     90m46.879s
```
* `cmake --build . -- -j $(nproc/4)`:
  - fast NVCC:
```
        real    53m44.173s
        user    1130m18.323s
        sys     71m32.385s
```
  - normal  NVCC:
```
        real    81m53.768s
        user    858m45.402s
        sys     61m15.539s
```
* Conclusion: fast NVCC doesn't provide too much gain when compiler is set to use full CPU utilization, in fact it is **even worse** because of the thread switcing.

initial statistic for partial recompile (edit .cu files)

* `cmake --build . -- -j $(nproc)`
  - fast NVCC:
```
[2021-01-13 18:10:24] [ 86%] Building NVCC (Device) object caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/cuda/torch_cuda_generated_BinaryMiscOpsKernels.cu.o
[2021-01-13 18:11:08] [ 86%] Linking CXX shared library ../lib/libtorch_cuda.so
```
  - normal NVCC:
```
[2021-01-13 17:35:40] [ 86%] Building NVCC (Device) object caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/cuda/torch_cuda_generated_BinaryMiscOpsKernels.cu.o
[2021-01-13 17:38:08] [ 86%] Linking CXX shared library ../lib/libtorch_cuda.so
```
* Conclusion: Effective compilation time for single CU file modification reduced from from 2min30sec to only 40sec when compiling multiple architecture. This shows **4X** gain in speed up using fast NVCC -- reaching the theoretical limit of 5X when compiling 5 gencode architecture at the same time.

Follow up PRs:
- should have better fallback mechanism to detect whether a build is supported by fast_nvcc or not instead of dryruning then fail with fallback.
- performance measurement instrumentation to measure what's the total compile time vs the parallel tasks critical path time.
- figure out why `-j $(nproc)` gives significant sys overhead (`sys 318m41.138s` vs `sys 90m46.879s`) over normal nvcc, guess this is context switching, but not exactly sure

Reviewed By: malfet

Differential Revision: D25692758

Pulled By: walterddr

fbshipit-source-id: c244d07b9b71f146e972b6b3682ca792b38c4457
2021-01-19 14:50:54 -08:00
f7b2b22b64 Remove instance of blacklist (#50478)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50478

See task for context

Test Plan: Sandcastle tests

Reviewed By: xush6528

Differential Revision: D25893912

fbshipit-source-id: 761120e4999fddd256bbf855ce49bfd93472b062
2021-01-19 14:42:01 -08:00
0c9fb4aff0 Disable tracer warning for slicing indices. (#50414)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50414

If the index that is supplied from python is an integral type, it converts everything to int64_t which is traced correctly.

Test Plan:
new test case

Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D25930773

fbshipit-source-id: a3dfeb49df1394c5c8bea0de46038d2c549a0dc6
2021-01-19 14:15:50 -08:00
3344f06130 [FX] Fix using fx.wrap as a decorator (#50677)
Summary:
`torch.fx.wrap()` could not be used as a decorator as the docstring claimed because it returned None.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50677

Test Plan: Added `test_wrapped_via_decorator` which used to fail with `'NoneType' object is not callable` and now passes

Reviewed By: jamesr66a

Differential Revision: D25949313

Pulled By: jansel

fbshipit-source-id: 02d0f9adeed812f58ec94c94dd4adc43578f21ce
2021-01-19 13:42:15 -08:00
05036564cf Remove workaround for TensorPipe failing to get device of CUDA ptr (#50580)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50580

Due to what looked like a bug in CUDA, TensorPipe was sometimes failing to auto-detect the device of a CUDA pointer. A workaround, on the PyTorch side, was to always initialize a CUDA context on device 0. Now that TensorPipe has fixed that we can undo the workaround.

Reviewed By: mrshenli

Differential Revision: D25952929

fbshipit-source-id: 57a5f73241f7371661855c767e44a64ca3b84a74
2021-01-19 12:18:00 -08:00
5f33f22324 Fix caffe2 import tools.codegen (#50353)
Summary:
Using `insert` instead of `append` to add torch root directory to `sys.path`, to fix `ModuleNotFoundError: No module named 'tools.codegen'`, as mentioned in https://github.com/pytorch/pytorch/issues/47553

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50353

Reviewed By: mrshenli

Differential Revision: D25893827

Pulled By: ezyang

fbshipit-source-id: 841e28898fee5502495f3890801b49d9b442f9d6
2021-01-19 12:13:15 -08:00
a9deaf3659 Shouldn't need user local install for ROCm build (#50299)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/50299

Reviewed By: zhangguanheng66

Differential Revision: D25865423

Pulled By: ezyang

fbshipit-source-id: e2af5f00f99de3c0d38b6b6fedfd9b0027ed9b0b
2021-01-19 12:08:01 -08:00
cad4753115 Update TensorPipe submodule (#50733)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/50733

Test Plan: Ensure that CI jobs succeed on GitHub before landing.

Reviewed By: beauby

Differential Revision: D25954026

fbshipit-source-id: 44d21768379b301144518aafc9c68147db49d931
2021-01-19 12:05:25 -08:00
4511f2cc9d Clean up complex autograd test list (#50615)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50615

The method tests for some of the ops have been ported to the new OpInfo based tests. This PR removes those op names from `complex_list` in `test_autograd.py`

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D25931268

Pulled By: anjali411

fbshipit-source-id: 4d08626431c61c34cdca18044933e4f5b9b25232
2021-01-19 11:00:13 -08:00
937eff5853 Consolidate mypy tests and args (#50631)
Summary:
This PR helps with https://github.com/pytorch/pytorch/issues/50513 by reducing the complexity of our `mypy` test suite and making it easier to reproduce on the command line. Previously, to reproduce how `mypy` was actually run on tracked source files (ignoring the doctest typechecking) in CI, you technically needed to run 9 different commands with various arguments:
```
$ mypy --cache-dir=.mypy_cache/normal --check-untyped-defs --follow-imports silent
$ mypy --cache-dir=.mypy_cache/examples --follow-imports silent --check-untyped-defs test/type_hint_tests/module_list.py
$ mypy --cache-dir=.mypy_cache/examples --follow-imports silent --check-untyped-defs test/type_hint_tests/namedtuple.py
$ mypy --cache-dir=.mypy_cache/examples --follow-imports silent --check-untyped-defs test/type_hint_tests/opt_size.py
$ mypy --cache-dir=.mypy_cache/examples --follow-imports silent --check-untyped-defs test/type_hint_tests/size.py
$ mypy --cache-dir=.mypy_cache/examples --follow-imports silent --check-untyped-defs test/type_hint_tests/tensor_copy.py
$ mypy --cache-dir=.mypy_cache/examples --follow-imports silent --check-untyped-defs test/type_hint_tests/torch_cuda_random.py
$ mypy --cache-dir=.mypy_cache/examples --follow-imports silent --check-untyped-defs test/type_hint_tests/torch_optim.py
$ mypy --cache-dir=.mypy_cache/strict --config mypy-strict.ini
```
Now you only have to run 2 much simpler commands:
```
$ mypy
$ mypy --config mypy-strict.ini
```
One reason this is useful is because it will make it easier to integrate PyTorch's `mypy` setup into editors (remaining work on this to be done in a followup PR).

Also, as shown in the test plan, this also reduces the time it takes to run `test/test_type_hints.py` incrementally, by reducing the number of times `mypy` is invoked while still checking the same set of files with the same configs.

(Because this PR merges `test_type_hint_examples` (added in https://github.com/pytorch/pytorch/issues/34595) into `test_run_mypy` (added in https://github.com/pytorch/pytorch/issues/36584), I've added some people involved in those PRs as reviewers, in case there's a specific reason they weren't combined in the first place.)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50631

Test Plan:
Run this twice (the first time is to warm the cache):
```
$ python test/test_type_hints.py -v
```

- *Before:*
  ```
  test_doc_examples (__main__.TestTypeHints)
  Run documentation examples through mypy. ... ok
  test_run_mypy (__main__.TestTypeHints)
  Runs mypy over all files specified in mypy.ini ... ok
  test_run_mypy_strict (__main__.TestTypeHints)
  Runs mypy over all files specified in mypy-strict.ini ... ok
  test_type_hint_examples (__main__.TestTypeHints)
  Runs mypy over all the test examples present in ... ok

  ----------------------------------------------------------------------
  Ran 4 tests in 5.090s

  OK
  ```
  You can also just run `mypy` to see how many files it checks:
  ```
  $ mypy --cache-dir=.mypy_cache/normal --check-untyped-defs --follow-imports silent
  Success: no issues found in 1192 source files
  ```
- *After:*
  ```
  test_doc_examples (__main__.TestTypeHints)
  Run documentation examples through mypy. ... ok
  test_run_mypy (__main__.TestTypeHints)
  Runs mypy over all files specified in mypy.ini ... ok
  test_run_mypy_strict (__main__.TestTypeHints)
  Runs mypy over all files specified in mypy-strict.ini ... ok

  ----------------------------------------------------------------------
  Ran 3 tests in 2.404s

  OK
  ```
  Now `mypy` checks 7 more files, which is the number in `test/type_hint_tests`:
  ```
  $ mypy
  Success: no issues found in 1199 source files
  ```

Reviewed By: zou3519

Differential Revision: D25932660

Pulled By: samestep

fbshipit-source-id: 26c6f00f338e7b44954e5ed89522ce24e2fdc5f0
2021-01-19 10:05:39 -08:00
1a38fa9930 Striding for lists Part 1 (#48719)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48719

Attempt to break this PR (https://github.com/pytorch/pytorch/pull/33019) into two parts. As per our discussion with eellison,  the first part is to make sure our aten::slice operator take optional parameters for begin/step/end. This will help with refactoring ir_emitter.cpp for genering handling for list and slice striding. Once this PR merged, we will submit a second PR with compiler change.

Test Plan:
None for this PR, but new tests will be added for the second part.

Imported from OSS

Reviewed By: jamesr66a

Differential Revision: D25929902

fbshipit-source-id: 5385df04e6d61ded0699b09bbfec6691396b56c3
2021-01-19 09:30:01 -08:00
1154a8594e Add instructional error message for cudnn RNN double backward workaround (#33884)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33884

Mitigates https://github.com/pytorch/pytorch/issues/5261.

It's not possible for us to support cudnn RNN double backwards due to
limitations in the cudnn API. This PR makes it so that we raise an error
message if users try to get the double backward on a cudnn RNN; in the
error message we suggest using the non-cudnn RNN.

Test Plan: - added some tests to check the error message

Reviewed By: albanD

Differential Revision: D20143544

Pulled By: zou3519

fbshipit-source-id: c2e49b3d8bdb9b34b561f006150e4c7551a78fac
2021-01-19 09:05:36 -08:00
5d64658ce8 Add complex support for torch.{acosh, asinh, atanh} (#50387)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/50387

Test Plan: Imported from OSS

Reviewed By: heitorschueroff

Differential Revision: D25947496

Pulled By: anjali411

fbshipit-source-id: c70886a73378501421ff94cdc0dc737f1738bf6f
2021-01-19 08:18:22 -08:00
1000403f66 Adding missing decorator for test_device_map_gpu_mixed_self_4 (#50732)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/50732

Test Plan: Imported from OSS

Reviewed By: beauby

Differential Revision: D25954041

Pulled By: mrshenli

fbshipit-source-id: b2eeb1a77753cb8696613bfdc7bbc5001ae4c972
2021-01-19 07:53:11 -08:00
f9a5ba7398 Added linalg.slogdet (#49194)
Summary:
This PR adds `torch.linalg.slogdet`.

Changes compared to the original torch.slogdet:

- Complex input now works as in NumPy
- Added out= variant (allocates temporary and makes a copy for now)
- Updated `slogdet_backward` to work with complex input

Ref. https://github.com/pytorch/pytorch/issues/42666

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49194

Reviewed By: VitalyFedyunin

Differential Revision: D25916959

Pulled By: mruberry

fbshipit-source-id: cf9be8c5c044870200dcce38be48cd0d10e61a48
2021-01-19 07:28:12 -08:00
f7a8bfd0a1 Add batched grad testing to gradcheck, turn it on in test_autograd (#50592)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50592

This adds a `check_batched_grad=False` option to gradcheck and gradgradcheck.
It defaults to False because gradcheck is a public API and I don't want
to break any existing non-pytorch users of gradcheck.
This:
- runs grad twice with two grad outputs, a & b
- runs a vmapped grad with torch.stack([a, b])
- compares the results of the above against each other.

Furthermore:
- `check_batched_grad=True` is set to be the default for
gradcheck/gradgradcheck inside of test_autograd.py. This is done by
reassigning to the gradcheck object inside test_autograd
- I manually added `check_batched_grad=False` to gradcheck instances
that don't support batched grad.
- I added a denylist for operations that don't support batched grad.

Question:
- Should we have a testing only gradcheck (e.g.,
torch.testing.gradcheck) that has different defaults from our public
API, torch.autograd.gradcheck?

Future:
- The future plan for this is to repeat the above for test_nn.py (the
autogenerated test will require a denylist)
- Finally, we can repeat the above for all pytorch test files that use
gradcheck.

Test Plan: - run tests

Reviewed By: albanD

Differential Revision: D25925942

Pulled By: zou3519

fbshipit-source-id: 4803c389953469d0bacb285774c895009059522f
2021-01-19 06:48:28 -08:00
316f0b89c3 [testing] Port torch.{repeat, tile} tests to use OpInfo machinery (#50199)
Summary:
Reference: https://github.com/pytorch/pytorch/issues/50013

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50199

Reviewed By: ngimel

Differential Revision: D25949791

Pulled By: mruberry

fbshipit-source-id: 10eaf2d749fac8c08847f50461e72ad1c75c61e3
2021-01-19 06:02:27 -08:00
5f13cc861c Automated submodule update: tensorpipe (#50684)
Summary:
This is an automated pull request to update the first-party submodule for [pytorch/tensorpipe](https://github.com/pytorch/tensorpipe).

New submodule commit: eabfe52867

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50684

Test Plan: Ensure that CI jobs succeed on GitHub before landing.

Reviewed By: lw

Differential Revision: D25944553

fbshipit-source-id: e2bbcc48472cd79df89d87a0e61dcffa783c659d
2021-01-19 04:53:45 -08:00
c458558334 kill multinomial_alias_setup/draw (#50489)
Summary:
As per title. Partially Fixes https://github.com/pytorch/pytorch/issues/49421.
These functions appear to be dead code.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50489

Reviewed By: mruberry

Differential Revision: D25948912

Pulled By: ngimel

fbshipit-source-id: 108723bd4c76cbc3535eba902d6f74597bfdfa58
2021-01-19 00:23:58 -08:00
5252e9857a [pytorch] clean up unused util srcs under tools/autograd (#50611)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50611

Removed the unused old-style code to prevent it from being used.
Added all autograd/gen_pyi sources to mypy-strict.ini config.

Confirmed byte-for-byte compatible with the old codegen:
```
Run it before and after this PR:
  .jenkins/pytorch/codegen-test.sh <baseline_output_dir>
  .jenkins/pytorch/codegen-test.sh <test_output_dir>

Then run diff to compare the generated files:
  diff -Naur <baseline_output_dir> <test_output_dir>
```

Confirmed clean mypy-strict run:
```
mypy --config mypy-strict.ini
```

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D25929730

Pulled By: ljk53

fbshipit-source-id: 1fc94436fd4a6b9b368ee0736e99bfb3c01d38ef
2021-01-18 23:54:02 -08:00
b75cdceb44 [package] Properly demangle all accesses of __name__ in importer.py (#50711)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50711

As title, missed a few of these.

Test Plan: Imported from OSS

Reviewed By: yf225

Differential Revision: D25949363

Pulled By: suo

fbshipit-source-id: 197743fe7097d2ac894421a99c072696c3b8cd70
2021-01-18 23:43:46 -08:00
d5e5c5455a [ROCm] re-enable test_sparse.py tests (#50557)
Summary:
Signed-off-by: Kyle Chen <kylechen@amd.com>

cc: jeffdaily

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50557

Reviewed By: mruberry

Differential Revision: D25941432

Pulled By: ngimel

fbshipit-source-id: 534fc8a91a48fa8b3b397e63423cd8347b41bbe2
2021-01-18 23:36:39 -08:00
e9b369c25f Add SELU Activation to calculate_gain (#50664)
Summary:
Fixes #{[24991](https://github.com/pytorch/pytorch/issues/24991)}

I used a value of 0.75 as suggested in the forums by Thomas: https://discuss.pytorch.org/t/calculate-gain-tanh/20854/6

I verified that the value keeps the gradient stable for a 100-layer network.

Code to reproduce (from [jpeg729](https://discuss.pytorch.org/t/calculate-gain-tanh/20854/4)):
```python
import torch
import torch.nn.functional as F
import sys

a = torch.randn(1000,1000, requires_grad=True)
b = a
print (f"in: {a.std().item():.4f}")
for i in range(100):
    l = torch.nn.Linear(1000,1000, bias=False)
    torch.nn.init.xavier_normal_(l.weight, torch.nn.init.calculate_gain("selu"))
    b = getattr(F, 'selu')(l(b))
    if i % 10 == 0:
        print (f"out: {b.std().item():.4f}", end=" ")
        a.grad = None
        b.sum().backward(retain_graph=True)
        print (f"grad: {a.grad.abs().mean().item():.4f}")
```
Output:
```
in: 1.0008
out: 0.7968 grad: 0.6509
out: 0.3127 grad: 0.2760
out: 0.2404 grad: 0.2337
out: 0.2062 grad: 0.2039
out: 0.2056 grad: 0.1795
out: 0.2044 grad: 0.1977
out: 0.2005 grad: 0.2045
out: 0.2042 grad: 0.2273
out: 0.1944 grad: 0.2034
out: 0.2085 grad: 0.2464
```
I included the necessary documentation change, and it passes the _test_calculate_gain_nonlinear_ unittest.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50664

Reviewed By: mruberry

Differential Revision: D25942217

Pulled By: ngimel

fbshipit-source-id: 29ff1be25713484fa7c516df71b12fdaecfb9af8
2021-01-18 23:01:18 -08:00
ce30dba36f Enable TensorPipe CUDA fallback channel (#50675)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/50675

Test Plan: Imported from OSS

Reviewed By: beauby

Differential Revision: D25941963

Pulled By: mrshenli

fbshipit-source-id: 205786d7366f36d659a3a3374081a458cfcb4dd1
2021-01-18 19:38:40 -08:00
94d9a7e8ac Enable TensorPipe CUDA sending to self (#50674)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/50674

Test Plan: Imported from OSS

Reviewed By: beauby

Differential Revision: D25941964

Pulled By: mrshenli

fbshipit-source-id: b53454efdce01f7c06f67dfb890d3c3bdc2c648f
2021-01-18 19:35:40 -08:00
8b501dfd98 Fix memory leak in TensorPipeAgent. (#50564)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50564

When an RPC was sent, the associated future was stored in two maps:
pendingResponseMessage_ and timeoutMap_. Once the response was received, the
entry was only removed from pendingResponseMessage_ and not timeoutMap_. The
pollTimedoudRpcs method then eventually removed the entry from timeoutMap_
after the time out duration had passed.

Although, in scenarios where there is a large timeout and a large number of
RPCs being used, it is very easy for the timeoutMap_ to grow without any
bounds. This was discovered in https://github.com/pytorch/pytorch/issues/50522.

To fix this issue, I've added some code to cleanup timeoutMap_ as well once we
receive a response.
ghstack-source-id: 119925182

Test Plan:
1) Unit test added.
2) Tested with repro in https://github.com/pytorch/pytorch/issues/50522

#Closes: https://github.com/pytorch/pytorch/issues/50522

Reviewed By: mrshenli

Differential Revision: D25919650

fbshipit-source-id: a0a42647e706d598fce2ca2c92963e540b9d9dbb
2021-01-18 16:34:28 -08:00
f32b10e564 [BE] Fix the broken test caffe2/caffe2/python:lazy_dyndep_test - test_allcompare (#50696)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50696

set no deadline for test_alklcompare

Test Plan: buck test mode/dev //caffe2/caffe2/python:lazy_dyndep_test -- --exact 'caffe2/caffe2/python:lazy_dyndep_test - test_allcompare (caffe2.caffe2.python.lazy_dyndep_test.TestLazyDynDepAllCompare)' --run-disabled

Reviewed By: hl475

Differential Revision: D25947800

fbshipit-source-id: d2043f97128e257ef06ebca9b68262bb1c0c5e6b
2021-01-18 16:21:06 -08:00
d140ca8b69 Optimize implementation of torch.pow (#46830)
Summary:
- Related with https://github.com/pytorch/pytorch/issues/44937
- Use `resize_output` instead of `resize_as`
- Tuning the `native_functions.yaml`, move the inplace variant `pow_` next to the other `pow` entries

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46830

Reviewed By: mrshenli

Differential Revision: D24567702

Pulled By: anjali411

fbshipit-source-id: a352422c9d4e356574dbfdf21fb57f7ca7c6075d
2021-01-18 14:19:35 -08:00
227acc2e51 Complex autograd support for torch.{baddbmm, addbmm, addmm, addmv} (#50632)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50632

I'll port the following method tests in follow-up PRs:
`'baddbmm', 'addbmm', 'addmv', 'addr'`
After the tests are ported to OpInfo based tests, it would also be much easier to add tests with complex alpha and beta values.
Edit- it seems like it's hard to port the broadcasting variant tests because one ends up skipping `test_inplace_grad` and `test_variant_consistency_eager` even for the case when inputs are not required to be broadcasted.

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D25947471

Pulled By: anjali411

fbshipit-source-id: 9faa7f1fd55a1269bad282adac2b39d19bfa4591
2021-01-18 14:05:02 -08:00
7f3a407225 Multi label margin loss (#50007)
Summary:
Reopen PR for https://github.com/pytorch/pytorch/pull/46975

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50007

Reviewed By: mruberry

Differential Revision: D25850808

Pulled By: ngimel

fbshipit-source-id: a232e02949182b7d3799448d24ad54a9e0bcf95c
2021-01-18 01:48:05 -08:00
eae1b40400 Introduced operator variant to OpInfo (#50370)
Summary:
Introduced operator variant to OpInfo

Context: Split of https://github.com/pytorch/pytorch/issues/49158

cc mruberry

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50370

Reviewed By: mrshenli

Differential Revision: D25897821

Pulled By: mruberry

fbshipit-source-id: 4387ea10607dbd7209842b685f1794bcb31f434e
2021-01-18 00:05:01 -08:00
3f052ba07b Remove unnecessary dtype checks for complex types & disable complex dispatch for CPU min/max pointwise ops (#50465)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/50064

**PROBLEM DESCRIPTION:**
1. Had not removed dtype checks for complex types in the previous PR (https://github.com/pytorch/pytorch/issues/50347) for this issue.
These type-checks were added in https://github.com/pytorch/pytorch/issues/36377, but are no longer necessary,
as we now rely upon dispatch macros to produce error messages.
2. dtype checks in `clamp_max()` and `clamp_min()` for complex inputs had not been removed either.
3. For min/max pointwise ops in TensorCompareKernel.cpp, complex dispatch had not been removed for min/max functions.

### **FIX DESCRIPTION:**
**FIX SUMMARY:**
1. Removed dtype checks added in https://github.com/pytorch/pytorch/issues/36377, and added 3 more in TensorCompare.cpp.
2. Removed dtype checks for complex inputs in `clamp_max()` and `clamp_min()`.
3.  Disabled complex dispatch for min/max pointwise ops in TensorCompareKernel.cpp.
4. Error messages in the exceptions raised due to min/max ops not being implemented are now checked for containing the text _not support_ (which can also be present in _not supported_), or _not implemented_, so one of them should be a part of error messages, in order for them to be informative.

**REASON FOR NOT CHANGING DISPATCH FOR CUDA AND CLAMP OPS**:

As for the CUDA min/max operations, their kernels do not seem to be compiled & dispatched for complex types anyway, so no further changes seem to be required. Basically, the dispatch macros currently being used don't have cases for complex types.

For example,

1. the reduce CUDA ops use [AT_DISPATCH_ALL_TYPES_AND2 (678fe9f077)](https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/Dispatch.h#L548-L575) in [ReduceMinMaxKernel.cu](https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/cuda/ReduceMinMaxKernel.cu), and that macro doesn't allow complex types.

2. In [MinMaxElementwiseKernel.cu](https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/cuda/MaxMinElementwiseKernel.cu), the CUDA pointwise ops use [`AT_DISPATCH_FLOATING_TYPES_AND2 (678fe9f077)`](https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/Dispatch.h#L240-L263) for non-integral & non-boolean types, and this marco doesn't have a case for complex types either.

3. [clamp CUDA ops](https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/cuda/UnaryOpsKernel.cu#L170-L211) use `AT_DISPATCH_ALL_TYPES_AND2 (678fe9f077)`, which doesn't have a case for complex types.

Similarly, [CPU clamp min/max ops](https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/cpu/UnaryOpsKernel.cpp#L428-L458) use the `AT_DISPATCH_ALL_TYPES_AND `dispatch macro, which doesn't have a case for complex types.

**REASON FOR ADDING 3 dtype CHECKS:**
There are a few cases in which the methods corresponding to `min_stub()` or `max_stub()` are not called, so dispatch macros don't get invoked, resulting in no exceptions being raised. Hence, `dtype` checks are necessary at 3 places to raise exceptions:

1. 52dcc72999/aten/src/ATen/native/TensorCompare.cpp (L342)
2. 52dcc72999/aten/src/ATen/native/TensorCompare.cpp (L422)
3. 52dcc72999/aten/src/ATen/native/TensorCompare.cpp (L389)

The first dtype check requirement can be verified from the following example Python code based on `test_complex_unsupported()`:
```
import unittest
import torch

class MyTestCase(unittest.TestCase):

   def test_1(self):
      t = torch.tensor((1 + 1j), device='cpu', dtype=torch.complex128)
      with self.assertRaises(Exception):
         torch.max(t, dim=0)

if __name__ == '__main__':
    unittest.main()
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50465

Reviewed By: mruberry

Differential Revision: D25938106

Pulled By: ngimel

fbshipit-source-id: 95e2df02ba8583fa3ce87d4a2fdcd60b912dda46
2021-01-17 22:00:05 -08:00
1fdc35da2c [BE] Fix the broken test -- caffe2/caffe2/python:hypothesis_test - test_recurrent (#50668)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50668

GPU initialization sometimes is slow

Test Plan: buck test mode/opt //caffe2/caffe2/python:hypothesis_test -- --exact 'caffe2/caffe2/python:hypothesis_test - test_recurrent (caffe2.caffe2.python.hypothesis_test.TestOperators)' --run-disabled

Reviewed By: hl475

Differential Revision: D25939037

fbshipit-source-id: 832700cf42ece848cda66dd629a06ecda207f086
2021-01-17 21:21:38 -08:00
534c82153e fix bn channels_last contiguity check (#50659)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/42588
The contiguity check used to be for memory format suggested by `grad_output->suggest_memory_format()`, but an invariant guaranteed by derivatives.yaml is `input->suggest_memory_format()`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50659

Reviewed By: mruberry

Differential Revision: D25938921

Pulled By: ngimel

fbshipit-source-id: a945bfef6ce3d91b17e7ff96babe89ffd508939a
2021-01-17 21:10:12 -08:00
7e05d07ca7 [distributed_test_c10d]Enable disabled ROCm tests. (#50629)
Summary:
Signed-off-by: Jagadish Krishnamoorthy <jagdish.krishna@gmail.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50629

Reviewed By: albanD

Differential Revision: D25935005

Pulled By: rohan-varma

fbshipit-source-id: e0969afecac2f319833189a7a8897d78068a2cda
2021-01-16 23:32:30 -08:00
2001f3a2c9 Finished fleshing out the tensor expr bindings in expr.cpp (#50643)
Summary:
Adds the rest of the ops.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50643

Reviewed By: pbelevich

Differential Revision: D25936346

Pulled By: Chillee

fbshipit-source-id: 4e2a7afbeabde51991c39d187a8c35e766950ffe
2021-01-16 13:37:51 -08:00
a469336292 Fix pytorch-doc build (#50651)
Summary:
Fixes `docstring of torch.distributed.rpc.RRef.remote:14: WARNING: Field list ends without a blank line; unexpected unindent.` by indenting multiline fieldlist

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50651

Reviewed By: SplitInfinity

Differential Revision: D25935839

Pulled By: malfet

fbshipit-source-id: e2613ae75334d01ab57f4b071cb0fddf80c6bd78
2021-01-15 23:39:34 -08:00
da5d4396c5 remove dulicate newlines (#50648)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/50648

Reviewed By: malfet

Differential Revision: D25935513

Pulled By: walterddr

fbshipit-source-id: 1a8419b4fdb25368975ac8e72181c2c4b6295278
2021-01-15 22:26:47 -08:00
0ea1abe07b [PyTorch] Add missing Dispatcher.h include in quantized_ops.cpp (#50646)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50646

Master build broke (see https://app.circleci.com/pipelines/github/pytorch/pytorch/260715/workflows/948c9235-8844-4747-b40d-c14ed33f8dbb/jobs/10195595)
ghstack-source-id: 119906225

(Note: this ignores all push blocking failures!)

Test Plan: CI?

Reviewed By: malfet

Differential Revision: D25935300

fbshipit-source-id: 549eba1af24305728a5a0a84cb84142ec4807d95
2021-01-15 19:44:46 -08:00
c99f356051 Stable sort for CPU (#50052)
Summary:
Fixes [https://github.com/pytorch/pytorch/issues/38681](https://github.com/pytorch/pytorch/issues/38681) for the CPU.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50052

Reviewed By: mrshenli

Differential Revision: D25900823

Pulled By: glaringlee

fbshipit-source-id: 1a3fa336037d0aa2344d79f46dcacfd478a353d1
2021-01-15 19:34:27 -08:00
3df5f9c3b2 Revert D25843351: [pytorch][PR] Clarify, make consistent, and test the behavior of logspace when dtype is integral
Test Plan: revert-hammer

Differential Revision:
D25843351 (0ae0fac1bb)

Original commit changeset: 45237574d04c

fbshipit-source-id: fb5343d509b277158b14d1b61e10433793889842
2021-01-15 18:47:37 -08:00
0291f35b37 [FX] Make len traceable and scriptable with wrap (#50184)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/50184

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D25819832

Pulled By: jamesr66a

fbshipit-source-id: ab16138ee26ef2f92f3478c56f0db1873fcc5dd0
2021-01-15 17:46:53 -08:00
585ee119cf Updated codecov config settings (#50601)
Summary:
- Do not generate inline comments on PRs
- Increase number of signals to wait until generating a comment to 5 (2 for codecov configs, 2 for onnx and 1 for windows_test1)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50601

Reviewed By: albanD

Differential Revision: D25928920

Pulled By: malfet

fbshipit-source-id: 8a4ff70024c948cb65a4bdf31d269080d2cff945
2021-01-15 17:41:24 -08:00
b832604ffb Fix caffee2 for llvm trunk
Summary: Fix build with llvm-trunk. With D25877605 (cb37709bee), we need to explicitly include `llvm/Support/Host.h` in `llvm_jit.cpp`.

Test Plan: `buck build mode/opt-clang -j 56 sigrid/predictor/v2:sigrid_remote_predictor -c cxx.extra_cxxflags="-Wforce-no-error" -c cxx.modules=False -c cxx.use_default_autofdo_profile=False`

Reviewed By: bertmaher

Differential Revision: D25920968

fbshipit-source-id: 4b80d5072907f50d01e8fbef41cda8a89dd66a96
2021-01-15 17:12:39 -08:00
2569dc71e1 Reapply D25859132: [te] Optimize allocation of kernel outputs (#50546)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50546

And fix the ROCm build
ghstack-source-id: 119837166

Test Plan: CI

Reviewed By: ZolotukhinM

Differential Revision: D25912464

fbshipit-source-id: 023e1f6c9fc131815c5a7a31f4860dfe271f7ae1
2021-01-15 17:02:49 -08:00
8e60bf9034 add RequiresGradCheck (#50392)
Summary:
This change improves perf by 3-4% on fastrnns.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50392

Reviewed By: izdeby

Differential Revision: D25891392

Pulled By: Krovatkin

fbshipit-source-id: 44d9b6907d3975742c9d77102fe6a85aab2c08c0
2021-01-15 16:50:42 -08:00
6e3e57095c Add complex support for torch.nn.L1Loss (#49912)
Summary:
Building on top of the work of anjali411 (https://github.com/pytorch/pytorch/issues/46640)

Things added in this PR:
1. Modify backward and double-backward formulas
2. Add complex support for `new module tests` and criterion tests (and add complex tests for L1)
3. Modify some existing tests to support complex

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49912

Reviewed By: zhangguanheng66

Differential Revision: D25853036

Pulled By: soulitzer

fbshipit-source-id: df619f1b71c450ab2818eb17804e0c55990aa8ad
2021-01-15 15:53:15 -08:00
d64184ef4c [RPC] Support timeout for RRef proxy functions (#50499)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50499

Adds a timeout API to the following functions:
```
rref.rpc_sync()
rref.rpc_async()
rref.remote()
```
so that RPCs initiated by these proxy calls can be appropriately timed out similar to the regular RPC APIs. Timeouts are supported in the following use cases:

1. rpc.remote finishes in time and successfully, but function run by rref.rpc_async() is slow and times out. Timeout error will be raised
2. rref.rpc_async() function is fast, but rpc.remote() is slow/hanging. Then when rref.rpc_async() is called, it will still timeout with the passed in timeout (and won't block for the rpc.remote() to succeed, which is what happens currently). Although, the timeout will occur during the future creation itself (and not the wait) since it calls `rref._get_type` which blocks. We can consider making this nonblocking by modifying rref._get_type to return a future, although that is likely a larger change.

Test Plan: Added UT

Reviewed By: wanchaol

Differential Revision: D25897495

fbshipit-source-id: f9ad5b8f75121f50537677056a5ab16cf262847e
2021-01-15 13:23:23 -08:00
ab1ba8f433 [RPC] Support timeout in rref._get_type() (#50498)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50498

This change is mostly needed for the next diff in this stack, where
rref._get_type() is called in the rpc_async/rpc_sync RRef proxy function and
can block indefinitely if there is no timeout. It will also be useful to have a
timeout argument when we publicize this API to keep it consistent with other
RPC APIs.
ghstack-source-id: 119859767

Test Plan: Added UT

Reviewed By: pritamdamania87

Differential Revision: D25897588

fbshipit-source-id: 2e84aaf7e4faecf80005c78ee2ac8710f387503e
2021-01-15 13:18:39 -08:00
c78e7db7ee [PyTorch] Remove unnecessary dispatcher.h include in mobile/interpreter.h (#50316)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50316

It's unused.
ghstack-source-id: 119798799

Test Plan: CI

Reviewed By: iseeyuan

Differential Revision: D25858961

fbshipit-source-id: 0f214f93dcdf99d0c22e6d8032ed7a10604c714a
2021-01-15 13:10:30 -08:00
60a1831e61 [PyTorch] Remove unnecessary dispatcher.h include in op_registration.h (#50315)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50315

It's unused.
ghstack-source-id: 119798801

Test Plan: CI

Reviewed By: ezyang

Differential Revision: D25858937

fbshipit-source-id: fe4fdb33c1a443fdd17644c3f7f34c897abf383f
2021-01-15 13:10:28 -08:00
687f6a513a [PyTorch] Remove unnecessary dispatcher.h include in builtin_function.h (#50314)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50314

It's unused.
ghstack-source-id: 119798800

Test Plan: CI

Reviewed By: ezyang

Differential Revision: D25858900

fbshipit-source-id: 16107acb3df0de18ed16d92f1e2c1b0a72e3e43d
2021-01-15 13:05:47 -08:00
0ae0fac1bb Clarify, make consistent, and test the behavior of logspace when dtype is integral (#47647)
Summary:
torch.logspace doesn't seem to have explained how integers are handled.
Add some clarification and some test when dtype is integral.

The CUDA implementation is also updated to be consistent with CPU implementation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47647

Reviewed By: gchanan

Differential Revision: D25843351

Pulled By: walterddr

fbshipit-source-id: 45237574d04c56992c18766667ff1ed71be77ac3
2021-01-15 12:31:20 -08:00
8e7402441d Move irange to c10 (#46414)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46414

For loops are often written with mismatched data types which causes silent type and sign coercion in the absence of integer conversion warnings. Getting around this in templated code requires convoluted patterns such as
```
for(auto i=decltype(var){0};i<var;i++)
```
with this diff we can instead write
```
for(const auto i = c10::irange(var))
```
Note that this loop is type-safe and const-safe.

The function introduced here (`c10::irange`) allows for type-safety and const-ness within for loops, which prevents the accidental truncation or modification of integers and other types, improving code safety.

Test Plan:
```
buck test //caffe2/c10:c10_test_0
```

Reviewed By: ngimel

Differential Revision: D24334732

fbshipit-source-id: fec5ebda3643ec5589f7ea3a8e7bbea4432ed771
2021-01-15 11:44:55 -08:00
296e4a0b7f .circleci: Set +u for all conda install commands (#50505)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50505

Even with +u set for the the conda install it still seems to fail out
with an unbound variable error. Let's try and give it a default value
instead.

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: pbelevich

Differential Revision: D25913692

Pulled By: seemethere

fbshipit-source-id: 4b898f56bff25c7523f10b4933ea6cd17a57df80
2021-01-15 11:36:58 -08:00
0d981eea6c add type annotations to torch.nn.modules.conv (#49564)
Summary:
closes gh-49563

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49564

Reviewed By: albanD

Differential Revision: D25917441

Pulled By: walterddr

fbshipit-source-id: 491dc06cfc1bbf694dfd9ccefca4f55488a931b2
2021-01-15 11:16:11 -08:00
00d432a1ed Remove optional for veiw_fn during View Tracking (#50067)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50067

Fixes #49257

Using the `Callgrind` to test the performance.
```python
import torch
import timeit
from torch.utils.benchmark import Timer

timer = Timer("x.view({100, 5, 20});", setup="torch::Tensor x = torch::ones({10, 10, 100});", language="c++", timer=timeit.default_timer)
res = timer.collect_callgrind(number=10)
```
### Nightly
```python
torch.utils.benchmark.utils.valgrind_wrapper.timer_interface.CallgrindStats object at 0x7f7949138c40>
x.view({100, 5, 20});
setup: torch::Tensor x = torch::ones({10, 10, 100});
                           All          Noisy symbols removed
    Instructions:        42310                      42310
    Baseline:                0                          0
10 runs per measurement, 1 thread
Warning: PyTorch was not built with debug symbols.
         Source information may be limited. Rebuild with
         REL_WITH_DEB_INFO=1 for more detailed results.
```
### Current
```python
<torch.utils.benchmark.utils.valgrind_wrapper.timer_interface.CallgrindStats object at 0x7f78f271a580>
x.view({100, 5, 20});
setup: torch::Tensor x = torch::ones({10, 10, 100});
                           All          Noisy symbols removed
    Instructions:        42480                      42480
    Baseline:                0                          0
10 runs per measurement, 1 thread
Warning: PyTorch was not built with debug symbols.
         Source information may be limited. Rebuild with
         REL_WITH_DEB_INFO=1 for more detailed results.
```
### Compare
There are 170 instructions reduced
```python
torch.utils.benchmark.utils.valgrind_wrapper.timer_interface.FunctionCounts object at 0x7f7941b7a7c0>
    970  ???:torch::autograd::as_view(at::Tensor const&, at::Tensor const&, bool, bool, std::function<at::Tensor (at::Tensor const&)>, torch::autograd::CreationMeta, bool)
    240  ???:torch::autograd::ViewInfo::~ViewInfo()
    180  ???:torch::autograd::ViewInfo::ViewInfo(at::Tensor, std::function<at::Tensor (at::Tensor const&)>)
    130  ???:torch::autograd::make_variable_differentiable_view(at::Tensor const&, c10::optional<torch::autograd::ViewInfo>, c10::optional<torch::autograd::ViewInfo>, torch::autograd::CreationMeta, bool)
    105  /tmp/benchmark_utils_jit_build_69e2f1710544485588feeca0719a3a57/timer_cpp_4435526292782672407/timer_src.cpp:main
    100  ???:std::function<at::Tensor (at::Tensor const&)>::function(std::function<at::Tensor (at::Tensor const&)> const&)
     70  ???:torch::autograd::DifferentiableViewMeta::~DifferentiableViewMeta()
     70  ???:torch::autograd::DifferentiableViewMeta::DifferentiableViewMeta(c10::TensorImpl*, c10::optional<torch::autograd::ViewInfo>, c10::optional<torch::autograd::ViewInfo>, torch::autograd::CreationMeta)
   -100  ???:c10::optional_base<torch::autograd::ViewInfo>::optional_base(c10::optional_base<torch::autograd::ViewInfo>&&)
   -105  /tmp/benchmark_utils_jit_build_2e75f38b553e42eba00523a86ad9aa05/timer_cpp_3360771523810516633/timer_src.cpp:main
   -120  ???:torch::autograd::ViewInfo::ViewInfo(at::Tensor, c10::optional<std::function<at::Tensor (at::Tensor const&)> >)
   -210  ???:c10::optional_base<std::function<at::Tensor (at::Tensor const&)> >::~optional_base()
   -240  ???:c10::optional_base<torch::autograd::ViewInfo>::~optional_base()
   -920  ???:torch::autograd::as_view(at::Tensor const&, at::Tensor const&, bool, bool, c10::optional<std::function<at::Tensor (at::Tensor const&)> >, torch::autograd::CreationMeta, bool)
```

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D25900495

Pulled By: ejguan

fbshipit-source-id: dedd30e69db6b48601a18ae98d6b28faeae30d90
2021-01-15 08:29:28 -08:00
070a30b265 [BE] add warning message to cmake against env var "-std=c++xx" (#50491)
Summary:
this was discovered when working on https://github.com/pytorch/pytorch/issues/50230.

environment variables such as CXXFLAGS="-std=c++17" will not work because we use CMAKE_CXX_STANDARD 14.
Adding this warning to alert users when environment variable was set.

See: [CMake env var usage](https://cmake.org/cmake/help/latest/manual/cmake-env-variables.7.html#id4) and [CXXFLAGS usage](https://cmake.org/cmake/help/latest/envvar/CXXFLAGS.html) for more details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50491

Reviewed By: mrshenli

Differential Revision: D25907851

Pulled By: walterddr

fbshipit-source-id: 5af5eec76f79f9d35456af1f2663cafbc54e7dc8
2021-01-15 07:12:56 -08:00
a9db2f8e7a Revert D24924236: [pytorch][PR] [ONNX] Handle sequence output shape and type inference
Test Plan: revert-hammer

Differential Revision:
D24924236 (adc65e7c8d)

Original commit changeset: 506e70a38cfe

fbshipit-source-id: 78069a33fb3df825af1cb482da06a07f7b26ab48
2021-01-15 05:58:35 -08:00
366b00ab7b [AutoAccept][Codemod][FBSourceClangFormatLinter] Daily arc lint --take CLANGFORMAT
Reviewed By: zertosh

Differential Revision: D25921551

fbshipit-source-id: df0445864751c18eaa240deff6a142dd791d32ff
2021-01-15 04:16:07 -08:00
ffefa44e20 Automated submodule update: tensorpipe (#50572)
Summary:
This is an automated pull request to update the first-party submodule for [pytorch/tensorpipe](https://github.com/pytorch/tensorpipe).

New submodule commit: 161500fb09

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50572

Test Plan: Ensure that CI jobs succeed on GitHub before landing.

Reviewed By: lw

Differential Revision: D25920888

fbshipit-source-id: fa73ba50a2d9429ea1e0beaac6edc2fd8d3ce244
2021-01-15 02:12:54 -08:00
d9f71b5868 [WIP][FX] new sections in docs (#50562)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50562

Adding new top-level sections to the docs to be filled out

![image](https://user-images.githubusercontent.com/4685384/104666703-5b778580-5689-11eb-80ab-7df07f816b5b.png)

Test Plan: Imported from OSS

Reviewed By: Chillee

Differential Revision: D25919592

Pulled By: jamesr66a

fbshipit-source-id: 45f564eb8fddc7a42abb5501e160cca0dd0745c8
2021-01-14 21:34:36 -08:00
6882f9cc1c [FX] Add wrap() docstring to docs and add decorator example (#50555)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/50555

Test Plan: Imported from OSS

Reviewed By: Chillee

Differential Revision: D25917564

Pulled By: jamesr66a

fbshipit-source-id: 20c7c8b1192fa80c6a0bb9e18910791bd7167232
2021-01-14 21:31:51 -08:00
adc65e7c8d [ONNX] Handle sequence output shape and type inference (#46542)
Summary:
Handle sequence output shape and type inference.

This PR fixes value type of sequence outputs. Prior to this, all model sequence type outputs were unfolded for ONNX models.
This PR also enable shape inference for sequence outputs to represent the dynamic shape of these values.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46542

Reviewed By: ezyang

Differential Revision: D24924236

Pulled By: bzinodev

fbshipit-source-id: 506e70a38cfe31069191d7f40fc6375239c6aafe
2021-01-14 21:12:35 -08:00
e9dc8fc162 [TensorExpr] Add python bindings. (#49698)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49698

Reincarnation of #47620 by jamesr66a.

It's just an initial bunch of things that we're exposing to python, more
is expected to come in future. Some things can probably be done better,
but I'm putting this out anyway, since some other people were interested
in using and/or developing this.

Differential Revision: D25668694

Test Plan: Imported from OSS

Reviewed By: bertmaher

Pulled By: ZolotukhinM

fbshipit-source-id: fb0fd1b31e851ef9ab724686b9ac2d172fa4905a
2021-01-14 21:02:47 -08:00
9efe15313a Revert D25563542: Add batched grad testing to gradcheck, turn it on in test_autograd
Test Plan: revert-hammer

Differential Revision:
D25563542 (443412e682)

Original commit changeset: 125dea554abe

fbshipit-source-id: 0564735f977431350b75147ef209e56620dbab64
2021-01-14 19:19:02 -08:00
be51de4047 Minor doc improvement(?) on ArrayRef::slice (#50541)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50541

I found the current phrasing to be confusing

Test Plan: N/A

Reviewed By: ngimel

Differential Revision: D25909205

fbshipit-source-id: 483151d01848ab41d57b3f3b3775ef69f1451dcf
2021-01-14 18:09:34 -08:00
4de9d04f03 [TensorExpr] Hook Fuser Pass to JIT opt-limit utility. (#50518)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50518

That new feature allows to bisect the pass easily by hard-stopping it
after a given number of hits.

Test Plan: Imported from OSS

Reviewed By: tugsbayasgalan

Differential Revision: D25908597

Pulled By: ZolotukhinM

fbshipit-source-id: 8ee547989078c7b1747a4b02ce6e71027cb3055f
2021-01-14 17:08:50 -08:00
08baffa8aa Drop blacklist from glow (#50480)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/50480

Test Plan: Sandcastle tests

Reviewed By: xush6528

Differential Revision: D25893858

fbshipit-source-id: 297440997473c037e8f59a460306569d0a4aa67c
2021-01-14 16:06:34 -08:00
2ceaec704d Fix warnings in TensorShape (#50486)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50486

Compiling currently gives:
```
an 13 16:46:39 In file included from ../aten/src/ATen/native/TensorShape.cpp:12:
Jan 13 16:46:39 ../aten/src/ATen/native/Resize.h:37:24: warning: comparison of integers of different signs: 'int64_t' (aka 'long long') and 'size_t' (aka 'unsigned long') [-Wsign-compare]
Jan 13 16:46:39     if (new_size_bytes > self->storage().nbytes()) {
Jan 13 16:46:39         ~~~~~~~~~~~~~~ ^ ~~~~~~~~~~~~~~~~~~~~~~~~
Jan 13 16:46:39 ../aten/src/ATen/native/TensorShape.cpp:32:24: warning: comparison of integers of different signs: 'size_t' (aka 'unsigned long') and 'int64_t' (aka 'long long') [-Wsign-compare]
Jan 13 16:46:39   for (size_t i = 0; i < shape_tensor.numel(); ++i) {
Jan 13 16:46:39                      ~ ^ ~~~~~~~~~~~~~~~~~~~~
Jan 13 16:46:39 ../aten/src/ATen/native/TensorShape.cpp:122:25: warning: comparison of integers of different signs: 'int64_t' (aka 'long long') and 'size_t' (aka 'unsigned long') [-Wsign-compare]
Jan 13 16:46:39   for (int64_t i = 0; i < tensors.size(); i++) {
Jan 13 16:46:39                       ~ ^ ~~~~~~~~~~~~~~
Jan 13 16:46:39 ../aten/src/ATen/native/TensorShape.cpp:162:21: warning: comparison of integers of different signs: 'int' and 'size_t' (aka 'unsigned long') [-Wsign-compare]
Jan 13 16:46:39   for (int i = 0; i < tensors.size(); i++) {
Jan 13 16:46:39                   ~ ^ ~~~~~~~~~~~~~~
Jan 13 16:46:39 ../aten/src/ATen/native/TensorShape.cpp:300:25: warning: comparison of integers of different signs: 'int64_t' (aka 'long long') and 'size_t' (aka 'unsigned long') [-Wsign-compare]
Jan 13 16:46:39   for (int64_t i = 0; i < s1.size(); ++i) {
Jan 13 16:46:39                       ~ ^ ~~~~~~~~~
Jan 13 16:46:39 ../aten/src/ATen/native/TensorShape.cpp:807:21: warning: comparison of integers of different signs: 'int64_t' (aka 'long long') and 'size_t' (aka 'unsigned long') [-Wsign-compare]
Jan 13 16:46:39     TORCH_CHECK(dim < self_sizes.size());
Jan 13 16:46:39                 ~~~ ^ ~~~~~~~~~~~~~~~~~
Jan 13 16:46:39 ../c10/util/Exception.h:361:31: note: expanded from macro 'TORCH_CHECK'
Jan 13 16:46:39   if (C10_UNLIKELY_OR_CONST(!(cond))) {                                 \
Jan 13 16:46:39                               ^~~~
Jan 13 16:46:39 ../c10/util/Exception.h:244:47: note: expanded from macro 'C10_UNLIKELY_OR_CONST'
Jan 13 16:46:39 #define C10_UNLIKELY_OR_CONST(e) C10_UNLIKELY(e)
Jan 13 16:46:39                                               ^
Jan 13 16:46:39 ../c10/macros/Macros.h:173:65: note: expanded from macro 'C10_UNLIKELY'
Jan 13 16:46:39 #define C10_UNLIKELY(expr)  (__builtin_expect(static_cast<bool>(expr), 0))
Jan 13 16:46:39                                                                 ^~~~
Jan 13 16:46:39 ../aten/src/ATen/native/TensorShape.cpp:855:24: warning: comparison of integers of different signs: 'size_t' (aka 'unsigned long') and 'const int64_t' (aka 'const long long') [-Wsign-compare]
Jan 13 16:46:39   for (size_t i = 0; i < num_blocks; ++i) {
Jan 13 16:46:39                      ~ ^ ~~~~~~~~~~
Jan 13 16:46:39 ../aten/src/ATen/native/TensorShape.cpp:2055:23: warning: comparison of integers of different signs: 'int' and 'size_t' (aka 'unsigned long') [-Wsign-compare]
Jan 13 16:46:39     for (int i = 0; i < vec.size(); i++) {
Jan 13 16:46:39                     ~ ^ ~~~~~~~~~~
Jan 13 16:46:39 ../aten/src/ATen/native/TensorShape.cpp:2100:25: warning: comparison of integers of different signs: 'int64_t' (aka 'long long') and 'size_t' (aka 'unsigned long') [-Wsign-compare]
Jan 13 16:46:39   for (int64_t i = 0; i < src.size(); ++i) {
```
This fixes issues with loop iteration variable types

Test Plan: Sandcastle tests

Reviewed By: xush6528

Differential Revision: D25901799

fbshipit-source-id: c68d9ab93ab0142b5057ce4ca9e75c620a1425f0
2021-01-14 15:24:46 -08:00
1908f56b3a Fix warnings in "ForeachOpsKernels" (#50482)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50482

Compiling currently shows:
```
Jan 13 16:46:28 In file included from ../aten/src/ATen/native/ForeachOpsKernels.cpp:2:
Jan 13 16:46:28 ../aten/src/ATen/native/ForeachUtils.h:28:21: warning: comparison of integers of different signs: 'int' and 'size_t' (aka 'unsigned long') [-Wsign-compare]
Jan 13 16:46:28   for (int i = 0; i < tensors1.size(); i++) {
Jan 13 16:46:28                   ~ ^ ~~~~~~~~~~~~~~~
Jan 13 16:46:28 ../aten/src/ATen/native/ForeachUtils.h:44:21: warning: comparison of integers of different signs: 'int' and 'size_t' (aka 'unsigned long') [-Wsign-compare]
Jan 13 16:46:28   for (int i = 0; i < tensors1.size(); i++) {
Jan 13 16:46:28                   ~ ^ ~~~~~~~~~~~~~~~
Jan 13 16:46:28 ../aten/src/ATen/native/ForeachUtils.h:149:25: warning: comparison of integers of different signs: 'int64_t' (aka 'long long') and 'size_t' (aka 'unsigned long') [-Wsign-compare]
Jan 13 16:46:28   for (int64_t i = 0; i < tensors1.size(); i++) {
Jan 13 16:46:28                       ~ ^ ~~~~~~~~~~~~~~~
Jan 13 16:46:28 ../aten/src/ATen/native/ForeachUtils.h:164:25: warning: comparison of integers of different signs: 'int64_t' (aka 'long long') and 'size_t' (aka 'unsigned long') [-Wsign-compare]
Jan 13 16:46:28   for (int64_t i = 0; i < tensors1.size(); i++) {
Jan 13 16:46:28                       ~ ^ ~~~~~~~~~~~~~~~
Jan 13 16:46:28 ../aten/src/ATen/native/ForeachUtils.h:183:25: warning: comparison of integers of different signs: 'int64_t' (aka 'long long') and 'size_t' (aka 'unsigned long') [-Wsign-compare]
Jan 13 16:46:28   for (int64_t i = 0; i < tensors1.size(); i++) {
Jan 13 16:46:28                       ~ ^ ~~~~~~~~~~~~~~~
Jan 13 16:46:28 ../aten/src/ATen/native/ForeachUtils.h:198:25: warning: comparison of integers of different signs: 'int64_t' (aka 'long long') and 'size_t' (aka 'unsigned long') [-Wsign-compare]
Jan 13 16:46:28   for (int64_t i = 0; i < tensors1.size(); i++) {
Jan 13 16:46:28                       ~ ^ ~~~~~~~~~~~~~~~
Jan 13 16:46:28 ../aten/src/ATen/native/ForeachOpsKernels.cpp:150:1: warning: comparison of integers of different signs: 'int' and 'size_t' (aka 'unsigned long') [-Wsign-compare]
Jan 13 16:46:28 FOREACH_BINARY_OP_LIST_ALPHA(add);
Jan 13 16:46:28 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Jan 13 16:46:28 ../aten/src/ATen/native/ForeachOpsKernels.cpp:74:21: note: expanded from macro 'FOREACH_BINARY_OP_LIST_ALPHA'
Jan 13 16:46:28   for (int i = 0; i < tensors1.size(); i++) {                                                                           \
Jan 13 16:46:28                   ~ ^ ~~~~~~~~~~~~~~~
Jan 13 16:46:28 ../aten/src/ATen/native/ForeachOpsKernels.cpp:150:1: warning: comparison of integers of different signs: 'int' and 'size_t' (aka 'unsigned long') [-Wsign-compare]
Jan 13 16:46:28 FOREACH_BINARY_OP_LIST_ALPHA(add);
Jan 13 16:46:28 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Jan 13 16:46:28 ../aten/src/ATen/native/ForeachOpsKernels.cpp:84:21: note: expanded from macro 'FOREACH_BINARY_OP_LIST_ALPHA'
Jan 13 16:46:28   for (int i = 0; i < tensors1.size(); i++) {                                                                           \
Jan 13 16:46:28                   ~ ^ ~~~~~~~~~~~~~~~
Jan 13 16:46:28 ../aten/src/ATen/native/ForeachOpsKernels.cpp:151:1: warning: comparison of integers of different signs: 'int' and 'size_t' (aka 'unsigned long') [-Wsign-compare]
Jan 13 16:46:28 FOREACH_BINARY_OP_LIST_ALPHA(sub);
Jan 13 16:46:28 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Jan 13 16:46:28 ../aten/src/ATen/native/ForeachOpsKernels.cpp:74:21: note: expanded from macro 'FOREACH_BINARY_OP_LIST_ALPHA'
Jan 13 16:46:28   for (int i = 0; i < tensors1.size(); i++) {                                                                           \
Jan 13 16:46:28                   ~ ^ ~~~~~~~~~~~~~~~
Jan 13 16:46:28 ../aten/src/ATen/native/ForeachOpsKernels.cpp:151:1: warning: comparison of integers of different signs: 'int' and 'size_t' (aka 'unsigned long') [-Wsign-compare]
Jan 13 16:46:28 FOREACH_BINARY_OP_LIST_ALPHA(sub);
Jan 13 16:46:28 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Jan 13 16:46:28 ../aten/src/ATen/native/ForeachOpsKernels.cpp:84:21: note: expanded from macro 'FOREACH_BINARY_OP_LIST_ALPHA'
Jan 13 16:46:28   for (int i = 0; i < tensors1.size(); i++) {                                                                           \
Jan 13 16:46:28                   ~ ^ ~~~~~~~~~~~~~~~
Jan 13 16:46:28 ../aten/src/ATen/native/ForeachOpsKernels.cpp:158:1: warning: comparison of integers of different signs: 'int' and 'size_t' (aka 'unsigned long') [-Wsign-compare]
Jan 13 16:46:28 FOREACH_BINARY_OP_SCALARLIST(add);
Jan 13 16:46:28 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Jan 13 16:46:28 ../aten/src/ATen/native/ForeachOpsKernels.cpp:31:21: note: expanded from macro 'FOREACH_BINARY_OP_SCALARLIST'
Jan 13 16:46:28   for (int i = 0; i < tensors.size(); i++) {                                                                            \
Jan 13 16:46:28                   ~ ^ ~~~~~~~~~~~~~~
Jan 13 16:46:28 ../aten/src/ATen/native/ForeachOpsKernels.cpp:158:1: warning: comparison of integers of different signs: 'int' and 'size_t' (aka 'unsigned long') [-Wsign-compare]
Jan 13 16:46:28 FOREACH_BINARY_OP_SCALARLIST(add);
Jan 13 16:46:28 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Jan 13 16:46:28 ../aten/src/ATen/native/ForeachOpsKernels.cpp:40:21: note: expanded from macro 'FOREACH_BINARY_OP_SCALARLIST'
Jan 13 16:46:28   for (int i = 0; i < tensors.size(); i++) {                                                                            \
Jan 13 16:46:28                   ~ ^ ~~~~~~~~~~~~~~
Jan 13 16:46:28 ../aten/src/ATen/native/ForeachOpsKernels.cpp:159:1: warning: comparison of integers of different signs: 'int' and 'size_t' (aka 'unsigned long') [-Wsign-compare]
Jan 13 16:46:28 FOREACH_BINARY_OP_SCALARLIST(sub);
Jan 13 16:46:28 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Jan 13 16:46:28 ../aten/src/ATen/native/ForeachOpsKernels.cpp:31:21: note: expanded from macro 'FOREACH_BINARY_OP_SCALARLIST'
Jan 13 16:46:28   for (int i = 0; i < tensors.size(); i++) {                                                                            \
Jan 13 16:46:28                   ~ ^ ~~~~~~~~~~~~~~
Jan 13 16:46:28 ../aten/src/ATen/native/ForeachOpsKernels.cpp:159:1: warning: comparison of integers of different signs: 'int' and 'size_t' (aka 'unsigned long') [-Wsign-compare]
Jan 13 16:46:28 FOREACH_BINARY_OP_SCALARLIST(sub);
Jan 13 16:46:28 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Jan 13 16:46:28 ../aten/src/ATen/native/ForeachOpsKernels.cpp:40:21: note: expanded from macro 'FOREACH_BINARY_OP_SCALARLIST'
Jan 13 16:46:28   for (int i = 0; i < tensors.size(); i++) {                                                                            \
Jan 13 16:46:28                   ~ ^ ~~~~~~~~~~~~~~
Jan 13 16:46:28 ../aten/src/ATen/native/ForeachOpsKernels.cpp:160:1: warning: comparison of integers of different signs: 'int' and 'size_t' (aka 'unsigned long') [-Wsign-compare]
Jan 13 16:46:28 FOREACH_BINARY_OP_SCALARLIST(mul);
Jan 13 16:46:28 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Jan 13 16:46:28 ../aten/src/ATen/native/ForeachOpsKernels.cpp:31:21: note: expanded from macro 'FOREACH_BINARY_OP_SCALARLIST'
Jan 13 16:46:28   for (int i = 0; i < tensors.size(); i++) {                                                                            \
Jan 13 16:46:28                   ~ ^ ~~~~~~~~~~~~~~
Jan 13 16:46:28 ../aten/src/ATen/native/ForeachOpsKernels.cpp:160:1: warning: comparison of integers of different signs: 'int' and 'size_t' (aka 'unsigned long') [-Wsign-compare]
Jan 13 16:46:28 FOREACH_BINARY_OP_SCALARLIST(mul);
Jan 13 16:46:28 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Jan 13 16:46:28 ../aten/src/ATen/native/ForeachOpsKernels.cpp:40:21: note: expanded from macro 'FOREACH_BINARY_OP_SCALARLIST'
Jan 13 16:46:28   for (int i = 0; i < tensors.size(); i++) {                                                                            \
Jan 13 16:46:28                   ~ ^ ~~~~~~~~~~~~~~
Jan 13 16:46:28 ../aten/src/ATen/native/ForeachOpsKernels.cpp:161:1: warning: comparison of integers of different signs: 'int' and 'size_t' (aka 'unsigned long') [-Wsign-compare]
Jan 13 16:46:28 FOREACH_BINARY_OP_SCALARLIST(div);
Jan 13 16:46:28 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Jan 13 16:46:28 ../aten/src/ATen/native/ForeachOpsKernels.cpp:31:21: note: expanded from macro 'FOREACH_BINARY_OP_SCALARLIST'
Jan 13 16:46:28   for (int i = 0; i < tensors.size(); i++) {                                                                            \
Jan 13 16:46:28                   ~ ^ ~~~~~~~~~~~~~~
Jan 13 16:46:28 ../aten/src/ATen/native/ForeachOpsKernels.cpp:161:1: warning: comparison of integers of different signs: 'int' and 'size_t' (aka 'unsigned long') [-Wsign-compare]
Jan 13 16:46:28 FOREACH_BINARY_OP_SCALARLIST(div);
Jan 13 16:46:28 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Jan 13 16:46:28 ../aten/src/ATen/native/ForeachOpsKernels.cpp:40:21: note: expanded from macro 'FOREACH_BINARY_OP_SCALARLIST'
Jan 13 16:46:28   for (int i = 0; i < tensors.size(); i++) {                                                                            \
Jan 13 16:46:28                   ~ ^ ~~~~~~~~~~~~~~
Jan 13 16:46:28 ../aten/src/ATen/native/ForeachOpsKernels.cpp:163:1: warning: comparison of integers of different signs: 'int' and 'size_t' (aka 'unsigned long') [-Wsign-compare]
Jan 13 16:46:28 FOREACH_BINARY_OP_LIST(mul);
Jan 13 16:46:28 ^~~~~~~~~~~~~~~~~~~~~~~~~~~
Jan 13 16:46:28 ../aten/src/ATen/native/ForeachOpsKernels.cpp:53:21: note: expanded from macro 'FOREACH_BINARY_OP_LIST'
Jan 13 16:46:28   for (int i = 0; i < tensors1.size(); i++) {                                                             \
Jan 13 16:46:28                   ~ ^ ~~~~~~~~~~~~~~~
Jan 13 16:46:28 ../aten/src/ATen/native/ForeachOpsKernels.cpp:163:1: warning: comparison of integers of different signs: 'int' and 'size_t' (aka 'unsigned long') [-Wsign-compare]
Jan 13 16:46:28 FOREACH_BINARY_OP_LIST(mul);
Jan 13 16:46:28 ^~~~~~~~~~~~~~~~~~~~~~~~~~~
Jan 13 16:46:28 ../aten/src/ATen/native/ForeachOpsKernels.cpp:63:21: note: expanded from macro 'FOREACH_BINARY_OP_LIST'
Jan 13 16:46:28   for (int i = 0; i < tensors1.size(); i++) {                                                             \
Jan 13 16:46:28                   ~ ^ ~~~~~~~~~~~~~~~
Jan 13 16:46:28 ../aten/src/ATen/native/ForeachOpsKernels.cpp:164:1: warning: comparison of integers of different signs: 'int' and 'size_t' (aka 'unsigned long') [-Wsign-compare]
Jan 13 16:46:28 FOREACH_BINARY_OP_LIST(div);
Jan 13 16:46:28 ^~~~~~~~~~~~~~~~~~~~~~~~~~~
Jan 13 16:46:28 ../aten/src/ATen/native/ForeachOpsKernels.cpp:53:21: note: expanded from macro 'FOREACH_BINARY_OP_LIST'
Jan 13 16:46:28   for (int i = 0; i < tensors1.size(); i++) {                                                             \
Jan 13 16:46:28                   ~ ^ ~~~~~~~~~~~~~~~
Jan 13 16:46:28 ../aten/src/ATen/native/ForeachOpsKernels.cpp:164:1: warning: comparison of integers of different signs: 'int' and 'size_t' (aka 'unsigned long') [-Wsign-compare]
Jan 13 16:46:28 FOREACH_BINARY_OP_LIST(div);
Jan 13 16:46:28 ^~~~~~~~~~~~~~~~~~~~~~~~~~~
Jan 13 16:46:28 ../aten/src/ATen/native/ForeachOpsKernels.cpp:63:21: note: expanded from macro 'FOREACH_BINARY_OP_LIST'
Jan 13 16:46:28   for (int i = 0; i < tensors1.size(); i++) {                                                             \
Jan 13 16:46:28                   ~ ^ ~~~~~~~~~~~~~~~
Jan 13 16:46:28 ../aten/src/ATen/native/ForeachOpsKernels.cpp:195:1: warning: comparison of integers of different signs: 'int' and 'size_t' (aka 'unsigned long') [-Wsign-compare]
Jan 13 16:46:28 FOREACH_POINTWISE_OP_SCALAR(addcdiv);
Jan 13 16:46:28 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Jan 13 16:46:28 ../aten/src/ATen/native/ForeachOpsKernels.cpp:115:21: note: expanded from macro 'FOREACH_POINTWISE_OP_SCALAR'
Jan 13 16:46:28   for (int i = 0; i < input.size(); i++) {                                                                                           \
Jan 13 16:46:28                   ~ ^ ~~~~~~~~~~~~
Jan 13 16:46:28 ../aten/src/ATen/native/ForeachOpsKernels.cpp:195:1: warning: comparison of integers of different signs: 'int' and 'size_t' (aka 'unsigned long') [-Wsign-compare]
Jan 13 16:46:28 FOREACH_POINTWISE_OP_SCALAR(addcdiv);
Jan 13 16:46:28 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Jan 13 16:46:28 ../aten/src/ATen/native/ForeachOpsKernels.cpp:125:21: note: expanded from macro 'FOREACH_POINTWISE_OP_SCALAR'
Jan 13 16:46:28   for (int i = 0; i < input.size(); i++) {                                                                                           \
Jan 13 16:46:28                   ~ ^ ~~~~~~~~~~~~
Jan 13 16:46:28 ../aten/src/ATen/native/ForeachOpsKernels.cpp:196:1: warning: comparison of integers of different signs: 'int' and 'size_t' (aka 'unsigned long') [-Wsign-compare]
Jan 13 16:46:28 FOREACH_POINTWISE_OP_SCALAR(addcmul);
Jan 13 16:46:28 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Jan 13 16:46:28 ../aten/src/ATen/native/ForeachOpsKernels.cpp:115:21: note: expanded from macro 'FOREACH_POINTWISE_OP_SCALAR'
Jan 13 16:46:28   for (int i = 0; i < input.size(); i++) {                                                                                           \
Jan 13 16:46:28                   ~ ^ ~~~~~~~~~~~~
Jan 13 16:46:28 ../aten/src/ATen/native/ForeachOpsKernels.cpp:196:1: warning: comparison of integers of different signs: 'int' and 'size_t' (aka 'unsigned long') [-Wsign-compare]
Jan 13 16:46:28 FOREACH_POINTWISE_OP_SCALAR(addcmul);
Jan 13 16:46:28 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Jan 13 16:46:28 ../aten/src/ATen/native/ForeachOpsKernels.cpp:125:21: note: expanded from macro 'FOREACH_POINTWISE_OP_SCALAR'
Jan 13 16:46:28   for (int i = 0; i < input.size(); i++) {                                                                                           \
Jan 13 16:46:28                   ~ ^ ~~~~~~~~~~~~
Jan 13 16:46:28 ../aten/src/ATen/native/ForeachOpsKernels.cpp:198:1: warning: comparison of integers of different signs: 'int' and 'size_t' (aka 'unsigned long') [-Wsign-compare]
Jan 13 16:46:28 FOREACH_POINTWISE_OP_SCALARLIST(addcdiv);
Jan 13 16:46:28 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Jan 13 16:46:28 ../aten/src/ATen/native/ForeachOpsKernels.cpp:135:21: note: expanded from macro 'FOREACH_POINTWISE_OP_SCALARLIST'
Jan 13 16:46:28   for (int i = 0; i < input.size(); i++) {                                                                                                              \
Jan 13 16:46:28                   ~ ^ ~~~~~~~~~~~~
Jan 13 16:46:28 ../aten/src/ATen/native/ForeachOpsKernels.cpp:198:1: warning: comparison of integers of different signs: 'int' and 'size_t' (aka 'unsigned long') [-Wsign-compare]
Jan 13 16:46:28 FOREACH_POINTWISE_OP_SCALARLIST(addcdiv);
Jan 13 16:46:28 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Jan 13 16:46:28 ../aten/src/ATen/native/ForeachOpsKernels.cpp:145:21: note: expanded from macro 'FOREACH_POINTWISE_OP_SCALARLIST'
Jan 13 16:46:28   for (int i = 0; i < input.size(); i++) {                                                                                                              \
Jan 13 16:46:28                   ~ ^ ~~~~~~~~~~~~
Jan 13 16:46:28 ../aten/src/ATen/native/ForeachOpsKernels.cpp:199:1: warning: comparison of integers of different signs: 'int' and 'size_t' (aka 'unsigned long') [-Wsign-compare]
Jan 13 16:46:28 FOREACH_POINTWISE_OP_SCALARLIST(addcmul);
Jan 13 16:46:28 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Jan 13 16:46:28 ../aten/src/ATen/native/ForeachOpsKernels.cpp:135:21: note: expanded from macro 'FOREACH_POINTWISE_OP_SCALARLIST'
Jan 13 16:46:28   for (int i = 0; i < input.size(); i++) {                                                                                                              \
Jan 13 16:46:28                   ~ ^ ~~~~~~~~~~~~
Jan 13 16:46:28 ../aten/src/ATen/native/ForeachOpsKernels.cpp:199:1: warning: comparison of integers of different signs: 'int' and 'size_t' (aka 'unsigned long') [-Wsign-compare]
Jan 13 16:46:28 FOREACH_POINTWISE_OP_SCALARLIST(addcmul);
Jan 13 16:46:28 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Jan 13 16:46:28 ../aten/src/ATen/native/ForeachOpsKernels.cpp:145:21: note: expanded from macro 'FOREACH_POINTWISE_OP_SCALARLIST'
Jan 13 16:46:28   for (int i = 0; i < input.size(); i++) {
```
this diff fixes that

Test Plan: Sandcastle tests

Reviewed By: xush6528

Differential Revision: D25901744

fbshipit-source-id: 2cb665358a103d85e07c690d73b3f4a557d4c135
2021-01-14 15:21:39 -08:00
171f265d80 Back out "Revert D25717510: Clean up some type annotations in benchmarks/fastrnns" (#50556)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50556

Original commit changeset: 2bcc19cd4340

Test Plan: Soft revert hammer

Reviewed By: walterddr, seemethere

Differential Revision: D25917129

fbshipit-source-id: e5caad77655789d607b84eee820aa7c960e00f51
2021-01-14 15:15:03 -08:00
51157e802f Use separate mypy caches for TestTypeHints cases (#50539)
Summary:
Addresses one of the speed points in https://github.com/pytorch/pytorch/issues/50513 by making the `TestTypeHints` suite much faster when run incrementally. Also fixes an issue (at least on 5834438090a1b3206347e30968e48f44251a53a1) where running that suite repeatedly results in a failure every other run (see the test plan below).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50539

Test Plan:
First clear your [`mypy` cache](https://mypy.readthedocs.io/en/stable/command_line.html#incremental-mode):
```
$ rm -r .mypy_cache
```
Then run this twice:
```
$ python test/test_type_hints.py
```

- *Before:*
  ```
  ....
  ----------------------------------------------------------------------
  Ran 4 tests in 212.340s

  OK
  ```
  ```
  .F..
  ======================================================================
  FAIL: test_run_mypy (__main__.TestTypeHints)
  Runs mypy over all files specified in mypy.ini
  ----------------------------------------------------------------------
  Traceback (most recent call last):
    File "test/test_type_hints.py", line 214, in test_run_mypy
      self.fail(f"mypy failed: {stdout} {stderr}")
  AssertionError: mypy failed: torch/quantization/fx/quantize.py:138: error: "Tensor" not callable  [operator]
  Found 1 error in 1 file (checked 1189 source files)

  ----------------------------------------------------------------------
  Ran 4 tests in 199.331s

  FAILED (failures=1)
  ```
- *After:*
  ```
  ....
  ----------------------------------------------------------------------
  Ran 4 tests in 212.815s

  OK
  ```
  ```
  ....
  ----------------------------------------------------------------------
  Ran 4 tests in 5.491s

  OK
  ```

Reviewed By: xuzhao9

Differential Revision: D25912363

Pulled By: samestep

fbshipit-source-id: dac38c890399193699c57b6c9fa8df06a88aee5d
2021-01-14 14:44:31 -08:00
468c99fba4 Reapply D25856891: [te] Benchmark comparing fused overhead to unfused (#50543)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50543

Original commit changeset: 2d2f07f79986

Was part of a stack that got reverted.  This is just a benchmark.
ghstack-source-id: 119825594

Test Plan: CI

Reviewed By: navahgar

Differential Revision: D25912439

fbshipit-source-id: 5d9ca45810fff8931a3cfbd03965e11050180676
2021-01-14 14:17:45 -08:00
30e45bb133 Enable GPU-to-GPU comm in TensorPipeAgent (#44418)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44418

This commit uses TensorPipe's cuda_ipc channel to conduct
cross-process same-machine GPU-to-GPU communication. On the sender
side, `TensorPipeAgent` grabs a stream to each device used by the
message, let these streams wait for current streams, and passes
the streams to TensorPipe `CudaBuffer`. On the receiver side, it
also grabs a stream for each device used in the message, and uses
these streams to receive tensors and run user functions. After that,
these streams are then used for sending the response back to the
sender. When receiving the response, the sender will grab a new set
of streams and use them for TensorPipe's `CudaBuffer`.

If device maps are provided, `TensorPipeAgent::send` will return a
derived class of `CUDAFuture`, which is specifically tailored for
RPC Messages.

TODOs:
1. Enable sending CUDA RPC to the same process.
2. Add a custom CUDA stream pool.
3. When TensorPipe addressed the error for `cudaPointerGetAttributes()`,
remove `cuda:0` context initialization code in `backend_registry.py`.
4. When TensorPipe can detect availability of peer access, enable all
tests on platforms without peer access.

Differential Revision: D23626207

Test Plan: Imported from OSS

Reviewed By: lw

Pulled By: mrshenli

fbshipit-source-id: d30e89e8a98bc44b8d237807b84e78475c2763f0
2021-01-14 13:55:41 -08:00
554a1a70c7 [quant] update embedding module to not store qweight (#50418)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50418

previously we were storing the quantized weight as a module attribute, whcih
was resulting in the weight getting stored as part of the model.
We don't need this since we already store the unpacked weights as part of the model.

Test Plan:
Before
```
Archive:  tmp.pt
 Length   Method    Size  Cmpr    Date    Time   CRC-32   Name
--------  ------  ------- ---- ---------- ----- --------  ----
     586  Stored      586   0% 00-00-1980 00:00 5fefdda0  tmp/extra/producer_info.json
 1588700  Stored  1588700   0% 00-00-1980 00:00 04e0da4c  tmp/data/0
   63548  Stored    63548   0% 00-00-1980 00:00 0ceb1f45  tmp/data/1
   63548  Stored    63548   0% 00-00-1980 00:00 517bc3ab  tmp/data/2
 1588700  Stored  1588700   0% 00-00-1980 00:00 dbe88c73  tmp/data/3
   63548  Stored    63548   0% 00-00-1980 00:00 d8dc47c4  tmp/data/4
   63548  Stored    63548   0% 00-00-1980 00:00 b9e0c20f  tmp/data/5
    1071  Stored     1071   0% 00-00-1980 00:00 10dc9350  tmp/data.pkl
     327  Defl:N      203  38% 00-00-1980 00:00 dfddb661  tmp/code/__torch__/___torch_mangle_0.py
     185  Stored      185   0% 00-00-1980 00:00 308f580b  tmp/code/__torch__/___torch_mangle_0.py.debug_pkl
    1730  Defl:N      515  70% 00-00-1980 00:00 aa11f799  tmp/code/__torch__/torch/nn/quantized/modules/embedding_ops.py
    1468  Defl:N      636  57% 00-00-1980 00:00 779609a6  tmp/code/__torch__/torch/nn/quantized/modules/embedding_ops.py.debug_pkl
       0  Stored        0   0% 00-00-1980 00:00 00000000  tmp/code/__torch__/torch/classes/quantized.py
       6  Stored        6   0% 00-00-1980 00:00 816d0907  tmp/code/__torch__/torch/classes/quantized.py.debug_pkl
       4  Stored        4   0% 00-00-1980 00:00 57092f6d  tmp/constants.pkl
       2  Stored        2   0% 00-00-1980 00:00 55679ed1  tmp/version
--------          -------  ---                            -------
 3436971          3434800   0%                            16 files
```
After
```
Archive:  tmp.pt
 Length   Method    Size  Cmpr    Date    Time   CRC-32   Name
--------  ------  ------- ---- ---------- ----- --------  ----
 1588700  Stored  1588700   0% 00-00-1980 00:00 a4da6981  tmp/data/0
   63548  Stored    63548   0% 00-00-1980 00:00 74d9b607  tmp/data/1
   63548  Stored    63548   0% 00-00-1980 00:00 e346a0c2  tmp/data/2
     952  Stored      952   0% 00-00-1980 00:00 eff8706e  tmp/data.pkl
     375  Defl:N      227  40% 00-00-1980 00:00 96c77b68  tmp/code/__torch__/quantization/test_quantize/___torch_mangle_23.py
     228  Defl:N      162  29% 00-00-1980 00:00 6a378113  tmp/code/__torch__/quantization/test_quantize/___torch_mangle_23.py.debug_pkl
    1711  Defl:N      509  70% 00-00-1980 00:00 66d8fd61  tmp/code/__torch__/torch/nn/quantized/modules/embedding_ops.py
    1473  Defl:N      634  57% 00-00-1980 00:00 beb2323b  tmp/code/__torch__/torch/nn/quantized/modules/embedding_ops.py.debug_pkl
       0  Stored        0   0% 00-00-1980 00:00 00000000  tmp/code/__torch__/torch/classes/quantized.py
       6  Stored        6   0% 00-00-1980 00:00 816d0907  tmp/code/__torch__/torch/classes/quantized.py.debug_pkl
       4  Stored        4   0% 00-00-1980 00:00 57092f6d  tmp/constants.pkl
       2  Stored        2   0% 00-00-1980 00:00 55679ed1  tmp/version
--------          -------  ---                            -------
 1720547          1718292   0%                            12 files
```

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D25879879

fbshipit-source-id: e09427a60d4c44dd1a190575e75f3ed9cde6358f
2021-01-14 10:38:06 -08:00
3dcf126c31 Validate args in HalfCauchy and HalfNormal (#50492)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/50404
Complementary to https://github.com/pytorch/pytorch/issues/50403

This also fixes `HalfCauchy.cdf()`, `HalfNormal.log_prob()`, `HalfNormal.cdf()` and ensures validation is not done twice.

cc feynmanliang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50492

Reviewed By: mrshenli

Differential Revision: D25909541

Pulled By: neerajprad

fbshipit-source-id: 35859633bf5c4fd20995182c599cbcaeb863cf29
2021-01-14 10:16:56 -08:00
7fb935806d enable CPU tests back (#50490)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50490

Right now CPU tests are skipped because it always failed in checking 'torch.cuda.device_count() < int(self.world_size)',
enable CPU tests back by checking device count only when cuda is available

Test Plan: unit tests, CPU tests are not skipped with this diff

Reviewed By: rohan-varma

Differential Revision: D25901980

fbshipit-source-id: e6e8afe217604c5f5b3784096509240703813d94
2021-01-14 10:13:55 -08:00
1ea39094a8 Link to mypy wiki page from CONTRIBUTING.md (#50540)
Summary:
Addresses one of the documentation points in https://github.com/pytorch/pytorch/issues/50513 by making it easier to find our `mypy` wiki page. Also updates the `CONTRIBUTING.md` table of contents and removes some trailing whitespace.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50540

Reviewed By: janeyx99

Differential Revision: D25912366

Pulled By: samestep

fbshipit-source-id: b305f974700a9d9ebedc0c2cb75c92e72d84882a
2021-01-14 10:05:48 -08:00
e05882d2a4 Back out "reuse consant from jit" (#50521)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50521

Original commit changeset: 9731ec1e0c1d

Test Plan:
- run `arc focus2 -b pp-ios //xplat/arfx/tracking/segmentation:segmentationApple -a ModelRunner --force-with-bad-commit `
- build via Xcode, run it on an iOS device
- Click "Person Segmentation"
- Crash observed without the diff patched, and the segmentation image is able to be loaded with this diff patched

Reviewed By: husthyc

Differential Revision: D25908493

fbshipit-source-id: eef072a8a3434b932cfd0646ee78159f72be5536
2021-01-14 09:50:40 -08:00
0be1a24b48 Drop unused imports from caffe2/quantization (#50493)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50493

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49974

From
```
./python/libcst/libcst codemod remove_unused_imports.RemoveUnusedImportsWithGlean --no-format caffe2/
```

Test Plan: Sandcastle Tests

Reviewed By: xush6528

Differential Revision: D25902417

fbshipit-source-id: aeebafce2c4fb649cdce5cf4fd4c5b3ee19923c0
2021-01-14 09:15:19 -08:00
ef6be0ec50 Revert D25903846: [pytorch][PR] Structured kernel definition for upsample_nearest2d
Test Plan: revert-hammer

Differential Revision:
D25903846 (19a8e68d8c)

Original commit changeset: 0059fda9b7d8

fbshipit-source-id: b4a7948088c0329a3605c32b64ed77e060e63fca
2021-01-14 08:44:48 -08:00
443412e682 Add batched grad testing to gradcheck, turn it on in test_autograd (#49120)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49120

This adds a `check_batched_grad=False` option to gradcheck and gradgradcheck.
It defaults to False because gradcheck is a public API and I don't want
to break any existing non-pytorch users of gradcheck.
This:
- runs grad twice with two grad outputs, a & b
- runs a vmapped grad with torch.stack([a, b])
- compares the results of the above against each other.

Furthermore:
- `check_batched_grad=True` is set to be the default for
gradcheck/gradgradcheck inside of test_autograd.py. This is done by
reassigning to the gradcheck object inside test_autograd
- I manually added `check_batched_grad=False` to gradcheck instances
that don't support batched grad.
- I added a denylist for operations that don't support batched grad.

Question:
- Should we have a testing only gradcheck (e.g.,
torch.testing.gradcheck) that has different defaults from our public
API, torch.autograd.gradcheck?

Future:
- The future plan for this is to repeat the above for test_nn.py (the
autogenerated test will require a denylist)
- Finally, we can repeat the above for all pytorch test files that use
gradcheck.

Test Plan: - run tests

Reviewed By: albanD

Differential Revision: D25563542

Pulled By: zou3519

fbshipit-source-id: 125dea554abefcef0cb7b487d5400cd50b77c52c
2021-01-14 08:13:23 -08:00
0abe7f5ef6 [BE] fix subprocess wrapped test cases reported as failure (#50515)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/49901.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50515

Reviewed By: janeyx99

Differential Revision: D25907836

Pulled By: walterddr

fbshipit-source-id: f6f3aa4c1222bf866077275d28ba637eeaef10c5
2021-01-14 08:05:40 -08:00
d2c3733ca1 Reorder torch.distributed.rpc.init_rpc docstring arguments (#50419)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/50419

Test Plan: Imported from OSS

Reviewed By: glaringlee

Differential Revision: D25911561

Pulled By: pbelevich

fbshipit-source-id: 62c9a5c3f5ec5eddcbd149821ebdf484ff392158
2021-01-14 07:58:09 -08:00
2639f1d4a6 Revert D25717510: Clean up some type annotations in benchmarks/fastrnns
Test Plan: revert-hammer

Differential Revision:
D25717510 (7d0eecc666)

Original commit changeset: 4f6431d140e3

fbshipit-source-id: 2bcc19cd434047f3857e0d7e804d34f72e566c30
2021-01-14 07:23:45 -08:00
934805bc49 cleaned up ModuleAttributeError (#50298)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/49726
Just cleaned up the unnecessary `ModuleAttributeError`

BC-breaking note:
`ModuleAttributeError` was added in the previous unsuccessful [PR](https://github.com/pytorch/pytorch/pull/49879) and removed here. If a user catches `ModuleAttributeError` specifically, this will no longer work. They should catch `AttributeError` instead.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50298

Reviewed By: mrshenli

Differential Revision: D25907620

Pulled By: jbschlosser

fbshipit-source-id: cdfa6b1ea76ff080cd243287c10a9d749a3f3d0a
2021-01-14 06:58:01 -08:00
4ee631cdf0 Revert D25856891: [te] Benchmark comparing fused overhead to unfused
Test Plan: revert-hammer

Differential Revision:
D25856891 (36ae3feb22)

Original commit changeset: 0e99515ec2e7

fbshipit-source-id: 2d2f07f79986ca7815b9eae63e734db76bdfc0c8
2021-01-14 04:33:35 -08:00
269193f5f5 Revert D25859132: [te] Optimize allocation of kernel outputs
Test Plan: revert-hammer

Differential Revision:
D25859132 (62f676f543)

Original commit changeset: 8753289339e3

fbshipit-source-id: 580069c7fa7565643d3204f3740e64ac94c4db39
2021-01-14 04:28:29 -08:00
19a8e68d8c Structured kernel definition for upsample_nearest2d (#50189)
Summary:
See the structured kernel definition [RFC](https://github.com/pytorch/rfcs/pull/9) for context.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50189

Reviewed By: mrshenli

Differential Revision: D25903846

Pulled By: soulitzer

fbshipit-source-id: 0059fda9b7d86f596ca35d830562dd4b859293a0
2021-01-13 22:48:23 -08:00
fc9f013cea HalfCauchy should ValueError if _validate_args (#50403)
Summary:
**Expected**: When I run `torch.distributions.HalfCauchy(torch.tensor(1.0), validate_args=True).log_prob(-1)`, I expect a `ValueErro` because that is the behavior of other distributions (e.g. Beta, Bernoulli).

**Actual**: No run-time error is thrown, but a `-inf` log prob is returned.

Fixes https://github.com/pytorch/pytorch/issues/50404

 ---
This change is [<img src="https://reviewable.io/review_button.svg" height="34" align="absmiddle" alt="Reviewable"/>](https://reviewable.io/reviews/pytorch/pytorch/50403)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50403

Reviewed By: mrshenli

Differential Revision: D25907131

Pulled By: neerajprad

fbshipit-source-id: ceb63537e5850809c8b32cf9db0c99043f381edf
2021-01-13 22:07:49 -08:00
52ea372fcb [tools] Update clang-format linux hash (#50520)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50520

**Summary**
The new version of `clang-format` for linux64 that was uploaded to S3
earlier this week was dynamically linked to fbcode's custom platform.
A new binary has been uploaded that statically links against `libgcc`
and `libstdc++`, which seems to have fixed this issue. Ideally, all
libraries would be statically linked.

**Test Plan**
`clang-format` workflow passes on this PR and output shows that it
successfully downloaded, verified and ran.

```
Created directory /home/runner/work/pytorch/pytorch/.clang-format-bin for clang-format binary
Downloading clang-format to /home/runner/work/pytorch/pytorch/.clang-format-bin

Reference Hash: 9073602de1c4e1748f2feea5a0782417b20e3043
Actual Hash: 9073602de1c4e1748f2feea5a0782417b20e3043
Using clang-format located at /home/runner/work/pytorch/pytorch/.clang-format-bin/clang-format
no modified files to format
```

Test Plan: Imported from OSS

Reviewed By: pbelevich

Differential Revision: D25908868

Pulled By: SplitInfinity

fbshipit-source-id: 5667fc5546e5ed0bbf9f36570935d245eb26629b
2021-01-13 20:50:56 -08:00
5ea9584400 Assemble technical overview of FX (#50291)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/50291

Test Plan: Imported from OSS

Reviewed By: pbelevich, SplitInfinity

Differential Revision: D25908444

Pulled By: ansley

fbshipit-source-id: 9860143a0b6aacbed3207228183829c18d10bfdb
2021-01-13 19:31:58 -08:00
a3f9cf9497 Fix fastrnn benchmark regression introduced by 49946 (#50517)
Summary:
Simply add missing `from typing import List, Tuple` and `from torch import Tensor`

Fixes regression introduced by https://github.com/pytorch/pytorch/pull/49946

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50517

Reviewed By: gchanan

Differential Revision: D25908379

Pulled By: malfet

fbshipit-source-id: a44b96681b6121e61b69f960f81c0cad3f2a8d20
2021-01-13 19:10:11 -08:00
0b49778666 [package] mangle imported module names (#50049)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50049

Rationale and implementation immortalized in a big comment in
`torch/package/mangling.md`.

This change also allows imported modules to be TorchScripted

Test Plan: Imported from OSS

Reviewed By: pbelevich

Differential Revision: D25758625

Pulled By: suo

fbshipit-source-id: 77a99dd2024c76716cfa6e59c3855ed590efda8b
2021-01-13 16:32:36 -08:00
4a0d17ba2d [PyTorch][codemod] Replace immediately-dereferenced expect calls w/expectRef (#50228)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50228

`fastmod -m 'expect(<((at|c10)::)?\w+Type>\(\)\s*)->'
'expectRef${1}.'`
Presuming it builds, this is a safe change: the result of `expect()`
wasn't being saved anywhere, so we didn't need it, so we can take a
reference instead of a new `shared_ptr`.
ghstack-source-id: 119782961

Test Plan: CI

Reviewed By: SplitInfinity

Differential Revision: D25837374

fbshipit-source-id: 86757b70b1520e3dbaa141001e7976400cdd3b08
2021-01-13 16:13:55 -08:00
c6cb632c63 [PyTorch] Make SROpFunctor a raw function pointer (#50395)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50395

There's no need for these to be `std::function`.
ghstack-source-id: 119684828

Test Plan: CI

Reviewed By: hlu1

Differential Revision: D25874187

fbshipit-source-id: e9fa3fbc0dca1219ed13904ca704670ce24f7cc3
2021-01-13 15:51:14 -08:00
50256710a0 [PyTorch] Make TensorImpl::empty_tensor_restride non-virtual (#50301)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50301

I'm not sure why this is virtual. We don't seem to override it anywhere, and GitHub code search doesn't turn up anything either.
ghstack-source-id: 119622058

Test Plan: CI

Reviewed By: ezyang

Differential Revision: D25856434

fbshipit-source-id: a95a8d738b109b34f2aadf8db5d4b733d679344f
2021-01-13 15:44:21 -08:00
9ebea77299 [PyTorch] Reapply D25687465: Devirtualize TensorImpl::dim() with macro (#50290)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50290

This was reverted because it landed after D24772023 (b73c018598), which
changed the implementation of `dim()`,  without rebasing on top of it,
and thus broke the build.
ghstack-source-id: 119608505

Test Plan: CI

Reviewed By: ezyang

Differential Revision: D25852810

fbshipit-source-id: 9735a095d539a3a6dc530b7b3bb758d4872d05a8
2021-01-13 15:15:32 -08:00
21542b43a8 [FX] Update docstring code/graph printout (#50396)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/50396

Test Plan: Imported from OSS

Reviewed By: Chillee

Differential Revision: D25874253

Pulled By: jamesr66a

fbshipit-source-id: 6217eadbcbe823db14df25070eef411e184c2273
2021-01-13 15:08:20 -08:00
08b6b78c51 [FX] Make FX stability warning reference beta (#50394)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/50394

Test Plan: Imported from OSS

Reviewed By: Chillee

Differential Revision: D25874188

Pulled By: jamesr66a

fbshipit-source-id: 4fc4e72fec1f3fab770d870fe78cd4ad0f1d6888
2021-01-13 15:06:39 -08:00
aeefe2ce31 [ONNX] ONNX dev branch merge 01-06-2021 (#50163)
Summary:
[ONNX] ONNX dev branch merge 01-06-2021
- [ONNX] Support onnx if/loop sequence output in opset 13 - (https://github.com/pytorch/pytorch/issues/49270)
- Symbolic function for torch.square (https://github.com/pytorch/pytorch/issues/49446)
- [ONNX] Add checks in ONNXSetDynamicInputShape (https://github.com/pytorch/pytorch/issues/49783) …
- [ONNX] Enable export af aten::__derive_index (https://github.com/pytorch/pytorch/issues/49514) …
- [ONNX] Update symbolic for unfold (https://github.com/pytorch/pytorch/issues/49378) …
- [ONNX] Update the sequence of initializers in exported graph so that it is as same as inputs. (https://github.com/pytorch/pytorch/issues/49798)
- [ONNX] Enable opset 13 ops (https://github.com/pytorch/pytorch/issues/49612) …
- [ONNX] Improve error message for supported model input types in ONNX export API. (https://github.com/pytorch/pytorch/issues/50119)
- [ONNX] Add a post-pass for If folding (https://github.com/pytorch/pytorch/issues/49410)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50163

Reviewed By: pbelevich

Differential Revision: D25821059

Pulled By: SplitInfinity

fbshipit-source-id: 9f511a93d9d5812d0ab0a49d61ed0fa5f8066948
2021-01-13 13:51:21 -08:00
30a8ba93b1 Remove a blacklist reference (#50477)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50477

See task for context

Test Plan: Sandcastle+OSS tests

Reviewed By: xush6528

Differential Revision: D25893906

fbshipit-source-id: c9b86d0292aa751597d75e8d1b53f99b99c924b9
2021-01-13 13:39:06 -08:00
7426878981 Exclude test/generated_type_hints_smoketest.py from flake8 (#50497)
Summary:
Similar to https://github.com/pytorch/pytorch/issues/48201, this PR excludes a file that is auto-generated by [`test/test_type_hints.py`](5834438090/test/test_type_hints.py (L109-L111)), which doesn't happen to be run before the Flake8 check is done in CI. Also, because the `exclude` list in `.flake8` has gotten fairly long, this PR splits it across multiple lines.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50497

Test Plan:
Run this in your shell:

```sh
python test/test_type_hints.py TestTypeHints.test_doc_examples
flake8
```

- _Before:_ `flake8` prints [these 169 false positives](https://pastebin.com/qPJY24g8) and returns exit code 1
- _After:_ `flake8` prints no output and returns exit code 0

Reviewed By: mrshenli

Differential Revision: D25903177

Pulled By: samestep

fbshipit-source-id: 21f757ac8bfa626bb56ece2ecc55668912b71234
2021-01-13 12:30:19 -08:00
b89827b73f Drop unused imports (#49972)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49972

From
```
./python/libcst/libcst codemod remove_unused_imports.RemoveUnusedImportsWithGlean --no-format caffe2/
```

Test Plan: Standard sandcastle tests

Reviewed By: xush6528

Differential Revision: D25727352

fbshipit-source-id: 6b90717e161aeb1da8df30e67d586101d35d7d5f
2021-01-13 12:26:17 -08:00
62f676f543 [te] Optimize allocation of kernel outputs (#50318)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50318

We can skip the dispatcher and go to the device-specific
`at::native::empty_strided` implementation.

Also, unpacking the TensorOptions struct at kernel launch time actually takes a
bit of work, since the optionals are encoded in a bitfield.  Do this upfront
and use the optionals directly at runtime.
ghstack-source-id: 119735738

Test Plan:
Before:
```
-------------------------------------------------------
Benchmark                Time           CPU Iterations
-------------------------------------------------------
FusedOverhead         2143 ns       2142 ns     332946
UnfusedOverhead       2277 ns       2276 ns     315130
```

After:
```
-------------------------------------------------------
Benchmark                Time           CPU Iterations
-------------------------------------------------------
FusedOverhead        2175 ns       2173 ns  321877
UnfusedOverhead      2394 ns       2394 ns  307360
```

(The noise in the baseline makes this really hard to read, it seemed to be
about 3-5% faster in my local testing)

Reviewed By: eellison

Differential Revision: D25859132

fbshipit-source-id: 8753289339e365f78c790bee076026cd649b8509
2021-01-13 12:12:43 -08:00
36ae3feb22 [te] Benchmark comparing fused overhead to unfused (#50305)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50305

That's it
ghstack-source-id: 119631533

Test Plan:
```
buck run //caffe2/benchmarks/cpp/tensorexpr:tensorexpr_bench -- --benchmark_filter=Overhead
```
```
Run on (24 X 2394.67 MHz CPU s)
2021-01-08 16:06:17
-------------------------------------------------------
Benchmark                Time           CPU Iterations
-------------------------------------------------------
FusedOverhead         2157 ns       2157 ns     311314
UnfusedOverhead       2443 ns       2443 ns     311221
```

Reviewed By: ZolotukhinM

Differential Revision: D25856891

fbshipit-source-id: 0e99515ec2e769a04929157d46903759c03182a3
2021-01-13 12:09:37 -08:00
48318eba40 Fix TestOpInfoCUDA.test_unsupported_dtypes_addmm_cuda_bfloat16 on ampere (#50440)
Summary:
The `TestOpInfoCUDA.test_unsupported_dtypes_addmm_cuda_bfloat16` in `test_ops.py` is failing on ampere. This is because addmm is supported on Ampere, but the test is asserting that it is not supported.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50440

Reviewed By: mrshenli

Differential Revision: D25893326

Pulled By: ngimel

fbshipit-source-id: afeec25fdd76e7336d84eb53ea36319ade1ab421
2021-01-13 11:25:43 -08:00
d2e96fcf17 Update loss module doc (#48596)
Summary:
Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48596

Reviewed By: izdeby

Differential Revision: D25889748

Pulled By: zou3519

fbshipit-source-id: 9f6e77ba2af4030c8b9ae4afcea6d002a4dae423
2021-01-13 10:41:20 -08:00
fc5db4265b [BE] replace unittest.main with run_tests (#50451)
Summary:
fix https://github.com/pytorch/pytorch/issues/50448.

This replaces all `test/*.py` files with run_tests(). This PR does not address test files in the subdirectories because they seems unrelated.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50451

Reviewed By: janeyx99

Differential Revision: D25899924

Pulled By: walterddr

fbshipit-source-id: f7c861f0096624b2791ad6ef6a16b1c4895cce71
2021-01-13 10:33:08 -08:00
a4383a69d4 Clean up some type annotations in caffe2/test (#49943)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49943

Upgrades type annotations from Python2 to Python3

Test Plan: Sandcastle tests

Reviewed By: xush6528

Differential Revision: D25717534

fbshipit-source-id: 5aedea4db07efca126ffb6daee79617c30a67146
2021-01-13 10:01:55 -08:00
7d0eecc666 Clean up some type annotations in benchmarks/fastrnns (#49946)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49946

Upgrades type annotations from Python2 to Python3

Test Plan: Sandcastle tests

Reviewed By: xush6528

Differential Revision: D25717510

fbshipit-source-id: 4f6431d140e3032b4ca55587f9602aa0ea38c671
2021-01-13 09:57:14 -08:00
05542f6222 EMA op (#50393)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50393

Exponential Moving Average

Usage:

add ema_options in adagrad optimizer. For details, plz refer to the test workflow setting.

if ema_end == -1, it means ema will never end.

Test Plan:
buck test caffe2/caffe2/fb/optimizers:ema_op_optimizer_test

buck test caffe2/caffe2/fb/optimizers:ema_op_test

f240459719

Differential Revision: D25416056

fbshipit-source-id: a25e676a364969e3be2bc47750011c812fc3a62f
2021-01-13 08:58:01 -08:00
4a2d3d1cfd MAINT: char class regex simplify (#50294)
Summary:
* remove some cases of single characters in
character classes--these can incur the overhead
of a character class with none of the benefits
of a multi-character character class

* for more details, see Chapter 6 of:
Friedl, Jeffrey. Mastering Regular Expressions. 3rd ed.,
O’Reilly Media, 2009.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50294

Reviewed By: zhangguanheng66

Differential Revision: D25870912

Pulled By: malfet

fbshipit-source-id: 9be5be9ed11fd49876213f0be8121b24739f1c13
2021-01-13 08:48:17 -08:00
664126bab5 Enables build with oneDNN (MKL-DNN) on AArch64 (#50400)
Summary:
Since version 1.6, oneDNN has provided limited support for AArch64 builds.

This minor change is to detect an AArch64 CPU and permit the use of
`USE_MKLDNN` in that case.

Build flags for oneDNN are also modified accordingly.

Note: oneDNN on AArch64, by default, will use oneDNN's reference C++ kernels.
These are not optimised for AArch64, but oneDNN v1.7 onwards provides support
for a limited set of primitives based Arm Compute Library.
See: https://github.com/oneapi-src/oneDNN/pull/795
and: https://github.com/oneapi-src/oneDNN/pull/820
for more details. Support for ACL-based oneDNN primitives in PyTorch
will require some further modification,

Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50400

Reviewed By: izdeby

Differential Revision: D25886589

Pulled By: malfet

fbshipit-source-id: 2c81277a28ad4528c2d2211381e7c6692d952bc1
2021-01-13 08:41:44 -08:00
deba3bd1d0 Fix TORCH_LIBRARIES variables when do static build (#49458)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/21737

With this fix, TORCH_LIBRARIES variable can provide all nessesary static libraries build from pytorch repo.
User program (if do static build) now can just link with ${TORCH_LIBRARIES} + MKL + cuda runtime.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49458

Reviewed By: mrshenli

Differential Revision: D25895354

Pulled By: malfet

fbshipit-source-id: 8ff47d14ae1f90036522654d4354256ed5151e5c
2021-01-13 07:56:27 -08:00
2a603145d7 [AutoAccept][Codemod][FBSourceClangFormatLinter] Daily arc lint --take CLANGFORMAT
Reviewed By: zertosh

Differential Revision: D25896704

fbshipit-source-id: c6b112db889aaf31996929829e4989f9562964da
2021-01-13 04:22:15 -08:00
4a3a37886c Fix fft slow tests (#50435)
Summary:
The failure is:
```
______________________________________________________________________________________________________ TestCommonCUDA.test_variant_consistency_jit_fft_rfft_cuda_float64 _______________________________________________________________________________________________________
../.local/lib/python3.9/site-packages/torch/testing/_internal/common_utils.py:889: in wrapper
    method(*args, **kwargs)
../.local/lib/python3.9/site-packages/torch/testing/_internal/common_utils.py:889: in wrapper
    method(*args, **kwargs)
../.local/lib/python3.9/site-packages/torch/testing/_internal/common_device_type.py:267: in instantiated_test
    if op is not None and op.should_skip(generic_cls.__name__, name,
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <torch.testing._internal.common_methods_invocations.SpectralFuncInfo object at 0x7f7375f9b550>, cls_name = 'TestCommon', test_name = 'test_variant_consistency_jit', device_type = 'cuda', dtype = torch.float64

    def should_skip(self, cls_name, test_name, device_type, dtype):
>       for si in self.skips:
E       TypeError: 'NoneType' object is not iterable

../.local/lib/python3.9/site-packages/torch/testing/_internal/common_methods_invocations.py:186: TypeError

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50435

Reviewed By: izdeby

Differential Revision: D25886650

Pulled By: mruberry

fbshipit-source-id: 722a45247dc79be86858306cd1b51b0a63df8b37
2021-01-13 01:31:37 -08:00
057be23168 [doc] Add note about torch.flip returning new tensor and not view. (#50041)
Summary:
Reference: https://github.com/pytorch/pytorch/issues/38271

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50041

Reviewed By: izdeby

Differential Revision: D25883870

Pulled By: mruberry

fbshipit-source-id: 33cc28a2176e98f2f29077958782291609c7999b
2021-01-13 01:01:47 -08:00
b54240d200 [PyTorch] Gate tls_local_dispatch_key_set inlining off for Android (#50450)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50450

See comment, seems to break things.
ghstack-source-id: 119753229

Test Plan: CI

Reviewed By: ljk53

Differential Revision: D25892759

fbshipit-source-id: 3b34a384713c77aa28b1ef5807828a08833fd86f
2021-01-12 23:32:12 -08:00
ca5d9617ba Fix remainder type promotion (#48668)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48668

Combine tests for `fmod` and `remainder`.

## BC-breaking Note:
In order to make `remainder` operator have type promotion, we have to introduce BC breaking.
### 1.7.1:
In the case where the second argument is a python number, the result is casted to the dtype of the first argument.
```python
>>> torch.remainder(x, 1.2)
tensor([0, 0, 0, 0, 0], dtype=torch.int32)
```
### This PR:
In the case where the second argument is a python number, the dtype of result is determined by type promotion of both inputs.
```python
>>> torch.remainder(x, 1.2)
tensor([1.0000, 0.8000, 0.6000, 0.4000, 0.2000])
```

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D25869136

Pulled By: ejguan

fbshipit-source-id: 8e5e87eec605a15060f715952de140f25644008c
2021-01-12 22:09:30 -08:00
a0f7b18391 Fix fmod type promotion (#48278)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48278

Remove various lines from tests due to no type promotion introduced from #47323

## BC-breaking Note:
In order to make `fmod` operator have type promotion, we have to introduce BC breaking.
### 1.7.1:
In the case where the second argument is a python number, the result is casted to the dtype of the first argument.
```python
>>> torch.fmod(x, 1.2)
tensor([0, 0, 0, 0, 0], dtype=torch.int32)
```
### Prior PR:
Check the BC-breaking note of #47323

### This PR:
In the case where the second argument is a python number, the dtype of result is determined by type promotion of both inputs.
```python
>>> torch.fmod(x, 1.2)
tensor([1.0000, 0.8000, 0.6000, 0.4000, 0.2000])
```

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D25869137

Pulled By: ejguan

fbshipit-source-id: bce763926731e095b75daf2e934bff7c03ff0832
2021-01-12 22:04:19 -08:00
dea529a779 Add torch.cuda.can_device_access_peer (#50446)
Summary:
And unrelying torch._C._cuda_canDeviceAccessPeer, which is a wrapper around cudaDeviceCanAccessPeer

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50446

Reviewed By: mrshenli

Differential Revision: D25890405

Pulled By: malfet

fbshipit-source-id: ef09405f115bbe73ba301d608d56cd8f8453201b
2021-01-12 20:30:45 -08:00
4e248eb3f6 Change watchdog timeout logging from INFO to ERROR. (#50455)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50455

Certain systems only print logging messages for ERROR/WARN and the
error message that the watchdog is timing out a particular operation is pretty
important.

As a result, changing its level to ERROR instead of INFO.
ghstack-source-id: 119761029

Test Plan: waitforbuildbot

Reviewed By: rohan-varma

Differential Revision: D25894795

fbshipit-source-id: 259b16c13f6cdf9cb1956602d15784b92aa53f17
2021-01-12 20:15:39 -08:00
4e76616719 [StaticRuntime][ATen] Add out variant for narrow_copy (#49502)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49502

It broke the OSS CI the last time I landed it, mostly cuda tests and python bindings.

Similar to permute_out, add the out variant of `aten::narrow` (slice in c2) which does an actual copy. `aten::narrow` creates a view, however, an copy is incurred when we call `input.contiguous` in the ops that follow `aten::narrow`, in `concat_add_mul_replacenan_clip`, `casted_batch_one_hot_lengths`, and `batch_box_cox`.

{F351263599}

Test Plan:
Unit test:

```
buck test //caffe2/aten:math_kernel_test
buck test //caffe2/test:sparse -- test_narrow
```
Benchmark with the adindexer model:
```
bs = 1 is neutral

Before:
I1214 21:32:51.919239 3285258 PyTorchPredictorBenchLib.cpp:209] PyTorch run finished. Milliseconds per iter: 0.0886948. Iters per second: 11274.6
After:
I1214 21:32:52.492352 3285277 PyTorchPredictorBenchLib.cpp:209] PyTorch run finished. Milliseconds per iter: 0.0888019. Iters per second: 11261

bs = 20 shows more gains probably because the tensors are bigger and therefore the cost of copying is higher

Before:
I1214 21:20:19.702445 3227229 PyTorchPredictorBenchLib.cpp:209] PyTorch run finished. Milliseconds per iter: 0.527563. Iters per second: 1895.51
After:
I1214 21:20:20.370173 3227307 PyTorchPredictorBenchLib.cpp:209] PyTorch run finished. Milliseconds per iter: 0.508734. Iters per second: 1965.67
```

Reviewed By: ajyu

Differential Revision: D25596290

fbshipit-source-id: da2f5a78a763895f2518c6298778ccc4d569462c
2021-01-12 19:35:32 -08:00
49896c48e0 Caffe2 Concat operator benchmark (#50449)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50449

Port caffe2 operator benchmark from torch.cat to caffe2 concat to measure the difference in performance.

previous diff abandoned to rerun github CI tests. D25738076

Test Plan:
Tested on devbig by running both pt and c2 benchmarks. Compiled with mode/opt

Inputs:
```
size, number of inputs, cat dimension, device
----------------------------------------------------
(1, 1, 1), N: 2, dim: 0, device: cpu
(512, 512, 2), N: 2, dim: 1, device: cpu
(128, 1024, 2), N: 2, dim: 1, device: cpu
(1024, 1024, 2), N: 2, dim: 0, device: cpu
(1025, 1023, 2), N: 2, dim: 1, device: cpu
(1024, 1024, 2), N: 2, dim: 2, device: cpu
[<function <lambda> at 0x7f922718e8c0>, 111, 65], N: 5, dim: 0, device: cpu
[96, <function <lambda> at 0x7f9226dad710>, 64], N: 5, dim: 1, device: cpu
[128, 64, <function <lambda> at 0x7f91a3625ef0>], N: 5, dim: 2, device: cpu
[<function <lambda> at 0x7f91a3625f80>, 32, 64], N: 50, dim: 0, device: cpu
[32, <function <lambda> at 0x7f91a3621050>, 64], N: 50, dim: 1, device: cpu
[33, 65, <function <lambda> at 0x7f91a36210e0>], N: 50, dim: 2, device: cpu
(64, 32, 4, 16, 32), N: 2, dim: 2, device: cpu
(16, 32, 4, 16, 32), N: 8, dim: 2, device: cpu
(9, 31, 5, 15, 33), N: 17, dim: 4, device: cpu
[<function <lambda> at 0x7f91a3621170>], N: 100, dim: 0, device: cpu
[<function <lambda> at 0x7f91a3621200>], N: 1000, dim: 0, device: cpu
[<function <lambda> at 0x7f91a3621290>], N: 2000, dim: 0, device: cpu
[<function <lambda> at 0x7f91a3621320>], N: 3000, dim: 0, device: cpu
```

```
pytorch: MKL_NUM_THREADS=1 OMP_NUM_THREADS=1 buck-out/gen/caffe2/benchmarks/operator_benchmark/pt/cat_test.par --tag_filter=all
caffe2: MKL_NUM_THREADS=1 OMP_NUM_THREADS=1 buck-out/gen/caffe2/benchmarks/operator_benchmark/c2/concat_test.par --tag_filter=all
```
```
Metric: Forward Execution Time (us)

pytorch             | caffe2
--------------------------------
 4.066              | 0.312
 351.507            | 584.033
 184.649            | 292.157
 9482.895           | 6845.112
 9558.988           | 6847.511
 13730.016          | 14118.505
 6324.371           | 4840.883
 4613.497           | 3702.213
 7504.718           | 7889.751
 9882.978           | 7364.350
 10087.076          | 7483.178
 16849.556          | 18092.295
 19181.075          | 13363.742
 19296.508          | 13466.863
 34157.449          | 56320.073
 176.483            | 267.106
 322.247            | 352.782
 480.064            | 460.214
 607.381            | 476.908
```

Reviewed By: hlu1

Differential Revision: D25890595

fbshipit-source-id: f53e125c0680bc2ebf722d1da5ec964bec585fdd
2021-01-12 18:27:44 -08:00
af968cd672 [Pytorch Mobile] Remove caching (in code) of interned strings (#50390)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50390

Currently, there is a massive switch/case statement that is generated in the `InternedStrings::string()` method to speed up Symbol -> string conversion without taking a lock (mutex). The relative call rate of this on mobile is insignificant, so unlikely to have any material impact on runtime even if the lookups happen under a lock. Plus, parallelism is almost absent on mobile, which is where locks/mutexes cause the most problem (taking a mutex without contention is usually very fast and just adds a memory barrier iirc).

The only impact that caching interned strings has is avoiding taking a lock when interned strings are looked up. They are not looked up very often during training, and based on basic testing, they don't seem to be looked up much during inference either.

During training, the following strings were looked up at test startup:

```
prim::profile
prim::profile_ivalue
prim::profile_optional
prim::FusionGroup
prim::TypeCheck
prim::FallbackGraph
prim::ChunkSizes
prim::ConstantChunk
prim::tolist
prim::FusedConcat
prim::DifferentiableGraph
prim::MMBatchSide
prim::TensorExprGroup
```

Command used to trigger training: `buck test fbsource//xplat/papaya/client/executor/torch/store/transform/feature/test:test`

During inference, the only symbol that was looked up was `tolist`.
ghstack-source-id: 119679831

Test Plan:
See the summary above + sandcastle tests.

### Size test: fbios

```
D25861786-V1 (https://www.internalfb.com/intern/diff/D25861786/?dest_number=119641372)

fbios: Succeeded
Change in Download Size for arm64 + 3x assets variation: -13.9 KiB
Change in Uncompressed Size for arm64 + 3x assets variation: -41.7 KiB

Mbex Comparison: https://our.intern.facebook.com/intern/mbex/bsb:747386759232352@base/bsb:747386759232352@diff/
```

### Size test: igios

```
D25861786-V1 (https://www.internalfb.com/intern/diff/D25861786/?dest_number=119641372)

igios: Succeeded
Change in Download Size for arm64 + 3x assets variation: -16.6 KiB
Change in Uncompressed Size for arm64 + 3x assets variation: -42.0 KiB

Mbex Comparison: https://our.intern.facebook.com/intern/mbex/bsb:213166470538954@base/bsb:213166470538954@diff/
```

Reviewed By: iseeyuan

Differential Revision: D25861786

fbshipit-source-id: 34a55d693edc41537300f628877a64723694f8f0
2021-01-12 17:53:18 -08:00
8c25b9701b Type annotations in test/jit (#50293)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50293

Switching to type annotations for improved safety and import tracking.

Test Plan: Sandcastle tests

Reviewed By: xush6528

Differential Revision: D25853949

fbshipit-source-id: fb873587bb521a0a55021ee4d34d1b05ea8f000d
2021-01-12 16:47:06 -08:00
4c97ef8d77 Create subgraph rewriter (#49540)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/49540

Test Plan: Imported from OSS

Reviewed By: pbelevich

Differential Revision: D25869707

Pulled By: ansley

fbshipit-source-id: 93d3889f7ae2ecc5e8cdd7f4fb6b0446dbb3cb31
2021-01-12 16:32:13 -08:00
374951d102 Add type annotations to torch.nn.modules.padding (#49494)
Summary:
Closes gh-49492

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49494

Reviewed By: mruberry

Differential Revision: D25723837

Pulled By: walterddr

fbshipit-source-id: 92af0100f6d9e2bb25b259f5a7fe9d449ffb6443
2021-01-12 15:34:28 -08:00
cb37709bee [te] Create TargetMachine only once with correct options to fix perf (#50406)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50406

We were creating different TMs in PytorchLLVMJIT and LLVMCodeGen; the
one in LLVMCodeGen had the right target-specific options to generate fast AVX2
code (with FMAs, vbroadcastss, etc.), and that's what was showing up in the
debug output, but the LLVMJIT TM was the one that actually generated runtime
code, and it was slow.
ghstack-source-id: 119700110

Test Plan:
```
buck run mode/opt //caffe2/benchmarks/fb/tensorexpr:tensorexpr_bench
```

With this diff NNC is getting at least somewhat (5%) close to Pytorch with MKL,
for at least this one small-ish test case"

```
Run on (24 X 2394.67 MHz CPU s)
2021-01-11 15:57:27
----------------------------------------------------------------------------------------------------
Benchmark                                             Time           CPU Iterations UserCounters...
----------------------------------------------------------------------------------------------------
Gemm/Torch/128/128/128                            65302 ns      65289 ns      10734 GFLOPS=64.2423G/s
Gemm/TensorExprTile4x16VecUnroll/128/128/128      68602 ns      68599 ns      10256 GFLOPS=61.1421G/s
```

Reviewed By: bwasti

Differential Revision: D25877605

fbshipit-source-id: cd293bac94d025511f348eab5c9b8b16bf6505ec
2021-01-12 15:25:48 -08:00
7d28f1c81d [quant][refactor] Minor refactor of some typos (#50304)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50304

Does not include any functional changes -- purely for fixing minor typos in the `fuser_method_mappings.py`

Test Plan: Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D25857248

Pulled By: z-a-f

fbshipit-source-id: 3f9b864b18bda8096e7cd52922dc21be64278887
2021-01-12 15:23:13 -08:00
39aac65430 [quant][bug] Fixing the mapping getter to return a copy (#50297)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50297

Current implementation has a potential bug: if a user modifies the quantization mappings returned by the getters, the changes will propagate.
For example, the bug will manifest itself if the user does the following:

```
my_mapping = get_default_static_quant_module_mappings()
my_mapping[nn.Linear] = UserLinearImplementation
model_A = convert(model_A, mapping=my_mapping)

default_mapping = get_default_static_quant_module_mappings()
model_B = convert(model_B, mapping=default_mapping)
```

In that case the `model_B` will be quantized with with the modified mapping.

Test Plan: Imported from OSS

Reviewed By: vkuzo

Differential Revision: D25855753

Pulled By: z-a-f

fbshipit-source-id: 0149a0c07a965024ba7d1084e89157a9c8fa1192
2021-01-12 15:19:39 -08:00
412e3f46e9 Automated submodule update: tensorpipe (#50441)
Summary:
This is an automated pull request to update the first-party submodule for [pytorch/tensorpipe](https://github.com/pytorch/tensorpipe).

New submodule commit: ac98f40758

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50441

Test Plan: Ensure that CI jobs succeed on GitHub before landing.

Reviewed By: mrshenli

Differential Revision: D25888666

fbshipit-source-id: fd447f81462f476c62aed0e43830a710f60187e1
2021-01-12 14:17:55 -08:00
50744cd0f7 [package] better error message when unpickling a mocked obj (#50159)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/50159

Test Plan: Imported from OSS

Reviewed By: tugsbayasgalan

Differential Revision: D25809551

Pulled By: suo

fbshipit-source-id: 130587e650271cf158f5f5d9e688c622c9006631
2021-01-12 14:11:32 -08:00
6d947067c9 fixing autodiff to support Optional[Tensor] on inputs (#49430)
Summary:
This PR fixes two local issue for me:

1. Assert failure when passing `None` to `Optional[Tensor]` input that requires gradient in autodiff
2. Wrong vjp mapping on inputs when `requires_grad` flag changes on inputs stack.

This PR is to support autodiff on layer_norm.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49430

Reviewed By: izdeby

Differential Revision: D25886211

Pulled By: eellison

fbshipit-source-id: 075af35a4a9c0b911838f25146f859897f9a07a7
2021-01-12 14:01:14 -08:00
c198e6c6fa Stop moving scalars to GPU for one computation in leaky_rrelu_backward. (#50115)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50115

There is no way this is performant and we are trying to minimize the usage of scalar_to_tensor(..., device) since it is an anti-pattern, see https://github.com/pytorch/pytorch/issues/49758.

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D25790331

Pulled By: gchanan

fbshipit-source-id: 89d6f016dfd76197541b0fd8da4a462876dbf844
2021-01-12 13:44:30 -08:00
cf45d65f1c Clean up some type annotations in test/jit/...../test_class_type.py (#50156)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50156

Upgrades type annotations from Python2 to Python3

Test Plan: Sandcastle tests

Reviewed By: xush6528

Differential Revision: D25720035

fbshipit-source-id: 7e1aec34b21f3c9a3e8db9578258d99ffb87e6d4
2021-01-12 13:28:13 -08:00
725640ed84 Check CUDA kernel launches in caffe2/caffe2/utils/math (#50238)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50238

Added `C10_CUDA_KERNEL_LAUNCH_CHECK();` after all kernel launches in caffe2/caffe2/utils/math

Test Plan:
```
buck build //caffe2/caffe2
```

{F356531214}

files in caffe2/caffe2/utils/math no longer show up when running
```
python3 caffe2/torch/testing/check_kernel_launches.py
```

Reviewed By: r-barnes

Differential Revision: D25773299

fbshipit-source-id: 28d67b4b9f57f1fa1e8699e43e9202bad4d42c5f
2021-01-12 13:09:15 -08:00
5cdc32bf1c [vmap] Add batching rules for comparisons ops (#50364)
Summary:
Related to https://github.com/pytorch/pytorch/issues/49562

This PR adds batching rules for the below comparison ops.
- torch.eq
- torch.gt
- torch.ge
- torch.le
- torch.lt
- torch.ne

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50364

Reviewed By: anjali411

Differential Revision: D25885359

Pulled By: zou3519

fbshipit-source-id: 58874f24f8d525d8fac9062186b1c9970618ff55
2021-01-12 13:00:56 -08:00
Jan
b2f7ff7d29 Fix MultiheadAttention docstring latex (#50430)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/50429

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50430

Reviewed By: izdeby

Differential Revision: D25885695

Pulled By: zou3519

fbshipit-source-id: 7b017f9c5cdebbc7254c8193305c54003478c343
2021-01-12 12:45:42 -08:00
a389b30bfc Add Post Freezing Optimizations, turn on by default in torch.jit.freeze (#50222)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50222

This PR adds a pass which runs a set of optimizations to be done after freezing. Currently this encompasses Conv-BN folding, Conv->Add/Sub/Mul/Div folding and i'm also planning on adding dropout removal.

I would like some feedback on the API. torch.jit.freeze is technically in \~prototype\~ phase so we have some leeway around making changes. I think in the majority of cases, the user is going to want to freeze their model, and then run in inference. I would prefer if the optimization was opt-out instead of opt-in. All internal/framework use cases of freezing all use `freeze_module`, not the python API, so this shouldn't break anything.

I have separated out the optimization pass as a separate API to make things potentially modular, even though I suspect that is an unlikely case. In a future PR i would like to add a `torch::jit::freeze` which follows the same api as `torch.jit.freeze` intended for C++ use, and runs the optimizations.

Test Plan: Imported from OSS

Reviewed By: tugsbayasgalan

Differential Revision: D25856264

Pulled By: eellison

fbshipit-source-id: 56be1f12cfc459b4c4421d4dfdedff8b9ac77112
2021-01-12 11:39:13 -08:00
30aeed7c2b Peephole Optimize out conv(x).dim(), which prevents BN fusion (#50221)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/50221

Test Plan: Imported from OSS

Reviewed By: tugsbayasgalan

Differential Revision: D25856266

Pulled By: eellison

fbshipit-source-id: ef7054b3d4ebc59a0dd129116d29273be33fe12c
2021-01-12 11:39:09 -08:00
a69f008cb7 [JIT] Factor out peephole to own test file (#50220)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/50220

Test Plan: Imported from OSS

Reviewed By: tugsbayasgalan

Differential Revision: D25856263

Pulled By: eellison

fbshipit-source-id: f3d918d860e64e788e0bb9b9cb85125660f834c6
2021-01-12 11:39:06 -08:00
6971149326 [JIT] Add Frozen Conv-> Add/Sub/Mul/Div fusion (#50075)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50075

Adds Conv - Add/Sub/Mul/Div fusion for frozen models. This helps cover models like torchvision maskrcnn, which use a hand-rolled batchnorm implementation: 90645ccd0e/torchvision/ops/misc.py (L45).

I haven't tested results yet but I would expect a somewhat similar speed up as conv-bn fusion (maybe a little less).

Test Plan: Imported from OSS

Reviewed By: tugsbayasgalan

Differential Revision: D25856265

Pulled By: eellison

fbshipit-source-id: 2c36fb831a841936fe4446ed440185f59110bf68
2021-01-12 11:39:02 -08:00
035229c945 [JIT] Frozen Graph Conv-BN fusion (#50074)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50074

Adds Conv-BN fusion for models that have been frozen. I haven't explicitly tested perf yet but it should be equivalent to the results from Chillee's PR [here](https://github.com/pytorch/pytorch/pull/476570) and [here](https://github.com/pytorch/pytorch/pull/47657#issuecomment-725752765). Click on the PR for details but it's a good speed up.

 In a later PR in the stack I plan on making this optimization on by default as part of `torch.jit.freeze`. I will also in a later PR add a peephole so that there is not conv->batchnorm2d doesn't generate a conditional checking # dims.

Zino was working on freezing and left the team, so not really sure who should be reviewing this, but I dont care too much so long as I get a review �

Test Plan: Imported from OSS

Reviewed By: tugsbayasgalan

Differential Revision: D25856261

Pulled By: eellison

fbshipit-source-id: da58c4ad97506a09a5c3a15e41aa92bdd7e9a197
2021-01-12 11:37:32 -08:00
b5d3826950 [PyTorch] Devirtualize TensorImpl::sizes() with macro (#50176)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50176

UndefinedTensorImpl was the only type that overrode this, and IIUC we don't need to do it.
ghstack-source-id: 119609531

Test Plan: CI, internal benchmarks

Reviewed By: ezyang

Differential Revision: D25817370

fbshipit-source-id: 985a99dcea2e0daee3ca3fc315445b978f3bf680
2021-01-12 10:33:46 -08:00
158c98ae49 Add new patterns for ConcatAddMulReplaceNaNClip (#50249)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50249

Add a few new patterns for `ConcatAddMulReplaceNanClip`

Reviewed By: houseroad

Differential Revision: D25843126

fbshipit-source-id: d4987c716cf085f2198234651a2214591d8aacc0
2021-01-12 10:20:01 -08:00
5834438090 Enable fast pass tensor_fill for single element complex tensors (#50383)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/50383

Test Plan: Imported from OSS

Reviewed By: heitorschueroff

Differential Revision: D25879881

Pulled By: anjali411

fbshipit-source-id: a254cff48ea9a6a38f7ee206815a04c31a9bcab0
2021-01-12 08:40:30 -08:00
6420071b43 Disable complex dispatch on min/max functions (#50347)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/50064

**PROBLEM:**
In issue https://github.com/pytorch/pytorch/issues/36377, min/max functions were disabled for complex inputs (via dtype checks).
However, min/max kernels are still being compiled and dispatched for complex.

**FIX:**
The aforementioned dispatch has been disabled & we now rely on errors produced
by dispatch macro to not run those ops on complex, instead of doing redundant dtype checks.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50347

Reviewed By: zhangguanheng66

Differential Revision: D25870385

Pulled By: anjali411

fbshipit-source-id: 921541d421c509b7a945ac75f53718cd44e77df1
2021-01-12 07:55:18 -08:00
4411b5ac57 add type annotations to torch.nn.modules.normalization (#49035)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/49034

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49035

Test Plan:
Imported from GitHub, without a `Test Plan:` line.
Force rebased to deal with merge conflicts

Reviewed By: zhangguanheng66

Differential Revision: D25767065

Pulled By: walterddr

fbshipit-source-id: ffb904e449f137825824e3f43f3775a55e9b011b
2021-01-12 07:40:15 -08:00
9384d31af5 Added linalg.pinv (#48399)
Summary:
This PR adds `torch.linalg.pinv`.

Changes compared to the original `torch.pinverse`:
 * New kwarg "hermitian": with `hermitian=True` eigendecomposition is used instead of singular value decomposition.
 * `rcond` argument can now be a `Tensor` of appropriate shape to apply matrix-wise clipping of singular values.
 * Added `out=` variant (allocates temporary and makes a copy for now)

Ref. https://github.com/pytorch/pytorch/issues/42666

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48399

Reviewed By: zhangguanheng66

Differential Revision: D25869572

Pulled By: mruberry

fbshipit-source-id: 0f330a91d24ba4e4375f648a448b27594e00dead
2021-01-12 06:52:06 -08:00
314351d0ef Fix Error with torch.flip() for cuda tensors when dims=() (#50325)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/49982

The method flip_check_errors was being called in cuda file which had a condition to throw an exception for when dims size is <=0 changed that to <0 and added seperate condition for when equal to zero to return from the method... the return was needed because after this point the method was performing check expecting a non-zero size dims ...

Also removed the comment/condition written to point to the issue

mruberry kshitij12345 please review this once

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50325

Reviewed By: zhangguanheng66

Differential Revision: D25869559

Pulled By: mruberry

fbshipit-source-id: a831df9f602c60cadcf9f886ae001ad08b137481
2021-01-12 05:41:28 -08:00
5546a12fe3 remove redundant tests from tensor_op_tests (#50096)
Summary:
All these Unary operators have been an entry in OpInfo DB.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50096

Reviewed By: zhangguanheng66

Differential Revision: D25870048

Pulled By: mruberry

fbshipit-source-id: b64e06d5b9ab5a03a202cda8c22fdb7e4ae8adf8
2021-01-12 04:53:12 -08:00
53473985b8 test_ops: Only run complex gradcheck when complex is supported (#49018)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/49018

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D25868683

Pulled By: mruberry

fbshipit-source-id: d8c4d89c11939fc7d81db8190ac6b9b551e4cbf5
2021-01-12 04:48:30 -08:00
d25c673dfc Cleanup unnecessary SpectralFuncInfo logic (#48712)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/48712

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D25868675

Pulled By: mruberry

fbshipit-source-id: 90b32b27d9a3d79c3754c4a1c0747dbe0f140192
2021-01-12 04:48:27 -08:00
fb73cc4dc4 Migrate some torch.fft tests to use OpInfos (#48428)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/48428

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D25868666

Pulled By: mruberry

fbshipit-source-id: ca6d0c4e44f4c220675dc264a405d960d4b31771
2021-01-12 04:42:54 -08:00
4da9ceb743 [doc] fix doc formatting for torch.randperm and torch.repeat_interleave (#50254)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/50207
Fixes https://github.com/pytorch/pytorch/issues/50208

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50254

Reviewed By: zhangguanheng66

Differential Revision: D25865861

Pulled By: mruberry

fbshipit-source-id: 9ae45c443df7cce0d8bfb313f1667ff4d5f6262f
2021-01-12 04:33:59 -08:00
78e71ce627 warn user once for possible unnecessary find_unused_params (#50133)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50133

`find_unused_parameters=True` is only needed when the model has unused parameters that are not known at model definition time or differ due to control flow.

Unfortunately, many DDP users pass this flag in as `True` even when they do not need it, sometimes as a precaution to mitigate possible errors that may be raised (such as the error we raise with not using all outputs).While this is a larger issue to be fixed in DDP, it would also be useful to warn once if we did not detect unused parameters.

The downside of this is that in the case of flow control models where the first iteration doesn't have unused params but the rest do, this would be a false warning. However, I think the warning's value exceeds this downside.
ghstack-source-id: 119707101

Test Plan: CI

Reviewed By: pritamdamania87

Differential Revision: D25411118

fbshipit-source-id: 9f4a18ad8f45e364eae79b575cb1a9eaea45a86c
2021-01-12 02:55:06 -08:00
8c5b0247a5 Fix PyTorch NEON compilation with gcc-7 (#50389)
Summary:
Apply sebpop patch to correctly inform optimizing compiler about side-effect of missing neon restrictions
Allow vec256_float_neon to be used even if compiled by gcc-7
Fixes https://github.com/pytorch/pytorch/issues/47098

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50389

Reviewed By: walterddr

Differential Revision: D25872875

Pulled By: malfet

fbshipit-source-id: 1fc5dfe68fbdbbb9bfa79ce4be2666257877e85f
2021-01-11 21:51:35 -08:00
c3b4b20627 [PyTorch] List::operator[] can return const ref for Tensor & string (#50083)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50083

This should supercede D21966183 (a371652bc8)
(https://github.com/pytorch/pytorch/pull/39763) and D22830381 (b44a10c179) as the way to get fast
access to the contents of a `torch::List`.
ghstack-source-id: 119675495

Reviewed By: smessmer

Differential Revision: D25776232

fbshipit-source-id: 81b4d649105ac9e08fc2c6563806f883809872f4
2021-01-11 20:27:03 -08:00
4fed585dfa [MacOS] Add unit tests for Metal ops (#50312)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50312

Integrate the operator tests to the MacOS playground app, so that we can run them on Sandcastle
ghstack-source-id: 119693035

Test Plan:
- `buck test pp-macos`
- Sandcastle tests

Reviewed By: AshkanAliabadi

Differential Revision: D25778981

fbshipit-source-id: 8b5770dfddba0ca19f662894757b2dff66df87e6
2021-01-11 20:15:17 -08:00
bee6b0be58 Fix warning when running scripts/build_ios.sh (#49457)
Summary:
* Fixes `cmake implicitly converting 'string' to 'STRING' type`
* Fixes `clang: warning: argument unused during compilation: '-mfpu=neon-fp16' [-Wunused-command-line-argument]`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49457

Reviewed By: zhangguanheng66

Differential Revision: D25871014

Pulled By: malfet

fbshipit-source-id: fa0c181ae7a1b8668e47f5ac6abd27a1c735ffce
2021-01-11 19:31:32 -08:00
72c1d9df75 Minor Fix: Double ";" typo in transformerlayer.h (#50300)
Summary:
Fix double ";" typo in transformerlayer.h

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50300

Reviewed By: zhangguanheng66

Differential Revision: D25857236

Pulled By: glaringlee

fbshipit-source-id: b9b21cfb3ddbff493f6d1c616abe21c5cfb9bce0
2021-01-11 19:25:22 -08:00
09f4844c1f Pytorch Distributed RPC Reinforcement Learning Benchmark (Throughput and Latency) (#46901)
Summary:
A Pytorch Distributed RPC benchmark measuring Agent and Observer Throughput and Latency for Reinforcement Learning

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46901

Reviewed By: mrshenli

Differential Revision: D25869514

Pulled By: osandoval-fb

fbshipit-source-id: c3b36b21541d227aafd506eaa8f4e5f10da77c78
2021-01-11 19:02:36 -08:00
2193544024 [GPU] Clean up the operator tests (#50311)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50311

Code clean up
ghstack-source-id: 119693032

Test Plan: Sandcastle

Reviewed By: husthyc

Differential Revision: D25823635

fbshipit-source-id: 5205ebd8a5331c0d1825face034cca10e8b3b535
2021-01-11 18:39:46 -08:00
a72c6fd6e0 [GPU] Fix the broken strides value for 2d transpose (#50310)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50310

Swapping the stride value is OK if the output tensor's storage stays in-contiguous. However, when we copy the result back to CPU, we expect to see a contiguous tensor.

```
>>> x = torch.rand(2,3)
>>> x.stride()
(3, 1)
>>> y = x.t()
>>> y.stride()
(1, 3)
>>> z = y.contiguous()
>>> z.stride()
(2, 1)
```
ghstack-source-id: 119692581

Test Plan: Sandcastle CI

Reviewed By: AshkanAliabadi

Differential Revision: D25823665

fbshipit-source-id: 61667c03d1d4dd8692b76444676cc393f808cec8
2021-01-11 18:05:31 -08:00
5f8e1a1da9 add type annotations to torch.nn.modules.module (#49045)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/49044

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49045

Reviewed By: malfet

Differential Revision: D25767092

Pulled By: walterddr

fbshipit-source-id: a81ba96f3495943af7bb9ee3e5fc4c94c690c405
2021-01-11 17:01:47 -08:00
f39f258dfd Ensure DDP + Pipe works with find_unused_parameters. (#49908)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49908

As described in https://github.com/pytorch/pytorch/issues/49891, DDP +
Pipe doesn't work with find_unused_parameters.

This PR adds a simple fix to enable this functionality. This only currently
works for Pipe within a single host and needs to be re-worked once we support
cross host Pipe.
ghstack-source-id: 119573413

Test Plan:
1) unit tests added.
2) waitforbuildbot

Reviewed By: rohan-varma

Differential Revision: D25719922

fbshipit-source-id: 948bcc758d96f6b3c591182f1ec631830db1b15c
2021-01-11 16:52:37 -08:00
b001c4cc32 Stop using an unnecessary scalar_to_tensor(..., device) call. (#50114)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50114

In this case, the function only dispatches on cpu anyway.

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D25790155

Pulled By: gchanan

fbshipit-source-id: 799dc9a3a38328a531ced9e85ad2b4655533e86a
2021-01-11 16:37:04 -08:00
ba83aea5ee [GPU] Calculate strides for metal tensors (#50309)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50309

Previously, in order to unblock the dogfooding, we did some hacks to calculate the strides for the output tensor. Now it's time to fix that.
ghstack-source-id: 119673688

Test Plan:
1. Sandcastle CI
2. Person segmentation results

Reviewed By: AshkanAliabadi

Differential Revision: D25821766

fbshipit-source-id: 8c067f55a232b7f102a64b9035ef54c72ebab4d4
2021-01-11 16:26:17 -08:00
9a3305fdd5 Automated submodule update: tensorpipe (#50369)
Summary:
This is an automated pull request to update the first-party submodule for [pytorch/tensorpipe](https://github.com/pytorch/tensorpipe).

New submodule commit: bc5ac93c56

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50369

Test Plan: Ensure that CI jobs succeed on GitHub before landing.

Reviewed By: mrshenli

Differential Revision: D25867976

Pulled By: lw

fbshipit-source-id: 5274aa424e3215b200dcb2c02f342270241dd77d
2021-01-11 16:21:02 -08:00
bb97503a26 [fix] Indexing.cu: Move call to C10_CUDA_KERNEL_LAUNCH_CHECK to make it reachable (#49283)
Summary:
Fixes Compiler Warning:
```
aten/src/ATen/native/cuda/Indexing.cu(233): warning: loop is not reachable

aten/src/ATen/native/cuda/Indexing.cu(233): warning: loop is not reachable

aten/src/ATen/native/cuda/Indexing.cu(233): warning: loop is not reachable
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49283

Reviewed By: zhangguanheng66

Differential Revision: D25874613

Pulled By: ngimel

fbshipit-source-id: 6e384e89533c1d80f241b7b98fda239c357d1a2c
2021-01-11 15:33:08 -08:00
d76176cc1f Raise warning during validation when arg_constraints not defined (#50302)
Summary:
After we merged https://github.com/pytorch/pytorch/pull/48743, we noticed that some existing code that subclasses `torch.Distribution` started throwing `NotImplemenedError` since the constraints required for validation checks were not implemented.

```sh
File "torch/distributions/distribution.py", line 40, in __init__
  for param, constraint in self.arg_constraints.items():
File "torch/distributions/distribution.py", line 92, in arg_constraints
  raise NotImplementedError
```

This PR throws a UserWarning for such cases instead and gives a better warning message.

cc. Balandat

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50302

Reviewed By: Balandat, xuzhao9

Differential Revision: D25857315

Pulled By: neerajprad

fbshipit-source-id: 0ff9f81aad97a0a184735b1fe3a5d42025c8bcdf
2021-01-11 15:26:53 -08:00
e160362837 Add range assert in autograd engine queue lookup (#50372)
Summary:
Follow up to  https://github.com/pytorch/pytorch/issues/49652

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50372

Reviewed By: zhangguanheng66

Differential Revision: D25872203

Pulled By: albanD

fbshipit-source-id: 8d6f30f17fba856c5c34c08372767349a250983d
2021-01-11 15:16:35 -08:00
7efc212f1f Add link to tutorial in Timer doc (#50374)
Summary:
Because I have a hard time finding this tutorial every time I need it. So I'm sure other people have the same issue :D

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50374

Reviewed By: zhangguanheng66

Differential Revision: D25872173

Pulled By: albanD

fbshipit-source-id: f34f719606e58487baf03c73dcbd255017601a09
2021-01-11 15:06:00 -08:00
fd0927035e .circleci: Remove CUDA 9.2 binary build jobs (#50388)
Summary:
Now that we support CUDA 11 we can remove support for CUDA 9.2

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50388

Reviewed By: zhangguanheng66

Differential Revision: D25872955

Pulled By: seemethere

fbshipit-source-id: 1c10bcc8f4abbc1af1b3180b4cf4a9ea9c7104f9
2021-01-11 14:16:58 -08:00
a48640af92 [JIT] Update clang-format hashes (#50399)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50399

**Summary**
This commit updates the expected hashes of the `clang-format` binaries
downloaded from S3. These binaries themselves have been updated due to
having been updated inside fbcode.

**Test Plan**
Uploaded new binaries to S3, deleted `.clang-format-bin` and ran
`clang_format_all.py`.

Test Plan: Imported from OSS

Reviewed By: seemethere

Differential Revision: D25875184

Pulled By: SplitInfinity

fbshipit-source-id: da483735de1b5f1dab7b070f91848ec5741f00b1
2021-01-11 14:13:45 -08:00
4d3c12d37c [JIT] Print better error when class attribute IValue conversion fails (#50255)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50255

**Summary**
TorchScript classes are copied attribute-by-attribute from a py::object into
a `jit::Object` in `toIValue`, which is called when copying objects from
Python into TorchScript. However, if an attribute of the class cannot be
converted, the error thrown is a standard pybind error that is hard to
act on.

This commit adds code to `toIValue` to convert each attribute to an
`IValue` inside a try-catch block, throwing a `cast_error` containing
the name of the attribute and the target type if the conversion fails.

**Test Plan**
This commit adds a unit test to `test_class_type.py`
based on the code in the issue that commit fixes.

**Fixes**
This commit fixes #46341.

Test Plan: Imported from OSS

Reviewed By: pbelevich, tugsbayasgalan

Differential Revision: D25854183

Pulled By: SplitInfinity

fbshipit-source-id: 69d6e49cce9144af4236b8639d8010a20b7030c0
2021-01-11 14:04:26 -08:00
080a097935 Add docstring for Proxy (#50145)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/50145

Test Plan: Imported from OSS

Reviewed By: pbelevich

Differential Revision: D25854281

Pulled By: ansley

fbshipit-source-id: d7af6fd6747728ef04e86fbcdeb87cb0508e1fd8
2021-01-11 13:47:55 -08:00
3d263d1928 Update op replacement tutorial (#50377)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/50377

Test Plan: Imported from OSS

Reviewed By: jamesr66a

Differential Revision: D25870409

Pulled By: ansley

fbshipit-source-id: b873b89c2e62b57cd5d816f81361c8ff31be2948
2021-01-11 13:04:38 -08:00
ec51b67282 Fix elu backward operation for negative alpha (#49272)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/47671

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49272

Test Plan:
```
x = torch.tensor([-2, -1, 0, 1, 2], dtype=torch.float32, requires_grad=True)
y = torch.nn.functional.elu_(x.clone(), alpha=-2)
grads = torch.tensor(torch.ones_like(y))
y.backward(grads)
```

```
RuntimeError: In-place elu backward calculation is triggered with a negative slope which is not supported.
This is caused by calling in-place forward function with a negative slope, please call out-of-place
version instead.
```

Reviewed By: albanD

Differential Revision: D25569839

Pulled By: H-Huang

fbshipit-source-id: e3c6c0c2c810261566c10c0cc184fd81b280c650
2021-01-11 12:52:52 -08:00
559e2d8816 Implement optimization bisect (#49031)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/49031

Test Plan: Imported from OSS

Reviewed By: nikithamalgifb

Differential Revision: D25691790

Pulled By: tugsbayasgalan

fbshipit-source-id: a9c4ff1142f8a234a4ef5b1045fae842c82c18bf
2021-01-11 12:25:28 -08:00
55ac7e53ae [quant][graphmode][fx] Support preserved_attributes in prepare_fx (#50306)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/50306

Test Plan: Imported from OSS

Reviewed By: supriyar

Differential Revision: D25857747

fbshipit-source-id: fac132fb36ed9cf207aea40429b5bc3f7c72c35d
2021-01-11 12:10:02 -08:00
271240ae29 [JIT] Ensure offset is a multiple of 4 to fix "Philox" RNG in jitted kernels (#50169)
Summary:
Immediately-upstreamable part of https://github.com/pytorch/pytorch/pull/50148.

This PR fixes what I'm fairly sure is a subtle bug with custom `Philox` class usage in jitted kernels.  `Philox` [constructors in kernels](68a6e46379/torch/csrc/jit/codegen/cuda/codegen.cpp (L102)) take the cuda rng generator's current offset.  The Philox constructor then carries out [`offset/4`](74c055b240/torch/csrc/jit/codegen/cuda/runtime/random_numbers.cu (L13)) (a uint64_t division) to compute its internal offset in its virtual Philox bitstream of 128-bit chunks.  In other words, it assumes the incoming offset is a multiple of 4.  But (in current code) that's not guaranteed.  For example, the increments used by [these eager kernels](74c055b240/aten/src/ATen/native/cuda/Distributions.cu (L171-L216)) could easily make offset not divisible by 4.

I figured the easiest fix was to round all incoming increments up to the nearest multiple of 4 in CUDAGeneratorImpl itself.

Another option would be to round the current offset up to the next multiple of 4 at the jit point of use.  But that would be a jit-specific offset jump, so jit rng kernels wouldn't have a prayer of being bitwise accurate with eager rng kernels that used non-multiple-of-4 offsets.  Restricting the offset to multiples of 4 for everyone at least gives jit rng the chance to match eager rng.  (Of course, there are still many other ways the numerics could diverge, like if a jit kernel launches a different number of threads than an eager kernel, or assigns threads to data elements differently.)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50169

Reviewed By: mruberry

Differential Revision: D25857934

Pulled By: ngimel

fbshipit-source-id: 43a75e2d0c8565651b0f12a5694c744fd86ece99
2021-01-11 11:53:48 -08:00
d390e3d8b9 [FX] Make graph target printouts more user-friendly (#50296)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/50296

Test Plan: Imported from OSS

Reviewed By: pbelevich

Differential Revision: D25855288

Pulled By: jamesr66a

fbshipit-source-id: dd725980fc492526861c2ec234050fbdb814caa8
2021-01-11 11:45:20 -08:00
a7e92f120c [FX} Implement wrap() by patching module globals during symtrace (#50182)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/50182

Test Plan: Imported from OSS

Reviewed By: pbelevich

Differential Revision: D25819730

Pulled By: jamesr66a

fbshipit-source-id: 274f4799ad589887ecf3b94f5c24ecbe1bc14b1b
2021-01-11 11:01:15 -08:00
f10e7aad06 [quant][graphmode][fx] Scope support for call_method in QuantizationTracer (#50173)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50173

Previously we did not set the qconfig for call_method node correctly since it requires us to know
the scope (module path of the module whose forward graph contains the node) of the node. This
PR modifies the QuantizationTracer to record the scope information and build a map from call_method
Node to module path, which will be used when we construct qconfig_map

Test Plan:
python test/test_quantization.py TestQuantizeFx.test_qconfig_for_call_method

Imported from OSS

Reviewed By: vkuzo

Differential Revision: D25818132

fbshipit-source-id: ee9c5830f324d24d7cf67e5cd2bf1f6e0e46add8
2021-01-11 10:43:58 -08:00
6eb8e83c0b [aten] embedding_bag_byte_rowwise_offsets_out (#49561)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49561

Out variant for embedding_bag_byte_rowwise_offsets

Test Plan:
```MKL_NUM_THREADS=1 OMP_NUM_THREADS=1 numactl -m 0 -C 3 ./buck-out/opt/gen/caffe2/caffe2/fb/predictor/ptvsc2_predictor_bench --scripted_model=/data/users/ansha/tmp/adindexer/merge/traced_merge_dper_fixes.pt --p
t_inputs=/data/users/ansha/tmp/adindexer/merge/container_precomputation_bs1.pt --iters=30000 --warmup_iters=10000  --num_threads=1 --pred_net=/data/users/ansha/tmp/adindexer/precomputation_merge_net.pb --c2_inp
uts=/data/users/ansha/tmp/adindexer/merge/c2_inputs_precomputation_bs1.pb --c2_sigrid_transforms_opt=1 --c2_use_memonger=1 --c2_apply_nomnigraph_passes --c2_weights=/data/users/ansha/tmp/adindexer/merge/c2_weig
hts_precomputation.pb --pt_enable_static_runtime --pt_cleanup_activations=true --pt_enable_out_variant=true --compare_results --do_profile```

Check embedding_bag_byte_rowwise_offsets_out is called in perf

Before: 0.081438
After: 0.0783725

Reviewed By: supriyar, hlu1

Differential Revision: D25620718

fbshipit-source-id: 83d5d0dd2e1f60c46e6727f73d5d8b52661b6767
2021-01-11 10:21:05 -08:00
0f412aa293 Move scalar_to_tensor_default_dtype out of ScalarOps.h because it's only useful for torch.where. (#50111)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/50111

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D25789638

Pulled By: gchanan

fbshipit-source-id: 4254e11e08606b64e393433ef2c169889ff2ac07
2021-01-11 09:36:29 -08:00
186fe48d6e Format RPC files with clang-format (#50367)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50367

This had already been done by mrshenli on Friday (#50236, D25847892 (f9f758e349)) but over the weekend Facebook's internal clang-format version got updated and this changed the format, hence we need to re-apply it. Note that this update also affected the JIT files, which are the other module enrolled in clang-format (see 8530c65e25, D25849205 (8530c65e25)).
ghstack-source-id: 119656866

Test Plan: Shouldn't include functional changes. In any case, there's CI.

Reviewed By: mrshenli

Differential Revision: D25867720

fbshipit-source-id: 3723abc6c35831d7a8ac31f74baf24c963c98b9d
2021-01-11 08:59:19 -08:00
acaf091302 Vulkan convolution touchups. (#50329)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/50329

Test Plan: Imported from OSS

Reviewed By: SS-JIA

Differential Revision: D25869147

Pulled By: AshkanAliabadi

fbshipit-source-id: b8f393330b68912506fdaefaf62a455dc192e36c
2021-01-11 08:51:57 -08:00
e29082b2a6 Run mypy over test/test_utils.py (#50278)
Summary:
_resubmission of gh-49654, which was reverted due to a cross-merge conflict_

This caught one incorrect annotation in `cpp_extension.load`.

xref gh-16574.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50278

Reviewed By: walterddr

Differential Revision: D25865278

Pulled By: ezyang

fbshipit-source-id: 25489191628af5cf9468136db36f5a0f72d9d54d
2021-01-11 08:16:23 -08:00
eb87686511 svd_backward: more memory and computationally efficient. (#50109)
Summary:
As per title.

CC IvanYashchuk (unfortunately I cannot add you as a reviewer for some reason).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50109

Reviewed By: gchanan

Differential Revision: D25828536

Pulled By: albanD

fbshipit-source-id: 3791c3dd4f5c2a2917eac62e6527ecd1edcb400d
2021-01-11 05:28:43 -08:00
9d8bd216f9 Use Unicode friendly API in fused kernel related code (#49781)
Summary:
See https://github.com/pytorch/pytorch/issues/47422

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49781

Reviewed By: gchanan

Differential Revision: D25847993

Pulled By: ezyang

fbshipit-source-id: e683a8d5841885857ea3037ac801432a1a3eda68
2021-01-10 20:03:00 -08:00
6a3fc0c21c Treat has_torch_function and object_has_torch_function as static False when scripting (#48966)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48966

This PR lets us skip the `if not torch.jit.is_scripting():` guards on `functional` and `nn.functional` by directly registering `has_torch_function` and `object_has_torch_function` to the JIT as statically False.

**Benchmarks**

The benchmark script is kind of long. The reason is that it's testing all four PRs in the stack, plus threading and subprocessing so that the benchmark can utilize multiple cores while still collecting good numbers. Both wall times and instruction counts were collected. This stack changes dozens of operators / functions, but very mechanically such that there are only a handful of codepath changes. Each row is a slightly different code path (e.g. testing in Python, testing in the arg parser, different input types, etc.)

<details>

<summary> Test script </summary>

```
import argparse
import multiprocessing
import multiprocessing.dummy
import os
import pickle
import queue
import random
import sys
import subprocess
import tempfile
import time

import torch
from torch.utils.benchmark import Timer, Compare, Measurement

NUM_CORES = multiprocessing.cpu_count()
ENVS = {
    "ref": "HEAD (current)",
    "torch_fn_overhead_stack_0": "#48963",
    "torch_fn_overhead_stack_1": "#48964",
    "torch_fn_overhead_stack_2": "#48965",
    "torch_fn_overhead_stack_3": "#48966",
}

CALLGRIND_ENVS = tuple(ENVS.keys())

MIN_RUN_TIME = 3
REPLICATES = {
    "longer": 1_000,
    "long": 300,
    "short": 50,
}

CALLGRIND_NUMBER = {
    "overnight": 500_000,
    "long": 250_000,
    "short": 10_000,
}

CALLGRIND_TIMEOUT = {
    "overnight": 800,
    "long": 400,
    "short": 100,
}

SETUP = """
    x = torch.ones((1, 1))
    y = torch.ones((1, 1))
    w_tensor = torch.ones((1, 1), requires_grad=True)
    linear = torch.nn.Linear(1, 1, bias=False)
    linear_w = linear.weight
"""

TASKS = {
    "C++: unary                 `.t()`": "w_tensor.t()",
    "C++: unary  (Parameter)    `.t()`": "linear_w.t()",
    "C++: binary (Parameter)    `mul` ": "x + linear_w",
    "tensor.py: _wrap_type_error_to_not_implemented `__floordiv__`": "x // y",
    "tensor.py: method          `__hash__`": "hash(x)",
    "Python scalar              `__rsub__`": "1 - x",
    "functional.py: (unary)     `unique`": "torch.functional.unique(x)",
    "functional.py: (args)      `atleast_1d`": "torch.functional.atleast_1d((x, y))",
    "nn/functional.py: (unary)  `relu`": "torch.nn.functional.relu(x)",
    "nn/functional.py: (args)   `linear`": "torch.nn.functional.linear(x, w_tensor)",
    "nn/functional.py: (args)   `linear (Parameter)`": "torch.nn.functional.linear(x, linear_w)",
    "Linear(..., bias=False)": "linear(x)",
}

def _worker_main(argv, fn):
    parser = argparse.ArgumentParser()
    parser.add_argument("--output_file", type=str)
    parser.add_argument("--single_task", type=int, default=None)
    parser.add_argument("--length", type=str)
    args = parser.parse_args(argv)
    single_task = args.single_task

    conda_prefix = os.getenv("CONDA_PREFIX")
    assert torch.__file__.startswith(conda_prefix)

    env = os.path.split(conda_prefix)[1]
    assert env in ENVS

    results = []
    for i, (k, stmt) in enumerate(TASKS.items()):
        if single_task is not None and single_task != i:
            continue

        timer = Timer(
            stmt=stmt,
            setup=SETUP,
            sub_label=k,
            description=ENVS[env],
        )
        results.append(fn(timer, args.length))

    with open(args.output_file, "wb") as f:
        pickle.dump(results, f)

def worker_main(argv):
    _worker_main(
        argv,
        lambda timer, _: timer.blocked_autorange(min_run_time=MIN_RUN_TIME)
    )

def callgrind_worker_main(argv):
    _worker_main(
        argv,
        lambda timer, length: timer.collect_callgrind(number=CALLGRIND_NUMBER[length], collect_baseline=False))

def main(argv):
    parser = argparse.ArgumentParser()
    parser.add_argument("--long", action="store_true")
    parser.add_argument("--longer", action="store_true")
    args = parser.parse_args(argv)

    if args.longer:
        length = "longer"
    elif args.long:
        length = "long"
    else:
        length = "short"
    replicates = REPLICATES[length]

    num_workers = int(NUM_CORES // 2)
    tasks = list(ENVS.keys()) * replicates
    random.shuffle(tasks)
    task_queue = queue.Queue()
    for _ in range(replicates):
        envs = list(ENVS.keys())
        random.shuffle(envs)
        for e in envs:
            task_queue.put((e, None))

    callgrind_task_queue = queue.Queue()
    for e in CALLGRIND_ENVS:
        for i, _ in enumerate(TASKS):
            callgrind_task_queue.put((e, i))

    results = []
    callgrind_results = []

    def map_fn(worker_id):
        # Adjacent cores often share cache and maxing out a machine can distort
        # timings so we space them out.
        callgrind_cores = f"{worker_id * 2}-{worker_id * 2 + 1}"
        time_cores = str(worker_id * 2)
        _, output_file = tempfile.mkstemp(suffix=".pkl")
        try:
            loop_tasks = (
                # Callgrind is long running, and then the workers can help with
                # timing after they finish collecting counts.
                (callgrind_task_queue, callgrind_results, "callgrind_worker", callgrind_cores, CALLGRIND_TIMEOUT[length]),
                (task_queue, results, "worker", time_cores, None))

            for queue_i, results_i, mode_i, cores, timeout in loop_tasks:
                while True:
                    try:
                        env, task_i = queue_i.get_nowait()
                    except queue.Empty:
                        break

                    remaining_attempts = 3
                    while True:
                        try:
                            subprocess.run(
                                " ".join([
                                    "source", "activate", env, "&&",
                                    "taskset", "--cpu-list", cores,
                                    "python", os.path.abspath(__file__),
                                    "--mode", mode_i,
                                    "--length", length,
                                    "--output_file", output_file
                                ] + ([] if task_i is None else ["--single_task", str(task_i)])),
                                shell=True,
                                check=True,
                                timeout=timeout,
                            )
                            break

                        except subprocess.TimeoutExpired:
                            # Sometimes Valgrind will hang if there are too many
                            # concurrent runs.
                            remaining_attempts -= 1
                            if not remaining_attempts:
                                print("Too many failed attempts.")
                                raise
                            print(f"Timeout after {timeout} sec. Retrying.")

                    # We don't need a lock, as the GIL is enough.
                    with open(output_file, "rb") as f:
                        results_i.extend(pickle.load(f))

        finally:
            os.remove(output_file)

    with multiprocessing.dummy.Pool(num_workers) as pool:
        st, st_estimate, eta, n_total = time.time(), None, "", len(tasks) * len(TASKS)
        map_job = pool.map_async(map_fn, range(num_workers))
        while not map_job.ready():
            n_complete = len(results)
            if n_complete and len(callgrind_results):
                if st_estimate is None:
                    st_estimate = time.time()
                else:
                    sec_per_element = (time.time() - st_estimate) / n_complete
                    n_remaining = n_total - n_complete
                    eta = f"ETA: {n_remaining * sec_per_element:.0f} sec"

            print(
                f"\r{n_complete} / {n_total}  "
                f"({len(callgrind_results)} / {len(CALLGRIND_ENVS) * len(TASKS)})   "
                f"{eta}".ljust(40), end="")
            sys.stdout.flush()
            time.sleep(2)
    total_time = int(time.time() - st)
    print(f"\nTotal time: {int(total_time // 60)} min, {total_time % 60} sec")

    desc_to_ind = {k: i for i, k in enumerate(ENVS.values())}
    results.sort(key=lambda r: desc_to_ind[r.description])

    # TODO: Compare should be richer and more modular.
    compare = Compare(results)
    compare.trim_significant_figures()
    compare.colorize(rowwise=True)

    # Manually add master vs. overall relative delta t.
    merged_results = {
        (r.description, r.sub_label): r
        for r in Measurement.merge(results)
    }

    cmp_lines = str(compare).splitlines(False)
    print(cmp_lines[0][:-1] + "-" * 15 + "]")
    print(f"{cmp_lines[1]} |{'':>10}\u0394t")
    print(cmp_lines[2] + "-" * 15)
    for l, t in zip(cmp_lines[3:3 + len(TASKS)], TASKS.keys()):
        assert l.strip().startswith(t)
        t0 = merged_results[(ENVS["ref"], t)].median
        t1 = merged_results[(ENVS["torch_fn_overhead_stack_3"], t)].median
        print(f"{l} |{'':>5}{(t1 / t0 - 1) * 100:>6.1f}%")
    print("\n".join(cmp_lines[3 + len(TASKS):]))

    counts_dict = {
        (r.task_spec.description, r.task_spec.sub_label): r.counts(denoise=True)
        for r in callgrind_results
    }

    def rel_diff(x, x0):
        return f"{(x / x0 - 1) * 100:>6.1f}%"

    task_pad = max(len(t) for t in TASKS)
    print(f"\n\nInstruction % change (relative to `{CALLGRIND_ENVS[0]}`)")
    print(" " * (task_pad + 8)  + (" " * 7).join([ENVS[env] for env in CALLGRIND_ENVS[1:]]))
    for t in TASKS:
        values = [counts_dict[(ENVS[env], t)] for env in CALLGRIND_ENVS]

        print(t.ljust(task_pad + 3) + "  ".join([
            rel_diff(v, values[0]).rjust(len(ENVS[env]) + 5)
            for v, env in zip(values[1:], CALLGRIND_ENVS[1:])]))

        print("\033[4m" + "    Instructions per invocation".ljust(task_pad + 3) + "  ".join([
            f"{v // CALLGRIND_NUMBER[length]:.0f}".rjust(len(ENVS[env]) + 5)
            for v, env in zip(values[1:], CALLGRIND_ENVS[1:])]) + "\033[0m")
        print()

    import pdb
    pdb.set_trace()

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--mode", type=str, choices=("main", "worker", "callgrind_worker"), default="main")
    args, remaining = parser.parse_known_args()

    if args.mode == "main":
        main(remaining)

    elif args.mode == "callgrind_worker":
        callgrind_worker_main(remaining)

    else:
        worker_main(remaining)

```

</details>

**Wall time**
<img width="1178" alt="Screen Shot 2020-12-12 at 12 28 13 PM" src="https://user-images.githubusercontent.com/13089297/101994419-284f6a00-3c77-11eb-8dc8-4f69a890302e.png">

<details>

<summary> Longer run (`python test.py --long`) is basically identical. </summary>

<img width="1184" alt="Screen Shot 2020-12-12 at 5 02 47 PM" src="https://user-images.githubusercontent.com/13089297/102000425-2350e180-3c9c-11eb-999e-a95b37e9ef54.png">

</details>

**Callgrind**
<img width="936" alt="Screen Shot 2020-12-12 at 12 28 54 PM" src="https://user-images.githubusercontent.com/13089297/101994421-2e454b00-3c77-11eb-9cd3-8cde550f536e.png">

Test Plan: existing unit tests.

Reviewed By: ezyang

Differential Revision: D25590731

Pulled By: robieta

fbshipit-source-id: fe05305ff22b0e34ced44b60f2e9f07907a099dd
2021-01-10 19:23:38 -08:00
d31a760be4 move has_torch_function to C++, and make a special case object_has_torch_function (#48965)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48965

This PR pulls `__torch_function__` checking entirely into C++, and adds a special `object_has_torch_function` method for ops which only have one arg as this lets us skip tuple construction and unpacking. We can now also do away with the Python side fast bailout for `Tensor` (e.g. `if any(type(t) is not Tensor for t in tensors) and has_torch_function(tensors)`) because they're actually slower than checking with the Python C API.

Test Plan: Existing unit tests. Benchmarks are in #48966

Reviewed By: ezyang

Differential Revision: D25590732

Pulled By: robieta

fbshipit-source-id: 6bd74788f06cdd673f3a2db898143d18c577eb42
2021-01-10 19:23:35 -08:00
632a4401a6 clean up imports for tensor.py (#48964)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48964

Stop importing overrides within methods now that the circular dependency is gone, and also organize the imports while I'm at it because they're a jumbled mess.

Test Plan: Existing unit tests. Benchmarks are in #48966

Reviewed By: ngimel

Differential Revision: D25590730

Pulled By: robieta

fbshipit-source-id: 4fa929ce8ff548500f3e55d0475f3f22c1fccc04
2021-01-10 19:23:32 -08:00
839c2f235f treat Parameter the same way as Tensor (#48963)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48963

This PR makes the binding code treat `Parameter` the same way as `Tensor`, unlike all other `Tensor` subclasses. This does change the semantics of `THPVariable_CheckExact`, but it isn't used much and it seemed to make sense for the half dozen or so places that it is used.

Test Plan: Existing unit tests. Benchmarks are in #48966

Reviewed By: ezyang

Differential Revision: D25590733

Pulled By: robieta

fbshipit-source-id: 060ecaded27b26e4b756898eabb9a94966fc9840
2021-01-10 19:18:31 -08:00
fd92bcfe39 Use FileStore in TorchScript for store registry (#50248)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50248

make the FileStore path also use TorchScript when it's needed.

Test Plan: wait for sandcastle.

Reviewed By: zzzwen

Differential Revision: D25842651

fbshipit-source-id: dec941e895a33ffde42c877afcaf64b5aecbe098
2021-01-10 18:50:56 -08:00
92fcb59feb Automated submodule update: tensorpipe (#50267)
Summary:
This is an automated pull request to update the first-party submodule for [pytorch/tensorpipe](https://github.com/pytorch/tensorpipe).

New submodule commit: 03e0711889

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50267

Test Plan: Ensure that CI jobs succeed on GitHub before landing.

Reviewed By: gchanan

Differential Revision: D25848309

Pulled By: mrshenli

fbshipit-source-id: c77adbad73c5b3b4b7d4e79953a797621dc11e5c
2021-01-10 13:36:57 -08:00
26cc630789 Allow arbitrary docstrings to be inside torchscript interface methods (#50271)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/50271

Test Plan:
new python test case

Imported from OSS

Reviewed By: nikithamalgifb

Differential Revision: D25853916

fbshipit-source-id: adc31e11331a97d08b5bc3f535f185da268554d1
2021-01-10 10:56:30 -08:00
4774c6800b Added linalg.inv (#48261)
Summary:
This PR adds `torch.linalg.inv` for NumPy compatibility.

`linalg_inv_out` uses in-place operations on provided `result` tensor.

I modified `apply_inverse` to accept tensor of Int instead of std::vector, that way we can write a function similar to `linalg_inv_out` but removing the error checks and device memory synchronization.

I fixed `lda` (leading dimension parameter which is max(1, n)) in many places to handle 0x0 matrices correctly.
Zero batch dimensions are also working and tested.

Ref https://github.com/pytorch/pytorch/issues/42666

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48261

Reviewed By: gchanan

Differential Revision: D25849590

Pulled By: mruberry

fbshipit-source-id: cfee6f1daf7daccbe4612ec68f94db328f327651
2021-01-10 04:00:51 -08:00
375c30a717 Avg pool 0 dim acceptance. (#50008)
Summary:
Reopen https://github.com/pytorch/pytorch/pull/47426 since it failed for XLA tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50008

Reviewed By: mruberry

Differential Revision: D25857687

Pulled By: ngimel

fbshipit-source-id: 8bd47a17b417b20089cf003173d8c0793be58c72
2021-01-09 21:46:05 -08:00
8530c65e25 [codemod][fbcode/caffe2] Apply clang-format update fixes
Test Plan: Sandcastle and visual inspection.

Reviewed By: igorsugak

Differential Revision: D25849205

fbshipit-source-id: ef664c1ad4b3ee92d5c020a5511b4ef9837a09a0
2021-01-09 14:37:36 -08:00
d4c1684cf5 reuse consant from jit (#49916)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/49916

Test Plan:
1. Build pytorch locally. `MACOSX_DEPLOYMENT_TARGET=10.9 CC=clang CXX=clang++ USE_CUDA=0 DEBUG=1 MAX_JOBS=16 python setup.py develop`
2. Run `python save_lite.py`
```
import torch

# ~/Documents/pytorch/data/dog.jpg
model = torch.hub.load('pytorch/vision:v0.6.0', 'shufflenet_v2_x1_0', pretrained=True)
model.eval()

# sample execution (requires torchvision)
from PIL import Image
from torchvision import transforms
import pathlib
import tempfile
import torch.utils.mobile_optimizer

input_image = Image.open('~/Documents/pytorch/data/dog.jpg')
preprocess = transforms.Compose([
    transforms.Resize(256),
    transforms.CenterCrop(224),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])
input_tensor = preprocess(input_image)
input_batch = input_tensor.unsqueeze(0) # create a mini-batch as expected by the model

# move the input and model to GPU for speed if available
if torch.cuda.is_available():
    input_batch = input_batch.to('cuda')
    model.to('cuda')

with torch.no_grad():
    output = model(input_batch)
# Tensor of shape 1000, with confidence scores over Imagenet's 1000 classes
print(output[0])
# The output has unnormalized scores. To get probabilities, you can run a softmax on it.
print(torch.nn.functional.softmax(output[0], dim=0))

traced = torch.jit.trace(model, input_batch)
sum(p.numel() * p.element_size() for p in traced.parameters())
tf = pathlib.Path('~/Documents/pytorch/data/data/example_debug_map_with_tensorkey.ptl')

torch.jit.save(traced, tf.name)
print(pathlib.Path(tf.name).stat().st_size)
traced._save_for_lite_interpreter(tf.name)
print(pathlib.Path(tf.name).stat().st_size)
print(tf.name)

```

3. Run `python test_lite.py`
```
import torch
from torch.jit.mobile import _load_for_lite_interpreter
# sample execution (requires torchvision)
from PIL import Image
from torchvision import transforms

input_image = Image.open('~/Documents/pytorch/data/dog.jpg')
preprocess = transforms.Compose([
    transforms.Resize(256),
    transforms.CenterCrop(224),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])
input_tensor = preprocess(input_image)
input_batch = input_tensor.unsqueeze(0) # create a mini-batch as expected by the model
reload_lite_model = _load_for_lite_interpreter('~/Documents/pytorch/experiment/example_debug_map_with_tensorkey.ptl')

with torch.no_grad():
    output_lite = reload_lite_model(input_batch)
# Tensor of shape 1000, with confidence scores over Imagenet's 1000 classes
print(output_lite[0])
# The output has unnormalized scores. To get probabilities, you can run a softmax on it.
print(torch.nn.functional.softmax(output_lite[0], dim=0))

```
4. Compare the result with pytorch in master and pytorch built locally with this change, and see the same output.
5. The model size was 16.1 MB and becomes 12.9 with this change.

Imported from OSS

Reviewed By: kimishpatel, iseeyuan

Differential Revision: D25731596

Pulled By: cccclai

fbshipit-source-id: 9731ec1e0c1d5dc76cfa374d2ad3d5bb10990cf0
2021-01-08 22:39:28 -08:00
ba1ce71cd1 Document single op replacement (#50116)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/50116

Test Plan: Imported from OSS

Reviewed By: jamesr66a

Differential Revision: D25803457

Pulled By: ansley

fbshipit-source-id: de2f3c0bd037859117dde55ba677fb5da34ab639
2021-01-08 21:01:18 -08:00
ea087e2d92 JIT: guard DifferentiableGraph node (#49433)
Summary:
This adds guarding for DifferentiableGraph nodes in order to not depend on
Also bailing out on required gradients for the CUDA fuser.

Fixes https://github.com/pytorch/pytorch/issues/49299

I still need to look into a handful of failing tests, but maybe it can be a discussion basis.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49433

Reviewed By: ngimel

Differential Revision: D25681374

Pulled By: Krovatkin

fbshipit-source-id: 8e7be53a335c845560436c0cceeb5e154c9cf296
2021-01-08 20:01:27 -08:00
36ddb00240 [fix] torch.cat: Don't resize out if it is already of the correct size. (#49937)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/49878

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49937

Reviewed By: mruberry

Differential Revision: D25851564

Pulled By: ngimel

fbshipit-source-id: 9a78922642d5bace70d887a88fa9e92d88038120
2021-01-08 18:10:49 -08:00
c2d37cd990 Change CMake config to enable universal binary for Mac (#50243)
Summary:
This PR is a step towards enabling cross compilation from x86_64 to arm64.

The following has been added:
1. When cross compilation is detected, compile a local universal fatfile to use as protoc.
2. For the simple compile check in MiscCheck.cmake, make sure to compile the small snippet as a universal binary in order to run the check.

**Test plan:**

Kick off a minimal build on a mac intel machine with the macOS 11 SDK with this command:
```
CMAKE_OSX_ARCHITECTURES=arm64 USE_MKLDNN=OFF USE_QNNPACK=OFF USE_PYTORCH_QNNPACK=OFF BUILD_TEST=OFF USE_NNPACK=OFF python setup.py install
```
(If you run the above command before this change, or without macOS 11 SDK set up, it will fail.)

Then check the platform of the built binaries using this command:
```
lipo -info build/lib/libfmt.a
```
Output:
- Before this PR, running a regular build via `python setup.py install` (instead of using the flags listed above):
  ```
  Non-fat file: build/lib/libfmt.a is architecture: x86_64
  ```
- Using this PR:
  ```
  Non-fat file: build/lib/libfmt.a is architecture: arm64
  ```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50243

Reviewed By: malfet

Differential Revision: D25849955

Pulled By: janeyx99

fbshipit-source-id: e9853709a7279916f66aa4c4e054dfecced3adb1
2021-01-08 17:26:08 -08:00
49bb0a30e8 Support scripting classmethod called with object instances (#49967)
Summary:
Currentlt classmethods are compiled the same way as methods - the first argument is self.
Adding a fake statement to assign the first argument to the class.
This is kind of hacky, but that's all it takes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49967

Reviewed By: gchanan

Differential Revision: D25841378

Pulled By: ppwwyyxx

fbshipit-source-id: 0f3657b4c9d5d2181d658f9bade9bafc72de33d8
2021-01-08 16:54:46 -08:00
1c12cbea90 Optimize Vulkan command buffer submission rate. (#49112)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/49112

Differential Revision: D25729889

Test Plan: Imported from OSS

Reviewed By: SS-JIA

Pulled By: AshkanAliabadi

fbshipit-source-id: c4ab470fdcf3f83745971986f3a44a3dff69287f
2021-01-08 16:39:22 -08:00
aa18d17455 add type annotations to torch.nn.modules.fold (#49479)
Summary:
closes gh-49478

Fixes https://github.com/pytorch/pytorch/issues/49478

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49479

Reviewed By: mruberry

Differential Revision: D25723838

Pulled By: walterddr

fbshipit-source-id: 45c4cbd6f147b6dc4a5f5419c17578c49c201022
2021-01-08 13:52:14 -08:00
2c4b6ec457 Unused exception variables (#50181)
Summary:
These unused variables were identified by [pyflakes](https://pypi.org/project/pyflakes/). They can be safely removed to simplify the code.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50181

Reviewed By: gchanan

Differential Revision: D25844270

fbshipit-source-id: 0e648ffe8c6db6daf56788a13ba89806923cbb76
2021-01-08 13:33:18 -08:00
8f31621f78 Fix MKL builds on Ubuntu (#50212)
Summary:
This fixes https://github.com/pytorch/pytorch/issues/50211

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50212

Reviewed By: janeyx99

Differential Revision: D25850876

Pulled By: walterddr

fbshipit-source-id: be138db3ae370c45f5fbf3af486cf8b32518df87
2021-01-08 13:16:30 -08:00
1bb7d8ff93 Revert D25717504: Clean up some type annotations in test/jit
Test Plan: revert-hammer

Differential Revision:
D25717504 (a4f30d48d8)

Original commit changeset: 9a83c44db02e

fbshipit-source-id: e6e3a83bed22701d8125f5a293dfcd5093c1a2cd
2021-01-08 12:14:48 -08:00
f9f758e349 Apply clang-format to rpc cpp files (#50236)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/50236

Test Plan: Imported from OSS

Reviewed By: lw

Differential Revision: D25847892

Pulled By: mrshenli

fbshipit-source-id: b4af1221acfcaba8903c629869943abbf877e04e
2021-01-08 11:47:43 -08:00
0bb341daaa Dump state when hitting ambiguous_autogradother_kernel. (#50246)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/50246

Test Plan: Imported from OSS

Reviewed By: bhosmer

Differential Revision: D25843205

Pulled By: ailzhang

fbshipit-source-id: 66916ae477a4ae97e1695227fc6af78c4f328ea3
2021-01-08 11:31:54 -08:00
d78b638a31 Convert string => raw strings so char classes can be represented in Python regex (#50239)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50239

Convert regex strings that have character classes (e.g. \d, \s, \w, \b, etc) into raw strings so they won't be interpreted as escape characters.

References:
Python RegEx - https://www.w3schools.com/python/python_regex.asp
Python Escape Chars - https://www.w3schools.com/python/gloss_python_escape_characters.asp
Python Raw String - https://www.journaldev.com/23598/python-raw-string
Python RegEx Docs - https://docs.python.org/3/library/re.html
Python String Tester - https://www.w3schools.com/python/trypython.asp?filename=demo_string_escape
Python Regex Tester - https://regex101.com/

Test Plan: To find occurrences of regex strings with the above issue in VS Code, search using the regex \bre\.[a-z]+\(['"], and under 'files to include', use /data/users/your_username/fbsource/fbcode/caffe2.

Reviewed By: r-barnes

Differential Revision: D25813302

fbshipit-source-id: df9e23c0a84c49175eaef399ca6d091bfbeed936
2021-01-08 11:17:17 -08:00
5d45140d68 [numpy] torch.{all/any} : output dtype is always bool (#47878)
Summary:
BC-breaking note:

This PR changes the behavior of the any and all functions to always return a bool tensor. Previously these functions were only defined on bool and uint8 tensors, and when called on uint8 tensors they would also return a uint8 tensor. (When called on a bool tensor they would return a bool tensor.)

PR summary:

https://github.com/pytorch/pytorch/pull/44790#issuecomment-725596687

Fixes 2 and 3

Also Fixes https://github.com/pytorch/pytorch/issues/48352

Changes
* Output dtype is always `bool` (consistent with numpy) **BC Breaking (Previously used to match the input dtype**)
* Uses vectorized version for all dtypes on CPU
* Enables test for complex
* Update doc for `torch.all` and `torch.any`

TODO
* [x] Update docs
* [x] Benchmark
* [x] Raise issue on XLA

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47878

Reviewed By: albanD

Differential Revision: D25714324

Pulled By: mruberry

fbshipit-source-id: a87345f725297524242d69402dfe53060521ea5d
2021-01-08 11:05:39 -08:00
a4f30d48d8 Clean up some type annotations in test/jit (#50158)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50158

Upgrades type annotations from Python2 to Python3

Test Plan: Sandcastle tests

Reviewed By: xush6528

Differential Revision: D25717504

fbshipit-source-id: 9a83c44db02ec79f353862255732873f6d7f885e
2021-01-08 10:56:55 -08:00
81778e2811 [onnx] Do not deref nullptr in scalar type analysis (#50237)
Summary:
Apply a little bit of defensive programming: `type->cast<TensorType>()` returns an optional pointer so dereferencing it can lead to a hard crash.

Fixes SIGSEGV reported in https://github.com/pytorch/pytorch/issues/49959

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50237

Reviewed By: walterddr

Differential Revision: D25839675

Pulled By: malfet

fbshipit-source-id: 403d6df5e2392dd6adc308b1de48057f2f9d77ab
2021-01-08 10:07:30 -08:00
b5ab0a7f78 Improve torch.linalg.qr (#50046)
Summary:
This is a follow up of PR https://github.com/pytorch/pytorch/issues/47764 to fix the remaining details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50046

Reviewed By: zou3519

Differential Revision: D25825557

Pulled By: mruberry

fbshipit-source-id: b8e335e02265e73484a99b0189e4cc042828e0a9
2021-01-08 09:52:31 -08:00
88bd69b488 Stop using c10::scalar_to_tensor in float_power. (#50105)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50105

There should be no functional change here.

A couple of reasons here:
1) This function is generally an anti-pattern (https://github.com/pytorch/pytorch/issues/49758) and it is good to minimize its usage in the code base.
2) pow itself has a fair amount of smarts like not broadcasting scalar/tensor combinations and we should defer to it.

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D25786172

Pulled By: gchanan

fbshipit-source-id: 89de03aa0b900ce011a62911224a5441f15e331a
2021-01-08 09:44:15 -08:00
55919a4758 add type annotations to torch.nn.quantized.modules.conv (#49702)
Summary:
closes gh-49700

No mypy issues were found in the first three entries deleted from `mypy.ini`:
```
[mypy-torch.nn.qat.modules.activations]
ignore_errors = True

[mypy-torch.nn.qat.modules.conv]
ignore_errors = True

[mypy-torch.nn.quantized.dynamic.modules.linear]
ignore_errors = True
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49702

Reviewed By: walterddr, zou3519

Differential Revision: D25767119

Pulled By: ezyang

fbshipit-source-id: cb83e53549a299538e1b154cf8b79e3280f7392a
2021-01-08 07:31:39 -08:00
54ce171f16 Fix persistent_workers + pin_memory (#48543)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/48370 https://github.com/pytorch/pytorch/issues/47445

cc emcastillo who authored the original functionality.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48543

Reviewed By: bdhirsh

Differential Revision: D25277474

Pulled By: ejguan

fbshipit-source-id: 1967002124fb0fff57caca8982bc7df359a059a2
2021-01-08 07:04:10 -08:00
d00acebd14 Add tensor.view(dtype) (#47951)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/42571

Note that this functionality is a subset of [`numpy.ndarray.view`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.view.html):
- this only supports viewing a tensor as a dtype with the same number of bytes
- this does not support viewing a tensor as a subclass of `torch.Tensor`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47951

Reviewed By: ngimel

Differential Revision: D25062301

Pulled By: mruberry

fbshipit-source-id: 9fefaaef77f15d5b863ccd12d836932983794475
2021-01-08 06:55:21 -08:00
5c5abd591d Implement torch.linalg.svd (#45562)
Summary:
This is related to https://github.com/pytorch/pytorch/issues/42666 .
I am opening this PR to have the opportunity to discuss things.
First, we need to consider the differences between `torch.svd` and `numpy.linalg.svd`:

1. `torch.svd` takes `some=True`, while `numpy.linalg.svd` takes `full_matrices=True`, which is effectively the opposite (and with the opposite default, too!)

2. `torch.svd` returns `(U, S, V)`, while `numpy.linalg.svd` returns `(U, S, VT)` (i.e., V transposed).

3. `torch.svd` always returns a 3-tuple; `numpy.linalg.svd` returns only `S` in case `compute_uv==False`

4. `numpy.linalg.svd` also takes an optional `hermitian=False` argument.

I think that the plan is to eventually deprecate `torch.svd` in favor of `torch.linalg.svd`, so this PR does the following:

1. Rename/adapt the old `svd` C++ functions into `linalg_svd`: in particular, now `linalg_svd` takes `full_matrices` and returns `VT`

2. Re-implement the old C++ interface on top of the new (by negating `full_matrices` and transposing `VT`).

3. The C++ version of `linalg_svd` *always* returns a 3-tuple (we can't do anything else). So, there is a python wrapper which manually calls `torch._C._linalg.linalg_svd` to tweak the return value in case `compute_uv==False`.

Currently, `linalg_svd_backward` is broken because it has not been adapted yet after the `V ==> VT` change, but before continuing and spending more time on it I wanted to make sure that the general approach is fine.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45562

Reviewed By: H-Huang

Differential Revision: D25803557

Pulled By: mruberry

fbshipit-source-id: 4966f314a0ba2ee391bab5cda4563e16275ce91f
2021-01-08 06:46:16 -08:00
006cfebf3d Update autograd related comments (#50166)
Summary:
Remove outdated comment and update to use new paths.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50166

Reviewed By: zou3519

Differential Revision: D25824942

Pulled By: albanD

fbshipit-source-id: 7dc694891409e80e1804eddcdcc50cc21b60f822
2021-01-08 06:37:57 -08:00
9f832c8d3e [numpy] torch.exp: promote integer inputs to float (#50093)
Summary:
Reference: https://github.com/pytorch/pytorch/issues/42515

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50093

Reviewed By: H-Huang

Differential Revision: D25803549

Pulled By: mruberry

fbshipit-source-id: e6f245b5e728f2dca6072f8c359f03dff63aa14d
2021-01-08 06:30:18 -08:00
fc2ead0944 Autograd engine, only enqueue task when it is fully initialized (#50164)
Summary:
This solves a race condition where the worker thread might
see a partially initialized graph_task

Fixes https://github.com/pytorch/pytorch/issues/49652

I don't know how to reliably trigger the race so I didn't add any test. But the rocm build flakyness (it just happens to race more often on rocm builds) should disappear after this PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50164

Reviewed By: zou3519

Differential Revision: D25824954

Pulled By: albanD

fbshipit-source-id: 6a3391753cb2afd2ab415d3fb2071a837cc565bb
2021-01-08 05:30:11 -08:00
c215ffb6a2 Revert D25687465: [PyTorch] Devirtualize TensorImpl::dim() with macro
Test Plan: revert-hammer

Differential Revision:
D25687465 (4de6b279c8)

Original commit changeset: 89aabce165a5

fbshipit-source-id: fa5def17209d1691e68b1245fa0873fd03e88eaa
2021-01-07 22:07:42 -08:00
294b7867eb Address clang-tidy warnings in ProcessGroupNCCL (#50131)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50131

Noticed that in the internal diff for
https://github.com/pytorch/pytorch/pull/49069 there was a clang-tidy warning to
use emplace instead of push_back. This can save us a copy as it eliminates the
unnecessary in-place construction
ghstack-source-id: 119560979

Test Plan: CI

Reviewed By: pritamdamania87

Differential Revision: D25800134

fbshipit-source-id: 243e57318f5d6e43de524d4e5409893febe6164c
2021-01-07 21:29:28 -08:00
5a63c452e6 Disable cuDNN persistent RNN on sm_86 devices (#49534)
Summary:
Excludes sm_86 GPU devices from using cuDNN persistent RNN.

This is because there are some hard-to-detect edge cases that will throw exceptions with cudnn 8.0.5 on Nvidia A40 GPU.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49534

Reviewed By: mruberry

Differential Revision: D25632378

Pulled By: mrshenli

fbshipit-source-id: cbe78236d85d4d0c2e4ca63a3fc2c4e2de662d9e
2021-01-07 21:20:21 -08:00
b73c018598 [PyTorch] Change representation of SizesAndStrides (#47508)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47508

This moves SizesAndStrides to a specialized representation
that is 5 words smaller in the common case of tensor rank 5 or less.
ghstack-source-id: 119313560

Test Plan:
SizesAndStridesTest added in previous diff passes under
ASAN + UBSAN.

Run framework overhead benchmarks. Looks more or less neutral.

Reviewed By: ezyang

Differential Revision: D24772023

fbshipit-source-id: 0a75fd6c2daabb0769e2f803e80e2d6831871316
2021-01-07 21:01:46 -08:00
882ddb2f2d [PyTorch] Introduce packed SizesAndStrides abstraction (#47507)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47507

This introduces a new SizesAndStrides class as a helper for
TensorImpl, in preparation for changing its representation.
ghstack-source-id: 119313559

Test Plan:
Added new automated tests as well.

Run framework overhead benchmarks. Results seem to be neutral-ish.

Reviewed By: ezyang

Differential Revision: D24762557

fbshipit-source-id: 6cc0ede52d0a126549fb51eecef92af41c3e1a98
2021-01-07 20:56:50 -08:00
c480eebf95 Completely remove FutureMessage type (#50029)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/50029

Test Plan:
buck run mode/opt -c=python.package_style=inplace //caffe2/torch/fb/training_toolkit/examples:ctr_mbl_feed_april_2020 -- local-preset --flow-entitlement pytorch_ftw_gpu --secure-group oncall_pytorch_distributed

Before:

```
...

I0107 11:03:10.434000 3831111 print_publisher.py:23  master      ] Publishing batch metrics: qps-qps|total_examples 14000.0
I0107 11:03:10.434000 3831111 print_publisher.py:23  master      ] Publishing batch metrics: qps-qps|window_qps 74.60101318359375
I0107 11:03:10.434000 3831111 print_publisher.py:23  master      ] Publishing batch metrics: qps-qps|lifetime_qps 74.60101318359375

...

I0107 11:05:12.132000 3831111 print_publisher.py:23  master      ] Publishing batch metrics: qps-qps|total_examples 20000.0
I0107 11:05:12.132000 3831111 print_publisher.py:23  master      ] Publishing batch metrics: qps-qps|window_qps 64.0
I0107 11:05:12.132000 3831111 print_publisher.py:23  master      ] Publishing batch metrics: qps-qps|lifetime_qps 64.64917755126953

...
```

After:

```
...

I0107 11:53:03.858000 53693 print_publisher.py:23  master      ] Publishing batch metrics: qps-qps|total_examples 14000.0
I0107 11:53:03.858000 53693 print_publisher.py:23  master      ] Publishing batch metrics: qps-qps|window_qps 72.56404876708984
I0107 11:53:03.858000 53693 print_publisher.py:23  master      ] Publishing batch metrics: qps-qps|lifetime_qps 72.56404876708984

...

I0107 11:54:24.612000 53693 print_publisher.py:23  master      ] Publishing batch metrics: qps-qps|total_examples 20000.0
I0107 11:54:24.612000 53693 print_publisher.py:23  master      ] Publishing batch metrics: qps-qps|window_qps 73.07617950439453
I0107 11:54:24.612000 53693 print_publisher.py:23  master      ] Publishing batch metrics: qps-qps|lifetime_qps 73.07617950439453

...
```

Reviewed By: lw

Differential Revision: D25774915

Pulled By: mrshenli

fbshipit-source-id: 1128c3c2df9d76e36beaf171557da86e82043eb9
2021-01-07 19:50:57 -08:00
171648edaa Completely Remove FutureMessage from RPC agents (#50028)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/50028

Test Plan: Imported from OSS

Reviewed By: lw

Differential Revision: D25753887

Pulled By: mrshenli

fbshipit-source-id: 40718349c2def262a16aaa24c167c0b540cddcb1
2021-01-07 19:50:53 -08:00
098751016e Completely Remove FutureMessage from RPC cpp tests (#50027)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/50027

Test Plan: Imported from OSS

Reviewed By: lw

Differential Revision: D25753815

Pulled By: mrshenli

fbshipit-source-id: 85b9b03fec52b4175288ac3a401285607744b451
2021-01-07 19:50:50 -08:00
1f795e1a9b Remove FutureMessage from RPC request callback logic (#50026)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/50026

Test Plan: Imported from OSS

Reviewed By: lw

Differential Revision: D25753588

Pulled By: mrshenli

fbshipit-source-id: a6fcda7830901dd812fbf0489b001e6bd9673780
2021-01-07 19:50:47 -08:00
2831af9837 Completely remove FutureMessage from FaultyProcessGroupAgent (#50025)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/50025

Test Plan: Imported from OSS

Reviewed By: lw

Differential Revision: D25753587

Pulled By: mrshenli

fbshipit-source-id: a5d4106a10d1b0d3e4c406751795f19af8afd120
2021-01-07 19:50:43 -08:00
0684d07425 Remove FutureMessage from sender TensorPipeAgent (#50024)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/50024

Test Plan: Imported from OSS

Reviewed By: lw

Differential Revision: D25753386

Pulled By: mrshenli

fbshipit-source-id: fdca051b805762a2c88f965ceb3edf1c25d40a56
2021-01-07 19:50:40 -08:00
1deb895074 Remove FutureMessage from sender ProcessGroupAgent (#50023)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/50023

Test Plan: Imported from OSS

Reviewed By: lw

Differential Revision: D25753217

Pulled By: mrshenli

fbshipit-source-id: 5a98473c17535c8f92043abe143064e7fca4413b
2021-01-07 19:50:37 -08:00
0c943931aa Completely remove FutureMessage from distributed autograd (#50020)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/50020

Test Plan: Imported from OSS

Reviewed By: lw

Differential Revision: D25752968

Pulled By: mrshenli

fbshipit-source-id: 138d37e204b6f9a584633cfc79fd44c8c9c00f41
2021-01-07 19:50:33 -08:00
b2da0b5afe Completely remove FutureMessage from RPC TorchScript implementations (#50005)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/50005

Test Plan: Imported from OSS

Reviewed By: lw

Differential Revision: D25750663

Pulled By: mrshenli

fbshipit-source-id: 6d97156b61d82aa19dd0567ca72fe04bd7b5d1e7
2021-01-07 19:50:30 -08:00
2d5f57cf3b Completely remove FutureMessage from RRef Implementations (#50004)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/50004

Test Plan: Imported from OSS

Reviewed By: lw

Differential Revision: D25750602

Pulled By: mrshenli

fbshipit-source-id: 06854a77f4fb5cc4c34a1ede843301157ebf7309
2021-01-07 19:50:27 -08:00
d730c7e261 Replace FutureMessage with ivalue::Future in RpcAgent retry logic (#49995)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/49995

Test Plan: Imported from OSS

Reviewed By: lw

Differential Revision: D25745301

Pulled By: mrshenli

fbshipit-source-id: b5e3a7e0b377496924847d8d70d61de32e2d87f4
2021-01-07 19:50:23 -08:00
008206decc Replace FutureMessage with ivalue::Future in RRefContext (#49960)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/49960

Test Plan: Imported from OSS

Reviewed By: lw

Differential Revision: D25730530

Pulled By: mrshenli

fbshipit-source-id: 5d54572c653592d79c40aed616266c87307a1ad8
2021-01-07 19:50:19 -08:00
25ef605132 Replace FutureMessage with ivalue::Future in distributed/autograd/utils.* (#49927)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/49927

Test Plan: Imported from OSS

Reviewed By: lw

Differential Revision: D25724241

Pulled By: mrshenli

fbshipit-source-id: d608e448f5224e41fbb0b5be6b9ac51a587f25b4
2021-01-07 19:50:16 -08:00
84e3237a53 Let RpcAgent::send() return JitFuture (#49906)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49906

This commit modifies RPC Message to inherit from `torch::CustomClassHolder`,
and wraps a Message in an IValue in `RpcAgent::send()`.

Test Plan: Imported from OSS

Reviewed By: lw

Differential Revision: D25719518

Pulled By: mrshenli

fbshipit-source-id: 694e40021e49e396da1620a2f81226522341550b
2021-01-07 19:47:14 -08:00
4de6b279c8 [PyTorch] Devirtualize TensorImpl::dim() with macro (#49770)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49770

Seems like the performance cost of making this commonly-called method virtual isn't worth having use of undefined tensors crash a bit earlier (they'll still fail to dispatch).
ghstack-source-id: 119528065

Test Plan: framework overhead benchmarks

Reviewed By: ezyang

Differential Revision: D25687465

fbshipit-source-id: 89aabce165a594be401979c04236114a6f527b59
2021-01-07 19:05:41 -08:00
1a1b665827 [PyTorch] validate that SparseTensorImpl::dim needn't be overridden (#49767)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49767

I'm told that the base implementation should work fine. Let's validate that in an intermediate diff before removing it.
ghstack-source-id: 119528066

Test Plan: CI

Reviewed By: ezyang, bhosmer

Differential Revision: D25686830

fbshipit-source-id: f931394d3de6df7f6c5c68fe8ab711d90d3b12fd
2021-01-07 19:05:38 -08:00
2e7c6cc9df [PyTorch] Devirtualize TensorImpl::numel() with macro (#49766)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49766

Devirtualizing this seems like a decent performance improvement on
internal benchmarks.

The *reason* this is a performance improvement is twofold:
1) virtual calls are a bit slower than regular calls
2) virtual functions in `TensorImpl` can't be inlined

Test Plan: internal benchmark

Reviewed By: hlu1

Differential Revision: D25602321

fbshipit-source-id: d61556456ccfd7f10c6ebdc3a52263b438a2aef1
2021-01-07 19:00:45 -08:00
bf4fcab681 Fix SyncBatchNorm usage without stats tracking (#50126)
Summary:
In `batch_norm_gather_stats_with_counts_cuda` use `input.scalar_type()` if `running_mean` is not defined
In `SyncBatchNorm` forward function create count tensor with `torch.float32` type if `running_mean` is None
Fix a few typos

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50126

Test Plan:
```
python -c "import torch;print(torch.batch_norm_gather_stats_with_counts( torch.randn(1, 3, 3, 3, device='cuda'), mean = torch.ones(2, 3, device='cuda'), invstd = torch.ones(2, 3, device='cuda'), running_mean = None, running_var = None  , momentum = .1, eps = 1e-5, counts = torch.ones(2, device='cuda')))"
```

Fixes https://github.com/pytorch/pytorch/issues/49730

Reviewed By: ngimel

Differential Revision: D25797930

Pulled By: malfet

fbshipit-source-id: 22a91e3969b5e9bbb7969d9cc70b45013a42fe83
2021-01-07 18:31:13 -08:00
870ab04b64 add type annotations to torch._utils (#49705)
Summary:
closes gh-49704

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49705

Reviewed By: mruberry

Differential Revision: D25725352

Pulled By: malfet

fbshipit-source-id: 05a7041c9caffde4a5c1eb8af0d13697075103af
2021-01-07 16:20:16 -08:00
ce370398cc [Gradient Compression] Remove the extra comma after "bucket" in PowerSGD hook signatures (#50197)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50197

Remove the extra comma after "bucket".
ghstack-source-id: 119513484

Test Plan: waitforbuildbot

Reviewed By: rohan-varma

Differential Revision: D25823117

fbshipit-source-id: acf048f7cb732c23cba3a81ccce1e70f6b9f4299
2021-01-07 15:56:20 -08:00
09eefec627 Clean up some type annotations in android (#49944)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49944

Upgrades type annotations from Python2 to Python3

Test Plan: Sandcastle tests

Reviewed By: xush6528

Differential Revision: D25717539

fbshipit-source-id: c621e2712e87eaed08cda48eb0fb224f6b0570c9
2021-01-07 15:42:55 -08:00
f83d57f99e [Don't review] Clean up type annotations in caffe2/torch/nn (#50079)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/50079

Test Plan: Sandcastle tests

Reviewed By: xush6528

Differential Revision: D25718694

fbshipit-source-id: f535fb879bcd4cb4ea715adfd90bbffa3fcc1150
2021-01-07 15:39:20 -08:00
2bceee785f Clean up simple type annotations in nn/functional.py (#50106)
Summary:
Also reformats code to pass linters.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50106

Test Plan: Sandcastle tests

Reviewed By: xush6528

Differential Revision: D25787566

fbshipit-source-id: 39c86b4021e279f92f8ccf30252a6cfae1063c3c
2021-01-07 15:33:40 -08:00
3b56e9d0ef [pytorch] prune based on custom importance scores (#48378)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48378

This commit adds support for accepting custom importance scores to use for pruning mask computation, rather than only using the parameter.

This is useful if one wants to prune based on scores from different technique such as activations, gradients, weighted scoring of parameters, etc.

An alternative to the above approach would be pass the custom mask to the already available interface. However, the ability to accept importance scores is easier it can leverage the mask computation logic that has already been baked in.

In addition, the commit also makes some minor lint fixes.

Test Plan:
* Unit tests
* Circle CI

Differential Revision: D24997355

fbshipit-source-id: 30797897977b57d3e3bc197987da20e88febb1fa
2021-01-07 15:21:43 -08:00
23cadb5d7b [PyTorch] Specialize list_element_from for IValue to avoid extra move/copy (#50124)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50124

This patch makes `list_element_from` avoid extra `IValue`
move/copies for `List<IValue>` by just forwarding the reference
argument.

We take advantage of this in `listConstruct` by using `push_back`
(which hits the `ListElementFrom` path) instead of `
ghstack-source-id: 119478962

Test Plan:
Inspected generated assembly for vararg_functions.cpp in
optimized build. Rather than a call to `vector::emplace_back` and an extra
move, `vector::push_back` gets inlined.

Reviewed By: ezyang

Differential Revision: D25794277

fbshipit-source-id: 2354d8c08e0a0d6be2db3f0d0d6c90c3a455d8bd
2021-01-07 15:17:36 -08:00
7ce8f7e488 [quant] Backend string for the quantized types (#49965)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49965

Without this checking the type of the quantized tensor using `type` would throw an error.

After this PR running the `type(qx)`, where `qx` is a quantized tensor would show something like `torch.quantized.QUInt8`.

Test Plan: Not needed -- this is just a string description for the quantized tensors

Differential Revision: D25731594

Reviewed By: ezyang

Pulled By: z-a-f

fbshipit-source-id: 942fdf89a1c50895249989c7203f2e7cc00df4c6
2021-01-07 14:57:34 -08:00
0c3bae6a89 docker: add environment variable PYTORCH_VERSION (#50154)
Summary:
The aim is being able to inspect a container image and determine immediately
which version of pytorch it contains.

Closes https://github.com/pytorch/pytorch/issues/48324

Signed-off-by: Felix Abecassis <fabecassis@nvidia.com>

seemethere PTAL.
As you requested in https://github.com/pytorch/pytorch/issues/48324#issuecomment-754237156, I'm submitting the patch. But I could only do limited testing as I'm not sure these Makefile/Dockerfile are used for pushing the Docker Hub images (since the Makefile tags the image with a `v` prefix for the version, as in: `pytorch:v1.7.1`, but Docker Hub images don't have this prefix).

Also on the master branch we currently have the following:
```
$ git describe --tags
v1.4.0a0-11171-g68a6e46379
```
So it's a little off, but it behaves as expected on the `release/1.7` branch.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50154

Reviewed By: walterddr

Differential Revision: D25828491

Pulled By: seemethere

fbshipit-source-id: 500ec96cb5f5da1321610002d5e3678f4b0b94b5
2021-01-07 14:12:54 -08:00
e12008d110 [quant] Mapping for the _LinearWithBias (#49964)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49964

`torch.nn.modules.linear._LinearWithBias` is only used in the transformers, and is completely identical to the `torch.nn.Linear`.
This PR creates a mapping so that this module would be treated the same as the Linear.

Test Plan:
```
python test/test_quantization.py TestDynamicQuantizedModule TestStaticQuantizedModule
```

Differential Revision: D25731589

Reviewed By: jerryzh168

Pulled By: z-a-f

fbshipit-source-id: 1b2697014e250e97d3010cdb542f9d130b71fbc3
2021-01-07 13:57:29 -08:00
160b4be60a [PyTorch] typeid: ensmallen scalarTypeItemSizes (#50165)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50165

There are currently 17 types, so this used to stretch across 3 cache lines and now it fits in one. All the types in question seem to be way under 255 bytes in size anyway.
ghstack-source-id: 119485090

Test Plan: CI, profiled internal benchmarks

Reviewed By: smessmer

Differential Revision: D25813574

fbshipit-source-id: c342d4f12a7b035503e1483b8301f68d98f3c503
2021-01-07 13:52:02 -08:00
0495180f6e Fix deprecation warning in scalar_type_analysis (#50218)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/50218

Reviewed By: janeyx99

Differential Revision: D25827971

Pulled By: malfet

fbshipit-source-id: a4e467721435d7ae0db2195694053621eee8c2ee
2021-01-07 13:33:51 -08:00
7377bfb1bd Fix compiler warnings pertaining to uniform_int() (#49914)
Summary:
**PROBLEM DESCRIPTION:**
GitHub issue 46391 suggests that compiler warnings pertaining to _floating-point value does not fit in required integral type_ might cause some confusion.

These compiler-warnings arise during compilation of the templated function `uniform_int()`. The warnings are misleading because they arise from the way the compiler compiles templated functions, but the if-else statements in the function obviate the possibilities that the warnings describe. So, the purpose of a fix would only be to fix the compiler warnings, and not to fix any sort of a bug.

**FIX DESCRIPTION:**
[EDITED, after inputs from malfet]: In the function `uniform_int()`, the if-else conditions pertaining to types `double` & `float` can be removed, and then an overloaded specialized function can be added for floating-point types. The current version of the function can be specialized to not have its return type as a floating point type.

An unrelated observation is that the if-else condition pertaining to the type `double` (line 57 in the original code) was redundant, as line 61 in the original code covered it (`std::is_floating_point<T>::value` would also have been true for the type `double`).

Fixes https://github.com/pytorch/pytorch/issues/46391

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49914

Reviewed By: H-Huang

Differential Revision: D25808037

Pulled By: malfet

fbshipit-source-id: 3f94c4bca877f09720b0d6efa5e1788554aba074
2021-01-07 13:26:08 -08:00
e096449360 Adding MyPy daemon status file to gitignore (#50132)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50132

When running mypy command using `dmypy run`, it creates a status file.
This PR adds the file to the ignore list.

Test Plan: Imported from OSS

Reviewed By: samestep

Differential Revision: D25834504

Pulled By: z-a-f

fbshipit-source-id: 6c5a8edd6d8eaf61983e3ca80e798e02d78e38ce
2021-01-07 12:55:31 -08:00
ec6d29d6fa Drop unused imports from test (#49973)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49973

From
```
./python/libcst/libcst codemod remove_unused_imports.RemoveUnusedImportsWithGlean --no-format caffe2/
```

Test Plan: Standard sandcastle tests

Reviewed By: xush6528

Differential Revision: D25727350

fbshipit-source-id: 237ec4edd85788de920663719173ebec7ddbae1c
2021-01-07 12:09:38 -08:00
fbdb7822c6 minor improvement: extract major version (#49393)
Summary:
1. Unify major version extraction. If there's an error, it would throw exception in PR CI. So far, not all cuda version tests are in PR CI
2. better readability.

passed cuda11 tests.
https://circleci.com/gh/pytorch/pytorch/9651144?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link
https://circleci.com/gh/pytorch/pytorch/9651145?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49393

Reviewed By: zou3519

Differential Revision: D25828318

Pulled By: seemethere

fbshipit-source-id: 5c6861f0ddafe9a77a9fe397e4e0f69ecce4b27f
2021-01-07 11:47:04 -08:00
8706187523 Fix #42271 (#50141)
Summary:
This pull fix #{42271} by manually specify template data type of `tensor<template>.item()` in `aten/src/THC/generic/THCTensorMasked.cu`.

Changes in submodules are not expected since I have pulled the latest submodules from the Pytorch master branch.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50141

Reviewed By: zou3519

Differential Revision: D25826104

Pulled By: ezyang

fbshipit-source-id: 80527a14786b36e4e520fdecc932e257d2520f89
2021-01-07 11:38:45 -08:00
45c0d64b33 Skip test_functional_autograd_benchmark during code coverage (#50183)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/49656

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50183

Reviewed By: walterddr

Differential Revision: D25819825

Pulled By: malfet

fbshipit-source-id: 0a3e64d6b6aedb6e729e7d14167955fd2d89862c
2021-01-07 11:17:21 -08:00
ace1680b68 [static runtime] Remove register concept by giving ownership to the nodes (#50050)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50050

Every node will now own its outputs.
I don't expect any big improvements perf-wise from this diff, the only eliminated code is from deallocate_registers
Largely, this is to enable more optimizations going forward.

Test Plan:
buck test mode/dev //caffe2/benchmarks/static_runtime:static_runtime_cpptest
buck test //caffe2/test:static_runtime

Reviewed By: hlu1

Differential Revision: D25571181

fbshipit-source-id: 91fcfbd5cd968af963ba89c45656997650ca6d18
2021-01-07 10:19:58 -08:00
321b98830e [script] Validator for unsupported ops on accelerator
Summary:
ATT

Next step:
1. integrate with dper flow.
2. Support in bento after diff is pushed to prod.

Test Plan:
buck run mode/opt-clang sigrid/predictor/scripts:check_accelerator_unsupported_ops -- --model_entity_id=232891739

I0106 17:08:36.425796 1238141 pybind_state.cc:531] Unsupported ops: Fused8BitRowwiseQuantizedToFloat

Reviewed By: khabinov

Differential Revision: D25818253

fbshipit-source-id: 8d8556b0400c1747f154b0517352f1685f1aa8b1
2021-01-07 02:04:56 -08:00
968ad47b41 Fix error messages thrown when the padding size is not valid (#50135)
Summary:
Hi, I changed error messages so that they correspond to the actual implementation.
Acording to the implementation, half of kernel size is valid as padding size.

This is minor but an example that the padding size is exactly equal to the half of kernel size,

Input: 5 x 5
Kernel: 4 x 4
Stride: 4
Padding: 2
==> Output: 2 x 2

You don't get the error in the above case, like following:
```python
import torch
import torch.nn as nn

# no error
input = torch.randn(1, 1, 5, 5)
pool = nn.MaxPool2d(4, 4, padding=2)
print(pool(input).shape)
# >>> torch.Size([1, 1, 2, 2])
```

You get the error when you set the padding size larger then half of kernel size like:
```python
# it raises error
input = torch.randn(1, 1, 5, 5)
pool = nn.MaxPool2d(4, 4, padding=3)
print(pool(input).shape)
```

The error message is:
```
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-43-2b09d1c5d79a> in <module>()
      1 input = torch.randn(1, 1, 5, 5)
      2 pool = nn.MaxPool2d(4, 4, padding=3)
----> 3 print(pool(input).shape)

3 frames
/usr/local/lib/python3.6/dist-packages/torch/nn/functional.py in _max_pool2d(input, kernel_size, stride, padding, dilation, ceil_mode, return_indices)
    584         stride = torch.jit.annotate(List[int], [])
    585     return torch.max_pool2d(
--> 586         input, kernel_size, stride, padding, dilation, ceil_mode)
    587
    588 max_pool2d = boolean_dispatch(

RuntimeError: pad should be smaller than half of kernel size, but got padW = 3, padH = 3, kW = 4, kH = 4
```

Thanks in advance.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50135

Reviewed By: hl475

Differential Revision: D25815337

Pulled By: H-Huang

fbshipit-source-id: 98142296fa6e6849d2e1407d2c1d4e3c2f83076d
2021-01-06 22:21:48 -08:00
11cdb910b4 [fx] Add matrix multiplication fusion pass (#50151)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50151

**Summary**
This commit adds a graph transformation pass that merges several matrix
multiplications that use the same RHS operand into one large matrix
multiplication. The LHS operands from all of the smaller matrix multiplications
are concatenated together and used as an input in the large matrix multiply,
and the result is split in order to obtain the same products as the original
set of matrix multiplications.

**Test Plan**
This commit adds a simple unit test with two matrix multiplications that share
the same RHS operand.

`python test/test_fx_experimental.py -k merge_matmul -v`

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D25809409

Pulled By: SplitInfinity

fbshipit-source-id: fb55c044a54dea9f07b71aa60d44b7a8f3966ed0
2021-01-06 21:49:37 -08:00
838e73de20 enable alltoall_single torchscript support (#48345)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/48345

Test Plan: wait for sandcastle

Differential Revision: D25074475

fbshipit-source-id: 04261f8453567154b0464f8348320e936ca06384
2021-01-06 18:37:00 -08:00
4e2ab2cd73 Move generator state APIs to ATen (#49589)
Summary:
## Rationale

While most of the `torch.Generator` properties and methods are implemented as a thin wrapper of the corresponding `at::Generator` methods, `torch.Generator.get_state()` and `torch.Generator.set_state()` are implemented in legacy Torch code and are not dispatched through the `c10::GeneratorImpl` interface. This is not structured well and makes implementing generators for new backends (e.g. `XLAGeneratorImpl` for the XLA backend) inconvenient. As such, this pull request seeks to move these generator state APIs to c10 and ATen.

## What is being refactored?
* Interfaces
  - Added `c10::GeneratorImpl::set_state` and `c10::GeneratorImpl::state` for getting and setting the internal state of a random number generator.
  - `at::Generator::set_state` and `at::Generator::state` wraps the above-mentioned APIs, as it's basically a PIMPL.
  - Added helper function `at::detail::check_rng_state` for checking the validity of new RNG state tensor.
* CPU Generator
  - Renamed and moved `THTensor_(setRNGState)` and `THTensor_(getRNGState)` to `CPUGeneratorImpl::set_state` and `CPUGenerator::state`.
  - Renamed and moved `THGeneratorState` and `THGeneratorStateNew` to `CPUGeneratorStateLegacy` and `CPUGeneratorState`.
* CUDA Generator
  - Renamed and moved `THCRandom_setRNGState` and `THCRandom_getRNGState` to `CUDAGeneratorImpl::set_state` and `CUDAGeneratorImpl::state`.
* PyTorch Bindings
  - `THPGenerator_setState` and `THPGenerator_getState` now simply forward to `at::Generator::set_state` and `at::Generator::state`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49589

Reviewed By: H-Huang

Differential Revision: D25785774

Pulled By: pbelevich

fbshipit-source-id: 8ed79209c4ffb1a0ae8b19952ac8871ac9e0255f
2021-01-06 18:26:56 -08:00
b6b76a1055 Mod lists to neutral+descriptive terms in caffe2/caffe2/opt (#49801)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49801

Per "https://fb.workplace.com/groups/e/permalink/3320810064641820/" we can no longer use the terms "whitelist" and "blacklist", and editing any file containing them results in a critical error signal. Let's embrace the change.
This diff changes "blacklist" to "blocklist" in a number of non-interface contexts (interfaces would require more extensive testing and might interfere with reading stored data, so those are deferred until later).

Test Plan: Sandcastle

Reviewed By: xush6528

Differential Revision: D25686949

fbshipit-source-id: e07de4d228674ae61559719cbe4717f8044778d2
2021-01-06 18:13:42 -08:00
ef1fa547ba [PyTorch] Use expectRef() when calling listConstruct (#50062)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50062

Avoids creating an extra shared_ptr.
ghstack-source-id: 119325645

Test Plan: CI

Reviewed By: ezyang

Differential Revision: D25766631

fbshipit-source-id: f2ab8349dfea325054820fa2c1055180c740574e
2021-01-06 18:13:38 -08:00
fa160d18e7 [PyTorch][jit] Add Type::{castRaw,expectRef} (#50061)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50061

These are more efficient than creating an extra `shared_ptr`
when you just want to access the casted value.
ghstack-source-id: 119325644

Test Plan: CI

Reviewed By: ezyang

Differential Revision: D25766630

fbshipit-source-id: 46f11f70333b44714cab708a4850922ab7486793
2021-01-06 18:12:05 -08:00
6838ecefb6 Clean up some type annotations in torch/jit (#49939)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49939

Upgrades type annotations from Python2 to Python3

Test Plan: Sandcastle tests

Reviewed By: xush6528

Differential Revision: D25717573

fbshipit-source-id: 7d5c98fafaa224e0504b73dc69b1e4a6410c0494
2021-01-06 16:39:57 -08:00
e49372d460 Bugfix nightly checkout tool to work on Windows (#49274)
Summary:
I am submitting this PR on behalf of Janne Hellsten(nurpax) from NVIDIA, for the convenience of CLA. Thanks Janne a lot for the contribution!

This fixes the bug when running `
./tools/nightly.py checkout -b my-nightly-branch` on windows. Before this fix, this command gets the following error on Windows.

```
ERROR:root:Fatal exception
Traceback (most recent call last):
  File "./tools/nightly.py", line 166, in logging_manager
    yield root_logger
  File "./tools/nightly.py", line 644, in main
    install(
  File "./tools/nightly.py", line 552, in install
    spdir = _site_packages(pytdir.name, platform)
  File "./tools/nightly.py", line 325, in _site_packages
    os.path.join(pytdir.name, "Lib", "site-packages")
NameError: name 'pytdir' is not defined
log file: d:\pytorch\nightly\log\2020-12-11_16h10m14s_6867a21e-3c0e-11eb-878e-04ed3363a33e\nightly.log
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49274

Reviewed By: H-Huang

Differential Revision: D25808156

Pulled By: malfet

fbshipit-source-id: 00778016366ab771fc3fb152710c7849210640fb
2021-01-06 16:14:51 -08:00
eb8003d8e9 [FX] Remove extraneous newlines at end of code (#50117)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/50117

Test Plan: Imported from OSS

Reviewed By: ansley

Differential Revision: D25791847

Pulled By: jamesr66a

fbshipit-source-id: 9c0b296e117e6bcf69ed9624ad0b243fa3db0f76
2021-01-06 15:47:37 -08:00
dc41d17655 .circleci: Add option to not run build workflow (#50162)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50162

Adds an option to not run the build workflow when the `run_build`
parameter is set to false

Should reduce the amount of double workflows that are run by
pytorch-probot

Uses functionality introduced in https://github.com/pytorch/pytorch-probot/pull/18

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: yns88

Differential Revision: D25812971

Pulled By: seemethere

fbshipit-source-id: 4832170f6abcabe3f385f47a663d148b0cfe2a28
2021-01-06 15:42:17 -08:00
3270e661c3 [PyTorch Mobile] Skip signature check when converting to typed operator handle (#49469)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49469

In Functions.cpp, there is a call to `typed<...>` that converts to a `TypedOperatorHandle`. This isn't needed at runtime since it's already been exercised during development, and for mobile, there is no possibility of operators or kernels being registered by users (from Python code the way it is possible on server side).
ghstack-source-id: 118714246

Test Plan:
Sandcastle

### App testing results:

FBiOS fails with an error similar to this one: https://fb.workplace.com/groups/2102613013103952/permalink/3815085708523332/

Tested 2 AR effects (gren screen and colors shift) on IGiOS.

### BSB results:

D25581159-V1 (https://www.internalfb.com/intern/diff/D25581159/?dest_number=118689912)

**fbios: Succeeded**
Change in Download Size for arm64 + 3x assets variation: -7.2 KiB
Change in Uncompressed Size for arm64 + 3x assets variation: -27.1 KiB

Mbex Comparison: https://our.intern.facebook.com/intern/mbex/bsb:135971531636706@base/bsb:135971531636706@diff/

D25581159-V1 (https://www.internalfb.com/intern/diff/D25581159/?dest_number=118689912)

**fbios-pika: Succeeded**
Change in Download Size for arm64 + 3x assets variation: -11.0 KiB
Change in Uncompressed Size for arm64 + 3x assets variation: -7.4 KiB

Mbex Comparison: https://our.intern.facebook.com/intern/mbex/bsb:430379774665351@base/bsb:430379774665351@diff/
3:02 AM

D25581159-V1 (https://www.internalfb.com/intern/diff/D25581159/?dest_number=118689912)

**igios: Succeeded**
Change in Download Size for arm64 + 3x assets variation: -5.3 KiB
Change in Uncompressed Size for arm64 + 3x assets variation: -17.3 KiB

Mbex Comparison: https://our.intern.facebook.com/intern/mbex/bsb:685843828784135@base/bsb:685843828784135@diff/

Reviewed By: iseeyuan

Differential Revision: D25581159

fbshipit-source-id: 4a62982829ec42c2d3f58f47f876f2543bc0099b
2021-01-06 14:56:07 -08:00
dde5b6e177 [PyTorch] Reapply D25547962: Make tls_local_dispatch_key_set inlineable (reapply) (#49763)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49763

This was reverted because it landed in a stack together with
D25542799 (9ce1df079f), which really was broken.
ghstack-source-id: 119063016

Test Plan: CI

Reviewed By: ezyang

Differential Revision: D25685959

fbshipit-source-id: 514d8076eac67c760f119cfebc2ae3d0ddcd4e04
2021-01-06 14:41:43 -08:00
eef5eb05bf Remove backward and requires_grad from Autograd backend key (#49613)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49613

Just following a TODO in the code base...
ghstack-source-id: 119450484

Test Plan: waitforsandcastle

Reviewed By: ezyang

Differential Revision: D25644597

fbshipit-source-id: 26f5fa6af480929d0468b0de3ab103813e40d78b
2021-01-06 14:22:58 -08:00
6643e9fbb3 Remove use_c10_dispatcher: full lines (#49259)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49259

Since `use_c10_dispatcher: full` is now the default, we can remove all those pesky lines mentioning it. Only the `use_c10_dispatcher: hacky_wrapper_for_legacy_signatures` lines are left.
ghstack-source-id: 119450485

Test Plan: waitforsandcastle

Reviewed By: ezyang

Differential Revision: D25506526

fbshipit-source-id: 8053618120c0b52ff7c73cacb34bec7eb38f8fe0
2021-01-06 14:22:54 -08:00
249261ada7 Remove generated_unboxing_wrappers and setManuallyBoxedKernel (#49251)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49251

Since all ops are c10-full and use templated unboxing now, we don't need to codegenerate any unboxing logic anymore.
Since this codegen was the only code using setManuallyBoxedKernel, we can also remove that functionality from KernelFunction, OperatorEntry and Dispatcher.
ghstack-source-id: 119450486

Test Plan: waitforsandcastle

Reviewed By: ezyang

Differential Revision: D25502865

fbshipit-source-id: 49d009df159fda4be41bd02457d4427e6e638c10
2021-01-06 14:22:50 -08:00
4a14020c0d Remove .impl_UNBOXED() and functionalities associated with it (#49220)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49220

Since all ops are c10-full, we can remove .impl_UNBOXED now.
This also removes the ability of KernelFunction or CppFunction to store unboxedOnly kernels.
ghstack-source-id: 119450489

Test Plan: waitforsandcastle

Reviewed By: ezyang

Differential Revision: D25490225

fbshipit-source-id: 32de9d591e6a842fe18abc82541580647e9cfdad
2021-01-06 14:22:46 -08:00
e4c41b6936 Remove codegen logic to support non-c10-full ops (#49164)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49164

This PR removes the logic paths in codegen that were responsible for handling non-c10-full ops.
This only goes through our basic codegen. It does not simplify C++ code yet and it does not remove the codegen for generated unboxing wrappers yet.
ghstack-source-id: 119450487

Test Plan: waitforsandcastle

Reviewed By: ezyang

Differential Revision: D25462977

fbshipit-source-id: 7e70d14bea96948f5056d98125f3e6ba6bd78285
2021-01-06 14:17:36 -08:00
fcb69d2eba Add android.permission.INTERNET permission to Android test_app. (#49996)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49996

According to section 5.2.1 of Snapdragon Profiler User Guide
(https://developer.qualcomm.com/qfile/30580/snapdragon_profiler_user_guide_reva.pdf)
OpenGL ES, Vulkan, and OpenCL apps must include
android.permission.INTERNET in the app's AndroidManifest.xml to enable
API tracing and GPU metrics.

Test Plan: Imported from OSS

Reviewed By: SS-JIA

Differential Revision: D25809555

Pulled By: AshkanAliabadi

fbshipit-source-id: c4d88a7ea98d9166efbc4157df7d822d99ba0df9
2021-01-06 12:58:28 -08:00
473e78c0fa Remove redundant code for unsupported Python versions (#49486)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49486

Remove code for Python 3.5 and lower.

There's more that can be removed/modernised, but sticking mainly to redundant version checks here, to keep the diff/PR smaller.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46579

Reviewed By: zou3519

Differential Revision: D24453571

Pulled By: ezyang

fbshipit-source-id: c2cfcf05d6c5f65df64d89c331692c9aec09248e
2021-01-06 12:45:46 -08:00
09eb468398 [vulkan] 2D prepacking for conv2d (#48816)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/48816

Test Plan: Imported from OSS

Reviewed By: AshkanAliabadi

Differential Revision: D25786280

Pulled By: SS-JIA

fbshipit-source-id: b41bf55dcff8f3dfbbf1994171e2ef62f16ff29a
2021-01-06 12:37:51 -08:00
9b519b4a3f [PyTorch Mobile] Generate Kernel dtype selection code in selected_mobile_ops.h during the build (#49279)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49279

Now that the YAML files for tracing based selective build optionally have the information regarding the selected kernel function dtypes, we can start generating constexpr selection code in the include file (`selected_mobile_ops.h`) to make the inclusion of code for specific dtypes selective based on compile time decisions.

The way this is done is that if we detect that the code for a specific dtype should not be in the binary, we add an exception (throw) statement just before the method is called (see the first diff in this dtack) and allow the compiler to optimize away the rest of the function's body. This has the advantage of allowing the compiler to know the lambda's return type (since it's inferred from the `return` statements in the body of the method, and if we compile out all the cases, then the compiler won't know the return type and it will result in a compilation error).

The generated `<ATen/selected_mobile_ops.h>` is being used (included) in `Dispatch.h`. In case `XPLAT_MOBILE_BUILD` is not defined, then we should include code for all kernel dtypes (non-selective build).

When merging, we need to handle the case of both older and newer (tracing based) operator lists. If we detect any operator that includes all overloads, it indicates that an old style operator list is part of the build, and we need to `include_all_kernel_dtypes` for this build.
ghstack-source-id: 119439497

Test Plan:
For Segmentation v220, here is one of the intermediate generated YAML files (selected_operators.yaml): {P154480509}
and here is the generated `selected_mobile_ops.h` file: {P159808798}

Here is the `selected_mobile_ops.h` file for lite_predictor (which includes all ops and all dtypes): {P159806443}

Continuous build for ~8 checked-in models validates that the selection code works as expected when we build based on dtype selection.

Reviewed By: iseeyuan

Differential Revision: D25388949

fbshipit-source-id: 1c182a4831a7f94f7b152f02dbd3bc01c0d22443
2021-01-06 12:17:32 -08:00
ba691e1a42 Remove incorrect links to zdevito/ATen (#50065)
Summary:
Similar to https://github.com/pytorch/pytorch/issues/49028, this PR removes a few more references to https://github.com/zdevito/ATen.

- The links for Functions.h, Tensor.h, and Type.h are simply broken, probably because they refer to `master` rather than a specific commit (cf. https://github.com/pytorch/pytorch/issues/47066)
- I'm unsure about the change to the `about` section of `aten/conda/meta.yaml`; can someone comment on whether I am understanding that field correctly?
- The reference to https://github.com/zdevito/ATen/issues/163 remains [in `tools/autograd/derivatives.yaml`](cd608fe59b/tools/autograd/derivatives.yaml (L91)), because the contents of that issue discussion don't seem to be mirrored anywhere else.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50065

Reviewed By: ezyang, walterddr

Differential Revision: D25767353

Pulled By: samestep

fbshipit-source-id: 265f46f058bc54ef6d1a77f112cdfa1f115b3247
2021-01-06 11:49:26 -08:00
6eee2a0a9f [JIT] disable masked fill (#50147)
Summary:
There is an internal user who is experiencing a bug with masked_fill. While I am almost certain this corresponds to an old pytorch version with the bug, the model that is breaking is important and time-sensitive and we are covering all bases to try to get it to work again.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50147

Reviewed By: nhsoukai

Differential Revision: D25806541

Pulled By: eellison

fbshipit-source-id: 131bd71b5db9717a8a9cb97973d0b4f0e96455d6
2021-01-06 11:36:30 -08:00
3ce539881a Back out "Revert D25757721: [pytorch][PR] Run mypy on more test files" (#50142)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50142

Original commit changeset: 58437d719285

Test Plan: OSS CI

Reviewed By: walterddr, ngimel

Differential Revision: D25803866

fbshipit-source-id: d6b83a5211e430c0451994391876103f1ad96315
2021-01-06 11:27:36 -08:00
638086950d Clean up type annotations in torch/nn/quantized/modules (#49941)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/49941

Test Plan: Sandcastle

Reviewed By: jerryzh168

Differential Revision: D25718715

fbshipit-source-id: bbe450d937cf7ef634e003c09146e308180d1d58
2021-01-06 11:03:08 -08:00
7d9eb6c680 Implementation of torch::cuda::synchronize (#50072)
Summary:
Adding `torch::cuda::synchronize()` to libtorch. Note that the implementation here adds a new method to the `CUDAHooksInterface`. An alternative that was suggested to me is to add a method to the `DeviceGuard` interface.

Fixes https://github.com/pytorch/pytorch/issues/47722

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50072

Reviewed By: H-Huang

Differential Revision: D25804342

Pulled By: jbschlosser

fbshipit-source-id: 45aa61d7c6fbfd3178caf2eb5ec053d6c01b5a43
2021-01-06 10:53:39 -08:00
e606e60331 [Needs Review] Convert some files to Python3 (#49351)
Summary:
Uses the Python standard library 2to3 script to convert a number of Python 2 files to Python 3. This facilitates code maintenance such as dropping unused imports in D25500422.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49351

Test Plan: Standard sandcastle tests

Reviewed By: xush6528

Differential Revision: D25499576

fbshipit-source-id: 0c44718ac734771ce0758b1cb30676cc3d76ac10
2021-01-06 10:48:16 -08:00
efe0533a24 Clean up some type annotations in torch/testing/_internal (#50078)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50078

Upgrades type annotations from Python2 to Python3

Test Plan: Sandcastle tests

Reviewed By: pritamdamania87

Differential Revision: D25717560

fbshipit-source-id: cec631f3121ef9ab87ff8b3b00f1fae6df9a2155
2021-01-06 10:41:22 -08:00
74c055b240 Fix mypy type hint for AdaptiveAvgPool2,3d, AdaptiveMaxPool2,3d (#49963)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/49918

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49963

Reviewed By: mrshenli, heitorschueroff

Differential Revision: D25760110

Pulled By: ezyang

fbshipit-source-id: aeb655b784689544000ea3b948f7d6d025aee441
2021-01-06 09:47:15 -08:00
68a6e46379 Push anonymous namespace into codegen, not template (#49498)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49498

In the near future, I want to code generate some functions that are
visible externally to this compilation unit.  I cannot easily do this
if all the codegen code is wrapped in a global anonymous namespace,
so push the namespace in.

Registration has to stay in an anonymous namespace to avoid name conflicts.
This could also have been solved by making the wrapper functions have
more unique names but I didn't do this in the end.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: albanD, smessmer

Differential Revision: D25616104

Pulled By: ezyang

fbshipit-source-id: 323c0dda05a081502aab702f359a08dfac8c41a4
2021-01-06 08:44:49 -08:00
480a756194 [PyTorch] IValue::toTensor can now return const Tensor& (#48868)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48868

Building on the previous diff, we can make `toTensor()` return a
`const Tensor&`, which should make it easier to avoid reference
counting.
ghstack-source-id: 119327372

Test Plan: internal benchmarks.

Reviewed By: bwasti

Differential Revision: D25325379

fbshipit-source-id: ca699632901691bcee432f595f75b0a4416d55dd
2021-01-06 08:40:50 -08:00
1b31e13539 [PyTorch] Store Tensor explicitly in IValue (#48824)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48824

Enables following diff, which will make toTensor() return
`const Tensor&` and allow callers to avoid refcounting overhead.
ghstack-source-id: 119327370

Test Plan:
ivalue_test

Internal benchmark to ensure perf parity. Some interesting steps
during the debugging process:

- First version was about a 5% regression
- Directly implementing move construction instead of using swap
  lowered the regression to 2-3%
- Directly implementing move assign was maybe an 0.5% improvement
- Adding C10_ALWAYS_INLINE on move assign got our regression to
  negligible
- Fixing toTensor() to actually be correct regressed us again, but
  omitting the explicit dtor call as exhaustively spelled out in a
  comment fixed it.

Reviewed By: bwasti

Differential Revision: D25324617

fbshipit-source-id: 7518c1c67f6f2661f151b43310aaddf4fb6e511a
2021-01-06 08:40:47 -08:00
688992c775 [PyTorch] Additional IValue tests (#49718)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49718

Improving test coverage in preparation for updating the
implementation of IValue.
ghstack-source-id: 119327373

Test Plan: ivalue_test

Reviewed By: hlu1

Differential Revision: D25674605

fbshipit-source-id: 37a82bb135f75ec52d2d8bd929c4329e8dcc4d25
2021-01-06 08:35:42 -08:00
5f2ec6293d Unused variables in neural net classes and functions (#50100)
Summary:
These unused variables were identified by [pyflakes](https://pypi.org/project/pyflakes/). They can be safely removed to simplify the code and possibly improve performance.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50100

Reviewed By: ezyang

Differential Revision: D25797764

Pulled By: smessmer

fbshipit-source-id: ced341aee692f429d2dcc3a4ef5c46c8ee99cabb
2021-01-06 08:16:57 -08:00
c517e15d79 Add support for converting sparse bool tensors to dense (#50019)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/49977

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50019

Reviewed By: smessmer

Differential Revision: D25782045

Pulled By: ezyang

fbshipit-source-id: a8389cbecb7e79099292a423a6fd8ac28631905b
2021-01-06 07:38:14 -08:00
2ac180a5dd Fix cl.exe detection in cpu/fused_kernel.cpp (#50085)
Summary:
The command used here is essentially `where cl.exe`. By using `system()` we will not be able to find cl.exe unless we are using VS Developer Prompt, which makes `activate()` meaningless. Change `system()` to `run()` fixes this.

Found during https://github.com/pytorch/pytorch/issues/49781.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50085

Reviewed By: smessmer

Differential Revision: D25782054

Pulled By: ezyang

fbshipit-source-id: e8e3cac903a73f3bd78def667ebe0e93201814c8
2021-01-06 07:16:41 -08:00
45ec35827e Set USE_RCCL cmake option (dependent on USE_NCCL) [REDUX] (#34683)
Summary:
Refiled duplicate of https://github.com/pytorch/pytorch/issues/31341 which was reverted in commit 63964175b52197a75e03b73c59bd2573df66b398.

This PR enables RCCL support when building Gloo as part of PyTorch for ROCm.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/34683

Reviewed By: glaringlee

Differential Revision: D25540578

Pulled By: ezyang

fbshipit-source-id: fcb02e5745d62e1b7d2e02048160e9e7a4b4df2d
2021-01-06 07:03:02 -08:00
cyy
0ad6f06684 drop a unneeded comma from cmakelist.txt (#50091)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/50091

Reviewed By: smessmer

Differential Revision: D25782083

Pulled By: ezyang

fbshipit-source-id: f90f57c6c9fc0c1e68ab30dd3b56dfe971798df2
2021-01-06 06:53:45 -08:00
ad7d208ba5 Revert D25239967: [fx] Add matrix multiplication fusion pass
Test Plan: revert-hammer

Differential Revision:
D25239967 (9b7f3fa146)

Original commit changeset: fb99ad25b7d8

fbshipit-source-id: 370167b5ade8bf2b3a6cccdf4290ea07b8347c79
2021-01-05 23:22:26 -08:00
282552dde2 [PyTorch] Reapply D25546409: Use .sizes() isntead of .size() in cat_serial_kernel_impl (#49762)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49762

This was reverted because it landed in a stack together with
D25542799 (9ce1df079f), which really was broken.
ghstack-source-id: 119326870

Test Plan: CI

Reviewed By: maratsubkhankulov

Differential Revision: D25685905

fbshipit-source-id: f4ec9e114993f988d4af380677331c72dfe41c44
2021-01-05 22:59:22 -08:00
57d489e43a Fix for possible RNG offset calculation bug in cuda vectorized dropout with VEC=2 (#50110)
Summary:
The [offset calculation](e3c56ddde6/aten/src/ATen/native/cuda/Dropout.cu (L328)) (which gives an estimated ceiling on the most 32-bit values in the philox sequence any thread in the launch will use) uses the hardcoded UNROLL value of 4, and assumes the hungriest threads can use every value (.x, .y, .z, and .w) their curand_uniform4 calls provide.  However, the way fused_dropout_kernel_vec is currently written, that assumption isn't true in the VEC=2 case:  Each iteration of the `grid x VEC` stride loop, each thread calls curand_uniform4 once, uses rand.x and rand.y, and discards rand.z and rand.w.  This means (I _think_) curand_uniform4 may be called twice as many times per thread in the VEC=2 case as for the VEC=4 case or the fully unrolled code path, which means the offset calculation (which is a good estimate for the latter two cases) is probably wrong for the `fused_dropout_kernel_vec<..., /*VEC=*/2>` code path.

The present PR inserts some value-reuse in fused_dropout_kernel_vec to align the number of times curand_uniform4 is called for launches with the same totalElements in the VEC=2 and VEC=4 cases.  The diff should
- make the offset calculation valid for all code paths
- provide a very small perf boost by reducing the number of curand_uniform4 calls in the VEC=2 path
- ~~make results bitwise accurate for all code paths~~ nvm, tensor elements are assigned to threads differently in the unrolled, VEC 2 and VEC 4 cases, so we're screwed here no matter what.

ngimel what do you think?

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50110

Reviewed By: smessmer

Differential Revision: D25790121

Pulled By: ngimel

fbshipit-source-id: f8f533ad997268c6f323cf4d225de547144247a8
2021-01-05 22:36:05 -08:00
f6f0fde841 [reland][quant][graphmode][fx] Standalone module support {input/output}_quantized_idxs (#49754) (#50058)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50058

This PR adds the support for {input/output}_quantized_idxs for standalone module.

if input_quantized_idxs = [] and output_quantized_idxs = [], the standalone module will be expecting float
input and produce float output, and will quantize the input and dequantize output internally

if input_quantized_idxs = [0] and otuput_qiuantized_idxs = [0], the standalone module will be expecting quantized
input and produce quantized output, the input will be quantized in the parent module, and output will be dequantized
in the parent module as well, this is similar to current quantized modules like nn.quantized.Conv2d

For more details, please see the test case

Test Plan:
python test/test_quantization.py TestQuantizeFx.test_standalone_module

Imported from OSS

Imported from OSS

Reviewed By: vkuzo

Differential Revision: D25768910

fbshipit-source-id: 96c21a3456cf192c8f1400afa4e86273ee69197b
2021-01-05 20:27:46 -08:00
05358332b3 Fix mypy typing check for test_dataset (#50108)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/50108

Test Plan: Imported from OSS

Reviewed By: glaringlee

Differential Revision: D25789184

Pulled By: ejguan

fbshipit-source-id: 0eeeeeda62533e7137d56f313b7bf11406b32611
2021-01-05 19:57:22 -08:00
def8aa5499 Remove cpu half and dead code from multinomial (#50063)
Summary:
Based on ngimel's (Thank you!) feedback, cpu half was only accidental, so I'm removing it.

This lets us ditch the old codepath for without replacement in favour of the new, better one.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50063

Reviewed By: mruberry

Differential Revision: D25772449

Pulled By: ngimel

fbshipit-source-id: 608729c32237de4ee6d1acf7e316a6e878dac7f0
2021-01-05 19:46:33 -08:00
9b7f3fa146 [fx] Add matrix multiplication fusion pass (#50120)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50120

This commit adds a graph transformation pass that merges several matrix
multiplications that use the same RHS operand into one large matrix
multiplication. The LHS operands from all of the smaller matrix multiplications
are concatenated together and used as an input in the large matrix multiply,
and the result is split in order to obtain the same products as the original
set of matrix multiplications.

Test Plan:
This commit adds a simple unit test with two matrix multiplications that share
the same RHS operand.

`buck test //caffe2/test:fx_experimental`

Reviewed By: jamesr66a

Differential Revision: D25239967

fbshipit-source-id: fb99ad25b7d83ff876da6d19dc4abd112d13001e
2021-01-05 19:37:08 -08:00
d80d38cf87 Clean up type annotations in caffe2/torch/nn/modules (#49957)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/49957

Test Plan: Sandcastle

Reviewed By: xush6528

Differential Revision: D25729745

fbshipit-source-id: 85810e2c18ca6856480bef81217da1359b63d8a3
2021-01-05 19:08:40 -08:00
75028f28e1 [PyTorch] Reapply D25545777: Use .sizes() instead of .size() in _cat_out_cpu (#49761)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49761

This was reverted because it landed in a stack together with
D25542799 (9ce1df079f), which really was broken.
ghstack-source-id: 119361027

Test Plan: CI

Reviewed By: bwasti

Differential Revision: D25685855

fbshipit-source-id: b51f67ebe667199d15bfc6f8f131a6f1ab1b0352
2021-01-05 19:04:23 -08:00
574a15b6cc [PyTorch] Reapply D25544731: Avoid extra Tensor refcounting in _cat_out_cpu (#49760)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49760

This was reverted because it landed in a stack together with
D25542799 (9ce1df079f), which really was broken.
ghstack-source-id: 119361028

Test Plan: CI

Reviewed By: bwasti

Differential Revision: D25685789

fbshipit-source-id: 41e5abb4ff30acaa6f33f9c806acd652a6dd9646
2021-01-05 18:59:20 -08:00
5f875965c6 Fix doc for vmap levels (#50099)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/50099

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D25783257

Pulled By: ejguan

fbshipit-source-id: 7d2c7614f87e1c8adc8aefe3fe312b6c98ff6788
2021-01-05 18:48:32 -08:00
70734f1260 Kill AT_SKIP_BFLOAT16_IF_NOT_ROCM (#48810)
Summary:
Dependency:
https://github.com/pytorch/pytorch/pull/48809 https://github.com/pytorch/pytorch/pull/48807 https://github.com/pytorch/pytorch/pull/48806 https://github.com/pytorch/pytorch/pull/48805 https://github.com/pytorch/pytorch/pull/48801 https://github.com/pytorch/pytorch/pull/44994 https://github.com/pytorch/pytorch/pull/44848

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48810

Reviewed By: mruberry

Differential Revision: D25772955

Pulled By: ngimel

fbshipit-source-id: 353f130eb701f8b338a826d2edaea69e6e644ee9
2021-01-05 18:10:23 -08:00
26391143b6 Support out argument in torch.fft ops (#49335)
Summary:
Ref https://github.com/pytorch/pytorch/issues/42175

This adds out argument support to all functions in the `torch.fft` namespace except for `fftshift` and `ifftshift` because they rely on `at::roll` which doesn't have an out argument version.

Note that there's no general way to do the transforms directly into the output since both cufft and mkl-fft only support single batch dimensions. At a minimum, the output may need to be re-strided which I don't think is expected from `out` arguments normally. So, on cpu this just copies the result into the out tensor. On cuda, the normalization is changed to call `at::mul_out` instead of an inplace multiply.

If it's desirable, I could add a special case to transform into the output when `out.numel() == 0` since there's no expectation to preserve the strides in that case anyway. But that would lead to the slightly odd situation where `out` having the correct shape follows a different code path from `out.resize_(0)`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49335

Reviewed By: mrshenli

Differential Revision: D25756635

Pulled By: mruberry

fbshipit-source-id: d29843f024942443c8857139a2abdde09affd7d6
2021-01-05 17:17:49 -08:00
5d93e2b818 torch.flip and torch.flip{lr, ud}: Half support for CPU and BFloat16 support for CPU & CUDA (#49895)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/49889

Also adds BFloat16 support for CPU and CUDA

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49895

Reviewed By: mrshenli

Differential Revision: D25746272

Pulled By: mruberry

fbshipit-source-id: 0b6a9bc13ae60c22729a0aea002ed857c36f14ff
2021-01-05 16:51:49 -08:00
d1c375f071 fix fork formatting (#49436)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/49436

Test Plan: Imported from OSS

Reviewed By: tugsbayasgalan

Differential Revision: D25788166

Pulled By: eellison

fbshipit-source-id: e368b473ad64a1168be01fc674625415a07ff31c
2021-01-05 16:38:34 -08:00
7fe25af59d Revert D25746115: [pytorch][PR] Improve documentation and warning message for creation of a tensor with from_numpy()
Test Plan: revert-hammer

Differential Revision:
D25746115 (4a6c178f73)

Original commit changeset: 3e534a8f2bc1

fbshipit-source-id: 12c921cf2d062794ce45afcaed1fbedc28dcdd01
2021-01-05 16:21:26 -08:00
dcc83868c5 [PyTorch Mobile] Mark xnnpack operators selective
Summary: The remaining operator registrations that are not marked as selective. The size save is -12.2 KB for igios and -14 KB for fbios.

Test Plan: CI

Reviewed By: dhruvbird

Differential Revision: D25742543

fbshipit-source-id: 3e58789d36d216a52340c00b53e2f783ea2c9414
2021-01-05 15:53:01 -08:00
5e1c8f24d4 Make stft (temporarily) warn (#50102)
Summary:
When continuing the deprecation process for stft it was made to throw an error when `use_complex` was not explicitly set by the user. Unfortunately this PR missed a model relying on the historic stft functionality. Before re-enabling the error we'll need to write an upgrader for that model.

This PR turns the error back into a warning to allow that model to continue running as before.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50102

Reviewed By: ngimel

Differential Revision: D25784325

Pulled By: mruberry

fbshipit-source-id: 825fb38af39b423ce11b376ad3c4a8b21c410b95
2021-01-05 15:39:00 -08:00
4a6c178f73 Improve documentation and warning message for creation of a tensor with from_numpy() (#49516)
Summary:
Implements very simple changes suggested in the short discussion of the issue. Updated documentation to inform user that creation of tensor with memory mapped read only numpy arrays will probably cause a crash of the program. The displayed warning message was also updated to contain the information about issues concerning the use of a memory mapped read only numpy array. Closes https://github.com/pytorch/pytorch/issues/46741.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49516

Reviewed By: mrshenli

Differential Revision: D25746115

Pulled By: mruberry

fbshipit-source-id: 3e534a8f2bc1f083a2835440d324bd6f30798ad4
2021-01-05 15:25:15 -08:00
9529ae3776 Revert D25757721: [pytorch][PR] Run mypy on more test files
Test Plan: revert-hammer

Differential Revision:
D25757721 (b7bfc723d3)

Original commit changeset: 44c396d8da9e

fbshipit-source-id: 58437d719285a4fecd8c05e487cc86fc2cebadff
2021-01-05 15:18:14 -08:00
d1a56fcd9d [docs] add docstring in torch.cuda.get_device_properties (#49792)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/49737

Added docstring in `torch.cuda.get_device_properties`
Added the `Returns` in `torch.cuda.get_device_name`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49792

Reviewed By: mruberry

Differential Revision: D25784046

Pulled By: ngimel

fbshipit-source-id: f88da02147f92c889398957fcaf22961d3bb1062
2021-01-05 14:51:07 -08:00
abe1fa49e9 [JIT] Add __prepare_scriptable__ duck typing to allow replacing nn.modules with scriptable preparations (#45645) (#49242)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49242

Fixes https://github.com/pytorch/pytorch/issues/45072

As discussed with zdevito gchanan cpuhrsch and suo, this change allows developers to create custom preparations for their modules before scripting. This is done by adding a `__prepare_scriptable__` method to a module which returns the prepared scriptable module out-of-place. It does not expand the API surface for end users.

Prior art by jamesr66a: https://github.com/pytorch/pytorch/pull/42244

Test Plan: Imported from OSS

Reviewed By: dongreenberg

Differential Revision: D25500303

fbshipit-source-id: d3ec9005de27d8882fc29d02f0d08acd2a5c6b2c
2021-01-05 14:18:15 -08:00
e71a13e8a3 [pytorch][codegen] migrate gen_variable_type to new data model (#49735)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49735

This is the final wave of autograd codegen data model migration.

After this PR:
- autograd codegen no longer depends on Declarations.yaml;
- autograd codegen sources are fully type annotated and pass mypy-strict check;

To avoid potential merge conflicts with other pending PRs, some structural
changes are intentionally avoided, e.g. didn't move inner methods out, didn't
change all inner methods to avoid reading outer function's variables, and etc.

Confirmed byte-for-byte compatible with the old codegen:
```
Run it before and after this PR:
  .jenkins/pytorch/codegen-test.sh <baseline_output_dir>
  .jenkins/pytorch/codegen-test.sh <test_output_dir>

Then run diff to compare the generated files:
  diff -Naur <baseline_output_dir> <test_output_dir>
```

Confirmed clean mypy-strict run:
```
mypy --config mypy-strict.ini
```

Test Plan: Imported from OSS

Reviewed By: ezyang, bhosmer

Differential Revision: D25678879

Pulled By: ljk53

fbshipit-source-id: ba6e2eb6b9fb744208f7f79a922d933fcc3bde9f
2021-01-05 14:12:39 -08:00
a272a7eeab [PyTorch] Avoid heap allocations in inferUnsqueezeGeometry (#49497)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49497

Noticed this thing spending relatively most of its time in
malloc in perf. Optimize for typical tensor sizes.
ghstack-source-id: 119318388

Test Plan:
perf profile internal benchmark; saw inferUnsqueezeGeometry
go from 0.30% exclusive 0.47% inclusive to 0.11% exclusive 0.16%
inclusive.

Differential Revision: D25596549

fbshipit-source-id: 3bbd2031645a4b9fe6f49a77d41db46826d0f632
2021-01-05 14:06:03 -08:00
093aca082e Enable distribution validation if __debug__ (#48743)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/47123
Follows https://github.com/pyro-ppl/pyro/pull/2701

This turns on `Distribution` validation by default. The motivation is to favor beginners by providing helpful error messages. Advanced users focused on speed can disable validation by calling
```py
torch.distributions.Distribution.set_default_validate_args(False)
```
or by disabling individual distribution validation via `MyDistribution(..., validate_args=False)`.

In practice I have found many beginners forget or do not know about validation. Therefore I have [enabled it by default](https://github.com/pyro-ppl/pyro/pull/2701) in Pyro. I believe PyTorch could also benefit from this change. Indeed validation caught a number of bugs in `.icdf()` methods, in tests, and in PPL benchmarks, all of which have been fixed in this PR.

## Release concerns
- This may slightly slow down some models. Concerned users may disable validation.
- This may cause new `ValueErrors` in models that rely on unsupported behavior, e.g. `Categorical.log_prob()` applied to continuous-valued tensors (only {0,1}-valued tensors are supported).

We should clearly note this change in release notes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48743

Reviewed By: heitorschueroff

Differential Revision: D25304247

Pulled By: neerajprad

fbshipit-source-id: 8d50f28441321ae691f848c55f71aa80cb356b41
2021-01-05 13:59:10 -08:00
e3c56ddde6 Revert D25757691: [pytorch][PR] Run mypy over test/test_utils.py
Test Plan: revert-hammer

Differential Revision:
D25757691 (c86cfcd81d)

Original commit changeset: 145ce3ae532c

fbshipit-source-id: 3dfd68f0c42fc074cde15c6213a630b16e9d8879
2021-01-05 13:40:13 -08:00
e442ac1e3f Update MultiHeadAttention docstring (#49950)
Summary:
Fixes MultiHeadAttention docstring.

Currently, https://pytorch.org/docs/stable/generated/torch.nn.MultiheadAttention.html#torch.nn.MultiheadAttention
is

<img width="648" alt="Screen Shot 2020-12-29 at 21 06 43" src="https://user-images.githubusercontent.com/2459423/103311124-cd10cc00-4a19-11eb-89c9-0ee261364963.png">

and with the fix will be

<img width="648" alt="Screen Shot 2020-12-29 at 22 41 35" src="https://user-images.githubusercontent.com/2459423/103315838-0dc31200-4a27-11eb-82e2-ca8f13d713a1.png">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49950

Reviewed By: mrshenli

Differential Revision: D25732573

Pulled By: zhangguanheng66

fbshipit-source-id: b362f3f617ab26b0dd25c3a0a7d4117e522e620c
2021-01-05 13:31:48 -08:00
9945fd7253 Drop unused imports from caffe2/python (#49980)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49980

From
```
./python/libcst/libcst codemod remove_unused_imports.RemoveUnusedImportsWithGlean --no-format caffe2/
```

Test Plan: Standard sandcastle tests

Reviewed By: xush6528

Differential Revision: D25727359

fbshipit-source-id: c4f60005b10546423dc093d31d46deb418352286
2021-01-05 13:17:46 -08:00
eee849be8c [caffe2][a10] Move down pragma pop to properly suppress warning 4522 (#49233)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49233

As the comments on line 160, say we should suppress this overly aggressive warning with MSVC:
```
caffe2\tensorbody.h_ovrsource#header-mode-symlink-tree-only,headers\aten\core\tensorbody.h(1223): warning C4522: 'at::Tensor': multiple assignment operators specified
```

However, in order to remove the warning, the closing brace of the class must be between the`#pragma warning` push and its corresponding pop. Move the pop down to ensure that.

Test Plan: Built locally using clang for Windows without buck cache, confirmed the warning resolved

Reviewed By: bhosmer

Differential Revision: D25422447

fbshipit-source-id: c1e1c66fb8513af5f9d4e3c1dc48d0070c4a1f84
2021-01-05 13:13:22 -08:00
16e5af41da Fix store based barrier to only use 'add'. (#49930)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49930

Certain store implementations don't work well when we use get() and
add() on the same key. To avoid this issue, we only use add() in the store
based barrier. The buggy store implementations can't be properly fixed due to
legacy reasons.

Test Plan:
1) unit tests.
2) waitforbuildbot

Reviewed By: osalpekar

Differential Revision: D25725386

fbshipit-source-id: 1535e2629914de7f78847b730f8764f92cde67e7
2021-01-05 12:46:24 -08:00
12ee7b61e7 support building with conda installed libraries (#50080)
Summary:
This should fix a bunch of share library compilation error when installed in conda lib, lib64 folder.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50080

Reviewed By: seemethere

Differential Revision: D25781923

Pulled By: walterddr

fbshipit-source-id: 78a74925981d65243b98bb99a65f1f2766e87a2f
2021-01-05 12:32:51 -08:00
e868825eb6 [RPC] Relax some profiling tests (#49983)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49983

We have observed very rare flakiness in some profiling tests recently,
i.e.: . However, we were not able to reproduce these even with thousands of
runs on the CI machines where the failure was originally reported. As a result,
relaxing these tests and re-enabling them to reduce failure rates.
ghstack-source-id: 119352019

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D25739416

fbshipit-source-id: 4dbb6b30f20d3af94ba39f4a7ccf4fb055e440bc
2021-01-05 11:47:32 -08:00
c115957df0 [distributed] Provide parameter to pass GPU ID in barrier function (#49069)
Summary:
For a multi GPU node, rank and corresponding GPU mapping can be different.
Provide optional parameter to specify the GPU device number for the
allreduce operation in barrier function.

Add test cases to validate barrier device_ids.

Signed-off-by: Jagadish Krishnamoorthy <jagdish.krishna@gmail.com>

Fixes https://github.com/pytorch/pytorch/issues/48110

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49069

Reviewed By: mrshenli

Differential Revision: D25658528

Pulled By: rohan-varma

fbshipit-source-id: 418198b6224c8c1fd95993b80c072a8ff8f02eec
2021-01-05 11:27:54 -08:00
3cd2f1f3a7 Add an option to disable aten::cat in TE (re-revert) (#50101)
Summary:
This reverts commit ace78ddb6a2bdbf03f08c69767eba57306dd69ed.

Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50101

Reviewed By: eellison

Differential Revision: D25784785

Pulled By: Krovatkin

fbshipit-source-id: cbb3d377e03303f6c8c71f4c59c6d90ab40d55f7
2021-01-05 11:08:11 -08:00
bbae6774c1 [JIT] Remove buffer metadata serialization forward-compat gate (#49990)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49990

**Summary**
This commit removes the forward-compatibility gate for buffer metadata
serialization. It was introduced to allow versions of fbcode
binaries statically linked against older versions of PyTorch (without
buffer metadata in JIT) to deserialize archives produced by new versions
of PyTorch. Enough time has probably passed that these old binaries
don't exist anymore, so it should be safe to remove the gate.

**Test Plan**
Internal tests.

Test Plan: Imported from OSS

Reviewed By: xw285cornell

Differential Revision: D25743199

Pulled By: SplitInfinity

fbshipit-source-id: 58d82ab4362270b309956826e36c8bf9d620f081
2021-01-05 11:03:28 -08:00
04e86be1a2 eager quant: fix error with removing forward hooks (#49813)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49813

https://github.com/pytorch/pytorch/issues/49739 reports a crash
where removing forward hooks results in a

```
RuntimeError: OrderedDict mutated during iteration
```

Unfortunately I cannot repro this inside the PyTorch module, but the issue
author has a good point and and we should not mutate the dict inside
of the iteration.

Test Plan:
```
// test plan from https://github.com/pytorch/pytorch/pull/46871 which
// originally added this
python test/test_quantization.py TestEagerModeQATOps
```

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D25698725

fbshipit-source-id: 13069d0d5017a84038c8f7be439a3ed537938ac6
2021-01-05 11:00:20 -08:00
113b7623d6 quant: throw a nice error message for allclose with quantized inputs (#49802)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49802

Currently `torch.allclose` is not supported with quantized inputs.
Throw a nice error message instead of a cryptic one.

Test Plan:
```
torch.allclose(x_fp32, y_fp32)

torch.allclose(x_int8, y_int8)
```

Imported from OSS

Reviewed By: supriyar

Differential Revision: D25693538

fbshipit-source-id: 8958628433adfca3ae6ce215f3e3ec3c5e29994c
2021-01-05 10:55:34 -08:00
44c17b28c6 quant: nice error message on convtranspose with per-channel weight (#49899)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49899

Per channel weights observer in conv transpose is not supported yet.  Adding an
error message which fails instantly instead of making the user wait until after
calibration/training finishes.

Test Plan:
```
python test/test_quantization.py TestPostTrainingStatic.test_convtranspose_per_channel_fails_early
python test/test_quantization.py TestQuantizeFx.test_convtranspose_per_channel_fails_early
```

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D25717151

fbshipit-source-id: 093e5979030ec185e3e0d56c45d7ce7338bf94b6
2021-01-05 09:38:57 -08:00
72306378b4 quant: ensure observers do not crash for empty Tensors (#49800)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49800

Ensures that having a Tensor with 0 elements does not crash observers.
Note: it's illegal to pass Tensors with 0 elements to reductions such
as min and max, so we gate this out before the logic hits min/max.

This should not be hit often in practice, but it's coming up
during debugging of some RCNN models with test inputs.

Test Plan:
```
python test/test_quantization.py TestObserver.test_zero_numel
```

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D25693230

fbshipit-source-id: d737559697c98bd923356edacba895835060bb38
2021-01-05 09:35:47 -08:00
c86cfcd81d Run mypy over test/test_utils.py (#49654)
Summary:
This caught one incorrect annotation in `cpp_extension.load`.

xref gh-16574.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49654

Reviewed By: heitorschueroff

Differential Revision: D25757691

Pulled By: ezyang

fbshipit-source-id: 145ce3ae532cc585d9ca3bbd5381401bad0072e2
2021-01-05 09:32:06 -08:00
b7bfc723d3 Run mypy on more test files (#49658)
Summary:
Improves one annotation for `augment_model_with_bundled_inputs`

Also add a comment to not work on caffe2 type annotations, that's not worth the effort - those ignores can stay as they are.

xref gh-16574

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49658

Reviewed By: heitorschueroff

Differential Revision: D25757721

Pulled By: ezyang

fbshipit-source-id: 44c396d8da9ef3f41b97f9c46a528f0431c4b463
2021-01-05 09:28:38 -08:00
e35b822d7d fixes indices computation for trilinear interpolate backwards (#50084)
Summary:
https://github.com/pytorch/pytorch/issues/48675 had some typos in indices computations so that results for trilinear interpolation where height is not equal to width were wrong. This PR fixes it.
cc xwang233

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50084

Reviewed By: BIT-silence

Differential Revision: D25777083

Pulled By: ngimel

fbshipit-source-id: 71be545628735fe875b7ea30bf6a09df4f2fae5c
2021-01-05 09:20:59 -08:00
52933b9923 Patch death tests/fork use after D25292667 (part 3)
Summary: (Note: this ignores all push blocking failures!)

Test Plan: unit tests

Differential Revision: D25775357

fbshipit-source-id: 0ae3c59181bc123d763ed9c0d05c536998ae5ca0
2021-01-05 09:07:49 -08:00
ace78ddb6a Revert D25763758: [pytorch][PR] introduce a flag to disable aten::cat in TE
Test Plan: revert-hammer

Differential Revision:
D25763758 (9e0b4a96e4)

Original commit changeset: c4f4a8220964

fbshipit-source-id: 98775ad9058b81541a010e646b0cf4864854be3e
2021-01-05 08:45:50 -08:00
3845770349 Fixing error in Readme.md. (#50033)
Summary:
Fix incorrect command in readme.
Fix incorrect url in readme.
Add url for dockerfile.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50033

Reviewed By: ezyang

Differential Revision: D25759567

Pulled By: mrshenli

fbshipit-source-id: 2a3bc88c8717a3890090ddd0d6657f49d14ff05a
2021-01-05 08:22:49 -08:00
8c66aec435 Fix grammar typo in readme.md (#50000)
Summary:
missing `

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50000

Reviewed By: ezyang

Differential Revision: D25759608

Pulled By: mrshenli

fbshipit-source-id: 4dbe06b8978ae5b2b9b66cde163dab4bd8ee2257
2021-01-05 08:14:48 -08:00
e4d596c575 Fix return value of _vmap_internals._get_name (#49951)
Summary:
This appears to have been a copy-paste error.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49951

Reviewed By: mrshenli

Differential Revision: D25757099

Pulled By: zou3519

fbshipit-source-id: e47cc3b0694645bd0025326bfe45852ef0266adf
2021-01-05 07:00:48 -08:00
6e6231f9cd unit test for fc parallelization aot (#50056)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50056

buck test //caffe2/caffe2/contrib/fakelowp/test:test_chunkingnnpi -- --fallback-classic

Test Plan: https://our.intern.facebook.com/intern/testinfra/testrun/7036874446100155

Reviewed By: venkatacrc

Differential Revision: D25731079

fbshipit-source-id: 4aa4ffc641659cd90bf4670d28cb43e43ae76dcd
2021-01-05 00:27:43 -08:00
ee80b45843 [TensorExpr] Fix LLVM 10 build after LLVM API changes
Summary: Use `llvm::CodeGenFileType` for llvm-10+

Test Plan: local build

Reviewed By: asuhan

Differential Revision: D25694990

fbshipit-source-id: c35d973ef2669929715a94da5dd46e4a0457c4e8
2021-01-04 23:19:21 -08:00
c51455a7bb [FX] fix Graph python_code return type annotation (#49931)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49931

This fixes #49932. The `maybe_return_annotation` was not being passed by reference, so it was never getting modified.

Test Plan: Imported from OSS

Reviewed By: jamesr66a

Differential Revision: D25725582

Pulled By: esqu1

fbshipit-source-id: 4136ff169a269d6b98f0b8e14d95d19e7c7cfa71
2021-01-04 19:55:33 -08:00
8fb5f16931 Complex backward for indexing, slicing, joining, and mutating ops (#49552)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49552

This PR:
1. Migrates independent autograd test for `hstack`, `dstack`, `vstack`, `movedim`, `moveaxis` from `test_autograd.py` to the new `OpInfo` based tests.
2. Migrates autograd test for `gather`, `index_select` from the method_tests to the new `OpInfo` based tests.
2. Enables complex backward for `stack, gather, index_select, index_add_` and adds tests for complex autograd for all the above mentioned ops.

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D25682511

Pulled By: anjali411

fbshipit-source-id: 5d8f89db4a9ec340ab99a6196987d44a23e2c6c6
2021-01-04 19:44:15 -08:00
9e0b4a96e4 introduce a flag to disable aten::cat in TE (#49579)
Summary:
introduce a flag to disable aten::cat in TE

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49579

Reviewed By: eellison

Differential Revision: D25763758

Pulled By: Krovatkin

fbshipit-source-id: c4f4a8220964813202369a3383057e77e7f10cb0
2021-01-04 19:17:29 -08:00
65122173ab [ONNX] Modified var_mean symbolic to support more combinations of dims (#48949)
Summary:
Based on existing implementation of var_mean, values of dim have to be sequential and start with zero. The formats listed below are cause scenarios with incompatible dimension for the Sub node.
-> dim[1, 2]
-> dim[0, 2]
-> dim[2, 0]

The changes in this PR allow such formats to be supported in var_mean

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48949

Reviewed By: houseroad

Differential Revision: D25540272

Pulled By: SplitInfinity

fbshipit-source-id: 59813a77ff076d138655cc8c17953358f62cf137
2021-01-04 18:10:39 -08:00
d0369aabe1 Clean up some type annotations in caffe2/contrib/aten/gen_op (#49945)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49945

Upgrades type annotations from Python2 to Python3

Test Plan: Sandcastle tests

Reviewed By: xush6528

Differential Revision: D25717502

fbshipit-source-id: 718d93e8614e9d050f4da1c6bd4ac892bab98154
2021-01-04 17:32:38 -08:00
a5339b9d7c Drop unused imports from leftovers (#49953)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49953

From
```
./python/libcst/libcst codemod remove_unused_imports.RemoveUnusedImportsWithGlean --no-format caffe2/
```

Test Plan: Standard sandcastle tests

Reviewed By: xush6528

Differential Revision: D25727348

fbshipit-source-id: b3feef80b9b4b535f1bd4060dace5b1a50bd5e69
2021-01-04 16:31:48 -08:00
5acb1cc1df Drop unused imports from scripts (#49956)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49956

From
```
./python/libcst/libcst codemod remove_unused_imports.RemoveUnusedImportsWithGlean --no-format caffe2/
```

Test Plan: Standard sandcastle tests

Reviewed By: xush6528

Differential Revision: D25727347

fbshipit-source-id: 74d0a08aa0cfd0f492688a2b8278a0c65fd1deba
2021-01-04 16:08:28 -08:00
efe1fc21fc Dont inlinine intermediates on cpu (#49565)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/49565

Test Plan: Imported from OSS

Reviewed By: Krovatkin, ZolotukhinM

Differential Revision: D25688271

Pulled By: eellison

fbshipit-source-id: 9ea7858e2db4fb31292e04440fc72ee04623c688
2021-01-04 15:46:20 -08:00
c439a6534d [ONNX] Handle Sub-block index_put in _jit_pass_onnx_remove_inplace_ops_for_onnx (#48734)
Summary:
For the added UT and existing UTs, this code is independent and ready for review.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48734

Reviewed By: izdeby

Differential Revision: D25502677

Pulled By: bzinodev

fbshipit-source-id: 788b4eaa5e5e8b5df1fb4956fbd25928127bb199
2021-01-04 15:11:10 -08:00
240c0b318a Suppress "statement is unreachable" warning (#49495)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49495

Compiling PyTorch currently generates a large number of warnings like this:
```
caffe2/aten/src/ATen/core/builtin_function.h(105): warning: statement is unreachable
```
The offending code
```
  std::string pretty_print_schema() const override {
    TORCH_INTERNAL_ASSERT(false);
    return "";
  }
```
has an unreachable return which prevents a "no return" warning.

We resolve the situation by using NVCC's pragma system to suppress this warning within this function.

Test Plan:
The warning appears when running:
```
buck build mode/dev-nosan //caffe2/torch/fb/sparsenn:test
```
As well as a number of other build commands.

Reviewed By: ngimel

Differential Revision: D25546542

fbshipit-source-id: 71cddd4fdb5fd16022a6d7b2daf0e6d55e6e90e2
2021-01-04 14:53:47 -08:00
f96ce3305c prohibit assignment to a sparse tensor (#50040)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/48225 by prohibiting assignment to a sparse Tensor.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50040

Reviewed By: mrshenli

Differential Revision: D25757125

Pulled By: zou3519

fbshipit-source-id: 3db6f48932eb10bf6ca5e97a6091afcabb60e478
2021-01-04 14:38:35 -08:00
71766d89ea [BE] unified run_process_no_exception code (#49774)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/49774

Reviewed By: janeyx99

Differential Revision: D25756811

Pulled By: walterddr

fbshipit-source-id: 4d2b3bd772572764ff96e5aad70323b58393e332
2021-01-04 13:43:09 -08:00
74dcb6d363 torch.xlogy: Use wrapped_scalar_tensor / gpu_with_scalars to speed up GPU kernel. (#49926)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49926

While investigating https://github.com/pytorch/pytorch/issues/49758, I changed the xlogy kernel to use the recommended wrapped_scaler_tensor pattern instead of moving the scalar to the GPU as a tensor.
While this doesn't avoid a synchronization (there is no synchronization in the move, as its done via fill), this does significantly speed up the GPU kernel (almost ~50%, benchmark in PR comments).

From looking at the nvprof output, it looks like this code path avoids broadcasting.  Aside: this seems unnecessary, as there is nothing special from the point-of-view of broadcasting whether the Tensor
is ()-sized or marked as a wrapped_scalar.  Still, this is a useful change to make as we avoid extra kernel launches and dispatches to create and fill the tensor.

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D25724215

Pulled By: gchanan

fbshipit-source-id: 4adcd5d8b3297502672ffeafc77e8af80592f460
2021-01-04 12:42:08 -08:00
483670ff0f [pytorch] add threshold_backward batching for vmap (#49881)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49881

title

Test Plan: pytest test/test_vmap.py -v -k "BatchedGrad"

Reviewed By: zou3519

Differential Revision: D25711289

fbshipit-source-id: f1856193249fda70da41e36e15bc26ea7966b510
2021-01-04 12:24:05 -08:00
da790eca69 Add trace batching forward/backward rule (#49979)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/49979

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D25734379

Pulled By: ejguan

fbshipit-source-id: 8f9346afaf324e7ab17bafd6ecc97eed8442fd38
2021-01-04 12:04:55 -08:00
0216366f0d Make use_c10_dispatcher: full mandatory for structured kernels (#49490)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49490

No reason to let people to do the legacy thing for the brand new kernel.
This simplifies the codegen.  I have to port the two structured kernels
to this new format.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: smessmer

Differential Revision: D25595406

Pulled By: ezyang

fbshipit-source-id: b5931873379afdd0f3b00a012e0066af05de0a69
2021-01-04 11:59:24 -08:00
6c833efd65 Move default or no default logic into native.argument (#49489)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49489

Previously, it was done at a use site, but that meant other use
sites don't get the right logic.  Pushing it in makes sure everyone
gets it.

I also fixed one case of confusion where defn() was used to define a decl().
If you want to define a declaration with no defaults, say no_default().decl()
which is more direct and will give us code reviewers a clue if you should
have pushed this logic in.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: smessmer

Differential Revision: D25595407

Pulled By: ezyang

fbshipit-source-id: 89c664f0ed4d95699794a0d3123d54d0f7e4cba4
2021-01-04 11:59:20 -08:00
8eee8460f8 codegen: Resolve overload ambiguities created by defaulted arguments (#49348)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49348

This is a redux of #45666 post refactor, based off of
d534f7d4c5
Credit goes to peterbell10 for the implementation.

Fixes #43945.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: smessmer

Differential Revision: D25594004

Pulled By: ezyang

fbshipit-source-id: c8eb876bb3348308d6dc8ba7bf091a2a3389450f
2021-01-04 11:59:16 -08:00
7202c0ec50 Tighten up error checking on manual_kernel_registration (#49341)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49341

I noticed that #49097 was using manual_kernel_registration incorrectly,
so this diff tightens up the testing so that:

1. We don't generate useless wrapper functions when manual_kernel_registration
is on (it's not going to be registered, so it does nothing).

2. manual_kernel_registration shouldn't affect generation of functions in
Functions.h; if you need to stop bindings, use manual_cpp_binding

3. Structured and manual_kernel_registration are a hard error

4. We raise an error if you set dispatch and manual_kernel_registration at the
same time.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: smessmer

Differential Revision: D25594003

Pulled By: ezyang

fbshipit-source-id: 655b10e9befdfd8bc95f1631b2f48f995a31a59a
2021-01-04 11:59:12 -08:00
8e20594b38 Construct CppSignatureGroup from NativeFunction (#49245)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49245

This will make it easier to implement the POC in
d534f7d4c5
see also https://github.com/pytorch/pytorch/pull/45666

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: smessmer

Differential Revision: D25594005

Pulled By: ezyang

fbshipit-source-id: e458d3dc3a765ec77425761b9b17f23769cecf9e
2021-01-04 11:55:28 -08:00
f0945537af .circleci: Ignore unbound variables for conda (#50053)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50053

For some reason conda likes to re-activate the conda environment when attempting this install
which means that a deactivate is run and some variables might not exist when that happens,
namely CONDA_MKL_INTERFACE_LAYER_BACKUP from libblas so let's just ignore unbound variables when
it comes to the conda installation commands

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: samestep

Differential Revision: D25760737

Pulled By: seemethere

fbshipit-source-id: 9e7720eb8a4f8028dbaa7bcfc304e5c1ca73ad08
2021-01-04 11:34:28 -08:00
69ca5e1397 Enforce c10-fullness for all ops (#49619)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49619

This is a minimal-change PR that enforces that all operators are c10-full by making it the default.

This does not clean up any code yet, that will happen in PRs stacked on top. But this PR already ensures
that there are no non-c10-full ops left and there will be no non-c10-full ops introduced anymore.
ghstack-source-id: 119269182

Test Plan: waitforsandcastle

Reviewed By: bhosmer

Differential Revision: D25650198

fbshipit-source-id: efc53e884cb53193bf58a4834bf148453e689ea1
2021-01-04 11:26:53 -08:00
6e84a018be move to non-legacy magma v2 headers (#49978)
Summary:
We recently (https://github.com/pytorch/pytorch/issues/7582) dropped magma v1 support, but we were still including the legacy compatibility headers and using functions only provided by them.
This changes the includes to the new magma_v2 header and fixes the triangular solve functions to use the v2-style magma_queue-using API.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49978

Reviewed By: mrshenli

Differential Revision: D25752499

Pulled By: ngimel

fbshipit-source-id: 26d916bc5ce63978b341aefb072af228f140637d
2021-01-04 11:18:53 -08:00
fdb81c538a Improve torch.flatten docs and add tests to test_view_ops (#49501)
Summary:
Addresses https://github.com/pytorch/pytorch/issues/39474

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49501

Reviewed By: mrshenli

Differential Revision: D25740586

Pulled By: soulitzer

fbshipit-source-id: 3d7bdbab91eb208ac9e6832bb766d9d95a00c103
2021-01-04 11:11:34 -08:00
b76822eb49 Update update_s3_htmls.yml (#49934)
Summary:
It is now running for forks, and generates a lot of failure message to owner of forks.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49934

Reviewed By: mruberry

Differential Revision: D25739552

Pulled By: seemethere

fbshipit-source-id: 0f9cc430316c0a5e9972de3cdd06d225528c81c2
2021-01-04 10:14:14 -08:00
22bd277891 Run test_type_hints first (#49748)
Summary:
Since it sort of a liner check and fails frequently

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49748

Reviewed By: vkuzo

Differential Revision: D25682980

Pulled By: malfet

fbshipit-source-id: 7dba28242dced0277bad56dc887d3273c1e9e575
2021-01-04 09:33:13 -08:00
211f35631f Add type annotations to _tensorboard_vis.py and hipify_python.py (#49834)
Summary:
closes gh-49833

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49834

Reviewed By: mruberry

Differential Revision: D25725341

Pulled By: malfet

fbshipit-source-id: 7454c7afe07a3ff829826afe02aba05b7f649d9b
2021-01-04 09:29:51 -08:00
c7e9abb66a Making ops c10-full: list of optional tensors (#49138)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49138

See for details: https://fb.quip.com/QRtJAin66lPN

We need to model optional types explicitly, mostly for schema inference. So we cannot pass a `Tensor?[]` as `ArrayRef<Tensor>`, instead we need to pass it as an optional type. This PR changes it to `torch::List<c10::optional<Tensor>>`. It also makes the ops c10-full that were blocked by this.

## Backwards Compatibility

- This should not break the Python API because the representation in Python is the same and python_arg_parser just transforms the python list into a `List<optional<Tensor>>` instead of into a `List<Tensor>`.
- This should not break serialized models because there's some logic that allows loading a serialized `List<Tensor>` as `List<optional<Tensor>>`, see https://github.com/pytorch/pytorch/pull/49138/files#diff-9315f5dd045f47114c677174dcaa2f982721233eee1aa19068a42ff3ef775315R57
- This will break backwards compatibility for the C++ API. There is no implicit conversion from `ArrayRef<Tensor>` (which was the old argument type) to `List<optional<Tensor>>`. One common call pattern is `tensor.index({indices_tensor})`, where indices_tensor is another `Tensor`, and that will continue working because the `{}` initializer_list constructor for `List<optional<Tensor>>` can take `Tensor` elements that are implicitly converted to `optional<Tensor>`, but another common call pattern was `tensor.index(indices_tensor)`, where previously, the `Tensor` got implicitly converted to an `ArrayRef<Tensor>`, and to implicitly convert `Tensor -> optional<Tensor> -> List<optional<Tensor>>` would be two implicit conversions. C++ doesn't allow chaining. two implicit conversions. So those call sites have to be rewritten to `tensor.index({indices_tensor})`.

ghstack-source-id: 119269131

Test Plan:
## Benchmarks (C++ instruction counts):
### Forward
#### Script
```py
from torch.utils.benchmark import Timer

counts = Timer(
    stmt="""
        auto t = {{op call to measure}};
    """,
    setup="""
        using namespace torch::indexing;
        auto x = torch::ones({4, 4, 4});
    """,
    language="cpp",
).collect_callgrind(number=1_000)
print(counts)
```
#### Results
|  Op call                                                              |before   |after   |delta  |      |
|------------------------------------------------------------------------|---------|--------|-------|------|
|x[0] = 1                                                                |11566015 |11566015|0      |0.00% |
|x.index({0})                                                            |6807019  |6801019 |-6000  |-0.09%|
|x.index({0, 0})                                                         |13529019 |13557019|28000  |0.21% |
|x.index({0, 0, 0})                                                      |10677004 |10692004|15000  |0.14% |
|x.index({"..."})                                                        |5512015  |5506015 |-6000  |-0.11%|
|x.index({Slice(None, None, None)})                                      |6866016  |6936016 |70000  |1.02% |
|x.index({None})                                                         |8554015  |8548015 |-6000  |-0.07%|
|x.index({false})                                                        |22400000 |22744000|344000 |1.54% |
|x.index({true})                                                         |27624088 |27264393|-359695|-1.30%|
|x.index({"...", 0, true, Slice(1, None, 2), torch::tensor({1, 2})})|123472000|123463306|-8694|-0.01%|

### Autograd
#### Script
```py
from torch.utils.benchmark import Timer

counts = Timer(
    stmt="""
        auto t = {{op call to measure}};
    """,
    setup="""
        using namespace torch::indexing;
        auto x = torch::ones({4, 4, 4}, torch::requires_grad());
    """,
    language="cpp",
).collect_callgrind(number=1_000)
print(counts)
```
Note: the script measures the **forward** path of an op call with autograd enabled (i.e. calls into VariableType). It does not measure the backward path.

#### Results
|  Op call                                                              |before   |after   |delta  |      |
|------------------------------------------------------------------------|---------|--------|-------|------|
|x.index({0})                                                            |14839019|14833019|-6000| 0.00% |
|x.index({0, 0})                                                         |28342019|28370019|28000| 0.00% |
|x.index({0, 0, 0})                                                      |24434004|24449004|15000| 0.00% |
|x.index({"..."})                                                       |12773015|12767015|-6000| 0.00% |
|x.index({Slice(None, None, None)})                                      |14837016|14907016|70000| 0.47% |
|x.index({None})                                                        |15926015|15920015|-6000| 0.00% |
|x.index({false})                                                        |36958000|37477000|519000| 1.40% |
|x.index({true})                                                         |41971408|42426094|454686| 1.08% |
|x.index({"...", 0, true, Slice(1, None, 2), torch::tensor({1, 2})}) |168184392|164545682|-3638710| -2.16% |

Reviewed By: bhosmer

Differential Revision: D25454632

fbshipit-source-id: 28ab0cffbbdbdff1c40b4130ca62ee72f981b76d
2021-01-04 05:04:02 -08:00
e44b2b72bd Back out "[pytorch][PR] Preserve memory format in qconv op" (#49994)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49994

Revert preserving memory format in qconv op because it is negatively affecting performance, will revert revert after fixing all issues

Test Plan: pytest fbcode/caffe2/test/quantization/test_quantized_op.py

Reviewed By: kimishpatel

Differential Revision: D25731279

fbshipit-source-id: 908dbb127210a93b27ada7ccdfa531177edf679a
2021-01-03 00:11:40 -08:00
8aad66a7bd [c10/**] Fix typos (#49815)
Summary:
All pretty minor. I avoided renaming `class DestructableMock` to `class DestructibleMock` and similar such symbol renames (in this PR).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49815

Reviewed By: VitalyFedyunin

Differential Revision: D25734507

Pulled By: mruberry

fbshipit-source-id: bbe8874a99d047e9d9814bf92ea8c036a5c6a3fd
2021-01-01 02:11:56 -08:00
749f8b7850 Remove flops warnings from the default profiler use case (#49896)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49896

Add missing check for with_flops option set

Test Plan:
python test/test_profiler.py
CI

Reviewed By: xuzhao9, ngimel

Differential Revision: D25716930

Pulled By: ilia-cher

fbshipit-source-id: 0da0bbb6c1a52328f665237e503406f877b41449
2020-12-30 23:49:29 -08:00
de3d8f8c35 Revert D25734450: [pytorch][PR] Improve torch.flatten docs and add tests to test_view_ops
Test Plan: revert-hammer

Differential Revision:
D25734450 (730965c246)

Original commit changeset: 993667dd07ac

fbshipit-source-id: 603af25311fc8b29bb033167f3b2704da79c3147
2020-12-30 22:04:43 -08:00
4677fc69a2 Fix inf norm grad (reland) (#48611)
Summary:
Reland of https://github.com/pytorch/pytorch/issues/48122

Does this result in a regression? No significant regression observed.

Timer script:
```
import torch
from torch.utils.benchmark import Timer

setup="""
a = torch.rand((2, 2), requires_grad=True)
gradient = torch.ones(2)
"""

stmt="""
torch.autograd.grad(torch.norm(a, dim=(0,), keepdim=False), a, gradient)
"""

timer = Timer(stmt, setup)

print(timer.timeit(10000))
print(timer.collect_callgrind(100))
```
Note: small matrix, keepdim is False, and dims is non-empty

Before change
```
Runtime   37.37 us
1 measurement, 10000 runs , 1 thread

                           All          Noisy symbols removed
    Instructions:     15279045                   15141710
    Baseline:             4257                       3851
100 runs per measurement, 1 thread
```

After change
```
Runtime 36.08 us
1 measurement, 10000 runs , 1 thread

                           All          Noisy symbols removed
    Instructions:     15296974                   15153534
    Baseline:             4257                       3851
100 runs per measurement, 1 thread
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48611

Reviewed By: albanD, mruberry

Differential Revision: D25309997

Pulled By: soulitzer

fbshipit-source-id: 5fb950dc9259234342985c0e84ada25a7e3814d6
2020-12-30 21:13:33 -08:00
730965c246 Improve torch.flatten docs and add tests to test_view_ops (#49501)
Summary:
Addresses https://github.com/pytorch/pytorch/issues/39474

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49501

Reviewed By: mruberry

Differential Revision: D25734450

Pulled By: soulitzer

fbshipit-source-id: 993667dd07acd81a4616465e0a3b94bde449193e
2020-12-30 20:35:46 -08:00
cd608fe59b Revert D25719980: [pytorch][PR] Accept input tensor with 0-dim batch size for MultiLabelMarginLoss
Test Plan: revert-hammer

Differential Revision:
D25719980 (6b56b71e61)

Original commit changeset: 83414bad37c0

fbshipit-source-id: 27eddd711a2b9e0adbc08bfab12100562e63ac21
2020-12-30 17:06:28 -08:00
46afd7fc9f [PyTorch] Decouple version numbers from c10 and caffe2 targets (#49905)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49905

There's size regression in model delivery in D25682312. Only the model version numbers are used. However, the dependency of the entire c10 (128 KB) is pulled in.

This diff is to decouple the version numbers to a separate header file, versions.h. Other targets referring to version numbers only can have deps of ```caffe2:version_headers```.
ghstack-source-id: 119161467

Test Plan: CI

Reviewed By: xcheng16, guangyfb

Differential Revision: D25716601

fbshipit-source-id: 07634bcf46eacfefa4aa75f2e4c9b9ee30c6929d
2020-12-30 15:34:01 -08:00
04a8412b86 [quant] Quantizable LSTM (#49671)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49671

- Introduces the `torch.nn.quantizable` namespace
- Adds the `torch.nn.quantizable.LSTM` module

The point of the `quantizable` namespace is to segregate the purely quantized modules with the modules that could be quantized through a normal quantization flow, but are not using the quantized kernels explicitly.
That means the quantizable modules are functionally and numerically equivalent to the FP ones and can be used instead of the FP ones without any loss.

The main difference between the `torch.nn.LSTM` and the `torch.nn.quantizable.LSTM` is that the former one does not support observation for the linear layers, because all the computation is internal to the `aten` namespace.
The `torch.nn.quantizable.LSTM`, however, uses explicit linear layers that can be observed for further quantization.

Test Plan: Imported from OSS

Differential Revision: D25663870

Reviewed By: vkuzo

Pulled By: z-a-f

fbshipit-source-id: 70ff5463bd759b9a7922571a5712d3409dfdfa06
2020-12-30 15:21:38 -08:00
ffbb68af8a quant docs: add common errors section (#49902)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49902

Adds a common errors section, and details the two errors
we see often on the discuss forums, with recommended solutions.

Test Plan: build the docs on Mac OS, the new section renders correctly.

Reviewed By: supriyar

Differential Revision: D25718195

Pulled By: vkuzo

fbshipit-source-id: c5ef2b24831d18d57bbafdb82d26d8fbf3a90781
2020-12-30 15:01:59 -08:00
a7e1f4f37a Remove incorrect usage of layout(std430) on uniform buffers, correctly now treated as error in the latest release of Vulkan SDK. (#49572)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/49572

Differential Revision: D25729888

Test Plan: Imported from OSS

Reviewed By: SS-JIA

Pulled By: AshkanAliabadi

fbshipit-source-id: 15dd4acef3dfae72f03e7e3085b1ff5936becf3d
2020-12-30 14:53:41 -08:00
6a951a6f4c Fix a KaTeX crash and many docstring issues (#49684)
Summary:
The first commit fixes the `MultiheadAttention` docstrings, which are causing a cryptic KaTeX crash.

The second commit fixes many documentation issues in `torch/_torch_docs.py`, and closes gh-43667 (missing "Keyword arguments" headers). It also fixes a weird duplicate docstring for `torch.argmin`; there's more of these, it looks like they were written based on whether the C++ implementation has an overload. That makes little sense to a Python user though, and the content is simply duplicate.

The `Shape:` heading for https://pytorch.org/docs/master/generated/torch.nn.MultiheadAttention.html looked bad, here's what it looks like with this PR:

<img width="475" alt="image" src="https://user-images.githubusercontent.com/98330/102797488-09a44e00-43b0-11eb-8788-acdf4e936f2f.png">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49684

Reviewed By: ngimel

Differential Revision: D25730909

Pulled By: mruberry

fbshipit-source-id: d25bcf8caf928e7e8e918017d119de12e10a46e9
2020-12-30 14:17:39 -08:00
6b56b71e61 Accept input tensor with 0-dim batch size for MultiLabelMarginLoss (#46975)
Summary:
Fix for one of the layers listed in https://github.com/pytorch/pytorch/issues/12013 or https://github.com/pytorch/pytorch/issues/38115

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46975

Reviewed By: mruberry

Differential Revision: D25719980

Pulled By: ngimel

fbshipit-source-id: 83414bad37c0b004bc7cced04df8b9c89bdba3e6
2020-12-30 13:29:26 -08:00
42d2e31cd6 [numpy] torch.rsqrt : promote integer inputs to float (#47909)
Summary:
Reference https://github.com/pytorch/pytorch/issues/42515

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47909

Reviewed By: ngimel

Differential Revision: D25730876

Pulled By: mruberry

fbshipit-source-id: c87a8f686e1dd64e511640e0278021c4a584ccf2
2020-12-30 10:33:14 -08:00
b54ad08978 Enable test_fusions TanhQuantize (#49970)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49970

enable test_fusions:test_tanhquantize

Test Plan: https://internalfb.com/intern/testinfra/testrun/6755399469176694

Reviewed By: hyuen

Differential Revision: D25732684

fbshipit-source-id: b8479e43b5248ba5510f0c78c993d534d3ffc2b0
2020-12-30 10:00:39 -08:00
cfc3db0ca9 Remove THPWrapper (#49871)
Summary:
Remove `THPWrapper` from PyTorch C code since it is not used anymore and because we have dropped Python 2 compatibility, its usage can be replaced by capsule objects (`PyCapsule_New`, `PyCapsule_CheckExact`, `PyCapsule_GetPointer` and `PyCapsule_GetDestructor`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49871

Reviewed By: mruberry

Differential Revision: D25715038

Pulled By: albanD

fbshipit-source-id: cc3b6f967bbe0dc42c692adf76dff4e4b667fdd5
2020-12-30 03:01:52 -08:00
12b73fdbbf Adding JIT support for cuda streams and events (#48020)
Summary:
=======

This PR addresses the following:

 * Adds JIT support for CUDA Streams
 * Adds JIT support for CUDA Events
 * Adds JIT support for CUDA Stream context manager

Testing:
======

python test/test_jit.py -v TestCUDA

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48020

Reviewed By: navahgar

Differential Revision: D25725749

Pulled By: nikithamalgifb

fbshipit-source-id: b0addeb49630f8f0c430ed7badeca43bb9d2535c
2020-12-29 20:24:57 -08:00
97c17b4772 Fix auto exponent issue for torch.pow (#49809)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49809

Fixes https://github.com/pytorch/xla/issues/2688 #46936

Test Plan: Imported from OSS

Reviewed By: nikithamalgifb

Differential Revision: D25724176

Pulled By: anjali411

fbshipit-source-id: 16287a1f481e9475679b99d6fb45de840da225be
2020-12-29 17:02:56 -08:00
e482c70a3d added List as an option to the unflattened_size (#49838)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/49743

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49838

Reviewed By: mruberry

Differential Revision: D25727971

Pulled By: ngimel

fbshipit-source-id: 60142dae84ef107f0083676a2a78ce6b0472b7e1
2020-12-29 16:50:37 -08:00
01b57e1810 Revert D25718705: Clean up type annotations in caffe2/torch/nn/modules
Test Plan: revert-hammer

Differential Revision:
D25718705 (891759f860)

Original commit changeset: 6a9e3e6d17aa

fbshipit-source-id: 1a4ef0bfdec8eb8e7ce149bfbdb34a4ad8d964b6
2020-12-29 16:42:26 -08:00
14edc726d9 Clean up some type annotations in caffe2/torch/quantization (#49942)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49942

Upgrades type annotations from Python2 to Python3

Test Plan: Sandcastle tests

Reviewed By: vkuzo

Differential Revision: D25717551

fbshipit-source-id: 1b63dc485ecf6641641b05f7ce095ae1d2d87346
2020-12-29 15:43:50 -08:00
4c5a4dbb8c [Tensorexpr]Copying header files in tensorexpr dir (#49933)
Summary:
Previously header files from jit/tensorexpr were not copied, this PR should enable copying.

This will allow other OSS projects like Glow to used TE.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49933

Reviewed By: Krovatkin, mruberry

Differential Revision: D25725927

Pulled By: protonu

fbshipit-source-id: 9d5a0586e9b73111230cacf044cd7e8f5c600ce9
2020-12-29 15:18:52 -08:00
891759f860 Clean up type annotations in caffe2/torch/nn/modules (#49938)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/49938

Test Plan: Sandcastle tests

Reviewed By: xush6528

Differential Revision: D25718705

fbshipit-source-id: 6a9e3e6d17aa458726cd32aa0a71a63c51b601d9
2020-12-29 14:04:52 -08:00
a111a9291c added fuse_op and list_construct - list_unpack pass
Summary: Added fuse_op and list_construct and list_unpack pass

Test Plan:
jit_graph_opt_test.py
jit_graph_optimizer_test.cc
sparsenn_fused_operator_test.py

Reviewed By: qizzzh

Differential Revision: D25715079

fbshipit-source-id: fa976be53135a83f262b8f2e2eaedadd177f46c4
2020-12-29 12:29:53 -08:00
8d7338e820 Enable tests using named temp files on Windows (#49640)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/49640

Reviewed By: ngimel

Differential Revision: D25681548

Pulled By: malfet

fbshipit-source-id: 0e2b25817c98d749920cb2b4079033a2ee8c1456
2020-12-29 09:57:35 -08:00
d434ac35e4 Update gather documentation to allow index.shape[k] <= input.shape[k] rather than ==. (#41887)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/41887

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D22680014

Pulled By: gchanan

fbshipit-source-id: b162fccabc22a1403c0c43c1131f0fbf4689a79d
2020-12-29 07:28:48 -08:00
c619892482 Fix errata (#49903)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/49903

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D25718411

Pulled By: ansley

fbshipit-source-id: 0cc365c5a53077752dc1c5a5c4a65b873baa3604
2020-12-28 20:40:41 -08:00
361f5ed91d Implement torch.linalg.qr (#47764)
Summary:
I am opening this PR early to have a place to discuss design issues.
The biggest difference between `torch.qr` and `numpy.linalg.qr` is that the former `torch.qr` takes a boolean parameter `some=True`, while the latter takes a string parameter `mode='reduced'` which can be one of the following:

`reduced`
this is completely equivalent to `some=True`, and both are the default.

`complete`
this is completely equivalent to `some=False`.

`r`
this returns only `r` instead of a tuple `(r, q)`. We have already decided that we don't want different return types depending on the parameters, so I propose to return `(r, empty_tensor)` instead. I **think** that in this mode it will be impossible to implement the backward pass, so we should raise an appropriate error in that case.

`raw`
in this mode, it returns `(h, tau)` instead of `(q, r)`. Internally, `h` and `tau` are obtained by calling lapack's `dgeqrf` and are later used to compute the actual values of `(q, r)`. The numpy docs suggest that these might be useful to call other lapack functions, but at the moment none of them is exposed by numpy and I don't know how often it is used in the real world.
I suppose the implementing the backward pass need attention to: the most straightforward solution is to use `(h, tau)` to compute `(q, r)` and then use the normal logic for `qr_backward`, but there might be faster alternatives.

`full`, `f`
alias for `reduced`, deprecated since numpy 1.8.0

`economic`, `e`
similar to `raw but it returns only `h` instead of `(h, tau). Deprecated since numpy 1.8.0

To summarize:
  * `reduce`, `complete` and `r` are straightforward to implement.

  * `raw` needs a bit of extra care, but I don't know how much high priority it is: since it is used rarely, we might want to not support it right now and maybe implement it in the future?

  * I think we should just leave `full` and `economic` out, and possibly add a note to the docs explaining what you need to use instead

/cc mruberry

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47764

Reviewed By: ngimel

Differential Revision: D25708870

Pulled By: mruberry

fbshipit-source-id: c25c70a23a02ec4322430d636542041e766ebe1b
2020-12-28 17:28:17 -08:00
bc4ff7ba05 fx quant: split linear test cases (#49740)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49740

1. Separates the module and functional linear test cases.
2. Combines the test case which tests for linear bias observation into
the main linear test case, as requested in
https://github.com/pytorch/pytorch/pull/49628.

Test Plan:
```
python test/test_quantization.py TestQuantizeFxOps.test_linear_module
python test/test_quantization.py TestQuantizeFxOps.test_linear_functional
```

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D25681272

fbshipit-source-id: 0ed0ebd5afb8cdb938b530f7dbfbd79798eb9318
2020-12-28 14:30:25 -08:00
ea558b2135 fx quant: hook up ConvTranspose{n}d (#49717)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49717

Quantization of `ConvTranpose{n}d` is supported in Eager mode. This PR
adds the support for FX graph mode.

Note: this currenlty only works in `qnnpack` because per-channel weights
are not supported by quantized conv transpose. In a future PR we should throw
an error when someone tries to quantize a ConvTranspose model with per-channel
weight observers until this is fixed.

Test Plan:
```
python test/test_quantization.py TestQuantizeFxOps.test_conv_transpose_1d
python test/test_quantization.py TestQuantizeFxOps.test_conv_transpose_2d
```

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D25674636

fbshipit-source-id: b6948156123ed55db77e6337bea10db956215ae6
2020-12-28 14:27:07 -08:00
fc559bd6dc [JIT] Constant prop getattr (#49806)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49806

Fix for https://github.com/pytorch/pytorch/issues/47089

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D25696791

Pulled By: eellison

fbshipit-source-id: 914c17b8effef7f4f341775ac2b8150ee4703efd
2020-12-28 10:44:53 -08:00
268441c7d8 [NNC] masked fill (#49627)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49627

There was a bug in the test that was hidden by the `If eager mode doesn't support a dtype/op/device combo` try /  catch, so cuda wasn't being tested �  The fix is just to rename `aten::masked_fill` to `aten_masked_fill`.

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D25696409

Pulled By: eellison

fbshipit-source-id: 83de1f5a194df54fe317b0035d4a6c1aed1d19a0
2020-12-28 10:37:02 -08:00
58fe67967c Support the in operator with str (#47057)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47057

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D24863370

Pulled By: ansley

fbshipit-source-id: 5d17165b06052f0a4676537c5f6757083185a591
2020-12-28 10:26:24 -08:00
e6779d4357 [*.py] Rename "Arguments:" to "Args:" (#49736)
Summary:
I've written custom parsers and emitters for everything from docstrings to classes and functions. However, I recently came across an issue when I was parsing/generating from the TensorFlow codebase: inconsistent use of `Args:` and `Arguments:` in its docstrings.

```sh
(pytorch#c348fae)$ for name in 'Args:' 'Arguments:'; do
    printf '%-10s %04d\n' "$name" "$(rg -IFtpy --count-matches "$name" | paste -s -d+ -- | bc)"; done
Args:      1095
Arguments: 0336
```

It is easy enough to extend my parsers to support both variants, however it looks like `Arguments:` is wrong anyway, as per:

  - https://google.github.io/styleguide/pyguide.html#doc-function-args @ [`ddccc0f`](https://github.com/google/styleguide/blob/ddccc0f/pyguide.md)

  - https://chromium.googlesource.com/chromiumos/docs/+/master/styleguide/python.md#describing-arguments-in-docstrings @ [`9fc0fc0`](https://chromium.googlesource.com/chromiumos/docs/+/9fc0fc0/styleguide/python.md)

  - https://sphinxcontrib-napoleon.readthedocs.io/en/latest/example_google.html @ [`c0ae8e3`](https://github.com/sphinx-contrib/napoleon/blob/c0ae8e3/docs/source/example_google.rst)

Therefore, only `Args:` is valid. This PR replaces them throughout the codebase.

PS: For related PRs, see tensorflow/tensorflow/pull/45420

PPS: The trackbacks automatically appearing below are sending the same changes to other repositories in the [PyTorch](https://github.com/pytorch) organisation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49736

Reviewed By: albanD

Differential Revision: D25710534

Pulled By: soumith

fbshipit-source-id: 61e8ff01abb433e9f78185c2d1d0cbd7c22c1619
2020-12-28 09:34:47 -08:00
9c64b9ffba early termination of CUDA tests (#49869)
Summary:
This is follow up on https://github.com/pytorch/pytorch/issues/49799.

* uses `torch.cuda.synchronize()` to validate CUDA assert instead of inspecting error message.
* remove non CUDA tests.

hopefully can reproduce why slow_tests fails but not normal test. since the test still runs for >1min.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49869

Reviewed By: mruberry

Differential Revision: D25714385

Pulled By: walterddr

fbshipit-source-id: 04f8ccb50d8c9ee42826a216c49baf90285b247f
2020-12-28 09:18:00 -08:00
963f7629b5 [numpy] torch.digamma : promote integer inputs to float (#48302)
Summary:
**BC-breaking Note:**

This PR updates PyTorch's digamma function to be consistent with SciPy's special.digamma function. This changes the result of the digamma function on the nonpositive integers, where the gamma function is not defined. Since the gamma function is undefined at these points, the (typical) derivative of the logarithm of the gamma function is also undefined at these points, and for negative integers this PR updates digamma to return NaN. For zero, however, it returns -inf to be consistent with SciPy.

Interestingly, SciPy made a similar change, which was noticed by at least one user: https://github.com/scipy/scipy/issues/9663#issue-396587679.

SciPy's returning of negative infinity at zero is intentional:
59347ae8b8/scipy/special/cephes/psi.c (L163)

This change is consistent with the C++ standard for the gamma function:
https://en.cppreference.com/w/cpp/numeric/math/tgamma

**PR Summary:**
Reference https://github.com/pytorch/pytorch/issues/42515

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48302

Reviewed By: ngimel

Differential Revision: D25664087

Pulled By: mruberry

fbshipit-source-id: 1168e81e218bf9fe5b849db0e07e7b22e590cf73
2020-12-24 22:42:55 -08:00
46cf6d332f Revert D25684692: [quant][graphmode][fx] Standalone module support {input/output}_quantized_idxs
Test Plan: revert-hammer

Differential Revision:
D25684692 (89b4899ea5)

Original commit changeset: 900360e01c0e

fbshipit-source-id: 8b65fa8fbc7b364fbddb5f23cc696cd9b7db98cd
2020-12-24 15:50:52 -08:00
ec6de6a697 Clip small scales to fp16 min
Summary: When the FC output min max range is very small, we want to enforce a cutoff on the scale parameter to better generalize for future values that could fall beyond the original range.

Test Plan:
More analysis about the output distributions can be found in N425166

An example workflow using fp16 min clipping is f240972205

Reviewed By: jspark1105

Differential Revision: D25681249

fbshipit-source-id: c4dfbd3ee823886afed06e6c2eccfc29d612f7e6
2020-12-24 03:49:34 -08:00
89b4899ea5 [quant][graphmode][fx] Standalone module support {input/output}_quantized_idxs (#49754)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49754

This PR adds the support for {input/output}_quantized_idxs for standalone module.

if input_quantized_idxs = [] and output_quantized_idxs = [], the standalone module will be expecting float
input and produce float output, and will quantize the input and dequantize output internally

if input_quantized_idxs = [0] and otuput_qiuantized_idxs = [0], the standalone module will be expecting quantized
input and produce quantized output, the input will be quantized in the parent module, and output will be dequantized
in the parent module as well, this is similar to current quantized modules like nn.quantized.Conv2d

For more details, please see the test case

Test Plan:
python test/test_quantization.py TestQuantizeFx.test_standalone_module

Imported from OSS

Reviewed By: raghuramank100

Differential Revision: D25684692

fbshipit-source-id: 900360e01c0e35b26fe85f4a887dc1fd6f7bfb66
2020-12-23 22:36:57 -08:00
69b1373587 Revert D25692616: [pytorch][PR] [reland] Early terminate when CUDA assert were thrown
Test Plan: revert-hammer

Differential Revision:
D25692616 (e6a215592e)

Original commit changeset: 9c5352220d63

fbshipit-source-id: dade8068cad265d15ee908d98abe0de5b81a195d
2020-12-23 17:48:12 -08:00
9552cc65d4 Creation of test framework for Sparse Operators (#48488)
Summary:
Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48488

Reviewed By: ngimel

Differential Revision: D25696487

Pulled By: mruberry

fbshipit-source-id: dc4f57c6628f62b74dd321f3f6b0fff86f25b040
2020-12-23 15:42:26 -08:00
5acc27c00a Revert D25690129: [pytorch][PR] Added linalg.inv
Test Plan: revert-hammer

Differential Revision:
D25690129 (8554b58fbd)

Original commit changeset: edb2d03721f2

fbshipit-source-id: 8679ea18e637423d35919544d2b047a62ac3abd8
2020-12-23 15:27:52 -08:00
1833009202 Fix typo in complex autograd docs (#49755)
Summary:
Update complex autograd docs to fix a typo

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49755

Reviewed By: mruberry

Differential Revision: D25692649

Pulled By: soulitzer

fbshipit-source-id: 43c2113b4c8f2d1828880102189a5a9b887dc784
2020-12-23 14:42:34 -08:00
e6a215592e [reland] Early terminate when CUDA assert were thrown (#49799)
Summary:
this is a reland of https://github.com/pytorch/pytorch/issues/49527.

fixed slow test not running properly in py36 because capture_output is introduced in py37.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49799

Reviewed By: janeyx99

Differential Revision: D25692616

Pulled By: walterddr

fbshipit-source-id: 9c5352220d632ec8d7464e5f162ffb468a0f30df
2020-12-23 14:25:14 -08:00
3f4b98d568 [numpy] torch.erfinv: promote integer inputs to float (#49155)
Summary:
Reference: https://github.com/pytorch/pytorch/issues/42515

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49155

Reviewed By: ngimel

Differential Revision: D25664234

Pulled By: mruberry

fbshipit-source-id: 630fd1d334567d78c8130236a67dda0f5ec02560
2020-12-23 14:22:03 -08:00
4d6110939a [pt][quant] Make the CUDA fake quantize logic consistent with CPU fake quantize logic (#49808)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49808

In PyTorch, it uses `dst = std::nearbyint(src * inv_scale) + zero_point` instead of the LEGACY  `dst = std::nearbyint(src * inv_scale + zero_point)`. However, the CUDA implementation doesn't match this. This Diff makes the CPU and CUDA implementation consistent.

- FBGEMM code pointer: https://github.com/pytorch/FBGEMM/blob/master/include/fbgemm/QuantUtils.h#L76-L80
- PyTorch code pointer:
https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/quantized/affine_quantizer.cpp#L306

Test Plan: CI

Reviewed By: dskhudia

Differential Revision: D25694235

fbshipit-source-id: 0a615e559132aafe18543deac1ea5028dd840cb9
2020-12-23 12:47:44 -08:00
e163172904 removes more unused THC functions (#49788)
Summary:
per title

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49788

Reviewed By: mruberry

Differential Revision: D25693328

Pulled By: ngimel

fbshipit-source-id: 244a096214d110e4c1a94f2847ff8457f1afb0d1
2020-12-23 12:38:20 -08:00
d99a0c3b3e Improve docs for scatter and gather functions (#49679)
Summary:
- Add warning about non-unique indices
- And note that these functions don't broadcast
- Add missing `torch.scatter` and `torch.scatter_add` doc entries
- Fix parameter descriptions
- Improve code examples to make indexing behaviour easier to understand

Closes gh-48214
Closes gh-26191
Closes gh-37130
Closes gh-34062
xref gh-31776

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49679

Reviewed By: mruberry

Differential Revision: D25693660

Pulled By: ngimel

fbshipit-source-id: 4983e7b4efcbdf1ab9f04e58973b4f983e8e43a4
2020-12-23 12:23:15 -08:00
b3387139b4 Mod lists to neutral+descriptive terms in caffe2/docs (#49803)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49803

Per "https://fb.workplace.com/groups/e/permalink/3320810064641820/" we can no longer use the terms "whitelist" and "blacklist", and editing any file containing them results in a critical error signal. Let's embrace the change.
This diff changes "blacklist" to "blocklist" in a number of non-interface contexts (interfaces would require more extensive testing and might interfere with reading stored data, so those are deferred until later).

Test Plan: Sandcastle

Reviewed By: vkuzo

Differential Revision: D25686924

fbshipit-source-id: 117de2ca43a0ea21b6e465cf5082e605e42adbf6
2020-12-23 11:37:11 -08:00
8554b58fbd Added linalg.inv (#48261)
Summary:
This PR adds `torch.linalg.inv` for NumPy compatibility.

`linalg_inv_out` uses in-place operations on provided `result` tensor.

I modified `apply_inverse` to accept tensor of Int instead of std::vector, that way we can write a function similar to `linalg_inv_out` but removing the error checks and device memory synchronization.

I fixed `lda` (leading dimension parameter which is max(1, n)) in many places to handle 0x0 matrices correctly.
Zero batch dimensions are also working and tested.

Ref https://github.com/pytorch/pytorch/issues/42666

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48261

Reviewed By: ngimel

Differential Revision: D25690129

Pulled By: mruberry

fbshipit-source-id: edb2d03721f22168c42ded8458513cb23dfdc712
2020-12-23 11:29:00 -08:00
370350c749 Preserve memory format in qconv op (#49533)
Summary:
* qconv used to return NHWC no matter the input format
* this change returns NCHW format if the input was NCHW

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49533

Test Plan:
pytest test/quantization/test_quantized_op.py::\
TestQuantizedConv::test_qconv2d_preserve_mem_format

Fixes https://github.com/pytorch/pytorch/issues/47295

Reviewed By: kimishpatel

Differential Revision: D25609205

Pulled By: axitkhurana

fbshipit-source-id: 83f8ca4a1496a8a4612fc3da082d727ead257ce7
2020-12-23 10:58:57 -08:00
5171bd94d7 [lint doc] how to fix flake errors if pre-commit hook wasn't there (#49345)
Summary:
This PR adds instructions on what to do if one committed into a PR branch w/o having a pre-commit hook enabled and having CI report flake8 errors.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49345

Reviewed By: cpuhrsch

Differential Revision: D25683167

Pulled By: soumith

fbshipit-source-id: 3c45c866e1636c116d2cacec438d62c860e6b854
2020-12-23 09:17:40 -08:00
55b431b17a [Gradient Compression] Directly let world_size = group_to_use.size() (#49715)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49715

Address the comment on https://github.com/pytorch/pytorch/pull/49417#discussion_r545388351
ghstack-source-id: 119049598

Test Plan: waitforbuildbot

Reviewed By: rohan-varma

Differential Revision: D25673997

fbshipit-source-id: 44eb2540e5a77331c34ba503285cbd0bd63c2c0a
2020-12-22 23:24:54 -08:00
88c33ff8ab [Gradient Compression] Explicitly restrict the scope of torch.cuda.synchronize to the current device (#49711)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49711

`torch.cuda.synchronize` uses the current device by default. Explicitly specify this device for better readability.

Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202
ghstack-source-id: 119017654

Test Plan:
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl

buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_DistributedDataParallel_powerSGD_ddp_comm_hook

Reviewed By: rohan-varma

Differential Revision: D25672267

fbshipit-source-id: 62a2266727a2ea76175f3c438daf20951091c771
2020-12-22 23:21:45 -08:00
ee271047b5 torch.utils.checkpoint.checkpoint + torch.cuda.amp (#49757)
Summary:
Adds a test to orphaned original PR (https://github.com/pytorch/pytorch/pull/40221).

Should fix https://github.com/pytorch/pytorch/issues/49738 and https://github.com/pytorch/pytorch/issues/47183

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49757

Reviewed By: mruberry

Differential Revision: D25689609

Pulled By: ngimel

fbshipit-source-id: 0a6adc11eb98382048ef9a9775e185dcdeff6010
2020-12-22 22:25:11 -08:00
f474ffa1a9 [quant][graphmode][fx] Change standalone module api (#49719)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49719

We find there are multiple use cases for standalone module, one use case requires standalone module
to produce a module that takes float Tensor as input and outputs a float Tensor, the other needs to
produce a modulee that takes quantized Tensor as input and outputs a quantized Tensor.

This is similar to `quantized_input_idxs` and `quantized_output_idxs` so we want to nest
prepare_custom_config_dict in the standalone module configuration, for maximum flxibility we also
include qconfig_dict for stand alone module as well in case user needs to have special qconfig_dict for
the standalone module in the future.

Changed from
```python
prepare_custom_config_dict =
{
  "standalone_module_name": ["standalone_module"],
   "standalone_module_class": [StandaloneModule]
 }
```
to
```python
prepare_custom_config_dict =
{
  "standalone_module_name": [("standalone_module", qconfig_dict1, prepare_custom_config_dict1)],
  "standalone_module_class": [(StandaloneModule, qconfig_dict2, prepare_custom_config_dict2)]
 }
```
The entries in the config are:
1. name/module_class
2. optional qconfig_dict, when it is None, we'll use {"": qconfig} where qconfig is the one from parent qconfig_dict
3. optional prepare_custom_config_dict, when it is None, we'll use default value of prepare_custom_config_dict for prepare API (None)

Test Plan:
python test/test_quantization.py TestQuantizeFx.test_standalone_module

Imported from OSS

Reviewed By: raghuramank100

Differential Revision: D25675704

fbshipit-source-id: 0889f519a3e55a7a677f0e2db4db9a18d87a93d4
2020-12-22 21:58:40 -08:00
af1b636b89 [Gradient Compression] Change wait() to value() in some callbacks of PowerSGD communication hook (#49709)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49709

Since wait() has already been called in the return statements of the precursor callbacks, no need to wait again.

Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202
ghstack-source-id: 119015237

Test Plan:
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl
buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_DistributedDataParallel_powerSGD_ddp_comm_hook

Reviewed By: rohan-varma

Differential Revision: D25672068

fbshipit-source-id: da136327db4c4c0e3b846ba8d6885629f1044374
2020-12-22 21:37:04 -08:00
68d438c9da Add PixelUnshuffle (#49334)
Summary:
Adds an implementation of `torch.nn.PixelUnshuffle` as the inverse operation of `torch.nn.PixelShuffle`. This addresses https://github.com/pytorch/pytorch/issues/2456

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49334

Test Plan:
```
# Unit tests.
python test/test_nn.py TestNN.test_pixel_shuffle_unshuffle

# Module test.
python test/test_nn.py TestNN.test_PixelUnshuffle

# C++ API tests.
build/bin/test_api

# C++ / python parity tests.
python test/test_cpp_api_parity.py

# JIT test.
python test/test_jit.py TestJitGeneratedFunctional.test_nn_pixel_unshuffle

# Override tests.
python test/test_overrides.py

# Type hint tests.
python test/test_type_hints.py
```

Screenshots of rendered docs:
<img width="876" alt="Screen Shot 2020-12-18 at 12 19 05 PM" src="https://user-images.githubusercontent.com/75754324/102642255-6b07bb00-412b-11eb-88fa-e53e7e8ba720.png">
<img width="984" alt="Screen Shot 2020-12-18 at 12 19 26 PM" src="https://user-images.githubusercontent.com/75754324/102642276-70fd9c00-412b-11eb-8548-445082a2db02.png">
<img width="932" alt="Screen Shot 2020-12-18 at 12 19 34 PM" src="https://user-images.githubusercontent.com/75754324/102642704-19abfb80-412c-11eb-9546-95bdd1c3cf22.png">
<img width="876" alt="Screen Shot 2020-12-22 at 12 51 36 PM" src="https://user-images.githubusercontent.com/75754324/102918259-986aa680-4454-11eb-99e7-a0b4c8b3e283.png">
<img width="869" alt="Screen Shot 2020-12-22 at 12 51 44 PM" src="https://user-images.githubusercontent.com/75754324/102918274-9ef91e00-4454-11eb-94bb-91b58aff47d3.png">

Reviewed By: mruberry

Differential Revision: D25401439

Pulled By: jbschlosser

fbshipit-source-id: 209d92ce7295e51699e83616d0c62170a7ce75c8
2020-12-22 20:14:55 -08:00
461aafe389 [numpy] torch.angle: promote integer inputs to float (#49163)
Summary:
**BC-Breaking Note:**

This PR updates PyTorch's angle operator to be consistent with NumPy's. Previously angle would return zero for all floating point values (including NaN). Now angle returns `pi` for negative floating point values, zero for non-negative floating point values, and propagates NaNs.

**PR Summary:**

Reference: https://github.com/pytorch/pytorch/issues/42515

TODO:

* [x] Add BC-Breaking Note (Prev all real numbers returned `0` (even `nan`)) -> Fixed to match the correct behavior of NumPy.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49163

Reviewed By: ngimel

Differential Revision: D25681758

Pulled By: mruberry

fbshipit-source-id: 54143fe6bccbae044427ff15d8daaed3596f9685
2020-12-22 18:43:14 -08:00
46b83212d1 Remove unused six code for Python 2/3 compatibility (#48077)
Summary:
This is basically a reborn version of https://github.com/pytorch/pytorch/issues/45254 .

Ref: https://github.com/pytorch/pytorch/issues/42919

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48077

Reviewed By: ngimel

Differential Revision: D25687042

Pulled By: bugra

fbshipit-source-id: 05f20a6f3c5212f73d0b1505b493b720e6cf74e5
2020-12-22 18:07:08 -08:00
abacf27038 Revert D25623219: [pytorch][PR] early terminate when CUDA assert were thrown
Test Plan: revert-hammer

Differential Revision:
D25623219 (be091600ed)

Original commit changeset: 1b414623ecce

fbshipit-source-id: ba304c57eea29d19550ac1e864ccfcd0cec68bec
2020-12-22 17:57:19 -08:00
010b9c52f4 Skip None submodule during JIT-tracing (#49765)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49765

Some PyTorch module can have None as submodule, which causes the following error in JIT-tracing:

Repro script:
```
import torch

class TestModule(torch.nn.Module):
  def __init__(self):
    super().__init__()
    self.submod = torch.nn.Linear(3, 4)
    self.submod = None

  def forward(self, inputs):
    return inputs

m = TestModule()
tm = torch.jit.trace(m, torch.tensor(1.))
```
Error:
```
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/data/miniconda3/envs/master_nightly/lib/python3.7/site-packages/torch/jit/_trace.py", line 742, in trace
    _module_class,
  File "/data/miniconda3/envs/master_nightly/lib/python3.7/site-packages/torch/jit/_trace.py", line 928, in trace_module
    module = make_module(mod, _module_class, _compilation_unit)
  File "/data/miniconda3/envs/master_nightly/lib/python3.7/site-packages/torch/jit/_trace.py", line 560, in make_module
    return _module_class(mod, _compilation_unit=_compilation_unit)
  File "/data/miniconda3/envs/master_nightly/lib/python3.7/site-packages/torch/jit/_trace.py", line 1039, in __init__
    submodule, TracedModule, _compilation_unit=None
  File "/data/miniconda3/envs/master_nightly/lib/python3.7/site-packages/torch/jit/_trace.py", line 560, in make_module
    return _module_class(mod, _compilation_unit=_compilation_unit)
  File "/data/miniconda3/envs/master_nightly/lib/python3.7/site-packages/torch/jit/_trace.py", line 988, in __init__
    assert isinstance(orig, torch.nn.Module)
AssertionError
```

This pull request changes the JIT-tracing logic to skip the None submodule when tracing.

Test Plan: `buck test mode/dev //caffe2/test:jit -- test_trace_skip_none_submodule`

Reviewed By: wanchaol

Differential Revision: D25670948

fbshipit-source-id: 468f42f5ddbb8fd3de06d0bc224dc67bd7172358
2020-12-22 17:45:35 -08:00
62f9b03b7c [lint] Apply whitespace linter to all gradle files
Summary: Run whitespace and license linters on gradle build files.

Reviewed By: zertosh

Differential Revision: D25687355

fbshipit-source-id: 44330daac7582fed6c05680bffc74e855a9b1dbc
2020-12-22 17:01:51 -08:00
27f0dd36d9 add type annotations to torch.nn.parallel._functions (#49687)
Summary:
Closes gh-49686

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49687

Reviewed By: ngimel

Differential Revision: D25680210

Pulled By: zou3519

fbshipit-source-id: 221f7c9a4d3a6213eac6983030b0be51ee1c5b60
2020-12-22 16:56:16 -08:00
de07d07600 fx quant: improve types on convert (#49688)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49688

Adds more types on FX quantize convert, fixing things as they
are uncovered by mypy.

Test Plan:
```
mypy torch/quantization
python test/test_quantization.py TestQuantizeFx
python test/test_quantization.py TestQuantizeFxOps
```

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D25667231

fbshipit-source-id: 262713c6ccb050a05e3119c0457d0335dde82d25
2020-12-22 16:53:23 -08:00
19f972b696 fx quant: do not observe bias on F.linear (#49628)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49628

Ensures that linear bias is not observed in a `F.linear` call. This should
be a small speedup in PTQ, and will change numerics (in a good way) for
QAT if someone is using `F.linear`.

Note: the implementation is slightly more verbose compared to conv
because bias is a keyword argument in Linear.

Test Plan:
```
python test/test_quantization.py TestQuantizeFxOps.test_linear_functional_bias_not_observed
```

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D25653532

fbshipit-source-id: c93501bf6b55cbe4a11cfdad6f79313483133a39
2020-12-22 16:53:21 -08:00
c3a7591cef fx quant: do not observe bias on F.conv (#49623)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49623

(not ready for review)

Ensures that conv bias is not observed in a `F.conv{n}d` call.

Test Plan: Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D25652856

fbshipit-source-id: 884f87be1948d3e049a557d79bec3c90aec34340
2020-12-22 16:49:50 -08:00
b414123264 Update is_floating_point() docs to mention bfloat16 (#49611)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/49610 . Explicitly mentions that `is_floating_point()` will return `True` if passed a `bfloat16` tensor.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49611

Reviewed By: mrshenli

Differential Revision: D25660723

Pulled By: VitalyFedyunin

fbshipit-source-id: 04fab2f6c1c5c2859c6efff1976a92a676b9efa3
2020-12-22 15:54:27 -08:00
67d0c18241 [FX] Try to make it more clear that _update_args_kwargs should not be called (#49745)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/49745

Test Plan: Imported from OSS

Reviewed By: zdevito

Differential Revision: D25682177

Pulled By: jamesr66a

fbshipit-source-id: 4910577541c4d41e1be50a7aa061873f061825b6
2020-12-22 15:20:02 -08:00
2780400904 [numpy] Add torch.xlogy (#48777)
Summary:
Reference https://github.com/pytorch/pytorch/issues/38349
Fixes https://github.com/pytorch/pytorch/issues/22656

TODO:
* [x] Add docs
* [x] Add tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48777

Reviewed By: ngimel

Differential Revision: D25681346

Pulled By: mruberry

fbshipit-source-id: 369e0a29ac8a2c44de95eec115bf75943fe1aa45
2020-12-22 15:05:59 -08:00
be091600ed early terminate when CUDA assert were thrown (#49527)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/49019

I marked the test_testing function as slow since it took ~1 minute to finish the subprocess test suite.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49527

Reviewed By: malfet

Differential Revision: D25623219

Pulled By: walterddr

fbshipit-source-id: 1b414623ecce14aace5e0996d5e4768a40e12e06
2020-12-22 14:33:41 -08:00
9b6fb856e8 Update NNPACK (#49749)
Summary:
This update enables NNPACK cross compilation on MacOS

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49749

Reviewed By: janeyx99

Differential Revision: D25683056

Pulled By: malfet

fbshipit-source-id: c7a6b7f49d61a9a0697d67f6319f06bd252b66a5
2020-12-22 14:20:37 -08:00
6f9532dd53 only upload s3 stats on master, nightly, and release branch (#49645)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/49645

Reviewed By: malfet

Differential Revision: D25665851

Pulled By: walterddr

fbshipit-source-id: 1cf50f6e3657f70776aaf3c5d3823c8a586bf22d
2020-12-22 14:15:18 -08:00
04e04abd06 remove unused THCBlas (#49725)
Summary:
removes unused THCBlas, call `at::cuda::blas::gemm` directly where needed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49725

Reviewed By: mruberry

Differential Revision: D25680831

Pulled By: ngimel

fbshipit-source-id: d826f3f558b156f45f2a4864daf3f6d086bda78c
2020-12-22 13:55:22 -08:00
1451d84766 Minor doc fix: change truncating to rounding in TF32 docs (#49625)
Summary:
Minor doc fix in clarifying that the input data is rounded not truncated.

CC zasdfgbnm ngimel

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49625

Reviewed By: mruberry

Differential Revision: D25668244

Pulled By: ngimel

fbshipit-source-id: ac97e41e0ca296276544f9e9f85b2cf1790d9985
2020-12-22 13:46:33 -08:00
21398fb6cb Fix get_overlap_status for tensors without storage (#49638)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/49638

Reviewed By: ngimel

Differential Revision: D25681908

Pulled By: asuhan

fbshipit-source-id: 2ea8623614f2f0027f6437cf2819ba1657464f54
2020-12-22 12:38:59 -08:00
c23808d8e8 Reland: Add base forward grad logic (#49734)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49734

RFC: https://github.com/pytorch/rfcs/pull/11

This PR add the basic logic to handle forward grad as dual Tensors.
It contains the following:
- Mechanism to save dual state on a Tensor and clear it up when the dual level ends
- C++ and python user facing API
- Updated view system that is able to track both forward and backward views

The current PR has the following limitations:
- Extensive tests are in the next PR in the stack as formulas are needed to write full tests.
- Only the manual formulas have been audited and no other formula is actually implemented here (they are in the next PR in the stack)
- Only level 0 is allowed for now. This was discussed and agreed that it is not needed for the first version of this PR.
- We can save one ViewInfo creation when both the forward and backward views have the same base. This can be done by adding a boolean flag to the DifferentiableViewMeta and extra logic in the `as_view` method. This is left out to keep this PR concise.
- We can skip tracking forward views if the base has a forward grad. This can be done by adding extra logic in the `as_view` method. This is left out to keep this PR concise.

Reading guide:
- Updated view handling in [gen_variable_type.py](https://github.com/pytorch/pytorch/pull/49097/files#diff-f6553cec68caeaea36f6c8b14ff76a6d39dfd774e0ea9ef2f76e8d81fd9af5df), [VariableTypeUtils.h](https://github.com/pytorch/pytorch/pull/49097/files#diff-ec71cfa45954dece1236c661d170e6341879c5be637f4abf52e826d61b40695a), [variable.cpp](https://github.com/pytorch/pytorch/pull/49097/files#diff-60e3bfe444e89efc7149f25b38e472710525984789934ab83f1bd5671b8ff285) (skip code below "[Forward Grad View]" for now), [variable.h](https://github.com/pytorch/pytorch/pull/49097/files#diff-1604bcd0e4350ed99ec45e437cee7ac9ebe337392c9ea16a236247aeeb35b02bR266-R542) and [custom_function.cpp](https://github.com/pytorch/pytorch/pull/49097/files#diff-dd85f452082b5bb6612bbc12adb496f8827defa228509f7b493de1d517522d5d). This introduces the new ViewInfo to hold view informations shared for forward and backward. It also updates the differentiable view meta to use this. And it updates the as_view function to handle both forward and backward view.
- New forward grad class that handle storing gradients and tracking at each level [forward_grad.h](https://github.com/pytorch/pytorch/pull/49097/files#diff-c6c5b9ab2d7e5dde4102495faa1b6bbbfc23aa3e47deb7359c0bfe1eb004c0cb), [forward_grad.cpp](https://github.com/pytorch/pytorch/pull/49097/files#diff-de2ab54ade7312701850d71a119a4f4ee4b9fc5a9c42a467cdd4e73c033531dd) and [build_variables.bzl](https://github.com/pytorch/pytorch/pull/49097/files#diff-dfdfa2efb17beddfd9094524f95351fd197db6c8857e96b436fb599870359325). EDIT: These files also contain the new flag to globally disable forward AD that allows us to reduce performance issues while this is in development.
- Lowest level API and binding between Tensor and AutogradMeta in [TensorBody.h](https://github.com/pytorch/pytorch/pull/49097/files#diff-7554853205392fa743357bf845ecc350a974ec049383248c12daaf2f4de04911), [TensorImpl.cpp](https://github.com/pytorch/pytorch/pull/49097/files#diff-052bd9150ef8e09289ddf644b5a6830ede49207201cd41728f6d7cc6d9cead94), [TensorImpl.h](https://github.com/pytorch/pytorch/pull/49097/files#diff-a15aae4cf23da44970db7cece62ff981265575c798c62f7b52d87c8809dfe2e1) and the rest of [variable.cpp](https://github.com/pytorch/pytorch/pull/49097/files#diff-60e3bfe444e89efc7149f25b38e472710525984789934ab83f1bd5671b8ff285R557-R677)
- API to access the forward primal that needs to be a differentiable function (and so in native_functions.yaml) [native_functions.yaml](https://github.com/pytorch/pytorch/pull/49097/files#diff-2f3dbd85efb9b5172f2264eedd3be47dd765e6ab7cc8bf3ade5e62c28ae35991) [NamedRegistrations.cpp](https://github.com/pytorch/pytorch/pull/49097/files#diff-69bd3bea510c9b64e1633fa18c3ea63d4b8348dbad3a78ad9de844ab3e43dc1d), [VariableMethodsStub.cpp](https://github.com/pytorch/pytorch/pull/49097/files#diff-23f5fcb737a2b289811fe0f4b65aef775e7c824b2e629ecd343df51405cd434f), [derivatives.yaml](https://github.com/pytorch/pytorch/pull/49097/files#diff-e4c2f99a2404e98c3586e07425da73008f36b1bada790648a7297af141d37f8c), [gen_python_functions.py](https://github.com/pytorch/pytorch/pull/49097/files#diff-e4c2f99a2404e98c3586e07425da73008f36b1bada790648a7297af141d37f8c), [gen_trace_type.py](https://github.com/pytorch/pytorch/pull/49097/files#diff-54e0b976027bf8debefb959ff360b89ae93466970c843365b1b3a03806d868ce), [TraceTypeManual.cpp](https://github.com/pytorch/pytorch/pull/49097/files#diff-f34636741ad4a23d018e0c289bc750c3bad887b45660e1d6eaf440d234a78fbf) and [part of VariableTypeManual.cpp](https://github.com/pytorch/pytorch/pull/49097/files#diff-6e19a1bce8cbdba8714b6e2c794a76bc0864b64a49cfa757cb0b5afdc937d1a4R198-R243)
- c++ API [autograd.h](https://github.com/pytorch/pytorch/pull/49097/files#diff-349028fbe8291a965a7a263c323b208fe071c35c66179ee997ef84fa81aa4b1e), [autograd.cpp](https://github.com/pytorch/pytorch/pull/49097/files#diff-a3fe908d67dfec16a1fcde300de68b0701bf68b88db7451f29f2bee255cf30c9)
- python binding [init.cpp](https://github.com/pytorch/pytorch/pull/49097/files#diff-c58a67c85191c22c9b3bb439117d8053edfd9dea839fa010cf967d404c3c630d)
- python API [forward_ad.py](https://github.com/pytorch/pytorch/pull/49097/files#diff-a4efad4ba18fffdfb264c21e5475997a24a743089a899f8ec1a5ff962c6738d9), [autograd/__init__.py](https://github.com/pytorch/pytorch/pull/49097/files#diff-743abcafd32ad0e69f39ac5a91df4197b7e1921c135cacee7ef6dc829a8a7af8)
- c++ and python printing [Formatting.cpp](https://github.com/pytorch/pytorch/pull/49097/files#diff-881dba501e71662e2e4818b4b016f739b344c8aed2f5edc6b871eda47a2aced0), [_tensor_str.py](https://github.com/pytorch/pytorch/pull/49097/files#diff-a7911f8d5e73adbff914d99fd7818ace2a7030b6a3748abe06ec6fc6e3df9cc3)
- Utility for formulas and updated manual functions to respect new view system as well as forward grad [FunctionsManual.h](https://github.com/pytorch/pytorch/pull/49097/files#diff-6378bb6dc81a64dab676d61731341fa5d1088418f32a1473a33a0ccfc2357dc1), [FunctionsManual.cpp](https://github.com/pytorch/pytorch/pull/49097/files#diff-4adbd88239afcd60e8198aab65d4f5e43b62314e34b80551e997a1ea503adea5) [rest of VariableTypeManual.cpp](https://github.com/pytorch/pytorch/pull/49097/files#diff-6e19a1bce8cbdba8714b6e2c794a76bc0864b64a49cfa757cb0b5afdc937d1a4R264-R433)
- Ensure SavedVariable save forward grad properly [saved_variable.h](https://github.com/pytorch/pytorch/pull/49097/files#diff-c1b8039d776241abe177d5aa99b79dd9489a9b3e529da8ab24c2e386c1238ae2), [saved_variable.cpp](https://github.com/pytorch/pytorch/pull/49097/files#diff-cc9fba479b5beae06b2eea2e390d17796e0341c5b037a20b5bcaccbb0c341030)

Test Plan: Imported from OSS

Reviewed By: gchanan

Differential Revision: D25678797

Pulled By: albanD

fbshipit-source-id: 3d58550c11b5f58b9b73fd30596d042b857fb9dd
2020-12-22 12:11:27 -08:00
eabe05ab72 [onnxifi] Get rid of class member (#49380)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49380

Couldn't resist removing a class member that is only used in one function.

Reviewed By: yinghai

Differential Revision: D25547366

fbshipit-source-id: 74e61c6a0068566fb7956380862999163e7e94bf
2020-12-22 12:02:52 -08:00
7b4a7661d6 Make PyTorch partially cross-compilable for Apple M1 (#49701)
Summary:
Update CPUINFO to include https://github.com/pytorch/cpuinfo/pull/51
Update sleef to include https://github.com/shibatch/sleef/pull/376
Modify aten/src/ATen/native/quantized/cpu/qnnpack/CMakeLists.txt to recognize CMAKE_OSX_ARCHITECTURES

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49701

Test Plan: `cmake -DCMAKE_OSX_ARCHITECTURES=x86_64 -DPYTHON_EXECUTABLE=/usr/bin/python3  -DUSE_XNNPACK=NO -DBUILD_TEST=YES .. -G Ninja; ninja basic` finishes successfully on Apple M1

Reviewed By: janeyx99

Differential Revision: D25669219

Pulled By: malfet

fbshipit-source-id: 5ee36b64e3a7ac76448f2a300ac4993375a26de5
2020-12-22 09:33:12 -08:00
42b5601f30 [ROCm] add 4.0 to nightly builds (#49632)
Summary:
Depends on https://github.com/pytorch/builder/pull/614.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49632

Reviewed By: ngimel

Differential Revision: D25665880

Pulled By: walterddr

fbshipit-source-id: b37a55b7e3028648453b422683fa4a72e0ee04a4
2020-12-22 08:41:13 -08:00
4d9d03fe47 Complex backward for torch.sqrt (#49461)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49461

resolves https://github.com/pytorch/pytorch/issues/48398

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D25589454

Pulled By: anjali411

fbshipit-source-id: 46e9f913c8ab3e18c98d6f623b2394044b6fe079
2020-12-22 07:58:42 -08:00
2df249f0ab [fix] inplace remainder/% (#49390)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/49214

**BC-Breaking**
Before this PR, `%=` didn't actually do the operation inplace and returned a new tensor.
After this PR, `%=` operation is actually inplace and the modified input tensor is returned.

Before PR,
```python
>>> import torch
>>> a = torch.tensor([11,12,13])
>>> id(a)
139627966219328
>>> a %= 10
>>> id(a)
139627966219264
```

After PR,
```python
>>> import torch
>>> a = torch.tensor([11,12,13])
>>> id(a)
139804702425280
>>> a %= 10
>>> id(a)
139804702425280
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49390

Reviewed By: izdeby

Differential Revision: D25560423

Pulled By: zou3519

fbshipit-source-id: 2b92bfda260582aa4ac22c4025376295e51f854e
2020-12-22 07:30:03 -08:00
dfb7520c47 NewModuleTest: Don't call both check_jacobian and gradcheck (#49566)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49566

Fixes #49422.

check_jacobian and gradcheck do roughly the same thing: they both
compute an analytic jacobian and a numeric jacobian and check that
they are equivalent. Furthermore, NewModuleTest will (by default) call
both check_jacobian and gradcheck, leading to some redundant checks that
waste CI resources.

However, there is one subtle difference: `check_jacobian` can handle the
special case where a Module takes in dense inputs and dense parameters
but returns sparse gradients, but that is not something gradcheck can
handle. This is only used in the tests for nn.Embedding and
nn.EmbeddingBag.

This PR does the following:
- have NewModuleTest call gradcheck instead of check_jacobian by default
- add a new "has_sparse_gradients" flag to NewModuleTest. These are True
for the nn.Embedding and nn.EmbeddingBag sparse gradient tests. If
`has_sparse_gradients` is True, then we call check_jacobian, otherwise,
we call gradcheck.
- Kills the "jacobian_input" flag. This flag was used to tell
NewModuleTest to not attempt to compute the jacobian for the inputs to
the module. This is only desireable if the input to the module isn't
differentiable and was only set in the case of nn.Embedding /
nn.EmbeddingBag that take a LongTensor input. `gradcheck` handles these
automatically by not checking gradients for non-differentiable inputs.

Test Plan:
- Code reading
- run test_nn.py

Reviewed By: albanD

Differential Revision: D25622929

Pulled By: zou3519

fbshipit-source-id: 8d831ada98b6a95d63f087ea9bce1b574c996a22
2020-12-22 06:48:31 -08:00
c348faedc4 [Gradient Compression] Warm-start of PowerSGD (#49451)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49451

Reuse the low-rank tensors P(s) and Q(s) from the previous iteration if possible.

This can give a better compression performance in terms of both accuracy and speed.

Also add a unit test for batched PowerSGD to test_c10d.py.

Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202
ghstack-source-id: 119014132

Test Plan:
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl
buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_DistributedDataParallel_powerSGD_ddp_comm_hook

Reviewed By: rohan-varma

Differential Revision: D25583086

fbshipit-source-id: a757df3c4cfcc0ead4647f7de2f43198f1e063ee
2020-12-22 01:19:14 -08:00
590e7168ed [PyTorch] Remove direct reference to native symbols in sparse related non-native codes (#49721)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49721

As a refactor effort of per-app selective build, we are decoupling ATen/native from the rest of aten (D25413998).
All symbols of ATen/native could only be referenced through dispatcher (https://github.com/pytorch/pytorch/issues/48684).

This diff is to decouple the native reference recently introduced for sparse tensors.
ghstack-source-id: 119028080

Test Plan: CI

Reviewed By: dhruvbird, ngimel

Differential Revision: D25675711

fbshipit-source-id: 381cbb3b361ee41b002055399d4996a9ca21377c
2020-12-21 22:16:20 -08:00
d54cf2aa27 [pt][ATen] Optimize bmm (#49506)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49506

- Get rid of expensive stuff like `TensorArg`, `checkBackend`, `checkSize`, and `TensorAccessor`.
- Add `checkDim` that does not require creating a `TensorArg` which incurs a refcount bump
- Avoid unnecessary calls to `torch.select`, which goes through the dispatcher in the cases we care about, with mat1 and mat2 not permuted or permuted with dims = [0, 2, 1]. The pt version of bmm supports crazy cases like when the inputs are permuted with dims = [1, 2, 0], which is uncommon in SparseNNs.

Test Plan:
Unit test:
```
buck test //caffe2/test:linalg
```

Benchmark with the adindexer model:
```
Before:
I1216 14:02:24.155516 2595800 PyTorchPredictorBenchLib.cpp:209] PyTorch run finished. Milliseconds per iter: 0.0847197. Iters per second: 11803.6
After:
I1216 14:02:26.583878 2595939 PyTorchPredictorBenchLib.cpp:209] PyTorch run finished. Milliseconds per iter: 0.082051. Iters per second: 12187.5
```

Reviewed By: bwasti

Differential Revision: D25577574

fbshipit-source-id: 8aba69b950e7b4d9d1b14ba837931695a908c068
2020-12-21 22:08:39 -08:00
11598da229 [FX] Fix python code having spurious newlines from placeholders (#49720)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/49720

Test Plan: Imported from OSS

Reviewed By: zdevito

Differential Revision: D25675825

Pulled By: jamesr66a

fbshipit-source-id: a9028acad9c8feb877fff5cd09aedabed52a3f4b
2020-12-21 21:41:24 -08:00
edce6b138d fx quant: fix types on _find_quants (#49616)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49616

Add types to `_find_quants` I/O and fix resulting errors,
needed for an upcoming bug fix.

Test Plan:
```
mypy torch/quantization
python test/test_quantization.py TestQuantizeFx
```

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D25645719

fbshipit-source-id: 4bf788b55fd4fd086c83a4438b9c2df22b9cff49
2020-12-21 21:05:57 -08:00
7c90b20f38 fx quant: add types to observed_module.py (#49607)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49607

Readability

Test Plan:
```
mypy torch/quantization
```

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D25643895

fbshipit-source-id: b4b8741b07ac4827c3bacd2084df81fbfdd0c2d5
2020-12-21 21:05:53 -08:00
9d5d193704 fx quant: types for fusion_patterns.py (#49606)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49606

Adds more types, for readability.

Test Plan:
```
mypy torch/quantization
```

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D25643894

fbshipit-source-id: 4aad52fe4e59ad74b6e0e3acd0f98fba91561a29
2020-12-21 21:05:49 -08:00
ab2194f912 unbreak mypy torch/quantization (#49549)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49549

Somehow `mypy torch/quantization` got broken in the past couple of days:
https://gist.github.com/vkuzo/07af454246f0a68e6fa8929beeec7e0d
.  I didn't see any relevant PRs other than
https://github.com/pytorch/pytorch/pull/47725, which doesn't seem
related. The error doesn't seem real, as the arguments to
`_cudnn_rnn_flatten_weight` seem correct. For now,
ignoring the failure so we have a clean `mypy` run on
`torch/quantization`.

Test Plan:
```
mypy torch/quantization
```

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D25616972

fbshipit-source-id: 46c207fe1565ec949c0b1f57d6cd0c93f627e6bd
2020-12-21 21:02:48 -08:00
a5b27d7a31 [TensorExpr] Move SimpleIREval implementation from .h to .cpp. (#49697)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49697

Mostly mechanical move. This refactoring helps to hide unnecessary
details from the SimpleIREval interface and make it more similar to a
pure 'codegen'.

Test Plan: Imported from OSS

Reviewed By: nickgg

Differential Revision: D25668696

Pulled By: ZolotukhinM

fbshipit-source-id: 423247bfcdfa88403e8ec92152f00110bb9da19c
2020-12-21 20:20:15 -08:00
e1f73ced1e [TensorExpr] Change LoopNest::vectorize to accept For* instead of Stmt*. (#49696)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49696

And make it static.

Test Plan: Imported from OSS

Reviewed By: navahgar, nickgg

Differential Revision: D25668695

Pulled By: ZolotukhinM

fbshipit-source-id: 8d7fb507d6f3beca70e868d9e0f4c46247311a99
2020-12-21 20:17:20 -08:00
f5178bf151 Revert D25607503: Add base forward grad logic
Test Plan: revert-hammer

Differential Revision:
D25607503 (fdf02eff3d)

Original commit changeset: f1396290de1d

fbshipit-source-id: 057206e28ff48ee288856adfe3ca577d4880789f
2020-12-21 19:56:28 -08:00
aa2782b9ec replacing THC_CLASS and THC_API with TORCH_CUDA_API (#49690)
Summary:
THC_API and THC_CLASS were leftover macros from before the consolidation of caffe2, aten, and torch. Now that they're combined, these are misleading and should just be TORCH_CUDA_API. The only file I manually edited was `THCGeneral.h.in`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49690

Reviewed By: malfet

Differential Revision: D25667982

Pulled By: janeyx99

fbshipit-source-id: 2fdf7912b2a0537b7c25e1fed21cc301fa59d57f
2020-12-21 19:21:22 -08:00
7eb392d73f Fix TCPStore type coercion (#49685)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/49052

The TCPStore example with 4 arguments was working because the datetime value was being implicitly converted to a bool. Modified the pybind definition and updated documentation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49685

Test Plan:
```
import torch.distributed as dist
from datetime import timedelta

dist.TCPStore("127.0.0.1", 0, True, timedelta(seconds=30))
```

Now fails with
```
TypeError: __init__(): incompatible constructor arguments. The following argument types are supported:
    1. torch._C._distributed_c10d.TCPStore(host_name: str, port: int, world_size: int, is_master: bool, timeout: datetime.timedelta = datetime.timedelta(seconds=300))

Invoked with: '127.0.0.1', 0, True, datetime.timedelta(seconds=30)
```

Reviewed By: mrshenli, ngimel

Differential Revision: D25668021

Pulled By: H-Huang

fbshipit-source-id: ce40b8648d0a414f0255666fbc680f1a66fae090
2020-12-21 19:04:15 -08:00
1043ecf68d Use store based barrier only for certain store types. (#49694)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49694

The store based barrier introduced in
https://github.com/pytorch/pytorch/pull/49419 broke for certain store types.
This is a quick fix to resolve the issues for other store types.
ghstack-source-id: 119006874

Test Plan: 1) waitforbuildbot

Reviewed By: ppwwyyxx, rohan-varma

Differential Revision: D25668404

fbshipit-source-id: 751fb8b229ad6f50ee9c50f63a70de5a91c9eda5
2020-12-21 18:41:28 -08:00
7e1356db7b Move device guard from MultiTensorApply.cuh (#46664)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46664

Test Plan: Imported from OSS

Reviewed By: anjali411

Differential Revision: D24453343

Pulled By: izdeby

fbshipit-source-id: b82a658af50ededc985195ed02dbf60e792c7a13
2020-12-21 18:08:54 -08:00
5b163e230a [jit][tracer] allow traced modules to return dicts with tuple values when strict=False (#49568)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49568

We have some inference use cases where the expected output of a module is of the form `{"key": (t1, t1)}` and are currently jit tracing the modules until we can reach jit script compatibility.

Test Plan: buck test mode/dev caffe2/test:jit -- 'test_trace_returning_complex_dict'

Reviewed By: houseroad

Differential Revision: D25624152

fbshipit-source-id: 5adef0e3c9d54cd31ad5fece4ac6530d541fd673
2020-12-21 15:35:46 -08:00
46c9a0e679 Do not use negative values in GCD computation. (#49379)
Summary:
GCD should always return positive integers. When negative values are used, we hit a corner case that results in an infinite recursion during simplification.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49379

Reviewed By: ezyang

Differential Revision: D25597115

Pulled By: navahgar

fbshipit-source-id: b0e8ac07ee50a5eb775c032628d4840df7424927
2020-12-21 15:08:43 -08:00
fdf02eff3d Add base forward grad logic (#49097)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49097

RFC: https://github.com/pytorch/rfcs/pull/11

This PR add the basic logic to handle forward grad as dual Tensors.
It contains the following:
- Mechanism to save dual state on a Tensor and clear it up when the dual level ends
- C++ and python user facing API
- Updated view system that is able to track both forward and backward views

The current PR has the following limitations:
- Extensive tests are in the next PR in the stack as formulas are needed to write full tests.
- Only the manual formulas have been audited and no other formula is actually implemented here (they are in the next PR in the stack)
- Only level 0 is allowed for now. This was discussed and agreed that it is not needed for the first version of this PR.
- We can save one ViewInfo creation when both the forward and backward views have the same base. This can be done by adding a boolean flag to the DifferentiableViewMeta and extra logic in the `as_view` method. This is left out to keep this PR concise.
- We can skip tracking forward views if the base has a forward grad. This can be done by adding extra logic in the `as_view` method. This is left out to keep this PR concise.

Reading guide:
- Updated view handling in [gen_variable_type.py](https://github.com/pytorch/pytorch/pull/49097/files#diff-f6553cec68caeaea36f6c8b14ff76a6d39dfd774e0ea9ef2f76e8d81fd9af5df), [VariableTypeUtils.h](https://github.com/pytorch/pytorch/pull/49097/files#diff-ec71cfa45954dece1236c661d170e6341879c5be637f4abf52e826d61b40695a), [variable.cpp](https://github.com/pytorch/pytorch/pull/49097/files#diff-60e3bfe444e89efc7149f25b38e472710525984789934ab83f1bd5671b8ff285) (skip code below "[Forward Grad View]" for now), [variable.h](https://github.com/pytorch/pytorch/pull/49097/files#diff-1604bcd0e4350ed99ec45e437cee7ac9ebe337392c9ea16a236247aeeb35b02bR266-R542) and [custom_function.cpp](https://github.com/pytorch/pytorch/pull/49097/files#diff-dd85f452082b5bb6612bbc12adb496f8827defa228509f7b493de1d517522d5d). This introduces the new ViewInfo to hold view informations shared for forward and backward. It also updates the differentiable view meta to use this. And it updates the as_view function to handle both forward and backward view.
- New forward grad class that handle storing gradients and tracking at each level [forward_grad.h](https://github.com/pytorch/pytorch/pull/49097/files#diff-c6c5b9ab2d7e5dde4102495faa1b6bbbfc23aa3e47deb7359c0bfe1eb004c0cb), [forward_grad.cpp](https://github.com/pytorch/pytorch/pull/49097/files#diff-de2ab54ade7312701850d71a119a4f4ee4b9fc5a9c42a467cdd4e73c033531dd) and [build_variables.bzl](https://github.com/pytorch/pytorch/pull/49097/files#diff-dfdfa2efb17beddfd9094524f95351fd197db6c8857e96b436fb599870359325). EDIT: These files also contain the new flag to globally disable forward AD that allows us to reduce performance issues while this is in development.
- Lowest level API and binding between Tensor and AutogradMeta in [TensorBody.h](https://github.com/pytorch/pytorch/pull/49097/files#diff-7554853205392fa743357bf845ecc350a974ec049383248c12daaf2f4de04911), [TensorImpl.cpp](https://github.com/pytorch/pytorch/pull/49097/files#diff-052bd9150ef8e09289ddf644b5a6830ede49207201cd41728f6d7cc6d9cead94), [TensorImpl.h](https://github.com/pytorch/pytorch/pull/49097/files#diff-a15aae4cf23da44970db7cece62ff981265575c798c62f7b52d87c8809dfe2e1) and the rest of [variable.cpp](https://github.com/pytorch/pytorch/pull/49097/files#diff-60e3bfe444e89efc7149f25b38e472710525984789934ab83f1bd5671b8ff285R557-R677)
- API to access the forward primal that needs to be a differentiable function (and so in native_functions.yaml) [native_functions.yaml](https://github.com/pytorch/pytorch/pull/49097/files#diff-2f3dbd85efb9b5172f2264eedd3be47dd765e6ab7cc8bf3ade5e62c28ae35991) [NamedRegistrations.cpp](https://github.com/pytorch/pytorch/pull/49097/files#diff-69bd3bea510c9b64e1633fa18c3ea63d4b8348dbad3a78ad9de844ab3e43dc1d), [VariableMethodsStub.cpp](https://github.com/pytorch/pytorch/pull/49097/files#diff-23f5fcb737a2b289811fe0f4b65aef775e7c824b2e629ecd343df51405cd434f), [derivatives.yaml](https://github.com/pytorch/pytorch/pull/49097/files#diff-e4c2f99a2404e98c3586e07425da73008f36b1bada790648a7297af141d37f8c), [gen_python_functions.py](https://github.com/pytorch/pytorch/pull/49097/files#diff-e4c2f99a2404e98c3586e07425da73008f36b1bada790648a7297af141d37f8c), [gen_trace_type.py](https://github.com/pytorch/pytorch/pull/49097/files#diff-54e0b976027bf8debefb959ff360b89ae93466970c843365b1b3a03806d868ce), [TraceTypeManual.cpp](https://github.com/pytorch/pytorch/pull/49097/files#diff-f34636741ad4a23d018e0c289bc750c3bad887b45660e1d6eaf440d234a78fbf) and [part of VariableTypeManual.cpp](https://github.com/pytorch/pytorch/pull/49097/files#diff-6e19a1bce8cbdba8714b6e2c794a76bc0864b64a49cfa757cb0b5afdc937d1a4R198-R243)
- c++ API [autograd.h](https://github.com/pytorch/pytorch/pull/49097/files#diff-349028fbe8291a965a7a263c323b208fe071c35c66179ee997ef84fa81aa4b1e), [autograd.cpp](https://github.com/pytorch/pytorch/pull/49097/files#diff-a3fe908d67dfec16a1fcde300de68b0701bf68b88db7451f29f2bee255cf30c9)
- python binding [init.cpp](https://github.com/pytorch/pytorch/pull/49097/files#diff-c58a67c85191c22c9b3bb439117d8053edfd9dea839fa010cf967d404c3c630d)
- python API [forward_ad.py](https://github.com/pytorch/pytorch/pull/49097/files#diff-a4efad4ba18fffdfb264c21e5475997a24a743089a899f8ec1a5ff962c6738d9), [autograd/__init__.py](https://github.com/pytorch/pytorch/pull/49097/files#diff-743abcafd32ad0e69f39ac5a91df4197b7e1921c135cacee7ef6dc829a8a7af8)
- c++ and python printing [Formatting.cpp](https://github.com/pytorch/pytorch/pull/49097/files#diff-881dba501e71662e2e4818b4b016f739b344c8aed2f5edc6b871eda47a2aced0), [_tensor_str.py](https://github.com/pytorch/pytorch/pull/49097/files#diff-a7911f8d5e73adbff914d99fd7818ace2a7030b6a3748abe06ec6fc6e3df9cc3)
- Utility for formulas and updated manual functions to respect new view system as well as forward grad [FunctionsManual.h](https://github.com/pytorch/pytorch/pull/49097/files#diff-6378bb6dc81a64dab676d61731341fa5d1088418f32a1473a33a0ccfc2357dc1), [FunctionsManual.cpp](https://github.com/pytorch/pytorch/pull/49097/files#diff-4adbd88239afcd60e8198aab65d4f5e43b62314e34b80551e997a1ea503adea5) [rest of VariableTypeManual.cpp](https://github.com/pytorch/pytorch/pull/49097/files#diff-6e19a1bce8cbdba8714b6e2c794a76bc0864b64a49cfa757cb0b5afdc937d1a4R264-R433)
- Ensure SavedVariable save forward grad properly [saved_variable.h](https://github.com/pytorch/pytorch/pull/49097/files#diff-c1b8039d776241abe177d5aa99b79dd9489a9b3e529da8ab24c2e386c1238ae2), [saved_variable.cpp](https://github.com/pytorch/pytorch/pull/49097/files#diff-cc9fba479b5beae06b2eea2e390d17796e0341c5b037a20b5bcaccbb0c341030)

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D25607503

Pulled By: albanD

fbshipit-source-id: f1396290de1d75760f3d380c43cdd56e86fa6099
2020-12-21 14:39:43 -08:00
befe337072 Fix test_cuda_init_race skip rules (#49693)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/49432

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49693

Reviewed By: walterddr, janeyx99

Differential Revision: D25668027

Pulled By: malfet

fbshipit-source-id: 802cbd39e4ebe585709179f332b680f5f7978814
2020-12-21 14:30:00 -08:00
983bfc79ed Enable product for bool tensor (#48637)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/48351

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48637

Reviewed By: mrshenli

Differential Revision: D25658596

Pulled By: mruberry

fbshipit-source-id: ff3ada74b6d281c8e4753ed38339a1c036f722ee
2020-12-21 14:11:26 -08:00
49c9994fb7 Clean up backward compatibility skip list (#49691)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49691

Quite a few stale items, let's make the list short.

Test Plan: oss ci

Reviewed By: hl475

Differential Revision: D25667464

fbshipit-source-id: cff1be8b5e0068470b3f621acf6bf4fbd414233e
2020-12-21 13:40:30 -08:00
92f37ae263 change block codegen to handle new inlining in NNC (#47687)
Summary:
minor changes to block codegen to handle new inlining in NNC.
For Block code generation we need to delay inlining before collecting dimension data about the tensors.
We need to collect the dimension of the tensor before they were flattened. We don't have this information after the inlining pass, so for Block we run inling after we have collected this data using `CreateBufferMap`  analysis.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47687

Reviewed By: ZolotukhinM

Differential Revision: D24864869

Pulled By: protonu

fbshipit-source-id: 9574c0599f7d959a1cf0eb49d4e3e541cbe9b1d3
2020-12-21 13:36:25 -08:00
476cabdfff added macros in jit logging to check whether loggings are enabled; replaced similar checks in LLVM codegen with such macros (#49121)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/49121

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D25445971

Pulled By: huiguoo

fbshipit-source-id: 980775a94159aa0b3b66fae938962761b38703d5
2020-12-21 13:01:22 -08:00
aebb7d1836 converted current debugging statements in LLVM codegen to jit-logging statements #48771 (#49040)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/49040

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D25407356

Pulled By: huiguoo

fbshipit-source-id: 1c1f893ed8d0877bee27e9a673a5dce2203c2bad
2020-12-21 12:58:12 -08:00
f7a085af98 Dynamic GRU quantization support (#49448)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49448

ghstack-source-id: 118982171

Test Plan:
buck test caffe2/test:quantization --  'test_qlstmGRU \(quantization\.test_quantized_op\.TestDynamicQuantizedRNNOp\)' --print-passing-details
buck test caffe2/test:quantization --  'test_quantized_rnn \(quantization\.test_quantize\.TestPostTrainingDynamic\)' --print-passing-details
buck test caffe2/test:quantization --  'test_qrnncell \(quantization\.test_quantized_op\.TestDynamicQuantizedRNNOp\)' --run-disabled --print-passing-details

Reviewed By: vkuzo

Differential Revision: D25579815

fbshipit-source-id: 413cc8888eb8058230b94c9576d2fa54b0ed1416
2020-12-21 12:36:59 -08:00
a84b93a6f8 add close() method to tqdm mock (#46040)
Summary:
In `torchvision` we use [`torch.hub.tqdm`](2cc20d7485/torchvision/datasets/utils.py (L11)) to display the dataset download. One of our methods uses [`tqdm().close()`](2cc20d7485/torchvision/datasets/utils.py (L188)), which is [not included in the mock](283ae1998c/torch/hub.py (L22-L49)). This PR adds a `close()` method to the mock.

Cc fmassa

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46040

Reviewed By: mrshenli

Differential Revision: D25619429

Pulled By: fmassa

fbshipit-source-id: a137f2417d8a47923ccb1ec6b7d5298c1545245c
2020-12-21 12:24:30 -08:00
12942ea52b [BE] Introduce set_cwd context manager (#49657)
Summary:
Used to temporarily change working directory, but restore it even if exception is raised
Use it in test_type_hints and during code coverage collection

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49657

Reviewed By: walterddr

Differential Revision: D25660543

Pulled By: malfet

fbshipit-source-id: 77f08d57e4b60b95daa4068d0dacf7c25f978526
2020-12-21 12:08:48 -08:00
44ce0b8883 Sparse-sparse matrix multiplication (CPU/CUDA) (#39526)
Summary:
This PR implements matrix multiplication support for 2-d sparse tensors using the COO sparse format.

The current implementation of `torch.sparse.mm` support this configuration,
`torch.sparse.mm(sparse_matrix1, sparse_matrix2.to_dense())`, but this could spend a lot of memory when sparse_matrix2's shape is large.

This implementation extends `torch.sparse.mm` function to support  `torch.sparse.mm(sparse_matrix1, sparse_matrix2)`

Resolves  #[20988](https://github.com/pytorch/pytorch/issues/20988) for CPU/CUDA.

- [x] sparse matmul
  - [x] CPU/CUDA C++ implementation
  - [x] unittests
  - [x] update torch.sparse.mm documentation
  - [x] autograd support

The CPU sparse-sparse matmul was implemented taking as a reference this work "Sparse Matrix Multiplication Package (SMMP)". The GPU sparse-sparse matmul is based on cuSparse, there is specific code for CUSPARSE when CUSPARSE_VERSION >= 11 and old version of CUSPARSE. Both CPU/CUDA  rely on the sparse-sparse matmul algorithm using the CSR indices format as it is one of the fastest algorithm.

Here it is the latest benchmark (script is here) results for torch.sparse.mm (CUDA) and torch.sparse.mm (CPU) and scipy, values are float32 scalars:

size | density | sparse.mm(CUDA) | sparse.mm(CPU) | scipy_coo_matmul
-- | -- | -- | -- | --
(32, 10000) | 0.01 | 822.7 | 79.4 | 704.1
(32, 10000) | 0.05 | 1741.1 | 402.6 | 1155.3
(32, 10000) | 0.1 | 2956.8 | 840.8 | 1885.4
(32, 10000) | 0.25 | 6417.7 | 2832.3 | 4665.2
(512, 10000) | 0.01 | 1010.2 | 3941.3 | 26937.7
(512, 10000) | 0.05 | 2216.2 | 26903.8 | 57343.7
(512, 10000) | 0.1 | 4868.4 | 87773.7 | 117477.0
(512, 10000) | 0.25 | 16639.3 | 608105.0 | 624290.4
(1024, 10000) | 0.01 | 1224.8 | 13088.1 | 110379.2
(1024, 10000) | 0.05 | 3897.5 | 94783.9 | 236541.8
(1024, 10000) | 0.1 | 10559.1 | 405312.5 | 525483.4
(1024, 10000) | 0.25 | 57456.3 | 2424337.5 | 2729318.7

A new backward algorithm was implemented using only `sparse @ sparse` and `sparse_mask` operations. Here is some benchmarking:

```
[------------------------- sparse.mm-backward -------------------------]
                            |   sparse.backward   |  dense.backward
 -----------------------------------------------------------------------
      (32, 10000) | 0.01    |            13.5          |         2.4
      (32, 10000) | 0.05    |            52.3          |         2.4
      (512, 10000) | 0.01   |          1016.8          |       491.5
      (512, 10000) | 0.05   |          1604.3          |       492.3
      (1024, 10000) | 0.01  |          2384.1          |      1963.7
      (1024, 10000) | 0.05  |          3965.8          |      1951.9
```

I added new benchmark tests. Now I am using a real dataset used in recent studies [1, 2] with different sparsity levels.

```
[---------------------------------- matmul ---------------------------------]
                        |   0.5   |  0.7   |  0.8   |  0.9   |  0.95  |  0.98
1 threads: ------------------------------------------------------------------
  (cpu)   torch         |    5.4  |   5.4  |   5.2  |   5.3  |   5.3  |   5.4
          torch.sparse  |  122.2  |  51.9  |  27.5  |  11.4  |   4.9  |   1.8
          scipy         |  150.1  |  87.4  |  69.2  |  56.8  |  38.4  |  17.1
  (cuda)  torch         |    1.3  |   1.1  |   1.1  |   1.1  |   1.1  |   1.1
          torch.sparse  |   20.0  |   8.4  |   5.1  |   2.5  |   1.5  |   1.1

[----------------------------------- backward -----------------------------------]
                        |   0.5   |   0.7   |   0.8   |   0.9   |   0.95  |   0.98
1 threads: -----------------------------------------------------------------------
  (cpu)   torch         |   17.7  |   17.9  |   17.7  |   17.7  |   17.6  |   17.9
          torch.sparse  |  672.9  |  432.6  |  327.5  |  230.8  |  176.7  |  116.7
  (cuda)  torch         |    3.8  |    3.6  |    3.5  |    3.5  |    3.6  |    3.5
          torch.sparse  |   68.8  |   46.2  |   35.6  |   24.2  |   17.8  |   11.9

Times are in milliseconds (ms).
```

In summary, I can say that the new `sparse @ sparse` backward algorithm is better as it is more about saving space than performance. Moreover, it is better than other options tested before.

## **References**

1. Trevor Gale, Matei Zaharia, Cliff Young, Erich Elsen. **Sparse GPU Kernels for Deep Learning.**  Proceedings of the International Conference for High Performance Computing, 2020. [https://github.com/google-research/google-research/tree/master/sgk](https://github.com/google-research/google-research/tree/master/sgk)
2. Trevor Gale, Erich Elsen, Sara Hooker. **The State of Sparsity in Deep Neural Networks.** [https://github.com/google-research/google-research/tree/master/state_of_sparsity](https://github.com/google-research/google-research/tree/master/state_of_sparsity)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/39526

Reviewed By: mruberry

Differential Revision: D25661239

Pulled By: ngimel

fbshipit-source-id: b515ecd66d25f347d637e159d51aa45fb43b6938
2020-12-21 11:53:55 -08:00
3779bdec56 Implementing NumPy-like function torch.broadcast_to (#48997)
Summary:
Related https://github.com/pytorch/pytorch/issues/38349

Implement NumPy-like function `torch.broadcast_to` to broadcast the input tensor to a new shape.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48997

Reviewed By: anjali411, ngimel

Differential Revision: D25663937

Pulled By: mruberry

fbshipit-source-id: 0415c03f92f02684983f412666d0a44515b99373
2020-12-21 11:24:50 -08:00
db2e9c1e7f [NNC] Intermediate allocs flattened and dependency support (#49554)
Summary:
Makes two changes in NNC for intermediate buffer allocations:
1. Flattens dimensions of buffers allocated in LoopNest::prepareForCodegen() to match their flattened usages.
2. Adds support for tracking memory dependencies of Alloc/Free to the MemDependencyChecker, which will allow us to check safety of accesses to intermediate buffers (coming in a future diff).

I didn't add any new tests as the mem dependency checker tests already cover it pretty well, particularly the GEMM test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49554

Reviewed By: VitalyFedyunin

Differential Revision: D25643133

Pulled By: nickgg

fbshipit-source-id: 66be3054eb36f0a4279d0c36562e63aa2dae371c
2020-12-21 10:35:15 -08:00
a3aafea076 Fixed a typo in dataloader.py. (#49437)
Summary:
This small PR fixes a one character typo in the docstring for `DataLoader`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49437

Reviewed By: ngimel

Differential Revision: D25665971

Pulled By: mrshenli

fbshipit-source-id: b60f975f1e3bf0bb8f88e39f490f716c602f087e
2020-12-21 10:27:24 -08:00
b1a1271f68 Fix typo in add_pr_curve docstrings. (#49648)
Summary:
Very small PR to fix a typo.

### Description
Fixed 1 typo in the documentation of `torch/utils/tensorboard/writer.py` (replaced "_should in_" by "_should be in_")

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49648

Reviewed By: ngimel

Differential Revision: D25665831

Pulled By: mrshenli

fbshipit-source-id: a4e733515603bb9313c1267fdf2cfcc2bc2773c6
2020-12-21 10:21:55 -08:00
b80a36614f Fix return type Any for Ternary ops (#49165)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/49165

Test Plan: Imported from OSS

Reviewed By: eellison

Differential Revision: D25463694

Pulled By: ejguan

fbshipit-source-id: 5cf907e8de6eeb0171d61175a60fac9812b76c6c
2020-12-21 10:12:41 -08:00
8be205ae13 Added linalg.solve (#48456)
Summary:
This PR adds `torch.linalg.solve`.

`linalg_solve_out` uses in-place operations on the provided result tensor.

I modified `apply_solve` to accept tensor of Int instead of std::vector, that way we can write a function similar to `linalg_solve_out` but removing the error checks and device memory synchronization.

In comparison to `torch.solve` this routine accepts 1-dimensional tensors and batches of 1-dim tensors for the right-hand-side term. `torch.solve` requires it to be at least 2-dimensional.

Ref. https://github.com/pytorch/pytorch/issues/42666

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48456

Reviewed By: izdeby

Differential Revision: D25562222

Pulled By: mruberry

fbshipit-source-id: a9355c029e2442c2e448b6309511919631f9e43b
2020-12-21 10:11:12 -08:00
5ce94991eb Fix sinc docs typo (#49667)
Summary:
Fix small typo in sinc docs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49667

Reviewed By: ngimel

Differential Revision: D25665721

Pulled By: soulitzer

fbshipit-source-id: 5f78b9e34bb0084e51ae79d1afc450bcb0ae3d75
2020-12-21 09:52:09 -08:00
ef172e138c [Mask R-CNN]Add Int8 AABB Generate proposals Op (#49574)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49574

Adds support for additional Eigen Utils for custom type defs.

Reviewed By: linbinyu

Differential Revision: D25624556

fbshipit-source-id: 0ffa90aaf8cbf1d08825e95156fb40d966ca7042
2020-12-21 09:43:33 -08:00
7ed140a1a0 [WIP][DataLoader] Prototype of SamplerIterableDataset (#49363)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/49363

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D25623637

Pulled By: ejguan

fbshipit-source-id: 9155d27d1fc91996b74110795cc73f1da0eedd44
2020-12-21 07:09:34 -08:00
554f79acb9 [WIP][DataLoader] Prototype of BatchIterableDataset (#49186)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/49186

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D25623636

Pulled By: ejguan

fbshipit-source-id: 01a08cccb69301481c55b46358203354b9b4f5fa
2020-12-21 07:09:31 -08:00
1b6fc1fd42 [WIP][DataLoader] CollateIterableDataset prototype (#48933)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48933

Prototype for CollateIterableDataset.
Move `collate_batch_fn` to BatchIterableDataset

- CollateIterableDataset
  - [x] Prototype
  - [x] Tests
- BatchIterableDataset
  - [x] Prototype
  - [x] Tests
- SamplerIterableDataset
  - [x] Prototype
  - [x] Tests

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D25623635

Pulled By: ejguan

fbshipit-source-id: 99ba077619f672551ac15367baaba985db35a9c2
2020-12-21 07:04:25 -08:00
bab732a3a3 [AutoAccept][Codemod][FBSourceClangFormatLinter] Daily arc lint --take CLANGFORMAT
Reviewed By: zertosh

Differential Revision: D25662961

fbshipit-source-id: f5811a5797fd6dc8733fdf86f35c93d12a08d53a
2020-12-21 04:14:44 -08:00
5c3788d5d7 Add support for torch.tensor_split to accept a tensor for indices argument (#49169)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49169

Trying to solve PR request https://github.com/pytorch/pytorch/issues/47479.
This diff tries to overload method `torch.tensor_split` to also accept a tensor for argument `split_size_or_sections` which currently accepts a python list or int. The motivation is to avoid converting a tensor to a list so that when tracing a model/module the tensor operations can be recorded.

Implementation is following the diff that originally added the `tensor_split` method D24166164 (ef4817fe5a).

Test Plan:
```
buck test caffe2/test:torch -- tensor_split
```
https://www.internalfb.com/intern/testinfra/testconsole/testrun/5910974550563805/

```
buck test caffe2/test:others -- tensor_split
```
https://www.internalfb.com/intern/testinfra/testconsole/testrun/1688849905082678/

Reviewed By: mruberry

Differential Revision: D25440885

fbshipit-source-id: 6705dc551279e3a5eb1e5ec1ede2728eab85ffb1
2020-12-20 21:43:44 -08:00
96aed203bf [Gradient Compression] Replace the assertions in PowerSGD comm hook by stream syncrhonization (#49435)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49435

Previously the assertion that prevents illegal memory access is because of the torch.any that returns a boolean value, which initiates a data transfer from the device to the host and forces a synchronization.

An explicit synchronization is more to the point.

Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202
ghstack-source-id: 118664204

Test Plan:
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl

buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_DistributedDataParallel_powerSGD_ddp_comm_hook

Reviewed By: rohan-varma

Differential Revision: D25573484

fbshipit-source-id: 516d0d502da2863b516c15332702335ee662f072
2020-12-20 17:24:06 -08:00
342bfd892f [Gradient Compression] Add error feedback to layerwise PowerSGD (#49418)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49418

Add error feedback to the original implementation of PowerSGD.

Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202
ghstack-source-id: 118670930

Test Plan:
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl

buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_DistributedDataParallel_powerSGD_ddp_comm_hook

Reviewed By: rohan-varma

Differential Revision: D25555538

fbshipit-source-id: c01145cc9acf574a4c6aa337dbbba0ba7d9350b2
2020-12-20 17:22:39 -08:00
5c25f8faf3 stft: Change require_complex warning to an error (#49022)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49022

**BC-breaking note**:

Previously torch.stft took an optional `return_complex` parameter that indicated whether the output would be a floating point tensor or a complex tensor. By default `return_complex` was False to be consistent with the previous behavior of torch.stft. This PR changes this behavior so `return_complex` is a required argument.

**PR Summary**:

* **#49022 stft: Change require_complex warning to an error**

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D25658906

Pulled By: mruberry

fbshipit-source-id: 11932d1102e93f8c7bd3d2d0b2a607fd5036ec5e
2020-12-20 14:48:25 -08:00
f5ee619d2a Updated derivative rules for complex svd and pinverse (#47761)
Summary:
Updated `svd_backward` to work correctly for complex-valued inputs.
Updated `common_methods_invocations.py` to take dtype, device arguments for input construction.
Removed `test_pinverse` from `test_autograd.py`, it is replaced by entries to `common_methods_invocations.py`.
Added `svd` and `pinverse` to list of complex tests.

References for complex-valued SVD differentiation:

- https://giggleliu.github.io/2019/04/02/einsumbp.html
- https://arxiv.org/abs/1909.02659

The derived rules assume gauge invariance of loss functions, so the result would not be correct for loss functions that are not gauge invariant.
https://re-ra.xyz/Gauge-Problem-in-Automatic-Differentiation/

The same rule is implemented in Tensorflow and [BackwardsLinalg.jl](https://github.com/GiggleLiu/BackwardsLinalg.jl).

Ref. https://github.com/pytorch/pytorch/issues/33152

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47761

Reviewed By: ngimel

Differential Revision: D25658897

Pulled By: mruberry

fbshipit-source-id: ba33ecbbea3f592238c01e62c7f193daf22a9d01
2020-12-20 14:39:31 -08:00
8b61fbdac9 Resubmit: [Gradient Compression] Implement the original layerwise PowerSGD (#49639)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49639

Resubmit #49417 with a fix for distributed_test.

The previous submission broke a multi-gpu test that runs on 4 GPUs. Since this test only runs on master, couldn't detect it before the submission.

The real diff is:
4ca1014bb5

This time I have verified that the previous failed test `pytorch_linux_xenial_cuda10_2_cudnn7_py3_multigpu_test` could pass after creating a PR (#49651) from a separate branch:
https://app.circleci.com/pipelines/github/pytorch/pytorch/253644/workflows/c1c02b70-0877-40e6-8b4c-61f60f6b70ed/jobs/9768079

ghstack-source-id: 118969912

Test Plan: buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_DistributedDataParallel_powerSGD_ddp_comm_hook、

Reviewed By: mrshenli

Differential Revision: D25654961

fbshipit-source-id: 2a45c8ceb9bdb54ff7309a8b66ec87e913e0150e
2020-12-20 13:02:52 -08:00
c0deb231db disable kthvalue overlap (#48254)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/47934

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48254

Reviewed By: bdhirsh

Differential Revision: D25276689

Pulled By: VitalyFedyunin

fbshipit-source-id: a70774e31c269b41786170e99ec1ede42596ba7b
2020-12-19 11:30:27 -08:00
1ac05cfe01 Remove DataPtr extractor from CUDAFuture (#48840)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48840

The CUDAFuture class needs to inspect the values it contains in order to extract its tensors (in fact, the DataPtrs backing those). These are needed first to determine what CUDA devices back those tensors, so that an event for each such device can be recorded; and later to record these DataPtrs with the CUDA caching allocator if they are used in other streams.

This became complicated when Python was added to the mix, because to inspect a Python object we need to acquire the GIL, but we couldn't do so from code that was supposed to also work in C++-only mode. The solution was for users to provide a custom way to extract DataPtrs, so that the PythonFutureWrapper could install such a custom Python-aware one. This was the DataPtr extractor.

In https://github.com/pytorch/pytorch/pull/48502 a different suggestion was proposed. At its root, it consists in adding support for IValues of type PyObject to the visit() and getSubValues() methods. In order to deal with the GIL, we do this through a virtual method: PyObjectHolder, which is the base class, is available also in C++-only mode, and thus defines this method but leaves it unimplemented; ConcretePyObjectHolder, which is the subclass, is only included in Python mode, and thus it can implement that method, acquire the GIL, and do what it's supposed to.

In my opinion, this approach is just brilliant! Thank wanchaol for proposing it! It hides the complexity of dealing with Python inside getSubValues(), where it can be done properly, thus simplifying enormously the CUDAFuture and the PythonFutureWrapper classes.
ghstack-source-id: 118704935

Test Plan: Unit tests

Reviewed By: wanchaol

Differential Revision: D25334355

fbshipit-source-id: 3f1d3bf6e6e8505a114c877fb9a6fcc3f68d91d3
2020-12-19 11:03:45 -08:00
e0f60c9720 Disable test on windows (#49636)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49636

test_export_stacks fails with permission errors

Test Plan:
CI

Imported from OSS

Reviewed By: robieta

Differential Revision: D25654680

fbshipit-source-id: 5689289e06eebc0686030f90ed56483a072b6850
2020-12-18 22:09:52 -08:00
e2d2d9bb0c [PyTorch Mobile] Preserve bundled input related methods when calling optimize_for_mobile (#49170)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49170

Added an extra step to **always** preserve the bundled inputs methods if they are present in the input module.

Also added a check to see if all the methods in the `preseved_methods` exist. If not, we will now throw an exception. This can hopefully stop hard-to-debug inputs from getting into downstream functions.

~~Add an optional argument `preserve_bundled_inputs_methods=False` to the `optimize_for_mobile` function. If set to be True, the function will now add three additional functions related with bundled inputs to be preserved: `get_all_bundled_inputs`, `get_num_bundled_inputs` and `run_on_bundled_input`.~~

Test Plan:
`buck test mode/dev //caffe2/test:mobile -- 'test_preserve_bundled_inputs_methods \(test_mobile_optimizer\.TestOptimizer\)'`

or

`buck test caffe2/test:mobile` to run some other related tests as well.

Reviewed By: dhruvbird

Differential Revision: D25463719

fbshipit-source-id: 6670dfd59bcaf54b56019c1a43db04b288481b6a
2020-12-18 22:01:46 -08:00
ad9923e5d5 Revert D25511543: [Gradient Compression] Implement the original layerwise PowerSGD
Test Plan: revert-hammer

Differential Revision:
D25511543 (71f3399e19)

Original commit changeset: 19ef188bc2d4

fbshipit-source-id: a363641a059aeacc57684884998cf8fb7363d748
2020-12-18 20:30:29 -08:00
5cde23fdd4 [quant][graphmode][fx] Allow user to specify qconfig for call_method (#49621)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49621

This adds support to configure qconfig for a call_method, e.g. x.chunk, this will help workaround
a problem in our internal model.

TODO: since call_method is also a string and we flatten the qconfig, might need to resolve namespace conflict between
call_method and module_name
TODO: Add scope support to set the qconfig for call_method correctly with original qconfig

Test Plan: Imported from OSS

Reviewed By: vkuzo

Differential Revision: D25651828

fbshipit-source-id: 82d66b121d37c8274fd481b6a2e9f9b54c5ca73d
2020-12-18 20:21:52 -08:00
e4eaa6de5f Fix lint (#49629)
Summary:
Fix lint on master

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49629

Reviewed By: rohan-varma

Differential Revision: D25654199

Pulled By: mrshenli

fbshipit-source-id: 2ab5669ad47996c0ca0f9b6611855767d5af0506
2020-12-18 19:26:06 -08:00
7278e3bd29 Bump tensorpipe version (#49599)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/49599

Reviewed By: lw

Differential Revision: D25639036

Pulled By: mrshenli

fbshipit-source-id: 595b396a01d7fa9049d88447ab9079e286637afe
2020-12-18 18:52:41 -08:00
159de1f1d6 Add benchmark for torch.distributed.pipeline.sync.Pipe (#49577)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49577

Repurposing the benchmarking from
https://github.com/facebookresearch/fairscale/blob/master/benchmarks/pipe.py
and pulling in a stripped down version of the benchmark into PyTorch.

Sample output:
```
Running benchmark with args: Namespace(batch_size=8, checkpoint='never', chunks=4, host='localhost', max_batch=10, num_decoder_layers=10, num_devices=4)
Number of parameters for model: 292833040
| batch     1 | wps 3593.07 | loss 25.98 | ppl 192556591553.37
| batch     2 | wps 4405.16 | loss 19.36 | ppl 256201548.33
| batch     3 | wps 4404.98 | loss 23.56 | ppl 17111244076.37
| batch     4 | wps 4413.25 | loss 27.11 | ppl 594561327825.83
| batch     5 | wps 4408.53 | loss 25.92 | ppl 181277705101.33
| batch     6 | wps 4385.64 | loss 24.92 | ppl 66592883598.50
| batch     7 | wps 4434.11 | loss 24.75 | ppl 56113635884.68
| batch     8 | wps 4441.25 | loss 24.88 | ppl 63666024212.82
| batch     9 | wps 4425.49 | loss 25.35 | ppl 101959669008.98
| batch    10 | wps 4421.05 | loss 25.34 | ppl 101597621863.94
Peak memory usage for GPUs: cuda:0: 2.38GiB, cuda:1: 3.04GiB, cuda:2: 3.04GiB, cuda:3: 3.67GiB,
```
ghstack-source-id: 118939686

Test Plan: sentinel

Reviewed By: rohan-varma

Differential Revision: D25628721

fbshipit-source-id: 41c788eed4f852aef019aec18a84cb25ad254f3a
2020-12-18 18:33:47 -08:00
8c52fdf522 Improve documentation for pipeline parallelism. (#48638)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48638

Polishing up some of the docs for the main `Pipe` class and its
`forward` method.
ghstack-source-id: 118820804

Test Plan: waitforbuildbot

Reviewed By: rohan-varma

Differential Revision: D25237705

fbshipit-source-id: ba3d8737b90a80024c827c0887fc56f14bf678b7
2020-12-18 18:28:26 -08:00
71f3399e19 [Gradient Compression] Implement the original layerwise PowerSGD (#49417)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49417

The existing implementation applies PowerSGD to a batch of flattened tensors, which is a coarse-grained compression. This hook now is renamed as "batched_powerSGD_hook".

Now implement the original implementation in the paper, which applies PowerSGD to each per-parameter tensor. This is a layerwise fine-grained compression. Although this original implementation is slower, it is expected to achieve a higher accuracy, especially when the shapes of per-param tensors cannot be aligned.

Also add a test in distributed_test.py.

Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202
ghstack-source-id: 118921275

Test Plan:
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl

buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_DistributedDataParallel_powerSGD_ddp_comm_hook

Reviewed By: rohan-varma

Differential Revision: D25511543

fbshipit-source-id: 19ef188bc2d4c7406443c8fa233c1f2c2f27d93c
2020-12-18 18:02:15 -08:00
6f381de006 Inline coverage report combining/reporting (#49615)
Summary:
Instead of calling coverage frontend import coverage module and call combine() and html_report()

Fixes https://github.com/pytorch/pytorch/issues/49596 by not using a strict mode when combining those reports

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49615

Reviewed By: seemethere

Differential Revision: D25645196

Pulled By: malfet

fbshipit-source-id: be55b5c23a3569a331cbdf3f86d8c89bc27d5fe1
2020-12-18 17:08:46 -08:00
e2e44bb10a [Issue #46210] added torch.fx.len() to provide support for len(); added a test case for torch.fx.len() (#49532)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/49532

Test Plan: Imported from OSS

Reviewed By: jamesr66a

Differential Revision: D25608804

Pulled By: huiguoo

fbshipit-source-id: 93ac02ab57db5d200d92443062286c34782ec0ef
2020-12-18 16:43:57 -08:00
3659560fba [NNC] Disable masked fill (#49622)
Summary:
There's a bug internally, disable as quick fix before investigation

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49622

Test Plan:
Imported from GitHub, without a `Test Plan:` line.
build

Reviewed By: zheng-xq, PursueHappinessDirectly

Differential Revision: D25651897

Pulled By: eellison

fbshipit-source-id: dd1454f2ef7506d7844016128aa6320d7e69aa6e
2020-12-18 16:28:00 -08:00
5ab9593098 torch.reciprocal: promote integer inputs to float (#49102)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/49091

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49102

Reviewed By: VitalyFedyunin

Differential Revision: D25639541

Pulled By: soulitzer

fbshipit-source-id: 1dd360bd7b77f106d606143d8d3961610bac8cb7
2020-12-18 16:17:30 -08:00
485aee7a22 Output stacks (support for SVG visualization) (#48438)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48438

Outputting stacks in a format suitable for SVG vizualization
(e.g. with https://github.com/brendangregg/FlameGraph tool)

Test Plan:
python test/test_profiler.py -k test_export_stacks

e.g. resnet18 (note: actual SVG is interactive):

<img width="1193" alt="Screen Shot 2020-11-24 at 7 06 27 PM" src="https://user-images.githubusercontent.com/30845429/100178160-397f3500-2e88-11eb-81c4-34b19c5fcb87.png">

Reviewed By: dzhulgakov

Differential Revision: D25174270

Pulled By: ilia-cher

fbshipit-source-id: 6b60084071b209441805c468f5ff777318e42d1a
2020-12-18 16:10:41 -08:00
d0a12c5a47 Add sinc operator (#48740)
Summary:
Implements the sinc operator.
See https://numpy.org/doc/stable/reference/generated/numpy.sinc.html

![image](https://user-images.githubusercontent.com/13428986/101653855-cdffa080-3a0d-11eb-8426-ecc81c152ebd.png)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48740

Reviewed By: ezyang

Differential Revision: D25597565

Pulled By: soulitzer

fbshipit-source-id: 6dbcf282ee4eba34930bc9e5c85c0c5e79cf0322
2020-12-18 15:52:24 -08:00
d088359e5a [torchscript] Fix constant propagation schemas (#49605)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/49605

Test Plan: Imported from OSS

Reviewed By: eellison

Differential Revision: D25643157

Pulled By: IvanKobzarev

fbshipit-source-id: c5440622f6cf559afadca853e1eb7a9fbb8edf7f
2020-12-18 15:28:42 -08:00
9d91360b5d Cleanup APIs for pipeline parallelism. (#48630)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48630

1) Make torch.distributed.pipeline package public.
2) Make several helper methods private.
ghstack-source-id: 118820803

Test Plan: waitforbuildbot

Reviewed By: rohan-varma

Differential Revision: D25235688

fbshipit-source-id: c32833ebf090ddbd4eaf06fcb5e3f9d421623a60
2020-12-18 15:17:13 -08:00
39d89e06e0 Upload test times to S3 (#49190)
Summary:
This PR currently just modifies the `test/print_test_stats.py` script (run in the `pytorch_linux_test` job) so that now it uploads test times to the new `ossci-metrics` S3 bucket (rather than just to Scribe) if passed the `--upload-to-s3` parameter.

The next step is to add an additional step to that `pytorch_linux_test` job which checks if it's being run on a PR, and if so, finds the `master` commit to compare against (similar to what's done in the now-unused `.jenkins/pytorch/short-perf-test-{c,g}pu.sh` scripts) and adds test time info to the Dr CI comment if the PR is significantly different from the base revision.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49190

Test Plan:
An "integration test" would be to just look in [the `ossci-metrics` S3 bucket](https://s3.console.aws.amazon.com/s3/buckets/ossci-metrics) to confirm that the CI run(s) for this PR did indeed upload their test time data successfully.

To test this locally, first make sure you have all the packages you need, such as these:
```
$ conda install -c anaconda boto3
$ conda install -c conda-forge unittest-xml-reporting
```
Then run whatever tests you want; these are the ones I used for my local smoke test, for no particular reason:
```
$ python test/test_spectral_ops.py --save-xml=/tmp/reports/spectral_ops
```
Once the tests finish, run the script to upload their times to S3:
```
$ CIRCLE_SHA1="$(git rev-parse HEAD)" CIRCLE_JOB=foo test/print_test_stats.py --upload-to-s3 /tmp/reports/spectral_ops
```
Now check that they uploaded successfully:
```
$ aws s3 cp "s3://ossci-metrics/test_time/$(git rev-parse HEAD)/foo/" /tmp/reports --recursive
```
And that it's a valid `*.json.bz2` file:
```
$ bzip2 -kdc /tmp/reports/*Z.json.bz2 | jq . | head -n21
{
  "build_pr": null,
  "build_tag": null,
  "build_sha1": "e46f43621b910bc2f18dd33c08f5af18a542d5ed",
  "build_branch": null,
  "build_job": "foo",
  "build_workflow_id": null,
  "total_seconds": 0.9640000000000003,
  "suites": {
    "TestFFTCPU": {
      "total_seconds": 0.9640000000000003,
      "cases": [
        {
          "name": "test_fft_invalid_dtypes_cpu",
          "seconds": 0.022,
          "errored": false,
          "failed": false,
          "skipped": false
        },
        {
          "name": "test_istft_throws_cpu",
```

Reviewed By: walterddr, malfet

Differential Revision: D25618035

Pulled By: samestep

fbshipit-source-id: 4d8013859a38a49e5bba700c5134951ca1a9d8b7
2020-12-18 14:46:37 -08:00
b361e33a66 [package] implicitly extern stdlib before mocking (#49306)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49306

This allows you to mock out everything except for specific patterns while
still correctly externing the python standard library. This makes it less
likely that you will need to override require_module.

Test Plan: Imported from OSS

Reviewed By: suo

Differential Revision: D25526212

Pulled By: zdevito

fbshipit-source-id: 7339f4c7f12af883496f79de95e57d452bb32dc2
2020-12-18 14:16:46 -08:00
fb755ad33e [FX] Emit named tuple construction node when NamedTuple appears as an arg (#49553)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/49553

Test Plan: Imported from OSS

Reviewed By: zdevito

Differential Revision: D25618577

Pulled By: jamesr66a

fbshipit-source-id: 042f742f9ca02e59bbceda97bfcf47f9bac07873
2020-12-18 14:10:17 -08:00
27f355f87e Test pipeline parallelism works with DDP. (#48470)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48470

Adding a unit test to test this works as expected. Although, this
doesn't work with other checkpointing modes of the pipe and checkpoint=never
needs to be set for this to work.
ghstack-source-id: 118820806

Test Plan: waitforbuildbot

Reviewed By: mrshenli

Differential Revision: D25182668

fbshipit-source-id: 85e69e338bf388c132a303ad93e29ec2cc4a0ed8
2020-12-18 13:34:44 -08:00
e17f0fd676 Adding support for bitwise augassignment operators (#44621)
Summary:
========
Fixes #{42915}

This commit adds support for Bitwise Shorthands in TorchScript, i.e : |=,&=,^=,<<=,>>=,**=

Testing:
======
This commit also adds test for the above fix in test_jit.py
The test can be invoked by
pytest -k augassign test/test_jit.py

Here is a snapshot of the testing:
<img width="1238" alt="image" src="https://user-images.githubusercontent.com/70345919/93105141-8f9f5300-f663-11ea-836b-3b52da6d2be5.png">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44621

Reviewed By: mrshenli

Differential Revision: D23906344

Pulled By: nikithamalgifb

fbshipit-source-id: 4c93a7430a625f698b163609ccec15e51417d564
2020-12-18 12:07:54 -08:00
daaf932a99 New profiler API (#48280)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48280

Adding new API for the kineto profiler that supports enable predicate
function

Test Plan: unit test

Reviewed By: ngimel

Differential Revision: D25142220

Pulled By: ilia-cher

fbshipit-source-id: c57fa42855895075328733d7379eaf3dc1743d14
2020-12-18 11:49:02 -08:00
4a870f6518 [PyTorch Mobile] Export Operator List from Mobile CompilationUnit instead of from TorchScript Model (#49385)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49385

Currently, the API to export operator lists accepts a `torch::jit::Module` object, and spits out an operator list. The operator list is practically used only for mobile. This is not ideal because the set of root operators may change by the time the model is subsequently optmized and exported for mobile.

What we need to to instead is glean the list of operators from the mobile model itself (`bytecode.pkl` specifically), and expose that instead.

Also updated the logic in `converter`.

### Before this change:
1. Get operator List from Torch Script Model
2. Convert to bytecode mobile model

### After this change:
1. Convert to bytecode mobile model
2. Use this converted mobile model to get the list of operators for each method on the model

ghstack-source-id: 118796752

Test Plan:
Added a unit test in `test_lite_interpreter.cpp` to ensure that all model referenced operators show up in the exported operator list. Also make `test_lite_interpreter.cpp` runnable from `xplat/caffe2/BUCK` since this is where the production code will be built from.

Verified that the list of operators produced before and after this change for an example model (segmentation) are the same.

{P147863234}

Also verified that the operator lists for BI-Xray model is different (we have been having problems with missing operators for this one): {P154903132}

Reviewed By: iseeyuan

Differential Revision: D24690094

fbshipit-source-id: 0426a6ef90456a811010cfe337c415882ae2deff
2020-12-18 11:17:57 -08:00
71ca600af9 Renaming CAFFE2_API to TORCH_API (#49496)
Summary:
Since caffe2 and torch have been consolidated, CAFFE2_API should be merged with TORCH_API. Addresses a TODO.

Manually edited some references of the removed `CAFFE2_API`:
* `CONTRIBUTING.md`
* `caffe2/proto/CMakeLists.txt`
* `cmake/ProtoBuf.cmake`
* `c10/macros/Export.h`
* `torch/csrc/WindowsTorchApiMacro.h`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49496

Reviewed By: malfet, samestep

Differential Revision: D25600726

Pulled By: janeyx99

fbshipit-source-id: 7e068d959e397ac183c097d7e9a9afeca5ddd782
2020-12-18 10:54:50 -08:00
c9e052130a [FX] Enforce args is tuple and kwargs is dict (#49526)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/49526

Test Plan: Imported from OSS

Reviewed By: Chillee

Differential Revision: D25606115

Pulled By: jamesr66a

fbshipit-source-id: f2a21d02a2cf8c08cbd618efc5a6a28d34806851
2020-12-18 10:21:19 -08:00
faf6032945 Remove deadlines for Caffe2 hypothesis_test when running on GPU. (#49591)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49591

A bunch of these tests are marked flaky, and have been since time immemorial. (Read: as far back as Buck will build.) However closer inspection reveals that they fail if and only if run on a GPU worker. What seems to be going on is that there are more jobs than GPUs, so the contention causes waits which registers as timeouts on the test.

This diff is kind of hacky, but it basically just drops deadlines if a GPU is present. Because Caffe2 is going away I'm not too terribly concerned about a beautiful solution, but we may as well keep some test coverage if it's easy.

CC Sebastian, Ilia, Min, and Hongzheng who also have tasks for what seems to be the same flakiness.

Test Plan: Turn the tests back on and see if they fall over. (The failure repros reliably on an OnDemand GPU and is fixed by this change, so it's not really just a hail Mary.)

Reviewed By: ngimel

Differential Revision: D25632981

fbshipit-source-id: 43dcce416fea916ba91f891e9e5b59b2c11cca1a
2020-12-18 10:00:24 -08:00
ccd646696b Fix Module backward hooks for all Tensor inputs/outputs (#46163)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/598

This is BC-breaking as we now explicitly don't call the hook when there are not Tensors at the top level of the output.
This feature was not working anyways as the returned grad_input/grad_output were wrong (not respecting the output structure and wrong inputs for multi-Node Module).

This is also BC-breaking as we now report the correct gradients for `nn.Module`s that contain multiple autograd `Node`s while we use to return bad results before.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46163

Reviewed By: ailzhang, mruberry

Differential Revision: D24894180

Pulled By: albanD

fbshipit-source-id: e1b5d193d2818eb2f51e2a2722c7405c8bd13c2b
2020-12-18 09:04:36 -08:00
0b27d57062 fixed the first line of torch.rst to match the __init__.py file's first line (#49584)
Summary:
Changed the first line of the torch.rst file to match that of the __init__.py file

Fixes https://github.com/pytorch/pytorch/issues/49228

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49584

Reviewed By: VitalyFedyunin

Differential Revision: D25639260

Pulled By: mrshenli

fbshipit-source-id: a0bafd945ff92115eed932662feedc46d29dfaab
2020-12-18 08:55:58 -08:00
7545ff6619 Refactor VmapPhysicalView::newLogicalToPhysical (#49482)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49482

Motivation
==========
Batching rules always invoke newLogicalToPhysical at the very end to turn
a physical tensor into a logical BatchedTensor (an example is below):
```
Tensor select_backward_batching_rule(const Tensor& grad, IntArrayRef input_sizes, int64_t dim, int64_t index) {
  auto grad_physical = MultiBatchVmapTransform::logicalToPhysical(grad);
  auto grad_input = at::zeros(grad_physical.getPhysicalShape(input_sizes), grad.options());
  auto physical_dim = getGradInputPhysicalDim(dim, input_sizes, grad_physical.numBatchDims());
  grad_input.select(physical_dim, index).copy_(grad_physical.tensor());
  return grad_physical.newLogicalFromPhysical(grad_input);
}
```
However, albanD noted that this function is confusing and ambiguous
because it's unclear which physical tensor is being turned into the logical
(in this case, grad_physical is a VmapPhysicalView, but we're really transforming
grad_input and returning it).
https://github.com/pytorch/pytorch/pull/44505#discussion_r487144018

I didn't want to make too many changes to the batching rule API because
I think we'll change it even more in the future, but this PR attempts to
remove the ambiguity by applying one of the suggestions in
https://github.com/pytorch/pytorch/pull/44505#discussion_r487144018

This PR
=======

The diagnosis of the problem is that we were conflating
"VmapPhysicalView", which maps logical attributes on a Tensor (like
dimension and shape) to physical attributes, with the reverse
physical-to-logical map. This PR creates a new VmapPhysicalToLogicalMap
object that handles the latter.

Instead of calling `grad_physical.newLogicalFromPhysical(grad_input)`,
an author of batching rules should now retrieve the VmapPhysicalToLogicalMap
object and apply it to their physical input. So the above code becomes:
```
grad_physical.getPhysicalToLogicalMap().apply(grad_input)
```

I've also moved VmapPhysicalView::makeLogicalFromPhysicalListInplace
to VmapPhysicalToLogicalMap::applyInplace.

Test Plan
=========
wait for tests

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D25592645

Pulled By: zou3519

fbshipit-source-id: 9c6ede9901ec6b70e5763193064658a8f91e6d48
2020-12-18 08:48:02 -08:00
f975f99d1d add checkout PR tip step for quick checks (#49590)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/49590

Reviewed By: samestep

Differential Revision: D25633341

Pulled By: walterddr

fbshipit-source-id: 6e8db1f628f562d7632390bdb7788437cb1bf63d
2020-12-18 08:41:27 -08:00
2de345d44d Add op bench for caffe2 quantile op (#49598)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49598

Add op bench for caffe2 quantile op

Test Plan: `buck run mode/opt caffe2/benchmarks/operator_benchmark/c2:quantile_op_test -- --wramup_iterations=10000  --iterations=10000`

Reviewed By: radkris-git

Differential Revision: D25590085

fbshipit-source-id: 0db58ac87c595b2bf2958f6299a1bf2ccea019db
2020-12-18 08:32:59 -08:00
6568572712 Support integral types for kAbs in SimpleIREvaluator (#49357)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49357

This is a follow-up fix for PR #48679, where the previous PR
adds support for integer inputs to aten::abs by promoting integers to
float and then demote the result back to integers. This PR supports
integer inputs to aten::abs more efficiently in the SimpleIREvaluator
by allowing implementing integer inputs for kAbs (renamed from kFabs).
- Rename kFabs to kAbs
- Add support for integer input to kAbs in SimpleIREvalator (note that:
llvm_codegen and cuda_codegen already supports integer inputs to kAbs)

Test Plan:
- `PYTORCH_TENSOREXPR_DONT_USE_LLVM=1 python test/test_jit_fuser_te.py
TestTEFuser.test_unary_ops`
- `python test/test_jit_fuser_te.py TestTEFuser.test_unary_ops`

Imported from OSS

Reviewed By: eellison

Differential Revision: D25545791

fbshipit-source-id: e52f51a352d149f66ce8341fb3beb479be08a230
2020-12-18 07:57:58 -08:00
72b00a8a52 Revert D25480770: Set USE_KINETO=1
Test Plan: revert-hammer

Differential Revision:
D25480770 (1a92802bde)

Original commit changeset: 037cd774f554

fbshipit-source-id: 6a6062195033ca91fcc0cfa1e890e47efc774ac1
2020-12-18 07:06:28 -08:00
1a92802bde Set USE_KINETO=1 (#49201)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49201

This unblocks kineto profiler for 1.8 release.
This PR supercedes https://github.com/pytorch/pytorch/pull/48391
Note: this will somewhat increase the size of linux server binaries, bc
we add libkineto.a and libcupti_static.a:
-rw-r--r-- 1 jenkins jenkins 1107502 Dec 10 21:16 build/lib/libkineto.a
-rw-r--r-- 1 root root 13699658 Nov 13  2019 /usr/local/cuda/lib64/libcupti_static.a

Test Plan:
CI
https://github.com/pytorch/pytorch/pull/48391

Imported from OSS

Reviewed By: ngimel

Differential Revision: D25480770

fbshipit-source-id: 037cd774f5547d9918d6055ef5cc952a54e48e4c
2020-12-18 01:48:10 -08:00
020c443fd1 Fix CustomAutogradTest.ReentrantPriority rerun failures (#49581)
Summary:
Clear static variable at the end of the test to ensure test passes after re-runs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49581

Test Plan:
`./bin/test_api "--gtest_filter=CustomAutogradTest.ReentrantPriority" --gtest_repeat=50`
Before the change all subsequent runs of the test failed with
```
../test/cpp/api/autograd.cpp:681: Failure
Expected equality of these values:
  order.size()
    Which is: 310
  10
```

Reviewed By: mrshenli

Differential Revision: D25632374

Pulled By: malfet

fbshipit-source-id: 4814d22b5dff15e1b38a0187e51070771fd58370
2020-12-18 00:34:06 -08:00
43f6da787e Use store based barrier in init_process_group. (#49419)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49419

As described in https://github.com/pytorch/pytorch/issues/48110, the
newly introduced `barrier()` in `init_process_group` messes up NCCL
communicator state since it uses a bunch of default devices to perform an
allreduce which simulates a barrier(). As a ressult, subsequent NCCL operations
might not behave as expected.
ghstack-source-id: 118861776

Test Plan:
1) unit test added.
2) waitforbuildbot

Reviewed By: mrshenli

Differential Revision: D25566550

fbshipit-source-id: ab083b67b634d7c515f4945deb228f959b27c936
2020-12-18 00:02:54 -08:00
5fcfebd84a Disables method variant grad and grad grad checks (#49576)
Summary:
These are redundant with the functional variant checks and can be very costly, as some grad and gradgrad testing takes minutes to run per variant. Maybe in the future we'll add them back for operations with divergent method implementations.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49576

Reviewed By: albanD, ngimel

Differential Revision: D25631691

Pulled By: mruberry

fbshipit-source-id: 247f750979d9dafab2454cdbfa992a2aa6da724a
2020-12-17 23:46:40 -08:00
573f4aa352 FLOPS Roofline Analysis Feature for PyTorch Profiler. (#46506)
Summary:
FLOPs Roofline Analysis Feature for PyTorch Profiler.

Currently, PyTorch Profiler lacks the ability to measure the FLOPs of operators, such as mm and conv.
FLOPs are helpful to estimate the computation complexity of the operators.
For now, we use input shapes to estimate the number of floating pointer operations.
In the future, we may compute this information by tracking hardware counters.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46506

Test Plan:
Run `python test/test_profiler_flops.py -k test_flops`. The test will print a profiler table with "FLOPS" column, like the following:
----------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ---------------------------------------------  ------------
                        Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls                                   Input Shapes        MFLOPS
----------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ---------------------------------------------  ------------
                aten::matmul         0.06%      57.653us        82.97%      79.310ms      79.310ms             1                 [[40, 33, 1, 243], [243, 243]]            --
                    aten::mm        82.84%      79.186ms        82.86%      79.204ms      79.204ms             1                      [[1320, 243], [243, 243]]       984.323
                aten::conv2d         0.04%      36.345us        16.06%      15.347ms      15.347ms             1  [[40, 16, 18, 260], [33, 16, 18, 18], [33], [  44065010.318
           aten::convolution         0.02%      16.016us        16.02%      15.310ms      15.310ms             1  [[40, 16, 18, 260], [33, 16, 18, 18], [33], [            --
          aten::_convolution         0.07%      63.855us        16.00%      15.294ms      15.294ms             1  [[40, 16, 18, 260], [33, 16, 18, 18], [33], [            --
    aten::mkldnn_convolution        15.89%      15.188ms        15.93%      15.225ms      15.225ms             1  [[40, 16, 18, 260], [33, 16, 18, 18], [33], [            --
                  aten::relu         0.10%      98.223us         0.64%     612.157us     306.079us             2                             [[40, 33, 1, 243]]            --
             aten::threshold         0.49%     465.416us         0.54%     513.934us     256.967us             2                     [[40, 33, 1, 243], [], []]            --
                  aten::add_         0.29%     279.301us         0.29%     279.301us     279.301us             1                  [[40, 33, 1, 243], [243], []]            --
                 aten::empty         0.10%      99.113us         0.10%      99.113us      24.778us             4                       [[], [], [], [], [], []]            --
----------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ---------------------------------------------  ------------
Self CPU time total: 95.584ms

.
----------------------------------------------------------------------
Ran 1 test in 0.176s

For now, we only provide FLOPs calculation for aten::conv2d and aten::mm operators.

Reviewed By: ezyang

Differential Revision: D25214452

Pulled By: xuzhao9

fbshipit-source-id: 0ae841bd8dbdeb032346dc3d9d38e19875aa1da3
2020-12-17 21:19:25 -08:00
5db12b6811 Add type inference for dequantization.tensors (#49517)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49517

We should add concrete type info for Tensor List case as well.

Test Plan: ci

Reviewed By: qizzzh

Differential Revision: D25599223

fbshipit-source-id: 3614e9ec25fc963a8d6a0bd641735fcca6c87032
2020-12-17 21:01:35 -08:00
ed0489c11a disable concat nested namespace check (#49571)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49571

Disable nested namespace check since OSS standard is
```
set(CMAKE_CXX_STANDARD 14)
```
and its currently causing confusion on clang-tidy internally such as D25214452

Test Plan: clang-tidy

Reviewed By: xuzhao9

Differential Revision: D25626392

fbshipit-source-id: 1fb472c89ebe9b83718ae27f2c1d77b8b2412b5e
2020-12-17 20:45:37 -08:00
9058040527 Add more list peephole idioms (#48268)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/48268

Test Plan: Imported from OSS

Reviewed By: jamesr66a

Differential Revision: D25104617

Pulled By: eellison

fbshipit-source-id: b41c03d5da6e9b88acf21a859f61c5c70608c150
2020-12-17 20:25:41 -08:00
39d3578e91 [ddp launch] solve zombie problem (#49305)
Summary:
I was exhausted with needing to hunt down zombies when working with ddp launcher, so this PR solves the various zombie issues.

This PR addresses 2 distinct zombie scenarios caused by ddp launch.py:

1. When the main process is killed, the child processes aren't killed and continue running
2. When any of the children processes dies (e.g. OOM), the rest of the children and the parent remain running, but really are stuck

To solve these problems this PR switches from `wait` to `poll` and uses signal handlers.

The main problem with `wait()` was that it's not async, and I was having a 2nd process OOM, and the code was stuck waiting for the first process to finish which will not happen since the first process is blocking now waiting for the 2nd process - a sort of deadlock. My 2nd card is smaller than the first one, so it occasionally OOMs.

Using `asyncio` would probably be the cleanest solution, but as it's relatively new in python, perhaps polling is good enough.

I wrote this little script to reproduce 2 problematic scenarios and a normal running setup, it does 3 different things according to the `--mode` arg

- `oom` - causes the 2nd process to exit prematurely emulating OOM
- `clean-finish` - just exit normally in both processes
- `False` (lack of arg) just keep on running - emulating multiple normally running processes

```
# oom.py
import argparse
from time import sleep
import sys

def main():
    parser = argparse.ArgumentParser()
    parser.add_argument("--local_rank", default=False, type=int)
    parser.add_argument("--mode", default=False, type=str)
    args, _ = parser.parse_known_args()

    print(f"{args.local_rank} is starting")
    sleep(3)

    if args.mode == "oom":
        # emulate OOM in 2nd card
        if args.local_rank == 1:
            raise RuntimeError("OOM")

    if args.mode == "clean-finish":
        sleep(1)
        print(f"{args.local_rank} is cleanly finishing")
        sys.exit(0)

    while (True):
        # emulate long running process
        print(f"{args.local_rank} is running")
        sleep(1)

if __name__ == "__main__":
    main()
```

Let's begin:

###  1. Normal execution

```
python -m torch.distributed.launch --nproc_per_node=2 ./oom.py --mode=clean-finish
```

All the processes exit upon completion - I won't bother pasting the log here - just testing that my code didn't break the normal running

### 2. OOM

```
python -m torch.distributed.launch --nproc_per_node=2 ./oom.py --mode=oom
```

```
POLLING FOR 17547
POLLING FOR 17548
0
0 is starting
1
1 is starting
POLLING FOR 17547
POLLING FOR 17548
POLLING FOR 17548
POLLING FOR 17547
POLLING FOR 17547
POLLING FOR 17548
0 is running
Traceback (most recent call last):
  File "./oom.py", line 33, in <module>
    main()
  File "./oom.py", line 20, in main
    raise RuntimeError("OOM")
RuntimeError: OOM
POLLING FOR 17548
process 17548 is no more
Killing subprocess 17547
Killing subprocess 17548
Traceback (most recent call last):
  File "/home/stas/anaconda3/envs/main-38/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/stas/anaconda3/envs/main-38/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/torch/distributed/launch.py", line 341, in <module>
    main()
  File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/torch/distributed/launch.py", line 327, in main
    sigkill_handler(signal.SIGTERM, None) # not coming back
  File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/torch/distributed/launch.py", line 301, in sigkill_handler
    raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd)
subprocess.CalledProcessError: Command '['/home/stas/anaconda3/envs/main-38/bin/python', '-u', './oom.py', '--local_rank=1', '--mode=oom']' returned non-zero exit status 1.
```

All processes exited and the trace was printed

### 3. Exit on SIGINT/SIGTERM

If I started a process and then realized I made a mistake I want to be able to kill it cleanly and if any sub-processes have already been spawned I want them to be killed too. Here the sighandler takes care of trapping the SIGTERM/SIGINT.

```
python -m torch.distributed.launch --nproc_per_node=2 ./oom.py
```

Here the processes emulate a long normal run.

So let's Ctrl-C the process as soon as it started and see:

```
POLLING FOR 18749
POLLING FOR 18750
0
0 is starting
1
1 is starting
POLLING FOR 18749
POLLING FOR 18750
POLLING FOR 18750
POLLING FOR 18749
POLLING FOR 18749
POLLING FOR 18750
0 is running
1 is running
POLLING FOR 18750
POLLING FOR 18749
0 is running
1 is running
^CTraceback (most recent call last):
Killing subprocess 18749
Traceback (most recent call last):
  File "./oom.py", line 33, in <module>
  File "./oom.py", line 33, in <module>
Killing subprocess 18750
Parent got kill signal=SIGINT, exiting
```

all processes got killed

--------------------------------

So this covered the 2 problematic cases and 1 normal case

Notes:
- we could probably switch to `sleep(3)` - `1` is probably too fast
- all the debug prints will be removed once you are happy - I left them so that it's easier for you to test that my PR does the right thing.

Thank you!

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49305

Reviewed By: izdeby

Differential Revision: D25565617

Pulled By: rohan-varma

fbshipit-source-id: 1ea864113f283d4daac5eef1131c8d745aae4c99
2020-12-17 20:07:59 -08:00
1047957831 [te][reapply] Add fast log approximation based on sleef (#49575)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49575

This is a fast log implementations

benchmark:

```
buck run mode/opt //caffe2/benchmarks/cpp/tensorexpr:tensorexpr_bench -c 'fbcode.caffe2_gpu_type=none'
```

Test Plan: buck test mode/no-gpu //caffe2/test/cpp/tensorexpr:tensorexpr -- *.fastLogFloat

Reviewed By: bertmaher

Differential Revision: D25627157

fbshipit-source-id: a4920f4f4005ce617d372b375e790ca966275cd9
2020-12-17 17:02:00 -08:00
c78fd76f18 Revert D25542799: [PyTorch] Merge CoinflipTLS into RecordFunctionTLS
Test Plan: revert-hammer

Differential Revision:
D25542799 (9ce1df079f)

Original commit changeset: 310f9fd15710

fbshipit-source-id: 51777914422a560e94430a786c86f5de4007a00b
2020-12-17 16:43:52 -08:00
625bc40def Revert D25544731: [PyTorch] Avoid extra Tensor refcounting in _cat_out_cpu
Test Plan: revert-hammer

Differential Revision:
D25544731 (1a0510463a)

Original commit changeset: 7b9656d0371a

fbshipit-source-id: 0f7ea74eca282cadf269bbd284d59650a431ed65
2020-12-17 16:43:49 -08:00
385f6b4807 Revert D25545777: [PyTorch] Use .sizes() instead of .size() in _cat_out_cpu
Test Plan: revert-hammer

Differential Revision:
D25545777 (c1879b573e)

Original commit changeset: b2714fac95c8

fbshipit-source-id: f534f8fc312943f1e6ba3d4029d6cf69b006aca8
2020-12-17 16:43:45 -08:00
52b3775914 Revert D25546409: [PyTorch] Use .sizes() isntead of .size() in cat_serial_kernel_impl
Test Plan: revert-hammer

Differential Revision:
D25546409 (953f9922ec)

Original commit changeset: 196034716b6e

fbshipit-source-id: 0e80f06a98c2842d2f11db7057ffcdcaea85f3bf
2020-12-17 16:43:42 -08:00
19dc5e94a6 Revert D25547962: [PyTorch] Make tls_local_dispatch_key_set inlineable (reapply)
Test Plan: revert-hammer

Differential Revision:
D25547962 (6f928a4a53)

Original commit changeset: 58424b1da230

fbshipit-source-id: 10ff9f45f6587f67e1c88886f977930b4f7e396a
2020-12-17 16:38:40 -08:00
d17dc37112 Add dict comprehension (#47774)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47774

Test Plan: Imported from OSS

Reviewed By: pbelevich

Differential Revision: D25615464

Pulled By: ansley

fbshipit-source-id: 10bba6f70e812fa580cbbbf097e93de7142484cc
2020-12-17 15:25:30 -08:00
ea4ccc730e Revert D25445815: [te] Add fast log approximation based on sleef
Test Plan: revert-hammer

Differential Revision:
D25445815 (1329066b69)

Original commit changeset: 20696eacd12a

fbshipit-source-id: 38830a6abd16260d60e5dd9a5594e65736a9c782
2020-12-17 15:03:17 -08:00
6db5e85726 [FileStore] Updating Docs to Reflect FileStore changes (#49557)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49557

Updating the PyTorch docs to reflect that FileStore now supported the
num_keys API. Also included a note to describe the behavior of the API.

Test Plan: build and rendered docs.

Reviewed By: jiayisuse

Differential Revision: D25619000

fbshipit-source-id: 6c660d7ceb32d1d61024df8394aff3fcd0b752c1
2020-12-17 14:54:29 -08:00
31fcbbdf35 [FileStore] Implemented numKeys and Added Tests (#49556)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49556

Implemented the missing Store functionality (specifically numKeys) in the FileStore.

Test Plan: Added both C++ and Python tests to verify functionality.

Reviewed By: jiayisuse

Differential Revision: D25619001

fbshipit-source-id: 9146d0da9e0903622be3035880f619bbb2cc3891
2020-12-17 14:54:24 -08:00
ad4467b93c .github: Add action workflow to update S3 HTMLS (#49509)
Summary:
Successful run: https://github.com/pytorch/pytorch/runs/1572315901

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49509

Reviewed By: walterddr

Differential Revision: D25619133

Pulled By: seemethere

fbshipit-source-id: 092ab12535f3bf4fc85bbfc690d3f5b10a5f8791
2020-12-17 14:50:59 -08:00
4b85239532 [quant][eagermode][fix] Fix quantization for DeQuantStub (#49428)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49428

Previously dequantstub will be swapped with nn.quantized.DeQuantize regardless of qconfig
reason is we skipped attaching qconfig for DeQuantStub to avoid adding fake quantize module to it
but the correct fix is to skip it in insert observers, this PR fixes the issue.

Test Plan: Imported from OSS

Reviewed By: vkuzo

Differential Revision: D25569991

fbshipit-source-id: d44a08c6e64c7a49509687dc389b57de1cbb878c
2020-12-17 14:42:40 -08:00
1329066b69 [te] Add fast log approximation based on sleef
Summary:
This is a fast log implementations

benchmark:
```
buck run mode/opt //caffe2/benchmarks/cpp/tensorexpr:tensorexpr_bench -c 'fbcode.caffe2_gpu_type=none'
```

Test Plan: buck test mode/no-gpu //caffe2/test/cpp/tensorexpr:tensorexpr -- *.fastLogFloat

Reviewed By: bertmaher

Differential Revision: D25445815

fbshipit-source-id: 20696eacd12a55e797f606f4a6dbbd94c9652888
2020-12-17 14:28:34 -08:00
2b61e4d84c Revert D25152559: T66557700 Support default argument values of a method
Test Plan: revert-hammer

Differential Revision:
D25152559 (6bde0ca6d3)

Original commit changeset: bbf52f1fbdbf

fbshipit-source-id: 592fdb3078b1ac86cd394adc6c1bfd6b10d829e1
2020-12-17 14:05:49 -08:00
0d411c4216 Test distributed collectives profiling with Gloo on GPU (#49072)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49072

As per the title, we should enable these tests for Gloo when run on GPU and the profiler is enabled with `use_cuda=True`. Enabling ProcessGroupNCCL profiling test to work with `use_cuda=True` is being tracked in https://github.com/pytorch/pytorch/issues/48987.
ghstack-source-id: 118789003

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D25388986

fbshipit-source-id: 664d922ac2e10c77299daebdc6d3c92bb70eb56e
2020-12-17 13:43:06 -08:00
20b90f3909 Set is_non_overlapping_and_dense_ flag in OpaqueTensorImpl constructor (#49470)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49470

https://github.com/pytorch/pytorch/pull/48625 changes the default contiguous settings for `TensorImpl` causing the Vulkan backend to crash. Therefore, add argument that can set `is_non_overlapping_and_dense_` back to false for `OpaqueTensorImpl` constructor.

Test Plan: Imported from OSS

Reviewed By: AshkanAliabadi

Differential Revision: D25592826

Pulled By: SS-JIA

fbshipit-source-id: e5d9de9a733875cb00c0546a3bc3271e5c6e23a3
2020-12-17 13:36:34 -08:00
eb131cf484 Revert D25105217: [pytorch][PR] Fix bad error message when int overflow
Test Plan: revert-hammer

Differential Revision:
D25105217 (c675727adf)

Original commit changeset: a5aa7c026694

fbshipit-source-id: ddb4c93f9317e1747def8842a8072c84776cd487
2020-12-17 11:59:39 -08:00
a727bf2851 Refactor RPC matchBuiltInOp to get rid of exception swallowing (#49009)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49009

As per the title, we should generally not have exception swalling and
this commit makes it so that if there is a true error in JIT operator
resolution, it is propagated back to the RPC callee and we don't silently
swallow any other exceptions that may happen. Swallowing the exceptions
previously resulted in hard to debug issues such as unexpected ops showing up
in profiler, and flaky tests which were fixed by
https://github.com/pytorch/pytorch/pull/41287

Added a unittest that validates the error that comes from `jit/pybind_utils.h`.
ghstack-source-id: 118794661

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D25392905

fbshipit-source-id: 6f93251635740bcf902824548b2bc6f9249be5f0
2020-12-17 11:37:21 -08:00
b8d98f05e7 [reland][quant][docs] Add fx graph mode quantization to quantization docs (#49211) (#49515)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/49515

Test Plan:
Imported from OSS

Imported from OSS

Reviewed By: vkuzo

Differential Revision: D25601061

fbshipit-source-id: 74e917d57895e9b4131a01fdcea8df3e94322bec
2020-12-17 10:30:10 -08:00
815d38395a PyLong_{As/From}{Long/UnsignedLong} lint checks (#49280)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/45581

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49280

Reviewed By: mruberry

Differential Revision: D25592330

Pulled By: ezyang

fbshipit-source-id: 5c16d6aed88ad1feaa7f129b4cd44c0561be2de2
2020-12-17 09:32:08 -08:00
c20b916cbd [AutoAccept][Codemod][FBSourceClangFormatLinter] Daily arc lint --take CLANGFORMAT
Reviewed By: zertosh

Differential Revision: D25609974

fbshipit-source-id: 4db8f8100336a2f0f2af8bc7b960d3711a5d1d7d
2020-12-17 05:32:07 -08:00
f5a26a554b [C2] Revive unsafe CoalesceOp (#49402)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49402

In cases of NCCLAllReduce operations there could be non-trivial overhead for
launching cooperative kernels (especially in case of async execution of
different parts of the model). This diff is reviving this operator to make it
possible to fuse multiple operations into a single kernel.

Test Plan:
Unit-test.
Used in a later diff.

Reviewed By: xianjiec

Differential Revision: D25531206

fbshipit-source-id: 64b1c161233a726f9e2868f1059316e42a8ea1fc
2020-12-17 04:31:29 -08:00
26974e6b28 Remove set_quantizer_ from native_functions.yaml (#49463)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49463

set_quantizer_ takes a ConstQuantizerPtr argument, which is neither supported by JIT nor by c10.
Also, it doesn't get dispatched (CPU and CUDA have the same implementation) and it is excluded from python bindings generation.
So there is no real reason why this needs to be in native_functions.yaml

Removing it unblocks the migration to c10-fullness since this is an op that would have been hard to migrate. See https://fb.quip.com/QRtJAin66lPN
ghstack-source-id: 118710663

Test Plan: waitforsandcastle

Reviewed By: ezyang

Differential Revision: D25587763

fbshipit-source-id: 8fab921f4c256c128d48d82dac731f04ec9bad92
2020-12-17 03:28:00 -08:00
f5b68e74d7 Revert D25574962: [pytorch][PR] Updated derivative rules for complex svd and pinverse
Test Plan: revert-hammer

Differential Revision:
D25574962 (9955355853)

Original commit changeset: 832b61303e88

fbshipit-source-id: d73f77f3e51b0f535dad6d21c5bebf8d41a6bfbd
2020-12-17 00:59:43 -08:00
c18af03a41 [pt] fuse ClipRangesGatherSigridHash (#49181)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49181

Fuse ClipRangesGatherSigridHash

Test Plan:
```
MKL_NUM_THREADS=1 OMP_NUM_THREADS=1 numactl -m 0 -C 3 ./buck-out/opt/gen/caffe2/caffe2/fb/predictor/ptvsc2_predictor_bench --scripted_model=/data/users/ansha/tmp/adindexer/merge/traced_merge_dper_fixes.pt --pt_inputs=/data/users/ansha/tmp/adindexer/merge/container_precomputation_bs1.pt --iters=30000 --warmup_iters=10000  --num_threads=1 --pred_net=/data/users/ansha/tmp/adindexer/precomputation_merge_net.pb --c2_inputs=/data/users/ansha/tmp/adindexer/merge/c2_inputs_precomputation_bs1.pb --c2_sigrid_transforms_opt=1 --c2_use_memonger=1 --c2_weights=/data/users/ansha/tmp/adindexer/merge/c2_weights_precomputation.pb --pt_enable_static_runtime --pt_cleanup_activations=true --pt_enable_out_variant=true --do_profile --compare_results
```

Verify op fused:
Node #3: 0.00104917 ms/iter, %173 : Tensor, %174 : Tensor = fb::clip_ranges_gather_sigrid_hash_offsets(%75, %76, %39, %40, %41, %38, %26)

Before: 0.0919786
After: 0.0911792

Reviewed By: hlu1

Differential Revision: D25468225

fbshipit-source-id: 36bd91c140eaa57cb42cdaad46d878b94f162a9d
2020-12-17 00:42:46 -08:00
26e076d19e Adding fix for invalid annotation types for dictionary (#49425)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/49362

**Summary:**
This PR fixes the issue where invalid annotation types are used for a dictionary.
Unsupported assertion message is generated for all invalid annotations

**Test Case**:
python test/test_jit.py TestJit.test_dict_invalid_annotations

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49425

Reviewed By: navahgar

Differential Revision: D25601578

Pulled By: nikithamalgifb

fbshipit-source-id: 91633e3d0891bdcb5402f044a74d02fe352ecd6f
2020-12-17 00:28:29 -08:00
65876d3f51 Change aten::native_layer_norm signature to match torch.layer_norm definition (#48971)
Summary:
This PR is to change the `aten::native_layer_norm` and `aten::native_layer_norm_backward` signature to match `torch.layer_norm` definition. The current definition doesn't provide enough information to the PyTorch JIT to fuse layer_norm during training.

`native_layer_norm(X, gamma, beta, M, N, eps)` =>
`native_layer_norm(input, normalized_shape, weight, bias, eps)`

`native_layer_norm_backward(dY, X, mean, rstd, gamma, M, N, grad_input_mask)` =>
`native_layer_norm_backward(dY, input, normalized_shape, mean, rstd, weight, bias, grad_input_mask)`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48971

Reviewed By: izdeby

Differential Revision: D25574070

Pulled By: ngimel

fbshipit-source-id: 23e2804295a95bda3f1ca6b41a1e4c5a3d4d31b4
2020-12-16 23:09:18 -08:00
2ea1d97e3b Add BFloat16 support for isinf and isfinite (#49356)
Summary:
Also fix some tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49356

Reviewed By: mruberry

Differential Revision: D25604364

Pulled By: ngimel

fbshipit-source-id: 9efdd83aaa96cacc66e9689db9f9d8c24175a693
2020-12-16 22:36:14 -08:00
ede0b169ea [quant][be] Add typing for quantization_mappings.py (#49179)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/49179

Test Plan: Imported from OSS

Reviewed By: vkuzo, wat3rBro

Differential Revision: D25470520

fbshipit-source-id: 16e35fec9a5f3339860bd2305ae8ffdd8e2dfaf7
2020-12-16 21:36:00 -08:00
4edaf4d759 Bring back math_silu_backward which works for all backends. (#49439)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/49439

Test Plan: Imported from OSS

Reviewed By: nikithamalgifb, ngimel

Differential Revision: D25594129

Pulled By: ailzhang

fbshipit-source-id: 627bbea9ba478ee3a8edcc6695abab6431900192
2020-12-16 21:06:12 -08:00
6230e337d5 Add torch._foreach_zero_ API (#47286)
Summary:
**In this PR**
- add `_foreach_zero_` API
- Update all optimizers under /_multi_tensor/ to use `_foreach_zero_` in `zero_grad` method

Performance improvement
----------------- OP:  zero_  -----------------
for-loop: 630.36 us
foreach: 90.84 us

script

```
import torch
import torch.optim as optim
import torch.nn as nn
import torchvision
import torch.utils.benchmark as benchmark_utils

inputs = [torch.rand(3, 200, 200, device="cuda") for _ in range(100)]

def main():
    for op in [
            "zero_"
        ]:
        print("\n\n----------------- OP: ", op, " -----------------")
        stmt = "[torch.{op}(t) for t in inputs]"
        timer = benchmark_utils.Timer(
            stmt=stmt.format(op = op),
            globals=globals(),
            label="str(optimizer)",
        )
        print(f"autorange:\n{timer.blocked_autorange()}\n\n")

        stmt = "torch._foreach_{op}(inputs)"
        timer_mta = benchmark_utils.Timer(
            stmt=stmt.format(op = op),
            globals=globals(),
            label="str(optimizer_mta)",
        )
        print(f"autorange:\n{timer_mta.blocked_autorange()}\n\n")

if __name__ == "__main__":
    main()

```
**TODO**
- Refactor zero_grad once foreach APIs are stable.

**Tested** via unit tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47286

Reviewed By: ngimel

Differential Revision: D24706240

Pulled By: izdeby

fbshipit-source-id: aac69d6d134d65126ae8e5916f3627b73d8a94bf
2020-12-16 20:04:25 -08:00
4ce2b0b0ac Set caffe2::pthreadpool() size in ParallelOpenMP (#45566)
Summary:
Addresses https://github.com/pytorch/pytorch/issues/45418.

This is probably not the best solution, but it's a rebase of the solution we're considering until https://github.com/pytorch/pytorch/issues/45418 is solved. If you can outline a better one I'm willing to implement it (:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45566

Reviewed By: ezyang

Differential Revision: D24621568

Pulled By: glaringlee

fbshipit-source-id: 89dad5c61d8b5c26984d401551a1fe29df1ead04
2020-12-16 19:53:08 -08:00
db2ecefc01 [reland] Support torch.distributed.irecv(src=None, ...) (#49383)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49383

Reland of https://github.com/pytorch/pytorch/pull/47137
ghstack-source-id: 118735407

Test Plan: waitforbuildbot

Reviewed By: osalpekar

Differential Revision: D25551910

fbshipit-source-id: 2e1f2f77e7c69204056dfe6ed178e8ad7650ab32
2020-12-16 19:39:23 -08:00
df2337097d add files to SLOW_TESTS for target determinator (#49500)
Summary:
- test_torch was split into 6 in https://github.com/pytorch/pytorch/issues/47356.
- also test_linalg has 10 slowtest marking.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49500

Reviewed By: ezyang, malfet

Differential Revision: D25598085

Pulled By: walterddr

fbshipit-source-id: 74b0b433897721db86c00e236d1dd925d7a6d3d0
2020-12-16 19:10:56 -08:00
82ac6c75af fx quant: make sure observer is inserted before a quantized output (#49420)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49420

Before: if an output was marked as quantized, it could actually not
be quantized, if the previous node was not quantized.

After: if an output was marked as quantized, it will be quantized
regardless of the quantization status of the previous node.

Test Plan:
```
python test/test_quantization.py TestQuantizeFxOps.test_quant_output_always_observed
```

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D25566834

fbshipit-source-id: 84755a1605fd3847edd03a7887ab9f635498c05c
2020-12-16 18:53:37 -08:00
84506e0316 fx quant: fix fq when input is quantized and node does not need fq (#49382)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49382

Fixes an edge case.  If the input to the graph is quantized and the
first node does not need activation observation, makes sure that
the observer is not inserted.

Test Plan:
```
python test/test_quantization.py TestQuantizeFxOps.test_int8_input_no_unnecessary_fq
```

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D25551041

fbshipit-source-id: a6cba235c63ca7f6856e4128af7c1dc7fa0085ea
2020-12-16 18:53:33 -08:00
7542076097 fx quant: do not insert observers at quantized inputs (#49239)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49239

Context: the existing implementation of `quantized_input_idxs` is convert-only.
Therefore, observers are inserted between the input and the first
quantized node.  This is a problem during QAT, because the initial
input is a fake_quant, and it starts with scale=1 and zp=0.  This does
not match the quantization parameters of the graph input, which can
lead to incorrect numerics.

Fix: do not insert observer for a quantized input.

Test Plan:
```
python test/test_quantization.py TestQuantizeFx
```

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D25499486

fbshipit-source-id: 303b49cc9d95a9fd06fef3b0859c08be34e19d8a
2020-12-16 18:53:30 -08:00
92df8706a0 fx quant: move {input|output}_quantized_idxs cfg from convert to prepare (#49238)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49238

Moves the `input_quantized_idxs` and `output_quantized_idxs` options
from the convert config to the prepare config.  This is done because
these operations are related to placing observers, which is numerics
changing during QAT.

The next PR will adjust the behavior of `input_quantized_idxs` in
prepare in QAT to prevent placing a fake_quant at the input if the
input is marked quantized.  Placing a fake_quant there can lead to
numerical inaccuracies during calibration, as it would start with
scale=1 and zp=0, which may be different from the quantization
parameters of the incoming quantized input.

Test Plan:
```
python test/test_quantization.py TestQuantizeFx
```

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D25498762

fbshipit-source-id: 17ace8f803542155652b310e5539e1882ebaadc6
2020-12-16 18:53:27 -08:00
36b20923ba eager quant: remove fake_quant after add/mul nodes during QAT (#49213)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49213

Changes behavior of Eager mode quantization to remove observation after add_scalar/mul_scalar.
This is not used, and it removes one difference between Eager and FX modes.

Test Plan:
```
python test/test_quantization.py TestQuantizeFxOps.test_quantized_add_qat
python test/test_quantization.py TestQuantizeFxOps.test_quantized_mul_qat
python test/test_quantization.py TestQuantizationAwareTraining.test_add_scalar_uses_input_qparams
python test/test_quantization.py TestQuantizationAwareTraining.test_mul_scalar_uses_input_qparams
```

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D25486276

fbshipit-source-id: 34a5d6ce0d08739319ec0f8b197cfc1309d71040
2020-12-16 18:50:11 -08:00
904586271b Add fusion support of aten::to (#48976)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/48976

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D25413164

Pulled By: eellison

fbshipit-source-id: 0c31787e8b5e1368b0cba6e23660799b652389cd
2020-12-16 18:36:16 -08:00
80b508f207 [NNC] add support for masked_fill (#48974)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/48974

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D25413165

Pulled By: eellison

fbshipit-source-id: 8cece1dc3692389be90c0d77bd71b103254d5ad3
2020-12-16 18:36:13 -08:00
50386b9988 [NNC] Add Support For is_nan (#48973)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/48973

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D25413166

Pulled By: eellison

fbshipit-source-id: 0c79258345df18c60a862373fa16931228fb92ef
2020-12-16 18:31:01 -08:00
60b4c40101 [extensions] fix is_ninja_available during cuda extension building (#49443)
Summary:
tldr: current version of `is_ninja_available` of `torch/utils/cpp_extension.py` fails to run in the recent incarnations of pip w/ new build isolation feature which is now a default. This PR fixes this problem.

The full story follows:

--------------------------

Currently trying to build https://github.com/facebookresearch/fairscale/ which builds cuda extensions fails with the recent pip versions. The build is failing to perform `is_ninja_available`, which runs a simple subprocess to run `ninja --version` but does it with some /dev/null stream override which seems to break with the new pip versions. Currently I have `pip==20.3.3`. The recent pip performs build isolation which first fetches all dependencies to somewhere under /tmp/pip-install-xyz and then builds the package.

If I build:

```
pip install fairscale --no-build-isolation
```
everything works.

When building normally (i.e. without `--no-build-isolation`), the failure is a long long trace,
<details>
<summary>Full log</summary>
<pre>
pip install fairscale
Collecting fairscale
  Downloading fairscale-0.1.1.tar.gz (83 kB)
     |████████████████████████████████| 83 kB 562 kB/s
  Installing build dependencies ... done
  Getting requirements to build wheel ... error
  ERROR: Command errored out with exit status 1:
   command: /home/stas/anaconda3/envs/main-38/bin/python /home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/pip/_vendor/pep517/_in_process.py get_requires_for_build_wheel /tmp/tmpjvw00c7v
       cwd: /tmp/pip-install-1wq9f8fp/fairscale_347f218384a64f24b8d5ce846641213e
  Complete output (55 lines):
  running egg_info
  writing fairscale.egg-info/PKG-INFO
  writing dependency_links to fairscale.egg-info/dependency_links.txt
  writing requirements to fairscale.egg-info/requires.txt
  writing top-level names to fairscale.egg-info/top_level.txt
  Traceback (most recent call last):
    File "/home/stas/anaconda3/envs/main-38/bin/ninja", line 5, in <module>
      from ninja import ninja
  ModuleNotFoundError: No module named 'ninja'
  Traceback (most recent call last):
    File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/pip/_vendor/pep517/_in_process.py", line 280, in <module>
      main()
    File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/pip/_vendor/pep517/_in_process.py", line 263, in main
      json_out['return_val'] = hook(**hook_input['kwargs'])
    File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/pip/_vendor/pep517/_in_process.py", line 114, in get_requires_for_build_wheel
      return hook(config_settings)
    File "/tmp/pip-build-env-a5x2icen/overlay/lib/python3.8/site-packages/setuptools/build_meta.py", line 149, in get_requires_for_build_wheel
      return self._get_build_requires(
    File "/tmp/pip-build-env-a5x2icen/overlay/lib/python3.8/site-packages/setuptools/build_meta.py", line 130, in _get_build_requires
      self.run_setup()
    File "/tmp/pip-build-env-a5x2icen/overlay/lib/python3.8/site-packages/setuptools/build_meta.py", line 145, in run_setup
      exec(compile(code, __file__, 'exec'), locals())
    File "setup.py", line 56, in <module>
      setuptools.setup(
    File "/tmp/pip-build-env-a5x2icen/overlay/lib/python3.8/site-packages/setuptools/__init__.py", line 153, in setup
      return distutils.core.setup(**attrs)
    File "/home/stas/anaconda3/envs/main-38/lib/python3.8/distutils/core.py", line 148, in setup
      dist.run_commands()
    File "/home/stas/anaconda3/envs/main-38/lib/python3.8/distutils/dist.py", line 966, in run_commands
      self.run_command(cmd)
    File "/home/stas/anaconda3/envs/main-38/lib/python3.8/distutils/dist.py", line 985, in run_command
      cmd_obj.run()
    File "/tmp/pip-build-env-a5x2icen/overlay/lib/python3.8/site-packages/setuptools/command/egg_info.py", line 298, in run
      self.find_sources()
    File "/tmp/pip-build-env-a5x2icen/overlay/lib/python3.8/site-packages/setuptools/command/egg_info.py", line 305, in find_sources
      mm.run()
    File "/tmp/pip-build-env-a5x2icen/overlay/lib/python3.8/site-packages/setuptools/command/egg_info.py", line 536, in run
      self.add_defaults()
    File "/tmp/pip-build-env-a5x2icen/overlay/lib/python3.8/site-packages/setuptools/command/egg_info.py", line 572, in add_defaults
      sdist.add_defaults(self)
    File "/home/stas/anaconda3/envs/main-38/lib/python3.8/distutils/command/sdist.py", line 228, in add_defaults
      self._add_defaults_ext()
    File "/home/stas/anaconda3/envs/main-38/lib/python3.8/distutils/command/sdist.py", line 311, in _add_defaults_ext
      build_ext = self.get_finalized_command('build_ext')
    File "/home/stas/anaconda3/envs/main-38/lib/python3.8/distutils/cmd.py", line 298, in get_finalized_command
      cmd_obj = self.distribution.get_command_obj(command, create)
    File "/home/stas/anaconda3/envs/main-38/lib/python3.8/distutils/dist.py", line 858, in get_command_obj
      cmd_obj = self.command_obj[command] = klass(self)
    File "/tmp/pip-build-env-a5x2icen/overlay/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 351, in __init__
      if not is_ninja_available():
    File "/tmp/pip-build-env-a5x2icen/overlay/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1310, in is_ninja_available
      subprocess.check_call('ninja --version'.split(), stdout=devnull)
    File "/home/stas/anaconda3/envs/main-38/lib/python3.8/subprocess.py", line 364, in check_call
      raise CalledProcessError(retcode, cmd)
  subprocess.CalledProcessError: Command '['ninja', '--version']' returned non-zero exit status 1.
  ----------------------------------------
ERROR: Command errored out with exit status 1: /home/stas/anaconda3/envs/main-38/bin/python /home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/pip/_vendor/pep517/_in_process.py get_requires_for_build_wheel /tmp/tmpjvw00c7v Check the logs for full command output.
</pre>

</details>

and the middle of it is what we want:

```
    File "/tmp/pip-build-env-a5x2icen/overlay/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 351, in __init__
      if not is_ninja_available():
    File "/tmp/pip-build-env-a5x2icen/overlay/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1310, in is_ninja_available
      subprocess.check_call('ninja --version'.split(), stdout=devnull)
    File "/home/stas/anaconda3/envs/main-38/lib/python3.8/subprocess.py", line 364, in check_call
      raise CalledProcessError(retcode, cmd)
  subprocess.CalledProcessError: Command '['ninja', '--version']' returned non-zero exit status 1.
```

For some reason pytorch fails to run this simple code:

```
# torch/utils/cpp_extension.py
def is_ninja_available():
    r'''
    Returns ``True`` if the `ninja <https://ninja-build.org/>`_ build system is
    available on the system, ``False`` otherwise.
    '''
    with open(os.devnull, 'wb') as devnull:
        try:
            subprocess.check_call('ninja --version'.split(), stdout=devnull)
        except OSError:
            return False
        else:
            return True
```

I suspect that pip does something to `os.devnull` and that's why it fails.

This PR proposes a simpler code which doesn't rely on anything but `subprocess.check_output`:

```
def is_ninja_available():
    r'''
    Returns ``True`` if the `ninja <https://ninja-build.org/>`_ build system is
    available on the system, ``False`` otherwise.
    '''
    try:
        subprocess.check_output('ninja --version'.split())
    except Exception:
        return False
    else:
        return True
```

which doesn't use `os.devnull` and performs the same function. There could be a whole bunch of different exceptions there I think, so I went for the generic one - we don't care why it failed, since this function's only purpose is to suggest whether ninja can be used or not.

Let's check

```
python -c "import torch.utils.cpp_extension; print(torch.utils.cpp_extension.is_ninja_available())"
True
```

Look ma - no std noise to take care of. (i.e. no need for /dev/null).

I was editing the  installed environment-wide `cpp_extension.py` file directly, so didn't need to tweak `PYTHONPATH` - I made sure to replace `'ninja --version'.` with something that should fail and I did get `False` for the above command line.

I next did a somewhat elaborate cheat to re-package an already existing binary wheel with this corrected version of `cpp_extension.py`, rather than building from source:
```
mkdir /tmp/pytorch-local-channel
cd /tmp/pytorch-local-channel

# get the latest nightly wheel
wget https://download.pytorch.org/whl/nightly/cu110/torch-1.8.0.dev20201215%2Bcu110-cp38-cp38-linux_x86_64.whl

# unpack it
unzip torch-1.8.0.dev20201215+cu110-cp38-cp38-linux_x86_64.whl

# edit torch/utils/cpp_extension.py to fix the python code with the new version as in this PR
emacs torch/utils/cpp_extension.py &

# pack the files back
zip -r torch-1.8.0.dev20201215+cu110-cp38-cp38-linux_x86_64.whl caffe2 torch torch-1.8.0.dev20201215+cu110.dist-info
```

Now I tell pip to use my local channel, plus `--pre` for it to pick up the pre-release as an acceptable wheel
```
# install using this local channel
git clone https://github.com/facebookresearch/fairscale/
cd fairscale
pip install -v --disable-pip-version-check -e . -f file:///tmp/pytorch-local-channel --pre
```
and voila all works.

```
[...]
Successfully installed fairscale
```

I noticed a whole bunch of ninja not found errors in the log, which I think is the same problem with other parts of the build system packages which also use this old check copied all over various projects and build tools, and which the recent pip breaks.

```
    writing manifest file '/tmp/pip-modern-metadata-_nsdesbq/fairscale.egg-info/SOURCES.txt'
    Traceback (most recent call last):
      File "/home/stas/anaconda3/envs/main-38/bin/ninja", line 5, in <module>
        from ninja import ninja
    ModuleNotFoundError: No module named 'ninja'
    [...]
    /tmp/pip-build-env-fqflyevr/overlay/lib/python3.8/site-packages/torch/utils/cpp_extension.py:364: UserWarning: Attempted to use ninja as the BuildExtension backend but we could not find ninja.. Falling back to using the slow distutils backend.
      warnings.warn(msg.format('we could not find ninja.'))
```

but these don't prevent from the build completing and installing.

I suppose these need to be identified and reported to various other projects, but that's another story.

The new pip does something to `os.devnull` I think which breaks any code relying on it - I haven't tried to figure out what happens to that stream object, but this PR which removes its usage solves the problem.

Also do notice that:

```
git clone https://github.com/facebookresearch/fairscale/
cd fairscale
python setup.py bdist_wheel
pip install dist/fairscale-0.1.1-cp38-cp38-linux_x86_64.whl
```
works too. So it is really a pip issue.

Apologies if the notes are too many, I tried to give the complete picture and probably other projects will need those details as well.

Thank you for reading.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49443

Reviewed By: mruberry

Differential Revision: D25592109

Pulled By: ezyang

fbshipit-source-id: bfce4420c28b614ead48e9686f4153c6e0fbe8b7
2020-12-16 18:02:11 -08:00
d409da0677 Fix CUDA extension ninja build (#49344)
Summary:
I am submitting this PR on behalf of Janne Hellsten(nurpax) from NVIDIA, for the convenience of CLA. Thanks Janne a lot for the contribution!

Currently, the ninja build decides whether to rebuild a .cu file or not pretty randomly. And there are actually two issues:

First, the arch list in the building command is ordered randomly. When the order changes, it will unconditionally rebuild regardless of the timestamp.

Second, the header files are not included in the dependency list, so if the header file changes, it is possible that ninja will not rebuild.

This PR fixes both issues. The fix for the second issue requires nvcc >= 10.2. nvcc < 10.2 can still build CUDA extension as it used to be, but it will be unable to see the changes in header files.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49344

Reviewed By: glaringlee

Differential Revision: D25540157

Pulled By: ezyang

fbshipit-source-id: 197541690d7f25e3ac5ebe3188beb1f131a4c51f
2020-12-16 17:45:12 -08:00
1c6e179b38 Relax the atol/rtol of layernorm math kernel test. (#49507)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/49507

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D25598424

Pulled By: ailzhang

fbshipit-source-id: b3f43e84f177cf7c14831b0b83a399b155c813c4
2020-12-16 17:37:51 -08:00
c675727adf Fix bad error message when int overflow (#48250)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/48114

Before:
```
>>> torch.empty(2 * 10 ** 20)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: empty(): argument 'size' must be tuple of ints, but found element of type int at pos 1
```

After fix:
```
>>> torch.empty(2 * 10 ** 20)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
RuntimeError: Overflow when unpacking long
```

Unclear whether we need a separate test for this case, I can add one if it's necessary...

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48250

Reviewed By: linbinyu

Differential Revision: D25105217

Pulled By: ezyang

fbshipit-source-id: a5aa7c0266945c8125210a2fd34ce4b6ba940c92
2020-12-16 17:30:45 -08:00
a5cc0a6f4c .circleci: Only downgrade if we have conda (#49519)
Summary:
Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49519

Reviewed By: robieta

Differential Revision: D25603779

Pulled By: seemethere

fbshipit-source-id: ca8d811925762a5a413ca906d94c974a4ac5b132
2020-12-16 17:14:17 -08:00
872f6486b1 Prevent accidentally writing old style ops (#49510)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49510

Adding old style operators with out arguments will break XLA. This prevents that. See for background: https://fb.workplace.com/groups/pytorch.dev/permalink/809934446251704/

This is a temporary change that will prevent this breakage for the next couple of days until the problem is resolved for good.
It will be deleted in https://github.com/pytorch/pytorch/pull/49164 then.
ghstack-source-id: 118756437

(Note: this ignores all push blocking failures!)

Test Plan: waitforsandcastle

Reviewed By: bhosmer

Differential Revision: D25599112

fbshipit-source-id: 6b0ca4da4b55da8aab9d1b332cd9f68e7602301e
2020-12-16 16:34:49 -08:00
9056173acc [NNC] Dont inline outputs buffers on cpu (#49488)
Summary:
In https://github.com/pytorch/pytorch/pull/48967/ we enabled output buffer inlining,  which results in duplicate computation if one output depends on another. This was done to fix correctness for CUDA, but is not needed for correctness for CPU and results in  perf slowdown.

The output buffer inlining solution for CUDA is intended to be an interim solution because it does not work with reductions.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49488

Reviewed By: ezyang

Differential Revision: D25596071

Pulled By: eellison

fbshipit-source-id: bc3d987645da5ce3c603b4abac3586b169656cfd
2020-12-16 16:28:25 -08:00
47c65f8223 Revert D25569586: stft: Change require_complex warning to an error
Test Plan: revert-hammer

Differential Revision:
D25569586 (5874925b46)

Original commit changeset: 09608088f540

fbshipit-source-id: 6a5953b327a4a2465b046e29bb007a0c5f4cf14a
2020-12-16 16:21:52 -08:00
3efd5d8f01 Introduce tools.codegen.api.translate (#49122)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49122

cpparguments_exprs has induced a lot of head scratching in many recent PRs for how to structure the code in a good way.  This PR eliminates the old algorithm for an entirely new algorithm inspired by logic programming.  The net result is shorter, cleaner and should be more robust to future changes.

This PR is a bit of a whopper.  Here is the order to review it.

- tools/codegen/api/types.py
  - Deleted CppArgument, CppArgumentPackIface (and subclasses), CppExpr, DispatcherExpr, DispatcherArgument, NativeExpr, NativeArgument, MetaArgument. All things previously called XArgument are now Binding. All things previously called XExpr are now Expr. I deleted the `__str__` implementation on Binding and fixed all call sites not to use it. On Binding, I renamed `str_no_default` and `str_default` to `defn` and `decl` for better symmetry with the corresponding signature concepts, although I'm open to naming them back to their original versions.
  - Obviously, things are less type safe without the class distinctions. So I introduce a new ADT called CType. CType represents the *semantic C++ type* of a binding: it is both the C++ type (e.g., `const Tensor&`) as well as the argument name that specifies what the  binding denotes (e.g., `other`). Every binding now records its CType. The key observation here is that you don't actually care if a given expression is from the cpp or dispatcher or native API; what you care is having enough information to know what the expression means, so you can use it appropriately. CType has this information. For the most part, ArgNames are just the string names of the arguments as you see them in JIT schema, but there is one case (`possibly_redundant_memory_format`) where we encode a little extra information. Unlike the plain strings we previously used to represent C++ types, CType have a little bit of structure around optional and references, because the translation code needs to work around these concepts.
  - I took the opportunity to kill all of the private fields like `_arguments` and `_returns_type` (since the argument types don't make sense anymore). Everything is computed for you on the fly. If this is a perf problem in codegen we can start using `cached_property` decorator.
  - All of the heavy lifting in CppSignature.argument_packs has been moved to the cpp module. We'll head over there next. Similarly, all of the exprs methods are now calling translate, the new functionality which we haven't gotten to yet
- tools/codegen/api/cpp.py
   - We refactor all of the type computation functions to return CType instead of str. Because CTypes need to know the denotation, there is a new `binds: ArgName` argument to most functions that provides the denotation, so we can slot it in. (An alternative would have been to construct CTypes without denotations and then fill them in post-facto, but I didn't do it this way. One downside is there are some places where I need a CType without denotation, so I fill these in with `__placeholder__` whenever this happens).
  - `argument` and `arguments` are now extremely simple. There is no more Pack business, just produce one or more Bindings. The one thing of note is that when both a `memory_format` and `options` are in scope, we label the memory format as `possibly_redundant_memory_format`. This will be used in translation
- tools/codegen/api/dispatcher.py and tools/codegen/api/native.py - same deal as cpp.py. One thing is that `cpparguments_exprs` is deleted; that is in the translator
- tools/codegen/api/translate.py - the translator! It uses a very simple backwards deduction engine to work out how to fill in the arguments of functions. There are comments in the file that explain how it works.
- Everything else: just some small call site tweaks for places when I changed API.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: ljk53

Differential Revision: D25455887

Pulled By: ezyang

fbshipit-source-id: 90dc58d420d4cc49281aa8647987c69f3ed42fa6
2020-12-16 16:18:40 -08:00
f66147ebca BFloat16: add explicit dtype support for to_mkldnn and to_dense (#48881)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/48881

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D25537190

Pulled By: VitalyFedyunin

fbshipit-source-id: a61a433c638e2e95576f88f081b64ff171b2316e
2020-12-16 16:09:42 -08:00
6f928a4a53 [PyTorch] Make tls_local_dispatch_key_set inlineable (reapply) (#49412)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49412

FLAGS_disable_variable_dispatch had to go, but it looks like the only user was some benchmarks anyway.
ghstack-source-id: 118669590

Test Plan:
Small (order of 0.1% improvement) on Internal benchmarks. Wait for
GitHub CI since this was reverted before due to CI break

Reviewed By: ezyang

Differential Revision: D25547962

fbshipit-source-id: 58424b1da230fdc5d27349af762126a5512fce43
2020-12-16 16:04:35 -08:00
953f9922ec [PyTorch] Use .sizes() isntead of .size() in cat_serial_kernel_impl (#49371)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49371

As with previous diff, .sizes() is strictly more efficient.
ghstack-source-id: 118627223

Test Plan: internal benchmark

Differential Revision: D25546409

fbshipit-source-id: 196034716b6e11efda1ec8cb1e0fce7732d73eb4
2020-12-16 16:04:32 -08:00
c1879b573e [PyTorch] Use .sizes() instead of .size() in _cat_out_cpu (#49368)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49368

The former is faster because it doesn't allow negative indexing (which we don't use).
ghstack-source-id: 118624598

Test Plan: internal benchmark

Reviewed By: hlu1

Differential Revision: D25545777

fbshipit-source-id: b2714fac95c801fd735fac25b238b4a79b012993
2020-12-16 16:04:29 -08:00
1a0510463a [PyTorch] Avoid extra Tensor refcounting in _cat_out_cpu (#49364)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49364

We had a local `Tensor` when we only needed a `const Tensor&`.
ghstack-source-id: 118624595

Test Plan: Internal benchmark.

Reviewed By: hlu1

Differential Revision: D25544731

fbshipit-source-id: 7b9656d0371ab65a6313cb0ad4aa1df707884c1c
2020-12-16 16:04:26 -08:00
9ce1df079f [PyTorch] Merge CoinflipTLS into RecordFunctionTLS (#49359)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49359

This should be both slightly more efficient (1 less TLS guard
check in at::shouldRunRecordFunction) and definitely more correct
(CoinflipTLS is now saved whenever RecordFunctionTLS is saved), fixing
a bad merge that left RecordFunctionTLS::tries_left dead.
ghstack-source-id: 118624402

Test Plan: Review, CI

Reviewed By: hlu1

Differential Revision: D25542799

fbshipit-source-id: 310f9fd157101f659cea13c331b2a0ee6db2db88
2020-12-16 16:00:49 -08:00
6bde0ca6d3 T66557700 Support default argument values of a method (#48863)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48863

Support default arguments when invoking a module via PyTorch Lite (`mobile::Module`).

Test Plan:
buck test mode/dbg //caffe2/test/cpp/jit:jit -- LiteInterpreterTest.MethodInvocation

buck test mode/dbg caffe2/test:mobile -- test_method_calls_with_optional_arg

Reviewed By: raziel, iseeyuan

Differential Revision: D25152559

fbshipit-source-id: bbf52f1fbdbfbc6f8fa8b65ab524b1cd4648f9c0
2020-12-16 15:55:03 -08:00
d0fb55454b Refine ConvParams::use_nnpack() (#49464)
Summary:
NNPACK convolution algorithms can only be used for kernels up to 16x16

Fixes https://github.com/pytorch/pytorch/issues/49462

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49464

Reviewed By: xuzhao9

Differential Revision: D25587879

Pulled By: malfet

fbshipit-source-id: 658197f23c08cab97f0849213ecee3f91f96c932
2020-12-16 15:42:43 -08:00
399b07a8f9 Add note to torch docs for sinh/cosh (#49413)
Summary:
Address https://github.com/pytorch/pytorch/issues/48641

Documents the behavior of sinh and cosh in the edge cases
```
>>> b = torch.full((15,), 89, dtype=torch.float32)
>>> torch.sinh(b)
tensor([2.2448e+38, 2.2448e+38, 2.2448e+38, 2.2448e+38, 2.2448e+38, 2.2448e+38,
        2.2448e+38, 2.2448e+38, 2.2448e+38, 2.2448e+38, 2.2448e+38, 2.2448e+38,
        2.2448e+38, 2.2448e+38, 2.2448e+38])
>>> b = torch.full((16,), 89, dtype=torch.float32)
>>> torch.sinh(b)
tensor([inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf])
>>> b = torch.full((17,), 89, dtype=torch.float32)
>>> torch.sinh(b)
tensor([       inf,        inf,        inf,        inf,        inf,        inf,
               inf,        inf,        inf,        inf,        inf,        inf,
               inf,        inf,        inf,        inf, 2.2448e+38])
>>> b = torch.full((32,), 89, dtype=torch.float32)[::2]
>>> torch.sinh(b)
tensor([2.2448e+38, 2.2448e+38, 2.2448e+38, 2.2448e+38, 2.2448e+38, 2.2448e+38,
        2.2448e+38, 2.2448e+38, 2.2448e+38, 2.2448e+38, 2.2448e+38, 2.2448e+38,
        2.2448e+38, 2.2448e+38, 2.2448e+38, 2.2448e+38])
```

See https://sleef.org/purec.xhtml

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49413

Reviewed By: ezyang

Differential Revision: D25587932

Pulled By: soulitzer

fbshipit-source-id: 6db75c45786f4b95f82459d0ce5efa37ec0774f0
2020-12-16 14:51:08 -08:00
f0217e2f52 Fix link in distributed contributing doc and add link (#49141)
Summary:
One of the links for ramp up tasks wasn't showing any results and the other was only RPC results. Instead of this, I just changed it to one link that has `pt_distributed_rampup` which seems reasonable as the developer will be able to see both RPC and distributed tasks.

Also added test command for DDP tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49141

Reviewed By: ezyang

Differential Revision: D25597560

Pulled By: rohan-varma

fbshipit-source-id: 85d7d2964a19ea69fe149c017cf88dff835b164a
2020-12-16 14:38:56 -08:00
676bfa6dbd Revert D25507480: [quant][docs] Add fx graph mode quantization to quantization docs
Test Plan: revert-hammer

Differential Revision:
D25507480 (7729581414)

Original commit changeset: 9e9e4b5fef97

fbshipit-source-id: fdb08d824209b97defaba2e207d1a914575a6ae7
2020-12-16 14:26:18 -08:00
09173ae65e Allow zero annealing epochs (#47579)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/47578.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47579

Reviewed By: H-Huang

Differential Revision: D25429403

Pulled By: vincentqb

fbshipit-source-id: c42fbcd71b46e07c672a1e9661468848ac16de38
2020-12-16 14:09:43 -08:00
4431731c68 Making ops c10-full: Storage arguments (#49146)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49146

Add support for Storage arguments to IValue and the JIT typing system, and make ops that were blocked on that c10-full.
ghstack-source-id: 118710665

(Note: this ignores all push blocking failures!)

Test Plan: waitforsandcastle

Reviewed By: ezyang

Differential Revision: D25456799

fbshipit-source-id: da14f125af352de5fcf05a83a69ad5a69d5a3b45
2020-12-16 14:00:34 -08:00
7767dcfc8d Revert D25564477: [pytorch][PR] Add sinc operator
Test Plan: revert-hammer

Differential Revision:
D25564477 (bbc71435b7)

Original commit changeset: 13f36a2b84da

fbshipit-source-id: 58cbe8109efaf499dd017531878b9fbbb27976bc
2020-12-16 13:19:16 -08:00
5874925b46 stft: Change require_complex warning to an error (#49022)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/49022

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D25569586

Pulled By: mruberry

fbshipit-source-id: 09608088f540c2c3fc70465f6a23f2aec5f24f85
2020-12-16 12:47:56 -08:00
7729581414 [quant][docs] Add fx graph mode quantization to quantization docs (#49211)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/49211

Test Plan: Imported from OSS

Reviewed By: raghuramank100

Differential Revision: D25507480

fbshipit-source-id: 9e9e4b5fef979f5621c1bbd1b49e9cc6830da617
2020-12-16 12:40:02 -08:00
9955355853 Updated derivative rules for complex svd and pinverse (#47761)
Summary:
Updated `svd_backward` to work correctly for complex-valued inputs.
Updated `common_methods_invocations.py` to take dtype, device arguments for input construction.
Removed `test_pinverse` from `test_autograd.py`, it is replaced by entries to `common_methods_invocations.py`.
Added `svd` and `pinverse` to list of complex tests.

References for complex-valued SVD differentiation:

- https://giggleliu.github.io/2019/04/02/einsumbp.html
- https://arxiv.org/abs/1909.02659

The derived rules assume gauge invariance of loss functions, so the result would not be correct for loss functions that are not gauge invariant.
https://re-ra.xyz/Gauge-Problem-in-Automatic-Differentiation/

The same rule is implemented in Tensorflow and [BackwardsLinalg.jl](https://github.com/GiggleLiu/BackwardsLinalg.jl).

Ref. https://github.com/pytorch/pytorch/issues/33152

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47761

Reviewed By: izdeby

Differential Revision: D25574962

Pulled By: mruberry

fbshipit-source-id: 832b61303e883ad3a451b84850ccf0f36763a6f6
2020-12-16 12:32:22 -08:00
39a23c797b Add docs/README.md to make existing doc build info more discoverable (#49286)
Summary:
Closes gh-42003

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49286

Reviewed By: glaringlee

Differential Revision: D25535250

Pulled By: ezyang

fbshipit-source-id: a7790bfe4528fa6a31698126cc687793fdf7ac3f
2020-12-16 11:55:45 -08:00
6f814d45aa Update TensorPipe submodule (#49467)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49467

Credit to beauby for the Bazel fixes.

Test Plan: Export and run on CI

Reviewed By: beauby

Differential Revision: D25588027

fbshipit-source-id: efe1c543eb7438ca05254de67cf8b5cee625119a
2020-12-16 11:33:17 -08:00
2ec3e803eb Update accumulate_grad to support vmap (#49119)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49119

I don't know how the accumulate_grad code gets hit via calling
autograd.grad, so I went through all places in accumulate_grad
that are definitely impossible to vmap through and changed them.

To support this:
- I added vmap support for Tensor::strides(). It returns the strides
that correspond to the public dimensions of the tensor (not the ones
being vmapped over).
- Changed an instance of empty_strided to new_empty_strided.
- Replaced an in-place operation in accumulate_grad.h

Test Plan:
- added a test for calling strides() inside of vmap
- added tests that exercise all of the accumulate_grad code path.
NB: I don't know why these tests exercise the code paths, but I've
verified that they do via gdb.

Suggestions for some saner test cases are very welcome.

Reviewed By: izdeby

Differential Revision: D25563543

Pulled By: zou3519

fbshipit-source-id: 05ac6c549ebd447416e6a07c263a16c90b2ef510
2020-12-16 11:30:16 -08:00
f98d8c6237 Move inplace_is_vmap_compatible to BatchedTensorImpl.h (#49118)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49118

I need this in the next stack up. It seems useful to have as a helper
function.

Test Plan: - run tests

Reviewed By: izdeby

Differential Revision: D25563546

Pulled By: zou3519

fbshipit-source-id: a4031fdc4b2373cc230ba3c66738d91dcade96e2
2020-12-16 11:30:13 -08:00
1b6d18aa7c Adding support for CuDNN-based LSTM with projections (#47725)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/46213

I didn't yet update the documentation, will add those change soon. A few other things that I didn't do, but want to clarify if I maybe should.

1. I didn't expose projections in c++ API: torch/csrc/api/src/nn/modules/rnn.cpp. Let me know if this is desirable and I will add those changes.
2. I didn't expose projections in "lstm_cell" function and "_thnn_differentiable_lstm_cell_backward" functions from aten/src/ATen/native/RNN.cpp. As far as I understand, they are not needed for nn.LSTM CPU execution. For lstm_cell, projections don't bring any real benefit, since if cell is used separately, it can be easily added in Python. For "_thnn_differentiable_lstm_cell_backward", I'm actually not sure where exactly that function is used, so I also disabled projections there for now. Please let me know if I should change that.
3. I added check that projections are not supported for quantized LSTMs to quantized_lstm_<data/input> functions. But I didn't add any checks to LSTMCell code. It seems that since I disabled projections in "lstm_cell" function, they should also not be available for quantized models through any other API than quantized_lstm_<data/input>. Please let me know if I'm not correct and I will add checks to other places.
4. Projections are not supported for CuDNN versions < 7.1.2. Should I add the check for CuDNN version and disable projections in that case? If so, what will be the best way to do that?
5. Currently I added projection weight as the last weight, so the layout is "w_ih, w_hh, b_ih, b_hh, w_hr". This breaks the assumption that biases come after weights and thus I had to add additional if-s in various places. Alternative way would be to have "w_ih, w_hh, w_hr, b_ih, b_hh" layout, in which case the assumption will be true. But in that case I will need to split the loop in get_parameters function from aten/src/ATen/native/cudnn/RNN.cpp. And in some cases, I will still need to add an "undefined" tensor in the 3rd position, because we get all 5 weights from CuDNN most of the time. So I'm not sure which way is better. Let me know if you think I should change to the weights-then-biases layout.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47725

Reviewed By: zou3519

Differential Revision: D25449794

Pulled By: ngimel

fbshipit-source-id: fe6ce59e481d1f5fd861a8ff7fa13d1affcedb0c
2020-12-16 11:27:02 -08:00
48d1ad1ada Reland "Add test for empty tensors for batch matmuls" (#48797)
Summary:
This reverts commit c7746adbc6e6ace9d4c2b54e32c8d36a7b7b0e31.

Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48797

Reviewed By: mruberry

Differential Revision: D25575264

Pulled By: ngimel

fbshipit-source-id: c7f3b384db833d727bb5bd8a51f1493a13016d09
2020-12-16 11:19:27 -08:00
afce5890ff Revert D25421263: [pytorch][PR] [numpy] torch.{all/any} : output dtype is always bool
Test Plan: revert-hammer

Differential Revision:
D25421263 (c508e5b1bf)

Original commit changeset: c6c681ef9400

fbshipit-source-id: 4c0c9acf42b06a3ed0af8f757ea4512ca35b6c59
2020-12-16 11:11:13 -08:00
d7659be58d [caffe2][autograd] Avoid extensive -Wunused-variable warnings on _any_requires_grad (#49167)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49167

Building with clang and a fair warning level can result in hundreds of lines of compiler output of the form:
```
caffe2\gen_aten_libtorch\autograd\generated\VariableType_1.cpp(2279,8): warning: unused variable '_any_requires_grad' [-Wunused-variable]
   auto _any_requires_grad = compute_requires_grad( self );
        ^
caffe2\gen_aten_libtorch\autograd\generated\VariableType_1.cpp(2461,8): warning: unused variable '_any_requires_grad' [-Wunused-variable]
   auto _any_requires_grad = compute_requires_grad( grad_output, self );
        ^
caffe2\gen_aten_libtorch\autograd\generated\VariableType_1.cpp(2677,8): warning: unused variable '_any_requires_grad' [-Wunused-variable]
   auto _any_requires_grad = compute_requires_grad( self );
        ^
...
```
This happens when requires_derivative == False. Let's mark `_any_requires_grad` as potentially unused. If this were C++17 we would use `[[maybe_unused]]` but to retain compatibility with C++11 we just mark it with `(void)`.

Test Plan: CI + locally built

Reviewed By: ezyang

Differential Revision: D25421548

fbshipit-source-id: c56279a184b1c616e8717a19ee8fad60f36f37d1
2020-12-16 10:38:11 -08:00
45b33c83f1 Revert "Revert D24923679: Fixed einsum compatibility/performance issues (#46398)" (#49189)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49189

This reverts commit d307601365c3b848072b8b8381208aedc1a0aca5 and fixes the bug with diagonals and ellipsis combined.

Test Plan: Imported from OSS

Reviewed By: glaringlee

Differential Revision: D25540722

Pulled By: heitorschueroff

fbshipit-source-id: 86d0c9a7dcfda600b546457dad102af2ff33e353
2020-12-16 10:38:07 -08:00
bbc71435b7 Add sinc operator (#48740)
Summary:
Implements the sinc operator.
See https://numpy.org/doc/stable/reference/generated/numpy.sinc.html

![image](https://user-images.githubusercontent.com/13428986/101653855-cdffa080-3a0d-11eb-8426-ecc81c152ebd.png)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48740

Reviewed By: izdeby

Differential Revision: D25564477

Pulled By: soulitzer

fbshipit-source-id: 13f36a2b84dadfb4fd1442a2a40a3a3246cbaecb
2020-12-16 10:33:02 -08:00
09c741868c [c10d Store] Store Python Docs Fixes (#49130)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49130

The Python Store API docs had some typos, where boolean value were
lower case, which is incorrect Python syntax. This diff fixes those typos.

Test Plan: Built and Rendered Docs

Reviewed By: mrshenli

Differential Revision: D25411492

fbshipit-source-id: fdbf1e6b8f81e9589e638286946cad68eb7c9252
2020-12-16 10:29:09 -08:00
4b3f05a471 [Docs] Updating init_process_group docs to indicate correct rank range (#49131)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49131

Users frequently assume the correct range of ranks is 1 ...
`world_size`. This PR udpates the docs to indicate that the correct rank range
users should specify is 0 ... `world_size` - 1.

Test Plan: Rendering and Building Docs

Reviewed By: mrshenli

Differential Revision: D25410532

fbshipit-source-id: fe0f17a4369b533dc98543204a38b8558e68497a
2020-12-16 10:26:04 -08:00
c52f1dc365 .circleci: downgrade conda-package-handling to 1.6.0 (#49434)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49434

There was a bug that was introduced in conda-package-handling >= 1.6.1 that makes archives
above a certain size fail out when attempting to extract
see: https://github.com/conda/conda-package-handling/issues/71

coincides with https://github.com/pytorch/builder/pull/611

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: xuzhao9, janeyx99, samestep

Differential Revision: D25573390

Pulled By: seemethere

fbshipit-source-id: 82173804f1b30da6e4b401c4949e2ee52065e149
2020-12-16 10:17:47 -08:00
f2ee8c6241 Instantiate PackedConvWeight to avoid linking error (#49442)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49442

When moving Aten/native to app level, symbols from native/quantized may sit in a target away from some of its call sites. As a result, there are linking errors of missing symbols of instantiations of PackedConvWeight::prepack. The solution is to instantiate PackedConvWeight in the same compilation unit. It's similar to D24941989 (fe6bb2d287).
ghstack-source-id: 118676374

Test Plan: CI

Reviewed By: dhruvbird

Differential Revision: D25576703

fbshipit-source-id: d6e3d11d51d8172ab8487ce44ec8c042889f0f11
2020-12-16 10:09:09 -08:00
86902f84bf CUDA BFloat embedding (#44848)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44848

Reviewed By: izdeby

Differential Revision: D25574204

Pulled By: ngimel

fbshipit-source-id: b35f7253a6ad2b83f7b6b06862a5ab77295373e0
2020-12-16 09:24:46 -08:00
001ff3acf6 webdataset prototype - LoadFilesFromDiskIterableDataset (#48955)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/48955

Test Plan: Imported from OSS

Reviewed By: izdeby

Differential Revision: D25541393

Pulled By: glaringlee

fbshipit-source-id: dea6ad64a7ba40abe45612d99f078b14d1da8bbf
2020-12-16 08:39:17 -08:00
6786b2b966 webdataset prototype - ListDirFilesIterableDataset (#48944)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48944

This is a stack PR for webdataset prototype. I am trying to make each stack a separate dataset.
To make the implementation simple, each dataset will only support the basic functionality.

- [x] ListDirFilesDataset
- [x] LoadFilesFromDiskIterableDataset
- [x] ReadFilesFromTarIterableDataset
- [x] ReadFilesFromZipIterableDataset
- [x] RoutedDecoderIterableDataset

Test Plan: Imported from OSS

Reviewed By: izdeby

Differential Revision: D25541277

Pulled By: glaringlee

fbshipit-source-id: 9e738f6973493f6be1d5cc1feb7a91513fa5807c
2020-12-16 08:34:20 -08:00
efc090652e Enhanced generators with grad-mode decorators (#49017)
Summary:
This PR addresses the feature request outlined in https://github.com/pytorch/pytorch/issues/48713 for two-way communication with enhanced generators from [pep-342](https://www.python.org/dev/peps/pep-0342/).

Briefly, the logic of the patch resembles `yield from` [pep-380](https://www.python.org/dev/peps/pep-0380/), which cannot be used, since the generator **must be interacted with from within the grad-mode context**, while yields from the decorator **must take place outside of the context**. Hence any interaction with the wrapped generator, be it via [.send](https://docs.python.org/3/reference/expressions.html?highlight=throw#generator.send), [.throw](https://docs.python.org/3/reference/expressions.html?highlight=throw#generator.throw), and even [.close](https://docs.python.org/3/reference/expressions.html?highlight=throw#generator.close) must be wrapped by a `with` clause. The patch is compatible with `for i in gen: pass` and `next(gen)` use cases and allows two-way communication with the generator via `.send <-> yield` points.

### Logic
At lines [L37-L38](2d40296c0c/torch/autograd/grad_mode.py (L37-L38)) we (the decorator) **start the wrapped generator** (coroutine) by issuing `None` into it (equivalently, we can use `next(get)` here). Then we **dispatch responses of the generator** to our ultimate caller and **relay the latter's requests** into the generator in the loop on lines [L39-L52](2d40296c0c/torch/autograd/grad_mode.py (L39-L52)).

We yield the most recent response on [L40-L41](2d40296c0c/torch/autograd/grad_mode.py (L40-L41)), at which point we become **paused**, waiting for the next ultimate caller's interaction with us. If the caller **sends us a request**, then we become unpaused and move to [L51-L52](2d40296c0c/torch/autograd/grad_mode.py (L51-L52)) and **forward it into the generator**, at which point we pause, waiting for its response. The response might be a value, an exception or a `StopIteration`. In the case of an exception from the generator, we let it **bubble up** from the immediately surrounding [except clause](https://docs.python.org/3/reference/compound_stmts.html#the-try-statement)  to the ultimate caller through the [outer try-except](2dc287bba8/torch/autograd/grad_mode.py (L36-L54)). In the case of a `StopIteration`, we **take it's payload and propagate it** to the caller via [return](2d40296c0c/torch/autograd/grad_mode.py (L54)). In the case of a value, the flow and the loop continues.

The caller **throwing an exception at us** is handled much like a proper request, except for the exception playing the role of the request. In this case we **forward it into the generator** on lines [L47-L49](2d40296c0c/torch/autograd/grad_mode.py (L47-L49)) and await its response. We explicitly **advance** the traceback one frame up, in order to indicate the **source of the exception within the generator**.

Finally the `GeneratorExit` is handled on lines [L42-L45](2d40296c0c/torch/autograd/grad_mode.py (L42-L45)) and closes the generator.

Updates: clarified exception propagation

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49017

Reviewed By: izdeby

Differential Revision: D25567796

Pulled By: albanD

fbshipit-source-id: 801577cccfcb2b5e13a08e77faf407881343b7b0
2020-12-16 07:15:33 -08:00
76d09ec33e [PyTorch] Avoid move-constructing a List in listConstruct (#49355)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49355

List's move ctor is a little bit more expensive than you might expect, but we can easily avoid it.
ghstack-source-id: 118624596

Test Plan: Roughly 1% improvement on internal benchmark.

Reviewed By: hlu1

Differential Revision: D25542190

fbshipit-source-id: 08532642c7d1f1604e16c8ebefd1ed3e56f7c919
2020-12-16 07:07:12 -08:00
ec8e9d31cf Making ops c10-full: optional lists (#49088)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49088

We had special case logic to support `int[]?` and `double[]?` but nothing for `DimnameList[]?`.
This PR generalizes the logic to support optional lists so it should now work with all types.
It also enables c10-fullness for ops that were blocked by this.

Note that using these arguments in a signature was always and still is expensive because the whole list needs to be copied.
We should probably consider alternatives in the future like for example using `torch::List` instead of `ArrayRef`, that could work without copying the list.
ghstack-source-id: 118660071

Test Plan: waitforsandcastle

Reviewed By: ezyang

Differential Revision: D25423901

fbshipit-source-id: dec58dc29f3bb4cbd89e2b95c42da204a9da2e0a
2020-12-16 02:55:11 -08:00
d69d42db78 Making ops c10 full: optional out arguments (#49083)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49083

We have some (but very few) ops that take optional out arguments `Tensor(a!)? out`.
This PR makes them non-optional mandatory arguments and enables c10-fullness for them.
There is only a very small number of ops affected by this.

Putting this up for discussion.

Alternatives considered:
If we keep them optional, we run into lots of issues in the dispatcher. We have to decide what the dispatcher calling convention for this argument type should be.
1) If we keep passing them in as `Tensor&` arguments and return them as `tuple<Tensor&, Tensor&, Tensor&>`, so basically same as currently, then the schema inference check will say "Your kernel function got inferred to have a `Tensor` argument but your native_functions.yaml declaration says `Tensor?`. This is a mismatch, you made an error". We could potentially disable that check, but that would open the door for real mistakes to not be reported anymore in the future. This sounds bad.
2) If we change them to a type that schema inference could differentiate from `Tensor`, say we pass them in as `const optional<Tensor>&` and return them as `tuple<const optional<Tensor>&, const optional<Tensor>&, const optional<Tensor>&>`, then our boxing logic fails because it can't recognize those as out overloads anymore and shortcut the return value as it is doing right now. We might be able to rewrite the boxing logic, but that could be difficult and could easily develop into a rabbit hole of having to clean up `Tensor&` references throughout the system where we use them.

Furthermore, having optional out arguments in C++ doesn't really make sense. the C++ API puts them to the front of the argument list, so you can't omit them anyways when calling an op.
You would be able to omit them when calling from Python with out kwargs, but not sure if we want that discrepancy between the c++ and python API.
ghstack-source-id: 118660075

Test Plan: waitforsandcastle

Reviewed By: ezyang

Differential Revision: D25422197

fbshipit-source-id: 3cb25c5a3d93f9eb960d70ca014bae485be9f058
2020-12-16 02:53:42 -08:00
306bab220e Revert D25554109: [StaticRuntime][ATen] Add out variant for narrow_copy
Test Plan: revert-hammer

Differential Revision:
D25554109 (ed04b71651)

Original commit changeset: 6bae62e6ce34

fbshipit-source-id: bfa038e150166d0116bcae8f7a6415d98d4146de
2020-12-16 02:44:45 -08:00
ed04b71651 [StaticRuntime][ATen] Add out variant for narrow_copy (#49449)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49449

Similar to permute_out, add the out variant of `aten::narrow` (slice in c2) which does an actual copy. `aten::narrow` creates a view, however, an copy is incurred when we call `input.contiguous` in the ops that follow `aten::narrow`, in `concat_add_mul_replacenan_clip`, `casted_batch_one_hot_lengths`, and `batch_box_cox`.

{F351263599}

Test Plan:
Unit test:

```
buck test //caffe2/aten:native_test
```
Benchmark with the adindexer model:
```
bs = 1 is neutral

Before:
I1214 21:32:51.919239 3285258 PyTorchPredictorBenchLib.cpp:209] PyTorch run finished. Milliseconds per iter: 0.0886948. Iters per second: 11274.6
After:
I1214 21:32:52.492352 3285277 PyTorchPredictorBenchLib.cpp:209] PyTorch run finished. Milliseconds per iter: 0.0888019. Iters per second: 11261

bs = 20 shows more gains probably because the tensors are bigger and therefore the cost of copying is higher

Before:
I1214 21:20:19.702445 3227229 PyTorchPredictorBenchLib.cpp:209] PyTorch run finished. Milliseconds per iter: 0.527563. Iters per second: 1895.51
After:
I1214 21:20:20.370173 3227307 PyTorchPredictorBenchLib.cpp:209] PyTorch run finished. Milliseconds per iter: 0.508734. Iters per second: 1965.67
```

Reviewed By: bwasti

Differential Revision: D25554109

fbshipit-source-id: 6bae62e6ce3456ff71559b635cc012fdcd1fdd0e
2020-12-16 01:47:46 -08:00
40d7c1091f Unescape string in RPC error message (#49373)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49373

Unescaping the string in RPC error message to provide better error msg

Test Plan: CI

Reviewed By: xush6528

Differential Revision: D25511730

fbshipit-source-id: 054f46d5ffbcb1350012362a023fafb1fe57fca1
2020-12-16 01:40:31 -08:00
a9137aeb06 quantized tensor: add preliminary support for advanced indexing, try 2 (#49346)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49346

This is less ambitious redo of
https://github.com/pytorch/pytorch/pull/49129/.

We make the

```
xq_slice = xq[:, [0], :, :]
```

indexing syntax work if `xq` is a quantized Tensor.  For now, we are
making the code not crash, with an in efficient `dq -> index -> q`
implementation.  A future PR can optimize performance by removing
the unnecessary memory copies (which will require some non-trivial
changes to TensorIterator).

Test Plan:
```
python test/test_quantization.py TestQuantizedOps.test_advanced_indexing
```

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D25539365

fbshipit-source-id: 98485875aaaf5743e1a940e170258057691be4fa
2020-12-16 01:28:38 -08:00
8954eb3f72 [StaticRuntime] Fusion pass for ClipRanges/GatherRanges/LengthsToOffsets (#49113)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/49113

Reviewed By: ajyu

Differential Revision: D25388512

fbshipit-source-id: 3daa5b9387a3a10b6c220688df06540c4d844aea
2020-12-16 00:34:49 -08:00
94e328c038 fix optimizer.pyi typo 'statue'->'state' (#49388)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/49388

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D25553672

Pulled By: glaringlee

fbshipit-source-id: e9f2233bd678a90768844af2d8d5e2994d59e304
2020-12-15 23:41:56 -08:00
cbeb4c25e5 [StaticRuntime] Permute_out (#49447)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49447

Adding an out variant for `permute`. It's better than fixing the copy inside contiguous because 1) we can leverage the c2 math library, 2) contiguous creates a tensor inside the function which isn't managed by the MemoryPlanner in StaticRuntime

Test Plan:
Benchmark:
```
After:
I1214 12:35:32.218775 991920 PyTorchPredictorBenchLib.cpp:209] PyTorch run finished. Milliseconds per iter: 0.0902339. Iters per second: 11082.3

Before:
I1214 12:35:43.368770 992620 PyTorchPredictorBenchLib.cpp:209] PyTorch run finished. Milliseconds per iter: 0.0961521. Iters per second: 10400.2
```

Reviewed By: yinghai

Differential Revision: D25541666

fbshipit-source-id: 013ed0d4080cd01de4d3e1b031ab51e5032e6651
2020-12-15 23:09:31 -08:00
acd72e79a3 update breathe (#49407)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/47462, but not completely.

Update breathe to the latest version to get fixes for the "Unable to resolve..." issues. There are still some build errors, but much fewer than before.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49407

Reviewed By: izdeby

Differential Revision: D25562163

Pulled By: glaringlee

fbshipit-source-id: 91bfd9e9ac70723816309f489022d72853f5fdc5
2020-12-15 21:47:07 -08:00
58551e52f0 [CMake] Use libtorch_cuda list defined in bzl file (#49429)
Summary:
Since NCCL is an optional CUDA dependency, remove nccl.cpp from the core filelist

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49429

Reviewed By: nikithamalgifb

Differential Revision: D25569883

Pulled By: malfet

fbshipit-source-id: 61371a4c6b0438e4e0a7f094975b9a9f9ffa4032
2020-12-15 20:51:16 -08:00
22c6dafd33 [PyTorch] Use plain old function pointer for RecordFunctionCallback (reapply) (#49408)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49408

Nearly every non-test callsite doesn't need to capture any variables anyway, and this saves 48 bytes per callback.
ghstack-source-id: 118665808

Test Plan:
Wait for GitHub CI since we had C++14-specific issues with
this one in previous PR https://github.com/pytorch/pytorch/pull/48629

Reviewed By: malfet

Differential Revision: D25563207

fbshipit-source-id: 6a2831205917d465f8248ca37429ba2428d5626d
2020-12-15 19:16:01 -08:00
e9d7d37ad0 [FX] Rename Node._uses and refactor Node.all_input_nodes (#49415)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/49415

Test Plan: Imported from OSS

Reviewed By: zdevito

Differential Revision: D25565341

Pulled By: jamesr66a

fbshipit-source-id: 2290ab62572632788809ba16319578bf0c0260ee
2020-12-15 17:13:57 -08:00
46debe7f23 [DPER] Introduce barrier operation to force synchronization of threads in async execution (#49322)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49322

In some cases async execution might loose dependencies (Alias like ops) or produce suboptimal scheduling when there is an option which parts to schedule first. Example of the later behavior can happen in ModelParallel training where copy can get lower priority compared to the rest of the execution on the given GPU, which will caused other GPUs to starve.

This operator allows to address these issues by introducing extra explicit dependencies between ops.

Test Plan:
Unit-test/
E2E testing in the future diffs.

Reviewed By: xianjiec

Differential Revision: D24933471

fbshipit-source-id: 1668994c7856d73926cde022378a99e1e8db3567
2020-12-15 16:13:42 -08:00
7518f54611 Add flag torch_jit_disable_warning_prints to allow disabling all warnings.warn (#49313)
Summary:
Adding a flag torch_jit_disable_warning_prints to optimize interpreter performance by suppressing (potentially large amount) of warnings.warn.

This is to work around TorchScript's warning behavior mismatch with Python. Python by default triggers a warning once per location but TorchScript doesn't support it. This causes same warning to trigger and print once per inference run, hurting performance.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49313

Reviewed By: SplitInfinity

Differential Revision: D25534274

Pulled By: gmagogsfm

fbshipit-source-id: eaeb57a335c3e6c7eb259671645db05d781e80a2
2020-12-15 15:22:41 -08:00
aff0b68a58 Fix include files for out-of-tree compilation (#48827)
Summary:
Signed-off-by: caozhong <zhong.z.cao@intel.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48827

Reviewed By: agolynski

Differential Revision: D25375988

Pulled By: ailzhang

fbshipit-source-id: a8d5ab4572d991d6d96dfe758011517651ff0a6b
2020-12-15 14:40:44 -08:00
16f4b0ed6b Replace THError() check in THCTensorMathReduce.cu with C10_CUDA_KERNEL_LAUNCH_CHECK() (#49424)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49424

As per conversation in this [comment](https://www.internalfb.com/intern/diff/D25541113 (e2510a0b60)/?dest_fbid=393026838623691&transaction_id=3818008671564312) on D25541113 (e2510a0b60), although THError does more than just log any errors associated cuda kernel launches, we're going to go ahead and replace it with C10_CUDA_KERNEL_LAUNCH_CHECK, so as to be consistent throughout the code base.
Standardization FTW.

This commit is purposefully sent in as a single file change so it can be easily reverted if it introduces a regression.

Test Plan:
Checked that the code still builds with
```
buck build //caffe2/aten:ATen-cu
```
Also ran basic aten tests
```
buck test //caffe2/aten:atest
```

Reviewed By: r-barnes

Differential Revision: D25567863

fbshipit-source-id: 1093bfe2b6ca6b9a3bfb79dcdc5d713f6025eb77
2020-12-15 14:08:09 -08:00
c508e5b1bf [numpy] torch.{all/any} : output dtype is always bool (#47878)
Summary:
BC-breaking note:

This PR changes the behavior of the any and all functions to always return a bool tensor. Previously these functions were only defined on bool and uint8 tensors, and when called on uint8 tensors they would also return a uint8 tensor. (When called on a bool tensor they would return a bool tensor.)

PR summary:

https://github.com/pytorch/pytorch/pull/44790#issuecomment-725596687

Fixes 2 and 3

Also Fixes https://github.com/pytorch/pytorch/issues/48352

Changes
* Output dtype is always `bool` (consistent with numpy) **BC Breaking (Previously used to match the input dtype**)
* Uses vectorized version for all dtypes on CPU
* Enables test for complex
* Update doc for `torch.all` and `torch.any`

TODO
* [x] Update docs
* [x] Benchmark
* [x] Raise issue on XLA

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47878

Reviewed By: H-Huang

Differential Revision: D25421263

Pulled By: mruberry

fbshipit-source-id: c6c681ef94004d2bcc787be61a72aa059b333e69
2020-12-15 13:59:32 -08:00
38a59a67f3 [JIT] Support multiple outputs in subgraph matcher. (#48992)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/48992

Differential Revision: D25388100

Test Plan: Imported from OSS

Reviewed By: heitorschueroff

Pulled By: ZolotukhinM

fbshipit-source-id: d95713af2220cf4f99ac92f59f8e5b902f2f3822
2020-12-15 13:09:24 -08:00
3ffe9e0f43 [static runtime] refine fusion group (#49340)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49340

This refines the fusion group to include on certain types of operations.  We cannot safely handle "canRunNatively" types and the memonger pass causes regressions on some internal models, so it was disabled (to be revisited with proper memory optimization once Tensor pools are implemented)

Test Plan:
```
buck test mode/no-gpu caffe2/test:static_runtime
buck test //caffe2/benchmarks/static_runtime:static_runtime_cpptest
```

Reviewed By: ZolotukhinM

Differential Revision: D25520105

fbshipit-source-id: add61d103e4f8b4615f5402e760893ef759a60a9
2020-12-15 12:57:35 -08:00
f4e15c4a23 [te] Fix bugs with shift operators (#49396)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49396

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49271

Two things:

1. These throw exceptions in their constructor, which causes a segfault (*), so
   move the exceptions to ::make.
2. They technically support FP types but the rules are complicated so let's not
   bother.

(*) The reason for the segfault: all Exprs including these inherit from
KernelScopedObject, whose constructor adds the object to a list for destruction
at the end of the containing KernelArena's lifetime.  But if the derived-class
constructor throws, the object is deleted even though it's still in the
KernelArena's list.  So when the KernelArena is itself deleted, it double-frees
the pointer and dies.  I've also fixed And, Or, and Xor in this diff.
ghstack-source-id: 118594998

Test Plan: `buck test //caffe2/test:jit`

Reviewed By: bwasti

Differential Revision: D25512052

fbshipit-source-id: 42670b3be0cc1600dc5cda6811f7f270a2c88bba
2020-12-15 12:44:59 -08:00
5912316cf7 Making ops c10-full: Generator arguments (#49013)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49013

I don't know why this works. I know, this is never a good way to start a PR description :P
I know that Generator is a dispatch relevant argument when called from an unboxed API and is ignored
for dispatch purposes when called from a boxed API. This should break something, but maybe we don't
have test cases for that.

We likely need to align the unboxed and boxed dispatch behavior before landing this.
The best solution would be to make Generator not dispatch relevant in unboxing. But that might be a bigger change.
An acceptable solution could be to make Generator dispatch relevant in boxing, but that needs perf measurements.

This PR needs further discussion.
ghstack-source-id: 118619230

(Note: this ignores all push blocking failures!)

Test Plan: waitforsandcastle

Reviewed By: bhosmer

Differential Revision: D25394998

fbshipit-source-id: f695c659ee6e3738f74cdf0af1a514ac0c30ebff
2020-12-15 11:21:43 -08:00
a6274c1278 Making ops c10 full: out overloads with default arguments (#49012)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49012

For some reason we apply default arguments to the functions in at::native too. So when an out overload had default arguments,
we couldn't move the out argument to the end because of those default arguments preceding it.
This PR fixes that and makes out overloads with default arguments c10-full
ghstack-source-id: 118619222

(Note: this ignores all push blocking failures!)

Test Plan: waitforsandcastle

Reviewed By: ezyang

Differential Revision: D25394605

fbshipit-source-id: 2ed1c3ce0d04a548e3141df2dca517756428fe15
2020-12-15 11:21:40 -08:00
b47fa5e88b Making ops c10-full: Dimname arguments (#49008)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49008

ghstack-source-id: 118619229

(Note: this ignores all push blocking failures!)

Test Plan: waitforsandcastle

Reviewed By: ezyang

Differential Revision: D25392590

fbshipit-source-id: 9a4c8917aaa254fac42f33973409f5497f878df2
2020-12-15 11:21:37 -08:00
c5f90a25c0 Making ops c10-full: ops blocked by manual registrations (#49007)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49007

Some ops had manual registrations, e.g. in VmapModeRegistrations and those manual registrations had to be changed too when making the op c10-full.
This PR makes those ops c10-full and fixes the manual registrations.
ghstack-source-id: 118619231

(Note: this ignores all push blocking failures!)

Test Plan: waitforsandcastle

Reviewed By: ezyang

Differential Revision: D25392591

fbshipit-source-id: f4124c0547594879646cb1778357f857ea951132
2020-12-15 11:21:33 -08:00
e391dbc1b5 Making ops c10 full: ops returning multiple out arguments (#49006)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49006

There was an issue in the unboxing logic with ops returning multiple out arguments.
This PR fixes that and makes those ops c10 full.

Additionally, it makes some ops c10 full that slipped through the cracks before.
ghstack-source-id: 118619224

(Note: this ignores all push blocking failures!)

Test Plan: waitforsandcastle

Reviewed By: ezyang

Differential Revision: D25392592

fbshipit-source-id: 6947304f34c5658fc12dc6608a21aff7bc4491e2
2020-12-15 11:21:30 -08:00
40a02e2ded Make out ops c10-full (with hacky-wrapper) (#48912)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48912

ghstack-source-id: 118619234

(Note: this ignores all push blocking failures!)

Test Plan:
Benchmark:
 ---
Old (i.e. codegenerated unboxing wrapper + no hacky_wrapper):
```
<torch.utils.benchmark.utils.valgrind_wrapper.timer_interface.CallgrindStats object at 0x7f64d03ebcd0>
torch.absolute(t, out=o)
setup:
  t = torch.empty([1])
  o = torch.empty([1])

                           All          Noisy symbols removed
    Instructions:       657204                     634396
    Baseline:             4192                       3786
100 runs per measurement, 1 thread
```

New (i.e. templated unboxing wrapper + hacky_wrapper):
```
<torch.utils.benchmark.utils.valgrind_wrapper.timer_interface.CallgrindStats object at 0x7fa7de211cd0>
torch.absolute(t, out=o)
setup:
  t = torch.empty([1])
  o = torch.empty([1])

                           All          Noisy symbols removed
    Instructions:       658160                     633996
    Baseline:             4210                       3786
100 runs per measurement, 1 threa
```

Reviewed By: bhosmer

Differential Revision: D25363335

fbshipit-source-id: ab9c122491e4209a49254dad0f7b3adb677b2c53
2020-12-15 11:16:00 -08:00
11334280bf Suppress warning: calling a constexpr __host__ function from a __host__ __device__ function is not allowed warning (#49197)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49197

Compiling currently gives a number of these warnings:
```
caffe2/c10/util/TypeCast.h(27): warning: calling a constexpr __host__ function from a __host__ __device__ function is not allowed. The experimental flag '--expt-relaxed-constexpr' can be used to allow this.
          detected during:
            instantiation of "decltype(auto) c10::maybe_real<true, src_t>::apply(src_t) [with src_t=c10::complex<double>]"
(57): here
            instantiation of "uint8_t c10::static_cast_with_inter_type<uint8_t, src_t>::apply(src_t) [with src_t=c10::complex<double>]"
(157): here
            instantiation of "To c10::convert<To,From>(From) [with To=uint8_t, From=c10::complex<double>]"
(169): here
            instantiation of "To c10::checked_convert<To,From>(From, const char *) [with To=uint8_t, From=c10::complex<double>]"
caffe2/c10/co
```
Here we fix this by adding `C10_HOST_DEVICE` to the offending function.

Test Plan:
Compiling
```
buck build mode/dev-nosan -c=python.package_style=inplace dper3/dper3_models/experimental/pytorch/ads:ads_model_generation_script
```
shows this warning.

We rely on sandcastle for testing here.

Reviewed By: xw285cornell

Differential Revision: D25440771

fbshipit-source-id: 876c412eb06e8837978061cc4793abda42fac821
2020-12-15 10:49:07 -08:00
778006918c [WIP][FX] Add FX page to docs (#48814)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/48814

Test Plan: Imported from OSS

Reviewed By: ansley

Differential Revision: D25320051

Pulled By: jamesr66a

fbshipit-source-id: b1fdec9615a7a4eb97c557bb3cba7f90b0a4d933
2020-12-15 09:48:29 -08:00
9908b93dcf fix test_dispatch tests to error on duplicate def (#49254)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/49254

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D25505170

Pulled By: bdhirsh

fbshipit-source-id: 6796f4ce022c3141934ee69c7caaa08e663adf39
2020-12-15 08:27:52 -08:00
8ae9b46e20 Revert D25494735: Update TensorPipe submodule
Test Plan: revert-hammer

Differential Revision:
D25494735 (5a5e576ab9)

Original commit changeset: 3d6f326ca49d

fbshipit-source-id: 369a4519b5b2fec19a7a5faf324b9467177e27f6
2020-12-15 08:11:56 -08:00
9234f5026d Make WorkNCCL use CUDAEvent::query() rather than re-implement it (#49343)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49343

at::cuda::CUDAEvent is "lazy" and only creates an event when it's first recorded. Until then, at::cuda::CUDAEvent is empty. If we use at::cuda::CUDAEvent::query() this is taken into account (an empty event is always ready), but WorkNCCL extracts the raw cudaEvent_t value from at::cuda::CUDAEvent and calls cudaEventQuery manually and doesn't check this. This could cause a failure.

It's unclear if this is ever supposed to happen, but we're seeing that failure, and we want to sort it out in order to see if there's something "deeper" going on.
ghstack-source-id: 118532806

Test Plan: Unit tests

Reviewed By: SciPioneer

Differential Revision: D25537844

fbshipit-source-id: 506319f4742e1c0a02aa75ecc01112ea3be42d8f
2020-12-15 03:15:48 -08:00
5a5e576ab9 Update TensorPipe submodule (#49232)
Summary:
Credit to beauby for the Bazel fixes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49232

Test Plan: Export and run on CI

Reviewed By: beauby

Differential Revision: D25494735

fbshipit-source-id: 3d6f326ca49dcd28d0d19cb561818c3c2904cb55
2020-12-15 00:47:39 -08:00
98726119d9 Do not return unitialized qschame from getQSchemeAndQParamVector (#49391)
Summary:
Assign it by default to `kPerTensorAffine`

Fixes regressions accidentally discovered by https://app.circleci.com/pipelines/github/pytorch/pytorch/250370/workflows/6f38ae43-a9a5-43f3-8c1f-0f911df69d75/jobs/9589799

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49391

Reviewed By: ngimel

Differential Revision: D25554180

Pulled By: malfet

fbshipit-source-id: f42a45e9d6743c665c62d057197d009f1542226e
2020-12-15 00:04:38 -08:00
39a10fb652 Fix check_kernel_launches.py for macros and provide extended context (#49365)
Summary:
`check_kernel_launches.py` currently gives a false positive in instances such as:
```
735:     <<<smallIndexGrid, smallIndexBlock, 0, stream>>>(                                   \
736:       outInfo, selfInfo, indicesInfo,                                                   \
737:       outSelectDim, selfSelectDim, static_cast<TYPE>(sliceSize),                        \
738:       selfSelectDimSize);                                                               \
739:     C10_CUDA_KERNEL_LAUNCH_CHECK();
```
because the newlines after the last `\` are not consumed by the regex. This fixes that.

In addition, the regex is modified to provide greater context for the start of the kernel launch. This changes the context from:
```
157:       (
158:           size, X_strides, Y_dims, X, Y);
```
to
```
157:       <<<M, CAFFE_CUDA_NUM_THREADS, 0, context->cuda_stream()>>>(
158:           size, X_strides, Y_dims, X, Y);
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49365

Test Plan:
```
buck test //caffe2/test:kernel_launch_checks -- --print-passing-details
```

Reviewed By: aakshintala

Differential Revision: D25545402

Pulled By: r-barnes

fbshipit-source-id: 76feac6a002187239853752b892f4517722a77bf
2020-12-14 22:09:33 -08:00
25bc906281 Revert D25135415: [PyTorch] Use plain old function pointer for RecordFunctionCallback
Test Plan: revert-hammer

Differential Revision:
D25135415 (7e23ee1598)

Original commit changeset: 5e92dc79da64

fbshipit-source-id: 45b1634a100084c84dca158a1f16ca760fef6988
2020-12-14 21:04:27 -08:00
a419a3e25d Add assertion on any NaN error on the error feedback (#49374)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49374

After the assertion is added, the NaN error on certain trainings disappears.

It seems that the real error is caused by the underlying illegal memory access. This is a temporary workaround.

Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202
ghstack-source-id: 118572471

Test Plan:
Real run on Ads 10X model: scripts/wayi/mast_prof_gradient_compression.sh POWER_SGD 8

To reproduce the error, just comment out the assertion.

Reviewed By: rohan-varma

Differential Revision: D25548299

fbshipit-source-id: 039af7d94a27e0f47ef647c6163fd0e5064951d5
2020-12-14 20:15:39 -08:00
7e23ee1598 [PyTorch] Use plain old function pointer for RecordFunctionCallback (#48629)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48629

Nearly every non-test callsite doesn't need to capture any variables anyway, and this saves 48 bytes per callback.
ghstack-source-id: 118568240

Test Plan: CI

Reviewed By: dhruvbird

Differential Revision: D25135415

fbshipit-source-id: 5e92dc79da6473ed15d1e381a21ed315879168f3
2020-12-14 20:08:16 -08:00
900aa4ee97 [PyTorch] remove convenience RecordFunctionCallback interface (#48620)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48620

In preparation for storing bare function pointer (8 bytes)
instead of std::function (32 bytes).
ghstack-source-id: 118568242

Test Plan: CI

Reviewed By: ezyang

Differential Revision: D25132183

fbshipit-source-id: 3790cfb5d98479a46cf665b14eb0041a872c13da
2020-12-14 20:03:15 -08:00
bbeee481c3 Fix typo in torch.load docstring for the f parameter (#49350)
Summary:
No issue opened for this (that I can see) and it was a fairly small change, so just opening this PR directly!

The docstring for `torch.load` had some of parameter descriptions including typos like ``:meth`readline` `` instead of``:meth:`readline` ``. This PR corrects that :)

<img width="811" alt="image" src="https://user-images.githubusercontent.com/30357972/102128240-7fa33500-3e45-11eb-8f54-ce5ca7bba96c.png">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49350

Reviewed By: glaringlee

Differential Revision: D25543041

Pulled By: mrshenli

fbshipit-source-id: 10db04d58dd5b07777bdd51d3fcb3c45dea4c84b
2020-12-14 19:16:01 -08:00
626b8c0cf2 [te] Ban uint8 tensors from fusion groups (#49247)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49247

uint8's expose all kind of corner cases in type promotion.  As an example, consider:
```
>>> torch.tensor([1], dtype=torch.uint8).lt(-1)
tensor([True])
>>> torch.tensor([1], dtype=torch.uint8).lt(torch.tensor(-1))
tensor([True])
>>> torch.tensor([1], dtype=torch.uint8).lt(torch.tensor([-1]))
tensor([False])
```
the difference is how promotions involving scalars (or 0-dim tensors, which are treated like scalars) are prioritized compared to tensor dtypes.
Per eellison, the order is something like:
1. Tensor FP types
2. Scalar FP types
3. Tensor Int types
4. Scalar Int types

The logic for this is here: c73e97033a/aten/src/ATen/native/TypeProperties.cpp (L93)

AFAICT the effects are mainly visible for the unsigned byte type (the only unsigned type, besides bool) since the others degrade more or less gracefully.

It's hard to re-use this logic as is in TensorIterator/TypeProperties, and it's complicated enough that it's not worth re-implementing in TE unless there's evidence that it matters for real models.
ghstack-source-id: 118555597

Test Plan: `buck test //caffe2/test:jit`

Reviewed By: eellison

Differential Revision: D25489035

fbshipit-source-id: db3ab84286d472fd8a247aeb7b36c441293aad85
2020-12-14 17:40:15 -08:00
50b361a821 Enable BF16 for indexing on CUDA (#48801)
Summary:
Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48801

Reviewed By: glaringlee

Differential Revision: D25542914

Pulled By: ngimel

fbshipit-source-id: 4113eb2729d15b40a89268172cc37122b5213624
2020-12-14 17:24:31 -08:00
23e98e73f6 Fix Windows CUDA-11.1 test jobs (#49376)
Summary:
Fixes typo introduced by  https://github.com/pytorch/pytorch/pull/49156

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49376

Reviewed By: seemethere

Differential Revision: D25548524

Pulled By: malfet

fbshipit-source-id: 6aa3d903f6105c576c009f05a6b9d29f32b35c47
2020-12-14 17:12:20 -08:00
e2510a0b60 Add Kernel Launch Checks to files under caffe2/aten/THC (#49358)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49358

Added the header file (`c10/cuda/CUDAException.h`) where the `C10_CUDA_KERNEL_LAUNCH_CHECK` is defined as needed to files under `caffe2/aten/THC`, and then added `C10_CUDA_KERNEL_LAUNCH_CHECK()` calls after each kernel launch. In some cases, removed some extraneous ErrorChecks

Test Plan:
Checked that the code still builds with
```
buck build //caffe2/aten:ATen-cu
```

Also ran basic aten tests
```
buck test //caffe2/aten:atest
```

Reviewed By: r-barnes

Differential Revision: D25541113

fbshipit-source-id: df1a50e14d291a86b24ca1746ac27fa586f9757c
2020-12-14 16:21:50 -08:00
cb3169d7a8 [aten] index_select dim 1 (#47077)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47077

Add benchmarks for pt index_select, batch_index_select, and c2's BatchGather
Add batch_index_select implementation based on the C2 BatchGather implementation

This currently falls back to index_select for backwards and cuda implementations.

Alternatively, we can look into the specifics of why index_select is slower and
replace the original implementation instead.

Test Plan:
./buck-out/opt/gen/caffe2/benchmarks/operator_benchmark/c2/batch_gather_test.par
./buck-out/opt/gen/caffe2/benchmarks/operator_benchmark/pt/index_select_test.par

PT results comparing without fix, block_size 1 only, and all dim=1
```
# no optimization
# ----------------------------------------
# PyTorch/Caffe2 Operator Micro-benchmarks
# ----------------------------------------
# Tag : short

# Benchmarking PyTorch: index_select
# Mode: Eager
# Name: index_select_M256_N512_K1_dim1_cpu
# Input: M: 256, N: 512, K: 1, dim: 1, device: cpu
Forward Execution Time (us) : 353.450

# Benchmarking PyTorch: index_select
# Mode: Eager
# Name: index_select_M512_N512_K1_dim1_cpu
# Input: M: 512, N: 512, K: 1, dim: 1, device: cpu
Forward Execution Time (us) : 862.492

# Benchmarking PyTorch: index_select
# Mode: Eager
# Name: index_select_M256_N512_K2_dim1_cpu
# Input: M: 256, N: 512, K: 2, dim: 1, device: cpu
Forward Execution Time (us) : 4555.344

# Benchmarking PyTorch: index_select
# Mode: Eager
# Name: index_select_M512_N512_K2_dim1_cpu
# Input: M: 512, N: 512, K: 2, dim: 1, device: cpu
Forward Execution Time (us) : 11003.279
```
```
# block size 1 only
# ----------------------------------------
# PyTorch/Caffe2 Operator Micro-benchmarks
# ----------------------------------------
# Tag : short

# Benchmarking PyTorch: index_select
# Mode: Eager
# Name: index_select_M256_N512_K1_dim1_cpu
# Input: M: 256, N: 512, K: 1, dim: 1, device: cpu
Forward Execution Time (us) : 129.240

# Benchmarking PyTorch: index_select
# Mode: Eager
# Name: index_select_M512_N512_K1_dim1_cpu
# Input: M: 512, N: 512, K: 1, dim: 1, device: cpu
Forward Execution Time (us) : 266.776

# Benchmarking PyTorch: index_select
# Mode: Eager
# Name: index_select_M256_N512_K2_dim1_cpu
# Input: M: 256, N: 512, K: 2, dim: 1, device: cpu
Forward Execution Time (us) : 4508.593

# Benchmarking PyTorch: index_select
# Mode: Eager
# Name: index_select_M512_N512_K2_dim1_cpu
# Input: M: 512, N: 512, K: 2, dim: 1, device: cpu
Forward Execution Time (us) : 10391.655
```
```
# dim 1
# PyTorch/Caffe2 Operator Micro-benchmarks
# ----------------------------------------
# Tag : short

# Benchmarking PyTorch: index_select
# Mode: Eager
# Name: index_select_M8_N8_K1_dim1_cpu
# Input: M: 8, N: 8, K: 1, dim: 1, device: cpu
Forward Execution Time (us) : 3.736

# Benchmarking PyTorch: index_select
# Mode: Eager
# Name: index_select_M256_N512_K1_dim1_cpu
# Input: M: 256, N: 512, K: 1, dim: 1, device: cpu
Forward Execution Time (us) : 130.460

# Benchmarking PyTorch: index_select
# Mode: Eager
# Name: index_select_M512_N512_K1_dim1_cpu
# Input: M: 512, N: 512, K: 1, dim: 1, device: cpu
Forward Execution Time (us) : 267.706

# Benchmarking PyTorch: index_select
# Mode: Eager
# Name: index_select_M8_N8_K2_dim1_cpu
# Input: M: 8, N: 8, K: 2, dim: 1, device: cpu
Forward Execution Time (us) : 4.187

# Benchmarking PyTorch: index_select
# Mode: Eager
# Name: index_select_M256_N512_K2_dim1_cpu
# Input: M: 256, N: 512, K: 2, dim: 1, device: cpu
Forward Execution Time (us) : 1739.550

# Benchmarking PyTorch: index_select
# Mode: Eager
# Name: index_select_M512_N512_K2_dim1_cpu
# Input: M: 512, N: 512, K: 2, dim: 1, device: cpu
Forward Execution Time (us) : 3468.332
```
C2 results:

```# Benchmarking Caffe2: batch_gather
WARNING: Logging before InitGoogleLogging() is written to STDERR
W1203 13:19:35.310904 782584 init.h:137] Caffe2 GlobalInit should be run before any other API calls.
# Name: batch_gather_M8_N8_K1_devicecpu
# Input: M: 8, N: 8, K: 1, device: cpu
Forward Execution Time (us) : 0.308

# Benchmarking Caffe2: batch_gather
# Name: batch_gather_M256_N512_K1_devicecpu
# Input: M: 256, N: 512, K: 1, device: cpu
Forward Execution Time (us) : 90.517

# Benchmarking Caffe2: batch_gather
# Name: batch_gather_M512_N512_K1_devicecpu
# Input: M: 512, N: 512, K: 1, device: cpu
Forward Execution Time (us) : 200.009

# Benchmarking Caffe2: batch_gather
# Name: batch_gather_M8_N8_K2_devicecpu
# Input: M: 8, N: 8, K: 2, device: cpu
Forward Execution Time (us) : 0.539

# Benchmarking Caffe2: batch_gather
# Name: batch_gather_M256_N512_K2_devicecpu
# Input: M: 256, N: 512, K: 2, device: cpu
Forward Execution Time (us) : 1001.540

# Benchmarking Caffe2: batch_gather
# Name: batch_gather_M512_N512_K2_devicecpu
# Input: M: 512, N: 512, K: 2, device: cpu
Forward Execution Time (us) : 2005.870
```

buck test dper3/dper3/modules/low_level_modules/tests:single_operators_test -- test_batch_gather

Reviewed By: hlu1

Differential Revision: D24630227

fbshipit-source-id: cd205a30d96a33d239f3266820ada9a90093cf91
2020-12-14 15:39:33 -08:00
220b91660f [pytorch] Expand PixelShuffle to support any number of batch dims (#49187)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49187

Expands the implementation of PixelShuffle to support any number of batch dimensions

Test Plan: `buck test caffe2/test:nn -- test_pixel_shuffle`

Reviewed By: mruberry

Differential Revision: D25399058

fbshipit-source-id: ab0a7f593b276cafc9ebb46a177e2c1dce56d0de
2020-12-14 14:52:57 -08:00
3a943e9f82 Use Unicode friendly API on Win32 in THAllocator (#47905)
Summary:
This replaces the narrow character set APIs with the wide character set ones in `THAllocator.cpp`. This fixes the potential crashes caused by passing non-ASCII characters in `torch::from_file` on Windows.

See: https://github.com/pytorch/pytorch/issues/47422

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47905

Reviewed By: zhangguanheng66

Differential Revision: D25399146

Pulled By: ezyang

fbshipit-source-id: 0a183b65de171c48ed1718fa71e773224eaf196f
2020-12-14 14:24:20 -08:00
1e2d1d7242 Fixed cat transform to work with event_dim > 0 (#49111)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/44530

As explained in the issue description, CatTransform does not work with event_dim > 0.
This PR fixes this. If this gets approved I am hoping to do the same for StackTransform as well.

fritzo Can you take a look at this ?

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49111

Reviewed By: neerajprad

Differential Revision: D25526005

Pulled By: ezyang

fbshipit-source-id: e14430093f550d5e0da7a311f9cd44796807830f
2020-12-14 14:16:18 -08:00
d5a971e193 Add kernel launch checks in caffe2/aten/src/ATen/native/cuda/ (#49269)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49269

Added C10_CUDA_KERNEL_LAUNCH_CHECK(); after all kernel launches in caffe2/aten/src/ATen/native/cuda.

Several files in the directory still trigger the check_kernel_launches.py tool. These are false positives as the tool doesn't seem to be parsing MACROS correctly.

Normalization.cuh <- This file is also highlighted by the check_kernel_launches.py tool, but the highlighted regions are device code where exception handling isn't allowed.

Test Plan:
Check that the code still builds with
```
buck build //caffe2/aten:ATen-cu
```
https://pxl.cl/1tLRB

Also ran
```
buck test //caffe2/aten:atest
```

https://pxl.cl/1tLSw

Reviewed By: r-barnes

Differential Revision: D25487597

fbshipit-source-id: 7a6689534f7ff85a5d2262831bf6918f1fe0b745
2020-12-14 13:46:25 -08:00
86cf1e1358 Add another way to verify ccache in CONTRIBUTING.md (#49337)
Summary:
In the case people are confused how to make sure ccache is working, I added another sentence in the documentation for how to check that the symlinks are correctly set up in addition to waiting for 2 clean builds of PyTorch.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49337

Reviewed By: walterddr

Differential Revision: D25535659

Pulled By: janeyx99

fbshipit-source-id: 435696255f517c074dd0d9f96534d22b60f795b2
2020-12-14 13:19:40 -08:00
6820745e28 Revert D25489030: [PyTorch] Make tls_local_dispatch_key_set inlineable
Test Plan: revert-hammer

Differential Revision:
D25489030 (be849ed1fd)

Original commit changeset: 63147bae783e

fbshipit-source-id: 6ce564979078f28ca9b7c80bc89ef492a2993806
2020-12-14 12:45:26 -08:00
4188c374ce Refactor: use version instead of major version in windows build (#49156)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/49219

1. update version instead of major version for env var of CUDA_VERSION
2. update related scripts

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49156

Reviewed By: glaringlee

Differential Revision: D25535530

Pulled By: ezyang

fbshipit-source-id: 0712227f2b06b45ee68efc42717c4308fea1abdc
2020-12-14 12:25:05 -08:00
6cfd7c3811 Remove type annotations from signatures in html docs (#49294)
Summary:
One unintended side effect of moving type annotations inline was that those annotations now show up in signatures in the html docs. This is more confusing and ugly than it is helpful. An example for `MaxPool1d`:

![image](https://user-images.githubusercontent.com/98330/102010280-77f86900-3d3d-11eb-8f83-e7ee0991ed92.png)

This makes the docs readable again. The parameter descriptions often already have type information, and there will be many cases where the type annotations will make little sense to the user (e.g., returning typevar T, long unions).

Change to `MaxPool1d` example:

![image](https://user-images.githubusercontent.com/98330/102010304-91011a00-3d3d-11eb-860d-ffa174b4d43b.png)

Note that once we can build the docs with Sphinx 3 (which is far off right now), we have two options to make better use of the extra type info in the annotations (some of which is useful):
- `autodoc_type_aliases`, so we can leave things like large unions unevaluated to keep things readable
- `autodoc_typehints = 'description'`, which moves the annotations into the parameter descriptions.

Another, more labour-intensive option, is what vadimkantorov suggested in gh-44964: show annotations on hover. Could also be done with some foldout, or other optional way to make things visible. Would be nice, but requires a Sphinx contribution or plugin first.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49294

Reviewed By: glaringlee

Differential Revision: D25535272

Pulled By: ezyang

fbshipit-source-id: 5017abfea941a7ae8c4595a0d2bdf8ae8965f0c4
2020-12-14 12:19:48 -08:00
9e3c25ff1d sls + layernorm test (#43799)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43799

Test Plan: https://www.internalfb.com/intern/testinfra/testconsole/testrun/3096224784866350/

Reviewed By: venkatacrc

Differential Revision: D23383351

fbshipit-source-id: c312d481ad15bded83bea90beaaae7742d0c54b8
2020-12-14 11:47:49 -08:00
be849ed1fd [PyTorch] Make tls_local_dispatch_key_set inlineable (#49264)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49264

FLAGS_disable_variable_dispatch had to go, but it looks like the only user was some benchmarks anyway.
ghstack-source-id: 118480532

Test Plan: Small (order of 0.1% improvement) on Internal benchmarks

Reviewed By: smessmer

Differential Revision: D25489030

fbshipit-source-id: 63147bae783e7a45391dd70d86730e48d3e0cafc
2020-12-14 11:17:35 -08:00
c068180a17 [CUDA graphs] Cuda RNG-safe graph capture and replay bindings (#48875)
Summary:
Part 2 of https://github.com/pytorch/pytorch/pull/46148 refactor.  (part 1 was https://github.com/pytorch/pytorch/pull/48694.)
Contains
- a few more CUDAGeneratorImpl diffs to clean up graph capture interaction
- Capture and replay bindings that interact correctly with CUDAGeneratorImpl
- Tests.

Diffs compile and tests pass on my machine (ubuntu 20.04, cuda 11.0) but it needs finetuning for many CI builds.

See [Note [CUDA Graph-safe RNG states]](02d89f9f1d/aten/src/ATen/CUDAGeneratorImpl.h (L13-L85)) for the strategy, based on https://github.com/pytorch/pytorch/pull/46148#issuecomment-724414794.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48875

Reviewed By: zou3519

Differential Revision: D25482654

Pulled By: ngimel

fbshipit-source-id: 634dbc4c6c9d7d0d9a62dc81a52d430561f905fe
2020-12-14 10:51:58 -08:00
25833e5d1c [CrashFix] Make the dst tensor contiguous when copying from metal
Summary: Somehow the destination tensor becomes incontiguous when copying from Metal. We need to call `.contiguous()` explicitly. See the crash log - https://www.internalfb.com/intern/logview/details/facebook_ios_crashes/1d865405fbc1a45f9517470906c9ec08/

Test Plan:
- verify the crash
- Sandcastle CIs

Reviewed By: dreiss

Differential Revision: D25502884

fbshipit-source-id: 46ee720bf6b6658e51cb56a4e4c16ce121eeabc7
2020-12-14 10:27:06 -08:00
a0432a7020 [AARCH64] Fix vst1q_f32_x2 implementation (#49273)
Summary:
Add memory operands to inline asm, that informs the compiler that this instruction writes to memory.

Fixes https://github.com/pytorch/pytorch/issues/48901

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49273

Reviewed By: walterddr

Differential Revision: D25512921

Pulled By: malfet

fbshipit-source-id: 474d070e1f7c2167b9958cbeb4e401dc0e4a930b
2020-12-14 10:09:39 -08:00
87636c07bb CUDA BF16 sparse (#48807)
Summary:
Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48807

Reviewed By: mruberry

Differential Revision: D25526752

Pulled By: ngimel

fbshipit-source-id: 9ff8e637486cfd67d46daf0c05142bbe611e08ec
2020-12-14 09:55:52 -08:00
690eaf9c43 add channels last for AdaptiveAvgPool2d (#48916)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48916

optimize adaptive average pool2d forward path

optimize adaptive average pool2d backward path

remove unused headers

minor change

minor change

rename the header; add adaptive max pooling in future.

minor change

loosen adapative_pool2d test on nhwc to both device cuda and cpu

minor change

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D25399469

Pulled By: VitalyFedyunin

fbshipit-source-id: 86f9fda35194f21144bd4667b778c861c05a5bac
2020-12-14 09:47:46 -08:00
8397a62a64 Fix cvtfp32_bf16 (#41280)
Summary:
For `Vec256<bfloat16>::blendv()` operator to work correctly, float32 -nan (0xfffffffff) must be converted to bfloat16 -nan (0xffff).
But cvtfp32_bf16 converts -nan to nan (0x7fc0)
TODO: Fix float32 +-nan conversion: i.e. float32 nan (0x7fffffff) must be converted to bfloat16 (0x7fff) nan

Closes https://github.com/pytorch/pytorch/issues/41238

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41280

Reviewed By: mruberry

Differential Revision: D23311585

Pulled By: malfet

fbshipit-source-id: 79499ce19f1ec3f6c954a874f1cd47f4ece6bdb5
2020-12-14 08:49:30 -08:00
bd322c8967 Update docstrings of torch.nn.modules.activation.MultiheadAttention (#48775)
Summary:
- Add the link to the original paper (Attention is All You Need)
- Fix indentation

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48775

Reviewed By: H-Huang

Differential Revision: D25465914

Pulled By: heitorschueroff

fbshipit-source-id: bbc296ec1523326e323587023c126e820e90ad8d
2020-12-14 08:34:33 -08:00
7d406b4a07 [PyTorch] Make TORCH_CHECK less likely to interfere with inlining (#49263)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49263

Now it is smaller and calls to an out-of-line function in
case of failure.
ghstack-source-id: 118480531

Test Plan:
1) Inspect perf profile of internal benchmark, much less
time spent in (for example) `c10::impl::getDeviceImpl`, which calls
TORCH_CHECK and should be inlined
2) Internal benchmarks

Reviewed By: smessmer

Differential Revision: D25481308

fbshipit-source-id: 0121ada779ca2518ca717f75920420957b3bb1aa
2020-12-14 08:11:23 -08:00
eb051afa78 [PyTorch] native_cpp_binding for size() and stride() (#49262)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49262

This uses the newly-added native_cpp_binding feature to avoid
dispatcher overhead for `size()` and `stride()`.
ghstack-source-id: 118480533

Test Plan: CI

Reviewed By: bwasti

Differential Revision: D25446275

fbshipit-source-id: 1215eaa530d5aa3d501f89da8c99d0a487d8c1b6
2020-12-14 08:09:35 -08:00
f54ab8fbfe Revert "Revert D25003113: make validate debug-only in Device copy ctr" (#49123)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49123

This reverts commit 7a4a2df2254b78d8c8d42b9f81b5b261a617466e.

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D25463531

Pulled By: bdhirsh

fbshipit-source-id: 7c7ecdc1d63ffd137b84a129887c424b2083a958
2020-12-14 07:33:37 -08:00
94a3d4b083 Remove unused operator at::_fft_with_size (#48905)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/48905

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D25480385

Pulled By: mruberry

fbshipit-source-id: 192d04a1b7e33b4e408cda8a82679c3ae3490a7d
2020-12-13 20:28:41 -08:00
fdadfb6e5d Fix formatting error in set_deterministic documentation (#49136)
Summary:
Fixes formatting error that was preventing a bulleted list from being displayed properly

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49136

Reviewed By: zou3519

Differential Revision: D25493130

Pulled By: mruberry

fbshipit-source-id: 7fc21e0e2cfa9465a60d2d43b805164316375f01
2020-12-13 19:55:19 -08:00
38ed398580 [fx] Add constant folding pass (#48443)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48443

Add a constant folding pass in FX:
- Iterate over an input graph and tag what nodes are fully constant, i.e. either `get_attr` nodes, or nodes with all inputs that are either `get_attr` or constant
- Use `model_transform.split_by_tags()` to split the graph into two
- Look for the `output` node in the constant graph to get names of attrs that will be folded
- Iterate over the non-constant graph and replace placeholders that are using the same name as the attrs with a `get_attr` as well as a dummy attr on the module
- Return these two graphs in a new `FoldedGraphModule`, which is a normal GraphModule but also stores the constant graph on the side along with a `run_folding()` method that will run const folding and update the dummy parameters with the actual folded parameters

Test Plan: Added a couple tests

Reviewed By: 842974287

Differential Revision: D25033996

fbshipit-source-id: 589c036751ea91bb8155d9be98af7dbc0552ea19
2020-12-13 18:06:07 -08:00
f2ba3c1621 Use group.WORLD appropriately in process group initialization. (#48767)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48767

As part of investigating
https://github.com/pytorch/pytorch/issues/48464, I realized some weird
inconsistency in how we use `_default_pg` and `group.WORLD`. `group.WORLD`
apparently was an `object()` and never changed despite `_default_pg` changing.
In this sense, `group.WORLD` was being used a constant to refer to the default
pg, but wasn't of type PG at all. In fact the passed in group is also compared
via `==` to `group.WORLD` in many places, and it just worked since the default
argument was `group.WORLD`.

To clean this up, I got rid of `_default_pg` completely and instead used
`group.WORLD` as the default pg throughout the codebase. This also fixes the
documentation issues mentioned in
https://github.com/pytorch/pytorch/issues/48464.

#Closes: https://github.com/pytorch/pytorch/issues/48464
ghstack-source-id: 118459779

Test Plan: waitforbuildbot

Reviewed By: rohan-varma

Differential Revision: D25292893

fbshipit-source-id: 9a1703c71610aee2591683ab60b010332e05e412
2020-12-13 17:53:42 -08:00
dc4db95540 Update pipeline API to accept arbitrary sequence of Tensors and not just Tuple (#48467)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48467

The current API's forward method only accepted a Tensor or a Tuple of
Tensors, making this more generic by accepting any Sequence of Tensors.
ghstack-source-id: 118436340

Test Plan: waitforbuildbot

Reviewed By: rohan-varma

Differential Revision: D25181944

fbshipit-source-id: 4db251dad52c01abc69f3d327788f2e4289e6c9d
2020-12-12 17:13:05 -08:00
33b7970d9e fix slow windows test (#49258)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49258

Tested by adding
`time.sleep(3)
`
in SubProcess.run and see test print "test_inherit_tensor: SubProcess too slow"

Sample failure:
https://app.circleci.com/pipelines/github/pytorch/pytorch/249756/workflows/3605479e-1020-4325-9a4c-8bde5ae38262/jobs/9550663

Test Plan: Imported from OSS

Reviewed By: supriyar

Differential Revision: D25507209

Pulled By: agolynski

fbshipit-source-id: ec808f0f658d0fb4c8447f68ec5ceba2aa66b1b5
2020-12-12 06:48:38 -08:00
cd927875e0 [pt] Replace size(dim) with sizes()[dim] (#49255)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49255

- Replace `size(dim)` with `sizes()[dim]` because `sizes()` does not go through the dispatcher and is marginally better.
- Remove unnecessary `size(dim)` and `sizes()` calls by saving the return value of `sizes()` to a temporary var.

Reviewed By: radkris-git

Differential Revision: D25488129

fbshipit-source-id: 4039e0609df20d5888666a71ad93b15e9a2182c5
2020-12-12 00:51:26 -08:00
717f31d984 Remove unused reconstruct_scopes function (#48822)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/48822

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D25325012

Pulled By: cccclai

fbshipit-source-id: 86ea4c0b2926257c0f82aa05cbcd83278b1b67f7
2020-12-11 23:43:36 -08:00
dc92f25b38 [te] Use c10::ScalarType utility functions in te::Dtype (#49148)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49148

Instead of defining our own variants.  I'm pretty sure this fixes a bug too, in that Bfloat16 wasn't being considered FP.  Otoh, I don't think it's possible to create TEs with Bfloat so...
ghstack-source-id: 118415314

Test Plan: `buck test //caffe2/test:jit`

Reviewed By: robieta

Differential Revision: D25456767

fbshipit-source-id: bd5822114b76c4fde82f566308909bd2a55f4f21
2020-12-11 22:41:57 -08:00
eaac28192c [te] Use Dtype::is_signed instead of an ad hoc local predicate. (#49147)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49147

D25456366 adds Dtype::is_signed (which is backed by c10::isSignedType), so use that instead of this one-off.
ghstack-source-id: 118415315

Test Plan:
```
buck test //caffe2/test{:jit,:tensorexpr,/cpp/tensorexpr:tensorexpr}
```

Reviewed By: robieta

Differential Revision: D25456683

fbshipit-source-id: 428f1e8bff21ea05730690226a44984995c4c138
2020-12-11 22:41:54 -08:00
ae88d25c23 [te] Fix clamp with uint8 args (#49143)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49143

Riddle me this, batman: how could `torch.clamp(torch.tensor([0], dtype=torch.uint8), -10, 10)` equal `10`?  The answer: the min/max args are first cast to the dtype of the input, giving min=246 and max 10.  Then you have to apply Min and Max in the right order: `Min(Max(in, min), max)`.  Differ in any way and you're doomed.  Hooray.

This PR makes TE match eager mode for this operator, plus fixes a major facepalm in the llvm min/max codegen where we were always generating signed comparisons.
ghstack-source-id: 118415318

Test Plan: `buck test //caffe2/test:{jit,tensorexpr}`

Reviewed By: robieta

Differential Revision: D25456366

fbshipit-source-id: dde3c26c2134bdbe803227601fa3d23eaac750fb
2020-12-11 22:36:52 -08:00
8999915a86 Fix "Missing return statement" mypy error (#49276)
Summary:
Adds `return None` after `assert_never` in the inner `get_one` function
Without it, TestTypeHints.test_run_mypy_strict using mypy  0.770 fails with the above mentioned error, see https://app.circleci.com/pipelines/github/pytorch/pytorch/249909/workflows/597d8e34-ff04-4efa-9dde-9e28fbded341/jobs/9557705

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49276

Reviewed By: jamesr66a

Differential Revision: D25513658

Pulled By: malfet

fbshipit-source-id: 318eaff7e0534b10eafe46c0b834b7f7cefea757
2020-12-11 22:18:50 -08:00
b5b8fe9876 Revert D25434956: [JIT] Use is_buffer in BufferPolicy::valid
Test Plan: revert-hammer

Differential Revision:
D25434956 (a480ca5302)

Original commit changeset: ff2229058abb

fbshipit-source-id: faba801e9b5e9fa0117624350518592868856eec
2020-12-11 21:10:15 -08:00
693e908656 [shape inference] fix ConstantFill
Test Plan: unit test

Reviewed By: yinghai

Differential Revision: D25326529

fbshipit-source-id: 1322635567f6661637cde90cadaac0197975e133
2020-12-11 19:40:42 -08:00
8d58362f59 [PyTorch] Remove native::zeros reference in TensorIndexing (#49117)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49117

Try to resolve part of the github issue of https://github.com/pytorch/pytorch/issues/48684 . It essentially calls the same functionality inside at::native::zeros().

After this diff, all references to aten::native symbols are removed.

ghstack-source-id: 118261305

Test Plan: CI

Reviewed By: dhruvbird

Differential Revision: D25444940

fbshipit-source-id: 7f782680daa3aedd1b7301cb08576da2ec70c188
2020-12-11 18:50:51 -08:00
635f1cd1a5 Enable LayerNorm test cases
Summary: Remove Skip from test defs.

Test Plan: https://our.intern.facebook.com/intern/testinfra/testrun/1407375060598951

Reviewed By: hyuen

Differential Revision: D25513174

fbshipit-source-id: 0ddfd1713cf7b9daf25f6e62df92d682cade350f
2020-12-11 17:58:24 -08:00
76d41c801e [JIT] Fix toIValue handling of AttributeError when casting ClassType (#49188)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/49188

Test Plan: Imported from OSS

Reviewed By: pbelevich

Differential Revision: D25476573

Pulled By: jamesr66a

fbshipit-source-id: cec296fae71cc0cdf36bde60417d7d3b1aa84198
2020-12-11 17:54:16 -08:00
29f0fa36b1 [Gradient Compression] Minor update of the comments on PowerSGD. (#49246)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49246

Previously the comment on matrix_approximation_rank was in PowerSGD_hook function. Now move it into PowerSGDState, because the function arg is already moved to this state as an attribute.

Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202
ghstack-source-id: 118414247

Test Plan: N/A

Reviewed By: rohan-varma

Differential Revision: D25501091

fbshipit-source-id: 701e3109a9a3f2a5f9d18d5bf6d0a266518ee8ea
2020-12-11 17:45:53 -08:00
21c38e1799 Additional validation for DistributedSampler. (#48865)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48865

If DistributedSampler was provided an invalid rank (ex:
https://discuss.pytorch.org/t/distributed-datasets-on-multi-machines/105113),
it failed with a cryptic assertion failure.

To fix this issue, I've added an additional check to DistributedSampler to
validate we provide a valid rank.
ghstack-source-id: 117906769

Test Plan:
1) waitforbuildbot
2) Unit test added.

Reviewed By: malfet

Differential Revision: D25344945

fbshipit-source-id: 7685e00c8b2c200efbd2949fb32ee32ea7232a08
2020-12-11 17:22:22 -08:00
6b78644623 [te] Add BitCast to the IR (#49184)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49184

Adds BitCasting to NNC.  This will enable fast approximation algorithms implemented directly in TensorExpressions

Test Plan: buck test mode/no-gpu //caffe2/test/cpp/tensorexpr:tensorexpr

Reviewed By: bertmaher

Differential Revision: D25466476

fbshipit-source-id: f063ab29ba7bab2dcce463e499f2d4a16bdc1f0e
2020-12-11 16:12:20 -08:00
5716b7db72 Enabled Scalar lists (#48222)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/48222

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D25074765

Pulled By: izdeby

fbshipit-source-id: 96ebe3c9907178c9338c03fb7993b2ecb26db8f4
2020-12-11 16:04:50 -08:00
bfce69d620 inline has function for DispatchKeySet (#49191)
Summary:
inlines `has` function for DispatchKeySet, that is frequently used in TensorImpl in calls such as `is_sparse`, `is_cuda` etc.
This increases `empty` instruction count (1853228 -> 1937428) without appreciable effect on runtime, and noticeably reduces instruction counts for `copy_` and friends that have to rely on `is_sparse`, `is_cuda` and the like a lot to decide which path to take (3269114 -> 2634114).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49191

Reviewed By: H-Huang

Differential Revision: D25483011

Pulled By: ngimel

fbshipit-source-id: 2f3ab83e2c836a726b9284ffc50d6ecf3701aada
2020-12-11 15:55:40 -08:00
53aa9b8c82 [FX] Move none assignments to same line (#49209)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/49209

Test Plan: Imported from OSS

Reviewed By: Chillee

Differential Revision: D25484975

Pulled By: jamesr66a

fbshipit-source-id: 44207be878f95ec9420e87af79833191d5cc0c7e
2020-12-11 15:45:40 -08:00
2f359e7d55 Add tensorpipe agent tests to multigpu tests. (#49210)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49210

The RPC tests use multiple gpus in some cases (ex: DDP + RPC and Pipe
+ DDP). We should enable multigpu tests for this purpose.
ghstack-source-id: 118366595

Test Plan: waitforbuildbot

Reviewed By: rohan-varma

Differential Revision: D25485506

fbshipit-source-id: eabbf442471ebc700b5986bc751879b9cf72b752
2020-12-11 15:00:38 -08:00
df027bfd2c Modify Pipe to return an RRef. (#47829)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47829

As per proposal in https://github.com/pytorch/pytorch/issues/44827,
the API needs to return an RRef to support inter-host pipelining.

For now, we just return a local RRef and only support pipeline on a single
host. But having this change in the API upfront ensures we don't make any BC
breaking changes later.
ghstack-source-id: 118366784

Test Plan: waitforbuildbot

Reviewed By: rohan-varma

Differential Revision: D24914022

fbshipit-source-id: e711e7d12efa45645f752f0e5e776a3d845f3ef5
2020-12-11 14:55:16 -08:00
c6147ae4c9 [PyTorch] Fix getCustomClassType() perf (#48981)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48981

1) It was copying the entire hash table every time.
2) We don't need to do a hash lookup at all.
ghstack-source-id: 118164406

Reviewed By: dzhulgakov

Differential Revision: D25385543

fbshipit-source-id: 6be95c742d6713345c51859ce36a7791a9e2e3f0
2020-12-11 14:20:01 -08:00
6c1b405a3b Updated derivative rules for complex QR decomposition (#48489)
Summary:
Updated `qr_backward` to work correctly for complex-valued inputs.
Added `torch.qr` to list of complex tests.

The previous implementation for real-valued differentiation used equation 42 from https://arxiv.org/abs/1001.1654
The current implementation is a bit simpler but the result for the real-valued input case is the same and all tests still pass.
Derivation of complex-valued QR differentiation https://giggleliu.github.io/2019/04/02/einsumbp.html

Ref. https://github.com/pytorch/pytorch/issues/33152

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48489

Reviewed By: bdhirsh

Differential Revision: D25272344

Pulled By: albanD

fbshipit-source-id: b53c1fca1683f4aee5f4d5ce3cab9e559170e7cf
2020-12-11 14:14:40 -08:00
e3542d2c12 [PyTorch] avoid unnecessary call to empty_tensor_restride in empty() (#48211)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48211

Our empty benchmark makes this call unconditionally. If
MemoryFormat::Contiguous is indeed a common case (or if workloads are
likely to use a consistent-ish memory format), then I'd expect
checking first to be a win.
ghstack-source-id: 118224990

Test Plan:
Profiled empty benchmark with perf, saw time spent in empty_tensor_restride go down.

Ran framework overhead benchmarks. ~7% win on empty(), 0.5-1.5% regression on InPlace, ~2% win on OutOfPlace. Seems like both the In/Out of place ones are likely to be noise because they don't exercise empty?

Reviewed By: bhosmer

Differential Revision: D24914706

fbshipit-source-id: 916771b335143f9b4ec9fae0d8118222ab6e8659
2020-12-11 13:57:57 -08:00
4bc4ec2686 Reduce kineto logging (#49216)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49216

Libkineto is pretty verbose by default, using libkineto api
to reduce amount of logging

Test Plan:
TORCH_CUDA_ARCH_LIST="6.0;7.0" USE_CUDA=1 USE_MKLDNN=1 BUILD_BINARY=1
python setup.py develop install --cmake

python test/test_profiler.py

Imported from OSS

Reviewed By: ngimel

Differential Revision: D25488109

fbshipit-source-id: 61b443bcf928db939f730ba32711385bb2b622d4
2020-12-11 13:50:13 -08:00
15200e385a Enable torch.where() to support Float16 & BFloat16 type inputs (#49004)
Summary:
Fixed https://github.com/pytorch/pytorch/issues/49075

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49004

Reviewed By: zou3519

Differential Revision: D25495225

Pulled By: H-Huang

fbshipit-source-id: 09418ee5503f65c8862e40119c5802779505a4db
2020-12-11 13:36:41 -08:00
218eaf4bba pyi codegen refactor - no need to group python signatures by overload name (#49057)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49057

Now that all of the byte-for-byte hacks are removed in the pyi codegen, there's no reason for the codegen to group pyi signature overloads together. I updated the logic in `gen_pyi` that computes signatures (`generate_type_hints()` and _generate_named_tuples()`) to operate per individual `PythonSignatureGroup`

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D25410849

Pulled By: bdhirsh

fbshipit-source-id: 8c190035d7bfc06ed192468efbe7d902922ad1fa
2020-12-11 13:29:24 -08:00
33a9b14da0 pyi codegen - removing byte-for-byte-compatibility hacks (sorting overloads) (#49056)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49056

This is another byte-for-byte compatibility hack. I'm now sorting pyi signature overloads (previously the codegen did not).

Mostly put this in a separate PR just to more easily reason about the diff in the codegen output.

Test Plan: Imported from OSS

Reviewed By: ljk53

Differential Revision: D25410846

Pulled By: bdhirsh

fbshipit-source-id: 06e5c32edbce610dd12ec7499014b41b23c646bd
2020-12-11 13:29:22 -08:00
b94ec8c9f7 pyi codegen - removing byte-for-byte compatibility hacks (#49055)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49055

Removed the majority of the TODO hacks that I added to the original pyi PR to maintain byte-for-byte compatibility.

I left a few of the divergences between pyi deprecated vs. native signatures, since (a) they're smaller and (b) it might make more sense to kill the deprecated functions at some point entirely.

Test Plan: Imported from OSS

Reviewed By: ljk53

Differential Revision: D25410847

Pulled By: bdhirsh

fbshipit-source-id: cf07cdda92f7492cd83d363cbb810e3810f6b8c8
2020-12-11 13:29:19 -08:00
9920adebfd pyi cleanup (#49054)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49054

These are some followups from the first pyi codegen PR. Still maintaining byte-for-byte compatibility in this one.

- Separated `argument_str() with a pyi flag into two functions, `argument_str()` and `argument_str_pyi()`
- Added a notes section for pyi at the top of `python.py`
- Added a `Python Interface` section that I moved the free-standing pyi functions to

Test Plan: Imported from OSS

Reviewed By: ljk53

Differential Revision: D25410848

Pulled By: bdhirsh

fbshipit-source-id: db83a80af900c32b5e32d67ce27767f6e7c2adfb
2020-12-11 13:27:41 -08:00
db5e5b439c Extra sampling of record function events [resend] (#49114)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49114

resend of https://github.com/pytorch/pytorch/pull/48289

Test Plan: see 48289

Reviewed By: robieta

Differential Revision: D25443365

Pulled By: ilia-cher

fbshipit-source-id: c15ac312222bb4d744e10199ed79801cccae8227
2020-12-11 12:53:37 -08:00
1cb5aa6c60 Fix structured kernel codegen (#49244)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49244

see https://fb.quip.com/ceEdANd5iVsO

RegisterMkldnnCPU kernels incorrectly used makeUnboxedOnly() calls to register add_.Tensor kernels. This is because the codegen incorrectly thought they're not c10-full.
This PR fixes that.
ghstack-source-id: 118411117

Test Plan: After this PR, RegisterMkldnnCPU doesn't contain the makeUnboxedOnly() calls anymore.

Reviewed By: ezyang

Differential Revision: D25500246

fbshipit-source-id: 8a8c2be9c4f4a5ce7eaae94257c2f8cbd176e92e
2020-12-11 12:37:35 -08:00
2a3bb1cea0 [quant][graphmode][fx][fix] Fix typo in fusion (#49183)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/49183

Test Plan: Imported from OSS

Reviewed By: hx89

Differential Revision: D25473367

fbshipit-source-id: 0cd5e6769eeea0923dd104ea90b0192e3475b3ad
2020-12-11 12:14:53 -08:00
796b267763 fix backwards compatibility for #48711 and its revert (#49240)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/49240

Test Plan: Imported from OSS

Reviewed By: heitorschueroff

Differential Revision: D25500727

Pulled By: agolynski

fbshipit-source-id: 6a690f52fe671267862b159b6330d37ef08ee291
2020-12-11 12:07:55 -08:00
f965b0fcfb Expose run_async function on torch::jit::Method (#48607)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48607

This change builds on top of
https://github.com/pytorch/pytorch/pull/46865

further exposing the async interface to `torch::jit::Method`.

added unit test for new `run_async`

Test Plan: `buck test caffe2/test/cpp/jit/...`

Reviewed By: dzhulgakov

Differential Revision: D25219726

fbshipit-source-id: 89743c82a0baa1affe0254c1e2dbf873de8e5c76
2020-12-11 11:17:58 -08:00
42c78ed745 Tuple Slice with both negative and positive stepped size (#48660)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48660

We used to support tuple slicing without any step size before, but this PR extends this feature to support arbitrary step size. We do this by manually reconstructing a new tuple in the IR instead of relying on TupleSlice prim.

Test Plan:
python tests

Imported from OSS

Reviewed By: gmagogsfm

Differential Revision: D25359336

fbshipit-source-id: 28cde536f28dd8a00607814b2900765e177f0ed7
2020-12-11 11:00:38 -08:00
c0a0845019 Improve new_group example in the context of SyncBatchNorm (#48897)
Summary:
Closes https://github.com/pytorch/pytorch/issues/48804
Improves some documentation/example in SyncBN docs to clearly show that each rank must call into all `new_group()` calls for creating process subgroups, even if they are not going to be part of that particular subgroup.
We then pick the right group, i.e. the group that the rank is part of, and pass that into the SyncBN APIs.

Doc rendering:

<img width="786" alt="syncbn_update" src="https://user-images.githubusercontent.com/8039770/101271959-b211ab80-373c-11eb-8b6d-d56483fd9f5d.png">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48897

Reviewed By: zou3519

Differential Revision: D25493181

Pulled By: rohan-varma

fbshipit-source-id: a7e93fc8cc07ec7797e5dbc356f1c3877342cfa3
2020-12-11 10:28:08 -08:00
f10b53d9ea [PyTorch Mobile] Record dtypes for tensors used in kernel function implementations (#48826)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48826

This change updates various macros to pass in the kernel tag string (`const char*`) into the macro that sets up the `case` statement for the dtype switch. This macro already receives the dtype (enum) which we also need.

There are 2 phases we need to build out for the `dtype` tracing to work:
1. Recording Phase
2. Conditional Compilation Phase

For this most part, this change is trying to focus on [1] (The Recording Phase) and sets up a new `RecordScope` enum value to track kernel dtypes. This code is compiled in only if a specific macro is defined (since this is an **extremely** hot code path, and even the slightest regression here can cause tremendous slow down overall).

I have only added a skeleton of the phase [2] (Conditional Compilation Phase) and there is a no-op `constexpr` method that selects every dtype in the kernel implementation. In subsequent diffs, this will be updated to point to a code-generated function based on the result of tracing the models that were requested.
ghstack-source-id: 118336675

Test Plan: See the next few diff in the stack for the application of this change to both record triggered dtypes (in kernel functions) as well as select dtype specific portions of kernel functions.

Reviewed By: ezyang

Differential Revision: D24220926

fbshipit-source-id: d7dbf21c7dcc6ce981d0fd4dcb62ca829fe3f69d
2020-12-11 09:41:52 -08:00
f204f77e6d Drop FutureNCCL in favor of vanilla CUDAFuture (#49014)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49014

We extracted a generic and reusable CUDAFuture class from FutureNCCL, but we had left FutureNCCL around, as a subclass of CUDAFuture, in order to deal with some peculiarity of ProcessGroupNCCL, namely that the future would be completed right away when constructed and that its CUDA events would be _shared_ with the ones of the WorkNCCL. This required some "hacks" in CUDAFuture itself (protected members, fields wrapped in shared_ptrs, ...).

My understanding is that creating CUDA events is a rather cheap operation. That would mean that we could afford to record _twice_ the events after each NCCL call, once for the WorkNCCL and once for the future. By doing so, we can use the CUDAFuture class directly and revert all its hacks.
ghstack-source-id: 118391217

Test Plan: Unit tests

Reviewed By: mrshenli

Differential Revision: D25355272

fbshipit-source-id: 3a2a0891724928221ff0f08600675d2f5990e674
2020-12-11 09:25:05 -08:00
dcd1e3d78d [AutoAccept][Codemod][FBSourceClangFormatLinter] Daily arc lint --take CLANGFORMAT
Reviewed By: zertosh

Differential Revision: D25490983

fbshipit-source-id: b24a11214a485a4a24ccf7da1e72715b450d3a81
2020-12-11 08:43:24 -08:00
2bb2f641c4 Bring fast_nvcc.py to PyTorch OSS (#48934)
Summary:
This PR adds `tools/fast_nvcc/fast_nvcc.py`, a mostly-transparent wrapper over `nvcc` that parallelizes compilation of CUDA files when building for multiple architectures at once.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48934

Test Plan: Currently this script isn't actually used in PyTorch OSS. Coming soon!

Reviewed By: walterddr

Differential Revision: D25286030

Pulled By: samestep

fbshipit-source-id: 971a404cf57f5694dea899a27338520d25191706
2020-12-11 08:17:21 -08:00
88b3d3371b add additional arm64 checker in cmake files (#48952)
Summary:
tentatively fixes https://github.com/pytorch/pytorch/issues/48873

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48952

Reviewed By: H-Huang

Differential Revision: D25463266

Pulled By: walterddr

fbshipit-source-id: 40afefffe8ab98ae7261c770316cb9c25225285f
2020-12-11 08:10:09 -08:00
2f1d1eb7df Revert D25428587: [pytorch][PR] add additional interpolation modes for torch.quantile
Test Plan: revert-hammer

Differential Revision:
D25428587 (25a8397bf3)

Original commit changeset: e98d24f6a651

fbshipit-source-id: fb217b8a19e853e83779a4edd312be86b26eb26d
2020-12-11 07:50:16 -08:00
5ab90b2fda Make CUDAFuture remember and restore current device in callback (#48789)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48789

CUDAFuture aims to "capture" the current state of CUDA-related stuff when the future is marked complete (e.g., by looking at current streams and recording events on them) and then "replicate" a similar state when users synchronize with the result of the future (by synchronizing the current streams with these events).

However, one "contextual" aspect of CUDA that we weren't capturing/replicating was the current device. This diff tries to fix that. I must mention that we can only do this for callbacks, while we cannot do it for the wait() method. I don't know if such a discrepancy between the two actually makes the overall behavior _worse_. I'd love to hear people's opinions on this.
ghstack-source-id: 118081338

Test Plan: Unit tests

Reviewed By: mrshenli

Differential Revision: D25210335

fbshipit-source-id: 1d1a3f80b1cc42e5114bc88554ed50617f1aaa90
2020-12-11 03:35:53 -08:00
2b1057b0cf [RPC Framework] Support retrieving the RRef to the remote module (#48983)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48983

Expose an API for users to retrieve the RRef for the underlying module.

This would be useful if users would like to run custom code on the remote end for the nn.Module.

Original PR issue: RemoteModule enhancements #40550
ghstack-source-id: 118378601

Test Plan: buck test mode/dev-nosan caffe2/test/distributed/rpc:process_group_agent -- RemoteModule

Reviewed By: pritamdamania87

Differential Revision: D25386042

fbshipit-source-id: 2dff33e8d5c9770be464eacf0b26c3e82f49a943
2020-12-10 23:53:44 -08:00
8669f02573 Saves a copy of vector<Tensor> in view ops returning TensorList. (#49149)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/49149

Test Plan: Imported from OSS

Reviewed By: pbelevich

Differential Revision: D25480104

Pulled By: ailzhang

fbshipit-source-id: 749345164662b15ec56b7b85a64011929e90c0b2
2020-12-10 23:42:26 -08:00
fce059d4ff [te] Don't throw when re-registering a CodeGen factory (#49174)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49174

We've seen this happening when libtorch is loaded repeatedly on macOS.  Tbh I'm not sure I understand why this happens; why do we re-construct these static objects but re-use the static registry itself?  But it's fairly straightforward to just overwrite the factory method and no harm in doing so.
ghstack-source-id: 118306581

Test Plan: compile

Reviewed By: ZolotukhinM

Differential Revision: D25466642

fbshipit-source-id: 4c456a57407f23fa0c9f4e74975ed1186e790c74
2020-12-10 23:37:29 -08:00
56a157fc79 hacky_wrapper_for_legacy_signatures reorders out arguments (#48911)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48911

This enables us to use hacky_wrapper_for_legacy_signatures for ops with out arguments so they can use templated unboxing logic without having to be rewritten.

This only actually enables it for one op as a proof of concept. There will be a separate PR enabling it for more ops.
ghstack-source-id: 118379659

Test Plan: waitforsandcastle

Reviewed By: bhosmer

Differential Revision: D25363336

fbshipit-source-id: da075d2cc58814f886a25d52652511dbbe990cec
2020-12-10 23:29:00 -08:00
da6f249a10 [caffe2] DeserializeToNDArray (#49135)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/49135

Differential Revision: D25417845

fbshipit-source-id: 4d8efd440bc2577fb717f911a401e7b81d48b907
2020-12-10 21:59:25 -08:00
59e822026c Add manual_cpp_binding to native_functions.yaml (#49092)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49092

Functions which specify manual_cpp_binding don't automatically
get C++ bindings generated for them in TensorBody.h or
Functions.h.  This lets end users manually define the bindings
themselves, which may be helpful if there is a way to
short circuit the dispatcher entirely.  contiguous() is switched
to use this mechanism.

Although manual_cpp_binding suggests that we don't generate the
binding at all, it is often the case that there is some "fast
path", but when this path is not satisfied, we should go back
to the slow dispatch.  So we still generate a fallback method/function
which the user-defined binding can call into in case that we
have to go slowpath.

The correctness conditions for bindings manually written in this
way are subtle.  Here are the ones I can think of off the top
of my head:

- Whatever condition is tested in the C++ body, must ALSO be
  tested again in the native:: implementation on the other
  side of the dispatcher.  This is because you are NOT GUARANTEED
  to hit the native:: implementation through the C++ binding,
  you may go straight to the implementation via a boxed call.

- If a binding is written in this way, it is only safe to
  skip dispatch if you would have returned the same tensor as
  before.  In any situation you would return a fresh tensor,
  you MUST go to the slow path, because you need to actually
  get to the autograd kernel.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: bhosmer

Differential Revision: D25428440

Pulled By: swolchok

fbshipit-source-id: 6e71767cb8d1086d56cd827c1d2d56cac8f6f5fe
2020-12-10 21:56:53 -08:00
743a4ef0ae [PyTorch] Enable AutoNonVariableTypeMode in static runtime (#49199)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49199

This should save us an extra round of dispatch for resize_,
resize_as_, detach_, and copy_, at the cost of disabling profiling and
tracing. I'm told that static runtime has its own per-op profiling and
we don't need tracing.
ghstack-source-id: 118348314

Test Plan:
Code review to confirm lack of need for profiling &
tracing, and that there isn't a different switch we should be using
instead.

Internal benchmarks -- seeing 11-12% improvement in overall runtime

Reviewed By: hlu1

Differential Revision: D25476819

fbshipit-source-id: 71e2c919b386b25c41084e2e4a54fe765a4f8f22
2020-12-10 21:51:59 -08:00
696e30af6e Fix ProcessGroupNCCL profiling when profiler is not run with use_cuda (#48946)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48946

Move recordFunctionEndCallback to after the blocking portion of launching the NCCL kernel, and remove addCallback since it runs the lambda inline anyways, and triggers unnecessary CUDA stream logic. If we want CUDA operations such as NCCL kernels accurately profiled, we should use the profiler with use_cuda=True. However, we are currently debugging a deadlock for the use_cuda=True case, fix is being tracked in #48987.

To ensure that the tests are no longer flaky, submitted this PR to ci-all: #48947 and ran the test a bunch of times ssh'd into the CI machine.

ghstack-source-id: 118330130

Test Plan: Ci

Reviewed By: mrzzd

Differential Revision: D25368322

fbshipit-source-id: 7d17036248a3dcd855e58addc383bba64d6bc391
2020-12-10 21:09:41 -08:00
cc3b59f6df [package] use bazel-style glob matching for mock/extern (#49066)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49066

This PR tweaks mock_module and extern_module. They are now renamed
mock and extern, and now only edit the package when a module matching
the pattern specified is required through dependency analysis.

save_extern_module and save_mock_module are added to explicitly modify
the package, but should not be needed by most users of the API unless they
are overriding require_package.

mock and extern now use bazel-style glob matching rules
(https://docs.bazel.build/versions/master/be/functions.html#glob).
i.e. `torch.**` matches `torch` and `torch.bar` but not `torchvision`.
mock and extern also now take an exclude list to filter out packages
that should not apply to the action.

Test Plan: Imported from OSS

Reviewed By: suo

Differential Revision: D25413935

Pulled By: zdevito

fbshipit-source-id: 5c06b417bee94ac8e72c13985b5ec42fcbe00817
2020-12-10 21:01:11 -08:00
159f258415 Update Kineto revision (#49200)
Summary:
Updating to a newer revision

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49200

Test Plan:
USE_KINETO=1 TORCH_CUDA_ARCH_LIST="6.0;7.0" USE_CUDA=1 USE_MKLDNN=1
BUILD_BINARY=1 python setup.py develop install --cmake
python test/test_profiler.py

python test/test_autograd.py -k test_profile
python test/test_autograd.py -k test_record

Fixes #{issue number}

Reviewed By: ngimel

Differential Revision: D25480439

Pulled By: ilia-cher

fbshipit-source-id: bca1f708f5e4a052028304b918a3adae9324318f
2020-12-10 19:51:10 -08:00
5469aa5e7f [NNC] Add a non functional Tensor kind (#48750)
Summary:
Adds the CompoundTensor, a specialisation of the NNC Tensor which allows arbitrary production statements. This will allow lowering of aten ops into specific NNC IR patterns (which don't need to be functional) - allowing us to shortcut to the optimized form of common patterns.

This is part 1 of trying to clean up the lowering of aten::cat so it is easier to optimize.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48750

Reviewed By: tugsbayasgalan

Differential Revision: D25433517

Pulled By: nickgg

fbshipit-source-id: de13c4719f8f87619ab254e5f324f13b5be1c9da
2020-12-10 19:43:50 -08:00
9b0ffb9fb3 Delete cpp.group_arguments (#49043)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49043

Previously, this function had nontrivial algorithmic content,
but after #48195, this was just a swiss army knife for pasting
together arguments while maintaining structure.  I added some
more properties for Arguments for convenient access in this way,
and then inlined the implementation of group_arguments into all of its call
sites, simplifying whenever contextual.  This might be controversial, but I
think the resulting code is easier to understand.

You may notice that there is some modest code duplication between
dispatcher.cpparguments_exprs and CppSignature.argument_packs.
This is a known problem and I will be attempting to fix it in
a follow up PR.

Confirmed to be byte-for-byte compatible.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: H-Huang

Differential Revision: D25455885

Pulled By: ezyang

fbshipit-source-id: 8fbe066e8c3cb7ee8adb5b87296ec5bd7b49e01f
2020-12-10 18:20:46 -08:00
267641a245 Rename positional and kwarg_only to have flat prefix (#49042)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49042

I want the names positional and kwarg_only to give the unflat
representation (e.g., preserving TensorOptionsArguments in the
returned Union).  So I regret my original naming choice when
I moved grouping to model.  This renames them to have flat_ prefix
and also adds a flat_non_out argument for cases where you just
want to look at non-out arguments.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: H-Huang

Differential Revision: D25455884

Pulled By: ezyang

fbshipit-source-id: f923f8881267a3e3e8e9521519412f7cc25034fc
2020-12-10 18:20:43 -08:00
0dea76ecda Delete some dead functions from tools.codegen.api.meta (#49041)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49041

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: H-Huang

Differential Revision: D25455886

Pulled By: ezyang

fbshipit-source-id: 5d7834d52f7032820ac2c73358bda77187c17224
2020-12-10 18:16:09 -08:00
882eb0f646 [quant][graphmode][fx] Add support for dynamic quant for RNN and RNNCell (#49126)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/49126

Test Plan:
python test/test_quantization.py TestQuantizeFxOps.test_rnn
python test/test_quantization.py TestQuantizeFxOps.test_rnn_cell

Imported from OSS

Reviewed By: raghuramank100

Differential Revision: D25449047

fbshipit-source-id: 532bf9ad2839958dde8c6f2d9399fac96b2b8bd4
2020-12-10 18:11:40 -08:00
a47a087a43 [NNC] Add missing data type support for abs and frac (#48679)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48679

This addresses the remaining problem reported in issue #48053

Data type supports for aten kernels in SimpleIREvaluator are not
consistent w/ aten::native library implementation. In SimpleIREvaluator,
  - only float/double are supported on aten::abs (integral types and half
are missing)
  - only float/double are supported on aten::frac (half are missing)

It is also not clear from kernel.cpp source code what are the expected
input data types for an aten kernel, leading to potential missing data
type issues down the road.

This commit addresses both issues in a limited way by
 - Added type promotion ops from half/integral input types to float
 - Added a skeleton support for some type checking for aten kernels,
   currently, only check for valid data types for frac and abs to limit
   the scope of the change; but the utility function can be used for
   consistently adding type checking for all aten functions

Known limitations:
 - abs support for integral types can be made more effective by invoking
 std::abs for integral tensors (currently kFabs maps to std::fabs).
 Since that change is a bit more involved (e.g., changing IntrinsicsOp
 kFabs to kAbs and other code generators accordingly), will leave it to
 another issue
 - other aten kernels may need similar type checking and some scrutiny
 on the use of promoteToFloat to detect invalid data types early on.
 That is also left for another issue

Test Plan:
test_jit_fuser_te.test_unary_ops

Imported from OSS

Reviewed By: asuhan

Differential Revision: D25344839

fbshipit-source-id: 95aca04c99b947dc20f11e4b3bae002f0ae37044
2020-12-10 17:47:15 -08:00
7feec06dfe Only 1 TensorImpl allocation in differentiable views. (#48896)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/48896

Test Plan: Imported from OSS

Reviewed By: pbelevich

Differential Revision: D25380895

Pulled By: ailzhang

fbshipit-source-id: 4d565e6312e860a2ff185a3f8b552005ddd29695
2020-12-10 17:39:40 -08:00
5e8cfec332 Add a newline before dependency graph output (#49127)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49127

Small change, but useful: it means that double-clicking the line lets
you copy the url easily

Test Plan: Imported from OSS

Reviewed By: zdevito

Differential Revision: D25450408

Pulled By: suo

fbshipit-source-id: 8b13b971b444187a8de59c89cc8f60206035b2ad
2020-12-10 17:03:23 -08:00
57145c910f Revert D24711613: [pytorch][PR] Preserve submodule with __set_state__ in freezing
Test Plan: revert-hammer

Differential Revision:
D24711613 (a3e1bd1fb9)

Original commit changeset: 22e51417454a

fbshipit-source-id: c2090b15fdba2d6c9dc1fbd987d32229dd898608
2020-12-10 16:26:38 -08:00
80f7510d92 [FX] Fix create_arg for NamedTuple (#48986)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/48986

Test Plan: Imported from OSS

Reviewed By: zdevito

Differential Revision: D25387156

Pulled By: jamesr66a

fbshipit-source-id: 0d38c43e02088fb7afb671683c88b6e463fe7c76
2020-12-10 15:32:04 -08:00
69522410fa add user vs internal msg support in common_utils.TestCase (#48935)
Summary:
should fixes https://github.com/pytorch/pytorch/issues/48879.

To test the effect of the messages: make test break, such as add `self.assertEqual(1, 2, "user_msg")` to any test
* Before:
```
AssertionError: False is not true : user_msg
```
* After
```
AssertionError: False is not true : Scalars failed to compare as equal! Comparing 1 and 2 gives a difference of 1, but the allowed difference with rtol=0 and atol=0 is only 0!
user_msg;
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48935

Reviewed By: samestep

Differential Revision: D25382153

Pulled By: walterddr

fbshipit-source-id: 95633a9f664f4b05a28801786b12a10bd21ff431
2020-12-10 15:25:46 -08:00
84fce6d29a [AARCH64] Fix HAS_VST1 check if compiled by clang (#49182)
Summary:
Use `UL` suffix supported by all C99 compatible compilers  instead of `__AARCH64_UINT64_C`, which is a gcc specific extension

Before the change this check would have failed as follows with a bug-free clang compiler with the following errors:
```
$ clang has_vst1.c
has_vst1.c:5:41: warning: implicit declaration of function '__AARCH64_UINT64_C' is invalid in C99 [-Wimplicit-function-declaration]
  v.val[0] = vcombine_f32 (vcreate_f32 (__AARCH64_UINT64_C (0)), vcreate_f32 (__AARCH64_UINT64_C (0)));
                                        ^
has_vst1.c:5:79: warning: implicit declaration of function '__AARCH64_UINT64_C' is invalid in C99 [-Wimplicit-function-declaration]
  v.val[0] = vcombine_f32 (vcreate_f32 (__AARCH64_UINT64_C (0)), vcreate_f32 (__AARCH64_UINT64_C (0)));
                                                                              ^
has_vst1.c:6:41: warning: implicit declaration of function '__AARCH64_UINT64_C' is invalid in C99 [-Wimplicit-function-declaration]
  v.val[1] = vcombine_f32 (vcreate_f32 (__AARCH64_UINT64_C (0)), vcreate_f32 (__AARCH64_UINT64_C (0)));
                                        ^
has_vst1.c:6:79: warning: implicit declaration of function '__AARCH64_UINT64_C' is invalid in C99 [-Wimplicit-function-declaration]
  v.val[1] = vcombine_f32 (vcreate_f32 (__AARCH64_UINT64_C (0)), vcreate_f32 (__AARCH64_UINT64_C (0)));
                                                                              ^
4 warnings generated.
/tmp/has_vst1-b1e162.o: In function `main':
has_vst1.c:(.text+0x30): undefined reference to `__AARCH64_UINT64_C'
```

Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49182

Reviewed By: walterddr

Differential Revision: D25471994

Pulled By: malfet

fbshipit-source-id: 0129a6f7aabc46aa117ef719d3a211449cb410f1
2020-12-10 15:19:12 -08:00
f4226b5c90 [static runtime] add static subgraph fusion pass (#49185)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49185

This diff adds a fusion feature that will let us use static runtime for *parts* of the graph.  This will prove useful in cases where fully eliminating control flow is hard etc.

TODO:
[x] factor out into separate fusion file
[x] add python test case
[x] add graph that isn't fully lowered test case
[x] add graph that has weird list/tuple outputs test case

the loop example looks quite good:
```
graph(%a.1 : Tensor,
      %b.1 : Tensor,
      %iters.1 : int):
  %12 : bool = prim::Constant[value=1]() # /data/users/bwasti/fbsource/fbcode/buck-out/dev/gen/caffe2/test/static_runtime#binary,link-tree/test_static_runtime.py:110:4
  %c.2 : Tensor = prim::StaticSubgraph_0(%a.1, %b.1)
  %c : Tensor = prim::Loop(%iters.1, %12, %c.2) # /data/users/bwasti/fbsource/fbcode/buck-out/dev/gen/caffe2/test/static_runtime#binary,link-tree/test_static_runtime.py:110:4
    block0(%i : int, %c.12 : Tensor):
      %c.10 : Tensor = prim::StaticSubgraph_1(%a.1, %c.12, %b.1)
      -> (%12, %c.10)
  return (%c)
with prim::StaticSubgraph_0 = graph(%0 : Tensor,
      %4 : Tensor):
  %5 : int = prim::Constant[value=2]()
  %6 : Tensor = aten::mul(%4, %5) # /data/users/bwasti/fbsource/fbcode/buck-out/dev/gen/caffe2/test/static_runtime#binary,link-tree/test_static_runtime.py:109:12
  %2 : int = prim::Constant[value=1]()
  %c.2 : Tensor = aten::add(%0, %6, %2) # /data/users/bwasti/fbsource/fbcode/buck-out/dev/gen/caffe2/test/static_runtime#binary,link-tree/test_static_runtime.py:109:8
  return (%c.2)
with prim::StaticSubgraph_1 = graph(%1 : Tensor,
      %7 : Tensor,
      %8 : Tensor):
  %9 : int = prim::Constant[value=1]()
  %c.4 : Tensor = aten::add(%7, %8, %9) # /data/users/bwasti/fbsource/fbcode/buck-out/dev/gen/caffe2/test/static_runtime#binary,link-tree/test_static_runtime.py:111:12
  %5 : int = prim::Constant[value=2]()
  %c.7 : Tensor = aten::mul_(%c.4, %5) # /data/users/bwasti/fbsource/fbcode/buck-out/dev/gen/caffe2/test/static_runtime#binary,link-tree/test_static_runtime.py:112:8
  %2 : int = prim::Constant[value=1]()
  %c.10 : Tensor = aten::sub_(%c.7, %1, %2) # /data/users/bwasti/fbsource/fbcode/buck-out/dev/gen/caffe2/test/static_runtime#binary,link-tree/test_static_runtime.py:113:8
  return (%c.10)
```

(Note: this ignores all push blocking failures!)

Test Plan:
buck test mode/no-gpu //caffe2/benchmarks/static_runtime:static_runtime_cpptest

buck test mode/no-gpu caffe2/test:static_runtime

Reviewed By: bertmaher

Differential Revision: D25385702

fbshipit-source-id: 2f24af4f11d92a959167facd03fbd24f464a6098
2020-12-10 14:03:11 -08:00
95a1725a4a Vsx initial support issue27678 (#41541)
Summary:
### Pytorch Vec256 ppc64le support
implemented types:

- double
- float
- int16
- int32
- int64
- qint32
- qint8
- quint8
- complex_float
- complex_double

Notes:
All basic vector operations are implemented:
There are a few problems:
- minimum maximum nan propagation for ppc64le is missing and was not checked
- complex multiplication, division, sqrt, abs are implemented as PyTorch x86. they can overflow and have precision problems than std ones.  That's why they were either excluded or tested in smaller domain range
- precisions of the implemented float math functions

~~Besides, I added CPU_CAPABILITY for power. but as because of  quantization errors for DEFAULT I had to undef and  use vsx for DEFAULT too~~

#### Details
##### Supported math functions

+ plus sign means vectorized, -  minus sign means missing,   (implementation notes are added inside braces)
(notes). Example: -(both ) means it was also missing on x86 side
g( func_name)  means vectorization is using func_name
sleef - redirected to the Sleef
unsupported

function_name | float | double | complex float | complex double
|-- | -- | -- | -- | --|
acos | sleef | sleef | f(asin) | f(asin)
asin | sleef | sleef | +(pytorch impl) | +(pytorch impl)
atan | sleef | sleef | f(log) | f(log)
atan2 | sleef | sleef | unsupported | unsupported
cos | +((ppc64le:avx_mathfun) ) | sleef | -(both) | -(both)
cosh | f(exp)   | -(both) | -(both) |
erf | sleef | sleef | unsupported | unsupported
erfc | sleef | sleef | unsupported | unsupported
erfinv | - (both) | - (both) | unsupported | unsupported
exp | + | sleef | - (x86:f()) | - (x86:f())
expm1 | f(exp)  | sleef | unsupported | unsupported
lgamma | sleef | sleef |   |
log | +  | sleef | -(both) | -(both)
log10 | f(log)  | sleef | f(log) | f(log)
log1p | f(log)  | sleef | unsupported | unsupported
log2 | f(log)  | sleef | f(log) | f(log)
pow | + f(exp)  | sleef | -(both) | -(both)
sin | +((ppc64le:avx_mathfun) ) | sleef | -(both) | -(both)
sinh | f(exp)  | sleef | -(both) | -(both)
tan | sleef | sleef | -(both) | -(both)
tanh | f(exp)  | sleef | -(both) | -(both)
hypot | sleef | sleef | -(both) | -(both)
nextafter | sleef  | sleef | -(both) | -(both)
fmod | sleef | sleef | -(both) | -(both)

[Vec256 Test cases Pr https://github.com/pytorch/pytorch/issues/42685](https://github.com/pytorch/pytorch/pull/42685)
Current list:

- [x] Blends
- [x] Memory: UnAlignedLoadStore
- [x] Arithmetics: Plus,Minu,Multiplication,Division
- [x] Bitwise: BitAnd, BitOr, BitXor
- [x] Comparison: Equal, NotEqual, Greater, Less, GreaterEqual, LessEqual
- [x] MinMax: Minimum, Maximum, ClampMin, ClampMax, Clamp
- [x] SignManipulation: Absolute, Negate
- [x] Interleave: Interleave, DeInterleave
- [x] Rounding: Round, Ceil, Floor, Trunc
- [x] Mask: ZeroMask
- [x] SqrtAndReciprocal: Sqrt, RSqrt, Reciprocal
- [x] Trigonometric: Sin, Cos, Tan
- [x] Hyperbolic: Tanh, Sinh, Cosh
- [x] InverseTrigonometric: Asin, ACos, ATan, ATan2
- [x] Logarithm: Log, Log2, Log10, Log1p
- [x] Exponents: Exp, Expm1
- [x] ErrorFunctions: Erf, Erfc, Erfinv
- [x] Pow: Pow
- [x] LGamma: LGamma
- [x] Quantization: quantize, dequantize, requantize_from_int
- [x] Quantization: widening_subtract, relu, relu6
Missing:
- [ ] Constructors, initializations
- [ ] Conversion , Cast
- [ ] Additional: imag, conj, angle (note: imag and conj only checked for float complex)

#### Notes on tests and testing framework
- some math functions are tested within domain range
- mostly testing framework randomly tests against std implementation within the domain or within the implementation domain for some math functions.
- some functions are tested against the local version. ~~For example, std::round and vector version of round differs. so it was tested against the local version~~
- round was tested against pytorch at::native::round_impl. ~~for double type on **Vsx  vec_round failed  for  (even)+0 .5 values**~~ . it was solved by using vec_rint
- ~~**complex types are not tested**~~  **After enabling complex testing due to precision and domain some of the complex functions failed for vsx and x86 avx as well. I will either test it against local implementation or check within the accepted domain**
- ~~quantizations are not tested~~  Added tests for quantizing, dequantize, requantize_from_int, relu, relu6, widening_subtract functions
- the testing framework should be improved further
- ~~For now `-DBUILD_MOBILE_TEST=ON `will be used for Vec256Test too~~
Vec256 Test cases will be built for each CPU_CAPABILITY

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41541

Reviewed By: zhangguanheng66

Differential Revision: D23922049

Pulled By: VitalyFedyunin

fbshipit-source-id: bca25110afccecbb362cea57c705f3ce02f26098
2020-12-10 13:42:39 -08:00
a3e1bd1fb9 Preserve submodule with __set_state__ in freezing (#47308)
Summary:
This PR does the following:

-  fail freezing if input module has __set_state__ method
-  preserves attributes of  submodules with __set_state__ method.

Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47308

Reviewed By: eellison

Differential Revision: D24711613

Pulled By: bzinodev

fbshipit-source-id: 22e51417454aaf85cc0ae4acb2dc7fc822f149a2
2020-12-10 13:36:34 -08:00
a480ca5302 [JIT] Use is_buffer in BufferPolicy::valid (#49053)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49053

**Summary**
`BufferPolicy::valid` uses `!typ->is_parameter(i)` to check if an
attribute is a buffer or not; it should use `type->is_buffer(i)` instead.

**Test Plan**
It is difficult to write an additional test that would have failed before this
commit because the two booleans `is_parameter` and `is_buffer` are never set
to `true` at the same time.

**Fixes**
This commit fixes #48746.

Test Plan: Imported from OSS

Reviewed By: pbelevich

Differential Revision: D25434956

Pulled By: SplitInfinity

fbshipit-source-id: ff2229058abbafed0b67d7b26254d406e5f7b074
2020-12-10 13:10:51 -08:00
c892c3ac9a remove hacky_wrapper from BackendSelect (#49079)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49079

BackendSelect kernels have been changed to be written the new way, so this hacky_wrapper here isn't needed anymore.
This PR is not expected to change perf or anything, just simplify the code a bit. The hacky_wrapper here was a no-op and not creating any actual wrappers
because it short-cirtuits to not create a wrapper when there is no wrapper needed.
ghstack-source-id: 118318436

Test Plan: waitforsandcastle

Reviewed By: bhosmer

Differential Revision: D25421633

fbshipit-source-id: 7a6125613f465dabed155dd892c8be6af5c617cf
2020-12-10 12:54:29 -08:00
21dba8c1ad Make aten::div.out c10-full (#47793)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47793

This migrates aten::div.out to be c10-full (without hacky wrapper) and fixes everything that needed to be fixed to make it work.
This is a prerequisite step to making out ops c10-full. Diffs stacked on top of this will introduce a hacky_wrapper for out ops and use it to make more ops c10-full.
ghstack-source-id: 118318433

Test Plan: waitforsandcastle

Reviewed By: ezyang

Differential Revision: D24901944

fbshipit-source-id: e477cb41675e477808c76af01706508beee44752
2020-12-10 12:52:50 -08:00
e1c1a7e964 [ONNX] Changes to export API to better handle named arguments (#47367)
Summary:
The args parameter of ONNX export is changed to better support optional arguments such that args is represented as:
args (tuple of arguments or torch.Tensor, a dictionary consisting of named arguments (optional)):
            a dictionary to specify the input to the corresponding named parameter:
            - KEY: str, named parameter
            - VALUE: corresponding input

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47367

Reviewed By: H-Huang

Differential Revision: D25432691

Pulled By: bzinodev

fbshipit-source-id: 9d4cba73cbf7bef256351f181f9ac5434b77eee8
2020-12-10 12:31:00 -08:00
0c70585505 fix #49064 (invalid escape) by using raw strings (#49065)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/49064 by using raw strings

I removed `# noqa: W605` because that's the "invalid escape sequence" check: https://www.flake8rules.com/rules/W605.html

I wrote a quick test to make sure the strings are the same before and after this PR. This block should print `True` (it does for me).

```
convolution_notes1 = \
    {"groups_note": r"""* :attr:`groups` controls the connections between inputs and outputs.
      :attr:`in_channels` and :attr:`out_channels` must both be divisible by
      :attr:`groups`. For example,

        * At groups=1, all inputs are convolved to all outputs.
        * At groups=2, the operation becomes equivalent to having two conv
          layers side by side, each seeing half the input channels
          and producing half the output channels, and both subsequently
          concatenated.
        * At groups= :attr:`in_channels`, each input channel is convolved with
          its own set of filters (of size
          :math:`\frac{\text{out\_channels}}{\text{in\_channels}}`).""",

        "depthwise_separable_note": r"""When `groups == in_channels` and `out_channels == K * in_channels`,
        where `K` is a positive integer, this operation is also known as a "depthwise convolution".

        In other words, for an input of size :math:`(N, C_{in}, L_{in})`,
        a depthwise convolution with a depthwise multiplier `K` can be performed with the arguments
        :math:`(C_\text{in}=C_\text{in}, C_\text{out}=C_\text{in} \times \text{K}, ..., \text{groups}=C_\text{in})`."""}  # noqa: B950

convolution_notes2 = \
    {"groups_note": """* :attr:`groups` controls the connections between inputs and outputs.
      :attr:`in_channels` and :attr:`out_channels` must both be divisible by
      :attr:`groups`. For example,

        * At groups=1, all inputs are convolved to all outputs.
        * At groups=2, the operation becomes equivalent to having two conv
          layers side by side, each seeing half the input channels
          and producing half the output channels, and both subsequently
          concatenated.
        * At groups= :attr:`in_channels`, each input channel is convolved with
          its own set of filters (of size
          :math:`\\frac{\\text{out\_channels}}{\\text{in\_channels}}`).""",  # noqa: W605

        "depthwise_separable_note": """When `groups == in_channels` and `out_channels == K * in_channels`,
        where `K` is a positive integer, this operation is also known as a "depthwise convolution".

        In other words, for an input of size :math:`(N, C_{in}, L_{in})`,
        a depthwise convolution with a depthwise multiplier `K` can be performed with the arguments
        :math:`(C_\\text{in}=C_\\text{in}, C_\\text{out}=C_\\text{in} \\times \\text{K}, ..., \\text{groups}=C_\\text{in})`."""}  # noqa: W605,B950

print(convolution_notes1 == convolution_notes2)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49065

Reviewed By: agolynski

Differential Revision: D25464507

Pulled By: H-Huang

fbshipit-source-id: 88a65a24e3cc29774af25e09823257b2136550fe
2020-12-10 12:22:49 -08:00
3b57be176e [NNC] Preserve strided output (#48264)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48264

Preserves the strided representation of NNC Tensor outputs by transforming them into the right layout at the end of the kernel.

Fix for https://github.com/pytorch/pytorch/issues/45604

Test Plan: Imported from OSS

Reviewed By: nikithamalgifb

Differential Revision: D25286213

Pulled By: eellison

fbshipit-source-id: 64d94ac463741e2568a1c9d44174e15ea26e511f
2020-12-10 12:19:51 -08:00
0b9d5e65e4 Remove inferred from tensor type ctors (#48263)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48263

The inferred type is only used once in `getInferred` and is confusing next to the other parameters. It has nothing to do with runtime values, it just means the type was inferred in type-checking. There are a bunch of parameters and overloads of Tensor instantiation as is.

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D25286211

Pulled By: eellison

fbshipit-source-id: 3dfc44ab7ff4fbf0ef286ae8716a4afac646804b
2020-12-10 12:19:49 -08:00
71ddc0ba19 [TensorExpr Fuser] Add support for nodes which have tensor constant inputs (#47814)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47814

Previously, we would bail completely if a node had a constant tensor input. This PR adds support for this case by lifting the constant out of the fusion graph after we've done fusion. It might be nice to add support for Tensor Constants in NNC itself, but it looked kind of tricky and this is an easy enough temporary solution.

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D25286215

Pulled By: eellison

fbshipit-source-id: 9ff67f92f5a2d43fd3ca087569898666525ca8cf
2020-12-10 12:19:47 -08:00
413caa7fd2 [NNC] Compute Tensor Output Properties in ininitialization (#47813)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47813

We have some code paths that at kernel invocation seem to handle dynamic sizes, but I'm not sure how well it works because we have other parts of our code base that assume that tenso shapes are always fully specified. https://github.com/pytorch/pytorch/blob/master/torch/csrc/jit/tensorexpr/kernel.cpp#L1572

As with some other PRs in the stack, I think it would be good to remove the features that aren't on/actively being worked on while they are not used.

I initially did this PR to try to speed up perf. I couldn't observe too much  of a speed up, so we can decide to keep drop this PR if we want.

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D25286212

Pulled By: eellison

fbshipit-source-id: 4ae66e0af88d649dd4e592bc78686538c2fdbaeb
2020-12-10 12:19:45 -08:00
0e666a9f5a [TensorExpr] Cache use of fallback in kernel invocation (#47812)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47812

Previously we were checking the environment every kernel invocation for `tensorExprFuserEnabled`, which checks the environment for `PYTORCH_TENSOREXPR`. This is only a dev-exposed API, so I think it is fine to only check once when the kernel is initialized. The `disable_optimization` flag which is user-exposed more or less covers the same functionality.

For fun, some benchmarking. I compared scripted before and after of
```
def foo(x, y):
    return x + y
```
for x, y = torch.tensor([1]). I also removed the prim::TypeCheck node to better
isolate the kernel (I cheated). Here is gist: https://gist.github.com/eellison/39f3bc368f5bd1f25ded4827feecd15e

Without Changes Run 1:
no fusion: sum 6.416894399004377 min: 0.6101883250012179 median 0.6412974080012646
with fusion: sum 6.437897570998757 min: 0.6350401220006461 median 0.6446951820034883

Without Changes Run2:
no fusion: sum 6.601341788002173 min: 0.6292048720024468 median 0.6642187059987918
with fusion: sum 6.734651455997664 min: 0.6365462899993872 median 0.6755226659988693

With Changes Run1:
no fusion: sum 6.097717430002376 min: 0.5977709550024883 median 0.613631643998815
with fusion: sum 6.1299369639964425 min: 0.5857932209983119 median 0.6159247440009494

With Changes Run2:
no fusion: sum 6.5672018059995025 min: 0.6245676209982776 median 0.6386050750006689
with fusion: sum 6.489086147994385 min: 0.6236886289989343 median 0.6535737619997235

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D25286210

fbshipit-source-id: a18b4918a7f7bed8a39112ae04b678e79026d39b
2020-12-10 12:19:42 -08:00
70853c5021 Dont use symbolic shapes check (#47810)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47810

`bindSymbolicShapes` wasn't checking device or dtype at all, so it wasn't correct. It also isn't being used anywhere (num_profiles is always 1 and we don't use symbolic shapes). We shouldn't have it on until we are actually using symoblic shapes.

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D25286214

Pulled By: eellison

fbshipit-source-id: 10fb175d0c75bd0159fb63aafc3b59cc5fd6c5af
2020-12-10 12:14:58 -08:00
18c03b9f00 make duplicate def() calls an error in the dispatcher (#48098)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/48098

Test Plan:
Imported from OSS
***
make duplicate def() calls an error in the dispatcher. Updating all fb operators to use the new dispatcher registration API

Reviewed By: ezyang

Differential Revision: D25056089

Pulled By: bdhirsh

fbshipit-source-id: 8d7e381f16498a69cd20e6955d69acdc9a1d2791
2020-12-10 11:38:52 -08:00
2519348f60 [Binary Push] Update the awscli installation, use conda install rather than brew install (#49175)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49175

As title
ghstack-source-id: 118306312

Test Plan: CI

Reviewed By: xta0

Differential Revision: D25466577

fbshipit-source-id: 67a521947db3744695f0ab5f421483ab96d8ed9f
2020-12-10 11:10:51 -08:00
edbf9263ad [iOS] Bump up the cocoapods version (#49176)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49176

Bump up the cocoapods version
ghstack-source-id: 118305636

Test Plan: CI

Reviewed By: xta0

Differential Revision: D25466321

fbshipit-source-id: 916adc514c5edc8971445da893362a160cfc092b
2020-12-10 11:07:49 -08:00
909a9060e9 [vmap] implement batching rule for fill_ and zero_ (#48516)
Summary:
Fix https://github.com/pytorch/pytorch/issues/47755

- This PR implements batching rules for in-place operators `fill_` and `zero_`.
- Testcases are added to the `test/test_vmap.py`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48516

Reviewed By: H-Huang

Differential Revision: D25431557

Pulled By: zou3519

fbshipit-source-id: 437b0534dc0b818fbe05f7fcfcb649aa677483dc
2020-12-10 10:59:05 -08:00
840e71f4e6 Check CUDA kernel launches (/fbcode/caffe2/) (#49145)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49145

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49105

(1) Add a safety check `C10_CUDA_KERNEL_LAUNCH_CHECK()` after each kernel launch. This diff only changes the files inside the directory /fbsource/fbcode/caffe2/modules/, /fbsource/fbcode/caffe2/fb/, /fbsource/fbcode/caffe2/test/.

(2) Get rid of old check `AT_CUDA_CHECK(cudaGetLastError())` when necessary.

Test Plan:
Test build:
```
buck build mode/dev-nosan //caffe2/modules/detectron:
buck test mode/dev-nosan //caffe2/modules/detectron:
buck build mode/dev-nosan //caffe2/torch/fb/:
buck test mode/dev-nosan //caffe2/torch/fb/:
```

To check for launches without checks:
```
python3 caffe2/torch/testing/check_kernel_launches.py
```
Make sure none of the updated files are in the returned list.

Reviewed By: r-barnes

Differential Revision: D25452852

fbshipit-source-id: d6657edab612c9e0fa99b29c68460be8b1a20064
2020-12-10 10:43:03 -08:00
524adfbffd Use new FFT operators in stft (#47601)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47601

Fixes https://github.com/pytorch/pytorch/issues/42175#issuecomment-719933913

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D25457217

Pulled By: mruberry

fbshipit-source-id: 455d216edd0b962eb7967ecb47cccc8d6865975b
2020-12-10 10:31:50 -08:00
54f0556ee4 Add missing complex support for torch.norm and torch.linalg.norm (#48284)
Summary:
**BC-breaking note:**

Previously, when given a complex input, `torch.linalg.norm` and `torch.norm` would return a complex output. `torch.linalg.cond` would sometimes return a complex output and sometimes return a real output when given a complex input, depending on its `p` argument. This PR changes this behavior to match `numpy.linalg.norm` and `numpy.linalg.cond`, so that a complex input will result in the downgraded real number type, consistent with NumPy.

**PR Summary:**

The following cases were previously unsupported for complex inputs, and this commit adds support:

- Frobenius norm
- Norm order 2 (vector and matrix)
- CUDA vector norm

Part of https://github.com/pytorch/pytorch/issues/47833

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48284

Reviewed By: H-Huang

Differential Revision: D25420880

Pulled By: mruberry

fbshipit-source-id: 11f6a2f3cad57d66476d30921c3f6ab8f3cd4017
2020-12-10 10:23:45 -08:00
25a8397bf3 add additional interpolation modes for torch.quantile (#48711)
Summary:
Fix https://github.com/pytorch/pytorch/issues/48523
Related  https://github.com/pytorch/pytorch/issues/38349

**BC-breaking Note:**

This PR updates PyTorch's quantile function to add additional interpolation methods `lower`, `higher`, `nearest`, and `midpoint`, and these interpolation methods are currently supported by NumPy.

New parameter `interpolation` is added to the signature for both `torch.quantile` and `torch.nanquantile` functions.

- `quantile(input, q, dim=None, interpolation='linear', keepdim=False, *, out=None) -> Tensor`
- `nanquantile(input, q, dim=None, interpolation='linear', keepdim=False, *, out=None) -> Tensor`

Function signatures followed the NumPy-like style for the moment, keeping `out` at the end to be consistent with PyTorch.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48711

Reviewed By: H-Huang

Differential Revision: D25428587

Pulled By: heitorschueroff

fbshipit-source-id: e98d24f6a651d302eb94f4ff4da18e38bdbf0124
2020-12-10 10:10:51 -08:00
45473ffe23 Refactor cudnn convolution (#49109)
Summary:
cuDNN v7 API has been deprecated, so we need to migrate to cuDNN v8 API. The v8 API does not exist on cuDNN 7, so there will be a long time both API should exist.

This is step 0 of adding cuDNN v8 API. There is no real code change in this PR. It just copy-pastes existing code. The original `Conv.cpp` is split into `ConvPlaceholders.cpp`, `ConvShared.cpp`, `ConvShared.h`, `Conv_v7.cpp`, `Conv_v8.cpp`. Currently `Conv_v8.cpp` is empty, and will be filled in the future.

The `ConvPlaceholders.cpp` contains placeholder implementation of cudnn convolution when cudnn is not enabled. These operators only raise errors and do no real computation. This file also contains deprecated operators. These operators are implemented using current operators.

The `ConvShared.cpp` and `ConvShared.h` contains code that will be shared by the v7 and v8 API, these include the definition of struct `ConvolutionParams` and `ConvolutionArgs`. As well as ATen exposed API like `cudnn_convolution` and intermediate `cudnn_convolution_forward`. These exposed functions will call raw API like `raw_cudnn_convolution_forward_out` in `Conv_v7.cpp` or `Conv_v8.cpp` for the real implementation.

The `Conv_v7.cpp`, `Conv_v8.cpp` contains the implementation of raw APIs, and are different for v7 and v8.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49109

Reviewed By: H-Huang

Differential Revision: D25463783

Pulled By: ezyang

fbshipit-source-id: 1c80de8e5d94d97a61e45687f6193e8ff5481e3e
2020-12-10 10:06:12 -08:00
d5c4a80cfd Allow ROCm CI to use non-default stream. (#48424)
Summary:
Revert https://github.com/pytorch/pytorch/issues/26394. Fixes https://github.com/pytorch/pytorch/issues/27356.  Not all MIOpen handles were setting their stream to the current stream prior to running the op.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48424

Reviewed By: H-Huang

Differential Revision: D25420384

Pulled By: mruberry

fbshipit-source-id: 051683ba9e3d264b71162bd344031a0c58bf6a41
2020-12-10 09:55:11 -08:00
195b92bfa6 Revert D25441716: [te] Add BitCast to the IR
Test Plan: revert-hammer

Differential Revision:
D25441716 (3384145418)

Original commit changeset: c97b871697bc

fbshipit-source-id: e6eff02e28e1ae8c826dd2cfed79f869839ed2ba
2020-12-10 09:31:35 -08:00
3384145418 [te] Add BitCast to the IR
Summary: Adds BitCasting to NNC.  This will enable fast approximation algorithms implemented directly in TensorExpressions

Test Plan: buck test mode/no-gpu //caffe2/test/cpp/tensorexpr:tensorexpr

Reviewed By: bertmaher

Differential Revision: D25441716

fbshipit-source-id: c97b871697bc5931d09cda4a9cb0a81bb420f4e2
2020-12-10 09:25:46 -08:00
21c04b4438 make AT_FFTW_ENABLED available to fb internal
Summary: follow up on D25375320 (b89c328493).

Test Plan: buck build

Reviewed By: samestep

Differential Revision: D25410973

fbshipit-source-id: 6c2627951a98d270d341b33538431644d03bed16
2020-12-10 07:36:35 -08:00
33bc7918e8 fix some comments in accelerator_partitioner.py (#49104)
Summary:
Fix some comments in accelerator_partittioner.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49104

Reviewed By: gcatron

Differential Revision: D25434999

Pulled By: scottxu0730

fbshipit-source-id: ce83b411cf959aabec119532ad42a892a2223286
2020-12-10 07:06:05 -08:00
c7b8f3e2cd Decouple direct access to native::scalar_tensor from TensorIndexing.h (#48761)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48761

Targeting one of the items in https://github.com/pytorch/pytorch/issues/48684. For performance purpose we don't use at::scalar_tensor. Since scalar_tensor_static is available for CPU we could use it at least for CPU. One uncertainty is the CUDA performance. But there's no fast path for CUDA under native::scalar_tensor either, I assume the perf on CUDA may not be affected.

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D25410975

Pulled By: iseeyuan

fbshipit-source-id: 160d21ffeefc9a2e8f00a55043144eebcada2aac
2020-12-10 05:34:35 -08:00
2255e68da8 Revert D25433268: [PyTorch Mobile] Preserve bundled input related methods when calling optimize_for_mobile
Test Plan: revert-hammer

Differential Revision:
D25433268 (95233870f2)

Original commit changeset: 0bf9b4afe64b

fbshipit-source-id: bba97e48ce0e72f9d1db5159065bb6495d62666c
2020-12-10 04:39:30 -08:00
b5a7e25059 Cache the DataPtrs in CUDAFuture (#48788)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48788

CUDAFuture needs to inspect the value it contains in order to first determine what devices its tensors reside on (so that it can record events on those devices), and then to record these tensors with the caching allocator when they are used in other streams. Extracting data ptrs can become somewhat expensive (especially if we resort to using the pickler to do that), hence it's probably a good idea to cache the result the first time we compute it.
ghstack-source-id: 118180023

Test Plan: Unit tests

Reviewed By: mrshenli

Differential Revision: D25303486

fbshipit-source-id: 5c541640f6d19249dfb5489ba5e8fad2502836fb
2020-12-10 03:54:29 -08:00
030fa6cfba Split out reusable CUDAFuture from FutureNCCL (#48506)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48506

This commit is part of a stack that reworks FutureNCCL in order to extract a generic CUDA-aware Future subclass. The stack deliberately breaks up this transition into elementary changes, to make it easier to verify that the behavior is preserved (or to highlight how it gets changed).

 ---

FutureNCCL is now a general-purpose type-agnostic multi-device class, so in this commit I extract it from ProcessGroupNCCL to make it available for wider use (notably by the RPC module). We'll call this new class CUDAFuture. We'll keep FutureNCCL as a subclass of CUDAFuture to deal with some NCCL peculiarity, namely the fact that the future becomes complete immediately upon creation. We can clean this up for good once we're done merging Future and Work.

I'm not exactly sure of where to put CUDAFuture. It needs to be available to both c10d and RPC (which lives under torch/csrc). If I figured CMake out correctly (and that's a big if) I think c10d can only depend on ATen (I'll maybe add a comment with how I tracked that down). Hence we cannot put CUDAFuture in torch/csrc. On the other hand, RPC currently depends on c10d, because RPC agents use ProcessGroups internally, so it would be "ok" to put CUDAFuture in c10d. However, we want to get rid of ProcessGroups in RPC, and at that point RPC should in principle not depend on c10d. In that case, the only shared dep between the two that I see is ATen itself.

While I'm a bit wary of putting it right in ATen, I think it might actually make sense. CUDAFuture is intended to be a general-purpose component that can be reused in all settings and is not particularly tied to c10d or RPC. Moreover, ATen already contains ivalue::Future, and it contains a lot of CUDA helpers, so CUDAFuture definitely belongs to the "closure" of what's already there.
ghstack-source-id: 118180030

Test Plan: Unit tests?

Reviewed By: wanchaol

Differential Revision: D25180532

fbshipit-source-id: 697f655240dbdd3be22a568d5102ab27691f86d4
2020-12-10 03:54:26 -08:00
4c425e8da0 Merge common parts of FutureNCCL into at::ivalue::Future (#48505)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48505

This commit is part of a stack that reworks FutureNCCL in order to extract a generic CUDA-aware Future subclass. The stack deliberately breaks up this transition into elementary changes, to make it easier to verify that the behavior is preserved (or to highlight how it gets changed).

 ---

FutureNCCL isn't just adding CUDA support to ivalue::Future, it's also reimplementing a lot of the latter's logic (by overriding plenty of its methods). That's brittle, as whenever a new method is added to ivalue::Future there's a risk of forgetting to add it to FutureNCCL, and in such a case calling this method on FutureNCCL would defer to the base class and give inconsistent results (e.g., future not being completed when it actually is). This _is already happening_, for example with the waitAndThrow or hasError, which are not implemented by FutureNCCL. In addition, this creates duplication between the two classes, which could lead to inconsistencies of behavior, bugs, missing features, ...

The best solution would be to keep the core future logic in ivalue::Future, and have _only_ the CUDA additions in FutureNCCL. That's what we're going to do, in two steps. In the previous commit, I split the CUDA features into separate hooks, which are called by FutureNCCL's other methods. In this commit, I'm removing these latter methods, and invoke the hooks directly from ivalue::Future.
ghstack-source-id: 118180032

Test Plan: Unit tests

Reviewed By: wanchaol

Differential Revision: D25180535

fbshipit-source-id: 19181fe133152044eb677062a9e31e5e4ad3c03c
2020-12-10 03:54:22 -08:00
9078088edb Split FutureNCCL's CUDA-specific parts from generic future logic (#48504)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48504

This commit is part of a stack that reworks FutureNCCL in order to extract a generic CUDA-aware Future subclass. The stack deliberately breaks up this transition into elementary changes, to make it easier to verify that the behavior is preserved (or to highlight how it gets changed).

 ---

FutureNCCL isn't just adding CUDA support to ivalue::Future, it's also reimplementing a lot of the latter's logic (by overriding plenty of its methods). That's brittle, as whenever a new method is added to ivalue::Future there's a risk of forgetting to add it to FutureNCCL, and in such a case calling this method on FutureNCCL would defer to the base class and give inconsistent results (e.g., future not being completed when it actually is). This _is already happening_, for example with the waitAndThrow or hasError, which are not implemented by FutureNCCL. In addition, this creates duplication between the two classes, which could lead to inconsistencies of behavior, bugs, missing features, ...

The best solution would be to keep the core future logic in ivalue::Future, and have _only_ the CUDA additions in FutureNCCL. That's what we're going to do, in two steps. In this commit, I'll split the CUDA features into separate hooks, which are called by FutureNCCL's other methods. In the next commit, I'll remove these latter methods, and invoke the hooks directly from ivalue::Future.
ghstack-source-id: 118180025

Test Plan: Unit tests

Reviewed By: mrshenli

Differential Revision: D25180534

fbshipit-source-id: 7b3cd374aee78f6c07104daec793c4d248404c61
2020-12-10 03:54:19 -08:00
a6778989d1 Support wider range of types in FutureNCCL (#48502)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48502

This commit is part of a stack that reworks FutureNCCL in order to extract a generic CUDA-aware Future subclass. The stack deliberately breaks up this transition into elementary changes, to make it easier to verify that the behavior is preserved (or to highlight how it gets changed).

 ---

FutureNCCL restricted the values to be tensors, or (singleton) lists of tensors, or Python object that could be converted to either of those types. We need a CUDA future that can handle more generic types though.

The main challenge is extracting all DataPtrs from an arbitrary object. I think I found some ways of doing so, but I'd like some JIT experts to look into this and tell me if there are better ways. I'll add inline comments for where their input would be appreciated.
ghstack-source-id: 118180026

Test Plan: Unit tests (I should probably add new ones)

Reviewed By: wanchaol

Differential Revision: D25177562

fbshipit-source-id: 1ef18e67bf44543c70abb4ca152f1610dea4e533
2020-12-10 03:54:15 -08:00
9fe3ac3650 Don't store device indices separately on FutureNCCL (#48501)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48501

This commit is part of a stack that reworks FutureNCCL in order to extract a generic CUDA-aware Future subclass. The stack deliberately breaks up this transition into elementary changes, to make it easier to verify that the behavior is preserved (or to highlight how it gets changed).

 ---

FutureNCCL stores a set of devices (on which the tensors in the data reside) and a CUDA event for each of those devices. In fact, each event instance also already contains the device it belongs to, which means we can avoid storing that information separately (with the risk that it'll be mismatched and/or inaccurate).
ghstack-source-id: 118180024

Test Plan: Unit tests

Reviewed By: mrshenli

Differential Revision: D25177554

fbshipit-source-id: 64667c176efc2a7dafe99457a1fbba5d142cb06c
2020-12-10 03:54:12 -08:00
e294c2d841 Add multi-GPU support to FutureNCCL (#48500)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48500

This commit is part of a stack that reworks FutureNCCL in order to extract a generic CUDA-aware Future subclass. The stack deliberately breaks up this transition into elementary changes, to make it easier to verify that the behavior is preserved (or to highlight how it gets changed).

 ---

After the previous changes, this is now much simpler than it sounds. For the most part it just consists in repeating some operations multiple times, once for device (e.g., recording and blocking on events). Funnily, we already had a vector of events, even though we only ever stored one element in it (this probably comes from the fact that this is shared with WorkNCCL, which can hold more than one event). Here, we now also store a vector of device indices.

Perhaps the only non-trivial part of this is that now, for "follow-up" Futures (for callbacks), we can't know in advance which device the result will be on so we must determine it dynamically when we receive the result, by inspecting it. That's also easier than it sound because we already have a dataptr extractor.
ghstack-source-id: 118180022

Test Plan: Unit tests (I should probably add new ones)

Reviewed By: mrshenli

Differential Revision: D25177556

fbshipit-source-id: 41ef39ec0dc458e341aa1564f2b9f2b573d7fa9f
2020-12-10 03:54:09 -08:00
91ad3ed831 Fix FutureNCCL not recording dataptrs with caching alloc in wait() (#48563)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48563

This commit is part of a stack that reworks FutureNCCL in order to extract a generic CUDA-aware Future subclass. The stack deliberately breaks up this transition into elementary changes, to make it easier to verify that the behavior is preserved (or to highlight how it gets changed).

 ---

The CUDA caching allocator requires us to register all streams in which a DataPtr is used. We already do so when we invoke a callback, for which we obtain streams from the ATen pool. However, we didn't do so when the user waits for the Future and then uses the results in their current streams. This was probably fine in most cases, because the outputs of the NCCL ops (which is the tensors we're dealing with here) were user-provided, and thus already registered in some user streams, but in principle the user could use different streams when waiting than the ones they used to create the tensors. (If they use the same streams, registering becomes a no-op). But, more importantly, this change will help us turn FutureNCCL into a more general-purpose class as for example in RPC the tensors of the result are allocated by PyTorch itself and thus we need to record their usage on the user's streams with the caching allocator.
ghstack-source-id: 118180033

Test Plan: Unit tests

Reviewed By: mrshenli

Differential Revision: D25210338

fbshipit-source-id: e0a4ba157653b74dd84cf5665c992ccce2dea188
2020-12-10 03:54:06 -08:00
003c30ba82 Fix FutureNCCL's completed() disagreeing with wait() (#48503)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48503

This commit is part of a stack that reworks FutureNCCL in order to extract a generic CUDA-aware Future subclass. The stack deliberately breaks up this transition into elementary changes, to make it easier to verify that the behavior is preserved (or to highlight how it gets changed).

 ---

My impression is that one property of the upstream Future class is that once .wait() returns, or once a callback is invoked, then .completed() should return True. This was not the case for FutureNCCL because .wait() would return immediately, and callbacks would be invoked inline, but .completed() could return False if the CUDA async operations hadn't completed yet.

That was odd and confusing. Since there are other ways for users to check the status of CUDA operations (if they really need, and typically I don't think it's so common), perhaps it's best to avoid checking the status of CUDA events in .completed().
ghstack-source-id: 118180028

Test Plan: Unit tests

Reviewed By: mrshenli

Differential Revision: D25180531

fbshipit-source-id: e1207f6b91f010f278923cc5fec1190d0fcdab30
2020-12-10 03:54:02 -08:00
b91b0872a1 Record CUDA events for "follow-up" FutureNCCL inside markCompleted (#48499)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48499

This commit is part of a stack that reworks FutureNCCL in order to extract a generic CUDA-aware Future subclass. The stack deliberately breaks up this transition into elementary changes, to make it easier to verify that the behavior is preserved (or to highlight how it gets changed).

 ---

We can merge and "hide" a whole bunch of CUDA-related logic if we store and record the CUDA events that correspond to the completion of a FutureNCCL when we call markCompleted (rather than splitting it between the constructor, the `then` method, and a wrapper around the callback).

A more concrete reason for this change is that soon I'll add support for multi-device, and in that case we can't necessarily know in advance which devices a value will be on until we get that value (and we don't want to record an event on all devices as then we might "over-synchronize").

To me, this also makes more conceptual sense: the moment when we store a value on the future, which is the "signal" that the future is now ready, should also be time at which we record the events needed to synchronize with that value. Though this may just be personal preference.
ghstack-source-id: 118180034

Test Plan: Unit tests

Reviewed By: mrshenli

Differential Revision: D25177557

fbshipit-source-id: 53d4bcdfb89fa0d11bb7b1b94db5d652edeb3b7b
2020-12-10 03:53:59 -08:00
6157f8aeb5 Use fresh stream from pool for each FutureNCCL callback (#48498)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48498

This commit is part of a stack that reworks FutureNCCL in order to extract a generic CUDA-aware Future subclass. The stack deliberately breaks up this transition into elementary changes, to make it easier to verify that the behavior is preserved (or to highlight how it gets changed).

 ---

FutureNCCL has a dedicated CUDA stream that it sets as current when running callbacks. This stream is initialized by the ProcessGroupNCCL by extracting it from the global ATen pool.

In order to decouple FutureNCCL from that specific ProcessGroup and make it more generic, in this commit we make FutureNCCL extract a fresh stream from the ATen pool each time it needs one.

This introduces a functional change, because it removes the implicit synchronization and ordering between the callbacks of a same Future. In fact, such an ordering is hard to guarantee in the general case as, for example, a user could attach a new callback just after the future becomes completed, and thus that callback would be run inline, immediately, out-of-order wrt the other callbacks. (There are ways to "fix" this but they are complicated). NCCL got around this because its futures are already marked complete when they're returned, but in fact it could also run into issues if multiple threads were adding callbacks simultaneously.

Note that it remains still possible to enforce ordering between callbacks, but one must now do so explicitly. Namely, instead of this:
```
fut.then(cb1)
fut.then(cb2)
```
one must now do:
```
fut.then(cb1).then(cb2)
```
ghstack-source-id: 118180029

Test Plan: Unit tests

Reviewed By: mrshenli

Differential Revision: D25177559

fbshipit-source-id: 4d4e73ea7bda0ea65066548109b9ea6d5b465599
2020-12-10 03:53:56 -08:00
8fb52e7fa2 Make FutureNCCL record events in current stream (#48497)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48497

This commit is part of a stack that reworks FutureNCCL in order to extract a generic CUDA-aware Future subclass. The stack deliberately breaks up this transition into elementary changes, to make it easier to verify that the behavior is preserved (or to highlight how it gets changed).

 ---

When we record the events to mark a "follow-up" future complete (for a callback), we used to record them onto the dedicated stream, but that streams is the current stream at that time, so instead we could just record them onto the current stream. This introduces no functional differences. The reason I'm adding such an additional layer of indirection is so that the dedicated stream is only referenced inside the `addCallback` method, which will later allow us to more easily change how that stream works.
ghstack-source-id: 118180035

Test Plan: Unit tests

Reviewed By: mrshenli

Differential Revision: D25177553

fbshipit-source-id: c6373eddd34bd399df09fd4861915bf98fd50681
2020-12-10 03:53:53 -08:00
e4267eb424 Have FutureNCCL record streams w/ allocator in addCallback (#48496)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48496

This commit is part of a stack that reworks FutureNCCL in order to extract a generic CUDA-aware Future subclass. The stack deliberately breaks up this transition into elementary changes, to make it easier to verify that the behavior is preserved (or to highlight how it gets changed).

 ---

There are two ways to add a callback to a Future: `then` and `addCallback` (with the former deferring to the latter). FutureNCCL only "patched" `then`, which caused `addCallback` to be unsupported. By patching `addCallback`, on the other hand, we cover both.

The high-level goal of this change though is to remove all CUDA-specific stuff from `then`, and move it to either `markCompleted` or to a wrapper around the callback. This will take a few more steps to achieve.
ghstack-source-id: 118180031

Test Plan: Unit tests

Reviewed By: mrshenli

Differential Revision: D25177558

fbshipit-source-id: ee0ad24eb2e56494c353db700319858ef9dcf32b
2020-12-10 03:53:50 -08:00
868a1a48c6 Add some safeguards to FutureNCCL (#48562)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48562

This commit is part of a stack that reworks FutureNCCL in order to extract a generic CUDA-aware Future subclass. The stack deliberately breaks up this transition into elementary changes, to make it easier to verify that the behavior is preserved (or to highlight how it gets changed).

 ---

In this commit I'm adding a few asserts to the constructors of FutureNCCL to make sure that what's passed in is what we expect (fun fact: until two commits ago that wasn't the case, as we were passed some empty events).

I'm also making the second constructor private, as it's only supposed to be used by the then() method.
ghstack-source-id: 118180036

Test Plan: Unit tests

Reviewed By: mrshenli

Differential Revision: D25210333

fbshipit-source-id: d2eacf0f7de5cc763e3cdd1ae5fd521fd2eec317
2020-12-10 03:53:47 -08:00
b7f5aa9890 Remove NCCL dependency from PythonFutureWrapper (#48495)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48495

This commit is part of a stack that reworks FutureNCCL in order to extract a generic CUDA-aware Future subclass. The stack deliberately breaks up this transition into elementary changes, to make it easier to verify that the behavior is preserved (or to highlight how it gets changed).

 ---

PythonFutureWrapper needs to provide a GIL-aware way to extract tensors from an IValue of type PyObject. Since this was only used by FutureNCCL it was guarded by #ifdef USE_C10D_NCCL. However, we will need to use it with CUDA-aware futures other than the NCCL one. This might have been achieved simply by replacing USE_C10D_NCCL with USE_CUDA, but I wanted to clean this up better.

We're dealing with two independent dimensions: C++-vs-Python and CPU-vs-CUDA. To make the code more modular, the two dimensions should be dealt with by orthogonal solutions: the user setting a custom callback to handle Python, and the subclass being CUDA-aware. Mixing these two axes makes it more complicated.

Another reason for changing how this works is that later on, when we'll introduce multi-device support, we'll need to extract dataptrs for other reasons too (rather than just recording streams with the caching allocator), namely to inspect the value to determine which devices it resides on.
ghstack-source-id: 118180038

Test Plan: Unit tests

Reviewed By: mrshenli

Differential Revision: D25177560

fbshipit-source-id: 3a424610c1ea191e8371ffee0a26d62639895884
2020-12-10 03:53:44 -08:00
7f7f0fa335 Avoid using FutureNCCL before it's ready (#48561)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48561

This commit is part of a stack that reworks FutureNCCL in order to extract a generic CUDA-aware Future subclass. The stack deliberately breaks up this transition into elementary changes, to make it easier to verify that the behavior is preserved (or to highlight how it gets changed).

 ---

WorkNCCL allows to extract a FutureNCCL through getFuture(). There is one instance of this method being called by ProcessGroupNCCL itself, in order to attach a callback to it. This was happening _before_ the work was actually launched, however FutureNCCL does _always_ invoke its callbacks immediately inline. The events that the FutureNCCL was using hadn't been recorded yet, thus blocking on them was a no-op. Moreover, the function that was being called was installed by the generic ProcessGroup superclass, which is not CUDA-aware, and thus probably didn't make any use of the CUDA events or streams.

383abf1f0c/torch/lib/c10d/ProcessGroup.cpp (L66)

In short: I believe that creating a FutureNCCL and attaching a callback was equivalent to just invoking that function directly, without any CUDA-specific thing. I'm thus converting the code to do just that, in order to simplify it.

Note that, given the comment, I don't think this was the original intention of that code. It seems that the function was intended to be run once the work finished. However, I am not familiar with this code, and I don't want to introduce any functional changes.
ghstack-source-id: 118180037

Test Plan: Unit tests

Reviewed By: mrshenli

Differential Revision: D25210337

fbshipit-source-id: 54033c814ac77641cbbe79b4d01686dfc2b45495
2020-12-10 03:48:43 -08:00
eb9516eaa4 [numpy] torch.exp{2, m1}: promote integer inputs to float (#48926)
Summary:
Reference: https://github.com/pytorch/pytorch/issues/42515

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48926

Reviewed By: zhangguanheng66

Differential Revision: D25392344

Pulled By: mruberry

fbshipit-source-id: ddbabcfd58cc4c944153b1a224cc232efa022104
2020-12-10 00:14:22 -08:00
27f7d1c286 Port eig CPU from TH to ATen (#43215)
Summary:
Also consolidates shared logic between `eig` CPU and CUDA implementations

Fixes https://github.com/pytorch/pytorch/issues/24693

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43215

Reviewed By: VitalyFedyunin, zhangguanheng66

Differential Revision: D23862622

Pulled By: ngimel

fbshipit-source-id: ca1002428850520cd74cd5b7ed8cb4d12dbd9c52
2020-12-09 23:27:35 -08:00
95233870f2 [PyTorch Mobile] Preserve bundled input related methods when calling optimize_for_mobile
Summary:
Added an extra step to **always** preserve the bundled inputs methods if they are present in the input module.

Also added a check to see if all the methods in the `preseved_methods` exist. If not, we will now throw an exception. This can hopefully stop hard-to-debug inputs from getting into downstream functions.

~~Add an optional argument `preserve_bundled_inputs_methods=False` to the `optimize_for_mobile` function. If set to be True, the function will now add three additional functions related with bundled inputs to be preserved: `get_all_bundled_inputs`, `get_num_bundled_inputs` and `run_on_bundled_input`.~~

Test Plan:
`buck test mode/dev //caffe2/test:mobile -- 'test_preserve_bundled_inputs_methods \(test_mobile_optimizer\.TestOptimizer\)'`

or

`buck test caffe2/test:mobile` to run some other related tests as well.

Reviewed By: dhruvbird

Differential Revision: D25433268

fbshipit-source-id: 0bf9b4afe64b79ed1684a3db4c0baea40ed3cdd5
2020-12-09 22:53:56 -08:00
9417e92722 op to gen quant params from min-max thresholds
Summary: Adding support to gen qparams to quantize a tensor from min and max thresholds of a tensor

Test Plan:
```
buck test mode/opt caffe2/caffe2/quantization/server:int8_gen_quant_params_min_max_test
```
```
Started reporting to test run: https://our.intern.facebook.com/intern/testinfra/testrun/5629499573509506
    ✓ ListingSuccess: caffe2/caffe2/quantization/server:int8_gen_quant_params_min_max_test - main (2.522)
    ✓ Pass: caffe2/caffe2/quantization/server:int8_gen_quant_params_min_max_test - test_int8_gen_quant_params_min_max_op (caffe2.caffe2.quantization.server.int8_gen_quant_params_min_max_test.TestInt8GenQuantParamsMinMaxOperator) (1.977)
Summary
  Pass: 1
  ListingSuccess: 1
```

Reviewed By: hx89

Differential Revision: D24485985

fbshipit-source-id: 18dee193f7895295d85d31dc013570e5d5d97357
2020-12-09 19:13:53 -08:00
c5bc6b40ab [NNC] Dead Store Elimination (#49030)
Summary:
Adds a new optimization method to LoopNest which eliminates stores that do not contribute to any output. It's unlikely any of the lowerings of aten operators produce these stores yet, but this creates some wiggle room for transformations in the future.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49030

Reviewed By: tugsbayasgalan

Differential Revision: D25434538

Pulled By: nickgg

fbshipit-source-id: fa1ead82e6f7440cc783c6116b23d0b7a5b5db4b
2020-12-09 18:49:53 -08:00
7a2abbd8fd Revert D25416620: [pytorch][PR] Add version_info tuple
Test Plan: revert-hammer

Differential Revision:
D25416620 (e69c2f85f6)

Original commit changeset: 20b561a0c76a

fbshipit-source-id: 4d73c7ed9191137d5be92236c18c312ce25a1471
2020-12-09 18:41:24 -08:00
3123f878dd [PyTorch] Avoid storage refcount bump in copy_tensor_metadata (#48877)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48877

Setting `Storage` in the TensorImpl ctor only to set it again in
`copy_tensor_metadata` wastes one refcount bump.
ghstack-source-id: 117937872

Test Plan:
internal benchmark. compared results with perf, saw 0.15%
reduction in percent of total time spent in
`TensorImpl::shallow_copy_and_detach`.

Reviewed By: bhosmer

Differential Revision: D25353529

fbshipit-source-id: e85d3a139ccd44cbd059c14edb19b22b962881a9
2020-12-09 17:51:07 -08:00
e69c2f85f6 Add version_info tuple (#48414)
Summary:
Add a `version_info` similar to `sys.version_info` for being able to make version tests. Example generated `version.py`:

```
__version__ = '1.8.0a0'
version_info = (1, 8, 0, 'a0')
# or version_info = (1, 8, 0, 'a0', 'deadbeef') if you're in a Git checkout
debug = False
cuda = None
git_version = '671ee71ad4b6f507218d1cad278a8e743780b716'
hip = None
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48414

Reviewed By: zhangguanheng66

Differential Revision: D25416620

Pulled By: malfet

fbshipit-source-id: 20b561a0c76ac0b16ff92f4bd43f8b724971e444
2020-12-09 17:44:35 -08:00
5375a479aa Add type annotations to conv-relu (#47680)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/47679

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47680

Reviewed By: zhangguanheng66

Differential Revision: D25416628

Pulled By: malfet

fbshipit-source-id: 103bea1e8c300990f74689787a71b1cfe916cfef
2020-12-09 17:12:26 -08:00
e9ef1fe309 [PyTorch Mobile] Add continuous build config for xplat/caffe2
Summary:
Currently this folder isn't covered by continuous build and ideally, it should be covered.

I've made everything that is actually used build, but there are test failures (commented out). Specifically:

### Build Failures

1. [Resolved] Vulkan stuff doesn't build because codegen doesn't generate files that Vulkan expects.
2. [Resolved] Vulkan relies in Android dev environment being set up, which doesn't exist on sandcastle machines. I think the resolution should be to restrict Vulkan stuff to the ANDROID platform, but will let AshkanAliabadi (who is the expect on all things Vulkan) provide the appropriate resoltion.
3. [Resolved] Older caffe2 stuff didn't have the deps set up correctly for zlib.
4. [Resolved] Some Papaya stuff didn't have the QPL deps set up correctly.
5. [Resolved] Some tests include cuda, which isn't available on xplat PyTorch Mobile.
6. [Resolved] Missing NNPACK dep on platforms other than ANDROID and MACOS.
7. [Resolved] Maskrcnn binary missing header includes.
8. [Resolved] Braces around scalar initializers in Vulkan Tests.
9. [Resolved] Incorrect header `<vulkan/vulkan.h>` and incorrect BUCK glob path to include it - seems like some completely different header was being included by libvulkan-stub.

### Test Failures

1. [Resolved] Memory Leak on exit in multiple (all?) QNNPACK tests.
2. [Unresolved] Lite Trainer test doesn't explicitly specify dep on input `.ptl` file, resulting in the file not being found in the test when the test attempts to open it.
3. [Resolved] Heap Use after free errors in old caffe2 tests.
4. [Resolved] Heap buffer overflow errors in old caffe2 tests.
5. [Unresolved] Something related to an overload of `at::Tensor` accepting C2 Tensor not being found (new PyTorch test I think).

Everything marked `[Unresolved]` above results in stuff that is commented out so that it isn't triggered. It is already currently broken, so it doesn't represent a regression - merely an explicit indication of the fact that it's broken.

Everything marked `[Resolved]` above means that it was fixed to function as intended based on my understanding of the intent.

Test Plan: Sandcastle.

Reviewed By: iseeyuan

Differential Revision: D25093853

fbshipit-source-id: e0dda4f3d852ef158cd088ae2cfd44019ade1573
2020-12-09 16:58:20 -08:00
16b8e6ab01 Class-based structured kernels, with migration of add to framework (#48718)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48718

This PR rewrites structured kernels to do the class-based mechanism (instead of defining a meta and impl function, they are methods on a class), and adds enough customizability on the class to support TensorIterator. To show it works, add is made a structured kernel. Don't forget to check https://github.com/pytorch/rfcs/pull/9 for a mostly up-to-date high level description of what's going on here.

High level structure of this PR (the order you should review files):
* TensorMeta.h - TensorMeta is deleted entirely; instead, meta functions will call `set_output` to allocate/resize their outputs. MetaBase gets a new `maybe_get_output` virtual method for retrieving the (possibly non-existent) output tensor in a meta function; this makes it easier to do special promotion behavior, e.g., as in TensorIterator.
* TensorIterator.cpp - Two major changes: first, we add TensorIteratorBase::set_output, which is a "light" version of TensorIterator::set_output; it sets up the internal data structures in TensorIterator, but it doesn't do allocation (that is assumed to have been handled by the structured kernels framework). The control flow here is someone will call the subclassed set_output, which will allocate output, and then we will call the parent class (TensorIteratorBase) to populate the fields in TensorIterator so that other TensorIterator phases can keep track of it. Second, we add some tests for meta tensors, and skip parts of TensorIterator which are not necessary when data is not available.
* tools/codegen/model.py - One new field in native_functions.yaml, structured_inherits. This lets you override the parent class of a structured meta class; normally it's MetaBase, but you can make it point at TensorIteratorBase instead for TensorIterator based kernels
* tools/codegen/gen.py - Now generate all of the classes we promised. It's kind of hairy because this is the first draft. Check the RFC for what the output looks like, and then follow the logic here. There are some complications: I need to continue to generate old style wrapper functions even if an operator is structured, because SparseCPU/SparseCUDA/etc won't actually use structured kernels to start. The most complicated code generation is the instantiation of `set_output`, which by in large replicates the logic in `TensorIterator::set_output`. This will continue to live in codegen for the forseeable future as we would like to specialize this logic per device.
* aten/src/ATen/native/UpSampleNearest1d.cpp - The previous structured kernel is ported to the new format. The changes are very modest.
* aten/src/ATen/native/BinaryOps.cpp - Add is ported to structured.

TODO:
* Work out an appropriate entry point for static runtime, since native:: function stubs no longer are generated
* Refactor TensorIteratorConfig construction into helper functions, like before
* Make Tensor-Scalar addition structured to fix perf regression
* Fix `verify_api_visibility.cpp`
* Refactor tools/codegen/gen.py for clarity
* Figure out why header changes resulted in undefined reference to `at::Tensor::operator[](long) const`

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: bhosmer

Differential Revision: D25278031

Pulled By: ezyang

fbshipit-source-id: 57c43a6e5df21929b68964d485995fbbae4d1f7b
2020-12-09 15:39:12 -08:00
a6fa3b2682 adding profile_ivalue (#47666)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47666

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D25255573

Pulled By: Krovatkin

fbshipit-source-id: 5d8753e4040a3d96105d28d26728125947c7a638
2020-12-09 15:29:15 -08:00
f431e47a2e [collect_env] Acquire windows encoding using OEMCP (#49020)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/49010.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49020

Reviewed By: zhangguanheng66

Differential Revision: D25398064

Pulled By: janeyx99

fbshipit-source-id: c7fd1e7d1f3dd82613d7f2031439503188b144fd
2020-12-09 15:22:18 -08:00
5765bbd78c Review memory overlap checks for advanced indexing operations (#48651)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/45964

Indexing operators e.g. `scatter`/`gather` use tensor restriding so the `TensorIterator` built in overlap checking needs to be disabled. This adds the missing overlap checks for these operators.

In addition, some indexing operators don't work will with `MemOverlapStatus::FULL` which is explicitly allowed by `assert_no_partial_overlap`. So, I've introduced `assert_no_overlap` that will raise an error on partial _or_ full overlap.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48651

Reviewed By: zhangguanheng66

Differential Revision: D25401047

Pulled By: ngimel

fbshipit-source-id: 53abb41ac63c4283f3f1b10a0abb037169f20b89
2020-12-09 15:10:52 -08:00
dfa3808704 [PyTorch] Remove aten::native::empty usage in TensorIndexing (#49074)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49074

Try to resolve part of the github issue of https://github.com/pytorch/pytorch/issues/48684 . ```aten::native::empty()``` is referenced in TensorIndexing.h. However, the definition of ```aten::native::empty()``` is nothing but checks and eventually calling ```at::empty()```.

In this diff, ```at::empty()``` is directly used to avoid the reference to native symbols.
ghstack-source-id: 118165999

Test Plan: CI

Reviewed By: dhruvbird

Differential Revision: D25417854

fbshipit-source-id: 7e4af411ae63642c8470e78cf8553400dc9a16c9
2020-12-09 14:50:19 -08:00
c7cc8a48c0 migrating some straggler pytorch ops in fbcode to the new registration API (#48954)
Summary:
I already migrated the majority of fbcode ops to the new registration API, but there are a few stragglers (mostly new files that were created in the last two weeks).

The goal is mostly to stamp out as much of the legacy registration API usage as possible, so that people only see the new API when they look around the code for examples of how to register their own ops.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48954

ghstack-source-id: 118140663

Test Plan: Ran buck targets for each file that I migrated

Reviewed By: ezyang

Differential Revision: D25380422

fbshipit-source-id: 268139a1d7b9ef14c07befdf9e5a31f15b96a48c
2020-12-09 14:42:29 -08:00
67d12c9582 Pass shape hints for AOT case (#48989)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48989

1. Pass shape hints at model export time.
2. A bit of logging to show if passed shape hints are loaded by OnnxifiOp.

From jfix71:
> for AOT we skip onnxifi on the predictor side. We do onnxifi at model export time

Test Plan:
Temporarily added extra logging to verify that we use passed shape hints for AOT scenario. Here are the test results:
1. AOT model generation https://fburl.com/paste/1dtxrdsr shows that pybind_state.cc is called.
2. Running predictor service https://fburl.com/paste/d4qcizya with more logging in onnxifi_op.cc D25344546 shows that we use provided shape hints instead of doing shape inference every time.

Reviewed By: jfix71

Differential Revision: D25344546

fbshipit-source-id: 799ca4baea23ed4d81d89d00cb3a52a1cbf69a44
2020-12-09 14:15:57 -08:00
bfa95f90a0 Revert D25325039: Check CUDA kernel launches (/fbcode/caffe2/)
Test Plan: revert-hammer

Differential Revision:
D25325039 (f5e9ffbc27)

Original commit changeset: 2043d6e63c7d

fbshipit-source-id: 5377dd2aa7c6f58c8641c956b7642c7c559bbc40
2020-12-09 14:07:16 -08:00
7a4a2df225 Revert D25003113: make validate debug-only in Device copy ctr
Test Plan: revert-hammer

Differential Revision:
D25003113 (4b26cafb8f)

Original commit changeset: e17e6495db65

fbshipit-source-id: fd636c954a97bd80892464feb974a11b9dd96899
2020-12-09 13:58:11 -08:00
fc0a3a1787 Improve torch.fft n-dimensional transforms (#46911)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46911

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D25420647

Pulled By: mruberry

fbshipit-source-id: bf7e6a2ec41f9f95ffb05c128ee0f3297e34aae2
2020-12-09 12:40:06 -08:00
f5e9ffbc27 Check CUDA kernel launches (/fbcode/caffe2/) (#49105)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49105

(1) Add a safety check `C10_CUDA_KERNEL_LAUNCH_CHECK()` after each kernel launch. This diff only changes the files inside the directory /fbsource/fbcode/caffe2/modules/, /fbsource/fbcode/caffe2/fb/, /fbsource/fbcode/caffe2/test/.

(2) Get rid of old check `AT_CUDA_CHECK(cudaGetLastError())` when necessary.

Test Plan:
Test build:
```
buck build //caffe2/modules/detectron:
buck build //caffe2/torch/fb/:
```

To check for launches without checks:
```
python3 caffe2/torch/testing/check_kernel_launches.py
```
Make sure none of the updated files are in the returned list.

Reviewed By: r-barnes

Differential Revision: D25325039

fbshipit-source-id: 2043d6e63c7d029c35576d3101c18247ffe92f01
2020-12-09 12:34:55 -08:00
7584161dfa Enhance new_group doc to mention using NCCL concurrently. (#48872)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48872

Using NCCL communicators concurrently is not safe and this is
documented in NCCL docs.

However, this is not documented in PyTorch and we should add documentation for
ProcessGroupNCCL so that users are aware of this limitation.
ghstack-source-id: 118148014

Test Plan: waitforbuildbot

Reviewed By: rohan-varma

Differential Revision: D25351778

fbshipit-source-id: f7f448dc834c47cc1244f821362f5437dd17ce77
2020-12-09 12:29:15 -08:00
c62f3fc40b fix clang-tidy warning - make global TorchLibraryInit objects const (#48956)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48956

ghstack-source-id: 118140666

Test Plan: GitHub CI

Reviewed By: ezyang

Differential Revision: D25381418

fbshipit-source-id: 1726ed233b809054cb9e5ba89e02c84fb868c1eb
2020-12-09 12:22:17 -08:00
b98e62f8eb [te] Add gflag for fast intrinsic expansion (#49060)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49060

TE contains a fast tanh/sigmoid implementation that may be slightly less precise than the eager implementation (I measured 1 ulp in some test cases).  We disabled it by default using an #ifdef but that may be too conservative.  Adding a gflag allows more testing without recompilation.
ghstack-source-id: 118140487

Test Plan: `buck test //caffe2/test:jit`

Reviewed By: eellison

Differential Revision: D25406421

fbshipit-source-id: 252b64091edfff878d2585e77b0a6896aa096ea5
2020-12-09 12:15:47 -08:00
44f33596d3 [pe] Add gflags for num_profiled_runs and bailout_depth, laint (#49059)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49059

We'd like to be able to change these defaults without rebuilding the library.
ghstack-source-id: 118140486

Test Plan: `buck build //caffe2/test:jit`

Reviewed By: eellison

Differential Revision: D25405568

fbshipit-source-id: 5d0561a64127adc44753e48d3b6c7f560c8b5820
2020-12-09 12:14:00 -08:00
e5a98c5ab0 [ONNX] Remove usage of isCompleteTensor() in symbolic functions (#48162)
Summary:
`isCompleteTensor()` only returns true when both scalar type and shape is present. All dimensions in the shape must be static. This high requirement is unnecessary for many use cases such as when only rank or scalar type needs to be known.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48162

Reviewed By: malfet

Differential Revision: D25340823

Pulled By: bzinodev

fbshipit-source-id: 1fef61f44918f4339dd6654fb725b18cd58d99cf
2020-12-09 11:37:19 -08:00
41fd51d7d8 [PyTorch] Reference to c10::GetCPUAllocator() directly (#49068)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49068

TH folder has some kernal implementations referenced by ATen/native. It goes with ATen/native in the follow-up diff for per-app selective build. ATen/Context.cpp stays in the lib level and should not reference to symbols in TH directly.

It's a simple change in this diff, as ```getTHDefaultAllocator()``` did nothing but returns ```c10::GetCPUAllocator()```. Use ```c10::GetCPUAllocator()``` instead of going extra route through ```getTHDefaultAllocator()```.
ghstack-source-id: 118151905

Test Plan: CI

Reviewed By: dhruvbird

Differential Revision: D24147914

fbshipit-source-id: 37efb43adc9b491c365df0910234fa6a8a34ec25
2020-12-09 10:37:39 -08:00
b3ab25aefa [numpy] torch.cosh: promote integer inputs to float (#48923)
Summary:
Reference: https://github.com/pytorch/pytorch/issues/42515

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48923

Reviewed By: zhangguanheng66

Differential Revision: D25393679

Pulled By: mruberry

fbshipit-source-id: 2151ee0467b50175f84ac492c219a46ef6bd66c3
2020-12-09 10:15:58 -08:00
492580b855 [te] Remove vestigial __init__.py from test/cpp/tensorexpr (#49061)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49061

We don't use the python harness for cpp tests anymore.
ghstack-source-id: 118140485

Test Plan: Careful thinking.

Reviewed By: navahgar

Differential Revision: D25410290

fbshipit-source-id: 879e3c6fb296298d567e1d70b18bde96b5cac90d
2020-12-09 10:09:46 -08:00
9f7fb54693 Revert D25111515: Extra sampling of record function events
Test Plan: revert-hammer

Differential Revision:
D25111515 (09b974c2d5)

Original commit changeset: 0d572a3636fe

fbshipit-source-id: d558d8052924d937d86db7dd40dc6388e6d28823
2020-12-09 08:37:17 -08:00
73f7178445 remove redundant sccache wrappers from build.sh scripts (#47944)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47944

Reviewed By: zhangguanheng66

Differential Revision: D25406873

Pulled By: walterddr

fbshipit-source-id: 5441b0a304e0be1213b4e14adf26118b3e7e330b
2020-12-09 08:20:44 -08:00
4b26cafb8f make validate debug-only in Device copy ctr (#47854)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47854

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D25003113

Pulled By: bdhirsh

fbshipit-source-id: e17e6495db65c48c7daf3429acbd86742286a1f3
2020-12-09 08:11:24 -08:00
71cfb73755 Add complex support to broadcast_coalesced (#48686)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/47330

Add support for DataParallel complex tensors by handling them as `torch.view_as_real` for `broadcast_coalesced`, `scatter` and `gather`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48686

Reviewed By: osalpekar

Differential Revision: D25261533

Pulled By: sidneyfletcher

fbshipit-source-id: 3a25e05deee43e053f40d1068fc5c7867cfa9686
2020-12-09 05:11:40 -08:00
09b974c2d5 Extra sampling of record function events (#48289)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48289

Adding extra sampling step when dispatching RecordFunction.

(Note: this ignores all push blocking failures!)

Reviewed By: swolchok

Differential Revision: D25111515

Pulled By: ilia-cher

fbshipit-source-id: 0d572a3636fe649a47ec47901826bbfc08368937
2020-12-09 02:29:13 -08:00
a20d4511e4 [PyTorch] TensorImpl::is_non_overlapping_and_dense_ should default to true (#48625)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48625

The default TensorImpl is contiguous. Therefore, it is non-overlapping and dense per refresh_contiguous().
ghstack-source-id: 118035410

Test Plan: CI

Reviewed By: ezyang

Differential Revision: D25232196

fbshipit-source-id: 1968d9ed444f2ad5414a78d0b11e5d3030e3109d
2020-12-09 00:49:31 -08:00
a849f38222 skip cuda test_cholesky_solve_batched_many_batches due to illegal memory access (#48999)
Summary:
See https://github.com/pytorch/pytorch/issues/48996

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48999

Reviewed By: zhangguanheng66

Differential Revision: D25390070

Pulled By: mruberry

fbshipit-source-id: cf59130f6189ab8c2dade6a6a4de2f69753a5e36
2020-12-09 00:47:55 -08:00
e8b00023b2 [ROCm] restore autograd tests (#48431)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/30845.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48431

Reviewed By: zhangguanheng66

Differential Revision: D25393323

Pulled By: mruberry

fbshipit-source-id: 339644abf4ad52be306007f4040c692a45998052
2020-12-09 00:40:40 -08:00
1c31f76297 Add high level profiling trace for dataloading and optimizer (#47655)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/47441

To give user more information about python level functions in profiler traces, we propose to instrument on the following functions:

```
_BaseDataLoaderIter.__next__
Optimizer.step
Optimizer.zero_grad
```

Because the record_function already uses if (!active) to check whether the profiler is enabled, so we don't explicitly call torch.autograd._profiler_enabled() before each instrument.

Acknowledgement: nbcsm, guotuofeng, gunandrose4u , guyang3532 , mszhanyi

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47655

Reviewed By: smessmer

Differential Revision: D24960386

Pulled By: ilia-cher

fbshipit-source-id: 2eb655789e2e2f506e1b8f95ad3d470c83281102
2020-12-09 00:13:56 -08:00
2d9585a6a1 [quant][graphmode][fx] Add test for ResnetBase (#48939)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48939

Add numerical test for fx graph mode for resnet base, comparing with eager mode

Test Plan: Imported from OSS

Reviewed By: supriyar

Differential Revision: D25375342

fbshipit-source-id: 08f49b88daede47d44ee2ea96a02999fea246cb2
2020-12-08 22:27:03 -08:00
59a3e76641 [pt][quant] Remove contiguous calls in qembeddingbag (#48993)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48993

I don't see any reasons that we need to call contiguous on the embedding tables. They should not exist in the first place. The indices and lengths/offsets are actually generated in the model, but they're most likely generated by SigridTransform -> ClipRanges -> GatherRanges -> SigridHash (sometimes) and none of these ops produce non-contiguous tensors. It should be fine to enforce tensor.is_contiguous().

Reviewed By: radkris-git

Differential Revision: D25266756

fbshipit-source-id: f15ecb67281c9ef0c7ac6637f439e538e77e30a2
2020-12-08 20:14:20 -08:00
7c0a3e3a06 Annotate torch._tensor_str (#48584)
Summary:
This is a follow up PR of https://github.com/pytorch/pytorch/issues/48463

> Rather than requiring that users write import numbers and then use numbers.Float etc., this PEP proposes a straightforward shortcut that is almost as effective: when an argument is annotated as having type float, an argument of type int is acceptable; similar, for an argument annotated as having type complex, arguments of type float or int are acceptable.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48584

Reviewed By: zhangguanheng66

Differential Revision: D25411080

Pulled By: malfet

fbshipit-source-id: e00dc1e9e6e46a8cfae77da4f2cf159c0c2b9bcc
2020-12-08 20:06:40 -08:00
34cc77a811 Torch onnx (#48980)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/45215

This is a follow up PR of https://github.com/pytorch/pytorch/issues/45258 and https://github.com/pytorch/pytorch/issues/48782

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48980

Reviewed By: zhangguanheng66

Differential Revision: D25399823

Pulled By: ezyang

fbshipit-source-id: 798055f4abbbffecdfab0325884193c81addecec
2020-12-08 19:41:44 -08:00
5450614cf6 Correctly apply WIN32_LEAN_AND_MEAN to the whole repo (#49025)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/48895

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49025

Reviewed By: zhangguanheng66

Differential Revision: D25399912

Pulled By: ezyang

fbshipit-source-id: 9b7225b0e43511e0b8981c39035d814a4406c523
2020-12-08 19:38:23 -08:00
4434c07a2c [quant][fix] Support quantization of ops where input is quantizable (#49027)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49027

For cat followed by linear since the output of cat is not quanitzed, we didnt quantize the linear
This checks the uses of the cat op to insert observers

Test Plan:
python test/test_quantization.py TestQuantizeJitOps.test_cat_linear

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D25403412

fbshipit-source-id: 5875db259bf75f08ce672ce341a67005ed2f8a04
2020-12-08 19:21:41 -08:00
993ce4b206 [quant][graphmode][fx] Add MatchAllNode in pattern matching (#48979)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/48979

Test Plan: Imported from OSS

Reviewed By: vkuzo

Differential Revision: D25385459

fbshipit-source-id: 43adffc9e2242d099cecd38d1902f9900158f51e
2020-12-08 18:53:55 -08:00
107c31f2f5 Add a pass to fetch attributes of nn.Module to fx.node (#47935)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47935

Fetch the parameters that are needed for lowering from nn.Module to fx.node for leaf_modules.

Test Plan: A test `test_fetch` is added to test_fx_experimental.py.

Reviewed By: jfix71

Differential Revision: D24957142

fbshipit-source-id: a349bb718bbcb7f543a49f235e071a079da638b7
2020-12-08 18:06:37 -08:00
3f9ff48ebb [JIT] Allow del statements with multiple targets (#48876)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48876

**Summary**
This commit adds support for `del` statements with multiple targets.
Targets are deleted left-to-right just like Python.

**Test Plan**
This commit updates the `TestBuiltins.test_del_multiple_operands` unit
test to actually test that multiple deletion works instead of asserting
that an error is thrown.

**Fixes**
This commit fixes #48635.

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D25386285

Pulled By: SplitInfinity

fbshipit-source-id: c0fbd8206cf98b2bd1b695d0b778589d58965a74
2020-12-08 15:39:42 -08:00
d033e185ed fx quant: move more functions to utils (#48908)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48908

No logic change, improving readability

Test Plan:
CI

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D25363080

fbshipit-source-id: 1d73a875bd7abf671b544ebc835432fea5306dc3
2020-12-08 15:37:04 -08:00
2668ea8087 fx quant: move qconfig utils to utils file (#48907)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48907

Improving readability

Test Plan:
CI

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D25363078

fbshipit-source-id: 6b0161db14ccf8c3b47edf4fc760ca9a399254b2
2020-12-08 15:37:00 -08:00
17e71509a6 fx quant: quick cleanup for model_device (#48906)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48906

As titled, removing some code which is no longer
needed after refactors.

Test Plan:
CI

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D25363079

fbshipit-source-id: 9e4bcf63f4f1c2a2d3fb734688ba593d72495349
2020-12-08 15:35:18 -08:00
e538bd6695 [collect_env] Add candidate paths for nvidia-smi on Windows (#49021)
Summary:
Recently, Nvidia tries to put nvidia-smi under SystemRoot.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49021

Reviewed By: zhangguanheng66

Differential Revision: D25399831

Pulled By: ezyang

fbshipit-source-id: b1ea12452012e0a3fb4703996b6104e7115a8a7f
2020-12-08 15:02:15 -08:00
02b63858f2 [CUDAExtension] support all visible cards when building a cudaextension (#48891)
Summary:
Currently CUDAExtension assumes that all cards are of the same type on the same machine and builds the extension with compute capability of the 0th card. This breaks later at runtime if the machine has cards of different types.

Specifically resulting in:
```
RuntimeError: CUDA error: no kernel image is available for execution on the device
```
when the cards of the types that weren't compiled for are used. (and the error is far from telling what the problem is to the uninitiated)

My current setup is:
```
$ CUDA_VISIBLE_DEVICES=0 python -c "import torch; print(torch.cuda.get_device_capability())"
(8, 6)
$ CUDA_VISIBLE_DEVICES=1 python -c "import torch; print(torch.cuda.get_device_capability())"
(6, 1)
```
but the extension was getting built with `-gencode=arch=compute_80,code=sm_80`.

This PR:
* [x] introduces a loop over all visible at build time devices to ensure the extension will run on all of them (it sorts the new list generated by the loop, so that the output is easier to debug should a card with lower capacity come last)
* [x] adds `+PTX` to the last entry of ccs derived from local cards (`if not _arch_list:`) to support other archs
* [x] adds a digest of my conversation with ptrblck on slack in the form of docs which hopefully can help others know which archs to support, how to override defaults, when and how to add PTX, etc.

Please kindly review that my prose is clear and easy to understand.

ptrblck

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48891

Reviewed By: ngimel

Differential Revision: D25358285

Pulled By: ezyang

fbshipit-source-id: 8160f3adebffbc8e592ddfcc3adf153a9dc91557
2020-12-08 14:57:10 -08:00
6000481473 add a unit test for large node error (#48938)
Summary:
add a unit test to test the situation where a node is too large to fit into any device

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48938

Reviewed By: zhangguanheng66

Differential Revision: D25402967

Pulled By: scottxu0730

fbshipit-source-id: a2e2a3dc70d139fa678865ef03e67fa57eff4a1d
2020-12-08 14:45:44 -08:00
5960581148 CUDA BFloat16 batchnorm (non-cuDNN) (#44994)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44994

Reviewed By: ailzhang

Differential Revision: D25377525

Pulled By: ngimel

fbshipit-source-id: 42d583bbc364532264a4d3ebaa6b4ae02a0413de
2020-12-08 14:25:42 -08:00
e8ec84864f [StaticRuntime] Add aten::narrow (#48991)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48991

Add native impl of `aten::narrow` to skip dispatcher, because `aten::narrow` calls `aten::slice` in its implementation, here we reduce the dispatcher overhead by two-fold by calling the native impl of `aten::slice`.

Reviewed By: bwasti

Differential Revision: D25387119

fbshipit-source-id: c020da2556a35bc57a8a2e21fa45dd491ea516a0
2020-12-08 13:48:21 -08:00
d1fb4b4ffc Put Flake8 requirements into their own file (#49032)
Summary:
This PR moves the list of Flake8 requirements/versions out of `.github/workflows/lint.yml` and into its own file `requirements-flake8.txt`. After (if) this PR is merged, I'll modify the Flake8 installation instructions on [the "Lint as you type" wiki page](https://github.com/pytorch/pytorch/wiki/Lint-as-you-type) (and its internal counterpart) to just say to install from that new file, rather than linking to the GitHub Actions YAML file and/or giving a command with a set of packages to install that keeps becoming out-of-date.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49032

Test Plan:
Either look at CI, or run locally using [act](https://github.com/nektos/act):
```sh
act -P ubuntu-latest=nektos/act-environments-ubuntu:18.04 -j flake8-py3
```

Reviewed By: janeyx99

Differential Revision: D25404037

Pulled By: samestep

fbshipit-source-id: ba4d1e17172a7808435df06cba8298b2b91bb27c
2020-12-08 13:29:10 -08:00
2b70bcd014 [TensorExpr] Enable inlining for output tensors too. (#48967)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48967

We previously didn't inline output tensors which resulted in correctness
issues like #48533. This PR allows inlining for output tensors too -
this could result in duplicated computations, but we can address that
later once correctness is ensured.

Performance results on FastRNNS:
Before the fix:
```
Benchmarking LSTMs...
            name          avg_fwd          std_fwd          avg_bwd          std_bwd
           cudnn            10.09          0.05431            17.55           0.2108
            aten            21.52           0.1276             26.7            1.471
             jit            13.25           0.8748            22.47             1.73
      jit_premul            11.43           0.3226            19.43            2.231
 jit_premul_bias            11.84           0.2245            20.33            2.205
      jit_simple            13.27           0.9906            22.15           0.9724
  jit_multilayer            13.38           0.8748            22.82             1.01
              py            33.55            4.837            46.41            6.333
```
After the fix:
```
Benchmarking LSTMs...
            name          avg_fwd          std_fwd          avg_bwd          std_bwd
           cudnn            10.09          0.05979            17.45           0.1987
            aten            21.21            0.144            26.43           0.7356
             jit            13.01           0.2925            23.21           0.8454
      jit_premul             11.4           0.3905            19.62            2.448
 jit_premul_bias            11.85           0.2461            20.29           0.6592
      jit_simple            13.08           0.8533            22.81            1.315
  jit_multilayer            12.93           0.1095            23.57            1.459
              py            31.21            2.783            44.63            6.073
```

Differential Revision: D25383949

Test Plan: Imported from OSS

Reviewed By: SplitInfinity

Pulled By: ZolotukhinM

fbshipit-source-id: 16f5727475109a278499bef7905f6aad18c8527a
2020-12-08 13:24:40 -08:00
0fb9d36660 Delete ATen mirror stuff (#49028)
Summary:
These files refer to https://travis-ci.org/github/zdevito/ATen and https://github.com/zdevito/ATen which were last updated in 2018 and 2019 respectively. According to zdevito:

> yeah, all of that stuff can be deleted
> was from a time when ATen was a separate repo from pytorch

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49028

Reviewed By: zdevito

Differential Revision: D25401810

Pulled By: samestep

fbshipit-source-id: a8eea7382f91e1aee6f45552645e6d53825fe5a7
2020-12-08 13:19:30 -08:00
dee82ef3ea Add LKJCholesky distribution (#48798)
Summary:
As a follow up to https://github.com/pytorch/pytorch/issues/48041, this adds the `LKJCholesky` distribution that samples the Cholesky factor of positive definite correlation matrices.

This also relaxes the check on `tril_matrix_to_vec` so that it works for 2x2 matrices with `diag=-2`.

cc. fehiepsi

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48798

Reviewed By: zhangguanheng66

Differential Revision: D25364635

Pulled By: neerajprad

fbshipit-source-id: 4abf8d83086b0ad45c5096760114a2c57e555602
2020-12-08 11:27:48 -08:00
c92c8598a3 [FX][2/2] Make docstrings pretty when rendered (#48871)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/48871

Test Plan: Imported from OSS

Reviewed By: ansley

Differential Revision: D25351588

Pulled By: jamesr66a

fbshipit-source-id: 4c6fd341100594c204a35d6a3aab756e3e22297b
2020-12-08 11:14:43 -08:00
b89c328493 Add fftw3 cmake as alternative for FFT/DFT (#48808)
Summary:
added cmake discovery in Dependencies.cmake for fftw3.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48808

Reviewed By: janeyx99

Differential Revision: D25375320

Pulled By: walterddr

fbshipit-source-id: cde3afc51eef9c621c7d19be7ad7573fc8b838c2
2020-12-08 10:35:33 -08:00
b0e919cf60 Avoid initializing gradInput twice in the backward phase of replication (#48890)
Summary:
https://github.com/pytorch/pytorch/issues/48889

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48890

Reviewed By: zhangguanheng66

Differential Revision: D25375697

Pulled By: ezyang

fbshipit-source-id: fd6f6089be44e68c4557b923550c7cadb90d739a
2020-12-08 10:15:24 -08:00
274ce26fd8 [static runtime] Add Internal Ops to the registry (#48616)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48616

This adds a couple of _out variants and then registers them to the registry.

I also added the concept of "canReuse{Input,Output}" so that we can annotate tensors that are not optimizable (specifically, non-float tensors).

In the future we can change this (with this D25062301)

after removing `RecordFunction`, we see these results

```
BS=20
 ---
caffe2:           0.651617 ~ 0.666354
static runtime:   0.753481
pytorch:          0.866658

BS=1
 ---
caffe2:           0.0858684 ~ 0.08633
static runtime:   0.209897
pytorch:          0.232694
```

Test Plan: standard internal test of ads model against caffe2 reference (see the scripts in this quip: https://fb.quip.com/ztERAYjuzdlr)

Reviewed By: hlu1

Differential Revision: D25066823

fbshipit-source-id: 25ca181c62209a4c4304f7fe73832b13e314df80
2020-12-08 09:32:38 -08:00
ad3fed8b90 [BE] Fix signed-unsigned warnings (#48848)
Summary:
Switch to range loops when possible
Replace `ptrdiff_t`(signed type) with `size_t`(unsigned type)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48848

Reviewed By: walterddr

Differential Revision: D25376591

Pulled By: malfet

fbshipit-source-id: 9835f83b7a17b6acc20731cc89c1c11c2aa01a78
2020-12-08 08:58:11 -08:00
c29f51642e Modify NEON check for ARM64 on OS X (#48982)
Summary:
Use CMAKE_SYSTEM_PROCESSOR rather than run sysctl

Fixes https://github.com/pytorch/pytorch/issues/48874

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48982

Reviewed By: walterddr

Differential Revision: D25385883

Pulled By: malfet

fbshipit-source-id: 47b6dc5be8d75f6d4a66a11c564abdfe31ac90b4
2020-12-08 07:58:22 -08:00
58c13cf685 Back out "Revert D25375885: [pytorch][PR] Reenable some BF16 tests on CUDA"
Summary: Revert D25397144 69829f3fff4d4a2d1a71bb52e90d3c7f16b27fa3

Test Plan: Revert Hammer

Reviewed By: janeyx99

Differential Revision: D25397572

fbshipit-source-id: 625ca2a32e4558ae4582a15697b6e1cc57cc1573
2020-12-08 07:52:59 -08:00
e2befb84bc minor README change to fix #25464 (#48970)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/25464

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48970

Reviewed By: walterddr

Differential Revision: D25396284

Pulled By: janeyx99

fbshipit-source-id: 8355c417b5c8b8865f208d7d8e8154048423afd9
2020-12-08 07:48:52 -08:00
39445f718c Revert D25375885: [pytorch][PR] Reenable some BF16 tests on CUDA
Test Plan: revert-hammer

Differential Revision:
D25375885 (e3893b867f)

Original commit changeset: 2e19fe725ae9

fbshipit-source-id: 69829f3fff4d4a2d1a71bb52e90d3c7f16b27fa3
2020-12-08 07:05:33 -08:00
07978bd62e [static runtime] fuse inference ops (1) (#48948)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48948

Fuse inference ops for the following inside static runtime:
ConcatAddMulReplaceNaNClip
CastedBatchOneHotLengths
ConcatBatchMatMulBatchGather

TODO:
1. add unit tests
2. add more restrictions on the graph transform (e.g. check inputs, check outputs not used elsewhere)

Test Plan:
Run adindexer model with static runtime and fusion; check ops
```
MKL_NUM_THREADS=1 OMP_NUM_THREADS=1 numactl -m 0 -C 3 ./buck-out/opt/gen/caffe2/caffe2/fb/predictor/ptvsc2_predictor_bench --scripted_model=/data/users/ansha/tmp/adindexer/traced_precomputation2.pt --pt_inputs=/data/users/ansha/tmp/adindexer/merge/container_precomputation_bs1.pt --iters=3000 --warmup_iters=10000  --num_threads=1 --pred_net=/data/users/ansha/tmp/adindexer/precomputation_merge_net.pb --c2_inputs=/data/users/ansha/tmp/adindexer/merge/c2_inputs_precomputation_bs1.pb --c2_sigrid_transforms_opt=1 --c2_use_memonger=1 --c2_weights=/data/users/ansha/tmp/adindexer/merge/c2_weights_precomputation.pb --pt_enable_static_runtime
```
transformed model graph contains the fused ops: P151559641

Results before fusion: P151567611
Results after fusion: P151566783 (8% speedup for bs=20, 14% speedup for bs=1)

Reviewed By: hlu1

Differential Revision: D25224107

fbshipit-source-id: c8442e8ceb018879c61ce564367b1c1b9412601b
2020-12-08 05:54:49 -08:00
b643dbb8a4 VariableType calls faithful C++ API for c10-full out ops (#47792)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47792

For operators with out arguments, VariableType previously called the out overload of the C++ API because that's all we had.
We introduced a faithful C++ API that takes out arguments in schema-order in D24835252 and this PR changes VariableType to use that API instead.

Note that this only applies to c10-full ops. Non-c10-full ops still call the unfaithful API. There aren't any c10-full out ops at the moment.
So this PR can only be tested and evaluated together with PRs on top that make ops with out arguments c10-full.
ghstack-source-id: 118068088

Test Plan: waitforsandcastle

Reviewed By: ezyang

Differential Revision: D24901945

fbshipit-source-id: a99db7e4d96fcc421f9664504f87df68fe1c482f
2020-12-08 03:48:45 -08:00
3ef36dca8e Faithful out arguments (#47712)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47712

This adds a faithful API for ops with out arguments, as described in https://docs.google.com/document/d/1h7nBibRwkRLQ8rsPhfALlwWR0QbkdQm30u4ZBwmaps8/edit# .

After this, an op will generate the following overloads for the C++ API:

```cpp
// Generated from the aten::abs operator (NOT from aten::abs.out)
Tensor at::abs(Tensor& self)

// Generated from the aten::abs.out operator
Tensor& at::abs(Tensor& self, Tensor& out)
Tensor& at::abs_out(Tensor& out, Tensor& self)

```

This is an important step towards making those ops c10-full (it allows VariableType, XLA and other backends to ignore reordering and just call through with the same argument order), but this does not make any of those ops c10-full yet.
It enables the faithful API independent from c10-fullness. That means the API is more consistent with the same API for all ops and making an op c10-full in the future will not trigger future C++ API changes.
ghstack-source-id: 118068091

Test Plan: waitforsandcastle

Reviewed By: ezyang

Differential Revision: D24835252

fbshipit-source-id: dedfabd07140fc8347bbf16ff219aad3b20f2870
2020-12-08 03:48:42 -08:00
046ea6696d Enable faithful API for all ops (#47711)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47711

Seems we generated the declaration but the definition only for c10-full ops. We should also generate the definition for non-c10-full ops.
This makes future migrations of ops from non-c10-full to c10-full have a lower impact on the C++ API.
ghstack-source-id: 118064755

Test Plan: waitforsandcastle

Reviewed By: bhosmer

Differential Revision: D24835006

fbshipit-source-id: 8f5c3c0ffcdc9b479ca3785d57da16db508795f5
2020-12-08 03:43:48 -08:00
32b098baf9 Add and adjust kernel launch checks (#46727)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46727

This adds kernel launch safety checks to a number of kernels. See D24309971 (353e7f940f) for context.

Test Plan: The existing pre-commit test rigs are used.

Reviewed By: ngimel

Differential Revision: D24334303

fbshipit-source-id: b6433f6be109fc8dbe789e91f3cbfbc31fd15951
2020-12-08 00:36:56 -08:00
cb6233aa53 Fix some convoluted(?) code (#48893)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48893

This simplifies some code which is written in an interesting way. It may be that this was intentional, but I don't recognize the pattern being used.

Test Plan: N/A - Sandcastle

Reviewed By: igorsugak

Differential Revision: D25358283

fbshipit-source-id: 19bcf01cbb117843e08df0237e6a03ea77958078
2020-12-07 22:48:23 -08:00
c3a90bedd4 Move aten::__contains__.int_list for lite jit (#48950)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48950

Needed by noise suppression model

Test Plan: build

Reviewed By: linbinyu

Differential Revision: D25321582

fbshipit-source-id: fbc67fc35087c5f44b7ab68d1485b2b916747723
2020-12-07 21:27:34 -08:00
881e9583b2 docker: Add make variable to add docker build args (#48942)
Summary:
Adds an extra make variable 'EXTRA_DOCKER_BUILD_FLAGS' that allows us to
add extra docker build flags to the docker build command.

Example:

    make -f docker.Makefile EXTRA_DOCKER_BUILD_FLAGS=--no-cache devel-image

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48942

Reviewed By: walterddr

Differential Revision: D25376288

Pulled By: seemethere

fbshipit-source-id: 9cf2c2a5e01d505fa54447604ecd653dcbdd42e1
2020-12-07 20:15:24 -08:00
5533be5170 CUDA BF16 backwards (#48809)
Summary:
Looks like there's no test?

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48809

Reviewed By: mruberry

Differential Revision: D25378998

Pulled By: ngimel

fbshipit-source-id: d16789892902b5a20828e8c7b414b478de33c4a5
2020-12-07 19:48:53 -08:00
3aeb9cc85d [DOCS]Correct docs for torch.lu_solve (#47762)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/43498 by correcting the function signature of `torch.lu_solve`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47762

Reviewed By: ljk53

Differential Revision: D24900259

Pulled By: ailzhang

fbshipit-source-id: 2a43170bde57e03d44025b23e3abcda169cfc9e2
2020-12-07 19:35:23 -08:00
bea88ee1d0 Added entry for torch.linalg.cond to linalg.rst (#48941)
Summary:
This PR makes documentation for `cond` available at https://pytorch.org/docs/master/linalg.html
I forgot to include this change in https://github.com/pytorch/pytorch/issues/45832.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48941

Reviewed By: ngimel

Differential Revision: D25379244

Pulled By: mruberry

fbshipit-source-id: c8c0a0b8a05c17025d6c3cea405b2add369e2019
2020-12-07 19:01:05 -08:00
c876d4f477 [Gradient Compression] Let the dtype of created low-rank tensors P and Q be the same type as the input tensor (#48902)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48902

Previously if the dtype of input gradients is FP16, matrix multiplications will fail, because the created low-rank tensors P and Q use FP32 dtype.

Now let the dtype of P and Q be the same as the input tensor.

Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202
ghstack-source-id: 117962078

Test Plan: buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl

Reviewed By: rohan-varma

Differential Revision: D25362071

fbshipit-source-id: e68753ff23bb480605b02891e128202ed0f8a587
2020-12-07 17:40:06 -08:00
533c837833 Register OpInfos for torch.fft transforms (#48427)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/48427

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D25266218

Pulled By: mruberry

fbshipit-source-id: 406e7ed5956bc7445daf8c027c9b4d2c8ff88fa1
2020-12-07 17:19:29 -08:00
adbb74ded9 [package] pre-emptively install submodules (#48799)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48799

Python's IMPORT_FROM bytecode will bypass the import infrastructure
when a packaging being loaded as part of a cirular dependency is being
accessed from the module _before_ that package has finished loading
and is installed on the module. Since we cannot override the lookup
on sys.modules, this PR pre-emptively does the module assignment before
running the submodules initialization code.

Note: this appears to work, but it is not clear to me why python doesn't
do this by default. It is possible that the logic for creating modules
is flexible enough in generic python that this interception between creating
the module and running its code is not always possible.

Test Plan: Imported from OSS

Reviewed By: suo

Differential Revision: D25312467

Pulled By: zdevito

fbshipit-source-id: 6fe3132af29364ccb2b3cabdd2b847d0a09eb515
2020-12-07 17:12:04 -08:00
e3893b867f Reenable some BF16 tests on CUDA (#48805)
Summary:
Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48805

Reviewed By: agolynski

Differential Revision: D25375885

Pulled By: ailzhang

fbshipit-source-id: 2e19fe725ae9450bd1a2bc4e2d308c59b9f94fac
2020-12-07 16:16:07 -08:00
7629612f9f Update torch.randint documentation to include missing note (#48787)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/46497

Includes note about returning dtype torch.int64.

Current documentation: https://pytorch.org/docs/stable/generated/torch.randint.html?highlight=randint#torch.randint
New documentation:
![image](https://user-images.githubusercontent.com/14858254/101196939-48977d00-3616-11eb-90a5-a7b706e8505f.png)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48787

Test Plan: Built documentation and checked generated docs

Reviewed By: ailzhang

Differential Revision: D25339421

Pulled By: H-Huang

fbshipit-source-id: c2ecaacaeb57971fe7fba0d9d54f3c61b0fd04ce
2020-12-07 16:11:28 -08:00
f67259fe89 Fix CI by removing gen_pyi from mypy-stirct.ini (#48961)
Summary:
Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48961

Reviewed By: janeyx99

Differential Revision: D25383152

Pulled By: malfet

fbshipit-source-id: ce0226398522342256d0d701edc13955d1095a0d
2020-12-07 15:26:27 -08:00
b77ca9e829 [Docs] Add examples for new object-based c10d APIs (#43932)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43932

Adds some basic examples to the documentation for each of the newly added
object-based collectibves.
ghstack-source-id: 117965966

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D23441838

fbshipit-source-id: 91344612952cfcaa71f08ccf2a2c9ed162ca9c89
2020-12-07 14:35:14 -08:00
d6b5f3ad98 Add object-based collective APIs to public docs (#48909)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48909

Adds these new APIs to the documentation
ghstack-source-id: 117965961

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D25363279

fbshipit-source-id: af6889d377f7b5f50a1a77a36ab2f700e5040150
2020-12-07 14:30:25 -08:00
88ebf6f894 Revert D25304229: [pytorch][PR] Add type annotations to torch.onnx.* modules
Test Plan: revert-hammer

Differential Revision:
D25304229 (8bc6023d7a)

Original commit changeset: b01b21ddbf86

fbshipit-source-id: bc3308176e2c70423f29f694e9db94828213e7d6
2020-12-07 11:58:03 -08:00
d307601365 Revert D24923679: Fixed einsum compatibility/performance issues (#46398)
Test Plan: revert-hammer

Differential Revision:
D24923679 (ea2a568cca)

Original commit changeset: 47e48822cd67

fbshipit-source-id: 52f17b66a4aa075d0159bdf1c98616e6098091b8
2020-12-07 11:48:36 -08:00
924b001b71 #48733 added logging statements to LLVM codegen using JIT logging (#48758)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/48758

Test Plan: PYTORCH_JIT_LOG_LEVEL=">>llvm_codegen" python test/test_jit_fuser_te.py -k test_lerp

Reviewed By: ZolotukhinM

Differential Revision: D25295995

Pulled By: huiguoo

fbshipit-source-id: 8927808932ef3657da26508d0f6574c9e5fbbb25
2020-12-07 11:14:53 -08:00
dad74e58fc [WIP] Added foreach_trunc, foreahc_reciprocal, foreach_sigmoid APIs (#47385)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47385

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D24737051

Pulled By: izdeby

fbshipit-source-id: ed259d9184b2b784d8cc1983a8b85cc6cbf930ba
2020-12-07 10:47:23 -08:00
ba6511b304 pyi codegen update - remove Declarations.yaml (#48754)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48754

The goal of this PR is to kill Declarations.yaml in the pyi codegen, in favor of native_functions + the existing python object model.

**High-level design**

Since the python signatures used by the `python_arg_parser` are “supposed” to resemble the corresponding pyi type hint signatures, I re-used the existing python object model that Jiakai defined in `tools/codegen/api/python.py`. This means that the pyi codegen now reads `native_functions.yaml`, parses it into a bunch of `PythonSignatureGroup` objects, and emits corresponding method + function variants of type-hint signatures for each one, respectively into `__init__.pyi` and `_VariableFunctions.pyi`.

What makes this uglier is that pyi and the python arg parser have a number of differences in how they’re emitted. I expressed that through a `pyi` flag on the `PythonSignature` dataclass, that tells it whether or not to print itself as a pyi vs. arg_parser signature.

One thing worth noting is how pyi generates signatures differently for native / deprecated op signatures.

For native ops:
- The pyi codegen fuses functional and out variants of each op into a single signature with an optional `out` argument. Ops without an `out` variant just get an ordinary functional signature.
- Some ops that fit certain criteria also get a second “varargs” signature - basically ops with a single positional argument of type List[int].

For deprecated signatures:
- Functional and out variants are not fused - they each get their own signature entry
- There are no varargs signatures

This is currently implemented through the `signature_str()` and `signature_str_vararg()` methods on the `PythonSignature`/`PythonSignatureDeprecated` classes.  `signature_str()` knows how to print itself with/without out arguments, differently for native/deprecated ops. `signature_str_vararg()` optionally returns a vararg variant of the signature if one exists.

**Calling out the gap between python_arg_parser vs. pyi**

The two formats are notably different, so I don’t think we can expect to unify them completely. That said, I encountered a number of differences in the pyi codegen that looked wrong- I tried to call them out in the PR, to be removed later. Just as an example, looking at the `svd` signature in the python_arg_parser vs. the pyi type hint:

python_arg_parser
```
Static PythonArgParser parser({
  “svd(Tensor input, bool some=True, bool compute_uv=True, *, TensorList[3] out=None”,
}, /*traceable=*/true);
```

Pyi
```
def svd(input: Tensor, some: _bool=True, compute_uv: _bool=True, *, out: Optional[Tensor]=None) -> namedtuple_U_S_V: …
```

The two have obvious syntactic differences that we probably don’t plan on changing: the python_arg_parser doesn’t include `def` or return types, and it includes the type hint before the variable name. But the type of `out` in pyi is probably wrong, since `svd` has multiple output params. I tried to clearly call out any instances of the pyi codegen diverging in a way that looks buggy, so we can clean it up in a later PR (see the comments for details).

Another particularly ugly “bug” that I kept in to maintain byte-for-byte compatibility is the fact that the pyi codegen groups operator overloads together. It turns out that the only reason it does this (as far as I can tell) is because is tacks on an out argument to signatures that don’t have one, if ANY overloads of that op have an out variant.

E.g. consider the pyi type hints generated for `nanmedian` in `_VF.pyi`:
```
overload
def nanmedian(input: Tensor, *, out: Optional[Tensor]=None) -> Tensor: ...
overload
def nanmedian(input: Tensor, dim: _int, keepdim: _bool=False, *, out: Optional[Tensor]=None) -> namedtuple_values_indices: ...
overload
def nanmedian(input: Tensor, dim: Union[str, ellipsis, None], keepdim: _bool=False, *, out: Optional[Tensor]=None) -> namedtuple_values_indices: ...
```

And the corresponding native_functions.yaml entries:
```
- func: nanmedian(Tensor self) -> Tensor
- func: nanmedian.dim(Tensor self, int dim, bool keepdim=False) -> (Tensor values, Tensor indices)
- func: nanmedian.dim_values(Tensor self, int dim, bool keepdim=False, *, Tensor(a!) values, Tensor(b!) indices) -> (Tensor(a!) values, Tensor(b!) indices)
- func: nanmedian.names_dim(Tensor self, Dimname dim, bool keepdim=False) -> (Tensor values, Tensor indices)
- func: nanmedian.names_dim_values(Tensor self, Dimname dim, bool keepdim=False, *, Tensor(a!) values, Tensor(b!) indices) -> (Tensor(a!) values, Tensor(b!)
```

Signature 2 corresponds to entries 2 and 3 in native_functions, and Signature 3 corresponds to entries 4 and 5. But signature 1 has an optional out argument, even though entry 1 in native_functions.yaml has no out variant.

I’d like to delete that logic in a later PR- that will also have the added benefit no longer requiring to group overloads together in the pyi codegen. We can just operate independently on each PythonSignatureGroup.

**More detailed accounting of the changes**

Per file:

gen_python_functions.py
- `load_signatures()` can now skip deprecated signatures. Needed because pyi only includes deprecated functions, and skips their method variants (maybe we should add them in…?)
- Moved `namedtuple_fieldnames` into python.cpp
- `group_overloads()` can now opt to not sort the overloads (needed for byte-for-byte compact, pyi doesn’t sort for some reason)

Python.py:
- Gave `PythonSignature`and `PythonSignatureDeprecated` a `pyi` flag that tells it whether or not to print itself in pyi vs. python_arg_parser format
- Added a `PythonReturns` dataclass , which is now a member of PythonSignature. It is only used by pyi. I found this useful because python returns need to know how to deal with named tuple returns properly. I also moved `namedtuple_fieldnames` into this file from gen_python_functions

gen_pyi.py
- Merged `get_py_torch_functions` and `get_py_variable_methods` into a single function, since they’re very similar
- Lifted out all of the pyi type hint type-mapping mess and dropped it into python.py. This required updating the mapping to deal with NativeFunction objects instead of the outputs of Declarations.yaml (this was most of the logic in `type_to_python`, `arg_to_type_hint`, and `generate_type_hints`).  `generate_type_hints` is now a small orchestration function that gathers the different signatures for each PythonSignatureGroup.
- NamedTuples are now generated by calling `PythonReturn.named_tuple()` (in `generate_named_tuples()`), rather than appending to a global list

A lot of hardcoded pyi signatures still live in `gen_pyi.py`. I didn’t look to closely into whether or not any of that can be removed as part of this PR.

Test Plan: Imported from OSS

Reviewed By: ljk53

Differential Revision: D25343802

Pulled By: bdhirsh

fbshipit-source-id: f73e99e1afef934ff41e4aca3dabf34273459a52
2020-12-07 10:39:38 -08:00
f2c3efd51f Fix generator exhaustion in SparseAdam (#47724)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/47594

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47724

Reviewed By: heitorschueroff

Differential Revision: D25304131

Pulled By: albanD

fbshipit-source-id: 67c058b0836b9b4fba4f7b966396e4f3fa61f939
2020-12-07 09:38:07 -08:00
21ba48fe49 [vulkan] test_app for mobilenetV2 on vulkan api (#48924)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/48924

Test Plan: Imported from OSS

Reviewed By: SS-JIA

Differential Revision: D25365000

Pulled By: IvanKobzarev

fbshipit-source-id: 79295b5781d2494681dbb4e4a741de49ff9c058c
2020-12-07 08:44:43 -08:00
36df25334f Fix incorrect usage of CUDACachingAllocator [v2] (#48817)
Summary:
This is similar to https://github.com/pytorch/pytorch/issues/46605, where the c10::complex part of the code was not merged yet at that moment.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48817

Reviewed By: malfet

Differential Revision: D25333179

Pulled By: ezyang

fbshipit-source-id: a92bdad5ad4b36bef7f050b21a59676c38e7b1fc
2020-12-07 08:27:59 -08:00
8bc6023d7a Add type annotations to torch.onnx.* modules (#48782)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/45215

This is a follow up PR of https://github.com/pytorch/pytorch/issues/45258

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48782

Reviewed By: heitorschueroff

Differential Revision: D25304229

Pulled By: ezyang

fbshipit-source-id: b01b21ddbf86f908ca08173e68b81fb25851bc81
2020-12-07 08:23:02 -08:00
1febd2225b Add explicit cast to cuda_atomic_ops_test.cu (#48886)
Summary:
Should fix linking error reported in https://github.com/pytorch/pytorch/issues/48870

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48886

Reviewed By: walterddr

Differential Revision: D25356601

Pulled By: malfet

fbshipit-source-id: 25282d4606251b27d047917f096868ddb662a723
2020-12-07 08:07:10 -08:00
00f01791a3 [Caffe2]Add more error message in ComputeBinaryBroadcastForwardDims
Summary: Add more error message in ComputeBinaryBroadcastForwardDims

Test Plan:
buck test mode/opt caffe2/caffe2/python/operator_test:gather_ranges_op_test

buck test mode/opt caffe2/caffe2/python/operator_test:reduce_ops_test

buck test mode/opt caffe2/caffe2/python/operator_test:elementwise_ops_test

Reviewed By: BIT-silence

Differential Revision: D24949525

fbshipit-source-id: 762d913a6615a6394072f5bebbcb5cc36f0b8603
2020-12-07 07:42:49 -08:00
a39398b9e5 CUDA BF16 norm (#48806)
Summary:
Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48806

Reviewed By: mruberry

Differential Revision: D25358465

Pulled By: ngimel

fbshipit-source-id: 1a2afd86f39e96db0754d04bf81de045b1e1235c
2020-12-06 23:41:05 -08:00
19f4c5110e Add another torch::jit::load API to load PyTorch model with shared_ptr PyTorchStreamReader input (#48802)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48802

Current torch::jit::load API only supports unique_ptr ReadAdaptInterface input, but for some cases, torch::jit::load may not be the only consumer of the reader adapter. This diff enables an overload of torch::jit::load to load shared_ptr PyTorchStreamReader.

Reviewed By: malfet, houseroad

Differential Revision: D25241904

fbshipit-source-id: aa403bac9ed820cc0e94342aebfe524a1d5bf913
2020-12-06 18:09:25 -08:00
e429d05015 Fixing error: "member may not be initialized" due to constexpr at Windows (#48836)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/48835
Fixes https://github.com/pytorch/pytorch/issues/48716

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48836

Reviewed By: malfet

Differential Revision: D25335829

Pulled By: datumbox

fbshipit-source-id: 807182e9afa3bb314dbb85bfcd9589a2c319a7db
2020-12-06 10:22:48 -08:00
ea2a568cca Fixed einsum compatibility/performance issues (#46398) (#47860)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47860

This PR makes torch.einsum compatible with numpy.einsum except for the sublist input option as requested here https://github.com/pytorch/pytorch/issues/21412. It also fixed 2 performance issues linked below and adds a check for reducing to torch.dot instead of torch.bmm which is faster in some cases.

fixes #45854, #37628, #30194, #15671

fixes #41467 with benchmark below
```python
import torch
from torch.utils.benchmark import Timer

a = torch.randn(10000, 100, 101, device='cuda')
b = torch.randn(10000, 101, 3, device='cuda')

c = torch.randn(10000, 100, 1, device='cuda')
d = torch.randn(10000, 100, 1, 3, device='cuda')

print(Timer(
    stmt='torch.einsum("bij,bjf->bif", a, b)',
    globals={'a': a, 'b': b}
).blocked_autorange())

print()

print(Timer(
    stmt='torch.einsum("bic,bicf->bif", c, d)',
    globals={'c': c, 'd': d}
).blocked_autorange())
```
```
<torch.utils.benchmark.utils.common.Measurement object at 0x7fa37c413850>
torch.einsum("bij,bjf->bif", a, b)
  Median: 4.53 ms
  IQR:    0.00 ms (4.53 to 4.53)
  45 measurements, 1 runs per measurement, 1 thread

<torch.utils.benchmark.utils.common.Measurement object at 0x7fa37c413700>
torch.einsum("bic,bicf->bif", c, d)
  Median: 63.86 us
  IQR:    1.52 us (63.22 to 64.73)
  4 measurements, 1000 runs per measurement, 1 thread
```

fixes #32591 with benchmark below
```python
import torch
from torch.utils.benchmark import Timer

a = torch.rand(1, 1, 16, 2, 16, 2, 16, 2, 2, 2, 2, device="cuda")
b = torch.rand(729, 1, 1, 2, 1, 2, 1, 2, 2, 2, 2, device="cuda")

print(Timer(
    stmt='(a * b).sum(dim = (-3, -2, -1))',
    globals={'a': a, 'b': b}
).blocked_autorange())

print()

print(Timer(
    stmt='torch.einsum("...ijk, ...ijk -> ...", a, b)',
    globals={'a': a, 'b': b}
).blocked_autorange())
```
```
<torch.utils.benchmark.utils.common.Measurement object at 0x7efe0de28850>
(a * b).sum(dim = (-3, -2, -1))
  Median: 17.86 ms
  2 measurements, 10 runs per measurement, 1 thread

<torch.utils.benchmark.utils.common.Measurement object at 0x7efe0de286a0>
torch.einsum("...ijk, ...ijk -> ...", a, b)
  Median: 296.11 us
  IQR:    1.38 us (295.42 to 296.81)
  662 measurements, 1 runs per measurement, 1 thread
```

TODO

- [x] add support for ellipsis broadcasting
- [x] fix corner case issues with sumproduct_pair
- [x] update docs and add more comments
- [x] add tests for error cases

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D24923679

Pulled By: heitorschueroff

fbshipit-source-id: 47e48822cd67bbcdadbdfc5ffa25ee8ba4c9620a
2020-12-06 08:02:37 -08:00
17f53bffef [Gradient Compression] Replace the key of error_dict in PowerSGD state with bucket index (#48867)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48867

Previously the key of error_dict is the hashcode of tensor. Now replaced with bucket index.

Bucket index can have a few advantages over the hashcode of tensor.
1) Error dict in the state never removes any key. If the bucket rebuild process occurs frequently, the size of error dict can increase. For now, such rebuild process is infrequent, so it is probably fine.

2) Integer index has a better readability than hashcode, and it can facilitate debugging.
If the user wants to debug the tensor values, usually only a specific bucket needs to be targeted. It's easy to specify such condition (e..g, bucket_index = 0), but it's hard to specify a hashcode in advance, as it can only be determined at runtime.

Note that sometimes the buckets can be rebuilt in the forward pass. In this case, the shape of the bucket with the same index will not be consistent with the one in the previous iteration, and hence the error tensor will be re--initialized as a zero tensor of the new shape. Therefore, `and state.error_dict[bucket_index].shape[0] == padded_total_length` is added to the condition of applying the local error from the previous iteration.

Deleted the arg type of `dist._GradBucket` in powerSGD_hook.py, because somehow test_run_mypy - TestTypeHints failed:
AssertionError: mypy failed: torch/distributed/algorithms/ddp_comm_hooks/powerSGD_hook.py:128: error: "_GradBucket" has no attribute "get_index"  [attr-defined]

Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202
ghstack-source-id: 117951402

Test Plan: buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl

Reviewed By: rohan-varma

Differential Revision: D25346347

fbshipit-source-id: 8348aa103002ec1c69e3ae759504b431140b3b0d
2020-12-05 23:53:27 -08:00
2e600feda9 [numpy] torch.sinh: promote integer inputs to float (#48644)
Summary:
Reference: https://github.com/pytorch/pytorch/issues/42515

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48644

Reviewed By: heitorschueroff

Differential Revision: D25298436

Pulled By: mruberry

fbshipit-source-id: 675ad8e3c34e61fbbab77eca15048df09b09c1ed
2020-12-05 22:04:31 -08:00
195ab5e864 remove non-default settings in fuser.py (#48862)
Summary:
I've noticed we are setting `_jit_set_num_profiled_runs` to 2 (which isn't our default) and sometimes we don't. We are also setting `_jit_set_bailout_depth` to 20 which **is** our default. I suggest we remove this logic altogether.
I did a quick run to see if there's any impact and thankfully, the numbers seem to be consistent, but we should try avoding testing configurations that aren't default or aren't  considered to become default.

 numactl -C 3 python -m fastrnns.bench --fuser=te --executor=profiling

non-defaults:

```
Namespace(cnns=None, cuda_pointwise_block_count=None, cuda_pointwise_block_size=None, cuda_pointwise_loop_level=None, device='cuda', executor='profiling', fuser='te', group=['cnns', 'rnns'], hiddenSize=512, inputSize=512, miniBatch=64, nloops=100, numLayers=1, print_json=None, rnns=None, sep=' ', seqLength=100, variable_lstms=False, warmup=10)
Benchmarking LSTMs...
            name          avg_fwd          std_fwd         info_fwd          avg_bwd          std_bwd         info_bwd
           cudnn            5.057          0.06287             None            7.322          0.07404             None
            aten            5.602          0.06303             None            13.64           0.4078             None
             jit            7.019          0.07995             None            13.77            0.554             None
      jit_premul            5.324          0.06203             None            12.01           0.2996             None
 jit_premul_bias            5.148          0.08061             None            11.62           0.4104             None
      jit_simple             6.69           0.2317             None            13.37           0.3791             None
  jit_multilayer            7.006            0.251             None            13.67           0.2239             None
              py            19.05           0.1119             None            28.28           0.6346             None

Benchmarking ResNets...
            name          avg_fwd          std_fwd         info_fwd          avg_bwd          std_bwd         info_bwd
        resnet18            8.712          0.01628             None            19.93          0.03512             None
    resnet18_jit            8.688          0.01374             None            19.79          0.07518             None
        resnet50            31.04          0.08049             None            66.44          0.08187             None
    resnet50_jit            31.11          0.07171             None            66.45          0.09157             None
```

defaults:
```
Namespace(cnns=None, cuda_pointwise_block_count=None, cuda_pointwise_block_size=None, cuda_pointwise_loop_level=None, device='cuda', executor='profiling', fuser='te', group=['cnns', 'rnns'], hiddenSize=512, inputSize=512, miniBatch=64, nloops=100, numLayers=1, print_json=None, rnns=None, sep=' ', seqLength=100, variable_lstms=False, warmup=10)
Benchmarking LSTMs...
            name          avg_fwd          std_fwd         info_fwd          avg_bwd          std_bwd         info_bwd
           cudnn            5.086            0.115             None            7.394           0.1743             None
            aten            5.611           0.2559             None            13.54            0.387             None
             jit            7.062           0.3358             None            13.24           0.3688             None
      jit_premul            5.379           0.2086             None            11.57           0.3987             None
 jit_premul_bias            5.202           0.2127             None            11.13          0.06748             None
      jit_simple            6.648          0.05794             None            12.84           0.3047             None
  jit_multilayer            6.964           0.1104             None            13.24           0.3283             None
              py            19.14          0.09959             None            28.17           0.4946             None

Benchmarking ResNets...
            name          avg_fwd          std_fwd         info_fwd          avg_bwd          std_bwd         info_bwd
        resnet18            8.713          0.01563             None            19.93          0.02759             None
    resnet18_jit            8.697          0.01792             None            19.78          0.06916             None
        resnet50            31.14          0.07431             None            66.57          0.07418             None
    resnet50_jit            31.21           0.0677             None            66.56          0.08655             None

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48862

Reviewed By: bertmaher

Differential Revision: D25342097

Pulled By: Krovatkin

fbshipit-source-id: 8d2f72c2770793ec8cecee9dfab9aaaf2e1ad2b1
2020-12-05 20:58:39 -08:00
85121a7a0f Added CUDA support for complex input for torch.cholesky_solve (#47047)
Summary:
`torch.cholesky_solve` now works for complex inputs on GPU.
I moved the existing tests to `test_linalg.py` and modified them to test complex and float32 dtypes.
Differentiation also works correctly with complex inputs now.

Ref. https://github.com/pytorch/pytorch/issues/33152

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47047

Reviewed By: ngimel

Differential Revision: D24730020

Pulled By: mruberry

fbshipit-source-id: 95402da5789c56e5a682019790985207fa28fa1f
2020-12-05 20:18:30 -08:00
5de22d3f69 Removes redundant method_test entries (#48828)
Summary:
Now that Lilyjjo's [stack of OpInfo updates](https://github.com/pytorch/pytorch/pull/48627) is landed, we can port method_test entries to OpInfos. This PR doesn't port any method_test entries, but it removes redundant entries. These entries previously tested both multi-dim and zero-dim tensors, so a new zero-dim tensor input is added to UnaryUfuncInfo's sample inputs.

To recap, this PR:

- removes method_test() entries that are redundant with OpInfo entries
- adds a new sample input to unary ufunc OpInfos that tests them on 0d tensors

cc kshitij12345 as an fyi. Going forward we should have a goal of not only porting all the MathTestMeta objects to use the OpInfo pattern but also all the current method_test entries. For each entry the function needs to be added as an OpInfo and the inputs need to be added as sample inputs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48828

Reviewed By: malfet

Differential Revision: D25336071

Pulled By: mruberry

fbshipit-source-id: 6b3f6c347195233d6b8ad57e2be68fd772663d9b
2020-12-05 19:25:29 -08:00
0185a05ceb Revert D25338250: [pytorch][PR] [BE] Fix signed-unsigned warnings
Test Plan: revert-hammer

Differential Revision:
D25338250 (6317e0b2f1)

Original commit changeset: e840618b113b

fbshipit-source-id: dbecb068892dc118f257fe5c50692ede2b2462ca
2020-12-05 18:08:22 -08:00
ae9f39eb58 [FX][1/2] Make docstrings pretty when rendered (#48738)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/48738

Test Plan: Imported from OSS

Reviewed By: zdevito

Differential Revision: D25280867

Pulled By: jamesr66a

fbshipit-source-id: d08641c19a6c69b4042389c800a48e699f0be628
2020-12-05 17:23:40 -08:00
0fb58d76a1 Support ArgMin in c2_pt_converter
Summary:
+ Add ArgMin support to Caffe2 to PyTorch converter
+ Using hypothesis to parameterize different conditions for test

Test Plan: buck test //caffe2/torch/fb/model_transform/c2_convert:c2_pt_converter_test

Reviewed By: houseroad

Differential Revision: D25016203

fbshipit-source-id: 94489fcf1ed3183ec96f9796a5b4fb348fbde5bc
2020-12-05 16:35:34 -08:00
251398acca Force a sync on non-CPU tensors for the benchmark to reflect the timing accurately. (#48856)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/48856

Test Plan: Imported from OSS

Reviewed By: IvanKobzarev

Differential Revision: D25339803

Pulled By: AshkanAliabadi

fbshipit-source-id: fdfd9a0e0cc37245d7671419f492e445396fbdb8
2020-12-05 10:47:44 -08:00
0923d19601 fx quant: add types to quantization_patterns (#48851)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48851

Adding typing to improve readability.

Note: this uncovered a few missing return statements, we should
fix that before landing.

Test Plan:
```
mypy torch/quantization/
```

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D25338644

fbshipit-source-id: 0ac4405db05fdd2737bc3415217bc1937c2db684
2020-12-05 08:47:18 -08:00
fa5f7d87bf fx quant: add typing for fuser (#48844)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48844

Add types to function I/O for `Fuser` to improve readability

Test Plan:
```
mypy torch/quantization/
```

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D25337314

fbshipit-source-id: e5074d71c7834f24975169d36bf49357e53650ff
2020-12-05 08:44:32 -08:00
63a71a82cf [ROCm] add 3.10 to nightly builds (#48866)
Summary:
Depends on https://github.com/pytorch/builder/pull/603.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48866

Reviewed By: malfet, janeyx99

Differential Revision: D25345895

Pulled By: walterddr

fbshipit-source-id: 5d1c754b36fa7ebd60832af58cbcbed2bc0da3bd
2020-12-05 06:56:17 -08:00
799b700ada add a unit test for lack of devices (#48858)
Summary:
add a unit test for the situation where devices have no enough memory

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48858

Reviewed By: malfet, gcatron

Differential Revision: D25341254

Pulled By: scottxu0730

fbshipit-source-id: c0524c22717b6c8afd67f5b0ad0f1851b973e4b7
2020-12-05 06:09:04 -08:00
5180caeeb4 Remove deprecated spectral ops from torch namespace (#48594)
Summary:
Ref https://github.com/pytorch/pytorch/issues/42175

This removes the 4 deprecated spectral functions: `torch.{fft,rfft,ifft,irfft}`. `torch.fft` is also now imported by by default.

The actual `at::native` functions are still used in `torch.stft` so can't be full removed yet. But will once https://github.com/pytorch/pytorch/issues/47601 has been merged.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48594

Reviewed By: heitorschueroff

Differential Revision: D25298929

Pulled By: mruberry

fbshipit-source-id: e36737fe8192fcd16f7e6310f8b49de478e63bf0
2020-12-05 04:12:32 -08:00
7439bc4dd6 [Gradient Compression] Add an index field to GradBucket for PowerSGD (#48757)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48757

Add an index field to GradBucekt, so error_dict is keyed by this index instead of the hashcode of input tensor. The replacement will be done in a separate diff, as the definition of this new method somehow couldn't be recognized in the OSS version.

Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202
ghstack-source-id: 117939208

Test Plan: buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl

Reviewed By: rohan-varma

Differential Revision: D25288496

fbshipit-source-id: 6f71977809690a0367e408bd59601ee62c9c03ea
2020-12-05 01:39:58 -08:00
6317e0b2f1 [BE] Fix signed-unsigned warnings (#48848)
Summary:
Switch to range loops when possible
Replace `ptrdiff_t`(signed type) with `size_t`(unsigned type)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48848

Reviewed By: walterddr

Differential Revision: D25338250

Pulled By: malfet

fbshipit-source-id: e840618b113b8bc0d8bb067c2fdf06e3ec9233d4
2020-12-04 23:15:28 -08:00
55b93735ac [PyTorch] Save refcount decrements in StaticRuntime::deallocate_registers (#48859)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48859

Code comment should explain what's going on. If not, please request changes.
ghstack-source-id: 117889942

Test Plan: Internal benchmarks

Reviewed By: hlu1

Differential Revision: D25288842

fbshipit-source-id: 6bddebb99c4744e2f7aceb279fdf995821404606
2020-12-04 21:47:00 -08:00
af30a89068 [caffe2][a10] Remove unreferenced local variable e (#48601)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48601

Fix this spurious warning:
```
caffe2\aten\src\aten\core\ivalue_inl.h(412): warning C4101: 'e': unreferenced local variable
```

Test Plan: Local build & continuous integration

Reviewed By: gmagogsfm

Differential Revision: D25194281

fbshipit-source-id: 3ba469d1cbff6f16394b95c4c33d95efcaea5e3e
2020-12-04 21:14:25 -08:00
f0f315c33b [PyTorch] Inline RecordFunctionCallback::shouldRun (#48286)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48286

RecordFunction initialization is a hot path. shouldRun often does little enough work that the function prologue takes a significant proportion of its time. So, this diff forces it to be inline.
ghstack-source-id: 117892387

Test Plan: FB-internal benchmarks

Reviewed By: ezyang

Differential Revision: D25108879

fbshipit-source-id: 7121413e714c5ca22c8bf10c1d2535a878c15aec
2020-12-04 20:48:39 -08:00
02d89f9f1d scatter_object_list API for c10d (#43930)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43930

Closes #23232. As part of addressing #23232, this PR adds support for scatter_object_list which is an API to scatter arbitrary picklable objects to all the other ranks.

The implementation approach follows a similar approach as https://github.com/pytorch/pytorch/pull/42189. The result of the `scatter` is stored as the first element of `scatter_object_output_list`, and the src rank is expected to provide an input list `scatter_object_input_list` which contains the objects to scatter.

Note that this API requires 1 broadcast and 2 scatters. This is because we must communicate the maximum object size to be scattered, which only the src rank knows about. After that, we also need to communicate the objects themselves as well as the true sizes of the object.

Note that the API is designed to match the tensor-based collectives other than supporting async_op. For now, it is a blocking call. If we see demand to support async_op, we will have to make more progress on merging work/future to support this.

It only works for Gloo because NCCL doesn't support scatter.
ghstack-source-id: 117904065

Reviewed By: mrshenli

Differential Revision: D23430686

fbshipit-source-id: f033b89cd82dadd194f2b036312a98423449c26b
2020-12-04 18:55:57 -08:00
a3298c2f64 Implement JIT serialization of ProcessGroup (#48544)
Summary:
This diff enables JIT serialization of `ProcessGroup`, including both base `ProcessGroup` class and derived classes like `ProcessGroupNCCL`.

If a `ProcessGroup` is created via high-level APIs like `dist_c10d.frontend().new_process_group_helper()`, they are automatically serializable. If a `ProcessGroup` is created via its derived class TorchBind APIs like `dist_c10d.ProcessGroupNCCL()`, then it has to be given a name and registered with `dist_c10d.frontend().register_process_group_name` to be uniquely identifiable and serializable.

* Fixed a minor bug in new dist_c10d frontend which fails to check whether a process group is used or not
* Fixed an issue where `test_jit_c10d.py` wasn't really run due to a configuration bug. Now tests are run as a slow test (need ci-all/* branch)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48544

Reviewed By: wanchaol

Differential Revision: D25298309

Pulled By: gmagogsfm

fbshipit-source-id: ed27ce37373c88277dc0c78704c48d4c19d46d46
2020-12-04 18:44:38 -08:00
3f10518def [PyTorch] Add VariableVersion&& overload for TensorImpl::shallow_copy_and_detach (#48681)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48681

This should reduce reference counting traffic when creating views.

The code duplication here is unfortunate and I'm open to suggestions on how to reduce it. It's especially regrettable that we create a footgun for subclasses of TensorImpl: they can accidentally override only one of the two overloads and get confusing behavior.
ghstack-source-id: 117896685

Test Plan: internal benchmarks

Reviewed By: ezyang

Differential Revision: D25259741

fbshipit-source-id: 55f99b16b50f9791fdab85cbc81d7cd14e31c4cf
2020-12-04 18:41:43 -08:00
9e10e3b74f [PyTorch] Move TensorImpl::shallow_copy_and_detach to .cpp file (#48680)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48680

It seems a bit long to put into the header (and is virtual anyway).
ghstack-source-id: 117894350

Test Plan: CI

Reviewed By: bhosmer

Differential Revision: D25259848

fbshipit-source-id: e3eed1f2483fc3c1ff51459159bf3bfed9d6f363
2020-12-04 18:36:56 -08:00
092e52a4da [fx]added prototype of to_folder (#47544)
Summary:
What this does is that given a `FxModule foo`, you can call `foo.to_folder('foo_folder', 'Foo')` and dump the current FX module into runnable Python code.

That is
```
foo = <fxModule>
foo = foo.to_folder('bar', 'Foo')
from bar import Foo
foo2 = Foo()

forall x, foo2(x) == Foo(x)
```

This has several use cases, largely lifted from jamesr66a's doc here: https://fb.quip.com/U6KHAFaP2cWa (FB-internal).

1. As we apply more heavy-weight function transformations with FX, figuring out what's going on can be quite a difficult experience. In particular, things that can typically be used for debugging (like `print` or `import pdb; pdb.set_trace()`) no longer work. This is particularly necessary if you're using a FX transform like `grad` or `vmap. With this, you simply open up the dumped file, and add `print`/`pdb` statements wherever you'd like.

2. This also provides an immense amount of user control. Some potential use-cases:
-  Let's say an existing FX transform has some bug, or generates suboptimal code. Instead of needing to modify that FX transform, writing another FX pass that fixes the suboptimal code, or simply giving up on FX, they can workaround it by simply modifying the resulting code themselves.
- This allows users to check in their FX modules into source control.
- You could even imagine using this as part of some code-gen type workflow, where you write a function, `vmap` it to get the function you actually want, and then simply copy the output of the `vmap` function without needing FX at all in the final code.

An example:
```python
class Test(nn.Module):
    def __init__(self):
        super(Test, self).__init__()
        self.W = torch.nn.Parameter(torch.randn(2))
        self.linear = nn.Linear(2, 2)
        self.attr = torch.randn(2)
        self.attr2 = torch.randn(2)

    def forward(self, x):
        return self.linear(self.W + (self.attr + self.attr2) + x)

mod = fx.symbolic_trace(Test())
mod.to_folder('foo', 'Foo')
```
results in
```python
import torch
class Foo(torch.nn.Module):
    def __init__(self):
        super().__init__()
        state_dict = torch.load('foo/state_dict.pt')
        self.linear = torch.load('foo/linear.pt') # Linear(in_features=2, out_features=2, bias=True)
        self.__tensor_constant0 = state_dict['__tensor_constant0']
        self.W = torch.nn.Parameter(state_dict['W'])

    def forward(self, x):
        w = self.W
        tensor_constant0 = self.__tensor_constant0
        add_1 = w + tensor_constant0
        add_2 = add_1 + x
        linear_1 = self.linear(add_2)
        return linear_1
```
Some current issues:
1. How do you actually ... save things like modules or parameters? I don't think FX is in the business of tracking initializations and such. Thus, the only way I see to do it is to dump the parameters/modules as blobs, and then load them in the generated initialization. This is a somewhat subpar user experience, and perhaps prevents it from being in some use cases (ie: you would need to check in the blobs into source control to save the model).

2. Currently, the only "atomic" modules we have are those in `torch.nn`. However, if we want to allow flexibility in this, and for example, allow "atomic" modules that are user-defined, then it's not clear how to allow those to be dumped in a way that we can then load elsewhere.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47544

Reviewed By: jamesr66a

Differential Revision: D25232917

Pulled By: Chillee

fbshipit-source-id: fd2b61a5f40e614fc94256a2957ed1d57fcf5492
2020-12-04 18:33:27 -08:00
03abd81b8d [ROCm] Enable skipped distributed global tests (#48023)
Summary:
The PR https://github.com/pytorch/pytorch/issues/47898 fixes the global tests. Hence enabling the tests.

Signed-off-by: Jagadish Krishnamoorthy <jagdish.krishna@gmail.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48023

Reviewed By: malfet, H-Huang

Differential Revision: D25347289

Pulled By: rohan-varma

fbshipit-source-id: 2b519a3046eae1cf1bfba98a125c09b4a6b01fde
2020-12-04 18:16:02 -08:00
9bb87fa58b [te] Fix spacing in graph dump (#48829)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48829

The first line was a run-on.
ghstack-source-id: 117845927

Test Plan: visual inspection

Reviewed By: ZolotukhinM

Differential Revision: D25326136

fbshipit-source-id: 3f46ad20aee5ed523b64d852d382eb06f4d60369
2020-12-04 18:10:44 -08:00
2d07d5b50a [te] Don't fuse integer fmod or remainder (#48700)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48700

fmod and remainder on int tensors will raise ZeroDivisionError if their divisors are 0.  I don't think we should try to generate code that raises exceptions.  If at some point we really wanted to fuse these, I might lean towards calling a C++ helper function from the generated code.
ghstack-source-id: 117845642

Test Plan: `buck test //caffe2/test:jit -- test_binary_ops`

Reviewed By: eellison

Differential Revision: D25265792

fbshipit-source-id: 0be56ba3feafa1dbf3c37f6bb8c1550cb6891e6d
2020-12-04 18:02:29 -08:00
5654fc8edd Revert D25293474: [pytorch][PR] Server connects to its listen socket addr
Test Plan: revert-hammer

Differential Revision:
D25293474 (7c9ba62130)

Original commit changeset: 15f75dab48a4

fbshipit-source-id: 71ca136f2aa3204ad49f76c604f51c477cba270a
2020-12-04 17:08:03 -08:00
4b8d965f18 Revert D25292656: [pytorch][PR] Support torch.distributed.irecv(src=None, ...)
Test Plan: revert-hammer

Differential Revision:
D25292656 (4eb4db7c30)

Original commit changeset: beb018ba0b67

fbshipit-source-id: 5a13055e50ed90731fee431e81c09a1871f6cc03
2020-12-04 16:57:06 -08:00
212ec07cb7 Support torchbind as attribute in torch.fx symbolic tracing (#48732)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48732

add support for ScriptObject as attributes in symbolic trace.

Test Plan: OSS CI

Reviewed By: jamesr66a

Differential Revision: D25116185

fbshipit-source-id: c61993c84279fcb3c91f1d44fb952a8d80d0e552
2020-12-04 16:21:44 -08:00
b9cd774e29 Get rid of printf in cuda fuser debugPrint() (#46994)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46994

Reviewed By: raghuramank100, mruberry

Differential Revision: D25342954

Pulled By: malfet

fbshipit-source-id: 549b5b072f7f70877261a155e989a21072ec49d8
2020-12-04 15:13:26 -08:00
ca3ae7dc73 [DI] create a new key for threadLocalDebugInfo (#48762)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48762

In distributed inference, we want to use a new type info to pass some information to operators. add a new key to threadLocalDebugInfo to unblock the development.

Test Plan: Only add a new key. Should have not effect on current build.

Reviewed By: dzhulgakov

Differential Revision: D25291242

fbshipit-source-id: c71565ff7a38cc514d7cd65246c7d5f6b2ce3b8b
2020-12-04 15:05:45 -08:00
0f9823d888 [PyTorch] Save some space in ProcessedNode (#48861)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48861

`std::function` already has an empty state; no need to wrap
it in `c10::Optional`.
ghstack-source-id: 117891382

Reviewed By: hlu1

Differential Revision: D25296912

fbshipit-source-id: 8291bcf11735d49db17415b5de915591ee65f781
2020-12-04 14:42:20 -08:00
142b21fd44 Add SparseLengthsSum4BitRowwiseSparse in c2_pt_converter (#48240)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48240

Adds the support for converting the SparseLengthsSum4BitRowwiseSparse operator from caffe2 to pytorch as a part of c2_pt_converter

Test Plan:
Added a unit tested

buck test //caffe2/torch/fb/model_transform/c2_convert:c2_pt_converter_test

Tests Passed :
https://our.intern.facebook.com/intern/testinfra/testrun/2251799856412296

Reviewed By: houseroad

Differential Revision: D25067833

fbshipit-source-id: 45cbc331ca35bee27e083714e65a1e87a2a2d2e0
2020-12-04 14:16:25 -08:00
4eb4db7c30 Support torch.distributed.irecv(src=None, ...) (#47137)
Summary:
Calling torch.distributed.irecv(src=None) fails with "The global rank None is not part of the group". This change calls recv_anysource if src is None. Tested locally with MPI backend.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47137

Reviewed By: heitorschueroff

Differential Revision: D25292656

fbshipit-source-id: beb018ba0b676924aeaabeb4a4d6acf96e4a1926
2020-12-04 13:56:36 -08:00
e1f9542d00 Revert D23898398: [Mask R-CNN]Add Int8 AABB Generate proposals Op
Test Plan: revert-hammer

Differential Revision:
D23898398 (714c7020ee)

Original commit changeset: fb5f6d6ed8a5

fbshipit-source-id: 05284ff4db6c05fff3f4a6bb80f665e87c0bf085
2020-12-04 13:34:55 -08:00
7c9ba62130 Server connects to its listen socket addr (#46801)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/46800

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46801

Reviewed By: heitorschueroff

Differential Revision: D25293474

fbshipit-source-id: 15f75dab48a4360645436360c216885cf3bd5667
2020-12-04 13:21:57 -08:00
42e6951e62 Remove save_state_warning in LambdaLR (#46813)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/46405, https://github.com/pytorch/pytorch/issues/43352

I updated the docstring in the local file (function level comments). Do I also need to edit somewhere else or recompile docstrings?

Also, though I didn't change any types here, how is typing (for IDE type checking) documentation generated / used)?

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46813

Reviewed By: ezyang

Differential Revision: D24923112

Pulled By: vincentqb

fbshipit-source-id: be7818e0d4593bfc5d74023b9c361ac2a538589a
2020-12-04 13:19:59 -08:00
714c7020ee [Mask R-CNN]Add Int8 AABB Generate proposals Op
Summary: Adds support for additional Eigen Utils for custom type defs.

Reviewed By: vkuzo

Differential Revision: D23898398

fbshipit-source-id: fb5f6d6ed8a56e6244f4f0cb419140b365ff7a82
2020-12-04 13:00:34 -08:00
ba3962f5f0 [Onnxifi] Warmup cache of output shapes (#48346)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48346

Onnxifi now accepts output shape info for all possible batch sizes. This is used to avoid doing shape inference inside `OnnxifiOp::extractOutputBatchSizes()`.

FB:
In this diff we try to pre-calculate output shapes for all possible batch sizes inside `PredictorContainer` where we supposedly have enough data to do so. This data is then passed down to OnnxifiOp.

Here is the dependency graph that I built manually trying to understand the entire flow.
https://pxl.cl/1rQRv

Test Plan:
Strobelight data https://fburl.com/strobelight/jlhhgt21 shows that `OnnxifiOp::RunOnDevice()` now takes only 2.17% of CPU instead of ~20% CPU with the current implementation.

Also, the current implementation takes dozens of milliseconds according to ipiszy:
> After adding more logs, I found each shapeinference call actually takes 40~50ms.

I also added added time measurements temporarily for `OnnxifiOp::extractOutputBatchSizes()`. New impenentation typically consumes 1 to 4 microseconds, and, when data for current bs is not present yet in `output_reshape_info_`, it takes 20-40 microseconds which is still much better than the current implementation.

AF canary https://www.internalfb.com/intern/ads/canary/431357944274985799
AI canary https://www.internalfb.com/intern/ads/canary/431365503038313840

Verifying using test tier https://pxl.cl/1sZ4S

Reviewed By: yinghai, ipiszy

Differential Revision: D25047110

fbshipit-source-id: 872dc1578a1e8e7c3ade5f5e2711e77ba290a671
2020-12-04 12:54:41 -08:00
0a42003f8f [TensorExpr Fuser] Handle fusing values with un-profiled uses (#48689)
Summary:
Copying myself from the code comments:

A value can be profiled with differently typed uses.
This can occur from:
- having a use which is not executed, so the type will be
TensorType::get()
- control-flow that depends on tensor type:
  if x.size() == 2 op(x) else op(x)
- mutation of the value on a field represented in the tensor type
  op(x); x.resize_([...]); op(x)

The most common case today with num_profiles = 1 is from the first case. Here we can just ignore non-profiled uses, and choose any of the profiled uses. Because we guard all tensor types in the runtime, even if we set a Value to have a profiled type from one use and then execute a use with a different profiled type, we will still be correct. In the future we could consider unifying the types of uses, or adding a type refinement node so uses can have the correct corresponding type.

Fix for https://github.com/pytorch/pytorch/issues/48043 I think there's probably too much context required for that to be a good bootcamp task...

There was an observed missed fusion opportunity in detectron2 because of this issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48689

Reviewed By: ngimel

Differential Revision: D25278791

Pulled By: eellison

fbshipit-source-id: 443e5e1254446a31cc895a275b5f1ac3798c327f
2020-12-04 12:48:10 -08:00
31808dcdd8 [RELAND] [CUDA graphs] Make CUDAGeneratorImpl capturable (ci-all edition) (#48694)
Summary:
Resubmission of https://github.com/pytorch/pytorch/pull/47989 with attempted fix for the unexpected context creation that caused revert (https://github.com/pytorch/pytorch/pull/47989#issuecomment-736689145).

Submitting from a ci-all branch because the failing test isn't public.

Diffs relative to master should be the same as https://github.com/pytorch/pytorch/pull/47989 's approved diffs, aside from the fix itself a5c80f63d3.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48694

Reviewed By: mruberry

Differential Revision: D25291431

Pulled By: ngimel

fbshipit-source-id: 8c27f85c64eecaf1f5cb925020fa6d38a07ff095
2020-12-04 12:35:46 -08:00
9af627fda1 fix some typos in the fx ir test_fx_experiemntal (#48847)
Summary:
fix some typos in test_fx_experimental.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48847

Reviewed By: malfet, gcatron

Differential Revision: D25339391

Pulled By: scottxu0730

fbshipit-source-id: 388d9da94259d2b306d59f3f4a167e486ac06d60
2020-12-04 12:18:36 -08:00
a5fb12d168 RRef proxy support for ScriptModule methods (#48339)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48339

Closes https://github.com/pytorch/pytorch/issues/48294
https://github.com/pytorch/pytorch/pull/48293 added creation and transfer of ScriptModule over RPC in python, but it did not work with ScriptModule.

This PR makes the above work with ScriptModule as per a discussion with mrshenli:
1) We remove the `hasattr()` check and just let Python throw the exception as it would when accessing the py function with `getattr`
2) We  condition on `issubclass(type, ScriptModule)` when checking if it is wrapped with async_function, because `ScriptModule` does not have getattr implemented (this is because ScriptModule forward/function is not a python function, it is a torchscript specific function):
```
torch/jit/_script.py", line 229, in __get__
    return self.__getattr__("forward")  # type: ignore
AttributeError: '_CachedForward' object has no attribute '__getattr__'
```
ghstack-source-id: 117631795

Test Plan: Modified ut

Reviewed By: wanchaol

Differential Revision: D25134423

fbshipit-source-id: 918ca88891c7b0531325f046b61f28947575cff0
2020-12-04 11:33:16 -08:00
fadec77c30 [quant][fx][graphmode] Renable torchvision test (#48602)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/48602

Test Plan: Imported from OSS

Reviewed By: vkuzo

Differential Revision: D25224917

fbshipit-source-id: efc73f425253c4eb7ae51064b6760416097f0437
2020-12-04 10:13:38 -08:00
07d185ef05 [ROCm] add 3.10 docker image (#48791)
Summary:
Add a ROCm 3.10 docker image for CI.  Keep the 3.9 image and remove the 3.8 image.  Plan is to keep two ROCm versions at a time.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48791

Reviewed By: janeyx99

Differential Revision: D25307102

Pulled By: walterddr

fbshipit-source-id: 88371aafd07db7c5d0dd210759bb7c3aac1f0187
2020-12-04 08:37:31 -08:00
bc2352e8c3 [NNC] Complete SimpleIREvaluator support for bitwise ops (#48053) (#48179)
Summary:
Add missing types for bitwise_ops in `SimpleIREvaluator`

This is the first part of fixes for issue https://github.com/pytorch/pytorch/issues/48053.
- Original implementation of bitwise_ops supports only int operands, the
fix all support for integral types supported by the IR

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48179

Test Plan: `python test/test_jit_fuser_te.py TestTEFuser.test_bitwise_ops`

Reviewed By: ZolotukhinM

Differential Revision: D25126944

Pulled By: penguinwu

fbshipit-source-id: 04dc7fc00c93b2bf1bd9f9cd09f7252357840b85
2020-12-04 08:10:18 -08:00
3a0d4240c3 Fix broadcast_all crashing on Tensor-likes (#48169)
Summary:
This ensures Tensor-likes that implement `__torch_function__` are properly handled by `torch.distributions.utils.broadcast_all`.  See Issue https://github.com/pytorch/pytorch/issues/37141 .

In this implementation, Number's will not be cast to the dtype of Tensor-likes.

Fixes https://github.com/pytorch/pytorch/issues/37141

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48169

Reviewed By: izdeby

Differential Revision: D25091414

Pulled By: walterddr

fbshipit-source-id: c5c99374b02409393a68dcb85e2f8feab154318f
2020-12-04 07:32:22 -08:00
eb43e12ee4 Revert D25277886: [pytorch][PR] Replace constexpr with CONSTEXPR_EXCEPT_WIN_CUDA
Test Plan: revert-hammer

Differential Revision:
D25277886 (0484b048d0)

Original commit changeset: eb845db35d31

fbshipit-source-id: 133b938ff8ae1aa54878a03ea5a7e732c6bd5901
2020-12-04 07:08:35 -08:00
6ab84ca0f3 Implement NumPy-like function torch.msort() (#48440)
Summary:
- Related with https://github.com/pytorch/pytorch/issues/38349
- Implementing the NumPy-like function `torch.msort()` .

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48440

Reviewed By: bdhirsh

Differential Revision: D25265753

Pulled By: mruberry

fbshipit-source-id: 7709ac5e5667e7541a3dc9048b9c9896b1a6dfa1
2020-12-04 04:32:09 -08:00
cb285080b0 Added computing matrix condition numbers (linalg.cond) (#45832)
Summary:
This PR adds `torch.linalg.cond` for NumPy compatibility.

Ref https://github.com/pytorch/pytorch/issues/42666.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45832

Reviewed By: ngimel

Differential Revision: D25183690

Pulled By: mruberry

fbshipit-source-id: a727959bfec2bc2dc36df59d9ef79c0534b68194
2020-12-04 02:23:57 -08:00
4cc163f8ec Add deadline to fakelowp tests (#48823)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48823

deadline=None is not good because

Sandcastle tests will return success for tests timeout (default flag), and we cannot efficiently detect broken tests if there is any.

In addition, the return signal for timeout is 64, which is same as skip test.

Test Plan: Sandcastle, and run tests on card

Reviewed By: hyuen

Differential Revision: D25318184

fbshipit-source-id: de1b55a259edb2452fb51ba4c598ab8cca9e76b7
2020-12-04 00:45:33 -08:00
2181ff89bb [vulkan][test] Not use non 1 dilation for conv2d (#48800)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/48800

Test Plan: Imported from OSS

Reviewed By: SS-JIA

Differential Revision: D25312276

Pulled By: IvanKobzarev

fbshipit-source-id: edb36c284ddb79969cbc4e774f11d85f14b39343
2020-12-03 23:45:01 -08:00
5fd61de99e [ONNX] Added hardswish symbolic in opset 9 (#48423)
Summary:
Adds support for torch.nn.Hardswish operator in Export

Fixes https://github.com/pytorch/pytorch/issues/43665

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48423

Reviewed By: heitorschueroff

Differential Revision: D25309868

Pulled By: bzinodev

fbshipit-source-id: f5583eb01b1b0e8f0bc95d5054941dd29605d6a5
2020-12-03 23:22:21 -08:00
15bc21c280 [ONNX] Track and list model params for scripting (#47348)
Summary:
List model parameters as inputs following freezing script module.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47348

Reviewed By: heitorschueroff

Differential Revision: D25309756

Pulled By: bzinodev

fbshipit-source-id: cbe679ece934d5e6c418a22f08c1662256914c4c
2020-12-03 23:07:28 -08:00
f065087567 [ONNX] Handle dynamic input axes for prim_ConstantChunk (#48176)
Summary:
When converting a model that uses `torch.chunk`, it does not work when we have a dynamic input axes, because `Split` split attr is static for opset 11. Therefore, we convert it using `Slice` (support opset 11+). This PR also handles the cases that the input axes cannot be divided by the number of outputs. Pytorch works a way that fit the first (n-1) outputs for the same dim, and remaining for the last one. Added UT for it.

The existing code on `sequence` `split` cannot be leveraged here, because `start`, `end` of `Slice` are static there, but dynamic here.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48176

Reviewed By: bdhirsh

Differential Revision: D25274862

Pulled By: bzinodev

fbshipit-source-id: 7d213a7605ad128aca133c057d6dd86c65cc6de9
2020-12-03 21:59:26 -08:00
86540dbf41 Fix jit doc model loading example (#48104)
Summary:
Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48104

Reviewed By: jamesr66a

Differential Revision: D25028353

Pulled By: suo

fbshipit-source-id: aaf74a40e7150a278d100e129740cfe1cef99af2
2020-12-03 20:47:20 -08:00
c55d45f04b [qnnpack] Fix unused var warning when building for different archs. (#48730)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48730

.

Test Plan: CI

Reviewed By: kimishpatel

Differential Revision: D25273068

fbshipit-source-id: 3a0cea633bf1c02fa3176b3b3f43db46d2beb861
2020-12-03 19:46:06 -08:00
f5d94244b2 fx quant: more typehints, part 3 (#48794)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48794

Adds typehints to function I/O in `torch/quantization/quantize_fx.py`,
for readability.

Test Plan:
```
mypy torch/quantization/
```

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D25307084

fbshipit-source-id: 67bdf95b78836dcabc7d829e1854ca5b8ceb8346
2020-12-03 19:28:16 -08:00
54da2dadd8 fx quant: more typehints, part 2 (#48792)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48792

Adds some more typehints throughout quantization/fx/quantize.py,
to help with readability.

Test Plan:
```
mypy torch/quantization/fx/quantize.py
```

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D25306683

fbshipit-source-id: fc38b885a2cb5bf2c6d23b6305658704c6eb7811
2020-12-03 19:28:12 -08:00
f5bcf45e3b fx quant: add more typehints (#48774)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48774

Adds some more typehints throughout `quantization/fx/quantize.py`.

More are needed, ran out of time for now, we can continue in
future PRs.

Test Plan:
```
mypy torch/quantization/fx/quantize.py
```

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D25295836

fbshipit-source-id: 4029aa8ea5b07ce9a57e4be6a66314d7a4e19585
2020-12-03 19:28:09 -08:00
c98c617b44 fx quant: clean up functions in _prepare (#48773)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48773

Makes util functions in `_prepare` have no side effects,
all dependencies are now in arguments.

Note: arg names are added in order as they appeared in function
code. It's not the most readable, but the lowest risk. This can
be cleaned up in future PRs if needed.

```
python test/test_quantization.py TestQuantizeFx
```

Test Plan: Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D25295839

fbshipit-source-id: 60c687f6b64924473f969541c8703118e4f7d16e
2020-12-03 19:28:06 -08:00
536352e86f fx quant: clean up functions in _generate_qconfig_map (#48772)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48772

Makes util functions in `_generate_qconfig_map` have no side
effects, all dependencies are now in arguments.

Test Plan:
```
python test/test_quantization.py TestQuantizeFx
```

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D25295837

fbshipit-source-id: 49399abef626234e34bb5ec8c6d870da3c1760e7
2020-12-03 19:25:38 -08:00
16fd1c32c5 [ONNX] Update batch_norm symbolic to handle track_running_stats=False (#47903)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/45333

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47903

Reviewed By: ejguan

Differential Revision: D25097509

Pulled By: bzinodev

fbshipit-source-id: 5584dac1150b13d4e0a6e0c39ac2f2caf41d3b38
2020-12-03 17:31:03 -08:00
cf1e5d7d2b Ignore MSVC's pdb file (#47963)
Summary:
These files are generated by MSVC when building with debug symbols `REL_WITH_DEB_INFO=1`:
```
PS C:\Users\Xiang Gao\source\repos\pytorch> git status
On branch master
Your branch is up to date with 'origin/master'.

Untracked files:
  (use "git add <file>..." to include in what will be committed)
        torch/lib/asmjit.pdb
        torch/lib/c10.pdb
        torch/lib/c10_cuda.pdb
        torch/lib/caffe2_detectron_ops_gpu.pdb
        torch/lib/caffe2_module_test_dynamic.pdb
        torch/lib/caffe2_observers.pdb
        torch/lib/fbgemm.pdb
        torch/lib/shm.pdb
        torch/lib/torch_cpu.pdb
        torch/lib/torch_cuda.pdb

nothing added to commit but untracked files present (use "git add" to track)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47963

Reviewed By: heitorschueroff

Differential Revision: D25311564

Pulled By: malfet

fbshipit-source-id: 1a7125f3c6ff296b4bb0975ee97b59c23586b1cb
2020-12-03 16:11:24 -08:00
cc1c3063c5 Add test binary to compare torch model outputs (#47933)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47933

Test Plan: Imported from OSS

Reviewed By: IvanKobzarev

Differential Revision: D25309199

Pulled By: SS-JIA

fbshipit-source-id: adc3fc7db33c251f6b661916265b86b7b8c68fc2
2020-12-03 15:29:56 -08:00
b3ac628081 [JIT] Fix bug in get_annotation_str for ast.Subscript (#48741)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48741

**Summary**
This commit fixes a bug in the handling of `ast.Subscript` inside
`get_annotation_str`. `annotation.value` (which contains the AST node
representing the container name) should also be processed using
`get_annotation_str`.

**Test Plan**
This commit adds a unit test to `TestClassType` based on the test case
from the issue that reported this bug.

**Fixes**
This commit fixes #47570.

Test Plan: Imported from OSS

Reviewed By: ppwwyyxx

Differential Revision: D25286013

Pulled By: SplitInfinity

fbshipit-source-id: 61a9e5dc16d9f87b80578f78d537f91332093e52
2020-12-03 14:41:02 -08:00
e7038a7725 Improve an autograd warning (#48765)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/48764

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48765

Reviewed By: heitorschueroff

Differential Revision: D25304145

Pulled By: albanD

fbshipit-source-id: e818413bf92ad0aa382eda77448183b9fd7d5e77
2020-12-03 12:39:10 -08:00
1eed54d17a Upgrade oneDNN (mkl-dnn) to v1.7 (#47853)
Summary:
Bump oneDNN (mkl-dnn) to 1.7 for bug fixes and performance optimizations
- Fixes https://github.com/pytorch/pytorch/issues/42115. Fixed build issue on Windows for the case when oneDNN is built as submodule
- Fixes https://github.com/pytorch/pytorch/issues/45746. Fixed segmentation fault for convolution weight gradient on systems with Intel AVX512 support

This PR also contains a few changes in ideep for follow-up update (not enabled in current PR yet):
- Performance improvements for the CPU path of Convolution
- Channel-last support

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47853

Reviewed By: bdhirsh

Differential Revision: D25275268

Pulled By: VitalyFedyunin

fbshipit-source-id: 75a589d57e3d19a7f23272a67045ad7494f1bdbe
2020-12-03 11:54:31 -08:00
47aa253632 [Feature] Allow user to specify a fraction of the GPU memory. (#48172)
Summary:
Add a new function,  torch.cuda.set_per_process_memory_fraction(fraction, device), to torch.cuda.  Related:  https://github.com/pytorch/pytorch/issues/18626
The fraction (float type, from 0 to 1) is used to limit memory  of cashing allocator on GPU device .  One can set it on any visible GPU. The allowed memory equals total memory * fraction. It will raise an OOM error when  try to apply GPU memory more than the allowed value. This function is similar to Tensorflow's per_process_gpu_memory_fraction
Note, this setting is just limit the cashing allocator in one process. If you are using multiprocess, you need to put this setting in to the subprocess to limit its GPU memory, because subprocess could have its own allocator.

## usage
In some cases, one needs to split a GPU device as two parts. Can set limitation before GPU memory using.
Eg. device: 0, each part takes half memory, the code as follows:
```
torch.cuda.set_per_process_memory_fraction(0.5, 0)
```
There is an example to show what it is.
```python
import torch
torch.cuda.set_per_process_memory_fraction(0.5, 0)
torch.cuda.empty_cache()
total_memory = torch.cuda.get_device_properties(0).total_memory
# less than 0.5 will be ok:
tmp_tensor = torch.empty(int(total_memory * 0.499), dtype=torch.int8, device='cuda')
del tmp_tensordel tmp_tensor
torch.cuda.empty_cache()
# this allocation will raise a OOM:
torch.empty(total_memory // 2, dtype=torch.int8, device='cuda')

"""
It raises an error as follows:
RuntimeError: CUDA out of memory. Tried to allocate 5.59 GiB (GPU 0; 11.17 GiB total capacity; 0 bytes already allocated; 10.91 GiB free; 5.59 GiB allowed; 0 bytes reserved in total by PyTorch)
"""
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48172

Reviewed By: bdhirsh

Differential Revision: D25275381

Pulled By: VitalyFedyunin

fbshipit-source-id: d8e7af31902c2eb795d416b57011cc8a22891b8f
2020-12-03 11:45:56 -08:00
c134f32835 Implemented torch.inner (#46716)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46716

Implemented torch.inner similar to [numpy.inner](https://numpy.org/doc/stable/reference/generated/numpy.inner.html). For now it's implemented as a composite op.

TODO

- [x] Add documentation

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D24860351

Pulled By: heitorschueroff

fbshipit-source-id: de5c82f285893495491fdba73b35634f4d00bac8
2020-12-03 11:37:55 -08:00
b726a1bbf8 quantize bias of the quantization parameters (#48749)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48749

this change reverts D25179863 (55e225a2dc) because in 1.0.0.14 this behavior got
reintroduced
we believe this was already working pre 1.0.0.9, then intel regressed which is
why we had to remove this quantization section, and in 1.0.0.14 they fixed it

Test Plan:
we tested ctr_instagram_5x which now passes with bitwise matching
hl475 will test the top6 models and if they match, we will use this point
to lock any further changes in the future

Reviewed By: venkatacrc

Differential Revision: D25283605

fbshipit-source-id: 33aa9af008c113d4d61e3461a44932b502bf42ea
2020-12-03 11:20:56 -08:00
dabc286ab3 Remove output used only by sizes (#448) (#47665)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47665

Re-enabled the pass to remove outputs from fusion that is only used by aten::size;
Added size computation for reduction op via new operator prim::ReductionSizes;

Test Plan: Imported from OSS

Reviewed By: navahgar, jamesr66a

Differential Revision: D25254675

Pulled By: Krovatkin

fbshipit-source-id: e9a057b0287ed0ac93b415647fd8e5e836ba9856
2020-12-03 11:14:30 -08:00
2cb9204159 Add nondeterministic alert to index_copy, median CUDA and kthvalue CUDA (#46942)
Summary:
Also fixes issue where skipped tests did not properly restore deterministic flag.

Fixes https://github.com/pytorch/pytorch/issues/46743

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46942

Reviewed By: heitorschueroff

Differential Revision: D25298020

Pulled By: mruberry

fbshipit-source-id: 14b1680e1fa536ec72018d0cdb0a3cf83b098767
2020-12-03 11:03:07 -08:00
c2ad3c4e6a Add scary comment in cpp_custom_type_hack.h (#48737)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/48737

Test Plan: Imported from OSS

Reviewed By: dzhulgakov

Differential Revision: D25280542

Pulled By: jamesr66a

fbshipit-source-id: 67c3b8c82def848ba3059dd6f6a23f9c5e329c0f
2020-12-03 10:58:12 -08:00
416dc68341 [Pytorch][Annotation] Update inlined callstack with module instance info (#47416)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47416

Test Plan: Imported from OSS

Reviewed By: kimishpatel

Differential Revision: D24752846

Pulled By: cccclai

fbshipit-source-id: 94d3c18c56161d1de3a16bb7c93502fedf71644c
2020-12-03 10:44:46 -08:00
5c9cef9a6c [numpy] Add torch.moveaxis (#48581)
Summary:
Reference: https://github.com/pytorch/pytorch/issues/38349 #36048 https://github.com/pytorch/pytorch/pull/41480#issuecomment-734398262

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48581

Reviewed By: bdhirsh

Differential Revision: D25276307

Pulled By: mruberry

fbshipit-source-id: 3e3e4df1343c5ce5b71457badc43f08c419ec5c3
2020-12-03 10:34:33 -08:00
befab0d9d4 [ONNX] Cast Gather index to Long if needed (#47653)
Summary:
Onnx op Gather index need be int32 or int64. However, we don't have this Cast in our converter.
Therefore, it fails the following UT (for opset 11+)
`seq_length.type().scalarType()` is None, so `_arange_cast_helper()` cannot treat it as all integral, then it will cast all to float. Then this float value will be used as Gather index, hence it throws error in ORT about float type index.
The fix is that we need cast Gather index type to Long if it is not int/long.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47653

Reviewed By: heitorschueroff

Differential Revision: D25298056

Pulled By: mruberry

fbshipit-source-id: 05e3a70ccfd74612233c63ec5bb78e060b211909
2020-12-03 09:34:59 -08:00
92f376147c Enable TCPStore on Windows (#47749)
Summary:
Enable TcpStore for DDP on Windows platform, in order to improve running DDP cross machines performance.

Related RFC is https://github.com/pytorch/pytorch/issues/47659

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47749

Reviewed By: bdhirsh

Differential Revision: D25220401

Pulled By: mrshenli

fbshipit-source-id: da4b46b42296e666fa7d8ec8040093de7443a529
2020-12-03 08:32:01 -08:00
93973ee699 Header cleanup (#48728)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48728

Mostly removing unnecessary includes so that TensorIterator.h can be
included from NativeFunctions.h without causing cycles.  There some
cases where I moved code around so that I didn't have to pull in other
unnecessary stuff.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: bhosmer

Differential Revision: D25278030

Pulled By: ezyang

fbshipit-source-id: 5f6b95a6bc734e452e9bd7bee8fe5278f5e45be2
2020-12-03 08:26:20 -08:00
f9a0abfc43 Fix code review from #48659 and #48116 (#48731)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48731

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: bhosmer

Differential Revision: D25278034

Pulled By: ezyang

fbshipit-source-id: 73652311b48d8d80c06e9385b7ff18ef3a158ae8
2020-12-03 08:26:17 -08:00
d6f9e8562b Generalize some TensorIterator consumers to take TensorIteratorBase (#48727)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48727

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: bhosmer

Differential Revision: D25278033

Pulled By: ezyang

fbshipit-source-id: 77f125ddb8446edf467a22130227d90583884bca
2020-12-03 08:24:48 -08:00
c01e5b8827 Simplify CachingAllocator. (#48752)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/48752

Reviewed By: linbinyu

Differential Revision: D25285292

fbshipit-source-id: 17679ccda5279ab426e50e4266c50aac74f92a13
2020-12-03 07:30:01 -08:00
ef50c94e7c reenabling MPI test (#48725)
Summary:
fixes https://github.com/pytorch/pytorch/issues/47443.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48725

Reviewed By: mrshenli

Differential Revision: D25278758

Pulled By: walterddr

fbshipit-source-id: a02d0fef99a7941c8e98da16a45d840e12b8b0c3
2020-12-03 06:50:36 -08:00
0484b048d0 Replace constexpr with CONSTEXPR_EXCEPT_WIN_CUDA (#48717)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/48716

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48717

Reviewed By: ezyang

Differential Revision: D25277886

Pulled By: datumbox

fbshipit-source-id: eb845db35d31b64d3e4401ed56843814192ce5a6
2020-12-03 05:36:38 -08:00
5489a98cd3 Add support for CorrCholeskyTransform (#48041)
Summary:
This adds a transform to convert a real vector of (D * (D-1))/2 dimension into the cholesky factor of a D x D correlation matrix. This follows the implementation in [NumPyro](https://github.com/pyro-ppl/numpyro/blob/master/numpyro/distributions/transforms.py) by fehiepsi. This is needed for the LKJDistribution which will be added in a subsequent PR.

Also in line with the ongoing effort to refactor distributions test, this moves the transforms test into its own file that uses pytest with parametrized fixtures.

For review:
 fehiepsi - could you help review the math?
 fritzo - do you have any suggestions for what to do about the event dimension (more details are in the comment below)?
 ezyang - could you review the changes in `run_test.py`? Instead of a separate `PYTEST_TESTS`, I have clubbed these tests in `USE_PYTEST_LIST` to avoid duplicate logic. The only difference is that we do not anymore check if pytest is not installed and exclude the tests in the list. I figured that if existing tests are already using pytest, this should not matter.

TODOs (probably not all can be satisfied at the same time):
 - [x] Use operations that are JIT friendly, i.e. the transform works with different sized input under JIT.
 - [x] Resolve test failures - currently `arange(scalar_tensor)` fails on certain backends but this is needed for JIT. Maybe we should only support same sized tensor under JIT?
 - [x] Add tests to check that the transform gives correct gradients and is in agreement with the `log_det_jacobian`.
 - [x] Add `input_event_dim` and `output_event_dim` to `CorrCholeskyTransform`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48041

Reviewed By: zhangguanheng66

Differential Revision: D25262505

Pulled By: neerajprad

fbshipit-source-id: 5a57e1c19d8230b53592437590b9169bdf2f71e9
2020-12-03 03:21:08 -08:00
313e77fc06 Add broadcast_shapes() function and use it in MultivariateNormal (#43935)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/43837

This adds a `torch.broadcast_shapes()` function similar to Pyro's [broadcast_shape()](7c2c22c10d/pyro/distributions/util.py (L151)) and JAX's [lax.broadcast_shapes()](https://jax.readthedocs.io/en/test-docs/_modules/jax/lax/lax.html). This helper is useful e.g. in multivariate distributions that are parameterized by multiple tensors and we want to `torch.broadcast_tensors()` but the parameter tensors have different "event shape" (e.g. mean vectors and covariance matrices). This helper is already heavily used in Pyro's distribution codebase, and we would like to start using it in `torch.distributions`.

- [x] refactor `MultivariateNormal`'s expansion logic to use `torch.broadcast_shapes()`
- [x] add unit tests for `torch.broadcast_shapes()`
- [x] add docs

cc neerajprad

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43935

Reviewed By: bdhirsh

Differential Revision: D25275213

Pulled By: neerajprad

fbshipit-source-id: 1011fdd597d0a7a4ef744ebc359bbb3c3be2aadc
2020-12-03 02:42:04 -08:00
c7746adbc6 Revert D24874754: [pytorch][PR] Add test for empty tensors for batch matmuls
Test Plan: revert-hammer

Differential Revision:
D24874754 (5f105e2aa6)

Original commit changeset: 41ba837740ff

fbshipit-source-id: d6cb31cbc4a2a386aab0a5f24710f218f9a561ca
2020-12-03 00:29:07 -08:00
79b9c03465 Optimize torch zeros (#45636)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45636

After creating empty tensor 'memset' used to zero out items of tensor

Test Plan:
pytorch benchmark tool results:

timer = benchmark_utils.Timer(stmt="torch.zeros((1024, 4096))")

Before: 1007 us
After:     841.26 us
1 measurement, 10000 runs , 1 thread

timer = benchmark_utils.Timer(stmt="torch.zeros((128))")

Before: 4 - 7.6 us
After:     2.4 - 2.8 us
1 measurement, 10000 runs , 1 thread

           torch.int8     |   1   |  4096  |  8192  |  16384  |  32768  |
1 threads: --------------------------------------------------------------
  (PR #45636)  x.zero_()  |  500  |   500  |   600  |    700  |   2000  |
  (Reference)  x.zero_()  |  800  |  1000  |  1000  |   2000  |   2000  |
2 threads: --------------------------------------------------------------
  (PR #45636)  x.zero_()  |  500  |   500  |   600  |    700  |   2000  |
  (Reference)  x.zero_()  |  800  |  1000  |  1000  |   2000  |   3000  |
4 threads: --------------------------------------------------------------
  (PR #45636)  x.zero_()  |  500  |   500  |   600  |    700  |   2000  |
  (Reference)  x.zero_()  |  800  |  1000  |  1000  |   2000  |   3000  |

           torch.int32    |   1   |  4096  |  8192  |  16384  |  32768  |
1 threads: --------------------------------------------------------------
  (PR #45636)  x.zero_()  |  400  |   700  |  2000  |   2900  |   5500  |
  (Reference)  x.zero_()  |  800  |  2000  |  3000  |   4400  |   7300  |
2 threads: --------------------------------------------------------------
  (PR #45636)  x.zero_()  |  500  |   700  |  2000  |   3000  |   5600  |
  (Reference)  x.zero_()  |  900  |  2000  |  2000  |   3600  |   7200  |
4 threads: --------------------------------------------------------------
  (PR #45636)  x.zero_()  |  400  |   700  |  2000  |   3000  |   5700  |
  (Reference)  x.zero_()  |  800  |  2000  |  3100  |   4300  |   9000  |

           torch.float16  |   1   |  4096  |  8192  |  16384  |  32768  |
1 threads: --------------------------------------------------------------
  (PR #45636)  x.zero_()  |  500  |   500  |   700  |   2000  |   3000  |
  (Reference)  x.zero_()  |  800  |  1000  |  2000  |   2000  |   3300  |
2 threads: --------------------------------------------------------------
  (PR #45636)  x.zero_()  |  500  |   600  |   700  |   2000  |   3000  |
  (Reference)  x.zero_()  |  800  |  1000  |  2000  |   2000  |   4300  |
4 threads: --------------------------------------------------------------
  (PR #45636)  x.zero_()  |  500  |   600  |   700  |   2000  |   3300  |
  (Reference)  x.zero_()  |  900  |  1000  |  2000  |   2000  |   4400  |

           torch.float32  |   1   |  4096  |  8192  |  16384  |  32768  |
1 threads: --------------------------------------------------------------
  (PR #45636)  x.zero_()  |  500  |   700  |  2000  |   3200  |   6100  |
  (Reference)  x.zero_()  |  800  |  2000  |  2000  |   3500  |   6100  |
2 threads: --------------------------------------------------------------
  (PR #45636)  x.zero_()  |  500  |   700  |  2000  |   3100  |   5600  |
  (Reference)  x.zero_()  |  800  |  2000  |  2000  |   3300  |   7000  |
4 threads: --------------------------------------------------------------
  (PR #45636)  x.zero_()  |  500  |   700  |  2000  |   3000  |   5600  |
  (Reference)  x.zero_()  |  900  |  2000  |  2000  |   3600  |   7500  |

Reviewed By: ngimel

Differential Revision: D23925113

fbshipit-source-id: 04e97ff6d67c52a8e7a21449113e1a0a7443098f
2020-12-02 23:25:30 -08:00
1112773cf5 Fix unintended error when worker force kill happens #43455 (#43462)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/43455

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43462

Reviewed By: bdhirsh

Differential Revision: D25277759

Pulled By: VitalyFedyunin

fbshipit-source-id: 0bb0d87374c0403853d71aac2c242374bfc7acf2
2020-12-02 21:42:16 -08:00
85c1e8acdc Replace kernel resource strings with real .cu source files (#48283)
Summary:
Convert the NVFUSER's runtime CUDA sources (under `.../jit/codegen/cuda/runtime`) to string literals, then include the headers with the generated literals.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48283

Reviewed By: mrshenli

Differential Revision: D25163362

Pulled By: ngimel

fbshipit-source-id: 4e6c181688ddea78ce6f3c754fee62fa6df16641
2020-12-02 21:22:29 -08:00
5f105e2aa6 Add test for empty tensors for batch matmuls (#47700)
Summary:
Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47700

Reviewed By: malfet

Differential Revision: D24874754

Pulled By: ngimel

fbshipit-source-id: 41ba837740ff7d5bd49d5f7277ad2064985aba2f
2020-12-02 20:45:59 -08:00
ea573ea944 [qunat][graphmode][fx] Standalone module takes float as input and output (#48671)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48671

Standalone module might be called separately so it's better to use float
as interface.

Test Plan: Imported from OSS

Reviewed By: vkuzo

Differential Revision: D25256184

fbshipit-source-id: e209492a180ce1f81f31c8d6057956a74bad20b1
2020-12-02 20:34:25 -08:00
22c3ae8b57 Disable autocast cache for tensor views as fix for #48049 (#48696)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/48049

Root cause of the issue explained [here](https://github.com/pytorch/pytorch/issues/48049#issuecomment-736701769).

This PR implements albanD's suggestion to add the `!t.is_view()` check and disable autocast caching for views of tensors.

The added test checks for an increase in memory usage by comparing the initially allocated memory with the memory after 3 iterations using a single `nn.Linear` layer in a `no_grad` and `autocast` context.

After this PR the memory usage in the original issue doesn't grow anymore and yields:
```python
autocast: True
0: 0MB (peak 1165MB)
1: 0MB (peak 1264MB)
2: 0MB (peak 1265MB)
3: 0MB (peak 1265MB)
4: 0MB (peak 1265MB)
5: 0MB (peak 1265MB)
6: 0MB (peak 1265MB)
7: 0MB (peak 1265MB)
8: 0MB (peak 1265MB)
9: 0MB (peak 1265MB)
```

CC ngimel mcarilli

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48696

Reviewed By: bdhirsh

Differential Revision: D25276231

Pulled By: ngimel

fbshipit-source-id: e2571e9f166c0a6f6f569b0c28e8b9ca34132743
2020-12-02 20:25:13 -08:00
0e4f9a7872 Refactored OpInfo testing to support custom SampleInputs, added addmm to op_db to test (#48627)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48627

Several changes to the OpInfo testing suite:
- Changed test_ops.py to support sample.inputs that are longer than a single element
- Changed OpInfo class to use custom sample_input generator functions, changed UnaryUfuncInfo to use new format
- Added mvp addmm op to operator database to test out sample.inputs with a length greater than a single element

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D25234178

Pulled By: Lilyjjo

fbshipit-source-id: cca2c60af7e6deb849a1cc3770c04ed88865016c
2020-12-02 19:59:40 -08:00
90faf43151 Support for OpInfo-based testing for operators in JIT (#47696)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47696

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D25212436

Pulled By: Lilyjjo

fbshipit-source-id: 1fd2884d86b2afd6321ae1599d755b4beae4670a
2020-12-02 19:59:37 -08:00
9c35a68094 Refactored assertAutodiff test to have better error message (#48567)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/48567

Test Plan: Imported from OSS

Reviewed By: eellison

Differential Revision: D25212435

Pulled By: Lilyjjo

fbshipit-source-id: eab3933bf4248dbfa20cd956d4a0106b10db5fc4
2020-12-02 19:59:33 -08:00
c465602d78 Refactor existing JIT testing utils to enable new OpInfo test suite to reuse existing logic (#47695)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47695

The method_tests from common_methods_invoations.py are being migrated into a new OpInfo class-based testing framework. The work in this commit pulls out the functions embedded in the old method_tests logic and places them in a location that both the old method_tests and OpInfo tests can use

Specifically: created torch/testing/_internal/common_jit.py from functions and methods in torch/testing/_internal/jit_utils.py and test/test_jit.py. Also created new intermediate class JitCommonTestCase to house moved methods. Also slightly modified jit_metaprogramming_utils.py to work for OpInfo tests

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D25212437

Pulled By: Lilyjjo

fbshipit-source-id: 97bc52c95d776d567750e7478fac722da30f4985
2020-12-02 19:54:30 -08:00
1195403915 [NNC] Add cpu fusion gflag (#48682)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/48682

Reviewed By: Krovatkin, ngimel

Differential Revision: D25260205

Pulled By: eellison

fbshipit-source-id: df1655fd75f2a13bcf7c025b1f0a7becc2fd126a
2020-12-02 19:47:18 -08:00
0d39bd47cf only enable cudnn persistent RNN when batchsize % 8 == 0 (#48070)
Summary:
On A100, cuDNN persistent RNN algo doesn't work quite well when batch size is not a multiple of 8, so we need to disable it.

Related: https://github.com/pytorch/pytorch/pull/43165

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48070

Reviewed By: bdhirsh

Differential Revision: D25283953

Pulled By: ngimel

fbshipit-source-id: d7f33b1f43e2e3c46dc89ae046779175f6992569
2020-12-02 18:40:18 -08:00
a1daf1e678 Use fastAtomicAdd in GPU upsampling trilinear (#48675)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/44206

This PR basically follows the diff in https://github.com/pytorch/pytorch/pull/21879 for upsampling bilinear.

For the script provided in https://github.com/pytorch/pytorch/issues/44206 , on my 2070 super GPU, the total timing I got (time in second)

| | non-amp | amp |
|---|---|---|
| before PR | 2.88 | 9.6 |
| after PR | 1.5 | 1.6 |

kernel time after PR
| | time | kernel |
| --- | --- | --- |
| non-amp | 0.37 ms | `void at::native::(anonymous namespace)::upsample_trilinear3d_backward_out_frame<float, float>(unsigned long, int, int, int, int, int, int, float, float, float, bool, float*, float const*) ` |
| amp | 0.61 ms | `void at::native::(anonymous namespace)::upsample_trilinear3d_backward_out_frame<c10::Half, float>(unsigned long, int, int, int, int, int, int, float, float, float, bool, c10::Half*, c10::Half const*)` |

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48675

Reviewed By: bdhirsh

Differential Revision: D25284853

Pulled By: ngimel

fbshipit-source-id: 30f0d92e73050edd36013ce528d2e131effa3542
2020-12-02 18:25:28 -08:00
5f62308739 Hipify revamp [REDUX] (#48715)
Summary:
[Refiled version of earlier PR https://github.com/pytorch/pytorch/issues/45451]

This PR revamps the hipify module in PyTorch to overcome a long list of shortcomings in the original implementation. However, these improvements are applied only when using hipify to build PyTorch extensions, not for PyTorch or Caffe2 itself.

Correspondingly, changes are made to cpp_extension.py to match these improvements.

The list of improvements to hipify is as follows:

1. Hipify files in the same directory as the original file, unless there's a "cuda" subdirectory in the original file path, in which case the hipified file will be in the corresponding file path with "hip" subdirectory instead of "cuda".
2. Never hipify the file in-place if changes are introduced due to hipification i.e. always ensure the hipified file either resides in a different folder or has a different filename compared to the original file.
3. Prevent re-hipification of already hipified files. This avoids creation of unnecessary "hip/hip" etc. subdirectories and additional files which have no actual use.
4. Do not write out hipified versions of files if they are identical to the original file. This results in a cleaner output directory, with minimal number of hipified files created.
5. Update header rewrite logic so that it accounts for the previous improvement.
6. Update header rewrite logic so it respects the rules for finding header files depending on whether "" or <> is used.
7. Return a dictionary of mappings of original file paths to hipified file paths from hipify function.
8. Introduce a version for hipify module to allow extensions to contain back-compatible code that targets a specific point in PyTorch where the hipify functionality changed.
9. Update cuda_to_hip_mappings.py to account for the ROCm component subdirectories inside /opt/rocm/include. This also results in cleanup of the Caffe2_HIP_INCLUDE path to remove unnecessary additions to the include path.

The list of changes to cpp_extension.py is as follows:

1. Call hipify when building a CUDAExtension for ROCm.
2. Prune the list of source files to CUDAExtension to include only the hipified versions of any source files in the list (if both original and hipified versions of the source file are in the list)
3. Add subdirectories of /opt/rocm/include to the include path for extensions, so that ROCm headers for subcomponent libraries are found automatically

cc jeffdaily sunway513 ezyang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48715

Reviewed By: bdhirsh

Differential Revision: D25272824

Pulled By: ezyang

fbshipit-source-id: 8bba68b27e41ca742781e1c4d7b07c6f985f040e
2020-12-02 18:03:23 -08:00
780f2b9a9b torch: Stop using _nt_quote_args from distutils (#48618)
Summary:
They removed the specific function in Python 3.9 so we should just
remake the function here and use our own instead of relying on hidden
functions from the stdlib

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Fixes https://github.com/pytorch/pytorch/issues/48617

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48618

Reviewed By: samestep

Differential Revision: D25230281

Pulled By: seemethere

fbshipit-source-id: 57216af40a4ae4dc8bafcf40d2eb3ba793b9b6e2
2020-12-02 16:53:25 -08:00
95311add49 Vulkan linear memory allocator. (#48569)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/48569

Test Plan: Imported from OSS

Reviewed By: SS-JIA

Differential Revision: D25277091

Pulled By: AshkanAliabadi

fbshipit-source-id: 0530832ce61432237976088cb72a8b7c3aee949c
2020-12-02 16:18:22 -08:00
90a3049a9a [fix] repr(torch.device) (#48655)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/48585

In the following commit 4c9eb57914, type of `DeviceIndex` was changed from `uint16_t` to `uint8_t`.
`uint8_t` is treated as ascii chars by std::cout and other stream operators. Hence the broken `repr`

Stackoverflow Reference: https://stackoverflow.com/questions/19562103/uint8-t-cant-be-printed-with-cout

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48655

Reviewed By: bdhirsh

Differential Revision: D25272289

Pulled By: ezyang

fbshipit-source-id: a1549f5f8d417138cf38795e4c373e3a487d3691
2020-12-02 15:48:17 -08:00
b006c7a132 Add reparameterization support to OneHotCategorical (#46610)
Summary:
Add reparameterization support to the `OneHotCategorical` distribution. Samples are reparameterized based on the straight-through gradient estimator, which is proposed in the paper [Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation](https://arxiv.org/abs/1308.3432).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46610

Reviewed By: neerajprad

Differential Revision: D25272883

Pulled By: ezyang

fbshipit-source-id: 8364408fe108a29620694caeac377a06f0dcdd84
2020-12-02 15:39:32 -08:00
de46369af7 [vulkan] Distribute weight prepacking along y dimension for conv2d (#48266)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/48266

Test Plan: Imported from OSS

Reviewed By: AshkanAliabadi

Differential Revision: D25222752

Pulled By: SS-JIA

fbshipit-source-id: 973e7956cd372c657dbbc6c7835e77b5f4e35f01
2020-12-02 14:54:36 -08:00
a4e13fcf3f add type annotations to common_nn.py (#48190)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/48189

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48190

Reviewed By: walterddr, zhangguanheng66

Differential Revision: D25245261

Pulled By: malfet

fbshipit-source-id: 0eabaed54996be83ead0fd7668f4d2be20adfc17
2020-12-02 14:46:00 -08:00
a49e2c5ce6 Remove "-b" option from pip install command (#48742)
Summary:
It has been deprecated for a while and was finally removed in 20.3
Followup after https://github.com/pytorch/pytorch/pull/48722
Fixes ONNX build failures after docker image update

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48742

Reviewed By: walterddr

Differential Revision: D25282017

Pulled By: malfet

fbshipit-source-id: 1dfa4eb57398f979107ca1544aafbc6d7b5e68a4
2020-12-02 14:28:18 -08:00
fc1153a8be [JIT] Fix clang-tidy warnings in jit/runtime (#47992)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47992

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D25258645

Pulled By: SplitInfinity

fbshipit-source-id: b3e4576400c101b247e80cb4044fc04471f39a47
2020-12-02 12:35:42 -08:00
a25d52f4e6 [JIT] Fix clang-tidy warnings in jit/serialization (#47991)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47991

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D25258639

Pulled By: SplitInfinity

fbshipit-source-id: 2492c5e3bfbe87600512988b7f31f11b7b014f5a
2020-12-02 12:35:40 -08:00
34b2304e34 [JIT] Fix clang-tidy warnings in jit/testing (#47986)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47986

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D25258642

Pulled By: SplitInfinity

fbshipit-source-id: 468b3751d6737c3262e72dfaa0cd7a1699e988a3
2020-12-02 12:35:38 -08:00
18eccfbe42 [JIT] Fix clang-tidy warnings in jit/python (#47985)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47985

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D25258644

Pulled By: SplitInfinity

fbshipit-source-id: dfc15dc62c148f79f4e99fd058a6bf2d071ccbb5
2020-12-02 12:35:36 -08:00
8746e1a1cc [JIT] Fix clang-tidy warnings in jit/passes (#47984)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47984

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D25258638

Pulled By: SplitInfinity

fbshipit-source-id: 0ed5ef6984ba988a2c67407efcc77355ca25bbee
2020-12-02 12:35:34 -08:00
9b973eb275 [JIT] Fix clang-tidy warnings jit/ir (#47983)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47983

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D25258643

Pulled By: SplitInfinity

fbshipit-source-id: b8e0ecfb3cc9ed928c564fb198b32c615e30eb5a
2020-12-02 12:35:31 -08:00
3039d24f4a [JIT] Fix clang-tidy warnings for jit/frontend (#47982)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47982

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D25258640

Pulled By: SplitInfinity

fbshipit-source-id: e2cf27130311904aa5b18e3232349604d01701a0
2020-12-02 12:35:28 -08:00
4aa5d68874 [JIT] Fix clang-tidy warnings for jit/api (#47981)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47981

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D25258641

Pulled By: SplitInfinity

fbshipit-source-id: 2cf2c1f5f02b7a64104d736f582ff6a15ba9b876
2020-12-02 12:30:39 -08:00
83c76611d5 [package] Support glob matching (#48633)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/48633

Test Plan: Imported from OSS

Reviewed By: suo

Differential Revision: D25236016

Pulled By: zdevito

fbshipit-source-id: 5eca7b7f344a6c2f6a047bfabdb4da8cdd0dc7ec
2020-12-02 12:24:46 -08:00
88735f2cc9 [package] move importer logic into import pickler (#48632)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/48632

Test Plan: Imported from OSS

Reviewed By: suo

Differential Revision: D25236017

Pulled By: zdevito

fbshipit-source-id: 57fd80d36ddf390ae35c58adf6dddbf15a1347c1
2020-12-02 12:24:44 -08:00
ce3484595e [packaging] missing quotation in graphviz printout (#48344)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/48344

Test Plan: Imported from OSS

Reviewed By: suo

Differential Revision: D25236018

Pulled By: zdevito

fbshipit-source-id: cb69ec35b86228dfcd1f2823db2b2150a9d3e8b9
2020-12-02 12:23:09 -08:00
15fc66d6c8 fix nvrtc PTX architecture cap for CUDA toolkit (#48455)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/48200

CUDA 11.0 only supports < sm_80 (https://docs.nvidia.com/cuda/archive/11.0/nvrtc/#group__options)

Note: NVRTC documentation is not a reliable source to query supported architecture. Rule of thumb is that nvrtc supports the same set of arch for nvcc, so the best way to query that is something like `nvcc -h | grep -o "compute_[0-9][0-9]" | sort | uniq`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48455

Reviewed By: zhangguanheng66

Differential Revision: D25255529

Pulled By: ngimel

fbshipit-source-id: e84cf51ab50519b4c97dad063cc43c9194942bb2
2020-12-02 11:50:22 -08:00
bdb68d9b0b [reland] [ROCm] remove versions less than 3.8 (#48723)
Summary:
First attempt to land https://github.com/pytorch/pytorch/issues/48118 failed due to the problem fixed by https://github.com/pytorch/pytorch/issues/48722.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48723

Reviewed By: bdhirsh

Differential Revision: D25274287

Pulled By: malfet

fbshipit-source-id: 3ff0be3b522012c647448e5173b3ae38446d4120
2020-12-02 11:24:49 -08:00
4d26941a9b Fix lite interpreter record function issue. (#47457)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47457

This fixes two issues.
1. lite interpreter record_function is intended to be used only for root op
profiling. At the moment if RECORD_FUNCTION is enabled via Dispatcher then it
logs not just root ops but all ops.
2. Because interpreter sets op index that later gets picked up elsewhere
(decoupled design), op index that is set in lite interpreter ends up getting
used by all the record function calls not just root op. Thus we dont really get
correct per op profiling. This diff also fixes this issue.

Reviewed By: ilia-cher

Differential Revision: D24763689

fbshipit-source-id: 6c1f8bcaec9fb5ebacb2743a5dcf7090ceb176b9
2020-12-02 11:24:45 -08:00
4fcdbb824b Updating all call-sites of the legacy dispatcher registration API in fbcode to the new API. (#48178)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48178

I migrated all call sites that used the legacy dispatcher registration API (RegisterOperators()) to use the new API (TORCH_LIBRARY...). I found all call-sites by running `fbgs RegisterOperators()`. This includes several places, including other OSS code (nestedtensor, torchtext, torchvision). A few things to call out:

For simple ops that only had one registered kernel without a dispatch key, I replaced them with:
```
TORCH_LIBRARY_FRAGMENT(ns, m) {
   m.def("opName", fn_name);
}
```

For ops that registered to a specific dispatch key / had multiple kernels registered, I registered the common kernel (math/cpu) directly inside a `TORCH_LIBRARY_FRAGMENT` block, and registered any additional kernels from other files (e.g. cuda) in a separate `TORCH_LIBRARY_IMPL` block.

```
// cpu file
TORCH_LIBRARY_FRAGMENT(ns, m) {
  m.def("opName(schema_inputs) -> schema_outputs");
  m.impl("opName", torch::dispatch(c10::DispatchKey::CPU, TORCH_FN(cpu_kernel)));
}

// cuda file
TORCH_LIBRARY_IMPL(ns, CUDA, m) {
  m.impl("opName", torch::dispatch(c10::DispatchKey::CUDA, TORCH_FN(cuda_kernel)));
}
```
Special cases:

I found a few ops that used a (legacy) `CPUTensorId`/`CUDATensorId` dispatch key. Updated those to use CPU/CUDA- this seems safe because the keys are aliased to one another in `DispatchKey.h`

There were a handful of ops that registered a functor (function class) to the legacy API. As far as I could tell we don't allow this case in the new API, mainly because you can accomplish the same thing more cleanly with lambdas. Rather than delete the class I wrote a wrapper function on top of the class, which I passed to the new API.

There were a handful of ops that were registered only to a CUDA dispatch key. I put them inside a TORCH_LIBRARY_FRAGMENT block, and used a `def()` and `impl()` call like in case two above.

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D25056090

Pulled By: bdhirsh

fbshipit-source-id: 8f868b45f545e5da2f21924046e786850eba70d9
2020-12-02 11:19:31 -08:00
022c929145 Revert "Revert D25199264: Enable callgrind collection for C++ snippets" (#48720)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48720

This reverts commit 6646ff122d3215b77909f669fc26cf6a927030db.

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D25273994

Pulled By: malfet

fbshipit-source-id: 61743176dc650136622e1b8f2384bbfbd7a46294
2020-12-02 11:10:11 -08:00
b2ec21a05a [ROCm] Enable deterministic rocBLAS mode (#48654)
Summary:
The PR adds a feature to disable atomics in rocblas calls thereby making the output deterministic when it is expected in pyTorch. This mode of rocBLAS can be exercised using the global setting `torch.set_deterministic(True)`

cc: ezyang jeffdaily sunway513

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48654

Reviewed By: bdhirsh

Differential Revision: D25272296

Pulled By: ezyang

fbshipit-source-id: 70400572b0ab37c6db52636584de0ae61bb5270a
2020-12-02 10:23:32 -08:00
52f0af03f8 [reland][quant][fix] Add bias once in conv_fused (#48593) (#48661)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48661

Previously _conv_forward will add self.bias to the result, so bias is added twice in qat ConvBn module
this PR added a bias argument to _conv_forward and _conv_forward is called with zero bias
in ConvBn module

fixes: https://github.com/pytorch/pytorch/issues/48514

Test Plan:
Imported from OSS

Imported from OSS

Reviewed By: vkuzo

Differential Revision: D25249175

fbshipit-source-id: 4536c7545d3dcd7e8ea254368ffb7cf15118d78c
2020-12-02 10:17:43 -08:00
0db73460db [quantization] fix run_arg tiny bug (#48537)
Summary:
This fix allows the calibration function to take in more than one positional argument.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48537

Reviewed By: zhangguanheng66

Differential Revision: D25255764

Pulled By: jerryzh168

fbshipit-source-id: 3ce20dbed95fd26664a186bd4a992ab406bba827
2020-12-02 10:07:33 -08:00
f61de25dfa Fix index_put doc. (#48673)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48673

fixes #48642

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D25257078

Pulled By: ailzhang

fbshipit-source-id: e5ebd6e07aafb262989fc12131546037fed8ebf6
2020-12-02 10:01:11 -08:00
071344debe Fix index parsing on Python-3.9 (#48676)
Summary:
In 3.9, `ast.Index` and `ast.ExtSlice` are deprecated, so:
-  `ast.parse('img[3]', model='eval')` evaluates to
`Expression(body=Subscript(value=Name(id='img'), slice=Constant(value=3)))` by 3.9,
but was previously evaluated to `Expression(body=Subscript(value=Name(id='img'), slice=Index(value=Num(n=3))))`
- and `ast.parse('img[..., 10:20]', mode='eval')` is evaluated to
`
Subscript(value=Name(id='img'),slice=Tuple(elts=[Constant(value=Ellipsis),Slice(lower=Constant(value=10), upper=Constant(value=20))]))
`
, but was evaluated to
`
Subscript(value=Name(id='img'), slice=ExtSlice(dims=[Index(value=Ellipsis()), Slice(lower=Num(n=10), upper=Num(n=20), step=None)]))
`

Fixes https://github.com/pytorch/pytorch/issues/48674

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48676

Reviewed By: seemethere, gmagogsfm

Differential Revision: D25261323

Pulled By: malfet

fbshipit-source-id: cc818ecc596a062ed5f1a1d11d3fdf0f22bf7f4a
2020-12-02 09:56:20 -08:00
3c5db30eaa Update magma to 2.5.4 for Windows (#48656)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/48527

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48656

Reviewed By: zhangguanheng66

Differential Revision: D25261601

Pulled By: malfet

fbshipit-source-id: 4ba0036ca882bccd1990108d13596455d179d06e
2020-12-02 09:45:21 -08:00
c98c98d77d Migrate fmod and fmod_ from TH to ATen (CUDA) (#47323)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47323

Fixes #24565

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D24763086

Pulled By: ejguan

fbshipit-source-id: fa004baea19bbbdbeb44814903db29226805ef0e
2020-12-02 09:38:29 -08:00
dc367e7903 Delete "-b" flag from pip install command (#48722)
Summary:
"--build <dir>" flag has been deprecated for a while and finally removed in pip-20.3

Before this PR is applied, every change to docker images would result in ONNX failures

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48722

Reviewed By: janeyx99

Differential Revision: D25274020

Pulled By: malfet

fbshipit-source-id: 9e0f9daba58ceeec5474d649d1b22bfeca91d7bc
2020-12-02 09:16:20 -08:00
4abca9067b Fix dataloader hang with large sampler (#48669)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/48666

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48669

Reviewed By: zhangguanheng66

Differential Revision: D25255763

Pulled By: VitalyFedyunin

fbshipit-source-id: d06421f52bb1d00cdf8025f1a2ba0d1f9284731a
2020-12-02 09:07:30 -08:00
3b25af02a4 matrix_exp + matrix_exp.backward complex support (#48363)
Summary:
As per title. Fixes https://github.com/pytorch/pytorch/issues/48299.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48363

Reviewed By: ejguan

Differential Revision: D25224498

Pulled By: albanD

fbshipit-source-id: 0c80ffb03ccfc46ab86398911edfba0b09049e55
2020-12-02 08:35:14 -08:00
e41e780f7a Added support for complex input for torch.lu_solve #2 (#48028)
Summary:
Relanding https://github.com/pytorch/pytorch/pull/46862
There was an issue with the simultaneous merge of two slightly conflicting PRs.

This PR adds `torch.lu_solve` for complex inputs both on CPU and GPU.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48028

Reviewed By: linbinyu

Differential Revision: D25003700

Pulled By: zou3519

fbshipit-source-id: 24cd1babe9ccdbaa4e2ed23f08a9153d40d0f0cd
2020-12-02 08:13:02 -08:00
6d6e9abe49 Delete NativeFunctions.h include from Functions.h (#48687)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48687

Only one header needed to be updated to now include NativeFunctions.h

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: bhosmer

Differential Revision: D25261845

Pulled By: ezyang

fbshipit-source-id: de778b5e014c812c52a307841827193ce823afcc
2020-12-02 07:57:25 -08:00
e097f8898c Move var and std overloads to Functions.cpp and remove native:: reference (#48683)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48683

I want to delete NativeFunctions.h from Functions.h header. To do this I must remove all references to native:: However; I also must avoid trampling over iseeyuan's work of making ATen compilable without reference to ATen_cpu. In this particular case, I fix the Functions.h problem by moving it to a cpp file, and removing the native:: short-circuit (ostensibly there for performances). This also fixes a hypothetical correctness bug where these would not dispatch properly if the underlying functions no longer uniformly used a single native:: implementation.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: bhosmer

Differential Revision: D25261843

Pulled By: ezyang

fbshipit-source-id: 05ca6555fbf1062f9b22d868c8cb88fdf8e4c24b
2020-12-02 07:57:20 -08:00
6ba7709415 Refactor TensorIterator to do allocations via MetaBase::set_output (#48659)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48659

Detailed RFC at
https://github.com/pytorch/rfcs/blob/rfc-0005/RFC-0005-structured-kernel-definitions.md#handling-tensoriterator

What this diff does:
* Refactor allocation of outputs in TensorIterator into a call to a single function TensorIterator::set_output.  This nicely centralizes restriding logic and mostly eliminates the need for a separate named tensor propagation pass. The one exception is for inplace operations (`add_`), where previously we never actually call `set_output` when we determine resizing is not necessary; there's an extra propagate names in `allocate_or_resize_outputs` to handle this case (I audited all other `set_output` sites and found that we always hit this path in that situation). Although hypothetically this could cause problems for structured kernels (which require a `set_output` call in all cases), this codepath is irrelevant for structured kernels as a TensorIterator will never be constructed with an explicit out argument (remember, structured kernels handle out/functional/inplace variants). There's also a tricky case in `compute_types`; check the comments there for more details.
* Split TensorIterator into a TensorIteratorBase, which contains most of the logic but doesn't define `set_output`. A decent chunk of the diff is just the mechanical rename of TensorIterator to TensorIteratorBase. However, there are a few cases where we create fresh TensorIterator objects from another TensorIterator. In those cases, we always construct a fresh TensorIterator (rather than preserving the subclass of TensorIteratorBase that induced this construction). This makes sense, because a structured function class will contain metadata that isn't relevant for these downstream uses. This is done by *intentionally* permitting object slicing with the `TensorIterator(const TensorIteratorBase&)` constructor.
* Introduce a new `MetaBase` class which contains the canonical virtual method definition for `set_output`. This will allow structured classes to make use of it directly without going through TensorIterator (not in this PR).

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: bhosmer

Differential Revision: D25261844

Pulled By: ezyang

fbshipit-source-id: 34a9830cccbc07eaaf7c4f75114cd00953e3db7d
2020-12-02 07:57:15 -08:00
742903c0df Move argument grouping into FunctionSchema (#48195)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48195

The general approach is to change Arguments, splitting `positional`, `kwarg_only` and `out`, into `pre_self_positional`, `self_arg`, `post_self_positional`, and `pre_tensor_options_kwarg_only`, `tensor_options` and `post_tensor_options_kwarg_only`. The splits are as you'd expect: we extract out the self argument and the tensor options arguments, and record the other arguments that came before and after. To do this, we move the logic in `group_arguments` to the parsing process.

Some fuzz in the process:
* I renamed `ThisArgument` to `SelfArgument`, since we don't actually use the terminology "this" outside of C++ (and the model is Python biased)
* I kept the `group_arguments` function, which now just reads out the arguments from the structured model in the correct order. In the long term, we should get rid of this function entirely, but for now I kept it as is to reduce churn.
* I decided to arbitrarily say that when self is missing, everything goes in "post-self", but when tensor options is missing, everything goes in "pre-tensor-options". This was based on where you typically find the argument in question: self is usually at front (so most args are after it), while tensor options are typically at the end (so most args go before it).

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: zhangguanheng66

Differential Revision: D25231166

Pulled By: ezyang

fbshipit-source-id: 25d77ad8319c4ce0bba4ad82e451bf536ef823ad
2020-12-02 07:57:11 -08:00
ba5686f8c5 Refactor argument fields in FunctionSchema to Arguments (#48182)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48182

I'm planning to add a bunch more argument fields following
https://github.com/pytorch/pytorch/pull/45890#discussion_r503646917 and
it will be a lot more convenient if the arguments get to live
in their own dedicated struct.  Type checker will tell you if
I've done it wrong.  No change to output.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: ljk53

Differential Revision: D25057897

Pulled By: ezyang

fbshipit-source-id: dd377181dad6ab0c894d19d83408b7812775a691
2020-12-02 07:57:06 -08:00
b4f5efa7b2 Structured kernels generate Meta registrations (#48116)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48116

If you port kernels to be structured, you get Meta kernels automatically
generated for you.  This is one payoff of structured kernels.

Code generation was mercifully really simple, although at risk of
"swiss cheese" syndrome: there's two new conditionals in the codegen
to tweak behavior when generating for meta keys.  It's not too bad
right now but there's a risk of things getting out of hand.  One
way to rationalize the logic here would be to transmit "TensorMeta-ness"
inside the TensorOptions (so tensor_from_meta can deal with it); then
the "Meta" kernel magic would literally just be generating empty
out_impls to call after all the scaffolding is done.  But I didn't
do this because it seemed like it would be more annoying short term.

Also had to teach resize_ to work on meta tensors, since we use them
to implement the out kernels.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: bhosmer, ailzhang

Differential Revision: D25056640

Pulled By: ezyang

fbshipit-source-id: f8fcfa0dbb58a94d9b4196748f56e155f83b1521
2020-12-02 07:54:48 -08:00
47db191f0c Implement Kumaraswamy Distribution (#48285)
Summary:
This PR implements the Kumaraswamy distribution.

cc: fritzo alicanb sdaulton

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48285

Reviewed By: ejguan

Differential Revision: D25221015

Pulled By: ezyang

fbshipit-source-id: e621b25a9c75671bdfc94af145a4d9de2f07231e
2020-12-02 07:46:45 -08:00
9c6979a266 [Gradient Compression] Error feedback for PowerSGD (still need to fix the key in error_dict) (#48670)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48670

Support an optional error feedback for PowerSGD -- storing the difference (i.e., the local error caused by compression) between the input gradient (adjusted by the existing error) and the gradient after decompression, and reinserting it at the next iteration.

Still need to add an index field to GradBucket as the key of error_dict. This is because the current key, input tensor of the bucket, can change across steps, as the buckets may be rebuilt in forward pass in order to save peak memory usage.

This is halfway of error feedback. Plan to add the new index field in a separate PR.

Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202
ghstack-source-id: 117636492

Test Plan: buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl

Reviewed By: rohan-varma

Differential Revision: D25240290

fbshipit-source-id: 5b6e11e711caccfb8984ac2767dd107dbf4c9b3b
2020-12-02 06:39:30 -08:00
463e5d2f12 Disable pruning on embedding look up operators when compressed_indices_mapping = {0} (#48672)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48672

When a user specifies `pruned_weights = True`, compressed_indices_mapping = {0}, it means they have not pruned the weights. In this case, we need to go through non-sparse kernels for embedding bag lookup.

Test Plan:
buck test //caffe2/test:quantization
https://www.internalfb.com/intern/testinfra/testconsole/testrun/3377699760676256/

Reviewed By: radkris-git

Differential Revision: D25252904

fbshipit-source-id: 3a97dfd41ec8113d61135f02d9f534df3419e81f
2020-12-02 06:28:03 -08:00
74330e0497 Added linalg.matrix_rank (#48206)
Summary:
This PR adds `torch.linalg.matrix_rank`.

Changes compared to the original `torch.matrix_rank`:
- input with the complex dtype is supported
- batched input is supported
- "symmetric" kwarg renamed to "hermitian"

Should I update the documentation for `torch.matrix_rank`?

For the input with no elements (for example 0×0 matrix), the current implementation is divergent from NumPy. NumPy stumbles on not defined max for such input, here I chose to return appropriately sized tensor of zeros. I think that's mathematically a correct thing to do.

Ref https://github.com/pytorch/pytorch/issues/42666.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48206

Reviewed By: albanD

Differential Revision: D25211965

Pulled By: mruberry

fbshipit-source-id: ae87227150ab2cffa07f37b4a3ab228788701837
2020-12-02 03:29:25 -08:00
6646ff122d Revert D25199264: Enable callgrind collection for C++ snippets
Test Plan: revert-hammer

Differential Revision:
D25199264 (ff097299ae)

Original commit changeset: 529244054e4c

fbshipit-source-id: 7429d7154f92e097089bf51dc81042b766de9cc3
2020-12-02 02:26:58 -08:00
6299c870ee Revert D25254920: [pytorch][PR] Add type annotations to torch.onnx.* modules
Test Plan: revert-hammer

Differential Revision:
D25254920 (40a2dd7e1e)

Original commit changeset: dc9dc036da43

fbshipit-source-id: c17cb282ebf90ecbae4023aa63ecbb443a87037d
2020-12-02 02:25:31 -08:00
bcc85a363e [numpy] torch.sigmoid : promote integer inputs to float (#47551)
Summary:
Reference https://github.com/pytorch/pytorch/issues/42515

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47551

Reviewed By: ngimel

Differential Revision: D25211953

Pulled By: mruberry

fbshipit-source-id: 9174cda401aeba0fd585a4c9bda166dbcf64f42f
2020-12-01 23:28:57 -08:00
44016e66c4 Revert D25097324: [pytorch][PR] [ONNX] Cast Gather index to Long if needed
Test Plan: revert-hammer

Differential Revision:
D25097324 (55fc0e9e53)

Original commit changeset: 42da1412d1b9

fbshipit-source-id: 491994a35a8aaf207dd5905191847171586aa4b7
2020-12-01 20:59:28 -08:00
15abf18b67 [MaskR-CNN] Add int8 aabb bbox_transform op
Summary: Adds support for Eigen Utils for custom type defs.

Reviewed By: vkuzo

Differential Revision: D23753697

fbshipit-source-id: de1cfb1c8176a08dd418364f2fce003344fe25bb
2020-12-01 20:51:50 -08:00
40a2dd7e1e Add type annotations to torch.onnx.* modules (#45258)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/45215

Still need to resolve a few mypy issues before a review. In special, there is an error which I don't know how to solve, see:
```python
torch/onnx/utils.py:437: error: Name 'is_originally_training' is not defined  [name-defined]
        if training is None or training == TrainingMode.EVAL or (training == TrainingMode.PRESERVE and not is_originally_training):
```

`is_originally_training` is used but never defined/imported on [`torch/onnx/utils.py`](ab5cc97fb0/torch/onnx/utils.py (L437)),

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45258

Reviewed By: zhangguanheng66

Differential Revision: D25254920

Pulled By: ezyang

fbshipit-source-id: dc9dc036da43dd56b23bd6141e3ab92e1a16e3b8
2020-12-01 20:41:39 -08:00
ff097299ae Enable callgrind collection for C++ snippets (#47865)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47865

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D25199264

Pulled By: robieta

fbshipit-source-id: 529244054e4cc01e4703b7b9720833d991452943
2020-12-01 20:03:17 -08:00
0225d3dc9d Add support for timing C++ snippets. (#47864)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47864

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D25199262

Pulled By: robieta

fbshipit-source-id: 1c2114628ed543fba4f403bf49c065f4d71388e2
2020-12-01 20:03:14 -08:00
17ea11259a Rework compat bindings. (#47863)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47863

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D25199261

Pulled By: robieta

fbshipit-source-id: 0a4a0409ddb75c1bf66cd31d67b55080227b1679
2020-12-01 20:03:11 -08:00
07f038aa9d Add option for cpp_extensions to compile standalone executable (#47862)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47862

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D25199265

Pulled By: robieta

fbshipit-source-id: eceb04dea60b82eb10434099639fa3afa61000ca
2020-12-01 20:03:08 -08:00
27905dfe9c Expose CXX_FLAGS through __config__ (#47861)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47861

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D25199263

Pulled By: robieta

fbshipit-source-id: 3cfdb0485d686a03a68dd0907d1733634857963f
2020-12-01 19:58:29 -08:00
b824fc4de2 [pytorch] [PR] Rename cuda kernel checks to C10 (#48615)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48615

Convert the macro from `TORCH_CUDA_KERNEL_LAUNCH_CHECK` to `C10_CUDA_KERNEL_LAUNCH_CHECK`, since it is now accessible through c10, not just torch.

Test Plan:
```
buck build //caffe2/caffe2:caffe2_cu
buck build //caffe2/aten:ATen-cu
buck test //caffe2/test:kernel_launch_checks -- --print-passing-details
```

Reviewed By: jianyuh

Differential Revision: D25228727

fbshipit-source-id: 9c65feb3d0ea3fbd31f1dcaecdb88ef0534f9121
2020-12-01 18:19:07 -08:00
25e367ec48 Revert D25246563: [pytorch][PR] [ROCm] remove builds for versions less than 3.8
Test Plan: revert-hammer

Differential Revision:
D25246563 (c5f1117be2)

Original commit changeset: cd6142286813

fbshipit-source-id: fec302da9802736cb88ae25c3b58705d93cd9920
2020-12-01 17:50:13 -08:00
8b2ca28c1d Add an option to run RPC tests with TCP init (#48248)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48248

We have found a few bugs where initializing/de-initializing/re-initializing RPC, and using RPC along with process groups does not work as expected, usually under TCP/env initialization (which is used over `file` which is the init that we use in our test in multi-machine scenarios).

Due to this motivation, this PR adds an environment variable `RPC_INIT_WITH_TCP` that allows us to run any RPC test with TCP initialization.

To avoid port collisions, we use `common.find_free_port()`.
ghstack-source-id: 117553039

Test Plan: CI

Reviewed By: lw

Differential Revision: D25085458

fbshipit-source-id: b5dbef2ff8ae88fa5bc1bb85a9e0fe077dbb552c
2020-12-01 17:42:32 -08:00
d0e9523c4f [TensorExpr] Add more operator tests. (#48677)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/48677

Test Plan: Imported from OSS

Reviewed By: eellison

Differential Revision: D25258656

Pulled By: ZolotukhinM

fbshipit-source-id: 173b87568f3f29f04d06b8621cbfbd53c38e4771
2020-12-01 17:34:09 -08:00
f7986969af [FX] Delete values after their last use (#48631)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/48631

Test Plan: Imported from OSS

Reviewed By: zdevito

Differential Revision: D25235981

Pulled By: jamesr66a

fbshipit-source-id: f79d8873d3ad1ad90b5bd6367fc6119925f116e9
2020-12-01 17:20:49 -08:00
cff1ff7fb6 Suppress unsigned warning (#48272)
Summary:
Fixes a pointless comparison against zero warning that arises for some scalar types

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48272

Test Plan:
Arises with
```
xbuck test mode/dev-nosan //caffe2/torch/fb/sparsenn:gpu_test -- test_prior_correction_calibration_prediction_binary
```

Fixes issues raised by https://github.com/pytorch/pytorch/issues/47876 - `std::is_signed` was a poor choice `std::is_unsigned` is a better choice. Surprisingly, the two are non-reciprocal.

Reviewed By: zhangguanheng66

Differential Revision: D25256251

Pulled By: r-barnes

fbshipit-source-id: 31665f5b0bc7eebee7456b85c37c5bce3f738bea
2020-12-01 17:09:09 -08:00
18f1cb14d5 Avoid resizing ones array when bias is not used (#48540)
Summary:
https://github.com/pytorch/pytorch/issues/48539

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48540

Reviewed By: zhangguanheng66

Differential Revision: D25255175

Pulled By: ezyang

fbshipit-source-id: 755435a0adf9129a2edbffbad252e95a05e84a5f
2020-12-01 16:21:56 -08:00
f5788898a9 TensorIteratorConfig is not used by reorder_dimensions (#48613)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48613

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: zhangguanheng66

Differential Revision: D25228679

Pulled By: ezyang

fbshipit-source-id: 06d57e89e7c9cfa84e2b0886c6e1f3a9fa06978a
2020-12-01 16:13:08 -08:00
75f38c2fa9 ret is never reassigned, return 0 directly (#48609)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48609

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: zhangguanheng66

Differential Revision: D25228678

Pulled By: ezyang

fbshipit-source-id: b0b501866c9beb509b0c8c37d074e2d276085a56
2020-12-01 16:08:11 -08:00
30324d1e71 fix INTERNAL ASSERT FAILED for maximum (#48446)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/48393

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48446

Reviewed By: zhangguanheng66

Differential Revision: D25240270

Pulled By: ngimel

fbshipit-source-id: 57fc223b98f2b6f96f2f24e1d9041644e3187262
2020-12-01 15:29:48 -08:00
1c02be1b6a Fix AttributeError in _get_device_attr (#48406)
Summary:
In PyTorch 1.5, when running `torch.cuda.reset_peak_memory_stats()` on a machine where `torch.cuda.is_available() is False`, I would get:
```
AssertionError:
Found no NVIDIA driver on your system. Please check that you
have an NVIDIA GPU and installed a driver from
http://www.nvidia.com/Download/index.aspx
```

In PyTorch 1.7, the same gets me a worse error (and a user warning about missing NVIDIA drivers if you look for it):
```
...
  File "/opt/conda/lib/python3.7/site-packages/torch/_utils.py", line 440, in _get_device_attr
    if device_type.lower() == "cuda":
AttributeError: 'NoneType' object has no attribute 'lower'
```

The formerly raised AssertionError is depended on by libraries like pytorch_memlab: ec9a72fc30/pytorch_memlab/line_profiler/line_profiler.py (L90)
It would be pretty gross if pytorch_memlab had to change that to catch an AttributeError.

With this patch, we get a more sensible:
```
...
  File "/opt/conda/lib/python3.7/site-packages/torch/cuda/memory.py", line 209, in reset_peak_memory_stats
    return torch._C._cuda_resetPeakMemoryStats(device)
RuntimeError: invalid argument to reset_peak_memory_stats
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48406

Reviewed By: mrshenli

Differential Revision: D25205630

Pulled By: ngimel

fbshipit-source-id: 7c505a6500d730f3a2da348020e2a7a5e1306dcb
2020-12-01 14:55:18 -08:00
4fe583e248 fix move default not compile correctly on cuda92 (#48257)
Summary:
explicitly define move constructor when using cuda version <= 9200

this fixes https://github.com/pytorch/csprng/issues/84

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48257

Reviewed By: malfet, mrshenli

Differential Revision: D25123467

Pulled By: walterddr

fbshipit-source-id: 72deff82c421fbaada6f38b2b6288f7f2f833062
2020-12-01 14:23:20 -08:00
54022e4f9b add new build settings to torch.__config__ (#48380)
Summary:
many newly added build settings are not saved in torch.__config__. adding them to the mix.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48380

Reviewed By: samestep

Differential Revision: D25161951

Pulled By: walterddr

fbshipit-source-id: 1d3dee033c93f2d1a7e2a6bcaf88aedafeac8d31
2020-12-01 14:16:36 -08:00
d9c76360b2 Add cuda_ipc channel to TensorPipe (#46791)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46791

Test Plan: Imported from OSS

Reviewed By: lw

Differential Revision: D25237121

Pulled By: mrshenli

fbshipit-source-id: f1428175b260fb23c4e0e6f92651426f38beaca9
2020-12-01 14:12:00 -08:00
e3713ad706 Let JIT unpickler to accept CUDA DataPtr from read_record_ (#46827)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46827

TensorPipe RPC agent uses JIT pickler/unpickler to serialize/deserialize
tensors. Instead of saving tensors to a file, the agent can directly
invoke `cudaMemcpy` to copy tensors from the sender to the receiver
before calling into JIT unpickler. As a result, before unpickling,
the agent might already have allocated tensors and need to pass
them to the JIT unpickler. Currently, this is done by providing a
`read_record` lambda to unpickler for CPU tensors, but this is
no longer sufficient for zero-copy CUDA tensors, as the unpickler
always allocate the tensor on CPU.

To address the above problem, this commit introduces a `use_storage_device`
flag to unpickler ctor. When this flag is set, the unpickler will
use the device from the `DataPtr` returned by the `read_record`
lambda to override the pickled device information and therefore
achieves zero-copy.

Test Plan: Imported from OSS

Reviewed By: wanchaol

Differential Revision: D24533218

Pulled By: mrshenli

fbshipit-source-id: 35acd33fcfb11b1c724f855048cfd7b2991f8903
2020-12-01 14:09:09 -08:00
5f181e2e6e centos now installs cmake from conda (#48035)
Summary:
For the same reason that ubuntu builds need conda cmake to find mkl.

CC jaglinux

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48035

Reviewed By: zhangguanheng66

Differential Revision: D25246723

Pulled By: malfet

fbshipit-source-id: 9dac130eecd2f76764d8027b888404c87e7a954a
2020-12-01 13:07:20 -08:00
3ceec73db9 [PyTorch] Lazily construct guts of RecordFunction (#47550)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47550

I saw over 5% time spent in RecordFunction's ctor during one
of our framework overhead benchmarks in `perf`. Inspecting assembly,
it looks like we just create a lot of RecordFunctions and the
constructor has to initialize a relatively large number of member
variables.

This diff takes advantage of the observation that RecordFunction does
nothing most of the time by moving its state onto the heap and only
allocating it if needed. It does add the requirement that profiling is
actually active to use RecordFunction accessors, which I hope won't be
a problem.
ghstack-source-id: 117498489

Test Plan: Run framework overhead benchmarks. Savings ranging from 3% (InPlace_ndim_1) to 7.5% (empty_ndim_3) wall time.

Reviewed By: ilia-cher

Differential Revision: D24812213

fbshipit-source-id: 823a1e2ca573d9a8d7c5b7bb3972987faaacd11a
2020-12-01 13:07:17 -08:00
d1df4038ff [PyTorch] Make RecordFunctionCallback::should_run_ a function pointer (#48274)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48274

The std::function-ness of it was used only for tests. (std::function is huge at 32 bytes, and not particularly efficient.)
ghstack-source-id: 117498491

Test Plan: CI

Reviewed By: dzhulgakov

Differential Revision: D25102077

fbshipit-source-id: fd941ddf32235a9659a1a17609c27cc5cb446a54
2020-12-01 13:02:25 -08:00
9342b97363 change global_fp16_constants for test_fc_nnpi_fp16 (#48663)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48663

enable the flag inside the test

Test Plan:
GLOW_NNPI=1 USE_INF_API=1 buck-out/opt/gen/caffe2/caffe2/contrib/fakelowp/test/test_fc_nnpi_fp16nnpi#binary.par
buck test -c glow.nnpi_use_inf_api=true mode/opt //caffe2/caffe2/contrib/fakelowp/test:test_fc_nnpi_fp16nnpi

Reviewed By: hl475

Differential Revision: D25249575

fbshipit-source-id: bb0a64859fa8e70eeea458376998142f37361525
2020-12-01 12:54:43 -08:00
c5f1117be2 [ROCm] remove builds for versions less than 3.8 (#48118)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/48118

Reviewed By: zhangguanheng66

Differential Revision: D25246563

Pulled By: malfet

fbshipit-source-id: cd6142286813411d542926284fbf65206bc371ae
2020-12-01 12:08:23 -08:00
aaf6582d02 fix issue by which pytorch_jni is not bundled in libtorch (#46466)
Summary:
Fixes issue with pytorch_jni.dll not being installed correctly in libtorch.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46466

Reviewed By: zhangguanheng66

Differential Revision: D25247564

Pulled By: ezyang

fbshipit-source-id: a509476ec4a0863fd67da3258e9300a9527d4f3b
2020-12-01 11:38:56 -08:00
7c73fda501 Remove balance and devices parameter from Pipe. (#48432)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48432

As per our design in https://github.com/pytorch/pytorch/issues/44827,
changign the API such that the user places modules on appropriate devices
instead of having a `balance` and `devices` parameter that decides this.

This design allows us to use RemoteModule in the future.
ghstack-source-id: 117491992
ghstack-source-id: 117491992

Test Plan: waitforbuildbot

Reviewed By: mrshenli

Differential Revision: D25172970

fbshipit-source-id: 61ea37720b92021596f69788e45265ac9cd41746
2020-12-01 11:21:59 -08:00
74d6a6106c Fuzzing benchmark for FFT operators (#47872)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47872

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D25237499

Pulled By: robieta

fbshipit-source-id: 44eb68c5989508f072b75526ae5dcef30898e4bd
2020-12-01 10:58:53 -08:00
df6fc3d83a Fix complex tensors and missing data in benchmark utility (#47871)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47871

- FuzzedTensor now supports complex data types
- Compare no longer calls min on empty ranges when a table has empty cells

* **#47871 Fix complex tensors and missing data in benchmark utility**

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D25237500

Pulled By: robieta

fbshipit-source-id: 76248647313d4d81590a68297a5f6768fa7d3d82
2020-12-01 10:54:19 -08:00
f80aaadbae fx quantization: add option to leave graph inputs and/or outputs quantized (#48624)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48624

Before this PR, there was an assumption that all graph inputs
and outputs are in floating point, with some exceptions for
`standalone_module`.

This PR adds an option to specify either inputs or outputs
as being quantized.

This is useful for incremental migrations of models using Eager mode.

Test Plan: Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D25231833

fbshipit-source-id: 9f9da17be72b614c4c334f5c588458b3e726ed17
2020-12-01 10:39:51 -08:00
98fddc1f06 Revert D25172740: [pytorch][PR] [CUDA graphs] Make CUDAGeneratorImpl capturable
Test Plan: revert-hammer

Differential Revision:
D25172740 (2200e72293)

Original commit changeset: c4568605755c

fbshipit-source-id: 3ebc845856096f5707897bfabaf718c8e13e86f0
2020-12-01 09:10:14 -08:00
0066b941f1 Add CUDA kernel checks to fbcode/caffe2/caffe2/sgd (#48347)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48347

Add a safety check `TORCH_CUDA_KERNEL_LAUNCH_CHECK()` after each kernel launch. This only includes changes to `//caffe2/caffe2/sgd`. Specifically these files did not have any kernel launch checks before.

Files changed were determined by running `python3 caffe2/torch/testing/check_kernel_launches.py`.

Other directories will be done in seperate diffs.

Test Plan:
Check build status
```
buck build //caffe2/caffe2:caffe2_cu
```
- https://www.internalfb.com/intern/buck/build/5cb3185e-b481-4f83-9c3a-260827dde5ef

Running
```
python3 caffe2/torch/testing/check_kernel_launches.py
```
 results in no files within this subdirectory (as of now)
- P150434759

Reviewed By: jianyuh

Differential Revision: D24868557

fbshipit-source-id: 1ad02260bcc9d13710bfd577c8d93be52595845c
2020-12-01 08:55:36 -08:00
736e8965e5 Change the type hints of "pooling.py". (#48412)
Summary:
Change the type hints of "AvgPool2d" and "AvgPool3d".

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48412

Reviewed By: ejguan

Differential Revision: D25221087

Pulled By: ezyang

fbshipit-source-id: 5fba2a8051a7b3d5508e97763bacfd2140a777bf
2020-12-01 07:27:37 -08:00
c81f2d9a2f Revert D25222215: [quant][fix] Add bias once in conv_fused
Test Plan: revert-hammer

Differential Revision:
D25222215 (d2e429864c)

Original commit changeset: 90c0ab79835b

fbshipit-source-id: 5c8eee107309cfa99cefdf439a62de0b388f9cfb
2020-12-01 07:17:45 -08:00
dc7ab46dcc Fix incorrect warnings in ParameterList/Dict (#48315)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/46983.

The solution is based of two components:

1. The introduction of the `_initialized` attribute. This will be used during ParameterList/Dict creation methods `__init__` (introduced in https://github.com/pytorch/pytorch/issues/47772) and  `__setstate__` to not trigger warnings when setting general `Module` attributes.
2. The introduction of the `not hasattr(self, key)` check to avoid triggering warnings when changing general `Module` attributes such as `.training` during the `train()` and `eval()` methods.

Tests related to the fix are added.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48315

Reviewed By: mrshenli

Differential Revision: D25130217

Pulled By: albanD

fbshipit-source-id: 79e2abf1eab616f5de74f75f370c2fe149bed4cb
2020-12-01 07:08:33 -08:00
492683bd42 Add LazyConvXd and LazyConvTransposeXd (#47350)
Summary:
This PR implements LazyConvXd and LazyConvTransposeXd based on https://github.com/pytorch/pytorch/issues/44538. (cc. emcastillo and albanD)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47350

Reviewed By: ejguan

Differential Revision: D25220645

Pulled By: albanD

fbshipit-source-id: b5e2e866d53761a3415fd762d05a81920f8b16c3
2020-12-01 07:00:28 -08:00
ccd20e995f [vulkan] convolution old prepacking via cpu-shader (#48330)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/48330

Test Plan: Imported from OSS

Reviewed By: SS-JIA

Differential Revision: D25131500

Pulled By: IvanKobzarev

fbshipit-source-id: b11edb94a78f5d6283c7be1887d72a4ca624a9ab
2020-11-30 22:52:43 -08:00
55fc0e9e53 [ONNX] Cast Gather index to Long if needed (#47653)
Summary:
Onnx op Gather index need be int32 or int64. However, we don't have this Cast in our converter.
Therefore, it fails the following UT (for opset 11+)
`seq_length.type().scalarType()` is None, so `_arange_cast_helper()` cannot treat it as all integral, then it will cast all to float. Then this float value will be used as Gather index, hence it throws error in ORT about float type index.
The fix is that we need cast Gather index type to Long if it is not int/long.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47653

Reviewed By: ejguan

Differential Revision: D25097324

Pulled By: bzinodev

fbshipit-source-id: 42da1412d1b972d4d82c17fb525879c2575820c9
2020-11-30 21:36:17 -08:00
02e58aabe1 [ONNX] Support nonzero(*, as_tuple=True) export (#47421)
Summary:
Support exporting with `as_tuple = true`

Example:
`torch.nonzero(x, as_tuple=True)`

This is the same as

`torch.unbind(torch.nonzero(x), 1)`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47421

Reviewed By: malfet

Differential Revision: D24870760

Pulled By: bzinodev

fbshipit-source-id: 06ca1e7ecf95fbf7c28eebce800df958c83264c8
2020-11-30 21:27:43 -08:00
acd4fca376 [caffe2][torch] Clean up unused variable 'device' (#48600)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48600

Fix this warning that pops up with clang and `-Wunused-variable`:
```
caffe2\torch\csrc\jit\frontend\schema_type_parser.cpp(153,30): warning: unused variable 'device' [-Wunused-variable]
```

Test Plan: Locally built & continuous integration

Reviewed By: eellison

Differential Revision: D25194298

fbshipit-source-id: 3af2895fcc96807a9df0ced60ec0af6b14dc0817
2020-11-30 20:56:35 -08:00
9500e8a081 Testing: Improve interaction between dtypes and ops decorators (#48426)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48426

Tests are run on the intersection of the dtypes requested and the types that are
supported by the operator (or are _not_ if `unsupported_dtypes_only` is used).

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D25205835

Pulled By: mruberry

fbshipit-source-id: 2c6318a1a3dc9836af7361f32caf9df28d8a792b
2020-11-30 20:46:22 -08:00
d2e429864c [quant][fix] Add bias once in conv_fused (#48593)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48593

Previously _conv_forward will add self.bias to the result, so bias is added twice in qat ConvBn module
this PR added a bias argument to _conv_forward and _conv_forward is called with zero bias
in ConvBn module

fixes: https://github.com/pytorch/pytorch/issues/48514

Test Plan: Imported from OSS

Reviewed By: raghuramank100

Differential Revision: D25222215

fbshipit-source-id: 90c0ab79835b6d09622dcfec9de4139881a60746
2020-11-30 19:26:17 -08:00
7a59a1b574 add aot_based_partition (#48336)
Summary:
This PR add supports on AOT based partition. Given each node and its corresponding partition id, generate the partition, submodules and dag

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48336

Reviewed By: gcatron

Differential Revision: D25226899

Pulled By: scottxu0730

fbshipit-source-id: 8afab234afae67c6fd48e958a42b614f730a61d9
2020-11-30 19:11:02 -08:00
ddb6594971 [Gradient Compression] Add a random generator to PowerSGD state for initializing low-rank matrix Q (#48507)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48507

Previously the random seed is the length of input tensor, which is not guaranteed to be the different for different batches. Now initialize a random generator in PowerSGD state, and use this generator to create a random seed to randomize the low-rank tensor Q at every step.

Therefore, the initial tensor Q should be the same across all the replicas at the same step, but different at different steps.

'torch.manual_seed' is used in the same way as https://github.com/epfml/powersgd/blob/master/gradient_reducers.py#L675

Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202
ghstack-source-id: 117483639

Test Plan:
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl

buck test mode/dev-nosan caffe2/test/distributed:c10d --
test_powerSGD_ddp_comm_hook_nccl_grad_is_view

Also checked the initial Qs and input random seeds of torch.manual_seed() of different ranks for a few steps in real runs.

Example logs:
Exactly same random seed of different ranks at the same step on two nodes, and the random seed varies at each step.

{F346971916}

Reviewed By: rohan-varma

Differential Revision: D25191589

fbshipit-source-id: f7f17df3ad2075ecae1a2a56ca082160f7c5fcfc
2020-11-30 18:46:45 -08:00
61936cb11e [PyTorch][JIT] Parameter passing & std::map API usage pass on ProfilingRecord::instrumentGraph (#47960)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47960

Audited this code path after seeing it in profiling. Found some issues:
- Multiple lookups in std::map can be avoided by using `std::map::insert`. It's really a find-or-insert, which is what this code wanted anyway.
- Some unnecessary copying of arguments that could be moved from
- We can move from shared_ptrs that are going out of scope anyway
ghstack-source-id: 116914902

Test Plan: Please advise, as I'm new to this code. Does it have test coverage? Is there a way I can easily measure the performance impact of this change?

Reviewed By: Krovatkin

Differential Revision: D24971041

fbshipit-source-id: 881a45f8958854be0e95fba659e0b64bd341501e
2020-11-30 18:39:28 -08:00
dc7d8a889e caffe2: refactor context to allow being typed (#48340)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48340

This changes the context managed classes from using a decorator to define them to using inheritance. Inheritance allows the python static type checking to work correctly.

```
context.define_context()
class Bar(object): ...

context.define_context(allow_default=True)
class Foo(object): ...
```

becomes
```
class Foo(context.Managed): ...

class Bar(context.DefaultManaged): ...
```

Behavior differences:
* arg_name has been removed since it's not used anywhere
* classes need to call `super()` in `__enter__/__exit__` methods if they override (none do)

This also defines a context.pyi file to add types for python3. python2 support should not be affected

Test Plan:
ci

  buck test //caffe2/caffe2/python:context_test //caffe2/caffe2/python:checkpoint_test

Reviewed By: dongyuzheng

Differential Revision: D25133469

fbshipit-source-id: 16368bf723eeb6ce3308d6827f5ac5e955b4e29a
2020-11-30 18:31:14 -08:00
adb4fd3f2f [te] Fix comparison ops on booleans (#48384)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48384

As title

Test Plan: buck test //caffe2/test:jit -- test_binary_ops

Reviewed By: asuhan

Differential Revision: D25115773

fbshipit-source-id: c5f8ee21692bcf0d78f099789c0fc7c457a1e4a2
2020-11-30 18:21:35 -08:00
d9f5ac0805 [TensorExpr] Add a envvar to disable LLVM backend and use IR Eval instead. (#48355)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/48355

Test Plan: Imported from OSS

Reviewed By: eellison

Differential Revision: D25139668

Pulled By: ZolotukhinM

fbshipit-source-id: 34dfcceadb24446d103710f00526693a53f3750f
2020-11-30 18:16:28 -08:00
a6f0c3c4f0 [TensorExpr] IREval: fix div for Half dtype. (#48354)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/48354

Test Plan: Imported from OSS

Reviewed By: izdeby

Differential Revision: D25139669

Pulled By: ZolotukhinM

fbshipit-source-id: a7eccad883d8b175d7d73db48bd366382eabea53
2020-11-30 18:14:08 -08:00
671a959233 Disable fast sigmoid since it causes divergence (#48623)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48623

The error introduced by fast sigmoid/tanh seems to accumulate in a way that's detectable in a macro-benchmark (unfortunately I don't have the model demonstrating it in a format that can be publically committed).
ghstack-source-id: 117496822

Test Plan: Tbh not sure how to test this since I'm not super well-versed in numerics.  I can verify it fixes a model divergence locally.

Reviewed By: navahgar

Differential Revision: D25230376

fbshipit-source-id: c404a0439f190359b72ad65b3f42369c53cae340
2020-11-30 17:46:07 -08:00
29f0e1e2ce Fused8BitRowwiseQuantizedToFloat operator support (#48407)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48407

T79817692: Fused8BitRowwiseQuantizedToFloat operator support for c2_pt_converter.

Also refactored some repeated code from the existing test functions. (Initial commit only has refactoring.)

Test Plan: buck test //caffe2/torch/fb/model_transform/c2_convert:c2_pt_converter_test

Reviewed By: bugra

Differential Revision: D25069936

fbshipit-source-id: 72f6a845a1b4639b9542c6b230c8cd74b06bc5a0
2020-11-30 17:11:39 -08:00
c3bb3827f9 remove unused params in scalar_tensor_static (#48550)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/48550

Test Plan: Imported from OSS

Reviewed By: Chillee

Differential Revision: D25229765

Pulled By: bhosmer

fbshipit-source-id: 220b0b4a85a3d83d947960851a7369f654b8b455
2020-11-30 17:01:22 -08:00
ea0ffbb6e6 [vulkan] Fix Addmm prepacking to persist after GPU flush (#48313)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48313

Previous prepacking method would cause stored data to be lost whenever data is flushed during a `.cpu()` call.  I updated the weight/bias prepacking to use the same method from `conv2d` in order to avoid this.

Test Plan: Imported from OSS

Reviewed By: AshkanAliabadi

Differential Revision: D25125405

Pulled By: SS-JIA

fbshipit-source-id: 2533994d522d90824fc25ee78c54016cfd0f3253
2020-11-30 16:09:46 -08:00
5b6b1495b9 Update Windows CI to CUDA 11.1, cuDNN 8.0.5 (#48469)
Summary:
Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48469

Reviewed By: walterddr

Differential Revision: D25187095

Pulled By: malfet

fbshipit-source-id: 47e29a172ebe71e60447a5483e63ac59818a0474
2020-11-30 15:48:30 -08:00
7f869dca70 [ROCm] update debug flags (#46717)
Summary:
Improves support for rocgdb when setting DEBUG=1 and building for ROCm.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46717

Reviewed By: mrshenli

Differential Revision: D25171544

Pulled By: malfet

fbshipit-source-id: b4699ba2277dcb89f07efb86f7153fae82a80dc3
2020-11-30 15:27:30 -08:00
d6ddd78eb0 Fix multiple spelling and grammar mistakes (#48592)
Summary:
I found a number of spelling & grammatical mistakes in the repository. Previously I had these fixes submitted individually, but I saw that a single word change was apparently too small for a PR to be merged. Hopefully this new PR has a sufficient number of changes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48592

Reviewed By: ejguan

Differential Revision: D25224216

Pulled By: mrshenli

fbshipit-source-id: 2af3db2aee486563efd0dffc4e8f777306a73e44
2020-11-30 15:18:44 -08:00
2200e72293 [CUDA graphs] Make CUDAGeneratorImpl capturable (#47989)
Summary:
Part 1 of https://github.com/pytorch/pytorch/pull/46148 refactor:  CUDAGeneratorImpl and eager mode kernel diffs.  See [Note [CUDA Graph-safe RNG states]](https://github.com/pytorch/pytorch/compare/master...mcarilli:cudagraphs_generator_diffs?expand=1#diff-0b7fb41bc872bb4d1b6480d4fbbb70e6871c16b8c439a97d9d8ecc6c8b893bc2R13) for the strategy, based on https://github.com/pytorch/pytorch/pull/46148#issuecomment-724414794.

By itself, this PR is a "no-op":  it's unusable without cooperation from CUDA graph capture and replay bindings.  Part 2 will add those bindings and tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47989

Reviewed By: mrshenli

Differential Revision: D25172740

Pulled By: ngimel

fbshipit-source-id: c4568605755c7b2d28d09d0fbb96837b494e6443
2020-11-30 15:11:23 -08:00
4976208e73 [caffe2] Register BlackBoxPredictor AllocationArenaPool as CPUCachingAllocator (#48161)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48161

- Register BlackBoxPredictor AllocationArenaPool as CPUCachingAllocator
- Use the AllocationArenaPool in both BlackBoxPredictor and StaticRuntime

Test Plan:
```
buck run //caffe2/caffe2/fb/predictor:black_box_predictor_test
buck run //caffe2/caffe2/fb/predictor:pytorch_predictor_test
```
AF canary:
https://www.internalfb.com/intern/ads/canary/431021257540238874/

Reviewed By: dzhulgakov

Differential Revision: D24977611

fbshipit-source-id: 33ba596b43c1e558c3ab237a0feeae93565b2d35
2020-11-30 15:03:34 -08:00
d386d3323f [dper] supress excessive msg (#48404)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48404

On bento this is printing a lot of msgs like (see N408483 if you're an internal user)
```
W1123 120952.322 schema.py:811] Scalar should be considered immutable. Only call Scalar.set() on newly created Scalar with unsafe=True. This will become an error soon.
```
And it's ignoring the log level I set at global level. Removing this line unless this is super important.

Test Plan: build a local dper package and verify

Differential Revision: D25163808

fbshipit-source-id: 338d01c82b4e67269328bbeafc088987c4cbac75
2020-11-30 14:55:52 -08:00
d74f2d28a1 Fix bazel build after sleef update (#48614)
Summary:
In https://github.com/shibatch/sleef/pull/361 `src/libm/sleeflibm_header.h.org` was renamed to `src/libm/sleeflibm_header.h.org.in`
Updating bazel build rule for sleef accordingly

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48614

Reviewed By: ezyang

Differential Revision: D25228160

Pulled By: malfet

fbshipit-source-id: dc0e56c2eb8990a19b5de14318e41ea9661c63f8
2020-11-30 14:47:11 -08:00
66440d1b29 Tweak Vulkan memory use. (#47728)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47728

Test Plan: Imported from OSS

Reviewed By: SS-JIA

Differential Revision: D25032740

Pulled By: AshkanAliabadi

fbshipit-source-id: 7eb72538dc1aa3feb4e2f8c4ff9c675eb8e97057
2020-11-30 14:28:09 -08:00
8f8738ce5c [vmap] implement batching rules for clamp, clamp_min and clamp_max (#48449)
Summary:
Fix https://github.com/pytorch/pytorch/issues/47754

- This PR implements batching rules for `clamp`, `clamp_min` and `clamp_max` operators.
- Testcases are added to `test/test_vmap.py`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48449

Reviewed By: ejguan

Differential Revision: D25219360

Pulled By: zou3519

fbshipit-source-id: 0b7e1b00f5553b4578f15a6cc440640e506b4918
2020-11-30 14:22:43 -08:00
b5149513ec migrate export_caffe2_op_to_c10.h macros to the new dispatcher registration API, update code_analyzer regex (#48308)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48308

The original regex that I added didn't correctly match namespaces that started with an underscore (e.g. `_test`), which caused a master-only test to fail.

The only change from the previous commit is that I updated the regex like so:

before: `^.*TORCH_LIBRARY_IMPL_init_([^_]+)_([^_]+)_[0-9]+(\(.*)?$`
after: `^.*TORCH_LIBRARY_IMPL_init_([_]*[^_]+)_([^_]+)_[0-9]+(\(.*)?$`

I added in a `[_]*` to the beginning of the namespace capture. I did the same for the `_FRAGMENT` regex.

Verified that running `ANALYZE_TEST=1 tools/code_analyzer/build.sh` (as the master-only test does) produces no diff in the output.

Fixing regex pattern to allow for underscores at the beginning of the
namespace

This reverts commit 3c936ecd3c68f395dad01f42935f20ed8068da02.

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D25123295

Pulled By: bdhirsh

fbshipit-source-id: 54bd1e3f0c8e28145e736142ad62a18806bb9672
2020-11-30 13:05:33 -08:00
032e4f81a8 Fix test comparison ops check for scalar overflow (#48597)
Summary:
Test should verify, that all listed conditions throw, not just the first one
Refactor duplicated constants
Use `self.assertTrue()` instead of suppressing flake8 `B015: Pointless Comparison` warning

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48597

Reviewed By: mruberry

Differential Revision: D25222734

Pulled By: malfet

fbshipit-source-id: 7854f755a84f23a1a52dc74402582e34d69ff984
2020-11-30 12:39:28 -08:00
b84d9b48d8 Fix the typo errror in the line #953 of the docs of 'torch/nn/modules/activation.py' (#48577)
Summary:
The title says it all.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48577

Reviewed By: ejguan

Differential Revision: D25224315

Pulled By: mrshenli

fbshipit-source-id: 8e34e9ec29b28768834972bfcdb443efd184f9ca
2020-11-30 12:02:40 -08:00
eba96b91cc Back out "[pytorch][PR] [JIT] Add __prepare_scriptable__ duck typing to allow replacing nn.modules with scriptable preparations"
Summary: Original commit changeset: 4ddff2d35312

Test Plan: sandcastle

Reviewed By: zhangguanheng66

Differential Revision: D25061862

fbshipit-source-id: 1d0cc5a34b8131ac88304f24394b677131d28e39
2020-11-30 11:49:36 -08:00
fe80638212 added docs to nn.rst (#48374)
Summary:
Fixes  https://github.com/pytorch/pytorch/issues/48198
Added following functions to a subsection "Global Hooks For Module" in containers sections of nn.rst.
- register_module_forward_pre_hook
- register_module_forward_hook
- register_module_backward_hook

screenshots:
![image](https://user-images.githubusercontent.com/30429206/99903019-9ee7f000-2ce7-11eb-95dd-1092d5e57ce7.png)
![image](https://user-images.githubusercontent.com/30429206/99903027-ac04df00-2ce7-11eb-9983-42ce67de75ba.png)
![image](https://user-images.githubusercontent.com/30429206/99903039-c3dc6300-2ce7-11eb-81c4-a0240067fe23.png)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48374

Reviewed By: ejguan

Differential Revision: D25219507

Pulled By: albanD

fbshipit-source-id: 0dd9d65f562c001c993ebcb51465e8ddcf631231
2020-11-30 11:34:49 -08:00
4e15877d5c Add documentation for torch.overrides submodule. (#48170)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/48087

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48170

Reviewed By: ejguan

Differential Revision: D25220942

Pulled By: ezyang

fbshipit-source-id: a2b7f7b565f5e77173d8ce2fe9676a8131f929b6
2020-11-30 11:25:31 -08:00
42e7cdc50a Improve libuv detection on Windows (#48571)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/48304

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48571

Reviewed By: ejguan

Differential Revision: D25220903

Pulled By: mrshenli

fbshipit-source-id: a485568621c4e289c5439474c2651186bc63c2f0
2020-11-30 11:16:13 -08:00
0213a3858a .circleci: Add python 3.9 builds for windows (#48138)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48138

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: samestep

Differential Revision: D25039140

Pulled By: seemethere

fbshipit-source-id: d39885562bdd8078a9735f1bc20f9d81cb024edc
2020-11-30 10:44:25 -08:00
af520d9d04 [cmake] clean up blas discovery (#47940)
Summary:
remove useless variable changes in blas discovery

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47940

Reviewed By: malfet

Differential Revision: D25122228

Pulled By: walterddr

fbshipit-source-id: 12bc3ce9e4f89a72b6a92c10d14024e5941f4b96
2020-11-30 10:29:50 -08:00
0b66cdadb6 Pin the rest of flake8 dependencies. (#48590)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48590

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: samestep

Differential Revision: D25220976

Pulled By: ezyang

fbshipit-source-id: 15817f8c5db7fea6efe9b70a1d1e46b8ca36d12b
2020-11-30 10:00:17 -08:00
e41d8b3d3d [JIT] adding missing test cases for test_isinstance.py (#47396)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47396

Test Plan: Imported from OSS

Reviewed By: nikithamalgifb

Differential Revision: D24739765

Pulled By: Lilyjjo

fbshipit-source-id: 881521175c9a4cdcda4555431fdf6861317f2f40
2020-11-30 09:10:15 -08:00
3c9e71c9ad fix BUILD_MOBILE_BENCHMARK typo (#48515)
Summary:
BUILD_MOBILE_BENCHMARKS in CMakeLists.txt should be BUILD_MOBILE_BENCHMARK.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48515

Reviewed By: albanD

Differential Revision: D25198724

Pulled By: mrshenli

fbshipit-source-id: 12765d10c272da04cb104202fcbabc6a0b007c5e
2020-11-30 08:38:43 -08:00
5bb2a87a94 Update sleef to fix build issues (#48529)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/48532

After PR https://github.com/pytorch/pytorch/issues/48275 updated the sleef submodule, pytorch incremental builds started failing due to shibatch/sleef#349. This updates the submodule to include the CMake fix in shibatch/sleef#361.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48529

Reviewed By: mrshenli

Differential Revision: D25210746

Pulled By: malfet

fbshipit-source-id: b41ac8de94848413397a19259c6affed5b2cb25b
2020-11-30 06:48:32 -08:00
5cb688b714 Merge all vec256 tests into one framework (#47294)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47294

Test Plan: Imported from OSS

Reviewed By: H-Huang

Differential Revision: D24707378

Pulled By: glaringlee

fbshipit-source-id: cc47ddb49bc2a3ecff9359e9623b0d7774743398
2020-11-30 05:13:09 -08:00
bdf360f9f2 [ONNX] Update onnx submodule (#47366)
Summary:
Update onnx submodule to 1.8 release

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47366

Reviewed By: hl475

Differential Revision: D24968733

Pulled By: houseroad

fbshipit-source-id: 2f0a3436ab3c9380ed8ff0887a483743c1209721
2020-11-30 00:05:46 -08:00
755b8158e2 Fix __config__ docs (#48557)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/48287

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48557

Reviewed By: ngimel

Differential Revision: D25211872

Pulled By: mruberry

fbshipit-source-id: ac916e16722809e747bd8960675c1477e3a1084d
2020-11-29 23:57:06 -08:00
0e5682d26b Pruning codeowners who don't actual do code review. (#48109)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48109

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D25026754

Pulled By: ezyang

fbshipit-source-id: c8f77a05fad867427789f376ef9da3a697e25353
2020-11-29 19:46:32 -08:00
2fe382e931 annotate torch._tensor_str (#48463)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/48462

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48463

Reviewed By: mrshenli

Differential Revision: D25187168

Pulled By: malfet

fbshipit-source-id: bb4ad1c6d376ad37995638615080452c71e36959
2020-11-29 10:09:19 -08:00
36c87f1243 Refactors test_torch.py to be fewer than 10k lines (#47356)
Summary:
Creates multiple new test suites to have fewer tests in test_torch.py, consistent with previous test suite creation like test_unary_ufuncs.py and test_linalg.py.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47356

Reviewed By: ngimel

Differential Revision: D25202268

Pulled By: mruberry

fbshipit-source-id: 75fde3ca76545d1b32b86d432a5cb7a5ba8f5bb6
2020-11-28 20:11:40 -08:00
272f4db043 Implement NumPy-like function torch.float_power() (#44937)
Summary:
- Related with https://github.com/pytorch/pytorch/issues/38349
- Implementing the NumPy-like function `torch.float_power()` .

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44937

Reviewed By: ngimel

Differential Revision: D25192119

Pulled By: mruberry

fbshipit-source-id: 2e446b8e0c2825f045fe057e30c9419335557a05
2020-11-27 18:01:42 -08:00
25ab39acd0 [numpy] torch.asin : promote integer inputs to float (#48461)
Summary:
Reference https://github.com/pytorch/pytorch/issues/42515

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48461

Reviewed By: ngimel

Differential Revision: D25192319

Pulled By: mruberry

fbshipit-source-id: fd5dffeca9cd98b86782bfa6a9ab367e425ee934
2020-11-27 15:26:58 -08:00
344918576c Migrate eig from the TH to Aten (CUDA) (#44105)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/24553

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44105

Reviewed By: ngimel

Differential Revision: D25192116

Pulled By: mruberry

fbshipit-source-id: 87f1ba4924b9174bfe0d9e2ab14bbe1c6bae879c
2020-11-27 15:15:48 -08:00
f95af7a79a [numpy] torch.erf{c} : promote integer inputs to float (#48472)
Summary:
Reference: https://github.com/pytorch/pytorch/issues/42515

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48472

Reviewed By: ngimel

Differential Revision: D25192324

Pulled By: mruberry

fbshipit-source-id: 6ef2fec8a27425f9c4c917fc3ae25ac1e1f5f454
2020-11-27 15:08:40 -08:00
7df8445242 torch.fft: Remove complex gradcheck workaround (#48425)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48425

gradcheck now natively supports functions with complex inputs and/or outputs.

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D25176377

Pulled By: mruberry

fbshipit-source-id: d603e2511943f38aeb3b8cfd972af6bf4701ed29
2020-11-26 22:45:59 -08:00
5dfced3b0d work around #47028 until a proper fix is identified (#48405)
Summary:
Otherwise, this test will appear flaky for ROCm even though it is a generic PyTorch issue.

CC albanD

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48405

Reviewed By: mrshenli

Differential Revision: D25183473

Pulled By: ngimel

fbshipit-source-id: 0fa19b5497a713cc6c5d251598e57cc7068604be
2020-11-26 18:33:19 -08:00
84fafbe49c [docs] docstring for no type checked meshgrid (#48471)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/48395

I am not sure this is a correct way for fixing tho

cc mruberry

Locally built preview:
![Screen Shot 2020-11-26 at 14 57 49](https://user-images.githubusercontent.com/32727188/100326034-d14f6100-2ff7-11eb-8abb-53317b9f518e.png)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48471

Reviewed By: mrshenli

Differential Revision: D25191033

Pulled By: mruberry

fbshipit-source-id: e5d9cb2748f7cb81923a1d4f204ffb330f6da1ee
2020-11-26 17:28:41 -08:00
c5ce995834 reintroduce deadline removal (#48481)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48481

we removed deadlines thinking that was the cause for the timeout of
the int8 test, but turns out that the int8 tests were failing because of a
legitimate bug and this was masked as a timeout

Now that the bug has been fixed, the tests are failing because it takes more
than 10s to run the test, this is an option we used to override
https://www.internalfb.com/intern/testinfra/diagnostics/2533274833752501.562949971168104.1606367728/

Test Plan: reran the test manually

Reviewed By: venkatacrc

Differential Revision: D25184573

fbshipit-source-id: 0b1b2eaa690472e80b9b0991618da8d792aeb42b
2020-11-26 11:10:29 -08:00
8b248af35d Alias _size_N_t to BroadcastingListN[int] (#48297)
Summary:
Because they are one and the same

Fixes https://github.com/pytorch/pytorch/issues/47528

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48297

Reviewed By: eellison

Differential Revision: D25116203

Pulled By: malfet

fbshipit-source-id: 7edc2c89daa3f3302822b1f9b83b41b04658c6b7
2020-11-26 08:09:43 -08:00
e7ca62be08 Fix PyTorch compilation on Apple M1 (#48275)
Summary:
Update cpuinfo and sleef to contain build fixes for M1

Fixes https://github.com/pytorch/pytorch/issues/48145

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48275

Reviewed By: walterddr

Differential Revision: D25135153

Pulled By: malfet

fbshipit-source-id: 2a82e14407d6f40c7dacd11109a8499d808c8ec1
2020-11-26 07:08:33 -08:00
18ae12a841 Refactor mkl fft planning to not use Tensor objects (#46910)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46910

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D25119656

Pulled By: mruberry

fbshipit-source-id: 77943d6cdf629240c814dc8df530dd7ee4163963
2020-11-25 23:04:41 -08:00
6a37582162 Fix misleading doc string in quint8.h (#48418)
Summary:
The doc string let suppose that `quint8` is for singed 8 bit values.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48418

Reviewed By: ngimel

Differential Revision: D25181705

Pulled By: mrshenli

fbshipit-source-id: 70e151b6279fef75505f80a7b0cd50032b4f1008
2020-11-25 20:48:39 -08:00
e56e21b775 Grammatically update the readme docs (#48328)
Summary:
Small grammatical update to the readme docs.

![Capture-py1](https://user-images.githubusercontent.com/65657554/99846018-9b475280-2b9b-11eb-84ab-37e129e4f3e6.PNG)

![Capture-py2](https://user-images.githubusercontent.com/65657554/99846023-9da9ac80-2b9b-11eb-9b3b-0998f53ec2ce.PNG)

![Capture-py3](https://user-images.githubusercontent.com/65657554/99846034-a0a49d00-2b9b-11eb-807e-7200c0b6fef4.PNG)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48328

Reviewed By: linbinyu

Differential Revision: D25132876

Pulled By: mrshenli

fbshipit-source-id: f1214b3098bec6713ef53f226f8d0d33946a5ec1
2020-11-25 19:56:32 -08:00
f1c985695c Enabled gloo backend in test_distributed unit tests for ROCm (#40395)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40395

Reviewed By: ngimel

Differential Revision: D25181692

Pulled By: mrshenli

fbshipit-source-id: 29f478c974791efc0acea210c8c9e574944746a5
2020-11-25 19:51:40 -08:00
db1b0b06c4 Flake8 fixes (#48453)
Summary:
Quiet errors from flake8. Only a couple of code changes for deprecated Python syntax from before 2.4. The rest is just adding noqa markers.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48453

Reviewed By: mruberry

Differential Revision: D25181871

Pulled By: ngimel

fbshipit-source-id: f8d7298aae783b1bce2a46827b088fc390970641
2020-11-25 19:09:50 -08:00
55e225a2dc Int8 FC fix to match NNPI ICE-REF step-C (#48459)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48459

Bias should be kept it in FP32. There is no need to convert the bias to FP16.

Test Plan: https://internalfb.com/intern/testinfra/testrun/562950128141661

Reviewed By: hyuen

Differential Revision: D25179863

fbshipit-source-id: e25d948c613d2b2d5adf2b674fc2ea4b4c8d3920
2020-11-25 14:58:06 -08:00
3858aaab37 Fix syntax issue in c++ cuda api note (#48434)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/48434

Test Plan: Imported from OSS

Reviewed By: rohan-varma

Differential Revision: D25173692

Pulled By: glaringlee

fbshipit-source-id: bbd6fa7615200bf1eaea731a4ed251d423412593
2020-11-25 14:31:14 -08:00
4ab2055857 Re-enable only cuda tests wrongly disabled before (#48429)
Summary:
Close https://github.com/pytorch/pytorch/issues/46536

Re-enable only cuda tests wrongly disabled in https://github.com/pytorch/pytorch/pull/45332

See discussions https://github.com/pytorch/pytorch/issues/46536#issuecomment-721386038 and https://github.com/pytorch/pytorch/pull/45332#issuecomment-721350987

~~See also https://github.com/pytorch/pytorch/pull/47237 and https://github.com/pytorch/pytorch/pull/47642~~

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48429

Reviewed By: ngimel

Differential Revision: D25176368

Pulled By: mruberry

fbshipit-source-id: 3822f5a45e58c0e387624e70ea272d16218901a9
2020-11-25 13:26:35 -08:00
9ecaeb0962 [numpy] Add unary-ufunc tests for erf variants (#47155)
Summary:
Adding Unary Ufunc Test entry for `erf` variants.

We use scipy functions for reference implementation.

We can later update the tests once these functions will update integer input to float.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47155

Reviewed By: ngimel

Differential Revision: D25176654

Pulled By: mruberry

fbshipit-source-id: cb08efed1468b27650cec4f87a9a34e999ebd810
2020-11-25 13:20:14 -08:00
33cc1d6a64 [docs] fix torch.swap{dim/axes} to showup in docs (#48376)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/48372

Verified locally that it is generated
![Screenshot from 2020-11-22 20-38-15](https://user-images.githubusercontent.com/19503980/99907517-298a1880-2d03-11eb-9a8f-9809609c2d2d.png)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48376

Reviewed By: ngimel

Differential Revision: D25176483

Pulled By: mruberry

fbshipit-source-id: 911b57d43319059cc9f809ea0396c3740ff81ff5
2020-11-25 13:15:39 -08:00
bc2c1d7d59 quant: make each line of fx/quantize.py <=80 chars (#48357)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48357

Cleans up the long lines in `torch/quantization/fx/quantize.py`
to fit the 80 character limit, so it's easier to read and looks
better on FB's tools.

In the future we can consider adding a linter for this.

Test Plan:
CI

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D25140833

fbshipit-source-id: 78605d58eda0184eb82f510baec26685a34870e2
2020-11-25 09:04:23 -08:00
1d984410fb quant fx: fix typo (#48356)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48356

As titled

Test Plan:
CI

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D25140834

fbshipit-source-id: e22f8d1ae77c7eb2ec8275b5fbca7dc5e503a4ca
2020-11-25 09:04:20 -08:00
8581c02a3f quant: add type annotations on quantization.fx.Quantizer matches (#48350)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48350

As titled, continuing to incrementally type quantization.fx.Quantizer.

Test Plan:
```
mypy torch/quantization/
python test/test_quantization.py TestQuantizeFx
```

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D25138947

fbshipit-source-id: fd19bf360077b447ce2272bfd4f6d6b798ae05ac
2020-11-25 08:59:29 -08:00
f7a8bf2855 Use libkineto in profiler (#46470)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46470

Adding ability to use Kineto (CUPTI) to profile CUDA kernels

Test Plan:
USE_KINETO=1 USE_CUDA=1 USE_MKLDNN=1 BLAS=MKL BUILD_BINARY=1 python setup.py develop install
python test/test_profiler.py

python test/test_autograd.py -k test_profile
python test/test_autograd.py -k test_record

```
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                       Memcpy HtoD (Pageable -> Device)         0.00%       0.000us         0.00%       0.000us       0.000us       2.000us        33.33%       2.000us       1.000us             2
                                      sgemm_32x32x32_NN         0.00%       0.000us         0.00%       0.000us       0.000us       2.000us        33.33%       2.000us       2.000us             1
void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       1.000us        16.67%       1.000us       1.000us             1
                       Memcpy DtoH (Device -> Pageable)         0.00%       0.000us         0.00%       0.000us       0.000us       1.000us        16.67%       1.000us       1.000us             1
                                            aten::randn         5.17%      74.000us         6.71%      96.000us      48.000us       0.000us         0.00%       0.000us       0.000us             2
                                            aten::empty         1.33%      19.000us         1.33%      19.000us       4.750us       0.000us         0.00%       0.000us       0.000us             4
                                          aten::normal_         1.05%      15.000us         1.05%      15.000us       7.500us       0.000us         0.00%       0.000us       0.000us             2
                                               aten::to        77.90%       1.114ms        91.61%       1.310ms     436.667us       0.000us         0.00%       3.000us       1.000us             3
                                    aten::empty_strided         2.52%      36.000us         2.52%      36.000us      12.000us       0.000us         0.00%       0.000us       0.000us             3
                                            aten::copy_         2.73%      39.000us        11.19%     160.000us      53.333us       0.000us         0.00%       3.000us       1.000us             3
                                        cudaMemcpyAsync         4.34%      62.000us         4.34%      62.000us      20.667us       0.000us         0.00%       0.000us       0.000us             3
                                  cudaStreamSynchronize         1.61%      23.000us         1.61%      23.000us       7.667us       0.000us         0.00%       0.000us       0.000us             3
                                               aten::mm         0.21%       3.000us         7.20%     103.000us     103.000us       0.000us         0.00%       2.000us       2.000us             1
                                           aten::stride         0.21%       3.000us         0.21%       3.000us       1.000us       0.000us         0.00%       0.000us       0.000us             3
                                       cudaLaunchKernel         2.45%      35.000us         2.45%      35.000us      17.500us       0.000us         0.00%       0.000us       0.000us             2
                                              aten::add         0.49%       7.000us         4.27%      61.000us      61.000us       0.000us         0.00%       1.000us       1.000us             1
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
```

benchmark: https://gist.github.com/ilia-cher/a5a9eb6b68504542a3cad5150fc39b1a

Reviewed By: Chillee

Differential Revision: D25142223

Pulled By: ilia-cher

fbshipit-source-id: b0dff46c28da5fb0a8e01cf548aa4f2b723fde80
2020-11-25 04:32:16 -08:00
e9efd8df1b [numpy] torch.log1p : promote integer inputs to float (#48002)
Summary:
Reference https://github.com/pytorch/pytorch/issues/42515

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48002

Reviewed By: ngimel

Differential Revision: D25148911

Pulled By: mruberry

fbshipit-source-id: 902d0ddf699debd6edd1b3d55f5c73932ca45e83
2020-11-24 22:01:07 -08:00
2e0a8b75d8 An implementation of torch.tile as requested in pytorch/pytorch#38349 (#47974)
Summary:
The approach is to simply reuse `torch.repeat` but adding one more functionality to tile, which is to prepend 1's to reps arrays if there are more dimensions to the tensors than the reps given in input. Thus for a tensor of shape (64, 3, 24, 24) and reps of (2, 2) will become (1, 1, 2, 2), which is what NumPy does.

I've encountered some instability with the test on my end, where I could get a random failure of the test (due to, sometimes, random value of `self.dim()`, and sometimes, segfaults). I'd appreciate any feedback on the test or an explanation for this instability so I can this.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47974

Reviewed By: ngimel

Differential Revision: D25148963

Pulled By: mruberry

fbshipit-source-id: bf63b72c6fe3d3998a682822e669666f7cc97c58
2020-11-24 18:07:25 -08:00
2dff0b3e91 Fix typos in comments (#48316)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/48316

Reviewed By: walterddr, mrshenli

Differential Revision: D25125123

Pulled By: malfet

fbshipit-source-id: 6f31e5456cc078cc61b288191f1933711acebba0
2020-11-24 10:56:40 -08:00
671ee71ad4 [AutoAccept][Codemod][FBSourceClangFormatLinter] Daily arc lint --take CLANGFORMAT
Reviewed By: zertosh

Differential Revision: D25158667

fbshipit-source-id: 3b2a7facbfbfaaabc2cb5ac22906673b17fd0f15
2020-11-23 05:03:53 -08:00
4ed7f36ed1 Added linalg.eigh, linalg.eigvalsh (#45526)
Summary:
This PR adds `torch.linalg.eigh`, and `torch.linalg.eigvalsh` for NumPy compatibility.
The current `torch.symeig` uses (on CPU) a different LAPACK routine than NumPy (`syev` vs `syevd`). Even though it shouldn't matter in practice, `torch.linalg.eigh` uses `syevd` (as NumPy does).

Ref https://github.com/pytorch/pytorch/issues/42666

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45526

Reviewed By: gchanan

Differential Revision: D25022659

Pulled By: mruberry

fbshipit-source-id: 3676b77a121c4b5abdb712ad06702ac4944e900a
2020-11-22 04:57:28 -08:00
b6654906c7 Fix assertEqual's handling of numpy array inputs (#48217)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/47948

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48217

Reviewed By: mrshenli

Differential Revision: D25119607

Pulled By: mruberry

fbshipit-source-id: efe84380d3797d242c2aa7d43d2209bcba89cee0
2020-11-22 00:13:42 -08:00
f2da18af14 Add USE_KINETO build option (#45888)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45888

Adding USE_LIBKINETO build option

Test Plan:
USE_KINETO=1 USE_CUDA=1 USE_MKLDNN=1 BLAS=MKL BUILD_BINARY=1 python
setup.py develop install --cmake

Reviewed By: Chillee

Differential Revision: D25142221

Pulled By: ilia-cher

fbshipit-source-id: d1634a8f9599604ff511fac59b9072854289510c
2020-11-21 20:20:32 -08:00
c5e380bfcb quant: add type annotations on quantization.fx.Quantizer class vars (#48343)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48343

Annotates the 4 class variables on `Quantizer` with real types,
fixing the small things uncovered by this along the way.

Test Plan:
```
mypy torch/quantization/
python test/test_quantization.py TestQuantizeFx
```

Imported from OSS

Reviewed By: supriyar

Differential Revision: D25136212

fbshipit-source-id: 6ee556c291c395bd8d8765a99f10793ca738086f
2020-11-21 15:31:00 -08:00
6b80b664bb quant: enable mypy on torch/quantization/fx (#48331)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48331

Enables mypy to not ignore type errors in FX quantization files.  Fixes the easy
typing errors inline, and comments out the harder errors to be fixed at a later time.
After this PR, mypy runs without errors on `torch/quantization`.

Test Plan:
```
> mypy torch/quantization/
Success: no issues found in 25 source files
```

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D25133348

fbshipit-source-id: 0568ef9405b292b80b3857eae300450108843e80
2020-11-21 15:29:27 -08:00
cac553cf34 [Gradient Compression] clang-format test_c10d.py (#48349)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48349

Apply clang-format only.

Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202
ghstack-source-id: 117305263

Test Plan: N/A

Reviewed By: pritamdamania87

Differential Revision: D25138833

fbshipit-source-id: 4ff112b579c0c5b8146495ebd2976d5faead2c1b
2020-11-21 09:28:39 -08:00
6400d27bbb [Gradient Compression] Define a customized state for PowerSGD comm hook (#48348)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48348

To support the features like error feedback, warm start, PowerSGD comm hook needs to maintain a state besides process group. Currently this state only includes a process group and a matrix approximation rank config.

This diff is a pure refactoring. Plan to add more state fields later.

Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202
ghstack-source-id: 117305280

Test Plan:
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl

buck test mode/dev-nosan caffe2/test/distributed:c10d --
test_powerSGD_ddp_comm_hook_nccl_grad_is_view

Reviewed By: rohan-varma

Differential Revision: D25137962

fbshipit-source-id: cd72b8b01e20f80a92c7577d22f2c96e9eebdc52
2020-11-21 09:25:35 -08:00
b967119906 [TensorExpr] Fix lowering for aten::div. (#48329)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/48329

Test Plan: Imported from OSS

Reviewed By: eellison

Differential Revision: D25130750

Pulled By: ZolotukhinM

fbshipit-source-id: 7c6345adcaec5f92cd6ce78b01f6a7d5923c0004
2020-11-21 09:20:28 -08:00
5e1faa1d41 [TensorExpr] Fix aten::atan2 lowering and disable aten::pow lowering on CPU. (#48326)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48326

The PR introduces a set of 'cuda-only' ops into `isSupported` function.
It is done to disable `pow` lowering on CPU where it's tricky to support
integer versions.

Test Plan: Imported from OSS

Reviewed By: eellison

Differential Revision: D25129211

Pulled By: ZolotukhinM

fbshipit-source-id: c62ae466e1d9ba9b3020519aadaa2a7fe7942d84
2020-11-21 09:15:42 -08:00
f1d328633c Fix mypy error (#48359)
Summary:
Fixes error introduced in https://github.com/pytorch/pytorch/pull/47657

`node.target` can be either a str or a callable, but this is checked in the pattern matching portion.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48359

Reviewed By: ilia-cher

Differential Revision: D25141885

Pulled By: Chillee

fbshipit-source-id: 94365a5a3dd351652ea7337077cd0e71b6ffe203
2020-11-21 03:32:54 -08:00
a00ba63023 Disable old fuser internally (#48322)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48322

Disable old fuser internally. I would like to find where we are inadvertently setting the old fuser, but in the meantime I would like to land a diff that I know will 100% cause it not to be run, and verify that it fixes the issue.

Test Plan: sandcastle

Reviewed By: ZolotukhinM

Differential Revision: D25126202

fbshipit-source-id: 5a4d0742f5f829e536f50e7ede1256c94dd05232
2020-11-21 00:42:23 -08:00
636fa8fda8 [quant] Add backend_independent option for quantized linear module (#48192)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48192

This is to allow producing a backend independent quantized module
since some backend don't have packed weight for linear

Test Plan:
test_quantized_module.py

Imported from OSS

Reviewed By: raghuramank100

Differential Revision: D25061645

fbshipit-source-id: a65535e53f35af4f2926af0ee330fdaae6dae996
2020-11-21 00:32:27 -08:00
fdc62c74a6 Add Kineto submodule (separate PR) (#48332)
Summary:
Separate PR to add Kineto submodule, mirrors the one I used
in my stack (45887)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48332

Reviewed By: gdankel

Differential Revision: D25139969

Pulled By: ilia-cher

fbshipit-source-id: b9ca2be5f15647655eeb4b2fbf4c82f84eee3dd8
2020-11-20 23:46:34 -08:00
6615edaf9a [Pytorch Mobile] Disable OutOfPlace calls for mobile (#48255)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48255

There's effort to move aten::native files to app level. After this effort, the operators and their kernels in aten::native can be selectively built per app.

There are some direct reference to at::native symbols in jit/runtime/static/ops.cpp. Those symbols are missing if their implementations are moved up to app level in Android.

Files in jit/runtime/static folder belongs to full-jit and should not be built from mobile. However, since Federated Learning is still using full-jit in fb4a, The current solution is to exclude those files from mobile torch_core target.

ghstack-source-id: 117123663

Test Plan: CI

Reviewed By: dhruvbird

Differential Revision: D24822690

fbshipit-source-id: c599b10f35e8d42bd4ca272da1a0cddf88ad7c37
2020-11-20 22:26:00 -08:00
16d089733b Enable creation and transfer of ScriptModule over RPC (#48293)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48293

This PR enables a `ScriptModule` to be created on a remote worker, retrieve it to current device, and run methods on it remotely.

In order to do this, we define custom pickling for `ScriptModule` in our `InternalRPCPickler`. The pickling basically uses torch.save and torch.load to save/recover the ScriptModule.

We test that we can create and retrieve a ScriptModule with rpc_sync, and create RRef to ScriptModule with rpc.remote. We can also run remote methods on the rref and transfer it to current worker.

Although we can run methods remotely on the RRef to ScriptModule, this does not currently work with RRef helper, filed to track that.
ghstack-source-id: 117275954

Test Plan: CI

Reviewed By: wanchaol

Differential Revision: D25107773

fbshipit-source-id: daadccf7bd25fe576110ee6e0dba6ed2bcd3e7f3
2020-11-20 22:15:54 -08:00
44def9ad71 [quant][fix] Fix quantization for qat.ConvBnReLU1d (#48059)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/48059

Test Plan: Imported from OSS

Reviewed By: vkuzo

Differential Revision: D25006388

fbshipit-source-id: ce911ce5e9c51966311cdf9e57dd6eceb357c74a
2020-11-20 21:02:02 -08:00
50e42b9092 Explicitly cast an implicit conversion from some macro defined type to a double (#48290)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48290

`scalar_t` here is expanded from nested macros to be an input value and
`upper_bound` is templated upon it. Whatever it gives back is unconditionally
cast to a `double` via the fact that it is always passed to
`binary_kernel_reduce_vec` which has a `double` as the fourth argument.

Change it here to be an explicit `static_cast<double>` to do what the compiler
was implicitly doing.

Test Plan: this is an error with -Werror in llvm11. This allows it to build

Reviewed By: ezyang

Differential Revision: D25111258

fbshipit-source-id: 6837afec52821f1f57b8c8f2df2d0eb3fc9b58bd
2020-11-20 19:21:22 -08:00
0a3db1d460 [FX] Prototype Conv/BN fuser in FX (#47657)
Summary:
Some interesting stuff going on. All benchmarks are tested with both my implementation as well as the current quantized fuser.

For these benchmarks, things like using MKLDNN/FBGEMM make a big differene.

## Manual compilation (everything turned off)
In the small case, things look good
```
non-fused:  1.174886703491211
fused:  0.7494957447052002
```

However, for `torchvision.resnet18`, we see
```
non-fused:  1.2272708415985107
fused:  3.7183213233947754
```

This is because Conv (no bias) -> Batch Norm is actually faster than Conv (bias) if you don't have any libraries...

## Nightly (CPU)
```
Toy
non-fused:  0.45807552337646484
fused:  0.34779977798461914

resnet18
non-fused:  0.14216232299804688
fused:  0.13438796997070312

resnet50
non-fused:  0.2999534606933594
fused:  0.29364800453186035

densenet161
non-fused:  0.6558926105499268
fused:  0.6190280914306641

inception_v3
non-fused:  1.2804391384124756
fused:  1.181272029876709
```
with MKLDNN.

We see a small performance gain across the board, with more significant performance gains for smaller models.

## Nightly (CUDA)

```
M
non-fused:  1.2220964431762695
fused:  1.0833759307861328

resnet18
non-fused:  0.09721899032592773
fused:  0.09089207649230957

resnet50
non-fused:  0.2053072452545166
fused:  0.19138741493225098

densenet161
non-fused:  0.6830024719238281
fused:  0.660109281539917
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47657

Reviewed By: eellison

Differential Revision: D25127546

Pulled By: Chillee

fbshipit-source-id: ecdf682038def046045fcc09faf9aeb6c459b5e3
2020-11-20 18:51:32 -08:00
6d0947c8cf Revert D25093315: [pytorch][PR] Fix inf norm grad
Test Plan: revert-hammer

Differential Revision:
D25093315 (ca880d77b8)

Original commit changeset: be1a7af32fe8

fbshipit-source-id: b383ec2a2c5884149b4fc7896f9d2856259794cd
2020-11-20 18:27:52 -08:00
f8722825b5 Compare Weights FX Implementation (#48056)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48056

PyTorch FX Quantization API:  Compare weights
ghstack-source-id: 117255311

Test Plan:
buck test mode/dev caffe2/test:quantization -- 'test_remove_qconfig_observer_fx'
buck test mode/dev caffe2/test:quantization -- 'test_compare_weights_linear_dynamic_fx'
buck test mode/dev caffe2/test:quantization -- 'test_compare_weights_linear_static_fx'
buck test mode/dev caffe2/test:quantization -- 'test_compare_weights_conv_static_fx'

Reviewed By: hx89

Differential Revision: D24940516

fbshipit-source-id: 301c1958c0e64ead9072e0fd002e4b21e8cb5b79
2020-11-20 17:17:19 -08:00
fefd56c4db Remove an accidental copy in a range-based for loop (#48234)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48234

This was copying the iterator varaible instead of taking a reference.
Fix the trivial error here.

Test Plan:
clang11 catches this and throws a warning that we promote with
-Werror. This change fixes the error.

Reviewed By: smeenai

Differential Revision: D24970929

fbshipit-source-id: 335a1b53276467987bc27fa41326803e01e70c01
2020-11-20 17:10:30 -08:00
286cdf3cda [static runtime] add static registry (#48258)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48258

This will enable closed source contributions

Test Plan: buck test mode/no-gpu //caffe2/benchmarks/static_runtime:static_runtime_cpptest

Reviewed By: hlu1

Differential Revision: D25031586

fbshipit-source-id: def859fa2fb4f01910b040242662a51b85804f01
2020-11-20 17:05:24 -08:00
0984d3123a [static runtime] add more _out variants (#48260)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48260

supporting a couple more operators

Test Plan:
use Ansha's test framework for e2e test

```
numactl -m 0 -C 3 ./buck-out/opt/gen/caffe2/caffe2/fb/predictor/ptvsc2_predictor_bench --pred_net=/home/bwasti/adindexer/precomputation_merge_net.pb --c2_inputs=/home/bwasti/adindexer/c2_inputs_precomputation_bs1.pb --c2_weights=/home/bwasti/adindexer/c2_weights_precomputation.pb --scripted_model=/home/bwasti/adindexer/traced_precomputation_partial_dper_fixes.pt --pt_inputs=/home/bwasti/adindexer/container_precomputation_bs1.pt --iters=30000 --warmup_iters=10000 --num_threads=1 --pt_enable_static_runtime=true --pt_cleanup_activations=true --pt_enable_out_variant=true --eps 1e-2
```

Reviewed By: hlu1

Differential Revision: D24767322

fbshipit-source-id: dce7f9bc0427632129f263bad509f0f00a21ccf3
2020-11-20 17:05:21 -08:00
87bfb2ff08 Automatically infer the type of the iterator in a range-based for loop (#48232)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48232

This was not the correct type that's being generated via the begin()
function and thus a new object was being created and attempted to take a
reference to. Instead just take a reference to whatever the range-based loop
generates.

Test Plan: This fixes a build error from a new warnign in llvm11

Reviewed By: smeenai

Differential Revision: D24970920

fbshipit-source-id: f125dca900f7550eee505b4f94781b6637533be0
2020-11-20 17:03:51 -08:00
8f1af0947c [iOS] Fix the fbios armv7 pika build
Summary: UBN task - T80034029

Test Plan:
1. pika-armv7: ` buck build //xplat/caffe2:aten_metalApple --flagfile 'fbsource//fbobjc/mode/ios.py#iphoneos-armv7,pika10,apple_toolchain'`
2. pika-arm64: ` buck build //xplat/caffe2:aten_metalApple --flagfile 'fbsource//fbobjc/mode/ios.py#iphoneos,pika10,apple_toolchain'`

Differential Revision: D25134207

fbshipit-source-id: 68b32dab1fc382ec23d7602e34bb64786cb38254
2020-11-20 16:31:37 -08:00
dc843fe197 Fix test_ldexp on Windows (#48335)
Summary:
Force `torch.randint` to generate tensor of int32 rather than tensor of int64
Delete unneeded copies

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48335

Reviewed By: ranman

Differential Revision: D25133312

Pulled By: malfet

fbshipit-source-id: 70bfcb6b7ff3bea611c4277e6634dc7473541288
2020-11-20 15:41:59 -08:00
7be30d1883 Move CUDA kernel check to c10 (#48277)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48277

We move `TORCH_CUDA_KERNEL_LAUNCH_CHECK` from `//caffe2/aten/src/ATen/cuda/Exceptions.h` to `//caffe2/c10/cuda/CUDAException.h`.

The primary reason is for allowing us to use this MACRO in other subdirectories of //caffe2, not just in ATen. Refer to D24309971 (353e7f940f) for context.

An example of this use case is D24868557, where we add these checks to `//caffe2/caffe2/sgd`.

Also, this should not affect current files, because `Exceptions.h` includes `CUDAException.h`.

Test Plan:
```
buck build //caffe2/aten:ATen-cu
```
- https://fburl.com/buck/oq3rxbir

Also wait for sandcastle tests.

Reviewed By: ngimel

Differential Revision: D25101720

fbshipit-source-id: e2b05b39ff1413a21e64949e26ca24c8f7d0400f
2020-11-20 14:58:15 -08:00
8177f63c91 Reorganize and refine the Windows.h import in C++ files (#48009)
Summary:
This PR aims to reduce the import overhead and symbol noises from the `windows.h` headers.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48009

Reviewed By: gchanan

Differential Revision: D25045840

Pulled By: ezyang

fbshipit-source-id: 01fda70f433ba2dd0cd2d7cd676ab6ffe9d98b90
2020-11-20 14:21:09 -08:00
6d5d336a63 Revert D25108971: [pytorch][PR] enable cuda11.1 and cudnn 8.0.5 in CI
Test Plan: revert-hammer

Differential Revision:
D25108971 (84d4e9c4fa)

Original commit changeset: d836690e1d5d

fbshipit-source-id: 555d1b8ee046d4263920cba8859b6d58e11fccd7
2020-11-20 12:01:24 -08:00
d1b8da75e6 [JIT] Metacompile boolean constants (#46721)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/46703

Previously, we would compile one side of an if-statement if it was a type-based expression we could statically resolve. I think it's reasonable to extend this metacompilation to booleans that are constant at compile time. There have been some instances where i've recommended unintuitive workarounds due to not having this behavior.

This is also possibly needed if we add boolean literals to schema declarations, which is a feature that might be needed to cleanup our `boolean_dispatch` mechanism.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46721

Reviewed By: ppwwyyxx

Differential Revision: D25008862

Pulled By: eellison

fbshipit-source-id: 5bc60a18f1021c010cb6abbeb5399c669fe04312
2020-11-20 11:17:15 -08:00
6eaf1e358c caffe2/core.Net: is_external_input rebuild lookup tables when necessary
Summary: is_external_input doesn't check if the lookup tables are valid. Calling .Proto() should invalidate all lookup tables and have them rebuilt on call to any methods depending on them. This adds this check to is_external_input.

Test Plan: internal unit tests

Reviewed By: dzhulgakov, esqu1

Differential Revision: D25100464

fbshipit-source-id: d792dec7e5aa9ffeafda88350e05cb757f4c4831
2020-11-20 10:53:24 -08:00
ca880d77b8 Fix inf norm grad (#48122)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/41779

Also fixes an issue with inf norm returning small non-zero values due to usage of `numeric_limit::min` which actually "returns the minimum positive normalized value" when applied to floating-point numbers. See https://en.cppreference.com/w/cpp/types/numeric_limits/min.

```
>>> import torch
>>> with torch.enable_grad():
...     a = torch.tensor([
...         [9., 2., 9.],
...         [-2., -3., -4.],
...         [7., 8., -9.],
...     ], requires_grad=True)
...     b = torch.norm(a, p=float('inf'))
...     b.backward()
...     print(a.grad)
...
tensor([[ 0.3333,  0.0000,  0.3333],
        [-0.0000, -0.0000, -0.0000],
        [ 0.0000,  0.0000, -0.3333]])
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48122

Reviewed By: izdeby

Differential Revision: D25093315

Pulled By: soulitzer

fbshipit-source-id: be1a7af32fe8bac0df877971fd75089d33e4bd43
2020-11-20 10:22:11 -08:00
63b04dc11d Update index.rst (#47282)
Summary:
Updating master to match changes we made to 1.7.

Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47282

Reviewed By: zhangguanheng66

Differential Revision: D24727322

Pulled By: brianjo

fbshipit-source-id: 64e3f06eb32c965390f282b81084460903d872a2
2020-11-20 08:52:00 -08:00
68a50a7152 Replace GatherRangesToDense operator in Dper from c2 to pt.
Summary: Replace `GatherRangesToDense` operator in Dper from c2 to pt.

Test Plan:
```
buck test //caffe2/torch/fb/sparsenn:test mode/dev-sand -c fbcode.nvcc_arch=v100 -c fbcode.enable_nccl_a2a=1
```

```
Started reporting to test run: https://our.intern.facebook.com/intern/testinfra/testrun/3659174735981484
    ✓ ListingSuccess: caffe2/torch/fb/sparsenn:test - main (22.179)
    ✓ Pass: caffe2/torch/fb/sparsenn:test - test_all_dropout_empty_input (caffe2.torch.fb.sparsenn.tests.sparsenn_operators_test.SparseNNOperatorsTest) (27.738)
    ✓ Pass: caffe2/torch/fb/sparsenn:test - test_batch_one_hot_lengths (caffe2.torch.fb.sparsenn.tests.sparsenn_operators_test.SparseNNOperatorsTest) (27.764)
    ✓ Pass: caffe2/torch/fb/sparsenn:test - test_gather_ranges (caffe2.torch.fb.sparsenn.tests.sparsenn_operators_test.SparseNNOperatorsTest) (27.787)
    ✓ Pass: caffe2/torch/fb/sparsenn:test - test_lengths_to_offsets (caffe2.torch.fb.sparsenn.tests.sparsenn_operators_test.SparseNNOperatorsTest) (27.804)
    ✓ Pass: caffe2/torch/fb/sparsenn:test - test_clip_chunks (caffe2.torch.fb.sparsenn.tests.sparsenn_operators_test.SparseNNOperatorsTest) (27.806)
    ✓ Pass: caffe2/torch/fb/sparsenn:test - test_offsets_to_ranges_empty_batch (caffe2.torch.fb.sparsenn.tests.sigrid_transforms_test.SigridTransformsOpsTest) (27.947)
    ✓ Pass: caffe2/torch/fb/sparsenn:test - test_multiple_runs (caffe2.torch.fb.sparsenn.tests.sigrid_transforms_test.SigridTransformsOpsTest) (28.008)
    ✓ Pass: caffe2/torch/fb/sparsenn:test - test_batch_one_hot (caffe2.torch.fb.sparsenn.tests.sparsenn_operators_test.SparseNNOperatorsTest) (28.036)
    ✓ Pass: caffe2/torch/fb/sparsenn:test - test_sort_id_score_list_by_score (caffe2.torch.fb.sparsenn.tests.sparsenn_operators_test.SparseNNOperatorsTest) (28.080)
    ✓ Pass: caffe2/torch/fb/sparsenn:test - test_gather_ranges_to_dense_caffe2_without_key (caffe2.torch.fb.sparsenn.tests.sparsenn_operators_test.SparseNNOperatorsTest) (28.119)
    ✓ Pass: caffe2/torch/fb/sparsenn:test - test_offsets_range (caffe2.torch.fb.sparsenn.tests.sparsenn_operators_test.SparseNNOperatorsTest) (28.147)
    ✓ Pass: caffe2/torch/fb/sparsenn:test - test_gather_ranges_to_dense_caffe2 (caffe2.torch.fb.sparsenn.tests.sparsenn_operators_test.SparseNNOperatorsTest) (28.179)
    ✓ Pass: caffe2/torch/fb/sparsenn:test - test_lengths_range (caffe2.torch.fb.sparsenn.tests.sparsenn_operators_test.SparseNNOperatorsTest) (28.241)
    ✓ Pass: caffe2/torch/fb/sparsenn:test - test_transform (caffe2.torch.fb.sparsenn.tests.sigrid_transforms_test.SigridTransformsOpsTest) (28.252)
    ✓ Pass: caffe2/torch/fb/sparsenn:test - test_all_dropout (caffe2.torch.fb.sparsenn.tests.sparsenn_operators_test.SparseNNOperatorsTest) (28.265)
    ✓ Pass: caffe2/torch/fb/sparsenn:test - test_bucketize (caffe2.torch.fb.sparsenn.tests.sparsenn_operators_test.SparseNNOperatorsTest) (28.274)
    ✓ Pass: caffe2/torch/fb/sparsenn:test - test_batch_box_cox (caffe2.torch.fb.sparsenn.tests.sparsenn_operators_test.SparseNNOperatorsTest) (28.305)
    ✓ Pass: caffe2/torch/fb/sparsenn:test - test_sigrid_hash_op (caffe2.torch.fb.sparsenn.tests.sparsenn_operators_test.SparseNNOperatorsTest) (28.314)
    ✓ Pass: caffe2/torch/fb/sparsenn:test - test_cumsum (caffe2.torch.fb.sparsenn.tests.sparsenn_operators_test.SparseNNOperatorsTest) (28.314)
    ✓ Pass: caffe2/torch/fb/sparsenn:test - test_ranges (caffe2.torch.fb.sparsenn.tests.sigrid_transforms_test.SigridTransformsOpsTest) (28.393)
    ✓ Pass: caffe2/torch/fb/sparsenn:test - test_rowwise_prune_op_32bit_indices (caffe2.torch.fb.sparsenn.tests.sparsenn_operators_test.SparseNNOperatorsTest) (28.411)
    ✓ Pass: caffe2/torch/fb/sparsenn:test - test_no_dropout (caffe2.torch.fb.sparsenn.tests.sparsenn_operators_test.SparseNNOperatorsTest) (28.520)
    ✓ Pass: caffe2/torch/fb/sparsenn:test - test_tracing (caffe2.torch.fb.sparsenn.tests.sigrid_transforms_test.SigridTransformsOpsTest) (28.945)
    ✓ Pass: caffe2/torch/fb/sparsenn:test - test_scale_gradient_backward (caffe2.torch.fb.sparsenn.tests.sparsenn_operators_test.SparseNNOperatorsTest) (33.231)
    ✓ Pass: caffe2/torch/fb/sparsenn:test - test_offsets_to_ranges (caffe2.torch.fb.sparsenn.tests.sigrid_transforms_test.SigridTransformsOpsTest) (19.864)
    ✓ Pass: caffe2/torch/fb/sparsenn:test - test_create (caffe2.torch.fb.sparsenn.tests.sigrid_transforms_test.SigridTransformsOpsTest) (19.634)
    ✓ Pass: caffe2/torch/fb/sparsenn:test - test_prior_correction_calibration_accumulate (caffe2.torch.fb.sparsenn.tests.sparsenn_operators_test.SparseNNOperatorsTest) (21.113)
    ✓ Pass: caffe2/torch/fb/sparsenn:test - test_scale_gradient (caffe2.torch.fb.sparsenn.tests.sparsenn_operators_test.SparseNNOperatorsTest) (21.204)
    ✓ Pass: caffe2/torch/fb/sparsenn:test - test_offsets_to_lengths (caffe2.torch.fb.sparsenn.tests.sparsenn_operators_test.SparseNNOperatorsTest) (21.533)
    ✓ Pass: caffe2/torch/fb/sparsenn:test - test_offsets_to_lengths_empty_batch (caffe2.torch.fb.sparsenn.tests.sparsenn_operators_test.SparseNNOperatorsTest) (21.487)
    ✓ Pass: caffe2/torch/fb/sparsenn:test - test_clip_ranges (caffe2.torch.fb.sparsenn.tests.sparsenn_operators_test.SparseNNOperatorsTest) (21.807)
    ✓ Pass: caffe2/torch/fb/sparsenn:test - test_gather_ranges_to_dense_without_max_mismatched_ratio (caffe2.torch.fb.sparsenn.tests.sparsenn_operators_test.SparseNNOperatorsTest) (21.576)
    ✓ Pass: caffe2/torch/fb/sparsenn:test - test_rowwise_prune_op_64bit_indices (caffe2.torch.fb.sparsenn.tests.sparsenn_operators_test.SparseNNOperatorsTest) (22.209)
    ✓ Pass: caffe2/torch/fb/sparsenn:test - test_embedding_bag_4bit_rowwise_sparse (caffe2.torch.fb.sparsenn.tests.sparsenn_operators_test.SparseNNOperatorsTest) (22.072)
    ✓ Pass: caffe2/torch/fb/sparsenn:test - test_prior_correction_calibration_prediction (caffe2.torch.fb.sparsenn.tests.sparsenn_operators_test.SparseNNOperatorsTest) (24.934)
Summary
  Pass: 35
  ListingSuccess: 1
Finished test run: https://our.intern.facebook.com/intern/testinfra/testrun/3659174735981484
```

```
buck build mode/opt //caffe2/benchmarks/operator_benchmark/fb/pt:gather_ranges_to_dense_benchmark_test
aibench-cli adhoc -c 'buck run //caffe2/benchmarks/operator_benchmark/fb/pt:gather_ranges_to_dense_benchmark_test'
```

```
# Benchmarking PyTorch: gather_ranges_to_dense
# Mode: Eager
# Name: gather_ranges_to_dense_batch_size13_max_lengths14_opcaffe2_gather_ranges_to_dense
# Input: batch_size: 13, max_lengths: 14, op: caffe2_gather_ranges_to_dense
Forward Execution Time (us) : 10.428

# Benchmarking PyTorch: gather_ranges_to_dense
# Mode: Eager
# Name: gather_ranges_to_dense_batch_size13_max_lengths14_optorch_gather_ranges_to_dense
# Input: batch_size: 13, max_lengths: 14, op: torch_gather_ranges_to_dense
Forward Execution Time (us) : 8.986
```

Reviewed By: dzhulgakov

Differential Revision: D24831789

fbshipit-source-id: 110edc86335ae357da435babf87da1a3e537c631
2020-11-20 08:14:32 -08:00
55d5b27343 Refactor request_callback_no_python.cpp processRpc function (#47816)
Summary:
Addresses step 1 of https://github.com/pytorch/pytorch/issues/46564

Took processing logic for each case in request_callback_no_python.cpp and put it in a dedicated function.

cc: izdeby

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47816

Reviewed By: izdeby

Differential Revision: D25090207

Pulled By: H-Huang

fbshipit-source-id: bfa38e38db02e077d859125739aaede90ba492e7
2020-11-20 07:29:51 -08:00
562d4c3bc5 Add basic ldexp operator for numpy compatibility (#45370)
Summary:
Adds ldexp operator for https://github.com/pytorch/pytorch/issues/38349

I'm not entirely sure the changes to `NamedRegistrations.cpp` were needed but I saw other operators in there so I added it.

Normally the ldexp operator is used along with the frexp to construct and deconstruct floating point values. This is useful for performing operations on either the mantissa and exponent portions of floating point values.

Sleef, std math.h, and cuda support both ldexp and frexp but not for all data types. I wasn't able to figure out how to get the iterators to play nicely with a vectorized kernel so I have left this with just the normal CPU kernel for now.

This is the first operator I'm adding so please review with an eye for errors.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45370

Reviewed By: mruberry

Differential Revision: D24333516

Pulled By: ranman

fbshipit-source-id: 2df78088f00aa9789aae1124eda399771e120d3f
2020-11-20 04:09:39 -08:00
ec256ab2f2 implement torch.addr using TensorIterator based kernels (#47664)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/47313

This PR implements `torch.addr` function using `TensorIterator` with `cpu_kernel_vec` and `gpu_kernel`.
It helps reduce memory usage, improve performance, and fix the bug when `beta` or `alpha` is a complex number.

Todo
- [x] benchmarking `torch.addr` for the change of this PR, as well as the legacy TH implementation used in PyTorch 1.6.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47664

Reviewed By: zhangguanheng66

Differential Revision: D25059693

Pulled By: ngimel

fbshipit-source-id: 20a90824aa4cb2240e81a9f17a9e2f16ae6e3437
2020-11-20 00:21:49 -08:00
eb49dabe92 [TensorExpr] Add even more operator tests. (#48292)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/48292

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D25113397

Pulled By: ZolotukhinM

fbshipit-source-id: a8591006e1fb71b87d50c8a150739a9bca835928
2020-11-19 23:35:19 -08:00
efd41db32c [TensorExpr] Add more operator tests. (#48282)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/48282

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D25108184

Pulled By: ZolotukhinM

fbshipit-source-id: ba8cdf6253533210a92348f475b8b9400d8ecb1a
2020-11-19 23:29:11 -08:00
56129bdea2 remove having no deadline for the test (#48226)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48226

this test is timing out, I am removing the deadline argument to see if
things improve

Test Plan:
ran locally
https://fburl.com/p41ocvrs

Reviewed By: venkatacrc

Differential Revision: D25067867

fbshipit-source-id: 80065553e0bd9883ea80e70a6748de1012e0d4e3
2020-11-19 23:10:37 -08:00
de284b6d35 [pytorch][codegen] add autograd data model (#48249)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48249

Introduced autograd related data models at tools.codegen.api.autograd.

Migrated load_derivatives.py to produce the new data models from derivatives.yaml.
It has clean mypy-strict result.

Changed both gen_autograd_functions.py and gen_variable_type.py to consume
the new data model.

Added type annotations to gen_autograd_functions.py - it has clean mypy-strict
result except for the .gen_autograd import (so haven't added it to the strict
config in this PR).

To limit the scope of the PR, gen_variable_type.py is not refactored, and the
main structure of load_derivatives.py / gen_autograd_functions.py is kept. We
only make necessary changes to make it work.

Confirmed byte-for-byte compatible with the old codegen:

```
Run it before and after this PR:
  .jenkins/pytorch/codegen-test.sh <baseline_output_dir>
  .jenkins/pytorch/codegen-test.sh <test_output_dir>

Then run diff to compare the generated files:
  diff -Naur <baseline_output_dir> <test_output_dir>
```

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D25086561

Pulled By: ljk53

fbshipit-source-id: 1f43ab0931d9814c24683b9a48ca497c5fc3d729
2020-11-19 21:47:05 -08:00
fa41275899 [Pytorch] Weaker memory ordering for c10::intrusive_ptr (#48221)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48221

load-acquire, acquire-release increment and decrement. (We
need acquire-release increment to make unique() and use_count()
reliable.) Note that this doesn't make a difference on x86, but we
should expect it to improve things on ARM and ARM64.
ghstack-source-id: 117065956

Test Plan: Careful review :)

Reviewed By: ezyang

Differential Revision: D24708209

fbshipit-source-id: 5e574115eee5c0a65047b638c5f9b1ec0124d04d
2020-11-19 20:59:30 -08:00
d6b374956f [JIT] Resolve torch.device in recursive compilation of classes (#47734)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47734

**Summary**
This commit allows `torch.device` to be resolved properly when used in
class types that are recursively scripted. This is accomplished by augmenting
the resolution callback used during recursively class scripting to include
the type annotations used on class method declarations.

Classes that are not explicitly annotated with `torch.jit.script` are
implicitly scripted during the compilation of a function or class method
that uses them. One key difference between this method of class type
compilation and explicit scripting is that the former uses a resolution callback
that can only resolve variables that class methods close over (see
`_jit_internal.createResolutionCallbackForClassMethods`). This does
not include type annotations and default arguments. This means that
builtin types like `torch.Tensor` and `torch.device` cannot be resolved
using the resolution callback. This issue does not arise when explicitly
scripting classes because the resolution callback for that code path is
constructed from scope of the class definition
(see `_jit_internal.createResolutionCallbackFromFrame`). `torch.Tensor`
and `torch.device` are almost always present in that scope, usually from
`import`ing `torch`.

**Test Plan**
This commit adds a new unit test to `TestClassType`,
`test_recursive_script_builtin_type_resolution`.

**Fixes**
This commit closes #47405.

Test Plan: Imported from OSS

Reviewed By: eellison

Differential Revision: D24995374

Pulled By: SplitInfinity

fbshipit-source-id: db68212634cacf81cfaeda8095a1fe5105fa73b7
2020-11-19 20:40:09 -08:00
28580d3c0f Add TorchBind-based Python and TorchScript binding for ProcessGroup (#47907)
Summary:
Add TorchBind-binding for ProcessGroup class.

Currently there are a few limitation of TorchBind that prevents us from fully matching existing PyBind-binding of ProcessGroup:

- TorchBind doesn't support method overloading. Current PyBind binding uses overloading extensively to provide flexible API, but TorchBind (and TorchScript ClassType behind it) doesn't yet support it. Therefore, we can provide at most one version of API under each name.

- TorchBind doesn't support C++ enums yet. This prevents us from making real uses of XXXOptions, which is widely used in many APIs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47907

Reviewed By: wanchaol

Differential Revision: D24945814

Pulled By: gmagogsfm

fbshipit-source-id: e103d448849ea838c10414068c3e4795db91ab1c
2020-11-19 20:25:56 -08:00
7828a22094 fix a bug in leakyReLU (#48265)
Summary:
The scale variable needs to be a scalar, otherwise it will report the following error: "RuntimeError: Cannot input a tensor of dimension other than 0 as a scalar argument"

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48265

Test Plan: Tested locally and the error disappeared.

Reviewed By: zhizhengwu

Differential Revision: D25105423

Pulled By: jerryzh168

fbshipit-source-id: 2a0df24cf7e40278a950bffe6e0a9552f99da1d1
2020-11-19 20:15:05 -08:00
998c4cac9a [FX] Add Node.all_input_nodes (#48270)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/48270

Test Plan: Imported from OSS

Reviewed By: Chillee

Differential Revision: D25100241

Pulled By: jamesr66a

fbshipit-source-id: f742f5a13debebb5be37f7c0045c121f6eaff1d5
2020-11-19 19:53:28 -08:00
aa8aa30a0b third_party: Update pybind to point to fork (#48117)
Summary:
There are specific patches we need for Python 3.9 compatibilty and that
process is currently hung up on separate issues.

Let's update to a newer version of our forked pybind to grab the Python
3.9 fixes while we wait for them to be upstreamed

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48117

Relates to: https://github.com/pybind/pybind11/pull/2657

Full comparison for this update looks like this: 59a2ac2745...seemethere:v2.6-fb

Fixes https://github.com/pytorch/pytorch/issues/47776

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48120

Reviewed By: gchanan

Differential Revision: D25030688

Pulled By: seemethere

fbshipit-source-id: 10889c813aeaa70ef1298adad5c631e6b5a39d72
2020-11-19 19:30:09 -08:00
84d4e9c4fa enable cuda11.1 and cudnn 8.0.5 in CI (#48242)
Summary:
Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48242

Reviewed By: walterddr

Differential Revision: D25108971

Pulled By: malfet

fbshipit-source-id: d836690e1d5d33c3395a44a86994a0a4bb381628
2020-11-19 19:27:36 -08:00
1a6666c967 [Gradient Compression] Add a comment on _orthogonalize. (#48253)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48253

Explained why a hand-crafted orthogonalize function is used instead of `torch.qr`.

Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202
ghstack-source-id: 117132622

Test Plan: N/A

Reviewed By: rohan-varma

Differential Revision: D25088607

fbshipit-source-id: ebc228afcb4737bb8529e7143ea170086730520e
2020-11-19 19:22:04 -08:00
3c936ecd3c Revert D25056091: migrate export_caffe2_op_to_c10.h macros to the new dispatcher registration API
Test Plan: revert-hammer

Differential Revision:
D25056091 (0ea4982cf3)

Original commit changeset: 0f647ab9bc5e

fbshipit-source-id: e54047b91d82df25460ee00482373c4580f94d50
2020-11-19 19:10:14 -08:00
0ea4982cf3 migrate export_caffe2_op_to_c10.h macros to the new dispatcher registration API (#48097)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/48097

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D25056091

Pulled By: bdhirsh

fbshipit-source-id: 0f647ab9bc5e5aee497dac058df492f6e742cfe9
2020-11-19 17:56:56 -08:00
4b56aef05d add kl_based_partition (#48197)
Summary:
This is a partition search based on Kernighan-Lin algorithm. First, the graph is partitioned using size_based_partition, then nodes from different partitions are swapped until the cost reaches minimum.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48197

Reviewed By: gcatron

Differential Revision: D25097065

Pulled By: scottxu0730

fbshipit-source-id: 3a11286bf4e5a712ab2848b92d0b98cd3d6a89be
2020-11-19 17:38:25 -08:00
c0723a0abf Add MessageTypeFlags enum for RPC Messages (#48143)
Summary:
Addresses https://github.com/pytorch/pytorch/issues/47145

Adds a new MessageTypeFlags enum so that checking for certain properties (e.g. isResponse, isRequest) can be done with a BITWISE AND instead of checking for each MessageType enum individually.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48143

Reviewed By: mrshenli

Differential Revision: D25091008

Pulled By: H-Huang

fbshipit-source-id: 56a823747748633c1ef3fa07817ca0f08c7399a8
2020-11-19 15:51:31 -08:00
feb6487acf Dont skip NCCL backend when testing all_reduce_cuda (#48231)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48231

Noticed that these tests were being skipped with NCCL backend, but
there doesn't appear to be a valid reason to. Enabled these tests and verify
that they pass with 500 stress runs.
ghstack-source-id: 117085209

Test Plan: CI

Reviewed By: SciPioneer

Differential Revision: D25079030

fbshipit-source-id: 8204288ffbd387375a1a86fe8c07243cfd855549
2020-11-19 15:26:57 -08:00
685cd9686f Refactor CuFFTConfig to not use tensor objects (#46909)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46909

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D25083884

Pulled By: mruberry

fbshipit-source-id: 15f8ec1da1a457811cf118a3adf2941c4b0a6a37
2020-11-19 14:31:51 -08:00
2039ff3fbb [Caffe2] Optimize MishOp on CPU (#48212)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48212

Optimize MishOp on CPU

Test Plan: buck test mode/dev-nosan //caffe2/caffe2/python/operator_test:activation_ops_test -- "mish"

Reviewed By: houseroad

Differential Revision: D25071304

fbshipit-source-id: fe94bfab512188d60412d66962983eff4f37bc07
2020-11-19 14:17:27 -08:00
f98ab18445 [pytorch][codegen] move is_abstract property to NativeFunction model (#48252)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48252

Moved to a shared place so that gen_variable_type.py can reuse it.

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D25087808

Pulled By: ljk53

fbshipit-source-id: 1f32e506956fc4eb08734cfde0add47b3e666bd9
2020-11-19 12:30:13 -08:00
9b19880c43 Fix collect_env.py with older version of PyTorch (#48076)
Summary:
Inspired by https://github.com/pytorch/pytorch/issues/47993, this fixes the import error in `collect_env.py` with older version of PyTorch when `torch.version` does not have `hip` property.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48076

Reviewed By: seemethere, xuzhao9

Differential Revision: D25024352

Pulled By: samestep

fbshipit-source-id: 7dff9d2ab80b0bd25f9ca035d8660f38419cdeca
2020-11-19 12:18:08 -08:00
343b3e5cae Added linalg.tensorinv (#45969)
Summary:
This PR adds `torch.linalg.tensorinv` for NumPy compatibility.

Ref https://github.com/pytorch/pytorch/issues/42666

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45969

Reviewed By: zhangguanheng66

Differential Revision: D25060568

Pulled By: mruberry

fbshipit-source-id: 3b145ce64e4bd5021bc229f5ffdd791c572673a0
2020-11-19 11:54:50 -08:00
678fe9f077 Add blas compare example (#47058)
Summary:
Adds a standalone script which can be used to test different BLAS libraries. Right now I've deliberately kept it limited (only a couple BLAS libs and only GEMM and GEMV). It's easy enough to expand later.

CC ngimel

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47058

Reviewed By: zhangguanheng66

Differential Revision: D25078946

Pulled By: robieta

fbshipit-source-id: b5f7f7ec289d59c16c5370b7a6636c10a496b3ac
2020-11-19 11:27:27 -08:00
008f840e7a Implement in-place method torch.cumsum_ and torch.cumprod_ (#47651)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/47193

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47651

Reviewed By: zou3519

Differential Revision: D24992438

Pulled By: ezyang

fbshipit-source-id: c38bea55f4af1fc92be780eaa8e1d462316e6192
2020-11-19 11:20:12 -08:00
fe6bb2d287 [PyTorch] Declare the instantiation of PackedConvWeightsQnnp<2>::prepack (#48256)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48256

`PackedConvWeightsQnnp<2>::prepack` is referenced by both `quantized::conv_prepack` and fbgemm.cpp. Since `quantized::conv_prepack` is in the same compilation unit as the class template it was fine. However, if we make operator registration selective, the reference from `quantized::conv_prepack` is gone. The reference from fbgemm.cpp is in another compilation unit and there is link error.

To avoid the link error, instantiate the symbol in the cpp file. It should also work to move all implementations to .h file, but to keep the existing code structure and to avoid (small chance of) code bloat, the implementations are kept as is.
ghstack-source-id: 117123564

Test Plan:
CI
buck build //fbandroid/apps/oculus/assistant:assistant_arm64_debug

Reviewed By: dhruvbird

Differential Revision: D24941989

fbshipit-source-id: adc96d0e55c89529fb71a43352aa68a1088a62a2
2020-11-19 10:54:58 -08:00
1dd4f4334c docker: Make CUDA_VERSION configurable (#48199)
Summary:
makes CUDA_VERSION configurable for the docker images:

make CUDA_VERSION=10.2 CUDNN_VERSION=7 official-image

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48199

Reviewed By: xuzhao9, janeyx99

Differential Revision: D25064256

Pulled By: seemethere

fbshipit-source-id: 25f52185097be647d11b5324f9f97cd41cdad75b
2020-11-19 10:06:45 -08:00
a7153a89a5 Exclude docs/cpp/src from flake8 (#48201)
Summary:
Currently when I run `flake8` locally I get [a bunch of extraneous warnings](https://pastebin.com/DMQevCtC) because the docs build puts a `pytorch-sphinx-theme` dir into `docs/cpp/src`. Those warnings don't show up in CI because the CI lint job doesn't generate that dir. This PR adds that to the Flake8 `exclude` list, similar to how `docs/src` is already present in that list.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48201

Reviewed By: walterddr, zhangguanheng66

Differential Revision: D25069130

Pulled By: samestep

fbshipit-source-id: 2fda9e813f54092398525b7fc97d0a8f7f835ca6
2020-11-19 10:00:14 -08:00
975ff6624b DOC: backport doc build fix from 1.7, tweak link (#47349)
Summary:
xref gh-46927 to the 1.7 release branch

This backports a fix to the script to push docs to pytorch/pytorch.github.io. Specifically, it pushes to the correct directory when a tag is created here. This issue became apparent in the 1.7 release cycle and should be backported to here.

Along the way, fix the canonical link to the pytorch/audio documentation now that they use subdirectories for the versions, xref pytorch/audio#992. This saves a redirect.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47349

Reviewed By: zhangguanheng66

Differential Revision: D25073752

Pulled By: seemethere

fbshipit-source-id: c778c94a05f1c3e916217bb184f69107e7d2c098
2020-11-19 09:51:18 -08:00
c542614e53 Implement C++ ModuleDict (#47707)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47707

Fixes #45896

Test Plan: Imported from OSS

Reviewed By: glaringlee

Differential Revision: D24872641

Pulled By: ejguan

fbshipit-source-id: 3d1dc9148ba3bcf66ab9c44ddb5774060bbc365d
2020-11-19 08:07:51 -08:00
c4a6df989c Pass any verbosity from test/run_test.py to pytest (#48204)
Summary:
Previously it was only possible to pass up to one [verbosity level](https://adamj.eu/tech/2019/10/03/my-most-used-pytest-commandline-flags/) to `pytest` when running a test via `test/run_test.py`. Presumably that behavior was never added because `unittest` [doesn't do anything extra](https://stackoverflow.com/a/1322648/5044950) when given more than one `--verbose` flag. This PR removes that limitation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48204

Test Plan:
Make a dummy `pytest`-style file `test/test_foo.py`:
```py
def test_bar():
    assert 'hello\n' * 10 == 'hello\n' * 20
```
Then add `'test_foo'` to both `TESTS` and `USE_PYTEST_LIST` in `test/run_test.py`, and run this command:
```sh
test/run_test.py -vvi test_foo
```

Reviewed By: walterddr

Differential Revision: D25069147

Pulled By: samestep

fbshipit-source-id: 2765ee78d18cc84ea0e262520838993f9e9ee04f
2020-11-19 08:06:26 -08:00
370310bedb batched grad for binary_cross_entropy, symeig (#48057)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48057

This PR fixes batched grad computation for:
- binary_cross_entropy (i.e., vmap through binary_cross_entropy_double_backward)
- symeig (i.e. vmap through symeig_backward)

It was previously impossible to vmap through those functions because
they use in-place operations in a vmap-incompatible way.

See note at
233192be73/aten/src/ATen/BatchedFallback.cpp (L117-L122)
for what it means for an in-place operation to be vmap-incompatible.

This PR adds a check: if the in-place operations in e.g. symeig are
vmap-incompatible and we are inside of a vmap, then we do the
out-of-place variant of the operation. Ditto for binary_cross_entropy.

This is to avoid code duplication: the alternative would be to register
the backward formula as an operator and change just those lines to be
out-of-place!

This PR also adds some general guidelines for what to do if an in-place
operation is vmap-incompatible.

General guidelines
------------------

If an in-place operation used in a backward formula is vmap-incompatible,
then as developers we have the following options:

- If the in-place operation directly followed the creation of a tensor with
  a factory function like at::zeros(...), we should replace the factory with a
  corresponding grad.new_zeros(...) call. The grad.new_zeros(...) call
  propagates the batch dims to the resulting tensor.
  For example:
    Before: at::zeros(input.sizes(), grad.options()).copy_(grad)
    After:  grad.new_zeros(input.sizes()).copy_(grad)

- If the in-place operation followed some sequence of operations, if the
  we want to be able to vmap over the backward formula as-is (this is
  usually the case for simple (<15loc) backward formulas), then use
  inplace_is_vmap_compatible to guard the operation. For example:
            c = a * b
    Before: c.mul_(grad)
    After:  c = inplace_is_vmap_compatible(c, grad) ? c.mul_(grad) : c * grad

- If we don't want to vmap directly over the backward formula (e.g., if the
  backward formula is too complicated or has a lot of vmap-incompatible
  operations, then register the backward formula as an operator and eventually
  write a batching rule for it.

Test Plan
---------
New tests

Test Plan: Imported from OSS

Reviewed By: zhangguanheng66

Differential Revision: D25069525

Pulled By: zou3519

fbshipit-source-id: e0dfeb5a812f35b7579fc6ecf7252bf31ce0d790
2020-11-19 07:59:02 -08:00
db767b7862 Add c10d new frontend to build (#48146)
Summary:
Stack from [ghstack](https://github.com/ezyang/ghstack):
* https://github.com/pytorch/pytorch/issues/48148 Add TorchBind-based Python and TorchScript binding for ProcessGroup
* https://github.com/pytorch/pytorch/issues/48147 Add process group creation logic in c10d new frontend
* **https://github.com/pytorch/pytorch/issues/48146 Add c10d new frontend to build**

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48146

Reviewed By: wanchaol

Differential Revision: D25073969

Pulled By: gmagogsfm

fbshipit-source-id: d111649144a4de9f380e5f7a2ad936860de4bd7b
2020-11-19 04:47:02 -08:00
daff3a81a1 [Gradient Compression] PowerSGD comm hook (#48060)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48060

Implement a PowerSGD variant that applies to a batched flattened tensor with zero paddings.

This version does not require handling 1D tensors and multi-dimenionsal tensors in the input separately, and hence it does not need to create two asyncrhonous future chains.

Potential optimizations:
1) Consider FP16 compression throughout PowerSGD.
2) Warm start and save one matrix multiplication per ieration.

Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202
ghstack-source-id: 117105938

Test Plan: buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_default_ddp_comm_hooks_nccl

Reviewed By: jiayisuse

Differential Revision: D24843692

fbshipit-source-id: f44200b1fd6e12e829fc543d21ab7ae086769561
2020-11-19 02:59:11 -08:00
0d8ddb5ec2 Make softmax and log_softmax handle negative dims, add tests (#48156)
Summary:
Make softmax and log_softmax handle negative dims, add tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48156

Reviewed By: bertmaher

Differential Revision: D25059788

Pulled By: Krovatkin

fbshipit-source-id: 985963e7df400857c9774660c76be7d56201a1ad
2020-11-19 01:38:14 -08:00
46d846f5bb T78750158 Support varying size input in numeric suite at 10/30/2020, 3:55:01 PM (#47391)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47391

Current Numeric Suite will fail if it's collecting for multiple inputs and each input is of not same size.  This fix adds support for varying size input in numeric suite.
ghstack-source-id: 117058862

Test Plan:
buck test mode/dev caffe2/test:quantization -- 'test_shadow_logger'
buck test mode/dev caffe2/test:quantization  -- 'test_output_logger'
buck test mode/dev caffe2/test:quantization -- 'test_compare_weights_lstm_dynamic'
buck test mode/dev caffe2/test:quantization -- 'test_compare_model_stub_lstm_dynamic'
buck test mode/dev caffe2/test:quantization -- 'test_compare_model_outputs_lstm_dynamic'
buck test mode/dev caffe2/test:quantization -- 'test_compare_weights_conv_static'
buck test mode/dev caffe2/test:quantization -- 'test_compare_weights_linear_static'
buck test mode/dev caffe2/test:quantization -- 'test_compare_weights_linear_dynamic'
buck test mode/dev caffe2/test:quantization -- 'test_compare_model_stub_conv_static'
buck test mode/dev caffe2/test:quantization -- 'test_compare_model_stub_linear_static'
buck test mode/dev caffe2/test:quantization -- 'test_compare_model_stub_submodule_static'
buck test mode/dev caffe2/test:quantization -- 'test_compare_model_stub_functional_static'
buck test mode/dev caffe2/test:quantization -- 'test_compare_model_stub_linear_dynamic'
buck test mode/dev caffe2/test:quantization -- 'test_compare_model_outputs_conv_static'
buck test mode/dev caffe2/test:quantization -- 'test_compare_model_outputs_linear_static'
buck test mode/dev caffe2/test:quantization -- 'test_compare_model_outputs_functional_static'
buck test mode/dev caffe2/test:quantization -- 'test_compare_model_outputs_linear_dynami

Reviewed By: hx89

Differential Revision: D24662271

fbshipit-source-id: 6908169ee448cbb8f33beedbd26104633632896a
2020-11-18 23:57:41 -08:00
8819bad86c Implement igammac (3rd PR) (#48171)
Summary:
Related: https://github.com/pytorch/pytorch/issues/46183 (torch.igamma)
This is the regularized upper incomplete gamma function.

This is supposed to be exactly the same as https://github.com/pytorch/pytorch/issues/47463, but after rebasing the `viable/strict` branch.

cc: mruberry

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48171

Reviewed By: zhangguanheng66

Differential Revision: D25060107

Pulled By: mruberry

fbshipit-source-id: 89780dea21dbb2141cbc4f7f18192cb78a769b17
2020-11-18 23:44:32 -08:00
c5dae335e4 [PT][StaticRuntime] Move prim op impl to ops.cpp (#48210)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48210

- Move prim op implementation from `ProcessedNode::run` to `getNativeOperation`
- Add out variant for `prim::listConstruct`

Test Plan:
```
buck test //caffe2/test:static_runtime
buck test //caffe2/benchmarks/static_runtime:static_runtime_cpptest
buck test //caffe2/caffe2/fb/predictor:pytorch_predictor_test

buck run mode/dev //caffe2/caffe2/fb/predictor:ptvsc2_predictor_bench -- \
--scripted_model=/home/hlu/ads/adindexer/adindexer_ctr_mobilefeed/pt/merge/traced_precomputation.pt \
--pt_inputs=/home/hlu/ads/adindexer/adindexer_ctr_mobilefeed/pt/merge/container_precomputation_bs1.pt \
--iters=1 --warmup_iters=1 --num_threads=1 --pt_enable_static_runtime=true \
--pt_cleanup_activations=true --pt_enable_out_variant=true
```

Reviewed By: ajyu

Differential Revision: D24748947

fbshipit-source-id: 12caeeae87b69e60505a6cea31786bd96f5c8684
2020-11-18 23:07:39 -08:00
6da26fe79b [te] Fix pow (#48213)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48213

it was completely broken unless rhs was a constant.

Test Plan: new unit test in test_jit_fuser_te.py

Reviewed By: eellison

Differential Revision: D25071639

fbshipit-source-id: ef1010a9fd551db646b83adfaa961648a5c388ae
2020-11-18 22:44:16 -08:00
ed57f804fa [quant][refactor] Move some util functions from torch/quantization/fx/utils.py to torch/quantization/utils.py (#48107)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/48107

Test Plan: Imported from OSS

Reviewed By: supriyar

Differential Revision: D25026495

fbshipit-source-id: 3634b6b95a18670232600874b1e593180ea9f44c
2020-11-18 22:32:19 -08:00
4316bf98f5 [FX] Refactor unique name handling (#48205)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/48205

Test Plan: Imported from OSS

Reviewed By: zdevito

Differential Revision: D25068934

Pulled By: jamesr66a

fbshipit-source-id: 04e02bbfd2cc9a8c3b963d9afdf40bac065c319b
2020-11-18 21:56:52 -08:00
bef460a803 [PyTorch] Return raw ptr from ThreadLocalDebugInfo::get() (#47796)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47796

`ThreadLocalDebugInfo::get()` is a hot function. For example, it is called by `DefaultCPUAllocator::allocate()`. Most callers do not even bother to keep the returned `shared_ptr` around, proving that they have no lifetime issues currently. For the rest, it appears that the only way that the returned pointer could become invalid is if they then called a function that swapped out `ThreadLocalDebugInfo` using `ThreadLocalStateGuard`. There are very few such paths, and it doesn't look like any current callers of `ThreadLocalDebugInfo::get()` needed a `shared_ptr` at all.
ghstack-source-id: 116979577

Test Plan:
1) reviewers to double-check audit of safety
2) run framework overhead benchmarks

Reviewed By: dzhulgakov

Differential Revision: D24902978

fbshipit-source-id: d684737cc2568534cac7cd3fb8d623b971c2fd28
2020-11-18 20:37:17 -08:00
5883e0b0e0 [quant][fix][ez] Fix quant_type classification for fp16, fp16 (#48073)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/48073

Test Plan: Imported from OSS

Reviewed By: supriyar

Differential Revision: D25011799

fbshipit-source-id: a12f645d6be1c607898633225b02617283d37df1
2020-11-18 20:07:54 -08:00
773d1f3208 [Person Seg] Compress the person seg model (#48008)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48008

### Motivation

The idea is to quantize the weights during model exporting and dequantize them when performing setstate in runtime. To replicate exactly what caffe2 did, only 10 conv layers were quantized.

Since the code here is restricted to the unet model, I created a custom prepacking context to do the graph rewriting and registering custom ops.

To run on iOS/MacOS, we need to link `unet_metal_prepack` explicitly.
- buck build //xplat/caffe2/fb/custom_ops/unet_metal_prepack:unet_metal_prepackApple
- buck build //xplat/caffe2/fb/custom_ops/unet_metal_prepack:unet_metal_prepackAppleMac

On the server side, the `unet_metal_prepack.cpp` needs to be compiled into the `aten_cpu` in order to do the graph-rewrite via optimize_for_mobile. However, since we don't want to ship it to the production, some local hacks were made to make this happen. More details can be found in the following diffs.

###  Results

-rw-r--r--   1 taox  staff   1.1M Nov 10 22:15 seg_init_net.pb
-rw-r--r--   1 taox  staff   1.1M Nov 10 22:15 seg_predict_net.pb

Note since we quantize the weights, some precision loss are expected, but overall good.

### ARD

- Person seg - v229
- Hair seg - v105
ghstack-source-id: 117019547

Test Plan:
### Video eval results from macos

{F345324969}

Differential Revision: D24881316

fbshipit-source-id: b67811d6d06de82130f4c22392cc961c9dda7559
2020-11-18 20:01:51 -08:00
a97d059614 Get TestTorch.test_empty_meta working again (#48113)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48113

Fix is simple: just treat Meta as a backend covered by AutogradOther.
This semantically makes sense, since meta kernels are just like regular
CPU/CUDA kernels, they just don't do any compute.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: zhangguanheng66

Differential Revision: D25056641

Pulled By: ezyang

fbshipit-source-id: 7b68911982352b3e0ee8616b38cd9c70bd58a740
2020-11-18 19:50:27 -08:00
4c9eb57914 [PyTorch] Narrow Device to 2 bytes by narrowing DeviceType and DeviceIndex (#47023)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47023

DeviceType pretty clearly only needs 1 byte. DeviceIndex only needs 1 byte given that machines don't have anywhere near 255 GPUs in them as far as I know.
ghstack-source-id: 116901430

Test Plan: Existing tests, added assertion to catch if my assumption about DeviceIndex is incorrect

Reviewed By: dzhulgakov

Differential Revision: D24605460

fbshipit-source-id: 7c9a89027fcf8eebd623b7cdbf6302162c981cd2
2020-11-18 19:39:40 -08:00
72918e475e [quant] FakeQuantize inherit from FakeQuantizeBase (#48072)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/48072

Test Plan: Imported from OSS

Reviewed By: raghuramank100

Differential Revision: D25011074

fbshipit-source-id: 260f4d39299bc148b65c21e67b571dfa1d0fe2ad
2020-11-18 19:14:20 -08:00
efeb988518 Suppress "ioctl points to uninitialised" check (#48187)
Summary:
libcuda.so from CUDA-11.1 makes ioctl() that valgrind's memcheck tool considers dangerous
Instruct valgrind to suppress that check

Fixes false positives reported in https://app.circleci.com/pipelines/github/pytorch/pytorch/240774/workflows/d4c66de8-f13b-47a2-ae62-2ec1bbe0664b/jobs/9026496

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48187

Reviewed By: janeyx99

Differential Revision: D25059850

Pulled By: malfet

fbshipit-source-id: 982df5860524482b0fcb2bfc6bb490fb06694cf6
2020-11-18 18:45:46 -08:00
576fa09157 [quant][fix] Fix quant type classification for float_qparam qconfig (#48069)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48069

also renamed float_qparam_dynamic_qconfig to float_qparam_weight_only_qconfig
It's not used in user code yet so we only need to update the tests.

Test Plan: Imported from OSS

Reviewed By: supriyar

Differential Revision: D25010175

fbshipit-source-id: caa3eaa5358a8bc5c808bf5f64e6ebff3e0b61e8
2020-11-18 18:22:08 -08:00
f0f8b97d19 Introducing winograd transformed fp16 nnpack to PT for unet 106 (#47925)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47925

ghstack-source-id: 117004847

Test Plan:
buck run caffe2/fb/custom_ops/unet_106_pt:unet_106_rewrite
    buck run caffe2/fb/custom_ops/unet_106_pt:tests

Reviewed By: dreiss

Differential Revision: D24822418

fbshipit-source-id: 0c0bc0772e4c878e979ee3d2078105377e220c43
2020-11-18 18:05:53 -08:00
383abf1f0c [PyTorch] Make RecordFunction::active private (#47549)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47549

In preparation for moving state onto the heap.
ghstack-source-id: 117027862

Test Plan: CI

Reviewed By: ilia-cher

Differential Revision: D24812214

fbshipit-source-id: 1455c2782b66f6a59c4d45ba58e1c4c92402a323
2020-11-18 17:58:54 -08:00
1bafff2366 [PyTorch][JIT] Skip unnecessary refcounting in TensorType::merge (#47959)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47959

Taking a shared_ptr by value incurs refcounting overhead and should only be done if the callee needs to take ownership. Otherwise, `const T&` is more efficient. (Specifically, you will have to do an atomic decrement when the argument is destroyed and probably an atomic increment as well. Passing by `const T&` also takes one less register than passing `std::shared_ptr<T>`, but that's less important.)

This diff fixes just this one function, but I'd be happy to audit & fix this whole file in future diffs. Thoughts?
ghstack-source-id: 116914899

Test Plan: build ATen-cpu

Reviewed By: Krovatkin

Differential Revision: D24970954

fbshipit-source-id: 6bdb4b710a94b8baf4ad63418fb38136134e0ef3
2020-11-18 17:49:16 -08:00
0f89be616a Removing non-thread-safe log statement from ReinitializeTensor (#48185)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48185

In a scenario where we have Caffe2 wrapped into a dynamic library, we were running into the memory corruption crash at program termination:

"corrupted size vs. prev_size in fastbins"

Turns out the crash occurs in glog's logging.cc, which is not thread-safe and has to initialize a static hostname string when flushing. If this ends up happening on multiple threads simultaneously, this can lead to a memory corruption.

```
==1533667== Invalid free() / delete / delete[] / realloc()
==1533667==    at 0xA3976BB: operator delete(void*, unsigned long) (vg_replace_malloc.c:595)
==1533667==    by 0x37E36AE: std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::~basic_string() (basic_string.h:647)
==1533667==    by 0xAD87 (97b9712aed)F6B: __run_exit_handlers (in /usr/lib64/libc-2.28.so)
==1533667==    by 0xAD8809 (153e2e96d4)F: exit (in /usr/lib64/libc-2.28.so)
==1533667==    by 0xAD71799: (below main) (in /usr/lib64/libc-2.28.so)
==1533667==  Address 0x165cd720 is 0 bytes inside a block of size 31 free'd
==1533667==    at 0xA3976BB: operator delete(void*, unsigned long) (vg_replace_malloc.c:595)
==1533667==    by 0x37E36AE: std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::~basic_string() (basic_string.h:647)
==1533667==    by 0xAD87 (97b9712aed)F6B: __run_exit_handlers (in /usr/lib64/libc-2.28.so)
==1533667==    by 0xAD8809 (153e2e96d4)F: exit (in /usr/lib64/libc-2.28.so)
==1533667==    by 0xAD71799: (below main) (in /usr/lib64/libc-2.28.so)
==1533667==  Block was alloc'd at
==1533667==    at 0xA39641F: operator new(unsigned long) (vg_replace_malloc.c:344)
==1533667==    by 0x37F4E18: std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_M_mutate(unsigned long, unsigned long, char const*, unsigned long) (basic_string.tcc:317)
==1533667==    by 0x37F4F2E: std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_M_replace(unsigned long, unsigned long, char const*, unsigned long) (basic_string.tcc:466)
==1533667==    by 0x5170344: GetHostName(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*) (logging.cc:227)
==1533667==    by 0x51702D4 (fc7f026980): google::LogDestination::hostname[abi:cxx11]() (logging.cc:555)
==1533667==    by 0x5173789: google::(anonymous namespace)::LogFileObject::Write(bool, long, char const*, int) (logging.cc:1072)
==1533667==    by 0x51746DF: google::LogDestination::LogToAllLogfiles(int, long, char const*, unsigned long) (logging.cc:773)
==1533667==    by 0x5170BDC: google::LogMessage::SendToLog() (logging.cc:1386)
==1533667==    by 0x5171236: google::LogMessage::Flush() (logging.cc:1305)
==1533667==    by 0x517114D: google::LogMessage::~LogMessage() (logging.cc:1264)
==1533667==    by 0x108DC840: caffe2::ReinitializeTensor(caffe2::Tensor*, c10::ArrayRef<long>, c10::TensorOptions) (tensor.cc:0)
==1533667==    by 0x103BBED0: caffe2::int8::Int8GivenTensorFillOp::RunOnDevice() (int8_given_tensor_fill_op.h:29)
==1533667==
```

There doesn't seem to be an obvious easy solution here. The logging API being used by c10 is fundamentally not thread-safe, at least when it uses glog. Glog does have a threadsafe API (raw_logging), but this doesn't seem to be used by c10 right now. I suspect other callers are not running into this crash because:
- They have other libraries using glog in their module, so the static variable in glog gets initialized before getting into a race condition
- They don't use int8 network in a glog context, thus avoiding this problematic log statement

An alternative fix would be to correctly initialize the dtype of the int8 tensor, which is currently always uninitialized, making the log statement always trigger for int8 networks. Initializing the int8 tensor correctly in tensor_int8.h is proving to be challenging though, at least without knowledge of Caffe2's codebase. And even then, it wouldn't fix the issue for all use cases.

Test Plan: Ran my app with valgrind, I no longer get the crash and valgrind doesn't complain about  a memory corruption anymore

Reviewed By: thyu, qizzzh

Differential Revision: D25040725

fbshipit-source-id: 1392a97ccf9b4c9ade1ea713610ee44a1578ae7d
2020-11-18 17:42:22 -08:00
4360486346 pass strict_fuser_check for recursive fusion (#47221)
Summary:
We forgot to pass `strict_fuser_check` recursively to nested GraphFuser.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47221

Reviewed By: zhangguanheng66

Differential Revision: D25060095

Pulled By: Krovatkin

fbshipit-source-id: 31fe79c3bc080b637fce9aacc562d60708223321
2020-11-18 16:57:38 -08:00
ea1e78a0c5 Revert D24853669: [pytorch][PR] Migrate eig from the TH to Aten (CUDA)
Test Plan: revert-hammer

Differential Revision:
D24853669 (866f8591be)

Original commit changeset: a513242dc7f4

fbshipit-source-id: a0c8c424b61b1e627d9102de6b4c6d0717a6c06d
2020-11-18 16:53:18 -08:00
2fbd70d336 fft: Generalize fill with conjugate symmetry and use complex dtypes (#46908)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46908

Generalize to non-contiguous dimensions.

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D25048504

Pulled By: mruberry

fbshipit-source-id: a82545de17fc207fefea7fbd88d03042a3ca41fe
2020-11-18 15:39:14 -08:00
0639387ff1 move Tensor comparisons back to C (#48018)
Summary:
It seems that the machinery to handle comparison method in C rather than Python already exists, unless I'm missing something. (There is a wrapper for `TypeError_to_NotImplemented_`, and Python code gen handles `__torch_function__` which are the two things `_wrap_type_error_to_not_implemented` is doing) The performance change is quite stark:

```
import torch
from torch.utils.benchmark import Timer

global_dict = {
    "x": torch.ones((2, 2)),
    "y_scalar": torch.ones((1,)),
    "y_tensor": torch.ones((2, 1)),
}

for stmt in ("x == 1", "x == y_scalar", "x == y_tensor"):
    print(Timer(stmt, globals=global_dict).blocked_autorange(min_run_time=5), "\n")
```

### Before:
```
<torch.utils.benchmark.utils.common.Measurement object at 0x7f3d1289dc10>
x == 1
  Median: 12.86 us
  IQR:    0.65 us (12.55 to 13.20)
  387 measurements, 1000 runs per measurement, 1 thread

<torch.utils.benchmark.utils.common.Measurement object at 0x7f3d1289d1d0>
x == y_scalar
  Median: 6.03 us
  IQR:    0.33 us (5.91 to 6.24)
  820 measurements, 1000 runs per measurement, 1 thread

<torch.utils.benchmark.utils.common.Measurement object at 0x7f3d2b9e2050>
x == y_tensor
  Median: 6.34 us
  IQR:    0.33 us (6.16 to 6.49)
  790 measurements, 1000 runs per measurement, 1 thread
```

### After:
```
<torch.utils.benchmark.utils.common.Measurement object at 0x7fbdba2a16d0>
x == 1
  Median: 6.88 us
  IQR:    0.40 us (6.74 to 7.14)
  716 measurements, 1000 runs per measurement, 1 thread

<torch.utils.benchmark.utils.common.Measurement object at 0x7fbdd2e07ed0>
x == y_scalar
  Median: 2.98 us
  IQR:    0.19 us (2.89 to 3.08)
  167 measurements, 10000 runs per measurement, 1 thread

<torch.utils.benchmark.utils.common.Measurement object at 0x7fbdd33e4510>
x == y_tensor
  Median: 3.03 us
  IQR:    0.13 us (2.97 to 3.10)
  154 measurements, 10000 runs per measurement, 1 thread
```

There's still a fair bit of work left. Equivalent NumPy is about 6x faster than the new overhead, and PyTorch 0.4 is about 1.25 us across the board. (No scalar cliff.) But it's a start.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48018

Reviewed By: gchanan

Differential Revision: D25026257

Pulled By: robieta

fbshipit-source-id: 093b06a1277df25b4b7cc0d4e585b558937b10a1
2020-11-18 15:25:41 -08:00
ed4dd86567 move aten::round to lite interpreter (#45931)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45931

move aten::round to lite interpreter. It's needed by TTS

Test Plan: build

Reviewed By: zhizhengwu

Differential Revision: D24149089

fbshipit-source-id: c8e292598dd04d7f0d40f121cb861f91d359e957
2020-11-18 12:30:32 -08:00
a36e646878 [pytorch][codegen] simplify python signature creation logic (#47977)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47977

Avoid calling CppSignatureGroup api - python signature shouldn't depend on
cpp signature. Still use cpp.group_arguments() to group TensorOptions.

Confirmed byte-for-byte compatible with the old codegen:

```
Run it before and after this PR:
  .jenkins/pytorch/codegen-test.sh <baseline_output_dir>
  .jenkins/pytorch/codegen-test.sh <test_output_dir>

Then run diff to compare the generated files:
  diff -Naur <baseline_output_dir> <test_output_dir>
```

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D24976334

Pulled By: ljk53

fbshipit-source-id: 5df5a7bbfd2b8cb460153e5bea4d91e65f716390
2020-11-18 12:26:50 -08:00
5eaf8562cd [pytorch][codegen] simplify dunder method check in gen_python_functions.py (#47976)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47976

Confirmed byte-for-byte compatible with the old codegen:

```
Run it before and after this PR:
  .jenkins/pytorch/codegen-test.sh <baseline_output_dir>
  .jenkins/pytorch/codegen-test.sh <test_output_dir>

Then run diff to compare the generated files:
  diff -Naur <baseline_output_dir> <test_output_dir>
```

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D24976273

Pulled By: ljk53

fbshipit-source-id: 6f8f20d18db20c3115808bfac0a8b8ad83dcf64c
2020-11-18 12:26:47 -08:00
5243456728 [pytorch][codegen] remove dead code in gen_variable_type.py (#47975)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47975

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D24976274

Pulled By: ljk53

fbshipit-source-id: 8542471ee30f26592aad949fc17eef87a47df024
2020-11-18 12:26:44 -08:00
07657b6001 [tensorexpr] Switch cpp tests to pure gtest (#48160)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48160

We no longer use the custom c++ test infra anyways, so move to pure
gtest.

Fixes #45703
ghstack-source-id: 116977283

Test Plan: `buck test //caffe2/test/cpp/tensorexpr`

Reviewed By: navahgar, nickgg

Differential Revision: D25046618

fbshipit-source-id: da34183d87465f410379048148c28e1623618553
2020-11-18 12:23:34 -08:00
464d23e6b4 [te][benchmark] Add more optimized versions of gemm (#48159)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/48159

Test Plan: Imported from OSS

Reviewed By: Chillee, ngimel

Differential Revision: D25059742

Pulled By: bertmaher

fbshipit-source-id: f197347f739c5bd2a4182c59ebf4642000c3dd55
2020-11-18 12:21:08 -08:00
8a996dd139 [te] Make BUILD_TENSOREXPR_BENCHMARK a real CMake option (#48158)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/48158

Test Plan: Imported from OSS

Reviewed By: Chillee

Differential Revision: D25059877

Pulled By: bertmaher

fbshipit-source-id: a98b6c18a91b4fe89d12bf5f7ead604e3cc0c8b0
2020-11-18 12:19:14 -08:00
866f8591be Migrate eig from the TH to Aten (CUDA) (#44105)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/24553

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44105

Reviewed By: heitorschueroff

Differential Revision: D24853669

Pulled By: mruberry

fbshipit-source-id: a513242dc7f49f55dbc6046c18d8a9d9aa2aaf8d
2020-11-18 12:10:18 -08:00
8af9f2cc23 Revert D24924736: [pytorch][PR] Hipify revamp
Test Plan: revert-hammer

Differential Revision:
D24924736 (10b490a3e0)

Original commit changeset: 4af42b8ff4f2

fbshipit-source-id: 7f8f90d55d8a69a2890ec73622fcea559189e381
2020-11-18 11:48:30 -08:00
68a3a3f3b5 Add torch.swapdims and torch.swapaxes (#46041)
Summary:
Reference https://github.com/pytorch/pytorch/issues/38349

Delegates to `torch.transpose` (not sure what is the best way to alias)

TODO:
* [x] Add test
* [x] Add documentation

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46041

Reviewed By: gchanan

Differential Revision: D25022816

Pulled By: mruberry

fbshipit-source-id: c80223d081cef84f523ef9b23fbedeb2f8c1efc5
2020-11-18 11:35:53 -08:00
d256e38823 [JIT] Pass TypePtr by reference in Argument::type() and Type::isSubtypeOfExt(). (#48061)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48061

This results in a ~6% improvement on DeepAndWide model and would improve
other models as well.

Before the change:
```
393[ms]
458[ms]
413[ms]
390[ms]
430[ms]
399[ms]
426[ms]
392[ms]
428[ms]
399[ms]
```

After the change:
```
396[ms]
375[ms]
396[ms]
392[ms]
370[ms]
402[ms]
395[ms]
409[ms]
366[ms]
388[ms]
```

Differential Revision: D25006357

Test Plan: Imported from OSS

Reviewed By: suo

Pulled By: ZolotukhinM

fbshipit-source-id: c9cdc6354c42962b14207db31cf2580a4e2430b1
2020-11-18 11:29:46 -08:00
df88cc3f7f Document that remainder does not support complex inputs (#48024)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/34266

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48024

Reviewed By: ngimel

Differential Revision: D25028700

Pulled By: mruberry

fbshipit-source-id: 6d88c7d0930283455deb51d70708cc4919eeca55
2020-11-18 11:21:23 -08:00
0387f2a6fa Fix default value of num_replicas in DistributedSampler docstring (#48135)
Summary:
Change default value of `num_replicas` from `rank` to `world_size` in DistributedSampler docstring.

Addresses https://github.com/pytorch/pytorch/issues/48055

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48135

Reviewed By: gchanan

Differential Revision: D25045328

Pulled By: rohan-varma

fbshipit-source-id: 6f84f7bb69087d8dae931cda51891b3cb1894306
2020-11-18 11:18:40 -08:00
140e946fec Disable distributed collectives profiling tests (#48129)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48129

It looks like all test failures in distributed_test have to do with
profiling, so disabling them in this PR (by setting `expect_event=False`
always), so that the distributed profiling tests don't run.

Created https://github.com/pytorch/pytorch/issues/48127 to track the fix.
Will verify with CI all that re-enabling distributed tests passes as expected.
ghstack-source-id: 116938304

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D25034888

fbshipit-source-id: c10bad3ca2425a2f2cde82232001dafcca152d1c
2020-11-18 11:12:09 -08:00
a6898cb5f4 Small documentation changes for RRef and Dist Autograd (#48123)
Summary:
Small wording changes and polishing documentation for:

https://pytorch.org/docs/master/rpc/rref.html
https://pytorch.org/docs/master/rpc/distributed_autograd.html

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48123

Reviewed By: zhangguanheng66

Differential Revision: D25059320

Pulled By: H-Huang

fbshipit-source-id: 7a0be56f062de06483b3bd3a5d617234101862ba
2020-11-18 10:57:59 -08:00
81b1673a21 Enable complex tests that depend on batched matmul on CUDA (#47910)
Summary:
Now when https://github.com/pytorch/pytorch/pull/42553 is merged we can delete a bit of code from the tests and enable some of the skipped complex tests.

Unfortunately, `test_pinverse_complex_xfailed` and `test_symeig_complex_xfailed` had bugs and it wasn't caught automatically that these tests xpass. Need to be careful next time with `unittest.expectedFailure`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47910

Reviewed By: zhangguanheng66

Differential Revision: D25052130

Pulled By: mruberry

fbshipit-source-id: 29512995c024b882f9cb78b7bede77733d5762d0
2020-11-18 10:44:47 -08:00
3ca4c656de Install magma on CUDA 11.1 (#48164)
Summary:
cc: xwang233 janeyx99

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48164

Reviewed By: malfet, zhangguanheng66

Differential Revision: D25058068

Pulled By: janeyx99

fbshipit-source-id: ab136ba60e5dda6a2eb7ac76548875e1df75a242
2020-11-18 10:12:08 -08:00
10b490a3e0 Hipify revamp (#45451)
Summary:
This PR revamps the hipify module in PyTorch to overcome a long list of shortcomings in the original implementation. However, these improvements are applied only when using hipify to build PyTorch extensions, **not for PyTorch or Caffe2 itself**.

Correspondingly, changes are made to `cpp_extension.py` to match these improvements.

The list of improvements to hipify is as follows:

1. Hipify files in the same directory as the original file, unless there's a "cuda" subdirectory in the original file path, in which case the hipified file will be in the corresponding file path with "hip" subdirectory instead of "cuda".
2. Never hipify the file in-place if changes are introduced due to hipification i.e. always ensure the hipified file either resides in a different folder or has a different filename compared to the original file.
3. Prevent re-hipification of already hipified files. This avoids creation of unnecessary "hip/hip" etc. subdirectories and additional files which have no actual use.
4. Do not write out hipified versions of files if they are identical to the original file. This results in a cleaner output directory, with minimal number of hipified files created.
5. Update header rewrite logic so that it accounts for the previous improvement.
6. Update header rewrite logic so it respects the rules for finding header files depending on whether `""` or `<>` is used.
7. Return a dictionary of mappings of original file paths to hipified file paths from `hipify` function.
8. Introduce a version for hipify module to allow extensions to contain back-compatible code that targets a specific point in PyTorch where the hipify functionality changed.
9. Update `cuda_to_hip_mappings.py` to account for the ROCm component subdirectories inside `/opt/rocm/include`. This also results in cleanup of the `Caffe2_HIP_INCLUDE` path to remove unnecessary additions to the include path.

The list of changes to `cpp_extension.py` is as follows:
1. Call `hipify` when building a CUDAExtension for ROCm.
2. Prune the list of source files to CUDAExtension to include only the hipified versions of any source files in the list (if both original and hipified versions of the source file are in the list)
3. Add subdirectories of /opt/rocm/include to the include path for extensions, so that ROCm headers for subcomponent libraries are found automatically

cc jeffdaily sunway513 hgaspar lcskrishna ashishfarmer

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45451

Reviewed By: ezyang

Differential Revision: D24924736

Pulled By: malfet

fbshipit-source-id: 4af42b8ff4f21c3782dedb8719b8f9f86b34bd2d
2020-11-18 08:37:49 -08:00
1454cbf087 Make numpy optional dependency for torch.cuda.amp (#48154)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/48154

Test Plan:
Uninstall `numpy` and try to importing `torch`

Discovered while working on https://github.com/pytorch/pytorch/issues/48145

Reviewed By: walterddr

Differential Revision: D25046307

Pulled By: malfet

fbshipit-source-id: c1171a49e03bdc40e8dc1d65928c6c12626e33db
2020-11-18 08:31:44 -08:00
e2b4c63dd9 Enable the faster combined weight branch in MHA when query/key/value is same object with nan (#48126)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/47979

For MHA module, it is preferred to use the combined weight branch as much as possible when query/key/value are same (in case of same values by `torch.equal` or exactly same object by `is` ops). This PR will enable the faster branch when a single object with `nan` is passed to MHA.

For the background knowledge
```
import torch
a = torch.tensor([float('NaN'), 1, float('NaN'), 2, 3])
print(a is a) # True
print(torch.equal(a, a)) # False
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48126

Reviewed By: gchanan

Differential Revision: D25042082

Pulled By: zhangguanheng66

fbshipit-source-id: 6bb17a520e176ddbb326ddf30ee091a84fcbbf27
2020-11-18 08:24:41 -08:00
9ead558899 Add max supported SM for nvrtc-11.0 (#48151)
Summary:
Should fix the regression when nvrtc from CUDA-11.0 is used on the system with RTX3080

Addresses issue described in https://github.com/pytorch/pytorch/issues/47669#issuecomment-725073808

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48151

Reviewed By: ngimel

Differential Revision: D25043899

Pulled By: malfet

fbshipit-source-id: 998ded59387e3971c2c1a5df4af595630515a72e
2020-11-18 08:17:28 -08:00
21c823970e [ROCm] remove sccache wrappers post build (#47947)
Summary:
For ROCm, the CI images serve both the CI jobs as well as public releases. Without removing the sccache wrappers, end users are forced to use sccache. Our users have encountered sccache bugs when using our PyTorch images, so we choose to remove them after the CI build completes. Further, runtime compilation of MIOpen kernels still experiences errors due to sccache.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47947

Reviewed By: gchanan

Differential Revision: D24994031

Pulled By: malfet

fbshipit-source-id: 65c57ae98e28fc0ce79f754b792d504148c7fcd6
2020-11-18 08:09:15 -08:00
98722ab8a7 There should be a newline between BUILD WITH CUDA and NVTX (#48048)
Summary:
When you do want to insert a `<br />` break tag using Markdown, you end a line with two or more spaces, then type return.

From
https://stackoverflow.com/questions/33191744/how-to-add-new-line-in-markdown-presentation/33191810

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48048

Reviewed By: gchanan

Differential Revision: D25003623

Pulled By: walterddr

fbshipit-source-id: ab5f7267ae936f6f006b4afa43254afa690ef7f4
2020-11-18 08:00:05 -08:00
2ff748a680 Move kthvalue scalar test to separate method for XLA (#48042)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48042

Moving scalar test to a separate method so the XLA team can continue to test for the other cases without failing. Requested here https://github.com/pytorch/xla/issues/2620#issuecomment-725696108

Test Plan: Imported from OSS

Reviewed By: zhangguanheng66

Differential Revision: D25055677

Pulled By: heitorschueroff

fbshipit-source-id: 5da66bac78ea197821fee0b9b8a213ff2dc19c67
2020-11-18 07:49:14 -08:00
ca8b9437ab Add type annotations for a few torch.nn.modules (#46013)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/46012

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46013

Reviewed By: gchanan

Differential Revision: D25012419

Pulled By: ezyang

fbshipit-source-id: 9fd8ad9fa3122efa294a08010171cb7ddf752778
2020-11-18 07:44:59 -08:00
8c00221fe2 Fix inconsistent environment variable naming for setting NVTOOLEXT_HOME in TorchConfig.cmake (#48012)
Summary:
When building libtorch with CUDA installed in some unconventional
location, CMake files rely on some environment variables to set cmake
variable, in particular NVTOOLSEXT_PATH environment variable is used to
set NVTOOLEXT_HOME in cmake/public/cuda.cmake. Later when consuming
such build using the generated cmake finder TorchConfig.cmake, another
convention is used which feels rather inconsistent, relying on a
completly new environment variable NVTOOLEXT_HOME, although the former
way is still in place, cmake/public/cuda.cmake being transitively called
via Caffe2Config.cmake, which is called by TorchConfig.cmake

Fixes https://github.com/pytorch/pytorch/issues/48032

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48012

Reviewed By: gchanan

Differential Revision: D25031260

Pulled By: ezyang

fbshipit-source-id: 0d6ab8ba9f52dd10be418b1a92b0f53c889f3f2d
2020-11-18 07:37:53 -08:00
2832e325dd [TensorPipe] Avoid using deprecated alias for error (#48168)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48168

TensorPipe deduplicated a set of error (which existed both under the ::transport and the ::channel namespaces). The old names were kept as aliases but we should migrate to the new ones.
ghstack-source-id: 116989010

Test Plan: CI

Reviewed By: beauby

Differential Revision: D25051218

fbshipit-source-id: caef27f1a0ff0e6f0b8b09fa92d6f79641c1e17a
2020-11-18 04:59:08 -08:00
df0ae244a9 [static runtime] Add out_ variant for aten::stack and aten::nan_to_num (#48150)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48150

With D24767322, the remaining ops without out_ variant are pretty much the sparsenn specific ops, which are a bit trickier to add.

Test Plan:
```
buck run //caffe2/test:static_runtime
buck run //caffe2/benchmarks/static_runtime:static_runtime_cpptest
buck run //caffe2/caffe2/fb/predictor:pytorch_predictor_test

buck-out/opt/gen/caffe2/caffe2/fb/predictor/ptvsc2_predictor_bench \
--pred_net=/home/hlu/ads/adindexer/adindexer_ctr_mobilefeed/pt/merge/precomputation_merge_net.pb \
--c2_inputs=/home/hlu/ads/adindexer/adindexer_ctr_mobilefeed/pt/merge/c2_inputs_precomputation_bs1.pb \
--c2_weights=/home/hlu/ads/adindexer/adindexer_ctr_mobilefeed/pt/merge/c2_weights_precomputation.pb \
--scripted_model=/home/hlu/ads/adindexer/adindexer_ctr_mobilefeed/pt/merge/traced_precomputation_partial_dper_fixes.pt \
--pt_inputs=/home/hlu/ads/adindexer/adindexer_ctr_mobilefeed/pt/merge/container_precomputation_bs1.pt \
--iters=1 --warmup_iters=1 --num_threads=1 --pt_enable_static_runtime=true \
--pt_cleanup_activations=true --pt_enable_out_variant=true \
--eps 1e-2
```

Reviewed By: bwasti

Differential Revision: D25016076

fbshipit-source-id: 59a7948d4cca60182b6755217571128c2fc51f4d
2020-11-17 23:06:17 -08:00
6049653c20 [quant][graphmode][fx] Keep linear op unchanged when qconfig is not supported (#48067)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/48067

Test Plan: Imported from OSS

Reviewed By: supriyar

Differential Revision: D25008463

fbshipit-source-id: d0bfc6bf8d544824d0a55cd4bcd1f9301d75c935
2020-11-17 21:59:55 -08:00
a1f494cb8b Fix test_inverse_singular for cublas path; fix cusolver inverse multi-stream issue (#47026)
Summary:
### test_inverse_singular for cublas failure

Related
https://github.com/pytorch/pytorch/pull/46616#issuecomment-718102758
https://app.circleci.com/pipelines/github/pytorch/pytorch/232112/workflows/4131d4ca-cd51-44e3-8e6c-b1c3555c62fa/jobs/8523970/tests

The cuda 11.1 CI container doesn't have MAGMA library, so cublas matrix inverse path is enabled.
```
Oct 27 23:13:47 -- MAGMA not found. Compiling without MAGMA support
```

The test_inverse_singular was introduced in https://github.com/pytorch/pytorch/pull/46625, but I forgot to fix that functionality for cublas path as well.

### cusolver inverse multi-stream failure

fix https://github.com/pytorch/pytorch/issues/47272

The original cuda event record/block stream was wrong, which could cause NaN in output tensor.

On my machine, the original code observes NaN in about 50k~500k loops. After this change, no NaN is observed in more than 2.5m loops.

The performance for batch 2 matrix inverse is still the same as those in https://github.com/pytorch/pytorch/issues/42403.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47026

Reviewed By: mruberry

Differential Revision: D24838546

Pulled By: ngimel

fbshipit-source-id: 3b83e4ab8e6b47a8273cba277251765bd6d97911
2020-11-17 21:42:11 -08:00
bc484cfed1 [c10d][jit] initial torchbind bindings for ProcessGroupNCCL (#42944)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42944

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D23228682

Pulled By: wanchaol

fbshipit-source-id: 30f4258ec2a90202264745511b897f4e1f5550f7
2020-11-17 21:01:55 -08:00
cc611280d3 Revert D24862372: [PyTorch Mobile] Fix for messenger: avoid error with [-Werror,-Wglobal-constructors]
Test Plan: revert-hammer

Differential Revision:
D24862372 (9392137dbe)

Original commit changeset: d07548645d5a

fbshipit-source-id: 973678b9afe64b68df774c327ba3b62ff252a141
2020-11-17 19:04:57 -08:00
4883d39c6f Avoid direct reference to at::native::tensor from TensorDataContainer (#47567)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47567

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D24822517

Pulled By: iseeyuan

fbshipit-source-id: f69bfc029aae5199dbc63193fc7a5e5e6feb5790
2020-11-17 17:32:21 -08:00
c6c6a53ba0 [JIT] Fix function schema subtype checking (#47965)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47965

**Summary**
This commit fixes `FunctionSchema::isSubtypeOf` so that the subtyping rule it
implements for `FunctionSchema` instances is contravariant in argument
types and covariant in return type. At present, the rule is covariant in
argument types and contravariant in return type, which is not correct.

A brief but not rigourous explanation follows. Suppose there are two
`FunctionSchema`s, `M = (x: T) -> R` and `N = (x: U) -> S`. For `M <= N`
to be true (i.e. that `M` is a subtype of `N`), it must be true that `U
<= T` and `R <= S`. This generalizes to functions with multiple
arguments.

**Test Plan**
This commit extends `TestModuleInterface.test_module_interface_subtype`
with two new tests cases that test the contravariance of argument types
and covariance of return types in determining whether a `Module`
implements an interface type.

Test Plan: Imported from OSS

Reviewed By: qizzzh

Differential Revision: D24970883

fbshipit-source-id: 2e4bda079c7062806c105ffcc14a28796b063525
2020-11-17 17:19:13 -08:00
94cd048bda Added foreach_frac API (#47384)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47384

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D24737052

Pulled By: izdeby

fbshipit-source-id: 8c94cc42bf22bfbb8f78bfeb2017a5756045763a
2020-11-17 16:56:30 -08:00
134bce7cd0 Adding bunch of unary foreach APIs (#47875)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47875

Implementing several unary operators for _foreach_ APIs.
### Planned list of ops
- [x]  abs
- [x]  acos
- [x]  asin
- [x]  atan
- [x]  ceil
- [x]  cos
- [x]  cosh
- [x]  erf
- [x]  erfc
- [x]  exp
- [x]  expm1
- [x]  floor
- [x]  log
- [x]  log10
- [x]  log1p
- [x]  log2
- [ ]  frac
- [x]  neg
- [ ]  reciprocal
- [x]  round
- [ ]  rsqrt
- [ ]  sigmoid
- [x]  sin
- [x]  sinh
- [x]  sqrt
- [x]  tan
- [x]  tanh
- [ ]  trunc
- [x]  lgamma
- [ ]  digamma
- [ ]  erfinv
- [ ]  sign
- [ ]  mvlgamma
- [ ]  clamp
- [ ]  clamp_min
- [ ]  clamp_max

### Perf results
```
----------------- OP:  sin  -----------------
  Median: 998.79 us
  300.84 us

----------------- OP:  abs  -----------------
  Median: 1.19 ms
  294.97 us

----------------- OP:  acos  -----------------
  Median: 982.30 us
  299.40 us

----------------- OP:  asin  -----------------
  Median: 1.16 ms
  298.09 us

----------------- OP:  atan  -----------------
  Median: 986.92 us
  295.64 us

----------------- OP:  ceil  -----------------
  Median: 1.17 ms
  297.25 us

----------------- OP:  cos  -----------------
  Median: 972.72 us
  294.41 us

----------------- OP:  cosh  -----------------
  Median: 1.17 ms
  294.97 us

----------------- OP:  erf  -----------------
  Median: 1.17 ms
  297.02 us

----------------- OP:  erfc  -----------------
  Median: 1.14 ms
  299.23 us

----------------- OP:  exp  -----------------
  Median: 1.15 ms
  298.79 us

----------------- OP:  expm1  -----------------
  Median: 1.17 ms
  291.79 us

----------------- OP:  floor  -----------------
  Median: 1.17 ms
  293.51 us

----------------- OP:  log  -----------------
  Median: 1.13 ms
  318.01 us

----------------- OP:  log10  -----------------
  Median: 987.17 us
  295.57 us

----------------- OP:  log1p  -----------------
  Median: 1.13 ms
  297.15 us

----------------- OP:  log2  -----------------
  Median: 974.21 us
  295.01 us

----------------- OP:  frac  -----------------
  Median: 1.15 ms
  296.01 us

----------------- OP:  neg  -----------------
  Median: 1.13 ms
  294.98 us

----------------- OP:  reciprocal  -----------------
  Median: 1.16 ms
  293.69 us

----------------- OP:  round  -----------------
  Median: 1.12 ms
  297.48 us

----------------- OP:  sigmoid  -----------------
  Median: 1.13 ms
  296.53 us

----------------- OP:  sin  -----------------
  Median: 991.02 us
  295.78 us

----------------- OP:  sinh  -----------------
  Median: 1.15 ms
  295.70 us

----------------- OP:  sqrt  -----------------
  Median: 1.17 ms
  297.75 us

----------------- OP:  tan  -----------------
  978.20 us
  297.99 us

----------------- OP:  tanh  -----------------
  Median: 967.84 us
  297.29 us

----------------- OP:  trunc  -----------------
  Median: 1.14 ms
  298.72 us

----------------- OP:  lgamma  -----------------
  Median: 1.14 ms
  317.53 us
```

### Script

```

import torch
import torch.optim as optim
import torch.nn as nn
import torchvision
import torch.utils.benchmark as benchmark_utils

inputs = [torch.rand(3, 200, 200, device="cuda") for _ in range(100)]

def main():
    for op in [
            "sin", "abs", "acos", "asin", "atan", "ceil",
            "cos", "cosh", "erf", "erfc",
            "exp", "expm1", "floor", "log",
            "log10", "log1p", "log2", "frac",
            "neg", "reciprocal", "round",
            "sigmoid", "sin", "sinh", "sqrt",
            "tan", "tanh", "trunc", "lgamma"
        ]:
        print("\n\n----------------- OP: ", op, " -----------------")
        stmt = "[torch.{op}(t) for t in inputs]"
        timer = benchmark_utils.Timer(
            stmt=stmt.format(op = op),
            globals=globals(),
            label="str(optimizer)",
        )
        print(f"autorange:\n{timer.blocked_autorange()}\n\n")

        stmt = "torch._foreach_{op}(inputs)"
        timer_mta = benchmark_utils.Timer(
            stmt=stmt.format(op = op),
            globals=globals(),
            label="str(optimizer_mta)",
        )
        print(f"autorange:\n{timer_mta.blocked_autorange()}\n\n")

if __name__ == "__main__":
    main()

```

Test Plan: Imported from OSS

Reviewed By: nikithamalgifb

Differential Revision: D24948801

Pulled By: izdeby

fbshipit-source-id: defec3c0394d6816d9a8b05a42a057348f1b4d96
2020-11-17 16:51:54 -08:00
0adace3706 fix calculate_extra_mem_bytes_needed_for (#48102)
Summary:
This PR fixes a bug in calculate_extra_mem_bytes_needed_for in get_device_to_partitions_mapping

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48102

Reviewed By: gcatron

Differential Revision: D25029059

Pulled By: scottxu0730

fbshipit-source-id: 7447b70e8da96b3dc2c5922cf9b62eb306877317
2020-11-17 16:46:11 -08:00
9392137dbe [PyTorch Mobile] Fix for messenger: avoid error with [-Werror,-Wglobal-constructors]
Summary:
Messenger build may set [-Werror,-Wglobal-constructors], which triggers the compilation error `declaration requires a global destructor`. See https://fb.workplace.com/groups/2148543255442743/permalink/2531994563764275/ for details.

Solution: https://stackoverflow.com/questions/15708411/how-to-deal-with-global-constructor-warning-in-clang

Test Plan:
CI
based on D24842445, `buck test //xplat/messenger/ml/ranking_service:MessengerRankingServiceApple`

Reviewed By: abiczo

Differential Revision: D24862372

fbshipit-source-id: d07548645d5af480c4e53e167b30b7cd7398ccb2
2020-11-17 16:15:44 -08:00
194ea076b2 Update VMA. (#47727)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47727

Test Plan: Imported from OSS

Reviewed By: IvanKobzarev

Differential Revision: D25032739

Pulled By: AshkanAliabadi

fbshipit-source-id: 223df5f18dbfee02ed41eb5e116cc15437e28e8e
2020-11-17 15:40:18 -08:00
568a72bacc Fix Vulkan empty (and family) breakage as a result of API update. (#47937)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47937

Test Plan: Imported from OSS

Reviewed By: IvanKobzarev

Differential Revision: D25032738

Pulled By: AshkanAliabadi

fbshipit-source-id: 8ee573033f7c9c7abcb9c08e4c80ca91da9f422f
2020-11-17 15:35:45 -08:00
cdc2d2843b Structured kernel definitions (#45277)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45277

Implements structured kernels as per https://github.com/pytorch/rfcs/pull/9 and ports upsample_nearest1d to use the framework.

The general structure of this diff:

- Define a new syntax for specifying structured kernels in `native_functions.yaml`. You put `structured: True` on the `out` function (that's what you implement) and `structured_delegate: foo.out` on the functional/inplace variants to define them in terms of the `out` function. There's a bunch of new consistency checking to see if you've done this right, though the error messages are of varying quality. This is most of what's going on in tools.codegen.model
- NativeFunctionGroup turns into StructuredNativeFunctions. Previously I thought that maybe we would use this grouping mechanism for both structured and unstructured kernels, but it turned out that Jiakai needed to make his own grouping structure. So now I've specialized it for structured kernels, which also means I get to add a bunch of invariants, like requiring structured kernels to have both a functional and an out variant. This is the lower bundle of changes in tools.codegen.model
- When you make an out kernel structured, this induces us to generate a new meta function signature for you to write shape checking and output allocation code. The signatures of these is defined by `tools.codegen.api.meta` and generated into `MetaFunctions.h`. Coverage here is very bare bones and will be driven by actual operators we port as we go.
- The meaty part of code generation is what we do when we have some grouped StructuredNativeFunctions. We continue to generate a wrapper per function type, but they're are a bit different as the call your meta functions, and make reference to the actual implementations in out.
- Then there's a port of `upsample_nearest1d`; easiest to review by just looking at what the final code looks like.

Missing pieces:

- Stride calculation in TensorMeta
- Sufficient sanity checking for inplace/out variants
- Enough rope to make TensorIterator work

This PR improves instruction counts on `upsample_nearest1d` because it eliminates an extra redispatch. Testing `at::upsample_nearest1d(x, {10});`

* Functional: before 1314105, after 1150705
* Out: before 915705, after 838405

These numbers may be jittered up to +-16400 (which is the difference when I tested against an unaffected operator `at::upsample_linear1d`), though that may also because unrelated changes affected all operators globally.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Differential Revision: D24253555

Test Plan: Imported from OSS

Reviewed By: smessmer

Pulled By: ezyang

fbshipit-source-id: 4ef58dd911991060f13576864c8171f9cc614456
2020-11-17 15:24:43 -08:00
d7e838467a [qunat][graphmode][fx] Embedding/EmbeddingBag works in static quant qconfig (#48062)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48062

When Embedding/EmbeddingBag are configured with static quant we'll skip inserting observer for
them in the graph and keep the op unchanged and print a warning.
This also aligns with eager mode behavior as well.

We'll enforce this behavior for other ops that only supports dynamic/weight_only quant but not static quant as well.

We used a global variable `DEFAULT_NOT_OBSERVED_QUANTIZE_HANDLER`, this is not exposed to user right now,
we can add that later if needed.

Test Plan: Imported from OSS

Reviewed By: supriyar

Differential Revision: D25007537

fbshipit-source-id: 6ab9e025269b44bbfd0d6dd5bb9f95fe3ca9dead
2020-11-17 15:02:04 -08:00
3846e35a55 [GPU] Enable Metal on macosx (#47635)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47635

Add macosx support for metal. The supported os version is 10.13 and above.
ghstack-source-id: 116845318

Test Plan:
1. Sandcastle Tests
2. CircleCI Jobs
3. In the next diff, we'll run the person segmentation model inside a macos app

Reviewed By: dreiss

Differential Revision: D24825088

fbshipit-source-id: 10d7976c953e765599002dc42d7f8d248d7c9846
2020-11-17 14:44:34 -08:00
05dc9821be .circleci: Add python 3.9 builds for macOS (#47689)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47689

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: janeyx99

Differential Revision: D25029226

Pulled By: seemethere

fbshipit-source-id: 1db2b021d3adf243453f4405219d5ce03d03a9c1
2020-11-17 14:21:50 -08:00
04545f4b46 [quant] out-variant for the reflection pad (#48037)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/48037

Test Plan: Imported from OSS

Reviewed By: ayush29feb

Differential Revision: D25000345

Pulled By: z-a-f

fbshipit-source-id: 8404239a70136dd8ba1ede9695af0cf848b933a2
2020-11-17 14:10:49 -08:00
e1a101676b [quant] ReflectionPad2d (#48036)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/48036

Test Plan: Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D25000347

Pulled By: z-a-f

fbshipit-source-id: f42bf3c6f7069385bc62609cf59d24c15734a058
2020-11-17 14:06:37 -08:00
cb046f7bd2 [static runtime] Initial memonger (#47759)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47759

Parity reached :)

*/0 -> no memonger
*/1 -> memonger on
We can see that the impact is large when activations don't all fit in cache (6x speed up on this micro bench)
```
BM_long_static_memory_optimization/2/0         8563 ns       8559 ns      86370
BM_long_static_memory_optimization/8/0         8326 ns       8322 ns      84099
BM_long_static_memory_optimization/32/0       11446 ns      11440 ns      56107
BM_long_static_memory_optimization/512/0    6116629 ns    6113108 ns        128
BM_long_static_memory_optimization/2/1         8151 ns       8149 ns      87000
BM_long_static_memory_optimization/8/1         7905 ns       7902 ns      85124
BM_long_static_memory_optimization/32/1       10652 ns      10639 ns      66055
BM_long_static_memory_optimization/512/1    1101415 ns    1100673 ns        641
```

TODO:
[x] implementation
[x] enable/disable flag
[x] statistics about memory saved
[x] additional models

Test Plan:
```
buck test //caffe2/test:static_runtime
buck test //caffe2/benchmarks/static_runtime:static_runtime_cpptest
buck test //caffe2/caffe2/fb/predictor:pytorch_predictor_test
```

Reviewed By: yinghai

Differential Revision: D24824445

fbshipit-source-id: db1f5239f72cbd1a9444017e20d5a107c3b3f043
2020-11-17 13:55:49 -08:00
06707a7ef8 Fix flake8 failure (#48124)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/48124

Reviewed By: walterddr

Differential Revision: D25032696

Pulled By: malfet

fbshipit-source-id: 2519d18de7417721d53f6404dc291fd8f7cc94fe
2020-11-17 13:48:08 -08:00
b1c5f06f9e Revert D24925955: Fix "pointless comparison" warning
Test Plan: revert-hammer

Differential Revision:
D24925955 (a03f05f2a2)

Original commit changeset: 56bcf32aeb16

fbshipit-source-id: f7bea36e5b23f254381a3cc655cb199a106cc62c
2020-11-17 13:35:37 -08:00
d522cd15a3 fix BC test, after removing __caffe2 ops (#48099)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/48099

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D25023321

Pulled By: bdhirsh

fbshipit-source-id: c9567b3dcfc2bea3587a17e4b6400fc490349365
2020-11-17 12:51:02 -08:00
b10d6c6089 [caffe2] cache NextName indexes for faster name generation (#47768)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47768

This stores the next ID for a given NextName(prefix, output_id) so repeated calls to NextName are significantly faster. This accounts for ~65% of time spent for large models.

Test Plan:
buck test //caffe2/caffe2/python/...

will launch canary job before landing to ensure no regressions + confirm speedup

Reviewed By: dzhulgakov

Differential Revision: D24876961

fbshipit-source-id: 668d73060d800513bc72d7cd405a47d15c4acc34
2020-11-17 12:24:00 -08:00
736deefc1f [torch][te] aten::type_as is unary, not binary (#48085)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48085

We were treating it as a binary operator, which implies shape
broadcasting, even though the second arg is thrown away aside from the type.
Treating it as a unary is the proper approach.
ghstack-source-id: 116873680

Test Plan: new unit test

Reviewed By: ZolotukhinM

Differential Revision: D25017585

fbshipit-source-id: 0cfa89683c9bfd4fbb132617c74b47b268d7f368
2020-11-17 12:17:19 -08:00
bbee0ecbd1 [pytorch][te] Handle negative axis in chunk (#48084)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48084

as title
ghstack-source-id: 116870328

Test Plan: new unit test

Reviewed By: Krovatkin

Differential Revision: D25017489

fbshipit-source-id: 0d1998fccad6f509db04b6c67a4e4e4093d96751
2020-11-17 12:12:49 -08:00
aabc87cd04 [NNC] Fix HalfChecker when half present but unused (#48068)
Summary:
Fixes an internally reported issue in the tensorexpr fuser when using FP16 on Cuda. The HalfChecker analysis to determine if we need to define the Half type searches the IR for expressions that use Half. If one of the parameters is of type Half but it (or any other Half expr) are not used in the IR we'll return a false negative. Fix this by adding the parameter list to the HalfChecker.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48068

Reviewed By: ZolotukhinM

Differential Revision: D25009680

Pulled By: nickgg

fbshipit-source-id: 24fddef06821f130db3d3f45d6d041c7f34a6ab0
2020-11-17 12:07:57 -08:00
0d6c900bdb docker: Fix PYTHON_VERSION not propagating (#47877)
Summary:
Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47877

Reviewed By: samestep

Differential Revision: D24929116

Pulled By: seemethere

fbshipit-source-id: 442f8eb13318c44735200dfbb2f88e4ca1d9a127
2020-11-17 11:49:30 -08:00
315122ce15 Bump up the CUDA OOM test memory size (#48029)
Summary:
80GB is no longer large any more https://nvidianews.nvidia.com/news/nvidia-doubles-down-announces-a100-80gb-gpu-supercharging-worlds-most-powerful-gpu-for-ai-supercomputing

Hopefully, the new size could be OK until the end of Moore's Law :)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48029

Reviewed By: linbinyu

Differential Revision: D25003603

Pulled By: zou3519

fbshipit-source-id: 626b9c031daee950df8453be4d7643dd67647213
2020-11-17 11:16:31 -08:00
9443150549 Update Graph docstring to match __init__.py (#48100)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/48100

Test Plan: Imported from OSS

Reviewed By: jamesr66a

Differential Revision: D25023407

Pulled By: ansley

fbshipit-source-id: e00706059b4c684451d2e1e48ca634b42693c1e1
2020-11-17 10:52:28 -08:00
8aaca4b46a [reland][quant] Remove nn.quantized.ReLU module and nn.quantized.functional.relu (#47415) (#48038)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48038

nn.ReLU works for both float and quantized input, we don't want to define an nn.quantized.ReLU
that does the same thing as nn.ReLU, similarly for nn.quantized.functional.relu

this also removes the numerical inconsistency for models quantizes nn.ReLU independently in qat mode

Test Plan:
Imported from OSS

Imported from OSS

Reviewed By: vkuzo

Differential Revision: D25000462

fbshipit-source-id: e3609a3ae4a3476a42f61276619033054194a0d2
2020-11-17 09:52:21 -08:00
a03f05f2a2 Fix "pointless comparison" warning (#47876)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47876

Fixes a pointless comparison against zero warning that arises for some scalar types

Test Plan:
Arises with
```
buck test mode/dev-nosan //caffe2/torch/fb/sparsenn:gpu_test -- test_prior_correction_calibration_prediction_binary
```

Reviewed By: ngimel

Differential Revision: D24925955

fbshipit-source-id: 56bcf32aeb164b078d537dd5d7c28a52bd7b66de
2020-11-17 09:05:04 -08:00
49f0e5dfeb Fix typing errors in torch.distributed.*, close issue #42967. (#47534)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47534

Test Plan: Imported from OSS

Reviewed By: walterddr

Differential Revision: D24952497

Pulled By: xuzhao9

fbshipit-source-id: 063bfd0707198436fcfd9431f72f9a392bc0017e
2020-11-16 23:27:59 -08:00
7f66fa62ca Fix typing errors in torch.distributed.nn.* directory. (#47533)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47533

Test Plan: Imported from OSS

Reviewed By: walterddr

Differential Revision: D24952500

Pulled By: xuzhao9

fbshipit-source-id: 8e66784fd8f9f111b6329e0bb48d6cd61c690a4a
2020-11-16 23:27:55 -08:00
915050ed66 Fix typing errors in torch.distributed.distributed_c10d.* (#47532)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47532

Test Plan: Imported from OSS

Reviewed By: walterddr

Differential Revision: D24952501

Pulled By: xuzhao9

fbshipit-source-id: 9b2dd1069eb1729c24be00f46da60d6a0439a8da
2020-11-16 23:27:51 -08:00
49eb82a7b2 Fix type annotation errors in torch.distributed.* directory (#47531)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47531

This is part of a stack of PRs that fixes mypy typing errors in the torch.distributed.* directory.

Test Plan:
python test_type_hints.py -v TestTypeHints.test_run_mypy

Imported from OSS

Reviewed By: walterddr

Differential Revision: D24952499

fbshipit-source-id: b193171e28c2211a71d28a544fa44770bf938a1e
2020-11-16 23:23:13 -08:00
af37f8f810 [pytorch][te] Do not merge Tensor[] variant of aten::where into fusion group (#48063)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48063

The TE fuser does not know how to construct a list of Tensors.

Test Plan: new unit test

Reviewed By: eellison

Differential Revision: D25007234

fbshipit-source-id: 1a8ffdf5ffecb39a727357799ed32df8f53150d6
2020-11-16 22:41:10 -08:00
43a9d6fb6e [TorchScript] Support user defined classes as constants (#5062)
Summary:
Pull Request resolved: https://github.com/pytorch/glow/pull/5062

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45556

User defined classes can be used as constants.  This is useful when freezing and removing the module from the graph.

Test Plan: waitforsadcastle

Reviewed By: eellison

Differential Revision: D23994974

fbshipit-source-id: 5b4a5c91158aa7f22df39d71f2658afce1d29317
2020-11-16 20:52:02 -08:00
3611d26a25 [JIT] Optimize FunctionSchema::checkArg for the Tensor case. (#48034)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48034

The Tensor case is one of the most common and the existing check can be
made faster. This results in a ~21% improvement on DeepAndWide model and
would improve other models as well.

Before the change:
```
505[ms]
491[ms]
514[ms]
538[ms]
514[ms]
554[ms]
556[ms]
512[ms]
516[ms]
527[ms]
```

After the change:
```
406[ms]
394[ms]
414[ms]
423[ms]
449[ms]
397[ms]
410[ms]
389[ms]
395[ms]
414[ms]
```

Differential Revision: D24999486

Test Plan: Imported from OSS

Reviewed By: zdevito

Pulled By: ZolotukhinM

fbshipit-source-id: 7139a3a38f9c44e8ea793afe2fc662ff51cc0460
2020-11-16 20:50:24 -08:00
7b2c78f120 Revert D24714803: make duplicate def() calls an error in the dispatcher. Updating all fb operators to use the new dispatcher registration API
Test Plan: revert-hammer

Differential Revision:
D24714803 (824f710694)

Original commit changeset: c809aad8a698

fbshipit-source-id: fb2ada65f9fc00d965708d202bd9d050f13ef467
2020-11-16 20:14:26 -08:00
549ef1d668 [caffe][memonger] Extend operator schema check to dag memonger (#48021)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48021

Extending operator schema check for simple memonger to dag memonger as well. As part of this a fix is being made to handle inplace ops (having at least one output name same as input blob). Earlier all the output blobs from ops were being treated as shareable but it failed assertion of external input blobs with the same name not allowed to share.

Test Plan: Added corresponding unit tests

Reviewed By: hlu1

Differential Revision: D24968862

fbshipit-source-id: b6679a388a82b0d68f65ade64b85560354aaa3ef
2020-11-16 19:17:55 -08:00
fa0acb73bd fix node manipulation in partition class (#48016)
Summary:
This PR fixes the add_node and remove_node in partition class and also add a unit test for node manipulation in partition

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48016

Reviewed By: gcatron

Differential Revision: D24996368

Pulled By: scottxu0730

fbshipit-source-id: 0ddffd5ed3f95e5285fffcaee8c4b671929b4df3
2020-11-16 15:33:11 -08:00
824f710694 make duplicate def() calls an error in the dispatcher. Updating all fb operators to use the new dispatcher registration API (#47322)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47322

Updating all call-sites of the legacy dispatcher registration API in fbcode to the new API.

I migrated all call sites that used the legacy dispatcher registration API (RegisterOperators()) to use the new API (TORCH_LIBRARY...). I found all call-sites by running `fbgs RegisterOperators()`. This includes several places, including other OSS code (nestedtensor, torchtext, torchvision). A few things to call out:

For simple ops that only had one registered kernel without a dispatch key, I replaced them with:
```
TORCH_LIBRARY_FRAGMENT(ns, m) {
   m.def("opName", fn_name);
}
```

For ops that registered to a specific dispatch key / had multiple kernels registered, I registered the common kernel (math/cpu) directly inside a `TORCH_LIBRARY_FRAGMENT` block, and registered any additional kernels from other files (e.g. cuda) in a separate `TORCH_LIBRARY_IMPL` block.

```
// cpu file
TORCH_LIBRARY_FRAGMENT(ns, m) {
  m.def("opName(schema_inputs) -> schema_outputs");
  m.impl("opName", torch::dispatch(c10::DispatchKey::CPU, TORCH_FN(cpu_kernel)));
}

// cuda file
TORCH_LIBRARY_IMPL(ns, CUDA, m) {
  m.impl("opName", torch::dispatch(c10::DispatchKey::CUDA, TORCH_FN(cuda_kernel)));
}
```
Special cases:

I found a few ops that used a (legacy) `CPUTensorId`/`CUDATensorId` dispatch key. Updated those to use CPU/CUDA- this seems safe because the keys are aliased to one another in `DispatchKey.h`

There were a handful of ops that registered a functor (function class) to the legacy API. As far as I could tell we don't allow this case in the new API, mainly because you can accomplish the same thing more cleanly with lambdas. Rather than delete the class I wrote a wrapper function on top of the class, which I passed to the new API.

There were a handful of ops that were registered only to a CUDA dispatch key. I put them inside a TORCH_LIBRARY_FRAGMENT block, and used a `def()` and `impl()` call like in case two above.

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D24714803

Pulled By: bdhirsh

fbshipit-source-id: c809aad8a698db3fd0d832f117f833e997b159e1
2020-11-16 15:33:08 -08:00
cba26e40cf migrate export_caffe2_op_to_c10.h macros to the new dispatcher registration API (#47321)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47321

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D24714805

Pulled By: bdhirsh

fbshipit-source-id: cd695c9c203a7fa4d5217c2466d7f274ce2cd096
2020-11-16 15:33:05 -08:00
93d9837375 rename macro. TORCH_LIBRARY_FRAGMENT_THIS_API_IS_FOR_PER_OP_REGISTRATION_ONLY to TORCH_LIBRARY_FRAGMENT (#47320)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47320

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D24714806

Pulled By: bdhirsh

fbshipit-source-id: 7007c9c54b785015577ebafd8e591aa534fe0640
2020-11-16 15:33:02 -08:00
95b9c2061b update legacy dispatcher registration API tests to avoid duplicate def() calls (#47319)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47319

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D24714804

Pulled By: bdhirsh

fbshipit-source-id: 4827fbb9a568a44599bb84e45cbe63b02181f21e
2020-11-16 15:32:59 -08:00
6ec2a89e01 remove ops in the __caffe2 namespace (#47318)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47318

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D24714807

Pulled By: bdhirsh

fbshipit-source-id: 7f040c12c0b3a0f322498386f849f693a64d1dcf
2020-11-16 15:30:16 -08:00
233192be73 Make sure valid ParameterList/Dict don't warn on creation (#47772)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/46983

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47772

Reviewed By: zou3519

Differential Revision: D24991341

Pulled By: albanD

fbshipit-source-id: 0fa21192f529a016048e3eef88c5a8f3cbb3c235
2020-11-16 13:16:59 -08:00
b12d645c2f Test TORCH_LIBRARY in CUDA extension (#47524)
Summary:
In the [official documentation](https://pytorch.org/tutorials/advanced/torch_script_custom_ops.html), it is recommended to use `TORCH_LIBRARY` to register ops for TorchScript. However, that code is never tested with CUDA extension and is actually broken (https://github.com/pytorch/pytorch/issues/47493). This PR adds a test for it. It will not pass CI now, but it will pass when the issue https://github.com/pytorch/pytorch/issues/47493 is fixed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47524

Reviewed By: zou3519

Differential Revision: D24991839

Pulled By: ezyang

fbshipit-source-id: 037196621c7ff9a6e7905efc1097ff97906a0b1c
2020-11-16 13:12:22 -08:00
cf92b0f3a0 add type annotations to multiprocessing module (#47756)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/47757

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47756

Reviewed By: malfet

Differential Revision: D24970773

Pulled By: ezyang

fbshipit-source-id: b0b9edb9cc1057829c6320e78174c6d5f7a77477
2020-11-16 13:05:49 -08:00
1e0ace7fdc Fix docstring typo (#47545)
Summary:
It's its.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47545

Reviewed By: ezyang

Differential Revision: D24921308

Pulled By: heitorschueroff

fbshipit-source-id: 3bd53b0303afa3b75cce23d0804096f3d7f67c7e
2020-11-16 13:03:36 -08:00
825ee7e7f8 [caffe2] plan_executor_test: add test case for should_stop loops (#47613)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47613

This is to test some more cancellation edge cases that were missing before. It passes under the current code.

Test Plan: buck test mode/opt caffe2/caffe2:caffe2_test_cpu -- PlanExecutorTest --stress-runs 10

Reviewed By: dahsh

Differential Revision: D24836956

fbshipit-source-id: 3b00dc081cbf4f26e7756d597099636edb49d256
2020-11-16 12:59:13 -08:00
550f26c6d5 Port math kernel for layer_norm from pytorch/xla. (#47882)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47882

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D24958691

Pulled By: ailzhang

fbshipit-source-id: 694e22c20a365730fbacf94efa1bdf7fdd7aec20
2020-11-16 12:49:58 -08:00
95ea778ac6 Set proper output differentiability for unique function (#47930)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/47851

Since the definitions of these functions in `native_functions.yaml` has special dispatch, we were already generating the proper `NotImplemented` behavior for these functions but we were wrongfully setting that gradient of all of the outputs.

Added entries in `derivatives.yaml` to allow us to specify which outpus are differentiable or not.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47930

Reviewed By: smessmer

Differential Revision: D24960667

Pulled By: albanD

fbshipit-source-id: 19e5bb3029cf0d020b31e2fa264b3a03dd86ec10
2020-11-16 12:26:10 -08:00
dea2337825 torch.Assert: make it torch.jit.script'able (#47399) (#47973)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47973

Currently torch.Assert is not scriptable, which makes it not very useful for production code. According to jamesr66a , moving this to c++ op land will help with scriptability. This PR implements the change.

Note: with the current code the Assert is scriptable but the Assert is a no-op after being scripted. Would love suggestions on how to address that (can be in future PR).

Test Plan:
```
python test/test_utils.py TestAssert.test_assert_scriptable
python test/test_utils.py TestAssert.test_assert_true
python test/test_fx.py TestFX.test_symbolic_trace_assert
```

Reviewed By: supriyar

Differential Revision: D24974299

Pulled By: vkuzo

fbshipit-source-id: 20d4f4d8ac20d76eee122f2cdcdcdcaf1cda3afe
2020-11-16 11:46:12 -08:00
ee995d33bd rename torch.Assert to torch._assert (#47763) (#47972)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47972

Changing the name due to the discussion in
https://github.com/pytorch/pytorch/pull/47399.

Test Plan:
```
python test/test_utils.py TestAssert.test_assert_true
python test/test_fx.py TestFX.test_symbolic_trace_assert
python test/test_fx_experimental.py
```

Reviewed By: supriyar

Differential Revision: D24974298

Pulled By: vkuzo

fbshipit-source-id: 24ded93a7243ec79a0375f4eae8a3db9b787f857
2020-11-16 11:43:27 -08:00
d20483a999 Skip dummy node creation for autograd engine when there is a single input and place on correct queue (#47592)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/42890
 - Removes dummy node
 - Places graph root on the correct queue based on input buffer's device instead of cpu queue by default

cpu - no significant change in speed (too noisy to measure), but we see up to 7% reduction in small graphs
cuda - small reduction in speed (still very noisy) and up to ~20% reduction in instruction count for small graphs

**CPU**
Code:
```
import torch
from torch.utils.benchmark import Timer

setup="""
a = torch.rand((2, 2), requires_grad=True)
b = torch.rand((2, 2), requires_grad=True)
gradient = torch.ones(2, 2)
"""

stmt="""
torch.autograd.grad(a*b, [a, b], gradient)
"""

timer = Timer(stmt, setup)

print(timer.timeit(10000))
print(timer.collect_callgrind(100))
```

Before (when dummy node is not skipped):
```
torch.autograd.grad(a*b, [a, b], gradient)
setup:
  a = torch.rand((2, 2), requires_grad=True)
  b = torch.rand((2, 2), requires_grad=True)
  gradient = torch.ones(2, 2)

  26.62 us
  1 measurement, 10000 runs , 1 thread
<torch.utils.benchmark.utils.valgrind_wrapper.timer_interface.CallgrindStats object at 0x7efee44ad8e0>
torch.autograd.grad(a*b, [a, b], gradient)
setup:
  a = torch.rand((2, 2), requires_grad=True)
  b = torch.rand((2, 2), requires_grad=True)
  gradient = torch.ones(2, 2)

                           All          Noisy symbols removed
    Instructions:      9755488                    9659378
    Baseline:             4300                       3784
100 runs per measurement, 1 thread
```

After
```
<torch.utils.benchmark.utils.common.Measurement object at 0x7f56961a7730>
torch.autograd.grad(a*b, [a, b], gradient)
setup:
  a = torch.rand((2, 2), requires_grad=True)
  b = torch.rand((2, 2), requires_grad=True)
  gradient = torch.ones(2, 2)

  26.78 us
  1 measurement, 10000 runs , 1 thread
<torch.utils.benchmark.utils.valgrind_wrapper.timer_interface.CallgrindStats object at 0x7f56961a78e0>
torch.autograd.grad(a*b, [a, b], gradient)
setup:
  a = torch.rand((2, 2), requires_grad=True)
  b = torch.rand((2, 2), requires_grad=True)
  gradient = torch.ones(2, 2)

                           All          Noisy symbols removed
    Instructions:      9045508                    8939872
    Baseline:             4280                       3784
100 runs per measurement, 1 thread
```
**Cuda**

Before
```
<torch.utils.benchmark.utils.common.Measurement object at 0x7f84cbaa1ee0>
torch.autograd.grad(out, [x, y], gradient)
setup:
  x = torch.rand((2,2), requires_grad=True, device="cuda")
  y = torch.rand((2,2), requires_grad=True, device="cuda")
  out = x + y
  gradient = torch.ones(2, 2).cuda()

  70.49 us
  1 measurement, 10000 runs , 1 thread
<torch.utils.benchmark.utils.valgrind_wrapper.timer_interface.CallgrindStats object at 0x7f84cbaa1e50>
torch.autograd.grad(out, [x, y], gradient)
setup:
  x = torch.rand((2,2), requires_grad=True, device="cuda")
  y = torch.rand((2,2), requires_grad=True, device="cuda")
  out = x + y
  gradient = torch.ones(2, 2).cuda()

                           All          Noisy symbols removed
    Instructions:      5054581                    4951911
    Baseline:             4105                       3735
100 runs per measurement, 1 thread
```

Remove dummy node only
```
<torch.utils.benchmark.utils.common.Measurement object at 0x7fbf29c67eb0>
torch.autograd.grad(out, [x, y], gradient)
setup:
  x = torch.rand((2,2), requires_grad=True, device="cuda")
  y = torch.rand((2,2), requires_grad=True, device="cuda")
  out = x + y
  gradient = torch.ones(2, 2).cuda()

  55.65 us
  1 measurement, 10000 runs , 1 thread
<torch.utils.benchmark.utils.valgrind_wrapper.timer_interface.CallgrindStats object at 0x7fbf29c67e20>
torch.autograd.grad(out, [x, y], gradient)
setup:
  x = torch.rand((2,2), requires_grad=True, device="cuda")
  y = torch.rand((2,2), requires_grad=True, device="cuda")
  out = x + y
  gradient = torch.ones(2, 2).cuda()

                           All          Noisy symbols removed
    Instructions:      5002105                    4900841
    Baseline:             4177                       3731
100 runs per measurement, 1 thread
```

Remove dummy node and put in correct queue
```
<torch.utils.benchmark.utils.common.Measurement object at 0x7fb64438ce80>
torch.autograd.grad(out, [x, y], gradient)
setup:
  x = torch.rand((2,2), requires_grad=True, device="cuda")
  y = torch.rand((2,2), requires_grad=True, device="cuda")
  out = x + y
  gradient = torch.ones(2, 2).cuda()

  27.56 us
  1 measurement, 10000 runs , 1 thread
<torch.utils.benchmark.utils.valgrind_wrapper.timer_interface.CallgrindStats object at 0x7fb64438cdf0>
torch.autograd.grad(out, [x, y], gradient)
setup:
  x = torch.rand((2,2), requires_grad=True, device="cuda")
  y = torch.rand((2,2), requires_grad=True, device="cuda")
  out = x + y
  gradient = torch.ones(2, 2).cuda()

                           All          Noisy symbols removed
    Instructions:      4104433                    4007555
    Baseline:             4159                       3735
100 runs per measurement, 1 thread
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47592

Reviewed By: ailzhang

Differential Revision: D24890761

Pulled By: soulitzer

fbshipit-source-id: f457376e4a882f8a59476e8c1e708391b1a031a2
2020-11-16 11:33:35 -08:00
957e45a97c [NNC] Support vectorization of reductions (#47924)
Summary:
Add support for ReduceOp in the Vectorizer, which allows vectorization of reductions. Only non-reduce axes can be vectorized currently, we'd need either automatically pulling out the RHS of reductions (better as a separate transform, I think) or special handling of vector reduce in the LLVM codegen (tricky, maybe not useful?) to make vectorizing reduce axes work.

There was a disabled LLVM test for this case which I reenabled with a bit of massaging, and added a few more.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47924

Reviewed By: bertmaher

Differential Revision: D24963464

Pulled By: nickgg

fbshipit-source-id: 91d91e9e2696555ab5690b154984b1ce48359d51
2020-11-16 10:43:53 -08:00
9aaf7fb398 [CI] Fix additional CI jobs not launched when PR is created from fork repo (#47969)
Summary:
`CIRCLE_PR_NUMBER` is not always set during CI.
This is to extract PR NUMBER from BRANCH info in order to launch additional CI jobs.

Should allow tag:`ci/binaries` and `ci/all` worked on forked PRs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47969

Reviewed By: janeyx99

Differential Revision: D24991790

Pulled By: walterddr

fbshipit-source-id: 3ca30752135d54236a9abf0610eb89946852d45a
2020-11-16 08:38:54 -08:00
3a2aad9314 Fix documentation to point to torch.overrides instead of _overrides. (#47842)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/47697

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47842

Reviewed By: smessmer

Differential Revision: D24951750

Pulled By: ezyang

fbshipit-source-id: df62ec2e52f1c561c864a50bac4abf4a55e4f8e6
2020-11-16 08:28:53 -08:00
f9552e6da4 update windows build guide (#47840)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/47483

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47840

Reviewed By: malfet

Differential Revision: D24951466

Pulled By: walterddr

fbshipit-source-id: 7530ec5a3aff7095978c330d9b78e58b10349373
2020-11-16 08:15:42 -08:00
147a48fb27 [cmake] clean up cmake/Utils.cmake (#47923)
Summary:
Consolidate into cmake/public/utils.cmake

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47923

Reviewed By: samestep

Differential Revision: D24955961

Pulled By: walterddr

fbshipit-source-id: 9d5f6af2b353a8c6f6d521c841fd0989393755cd
2020-11-16 08:12:32 -08:00
cd4aa9c95c Fix inplace check logic to be triggered when written to Tensor does not require gradients (#46296)
Summary:
Fix https://github.com/pytorch/pytorch/issues/46242

This ensures that the `check_inplace()` run the proper checks even if the Tensor that is being modified inplace does not requires gradient. As the Tensor written into it might require gradient and will make this inplace modification actually differentiable.
This contains:
- Codegen changes to tell `check_inplace()` if the inplace will be differentiable
- Changes in `handle_view_on_rebase` to work properly even when called for an input that does not require gradients (which was assumed to be true before)
- Corresponding tests (both warnings and the error raise internal assert errors without this fix)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46296

Reviewed By: ezyang

Differential Revision: D24903770

Pulled By: albanD

fbshipit-source-id: 74e65dad3d2e3b9f762cbb7b39f92f19d9a0b094
2020-11-16 08:06:06 -08:00
d032d22141 Replacing CUDA11.0 config with CUDA11.1 in CI (#47942)
Summary:
Relands https://github.com/pytorch/pytorch/issues/46616

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47942

Reviewed By: walterddr

Differential Revision: D24963006

Pulled By: janeyx99

fbshipit-source-id: 71a61c56dec88a32a1c5d194db5a2730100f60a1
2020-11-16 07:32:35 -08:00
013e6a3d9d Revert D24698027: Fix auto exponent issue for torch.pow
Test Plan: revert-hammer

Differential Revision:
D24698027 (8ef7ccd669)

Original commit changeset: f23fdb65c925

fbshipit-source-id: 9a67a2c6310c9e4fdefbb421a8cd4fa41595bc9a
2020-11-15 03:58:44 -08:00
8ef7ccd669 Fix auto exponent issue for torch.pow (#47024)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47024

Fixes https://github.com/pytorch/pytorch/issues/46936

Stack from [ghstack](https://github.com/ezyang/ghstack):
* **#47024 Fix auto exponent issue for torch.pow**

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D24698027

Pulled By: anjali411

fbshipit-source-id: f23fdb65c925166243593036e08214c4f041a63d
2020-11-14 22:50:12 -08:00
d293413b3e Batched matmul dtypes (#47873)
Summary:
Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47873

Reviewed By: navahgar

Differential Revision: D24928256

Pulled By: anjali411

fbshipit-source-id: a26aef7a15a13fc0b5716e905971265d8b1cea61
2020-11-14 22:45:48 -08:00
db1f217d8d Add complex support for torch.addcmul and torch.addcdiv (#46639)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46639

Resolves: https://github.com/pytorch/pytorch/issues/46546#issuecomment-713122245

Test Plan: Imported from OSS

Reviewed By: izdeby, ansley

Differential Revision: D24879099

Pulled By: anjali411

fbshipit-source-id: 76131dc68ac964e67a633f62e07f7c799df4463e
2020-11-14 21:27:34 -08:00
5adf840259 [pytorch][te][easy] Remove KernelScope from fusion pass tests (#47952)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47952

We don't actually generate a TE kernel so no need to use the
arena-allocation guard.

Test Plan:
```
buck test //caffe2/test/cpp/tensorexpr -- FuserPass
```

Reviewed By: ZolotukhinM

Differential Revision: D24967107

fbshipit-source-id: 302f65b2fcff704079e8b51b942b7b3baff95585
2020-11-14 20:25:01 -08:00
0e98fdd389 [ATen/CPU] Parallelize HalfToFloat + FloatToHalf operators in PT (#47777)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47777

Parallelize FP32 <-> FP16 op.
- Use at::Parallelize in ATen instead of parallelizing inside FBGEMM;
- provide more flexibility (at::Parallelize can be configured with different parallel backend).
ghstack-source-id: 116499687

Test Plan:
```
OMP_NUM_THREADS=10 buck test //caffe2/test:torch -- .test_half_tensor.
```
https://our.intern.facebook.com/intern/testinfra/testrun/7036874441928985

```
OMP_NUM_THREADS=10 buck run mode/opt -c pytorch.parallel_backend=tbb //caffe2/benchmarks/operator_benchmark/pt:tensor_to_test -- --iterations 1 --omp_num_threads 10 --warmup_iterations 0
```

Benchmark results for 512 x 512 Tensor copy:

- With 1 thread:
```
(base) [jianyuhuang@devbig281.ftw3.facebook.com: ~/fbsource/fbcode/caffe2/caffe2/operators] $ buck run mode/opt -c py
torch.parallel_backend=tbb //caffe2/benchmarks/operator_benchmark/pt:tensor_to_test -- --iterations 1 --omp_num_thread
s 1 --warmup_iterations 10
Parsing buck files: finished in 1.3 sec                                                                               Building: finished in 5.7 sec (100%) 6087/6087 jobs, 0 updated
  Total time: 7.0 sec
No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda'
# ----------------------------------------
# PyTorch/Caffe2 Operator Micro-benchmarks
# ----------------------------------------
# Tag : short

# Benchmarking PyTorch: FloatToHalfTensorConversionBenchmark
# Mode: Eager
# Name: FloatToHalfTensorConversionBenchmark_M512_N512_cpu
# Input: M: 512, N: 512, device: cpu
Forward Execution Time (us) : 99.279

# Benchmarking PyTorch: HalfToFloatTensorConversionBenchmark
# Mode: Eager
# Name: HalfToFloatTensorConversionBenchmark_M512_N512_cpu
# Input: M: 512, N: 512, device: cpu
Forward Execution Time (us) : 81.707
```

- With 2 threads:
```
(base) [jianyuhuang@devbig281.ftw3.facebook.com: ~/fbsource/fbcode/caffe2/caffe2/operators] $ buck run mode/opt -c py
torch.parallel_backend=tbb //caffe2/benchmarks/operator_benchmark/pt:tensor_to_test -- --iterations 1 --omp_num_thread
s 2 --warmup_iterations 10
Parsing buck files: finished in 1.3 sec
Building: finished in 4.4 sec (100%) 6087/6087 jobs, 0 updated
  Total time: 5.7 sec
No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda'
# ----------------------------------------
# PyTorch/Caffe2 Operator Micro-benchmarks
# ----------------------------------------
# Tag : short

# Benchmarking PyTorch: FloatToHalfTensorConversionBenchmark
# Mode: Eager
# Name: FloatToHalfTensorConversionBenchmark_M512_N512_cpu
# Input: M: 512, N: 512, device: cpu
Forward Execution Time (us) : 68.162

# Benchmarking PyTorch: HalfToFloatTensorConversionBenchmark
# Mode: Eager
# Name: HalfToFloatTensorConversionBenchmark_M512_N512_cpu
# Input: M: 512, N: 512, device: cpu
Forward Execution Time (us) : 49.245
```

Reviewed By: ngimel

Differential Revision: D24676355

fbshipit-source-id: 02bfb893a7b5a60f97c0559d8974c53837755ac2
2020-11-14 18:45:23 -08:00
f8248543a1 Pass in smaller timeout into init_process_group for distributed_test (#47896)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47896

Per title
ghstack-source-id: 116710141

Test Plan: CI

Reviewed By: osalpekar

Differential Revision: D24943323

fbshipit-source-id: 7bf33ce3a021b9750b65e0c08f602c465cd81d28
2020-11-14 13:38:20 -08:00
07e98d28cf [pytorch][codegen] migrate gen_variable_factories.py to the new data model (#47818)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47818

This is another relatively small codegen.

Ideally we should CppSignature.decl() to generate the c++ function declaration.
We didn't because it needs to add 'at::' to the types defined in ATen namespace.

E.g.:
- standard declaration:
```
Tensor eye(int64_t n, int64_t m, const TensorOptions & options={})
```

- expected:
```
at::Tensor eye(int64_t n, int64_t m, const at::TensorOptions & options = {})
```

Kept the hacky fully_qualified_type() method to keep compatibility with old codegen.

We could clean up by:
- Using these types in torch namespace - but this is a user facing header file,
  not sure if it will cause problem;
- Update cpp.argument_type() method to take optional namespace argument;

Confirmed byte-for-byte compatible with the old codegen:
```
Run it before and after this PR:
  .jenkins/pytorch/codegen-test.sh <baseline_output_dir>
  .jenkins/pytorch/codegen-test.sh <test_output_dir>

Then run diff to compare the generated files:
  diff -Naur <baseline_output_dir> <test_output_dir>
```

Test Plan: Imported from OSS

Reviewed By: bhosmer

Differential Revision: D24909478

Pulled By: ljk53

fbshipit-source-id: a0ceaa60cc765c526908fee39f151cd7ed5ec923
2020-11-14 13:05:23 -08:00
4779553921 Revert "[quant] Remove nn.quantized.ReLU module and nn.quantized.functional.relu (#47415)" (#47949)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47949

This reverts commit 1478e5ec2aa42b2a9742257642c7c1d3203d7309.

Test Plan: Imported from OSS

Reviewed By: supriyar

Differential Revision: D24966363

Pulled By: vkuzo

fbshipit-source-id: ca1126f699eef84027a15df35962728296c8a790
2020-11-14 08:40:30 -08:00
c936b43f14 [pytorch][codegen] add fully migrated scripts to mypy strict config (#47747)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47747

Moved MANUAL_AUTOGRAD / etc to gen_trace_type.py to avoid mypy from
scanning not yet migrated gen_variable_type.py.

Differential Revision: D24885066

Test Plan: Imported from OSS

Reviewed By: ezyang

Pulled By: ljk53

fbshipit-source-id: bf420e21c26f45fe2b94977bc6df840ffd8a3128
2020-11-14 02:28:00 -08:00
4ff8cd8f3a [pytorch][codegen] gen_python_functions.py loading native_functions.yaml / deprecated.yaml directly (#47746)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47746

- Removed the integration hack in gen_python_functions.py. It now directly
  loads native_functions.yaml. All dependencies on Declarations.yaml
  have been removed / moved to elsewhere.
- Rewrote the deprecated.yaml parsing logic to work with new data model directly.

Confirmed byte-for-byte compatible with the old codegen:
```
Run it before and after this PR:
  .jenkins/pytorch/codegen-test.sh <baseline_output_dir>
  .jenkins/pytorch/codegen-test.sh <test_output_dir>

Then run diff to compare the generated files:
  diff -Naur <baseline_output_dir> <test_output_dir>
```

Differential Revision: D24885067

Test Plan: Imported from OSS

Reviewed By: bhosmer

Pulled By: ljk53

fbshipit-source-id: 8e906b7dd36a64395087bd290f6f54596485ceb4
2020-11-14 02:27:57 -08:00
d91cefb0d8 [pytorch][codegen] migrate gen_annotated_fn_args.py to new codegen model (#47745)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47745

This is a relatively small codegen. Reintroduced 'simple_type' to preserve
old codegen output.

It depends on some methods defined in gen_python_functions.py - next PR will
clean up the remaining Declarations.yaml methods in gen_python_functions.py.

Confirmed byte-for-byte compatible with the old codegen:
```
Run it before and after this PR:
  .jenkins/pytorch/codegen-test.sh <baseline_output_dir>
  .jenkins/pytorch/codegen-test.sh <test_output_dir>

Then run diff to compare the generated files:
  diff -Naur <baseline_output_dir> <test_output_dir>
```

Differential Revision: D24885068

Test Plan: Imported from OSS

Reviewed By: ezyang

Pulled By: ljk53

fbshipit-source-id: c0fbd726bcc450c3c7fe232c23e5b31779d0b65f
2020-11-14 02:24:39 -08:00
0dbff184e9 change file name to snake style (#47914)
Summary:
Change Partitioner.py file name to partitioner.py
Change GraphManipulation.py file name to graph_manipulation.py
Move test_replace_target_nodes_with() to test_fx_experimental.py
Remove the unnecessary argument in size_based_partition() in Partitioner class

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47914

Reviewed By: gcatron

Differential Revision: D24956653

Pulled By: scottxu0730

fbshipit-source-id: 25b65be7dc7d64e90ffdc59cf394446fee83c3e6
2020-11-14 01:29:25 -08:00
1606899dbe distributed_test: Map rank to GPU accordingly (#47898)
Summary:
If world_size is lesser than or equal to number of GPU's available
then the rank can be directly mapped to corresponding GPU.
This fixes the issue referenced in https://github.com/pytorch/pytorch/issues/45435 and https://github.com/pytorch/pytorch/issues/47629

For world_size = 3 and number of GPU's = 8, the rank to GPU mapping
will be 0,2,4. This is due to the introduction of barrier,
(refer PR https://github.com/pytorch/pytorch/issues/45181)
the tensors in barrier is mapped to cuda0,1,2 and the tensors in the
actual test cases are mapped to cuda0,2,4 resulting in different streams and
leading to timeout. This issue is specific to default process group.
Issue is not observed in new process group since the streams are created again
after the initial barrier call.

This patch maps the rank to corresponding GPU's when the world_size is
less than or equal to the number of GPU's, in this case 0,1,2

Note: The barrier function in distributed_c10d.py should include new parameter
to specify the tensor or rank to GPU mapping. In that case, this patch will be
redundant but harmless since the tests can specify the tensors with appropriate
GPU rankings.

Fixes https://github.com/pytorch/pytorch/issues/47629

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47898

Reviewed By: smessmer

Differential Revision: D24956021

Pulled By: rohan-varma

fbshipit-source-id: a88257f22a7991ba36566329766c106d3360bb4e
2020-11-13 23:59:42 -08:00
982ae987d3 Revert D24941350: [pytorch][PR] Reopen PR for 0 dim batch size for AvgPool2d.
Test Plan: revert-hammer

Differential Revision:
D24941350 (ceeab70da1)

Original commit changeset: b7e50346d86e

fbshipit-source-id: 2e42e4418476658dc1afb905184841bf61688cfd
2020-11-13 22:33:37 -08:00
c543b3b582 Fix a downcast (#47919)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47919

Suppresses a downcast warning.

Test Plan:
Reproduces with
```
buck test mode/dev-nosan //caffe2/torch/fb/sparsenn:gpu_test
```

Reviewed By: suphoff

Differential Revision: D24866987

fbshipit-source-id: 44f19ab37a7d95abe08f570abfebc702827a2510
2020-11-13 22:26:29 -08:00
fe7d1d7d0e Add LeakyReLU operator to static runtime (#47798)
Summary:
- Add LeakyReLU operator to static runtime
- Add LeakyReLU benchmark
- Add LeakyReLU correctness test case

Static Runtime
```
------------------------------------------------------------------------------
Benchmark                                       Time           CPU Iterations
------------------------------------------------------------------------------
BM_leaky_relu/1                              4092 ns       4092 ns     172331
BM_leaky_relu/8                              4425 ns       4425 ns     158434
BM_leaky_relu/20                             4830 ns       4830 ns     145335
BM_leaky_relu_const/1                        3545 ns       3545 ns     198054
BM_leaky_relu_const/8                        3825 ns       3825 ns     183074
BM_leaky_relu_const/20                       4222 ns       4222 ns     165999
```

Interpreter
```
------------------------------------------------------------------------------
Benchmark                                       Time           CPU Iterations
------------------------------------------------------------------------------
BM_leaky_relu/1                              7183 ns       7182 ns      96377
BM_leaky_relu/8                              7580 ns       7580 ns      91588
BM_leaky_relu/20                             8066 ns       8066 ns      87183
BM_leaky_relu_const/1                        6466 ns       6466 ns     107925
BM_leaky_relu_const/8                        7063 ns       7063 ns      98768
BM_leaky_relu_const/20                       7380 ns       7380 ns      94564
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47798

Reviewed By: ezyang

Differential Revision: D24927043

Pulled By: kavoor

fbshipit-source-id: 69b12cc57f725f1dc8d68635788813710a74dc2b
2020-11-13 22:05:52 -08:00
17a6bc7c1b Cleanup unused code for Python < 3.6 (#47822)
Summary:
I think these can be safely removed since the min version of supported Python is now 3.6

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47822

Reviewed By: smessmer

Differential Revision: D24954936

Pulled By: ezyang

fbshipit-source-id: 5d4b2aeb78fc97d7ee4abaf5fb2aae21bf765e8b
2020-11-13 21:37:01 -08:00
4f9d0757f3 Add type informations to torch.cuda (#47134)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/47133

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47134

Reviewed By: smessmer

Differential Revision: D24955031

Pulled By: ezyang

fbshipit-source-id: 87f4623643715baa6ac0627383f009956f80cd46
2020-11-13 21:34:35 -08:00
2eb1e866e8 Update links in DDP note (#47663)
Summary:
Update the links in https://pytorch.org/docs/stable/notes/ddp.html#.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47663

Reviewed By: smessmer

Differential Revision: D24951684

Pulled By: ezyang

fbshipit-source-id: c1c104d76cf0292a7fc75a627bf76bb56fea72d0
2020-11-13 21:26:28 -08:00
550973b675 Missing curly bracket. (#47855)
Summary:
Typo fix

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47855

Reviewed By: smessmer

Differential Revision: D24951767

Pulled By: ezyang

fbshipit-source-id: 8884390370d4d71efd6cee10c3e0b8f55d7e5739
2020-11-13 21:17:24 -08:00
1bdd3687b9 Back out "[JIT] Fix function schema subtype checking"
Summary: Original commit changeset: bd07e7b47d2a

Test Plan: T79664004

Reviewed By: qizzzh

Differential Revision: D24969339

fbshipit-source-id: 8ecc4d52b86c5440c673e42b0e2cb78d94937a6f
2020-11-13 20:33:54 -08:00
11710598db Preserve module parameters in freezing (#47094)
Summary:
Added preserveParameters to freezing API that allows to preserve module
parameters.

Fixes #{39613}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47094

Reviewed By: eellison

Differential Revision: D24792867

Pulled By: bzinodev

fbshipit-source-id: f0cd980f5aed617b778afe2f231067c7c30a1527
2020-11-13 20:18:32 -08:00
f8c559db8e [resubmit] Providing more information while crashing process in async error handling (#47246)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47246

We crash the process in NCCL Async Error Handling if the collective
has been running for greater than some set timeout. This PR introduces more
information about the rank and duration the collective ran.
ghstack-source-id: 116676182

Test Plan: Run desync tests and flow.

Reviewed By: pritamdamania87

Differential Revision: D24695126

fbshipit-source-id: 61ae46477065a1a451dc46fb29c3ac0073ca531b
2020-11-13 20:11:06 -08:00
a9b6fa9e46 Fix multinomial when input has 0 prob (#47386)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47386

Fix multinomail when input has 0 prob

Test Plan: buck test mode/dev-nosan //caffe2/test:torch -- "multinomial"

Reviewed By: ngimel

Differential Revision: D24699691

fbshipit-source-id: d88bb5be8cfed9da2ce6f6a8abd18e834fbde580
2020-11-13 19:07:49 -08:00
f86ec08160 [pytorch][quantization] adding jit state for QuantizedLeakyReLU (#47660)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47660

Currently, `QuantizedLeakyReLU` doesn't have any items in the `state_dict`. However, this operator needs to store the `scale` and `zero_point` in its state dictionary or the loading state dict for a quantized model with LeakyReLUs that have non-default quantization params would break.

Test Plan:
Originally the issue was found here: https://www.internalfb.com/intern/anp/view/?id=390362&revision_id=2510709822565735

In the latest version, I fixed this issue: https://www.internalfb.com/intern/anp/view/?id=390362

Reviewed By: jerryzh168

Differential Revision: D24757522

fbshipit-source-id: 57e1dea072b5862e65e228e52a86f2062073aead
2020-11-13 18:59:46 -08:00
4380934b9b [JIT] Dont use specialized tensor type (#46130)
Summary:
Fix for https://github.com/pytorch/pytorch/issues/46122

For `Any`, we infer the type of the ivalue to set the ivalue's type tag. When we saw a Tensor, we would use a specialized Tensor type, so when `Dict[str, Tensor]` was passed in as any `Any` arg it would be inferred as `Dict[str, Float(2, 2, 2, 2)]` which breaks runtime `isinstance` checking.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46130

Reviewed By: glaringlee

Differential Revision: D24261447

Pulled By: eellison

fbshipit-source-id: 8a2bb26ce5b6c56c8dcd8db79e420f4b5ed83ed5
2020-11-13 18:34:40 -08:00
5c0dff836a Improve dimensionality mismatch warning (#47874)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47874

Test Plan: N/A

Reviewed By: ngimel

Differential Revision: D24926123

fbshipit-source-id: ace5543ae5122906164e13ae9463fe4dfa74d8d6
2020-11-13 18:26:34 -08:00
ceeab70da1 Reopen PR for 0 dim batch size for AvgPool2d. (#47426)
Summary:
Resubmitting https://github.com/pytorch/pytorch/pull/40694 since it could not be landed for some reason.

CC ngimel

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47426

Reviewed By: mruberry

Differential Revision: D24941350

Pulled By: ngimel

fbshipit-source-id: b7e50346d86eb63aaaf4fdd5ee71fafee2d0b476
2020-11-13 17:57:35 -08:00
260daf088d Added linalg.cholesky (#46083)
Summary:
This PR adds `torch.linalg.cholesky` function that matches `numpy.linalg.cholesky`.

Fixed `lda` argument to `lapackCholesky` calls.
Added `random_hermitian_pd_matrix` helper function for tests.

Ref https://github.com/pytorch/pytorch/issues/42666.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46083

Reviewed By: ailzhang

Differential Revision: D24861752

Pulled By: mruberry

fbshipit-source-id: 214dbceb4e8a2c589df209493efd843962d25593
2020-11-13 16:50:40 -08:00
e8fecd5caf Add constructor for ArgumentDef (#47492)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/47493

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47492

Reviewed By: bdhirsh

Differential Revision: D24791564

Pulled By: dzhulgakov

fbshipit-source-id: 43e4bbda754c61f40855675c1d5d0ddc9f351ebe
2020-11-13 16:39:45 -08:00
0685773d8d Automated submodule update: FBGEMM (#47929)
Summary:
This is an automated pull request to update the first-party submodule for [pytorch/FBGEMM](https://github.com/pytorch/FBGEMM).

New submodule commit: 9b0131179f

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47929

Test Plan: Ensure that CI jobs succeed on GitHub before landing.

Reviewed By: smessmer

Differential Revision: D24957361

fbshipit-source-id: 72fe80a784f10ddca52ee99fcf67cf6448a93012
2020-11-13 16:06:49 -08:00
0125e14c9a [OpBench] change relu entry point after D24747035
Summary: D24747035 (1478e5ec2a) removes the entry point of `nnq.functional.relu`. Adjust op benchmark to `torch.nn.ReLU` accordingly.

Test Plan: buck run caffe2/benchmarks/operator_benchmark/pt:qactivation_test -- --use_jit  --iterations 1 --warmup_iterations 1

Reviewed By: mingzhe09088

Differential Revision: D24961625

fbshipit-source-id: 5ed0ec7fa6d8cfefc8e7fc8324cf9a2a3e59de90
2020-11-13 15:38:27 -08:00
6e42b77be1 Add '--allow-run-as-root' to mpiexec to allow running distributed test inside a container (#43794)
Summary:
Inside a container, the user is often root. We should allow this use case so that people can easily run `run_test.py` insider a container

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43794

Reviewed By: ezyang

Differential Revision: D24904469

Pulled By: malfet

fbshipit-source-id: f96cb9dda3e7bd18b29801cde4c5b0616c750016
2020-11-13 15:31:06 -08:00
7b8bd91632 fp16 -> fp32 EmbeddingBag moved into CPU impl (#47076)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47076

Pull Request resolved: https://github.com/pytorch/glow/pull/5038

Eliminate double casting in glow when submitting fp16 per sample weights

Test Plan:
buck test glow/glow/torch_glow/tests:embedding_bag_test

Due to dependency conflicts between glow and caffe2, the test has been reverted from this diff, and landed separately

Reviewed By: allwu

Differential Revision: D24421367

fbshipit-source-id: eb3615144a2cad3d593543428dfdec165ad301df
2020-11-13 15:17:04 -08:00
6a4d55f23c [ONNX] Enable onnx shape inference in export by default (#46629)
Summary:
* Enable ONNX shape inference by default.
* ONNX could potentially set inferred shape in output instead of value_infos, checking both to be sure.
* Small fix in symbol_map to avoid overlooking dup symbols.
* Fix scalar_type_analysis to be consistent with PyTorch scalar type promotion logic.
* Correctly handle None dim_param from ONNX inferred shape.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46629

Reviewed By: ailzhang

Differential Revision: D24900171

Pulled By: bzinodev

fbshipit-source-id: 83d37fb9daf83a2c5969d8383e4c8aac986c35fb
2020-11-13 15:09:46 -08:00
c0aa863c56 [quant][graphmode][fx][refactor] insert_quantize_node (#47880)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47880

Test Plan: Imported from OSS

Reviewed By: vkuzo

Differential Revision: D24928797

fbshipit-source-id: 9a8b359cabfb800da86da114bf26bb5bd99d3fff
2020-11-13 14:50:42 -08:00
5d51b63984 Use Blocking Wait if both Blocking Wait and Async Error Handling Are Set (#47926)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47926

Given that we're soon enabling async error handling in PET, we should make the behavior explicit when users have set NCCL_BLOCKING_WAIT in their own code while also using PET. This PR essentially gives blocking wait precedence (for now). This way the blast radius of the PET change is smaller, while we continue working with blocking wait users and discussing whether moving to async error handling may be a good fit.
ghstack-source-id: 116553583

Test Plan: Simple FBL run/CI

Reviewed By: jiayisuse

Differential Revision: D24928149

fbshipit-source-id: d42c038ad44607feb3d46dd65925237c564ff7a3
2020-11-13 14:43:00 -08:00
f743b5639a [caffe2][memonger] Add support for distributed inference predict nets in DAG memonger (#47718)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47718

Distributed Inference splits a predict net into multiple parts, part0 being the main part which contains ops to make remote calls to other parts. part0 predict net may contain AsyncIf ops to optimize rpc call usage. AsyncIf ops have internal nets which may refer to memongered blobs. This change handles AsyncIf ops to update internal nets to refer to memongered blobs.

As part of this change, I am also updating dag memonger traversal to always start from root op, i.e. ops with 0 in degree. Earlier logic will start traversing ops based on input head blobs and if one of the head inputs is getting used in a non-root op which gets visited before its parent, the traversal will throwing assertion error here: https://fburl.com/diffusion/ob110s9z . Almost for all the distributed inference part0 nets, it was throwing this assertion error.

Test Plan: Added corresponding tests in memonger_test.py .  Could not find unit tests in c++ version of memonger.

Reviewed By: hlu1

Differential Revision: D24872010

fbshipit-source-id: 1dc99b2fb52b2bc692fa4fc0aff6b7e4c5e4f5b0
2020-11-13 14:12:07 -08:00
a3e08e5344 Support ReduceSum in c2_pt_converter (#47889)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47889

Adds support for converting the [caffe2 ReduceSum](https://caffe2.ai/docs/operators-catalogue#reducesum) operator to torch.
ghstack-source-id: 116580127

Test Plan:
buck test //caffe2/torch/fb/model_transform/c2_convert:c2_pt_converter_test : [results](https://our.intern.facebook.com/intern/testinfra/testrun/6755399466095119)

    ✓ ListingSuccess: caffe2/torch/fb/model_transform/c2_convert:c2_pt_converter_test - main (60.273)
    ✓ Pass: caffe2/torch/fb/model_transform/c2_convert:c2_pt_converter_test - test_sub_op (caffe2.torch.fb.model_transform.c2_convert.c2_pt_converter_test.C2PTConverterTest) (101.119)
    ✓ Pass: caffe2/torch/fb/model_transform/c2_convert:c2_pt_converter_test - test_layer_norm_conversion (caffe2.torch.fb.model_transform.c2_convert.c2_pt_converter_test.C2PTConverterTest) (101.404)
    ✓ Pass: caffe2/torch/fb/model_transform/c2_convert:c2_pt_converter_test - test_local_model_conversion (caffe2.torch.fb.model_transform.c2_convert.c2_pt_converter_test.C2PTConverterTest) (101.966)
    ✓ Pass: caffe2/torch/fb/model_transform/c2_convert:c2_pt_converter_test - test_reduce_sum (caffe2.torch.fb.model_transform.c2_convert.c2_pt_converter_test.C2PTConverterTest) (114.896)

Reviewed By: bugra

Differential Revision: D24925318

fbshipit-source-id: 3f3b791eff1b03e8f5adee744560fe8bc811c659
2020-11-13 12:02:58 -08:00
eccbd4df1c Remove fbcode/caffe2/mode (#46454)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46454

we stopped syncing this folder to fbcode, and it's not been used. AIbench will use the ones in xplat.

Test Plan: zbgs fbcode/caffe2/mode/ find nothing

Reviewed By: xta0

Differential Revision: D24356743

fbshipit-source-id: 7e70a2181a49b8ff3f87e5be3b8c808135f4c527
2020-11-13 11:54:47 -08:00
03d1978a1a [JIT] Resolve string literal type annotations using Resolver::resolveType (#47731)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47731

**Summary**
This commit modifies `ScriptTypeParser::parseTypeFromExpr` so that
string literal type annotations are resolved using
`Resolver::resolveType`. At present, they are parsed in
`parseBaseTypeName`, which inadvertently allows any key from
`string_to_type_lut` to be used as a string literal type annotation.

**Test Plan**
Existing unit tests (most notably
`TestClassType.test_self_referential_method` which tests the main
feature, self-referential class type annotations, that make use of
string literal type annotations).

**Fixes**
This commit fixes #47570.

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D24934717

Pulled By: SplitInfinity

fbshipit-source-id: b915b2c08272566b63b3cf5ff4a07ad43bdc381a
2020-11-13 11:46:08 -08:00
1915ae9510 [quant][graphmode][fx][refactor] is_output_quantized (#47879)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47879

Test Plan: Imported from OSS

Reviewed By: vkuzo

Differential Revision: D24928796

fbshipit-source-id: 55c49243b6a0b4811953cf72af57e5f56be8c419
2020-11-13 11:15:55 -08:00
6b8d20c023 [pytorch][te] Don't start TE fusion groups with an unknown-typed result (#47884)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47884

We need to know output types of everything in a fusion group to ensure
that we generate correctly-typed tensors.  We were incorrectly starting a
fusion group with an unknown-typed output.

Test Plan:
New unit tests:
```
buck test //caffe2/test:jit //caffe2/test/cpp/tensorexpr:tensorexpr
```

Reviewed By: eellison

Differential Revision: D24932786

fbshipit-source-id: 83978a951f32c1207bbc3555a7d3bd94fe4e70fb
2020-11-13 10:52:53 -08:00
d54497fca7 Try again to give hash in doc push scripts (#47922)
Summary:
This is a second attempt at 8304c25c67, since the first attempt did not work as shown by b05f3571fe and c59015f21d. This time the idea is to directly embed the commit hash itself into the generated command that is fed to `docker exec`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47922

Reviewed By: zou3519

Differential Revision: D24953734

Pulled By: samestep

fbshipit-source-id: 35b14d1266ef039e8c1bdf3648275af812a2e57b
2020-11-13 10:17:37 -08:00
f1babb00f0 [caffe2] Fix ListWithEvicted _pprint_impl wrongly printing _evicted_values (#47881)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47881

ListWithEvicted's _pprint_impl was accidentally printing _items before this change.

Reviewed By: dzhulgakov

Differential Revision: D24928521

fbshipit-source-id: 0d7940719b4a27defbaae3b99af104d7fe7b5144
2020-11-13 09:23:10 -08:00
d4db4718fa Revert D24873991: Profiler benchmark fix
Test Plan: revert-hammer

Differential Revision:
D24873991 (a97c7e2ef0)

Original commit changeset: 1c3950d7d289

fbshipit-source-id: 6f3b8a49caf90aaa3e16707005b6b7cf6e61d89f
2020-11-13 08:37:14 -08:00
e5da3b6097 Revert D24891767: rename torch.Assert to torch._assert
Test Plan: revert-hammer

Differential Revision:
D24891767 (a8ca042ec0)

Original commit changeset: 01c7a5acd83b

fbshipit-source-id: cd2271467151b578185758723fcd23f69051d3a3
2020-11-13 08:35:05 -08:00
4cec19b56a Revert D24740727: torch.Assert: make it torch.jit.script'able
Test Plan: revert-hammer

Differential Revision:
D24740727 (b787e748f0)

Original commit changeset: c7888e769c92

fbshipit-source-id: 1e097bd9c0f8b04bea0e0346317a126b42a3dc4f
2020-11-13 08:31:40 -08:00
1c7c612af0 Revert D24543682: [pytorch][PR] Added support for complex input for torch.lu_solve
Test Plan: revert-hammer

Differential Revision:
D24543682 (ffd0003022)

Original commit changeset: 165bde39ef95

fbshipit-source-id: 790b4157fdbc7149aaf0748555efe6daed7e1a23
2020-11-13 08:24:53 -08:00
8855c4e12f [AutoAccept][Codemod][FBSourceClangFormatLinter] Daily arc lint --take CLANGFORMAT
Differential Revision: D24946660

fbshipit-source-id: e47d04cac21314acb7f9ac3bdfa0d09289e399b4
2020-11-13 06:59:04 -08:00
759a548d6e add dependency check in cost_aware_partition (#47856)
Summary:
In the cost_aware_partition, check the circular dependency in try_combining_partitions. Also fix the calculate of communication time between partitions.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47856

Reviewed By: gcatron

Differential Revision: D24926591

Pulled By: scottxu0730

fbshipit-source-id: c634608675ac14b13b2370a727e4fb05e1bb94f0
2020-11-13 02:49:39 -08:00
ffd0003022 Added support for complex input for torch.lu_solve (#46862)
Summary:
`torch.lu_solve` now works for complex inputs both on CPU and GPU.
I moved the existing tests to `test_linalg.py` and modified them to test complex dtypes, but I didn't modify/improve the body of the tests.

Ref. https://github.com/pytorch/pytorch/issues/33152

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46862

Reviewed By: nikithamalgifb

Differential Revision: D24543682

Pulled By: anjali411

fbshipit-source-id: 165bde39ef95cafebf976c5ba4b487297efe8433
2020-11-13 02:35:31 -08:00
2ed3430877 [GPU] Make permuteWeights inline (#47634)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47634

Follow up on d16r's diff - D24710102. Make the function inline in order to get rid of the compiler checking `-Werror,-Wunused-function`.
ghstack-source-id: 116607200

Test Plan:
1. Sandcastle Tests
2. CircleCI jobs

Reviewed By: d16r

Differential Revision: D24824637

fbshipit-source-id: c17e219b384b91ac4620aa23112a6cda1200a605
2020-11-13 02:00:29 -08:00
692726812b [JIT] Fix function schema subtype checking (#47706)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47706

**Summary**
This commit fixes `FunctionSchema::isSubtypeOf` so that the subtyping rule it
implements for `FunctionSchema` instances is contravariant in argument
types and covariant in return type. At present, the rule is covariant in
argument types and contravariant in return type, which is not correct.

A brief but not rigourous explanation follows. Suppose there are two
`FunctionSchema`s, `M = (x: T) -> R` and `N = (x: U) -> S`. For `M <= N`
to be true (i.e. that `M` is a subtype of `N`), it must be true that `U
<= T` and `R <= S`. This generalizes to functions with multiple
arguments.

**Test Plan**
This commit extends `TestModuleInterface.test_module_interface_subtype`
with two new tests cases that test the contravariance of argument types
and covariance of return types in determining whether a `Module`
implements an interface type.

**Fixes**
This commit closes #47631.

Test Plan: Imported from OSS

Reviewed By: nikithamalgifb

Differential Revision: D24934099

Pulled By: SplitInfinity

fbshipit-source-id: bd07e7b47d2a3a56d676f2f572de09fb18ececd8
2020-11-13 00:43:53 -08:00
1aeac97712 [PyTorch] Remove unnecessary shared_ptr copies in ThreadLocalDebugInfo::get (#47791)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47791

`debug_info` is `thread_local` and this function is a leaf, so nobody else could free it out from under us. Regular pointer should be fine.
ghstack-source-id: 116456975

Test Plan: Run framework overhead benchmarks

Reviewed By: bhosmer

Differential Revision: D24901749

fbshipit-source-id: c01a60b609fd08e5200264d8e98d356e2c78cf28
2020-11-13 00:04:37 -08:00
b787e748f0 torch.Assert: make it torch.jit.script'able (#47399)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47399

Currently torch.Assert is not scriptable, which makes it not very useful for production code. According to jamesr66a , moving this to c++ op land will help with scriptability. This PR implements the change.

Note: with the current code the Assert is scriptable but the Assert is a no-op after being scripted. Would love suggestions on how to address that (can be in future PR).

Test Plan:
```
python test/test_utils.py TestAssert.test_assert_scriptable
python test/test_utils.py TestAssert.test_assert_true
python test/test_fx.py TestFX.test_symbolic_trace_assert
```

Imported from OSS

Reviewed By: eellison

Differential Revision: D24740727

fbshipit-source-id: c7888e769c921408a3020ca8332f4dae33f2bc0e
2020-11-13 00:02:19 -08:00
a8ca042ec0 rename torch.Assert to torch._assert (#47763)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47763

Changing the name due to the discussion in
https://github.com/pytorch/pytorch/pull/47399.

Test Plan:
```
python test/test_utils.py TestAssert.test_assert_true
python test/test_fx.py TestFX.test_symbolic_trace_assert
python test/test_fx_experimental.py
```

Imported from OSS

Reviewed By: ezyang

Differential Revision: D24891767

fbshipit-source-id: 01c7a5acd83bf9c962751552780930c242134dd2
2020-11-12 23:59:34 -08:00
16d6af74e6 [PyTorch] Optimize ~intrusive_ptr for the case of zero weak references (#47834)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47834

We can determine if (as is likely) there are no outstanding
weak references without bothering to decrement the
count. `std::shared_ptr` does this same optimization in libc++:
229db36474/libcxx/src/memory.cpp (L69-L107)
ghstack-source-id: 116576326

Test Plan:
Saw time spent in TensorImpl::release_resources drop in
local profiling of empty benchmark
Run framework overhead benchmarks. 9-10% savings on OutOfPlace, small single digit savings on empty, essentially none on InPlace.

Reviewed By: bhosmer

Differential Revision: D24914763

fbshipit-source-id: 19b03f960e32123bc72f7edce63fa1d18c3c143f
2020-11-12 23:50:48 -08:00
ed20e327d7 [quant] skip tests without fbgemm support (#47800)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47800

Fixes #47748

Test Plan:
python test/test_quantization.py

Imported from OSS

Reviewed By: vkuzo

Differential Revision: D24904885

fbshipit-source-id: 76d27659e73c7f60b3fcc25606657ee9305117be
2020-11-12 23:35:10 -08:00
9ee4f499f0 [OpBench] add _consume_op.list for processing input with type of List[Tensor] (#47890)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47890

As titled. In order to fix issue when running `chunk_test`, `split_test`, `qobserver` , `sort in qunary` in jit mode, because the output of `chunk_op` is a list of tensors which can not be handled by the current `_consume_op`

Test Plan:
OSS:
python3 -m benchmark_all_test --iterations 1 --warmup_iterations 1 --use_jit

Reviewed By: mingzhe09088

Differential Revision: D24774105

fbshipit-source-id: 210a0345b8526ebf3c24f4d0794e20b2ff6cef3d
2020-11-12 23:29:40 -08:00
0652d755d3 Fix some flaky tests in test_torch.py and test_nn.py (#46941)
Summary:
Fixed test:
- `test_is_nonzero`, this is asserting exact match, which is flaky when `TORCH_SHOW_CPP_STACKTRACES=1`, I changed this to non-exact assert
- `test_pinverse` TF32
- `test_symeig` TF32
- `test_triangular_solve_batched_many_batches_cpu_float64` precision on CPU BLAS
- `test_qr` TF32, as well as the tensor factory forgets a `dtype=dtype`
- `test_lu` TF32
- `ConvTranspose2d` TF32
- `Conv3d_1x1x1_no_bias` TF32
- `Transformer*` TF32

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46941

Reviewed By: heitorschueroff

Differential Revision: D24852725

Pulled By: mruberry

fbshipit-source-id: ccd4740cc643476178d81059d1c78da34e5082ed
2020-11-12 22:35:42 -08:00
2712acbd53 CUDA BFloat16 Dropout (#45005)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45005

Reviewed By: mruberry

Differential Revision: D24934761

Pulled By: ngimel

fbshipit-source-id: 8f615b97fb93dcd04a46e1d8eeb817ade5082990
2020-11-12 22:28:11 -08:00
1589ede8dd [quant][graphmode][fx] insert_observer_for_input_arg_of_observed_node (#47785)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47785

Test Plan: Imported from OSS

Reviewed By: vkuzo

Differential Revision: D24900302

fbshipit-source-id: 61d6287c462898837aed85d5c3a48b6e47b4a41b
2020-11-12 22:19:51 -08:00
dfd946871a Move eq.device to lite interpreter
Reviewed By: iseeyuan

Differential Revision: D24866273

fbshipit-source-id: 113dc50c7f083fa50fd431ffbac224101f8d3c4e
2020-11-12 22:03:57 -08:00
a97c7e2ef0 Profiler benchmark fix (#47713)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47713

Fix the import and also always use internal Timer

Test Plan: python benchmarks/profiler_benchmark/profiler_bench.py

Reviewed By: dzhulgakov

Differential Revision: D24873991

Pulled By: ilia-cher

fbshipit-source-id: 1c3950d7d289a4fb5bd7043ba2d842a35c263eaa
2020-11-12 21:47:30 -08:00
1afdcbfbb3 [quant][graphmode][fx][refactor] insert_observer_for_output_of_the_node (#47784)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47784

Test Plan:
python test/test_quantization.py TestQuantizeFx

Imported from OSS

Reviewed By: vkuzo

Differential Revision: D24900301

fbshipit-source-id: abaeae1b5747e517adeb0d50cec5998a8a3fc24d
2020-11-12 21:39:29 -08:00
59e96c55f7 Support MatMul in c2_pt_converter
Summary: Added the MatMul operator for caffe2

Test Plan: buck test //caffe2/torch/fb/model_transform/c2_convert:c2_pt_converter_test

Reviewed By: bugra

Differential Revision: D24920937

fbshipit-source-id: 7ba09ba0439cb9bd15d6a41fd8ff1a86d8d11437
2020-11-12 20:56:58 -08:00
c4ecbcdcb3 [quant][graphmode][fx][refactor] insert_observer_for_special_module (#47783)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47783

Test Plan:
python test/test_quantization.py TestQuantizeFx

Imported from OSS

Reviewed By: vkuzo

Differential Revision: D24900304

fbshipit-source-id: 11cc3dd4ea5e272209db9f3c419deadd40db5f42
2020-11-12 20:48:34 -08:00
9fa681c5e0 [ONNX] Add export of prim::dtype, prim::tolist (#46019)
Summary:
Add export of prim::dtype, prim::tolist.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46019

Reviewed By: malfet

Differential Revision: D24870870

Pulled By: bzinodev

fbshipit-source-id: 7f59e2c8f5ac2dbf83c889c73bd61f96587a296e
2020-11-12 20:34:40 -08:00
85c43c3da1 [ONNX] Convert _len based on the first dimension length (#47538)
Summary:
This PR is a bug fix.
As UT shows, for multiple-dimensional tensors, the current conversion for _len returns the total number of the tensors. But it should return the first dimension length, as pytorch _len defines.
Need `Squeeze` op at the end to ensure it outputs a scalar value.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47538

Reviewed By: malfet

Differential Revision: D24870717

Pulled By: bzinodev

fbshipit-source-id: c53c745baa6d2fb7cc1de55a19bd2eedb2ad5272
2020-11-12 20:25:39 -08:00
eab809377d [NNC] Remove all deferred expansion from Reductions (#47709)
Summary:
Refactors the ReduceOp node to remove the last remaining deferred functionality: completing the interaction between the accumulator buffer and the body. This fixes two issues with reductions:
1. Nodes inside the interaction could not be visited or modified, meaning we could generate bad code when the interaction was complex.
2. The accumulator load was created at expansion time and so could not be modified in some ways (ie. vectorization couldn't act on these loads).

This simplifies reduction logic quite a bit, but theres a bit more involved in the rfactor transform.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47709

Reviewed By: ZolotukhinM

Differential Revision: D24904220

Pulled By: nickgg

fbshipit-source-id: 159e5fd967d2d1f8697cfa96ce1bb5fc44920a40
2020-11-12 20:17:52 -08:00
eb8331e759 Revert D24524219: Remove balance and devices parameter from Pipe.
Test Plan: revert-hammer

Differential Revision:
D24524219 (8da7576303)

Original commit changeset: 9973172c2bb7

fbshipit-source-id: b187c80270adb2a412e3882863a2d7de2a52ed56
2020-11-12 19:31:19 -08:00
4f538a2ba4 [pytorch][bot] update mobile op deps (#47825)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47825

Test Plan: Imported from OSS

Reviewed By: iseeyuan

Differential Revision: D24913587

Pulled By: ljk53

fbshipit-source-id: b6219573c3238fb453d88019197a00c9f9dbabb8
2020-11-12 19:19:25 -08:00
a376d3dd5d [pytorch] strip out warning message ifdef STRIP_ERROR_MESSAGES (#47827)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47827

Similar to TORCH_CHECK_WITH_MSG, strip messages for TORCH_WARN/TORCH_WARN_ONCE.

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D24913586

Pulled By: ljk53

fbshipit-source-id: 00f0f2bf33a48d5d7008b70ff5820623586dfd4e
2020-11-12 19:16:42 -08:00
8ff0b6fef8 [OpBenchMobile] Enable operator_benchmark to run the benchmark on mobile through AiBench (#47767)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47767

This diff implements the functionality of running benchmark on mobile on top of operator_benchmark framework. It does so through a few steps:

1. create a scripted module from existing benchmark case.
2. run mobile specific optimization pass on the scripted module
3. run the scripted module on AiBench by calling its Python API

A small change in the way of writing a benchmark case is introduced so that both local and mobile run can share the same interface. The change is about having inputs as arguments of the `forward` function, so that mobile optimization pass can be run successfully (otherwise everything will be optimized away by constant propagation).

Test Plan:
## local op_bench run

buck run caffe2/benchmarks/operator_benchmark:benchmark_all_test --  --iterations 1 --warmup_iterations 1

buck run caffe2/benchmarks/operator_benchmark:benchmark_all_test --  --iterations 1 --warmup_iterations 1 --use_jit

Exceptions: `py_module` op in `FakeQuantizePerTensorBaseOpBenchmark` and `FakeQuantizePerChannelBaseOpBenchmark` under JIT mode. These tests also failed in the base version

```
RuntimeError:
Module 'FakeQuantizePerChannelOpBenchmark' has no attribute 'op_func' (This function exists as an attribute on the Python module, but we failed to compile it to a TorchScript function.
The error stack is reproduced here:

Python builtin <built-in method apply of FunctionMeta object at 0x619000c652a0> is currently not supported in Torchscript:
  File "/data/users/wangyang19/fbsource/fbcode/buck-out/dev/gen/caffe2/benchmarks/operator_benchmark/pt/quantization_test#link-tree/quantization_test.py", line 260
    quant_min: int, quant_max: int
):
    return _LearnableFakeQuantizePerChannelOp.apply(input, scale, zero_point, axis, quant_min, quant_max, 1.0)
           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
:
  File "/data/users/wangyang19/fbsource/fbcode/buck-out/dev/gen/caffe2/benchmarks/operator_benchmark/pt/quantization_test#link-tree/quantization_test.py", line 313
        axis: int, quant_min: int, quant_max: int
    ):
        return self.op_func(input, scale, zero_point, axis, quant_min, quant_max)
               ~~~~~~~~~~~~ <--- HERE
```

`_consume_op` typing mismatch: chunk, split, qobserver, sort in qunary. These will be fixed in D24774105

## OSS test

python3 -m benchmark_all_test --iterations 1 --warmup_iterations 1 --use_jit
python3 -m benchmark_all_test --iterations 1 --warmup_iterations 1

## saved module graph
```
module __torch__.mobile_benchmark_utils.OpBenchmarkMobile {
  parameters {
  }
  attributes {
    training = True
    num_iters = 1
    benchmark = <__torch__.pt.add_test.___torch_mangle_4.AddBenchmark object at 0x6070001b8b50>
  }
  methods {
    method forward {
      graph(%self : __torch__.mobile_benchmark_utils.OpBenchmarkMobile):
        %12 : None = prim::Constant() # /data/users/wangyang19/fbsource/fbcode/buck-out/dev/gen/caffe2/benchmarks/operator_benchmark/fb/pt/mobile/benchmark_all_test_fbcode#link-tree/mobile_benchmark_utils.py:9:4
        %4 : bool = prim::Constant[value=1]() # /data/users/wangyang19/fbsource/fbcode/buck-out/dev/gen/caffe2/benchmarks/operator_benchmark/fb/pt/mobile/benchmark_all_test_fbcode#link-tree/mobile_benchmark_utils.py:10:8
        %1 : int = prim::GetAttr[name="num_iters"](%self)
         = prim::Loop(%1, %4) # /data/users/wangyang19/fbsource/fbcode/buck-out/dev/gen/caffe2/benchmarks/operator_benchmark/fb/pt/mobile/benchmark_all_test_fbcode#link-tree/mobile_benchmark_utils.py:10:8
          block0(%i : int):
            %6 : __torch__.pt.add_test.___torch_mangle_4.AddBenchmark = prim::GetAttr[name="benchmark"](%self)
            %7 : __torch__.pt.add_test.___torch_mangle_4.AddBenchmark = prim::GetAttr[name="benchmark"](%self)
            %self.inputs_tuple : (Float(1, 1, 1, strides=[1, 1, 1], requires_grad=0, device=cpu), Float(1, 1, 1, strides=[1, 1, 1], requires_grad=0, device=cpu)) = prim::Constant[value=({0.48884}, {0.809042})]()
            %9 : Tensor, %10 : Tensor = prim::TupleUnpack(%self.inputs_tuple)
            %23 : int = prim::Constant[value=1]()
            %24 : Tensor = aten::add(%9, %10, %23) # /data/users/wangyang19/fbsource/fbcode/buck-out/dev/gen/caffe2/benchmarks/operator_benchmark/fb/pt/mobile/benchmark_all_test_fbcode#link-tree/pt/add_test.py:39:15
            -> (%4)
        return (%12)

    }
  }
  submodules {
    module __torch__.pt.add_test.___torch_mangle_4.AddBenchmark {
      parameters {
      }
      attributes {
        mobile_optimized = True
      }
      methods {
        method forward {
          graph(%self : __torch__.pt.add_test.___torch_mangle_4.AddBenchmark,
                %input_one.1 : Tensor,
                %input_two.1 : Tensor):
            %3 : int = prim::Constant[value=1]()
            %4 : Tensor = aten::add(%input_one.1, %input_two.1, %3) # /data/users/wangyang19/fbsource/fbcode/buck-out/dev/gen/caffe2/benchmarks/operator_benchmark/fb/pt/mobile/benchmark_all_test_fbcode#link-tree/pt/add_test.py:39:15
            return (%4)

        }
        method get_inputs {
          graph(%self : __torch__.pt.add_test.___torch_mangle_4.AddBenchmark):
            %self.inputs_tuple : (Float(1, 1, 1, strides=[1, 1, 1], requires_grad=0, device=cpu), Float(1, 1, 1, strides=[1, 1, 1], requires_grad=0, device=cpu)) = prim::Constant[value=({0.48884}, {0.809042})]()
            return (%self.inputs_tuple)

        }
      }
      submodules {
      }
    }
  }
}

```

Reviewed By: kimishpatel

Differential Revision: D24322214

fbshipit-source-id: 335317eca4f40c4083883eb41dc47caf25cbdfd1
2020-11-12 17:15:05 -08:00
edf751ca2f Make empty c10-full (#46092)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46092

Make empty c10-full without using hacky-wrapper, i.e. port the kernel to the new style signature.

This PR also changes the signature of some helpers called by empty to the new style.
ghstack-source-id: 116544203

(Note: this ignores all push blocking failures!)

Test Plan:
vs prev diff (outdated, before c10::optional fix): https://www.internalfb.com/intern/fblearner/details/224735103/

after c10::optional fix:
https://www.internalfb.com/intern/fblearner/details/231391773/

Also, after the c10::optional fix, the instruction counting benchmark shows a 2% regression for calling empty from Python. We decided this is acceptable and decided against landing D24425836 which would fix the regression.

Reviewed By: ezyang

Differential Revision: D24219944

fbshipit-source-id: e554096e90ce438c75b679131c3151ff8e5c5d50
2020-11-12 17:08:21 -08:00
3649a2c170 [numpy] torch.sqrt : promote integer inputs to float (#47293)
Summary:
Reference https://github.com/pytorch/pytorch/issues/42515

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47293

Reviewed By: malfet

Differential Revision: D24855994

Pulled By: mruberry

fbshipit-source-id: 1e6752f2eeba6d638dea0bdea0c650cf722718c9
2020-11-12 16:16:09 -08:00
7391edb591 [hotfix] fix misleadingly summary BLAS=MKL when there's no BLAS install (#47803)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47803

Reviewed By: samestep

Differential Revision: D24907453

Pulled By: walterddr

fbshipit-source-id: a3e41041f6aa506b054eb0ffc61f8525ba02cbf1
2020-11-12 16:05:14 -08:00
9734c042b8 [FX] Fix submodule naming for subgraph split (#47869)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47869

Test Plan: Imported from OSS

Reviewed By: scottxu0730

Differential Revision: D24925283

Pulled By: jamesr66a

fbshipit-source-id: a33bff20667405a3bbfc81e1e640c2649c0db03b
2020-11-12 15:58:45 -08:00
21f447ee2c Added serialization of parameters for leaf modules (#47729)
Summary:
This adds the serialization of parameters of leaf nodes to the json serialization.
Specifically __constants__ of the leaf module is serialized as parameters in the JSON.
It also adds type/shape to leaf modules as well.
```
{
            "shape": "[3, 3, 1, 1]",
            "dtype": "torch.float32",
            "parameters": {
                "name": "Conv2d",
                "stride": [
                    1,
                    1
                ],
                "padding": [
                    0,
                    0
                ],
                "dilation": [
                    1,
                    1
                ],
                "groups": 1,
                "padding_mode": "zeros",
                "output_padding": [
                    0,
                    0
                ],
                "in_channels": 3,
                "out_channels": 3,
                "kernel_size": [
                    2,
                    2
                ]
            },
            "target": "conv",
            "op_code": "call_module",
            "name": "conv",
            "args": [
                {
                    "is_node": true,
                    "name": "c"
                }
            ],
            "kwargs": {}
        },
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47729

Reviewed By: ailzhang

Differential Revision: D24901632

Pulled By: gcatron

fbshipit-source-id: 7f2d923937042b60819c58fd180b426a3733ff5f
2020-11-12 14:28:31 -08:00
8da7576303 Remove balance and devices parameter from Pipe. (#46804)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46804

As per our design in https://github.com/pytorch/pytorch/issues/44827,
changign the API such that the user places modules on appropriate devices
instead of having a `balance` and `devices` parameter that decides this.

This design allows us to use RemoteModule in the future.
ghstack-source-id: 116479842

Test Plan: waitforbuildbot

Reviewed By: mrshenli

Differential Revision: D24524219

fbshipit-source-id: 9973172c2bb7636572cdc37ce06bf8368638a463
2020-11-12 14:20:23 -08:00
65d5004b09 Update, appease, and enable fail-on for shellcheck (#47786)
Summary:
Currently ([example](https://github.com/pytorch/pytorch/runs/1381883195)), ShellCheck is run on `*.sh` files in `.jenkins/pytorch`, but it uses a three-and-a-half-year-old version, and doesn't fail the lint job despite yielding many warnings. This PR does the following:

- update ShellCheck to v0.7.1 (and generally make it always use the latest `"stable"` release), to get more warnings and also enable the directory-wide directives that were introduced in v0.7.0 (see the next bullet)
- move the rule exclusions list from a variable in `.jenkins/run-shellcheck.sh` to a [declarative file](https://github.com/koalaman/shellcheck/issues/725#issuecomment-469102071) `.jenkins/pytorch/.shellcheckrc`, so now editor integrations such as [vscode-shellcheck](https://github.com/timonwong/vscode-shellcheck) give the same warnings as the CLI script
- fix all ShellCheck warnings in `.jenkins/pytorch`
- remove the suppression of ShellCheck's return value, so now it will fail the lint job if new warnings are introduced

 ---

While working on this, I was confused because I was getting fairly different results from running ShellCheck locally versus what I saw in the CI logs, and also different results among the laptop and devservers I was using. Part of this was due to different versions of ShellCheck, but there were even differences within the same version. For instance, this command should reproduce the results in CI by using (almost) exactly the same environment:
```bash
act -P ubuntu-latest=nektos/act-environments-ubuntu:18.04 -j quick-checks \
| sed '1,/Run Shellcheck Jenkins scripts/d;/Success - Shellcheck Jenkins scripts/,$d' \
| cut -c25-
```
But the various warnings were being displayed in different orders, so it was hard to tell at a glance whether I was getting the same result set or not. However, piping the results into this ShellCheck-output-sorting Python script showed that they were in fact the same:
```python
import fileinput
items = ''.join(fileinput.input()).split('\n\n')
print(''.join(sorted(f'\n{item.strip()}\n\n' for item in items)), end='')
```
Note that while the above little script worked for the old version (v0.4.6) that was previously being used in CI, it is a bit brittle, and will not give great results in more recent ShellCheck versions (since they give more different kinds of output besides just a list of warnings).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47786

Reviewed By: seemethere

Differential Revision: D24900522

Pulled By: samestep

fbshipit-source-id: 92d66e1d5d28a77de5a4274411598cdd28b7d436
2020-11-12 14:00:16 -08:00
8304c25c67 Give hash in commit messages in doc push scripts (#47694)
Summary:
This PR replaces the current auto-generated commit messages like pytorch/pytorch.github.io@fb217ab34a (currently includes no information) and pytorch/cppdocs@7efd67e8f1 (currently includes only a timestamp, which is redundant since it's a Git commit) with more descriptive ones that specify the pytorch/pytorch commit they originated from. This information would be useful for debugging issues such as https://github.com/pytorch/pytorch/issues/47462.

GitHub will also [autolink](https://docs.github.com/en/free-pro-team@latest/github/writing-on-github/autolinked-references-and-urls#commit-shas) these new messages (similar to ezyang/pytorch-ci-hud@bc25ae770d), and so they will now also mostly follow Git commit message conventions by starting with a capital letter, using the imperative voice, and (at least in the autolink-rendered form on GitHub, although not in the raw text) staying under 50 characters.

**Question for reviewers:** Will my `export CIRCLE_SHA1="$CIRCLE_SHA1"` work here? Is it necessary?

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47694

Reviewed By: walterddr

Differential Revision: D24868240

Pulled By: samestep

fbshipit-source-id: 4907341e7b57ed6818ab550dc1ec423f2c2450c1
2020-11-12 13:36:01 -08:00
b1a4170ab3 [NNC] Fix lowering of aten::pow (#47795)
Summary:
NNC lowering of aten::pow assumes that the types of the exponent is either float or int cast to to float, which doesn't work great with double (or half for that matter).

Fixes https://github.com/pytorch/pytorch/issues/47304

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47795

Reviewed By: ZolotukhinM

Differential Revision: D24904201

Pulled By: nickgg

fbshipit-source-id: 43c3ea704399ebb36c33cd222db16c60e5b7ada5
2020-11-12 12:33:07 -08:00
149190c014 Added CUDA support for complex input for torch.solve (#47045)
Summary:
`torch.solve` now works for complex inputs on GPU.
I moved the existing tests to `test_linalg.py` and modified them to test complex and float32 dtypes.
Differentiation also works correctly with complex inputs.

Fixes https://github.com/pytorch/pytorch/issues/41084
Ref. https://github.com/pytorch/pytorch/issues/33152

anjali411 I hope you don't mind that I took over https://github.com/pytorch/pytorch/pull/42737

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47045

Reviewed By: nikithamalgifb

Differential Revision: D24921503

Pulled By: anjali411

fbshipit-source-id: 4c3fc4f193a84b6e28c43c08672d480715000923
2020-11-12 12:22:59 -08:00
275a89a7ee [Docs] Store Docs fixes about HashStore API (#47643)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47643

Updating the docs to indicate the `num_keys` and `delete_key` APIs are now supported by the HashStore (not just TCPStore).
ghstack-source-id: 116459958

Test Plan: CI

Reviewed By: jiayisuse, mrshenli

Differential Revision: D24633570

fbshipit-source-id: 549479dd99f9ec6decbfffcb74b9792403d05ba2
2020-11-12 12:14:52 -08:00
6aaf04616b [Metal] Remove undefined tests
Summary: As title

Test Plan:
- Circle CI
- Sandcastle

Reviewed By: husthyc

Differential Revision: D24915370

fbshipit-source-id: fe05ac37a25c804695a13fb5a7eabbc60442a102
2020-11-12 11:54:43 -08:00
f51be328ae [FX] Fix __tensor_constants not scriptable (#47817)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47817

Test Plan: Imported from OSS

Reviewed By: nikithamalgifb

Differential Revision: D24908959

Pulled By: jamesr66a

fbshipit-source-id: c0cadae2091e917b72684262b8655f8813ac9d91
2020-11-12 11:39:07 -08:00
76ff557de7 [NNC] add hazard analysis to Bounds Inference (#47684)
Summary:
Adds a helper function to Bounds Inference / Memory Analaysis infrastructure which returns the kind of hazard found between two Stmts (e.g. Blocks or Loops). E.g.
```
for (int i = 0; i < 10; ++i) {
  A[x] = i * 2;
}
for (int j = 0; j < 10; ++j) {
 B[x] = A[x] / 2;
}
```
The two loops have a `ReadAfterWrite` hazard, while in this example:
```
for (int i = 0; i < 10; ++i) {
  A[x] = i * 2;
}
for (int j = 0; j < 10; ++j) {
 A[x] = B[x] / 2;
}
```
The loops have a `WriteAfterWrite` hazard.

This isn't 100% of what we need for loop fusion, for example we don't check the strides of the loop to see if they match.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47684

Reviewed By: malfet

Differential Revision: D24873587

Pulled By: nickgg

fbshipit-source-id: 991149e5942e769612298ada855687469a219d62
2020-11-12 11:34:31 -08:00
664d2f48cf [NNC] Enable unary op cpu testing (#47374)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47374

A few small fixes needed to enable unary op cpu testing. If reviewers would prefer I split  them  up let me know.

Test Plan: Imported from OSS

Reviewed By: ansley

Differential Revision: D24805248

Pulled By: eellison

fbshipit-source-id: c2cfe2e3319a633e64da3366e68f5bf21d390cb7
2020-11-12 11:14:03 -08:00
dcca712d3c [NNC] refactor cuda half support to more general file (#47373)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47373

Test Plan: Imported from OSS

Reviewed By: ansley

Differential Revision: D24805246

Pulled By: eellison

fbshipit-source-id: 33b5c84c9212d51bac3968e02aae2434dde40cd8
2020-11-12 11:14:00 -08:00
346a71d29c [NNC] More cpu tests (#47372)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47372

Test Plan: Imported from OSS

Reviewed By: ansley

Differential Revision: D24805254

Pulled By: eellison

fbshipit-source-id: b7e5ee044ef816e024b6fc5c4041fff5f2049bb3
2020-11-12 11:13:57 -08:00
450738441b [NNC] Add more CPU Tests (#47371)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47371

Test Plan: Imported from OSS

Reviewed By: ansley

Differential Revision: D24805252

Pulled By: eellison

fbshipit-source-id: 16472960d09f6c981adca2a45b2a4efb75a09d4f
2020-11-12 11:13:54 -08:00
e618bd858e [NNC] Fix llvm min lowering for int inputs (#47370)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47370

Test Plan: Imported from OSS

Reviewed By: ansley

Differential Revision: D24805249

Pulled By: eellison

fbshipit-source-id: e13d956899e8651600fab94dab04aa39ca427769
2020-11-12 11:13:50 -08:00
fe81faee5f Add more CPU tests (#47369)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47369

Test Plan: Imported from OSS

Reviewed By: ansley

Differential Revision: D24805251

Pulled By: eellison

fbshipit-source-id: f1a8210ffdc3cc88354cb4896652151d83a0345a
2020-11-12 11:13:47 -08:00
b8a1070ec0 [TensorExpr][CPU] Fix bool -> int casting (#46951)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46951

If e.g. we're casting from torch.int -> torch.bool,  previously we would just truncate from int32 -> i8. Since torch.bool has 8 bits but only uses one of them, we need to makes sure that one bit is set.

Test Plan: Imported from OSS

Reviewed By: ansley

Differential Revision: D24805253

Pulled By: eellison

fbshipit-source-id: af3aa323f10820d189827eb51037adfa7d80fed9
2020-11-12 11:13:44 -08:00
ad5be26b2f Small changes/cleanup (#46950)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46950

Make sure that we're fusing in a fuse tests, and refactor to more concise API to check if fusions have happened.

Test Plan: Imported from OSS

Reviewed By: ansley

Differential Revision: D24805250

Pulled By: eellison

fbshipit-source-id: f898008a64b74e761bb5fe85f91b3cdf2dbdf878
2020-11-12 11:13:38 -08:00
f221a19a7f Force LLVM Compilation for CPU Tests (#46949)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46949

Test Plan: Imported from OSS

Reviewed By: ansley

Differential Revision: D24805247

Pulled By: eellison

fbshipit-source-id: 4fcaf02d8a78cc5cbcbde36940d0a2c85fba3fc5
2020-11-12 11:12:08 -08:00
f42cdc2e43 [NNC] Fix printing of integral doubles (#47799)
Summary:
When printing doubles, we don't do anything to distinguish intregal doubles (ie, 1 or 2) from ints. Added decoration of these doubles with `.0` if they are integral (i.e. DoubleImm(1) will print as `1.0`).

This is an issue specifically on Cuda where some intrinsics do not have type coercion. Added a test which covers this case (without the fix it tries to look up pow(double, int) which doesn't exist).

Fixes https://github.com/pytorch/pytorch/issues/47304

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47799

Reviewed By: ZolotukhinM

Differential Revision: D24904185

Pulled By: nickgg

fbshipit-source-id: baa38726966c94ee50473cc046b9ded5c4e748f7
2020-11-12 11:02:34 -08:00
1478e5ec2a [quant] Remove nn.quantized.ReLU module and nn.quantized.functional.relu (#47415)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47415

nn.ReLU works for both float and quantized input, we don't want to define an nn.quantized.ReLU
that does the same thing as nn.ReLU, similarly for nn.quantized.functional.relu

this also removes the numerical inconsistency for models quantizes nn.ReLU independently in qat mode

Test Plan: Imported from OSS

Reviewed By: z-a-f

Differential Revision: D24747035

fbshipit-source-id: b8fdf13e513a0d5f0c4c6c9835635bdf9fdc2769
2020-11-12 10:56:30 -08:00
66f9b1de1b [NCCL] enable p2p tests (#47797)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47797

NCCL p2p tests had hang issues before, the reason is that there were some unexpected context switches. For example, process 1 which is supposed to only use GPU1 could use GPU0 as a result of missing explicitly setting device.
ghstack-source-id: 116461969

Test Plan: waitforsandcastle

Reviewed By: jiayisuse

Differential Revision: D24863808

fbshipit-source-id: 92bd3a4874be8334210c7c8ee6363648893c963e
2020-11-12 10:44:50 -08:00
9ea7a6c7c5 [ONNX] Update ONNX doc for writing pytorch model (#46961)
Summary:
For tracing successfully, we need write pytorch model in torch way. So we add instructions with examples here.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46961

Reviewed By: ailzhang

Differential Revision: D24900040

Pulled By: bzinodev

fbshipit-source-id: b375b533396b11dbc9656fa61e84a3f92f352e4b
2020-11-12 10:16:45 -08:00
d7c8d3cccb Remove references to typing module from setup.py (#47677)
Summary:
It is part of core Python-3.6.2+

Fixes https://github.com/pytorch/pytorch/issues/47596

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47677

Reviewed By: walterddr

Differential Revision: D24860188

Pulled By: malfet

fbshipit-source-id: ad72b433a4493ebe5caca97c2e8a9d4b3c8172d4
2020-11-12 10:04:38 -08:00
809660ffa4 ATen DerivedType is dead, long live ATen RegisterDispatchKey (#47011)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47011

smessmer has complained about how it is difficult to find generated
code.  Well hopefully this diffs helps a bit with that.

There are three components to this refactor:

- Rename TypeDerived (CPUType) to RegisterDispatchKey (RegisterCPU).
  The 'Type' nomenclature is vestigial and I think Register says
  what these files do a lot more clearly.  I also got rid of
  the CPUType namespace; everything just goes in anonymous
  namespace now, less moving parts this way.
- Give Math and DefaultBackend their own files (RegisterMath and
  RegisterDefaultBackend)
- Restructure code generation so that schema definition is done
  completely separately from RegisterDispatchKey

I decided to name the files RegisterCPU rather than the old convention
BackendSelectRegister, because it seems better to me if these
files clump together in an alphabetical listing rather than being
spread out everywhere.  There are a few manual registration files
which should probably get similar renaming.

I also did a little garden cleaning about how we identify if a
dispatch key is a cuda key or a generic key (previously called
KEYWORD_ALL_BACKENDS but I like my naming better).

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Differential Revision: D24600806

Test Plan: Imported from OSS

Reviewed By: smessmer

Pulled By: ezyang

fbshipit-source-id: c1b510dd7515bd95e3ad25b8edf961b2fb30a25a
2020-11-12 09:53:48 -08:00
00a3add425 [TorchBind] Support using lambda function as TorchBind constructor (#47819)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47819

Reviewed By: wanchaol

Differential Revision: D24910065

Pulled By: gmagogsfm

fbshipit-source-id: ad5b4f67b0367e44fe486d31a060d9ad1e0cf568
2020-11-12 09:29:34 -08:00
b6cb2caa68 Revert "Fixed einsum compatibility/performance issues (#46398)" (#47821)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47821

This reverts commit a5c65b86ce249f5f2d365169e6315593fbd47b61.

 Conflicts:
	test/test_linalg.py

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D24909923

Pulled By: gchanan

fbshipit-source-id: 9dcf98e7c4a3c7e5aaffe475867fa086f3bb6ff2
2020-11-12 08:11:40 -08:00
cfe3defd88 [vulkan] Enable prepacked addmm/mm for linear layers (#47815)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47815

Test Plan: Imported from OSS

Reviewed By: IvanKobzarev

Differential Revision: D24908605

Pulled By: SS-JIA

fbshipit-source-id: e658bc2dbf23d5d911b979d3b8f467508f2fdf0c
2020-11-12 08:04:01 -08:00
e1ee3bfc0e Port bmm and baddbmm from TH to ATen (#42553)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42553

Ports `torch.bmm` and `torch.baddbmm` from TH to ATen, as well as adds support for complex dtypes. Also removes dead TH code for Level 2 functions.

Closes #24539

Test Plan: Imported from OSS

Reviewed By: ansley

Differential Revision: D24893511

Pulled By: anjali411

fbshipit-source-id: 0eba3f2aec99c48b3018a5264ee7789279cfab58
2020-11-12 07:57:42 -08:00
553ccccc54 [c10d] switch ProcessGroup to be managed by intrusive_ptr (#47343)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47343

Test Plan: Imported from OSS

Reviewed By: gmagogsfm

Differential Revision: D24723418

Pulled By: wanchaol

fbshipit-source-id: 0463819b96c53b12bdbb3905431110d7b21beb77
2020-11-12 07:36:23 -08:00
859e054314 skip test_all_reduce_sum_cuda_async test case for ROCM (#47630)
Summary:
Skip the following test case for rocm (When PYTORCH_TEST_WITH_ROCM=1):
- test_all_reduce_sum_cuda_async (__main__.TestDistBackendWithFork)

jeffdaily
pruthvistony

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47630

Reviewed By: seemethere, heitorschueroff

Differential Revision: D24849755

Pulled By: walterddr

fbshipit-source-id: b952c81677df2dfd35d459b94ce0f7a5b12c0d5c
2020-11-12 07:19:32 -08:00
2df5600155 [ROCm] add skipCUDAIfRocm to test_lingalg test_norm_fro_2_equivalence_old (#47809)
Summary:
This test started failing when ROCm CI moved to 3.9.  Skip until triage is complete.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47809

Reviewed By: seemethere

Differential Revision: D24906319

Pulled By: walterddr

fbshipit-source-id: 0c425f3b21190cfbc5e0d1c3f477d834af40f0ca
2020-11-12 07:12:43 -08:00
2907447c97 Spurious numpy writable warning (#47271)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/47160

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47271

Reviewed By: ailzhang

Differential Revision: D24855889

Pulled By: mruberry

fbshipit-source-id: beaf232b115872f20fb0292e995a876cdc429868
2020-11-12 00:14:56 -08:00
4b25d83e9b torch.dropout: fix non-contiguous layout input (#47552)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/47176

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47552

Reviewed By: ailzhang

Differential Revision: D24903435

Pulled By: ngimel

fbshipit-source-id: ef5398931dddf452f5f734b4aa40c11f4ee61664
2020-11-11 22:56:31 -08:00
a02baa0c7a [reland][c10d] switch ProcessGroupNCCL:Options to be managed by intrusive_ptr (#47807)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47807

reland https://github.com/pytorch/pytorch/pull/47075

Test Plan: wait for ci

Reviewed By: gmagogsfm

Differential Revision: D24905247

fbshipit-source-id: abd9731d86b3bd48d60bbc90d534823e0c037b93
2020-11-11 22:53:22 -08:00
665ac2f7b0 [reland] [c10d] switch Store to be managed by intrusive_ptr (#47808)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47808

reland https://github.com/pytorch/pytorch/pull/47074

Test Plan: wait for ci

Reviewed By: gmagogsfm

Differential Revision: D24905246

fbshipit-source-id: edeb7e6e486570ce889f12512e9dc02061d6cc03
2020-11-11 22:53:20 -08:00
70ae5685f9 [reland][c10d] switch ProcessGroup::Work to be managed by intrusive_ptr (#47806)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47806

reland https://github.com/pytorch/pytorch/pull/44046

Test Plan: wait for ci

Reviewed By: gmagogsfm

Differential Revision: D24905245

fbshipit-source-id: ad75ace5432fcfd22d513878f5a73c4bb017324e
2020-11-11 22:51:03 -08:00
89b371bc28 [quant] Add support for 2D indices for quantized embedding operators (#47766)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47766

The operator now supports accepting 2D indices as inputs.
For embedding operators, we set the default offsets in the op since the FBGEMM kernel expects it to be set
Output shape depends on the shape if the indices.

For embedding_bag operator, if indices is 2D (B, N) then offsets should be set to None by user. In this case
the input is interpreted as B bags each of fixed length N. Output shape is still 2-D in this case.

Test Plan:
python test/test_quantization.py TestQuantizedEmbeddingOps.test_embedding_bag_2d_indices
python test/test_quantization.py TestQuantizedEmbeddingOps.test_embedding_2d_indices

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D24895048

fbshipit-source-id: 2020910e1d85ed8673eedee2e504611ba260d801
2020-11-11 22:44:07 -08:00
47386722da [quant][graphmode][fx][refactor] insert_observer (#47782)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47782

Test Plan:
python test/test_quantization.py TestQuantizeFx

Imported from OSS

Reviewed By: supriyar

Differential Revision: D24900305

fbshipit-source-id: b00a90ab85badea7d18ae007cc68d0bcd58ab15c
2020-11-11 21:31:24 -08:00
dd77d5a1d4 [quant][refactor] factor out get_combined_dict function (#47781)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47781

Test Plan: Imported from OSS

Reviewed By: supriyar

Differential Revision: D24900303

fbshipit-source-id: 1a2cb0ec536384abcd140e0d073f0965ed2800cd
2020-11-11 21:01:31 -08:00
b46787d6d7 add cost_aware_partition (#47673)
Summary:
[WIP]This PR adds cost_aware_partition method in Partitioner class. The method partitions the fx graph module based on the latency of the whole graph.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47673

Reviewed By: gcatron

Differential Revision: D24896685

Pulled By: scottxu0730

fbshipit-source-id: 1b1651fe82ce56554f99d68da116e585c74099ed
2020-11-11 19:31:37 -08:00
c5834b6a23 Look in named-buffers of module for tensors (#47641)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47641

ghstack-source-id: 116450114

Test Plan: Presubmit tests

Reviewed By: jamesr66a

Differential Revision: D24848318

fbshipit-source-id: f6ede3def9d6f1357c4fd3406f97721dea06b9f1
2020-11-11 19:08:16 -08:00
c9f6e70c09 Refactor DDP uneven inputs control flags (#47394)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47394

This is a preliminary refactor for the next diff that will add an
additional flag to control whether we throw a StopIteration or not. We
basically move the flags for ddp uneven inputs to a simple class.
ghstack-source-id: 116428177

Test Plan: CI

Reviewed By: pritamdamania87

Differential Revision: D24739509

fbshipit-source-id: 96bf41bd1c02dd27e68f6f37d08e22f33129b319
2020-11-11 16:51:56 -08:00
e8a73fbf34 Workaround PyTorch debug build crash using old GCC (#47805)
Summary:
gcc-7.4.x or older fails to compile XNNPACK in debug mode with internal compiler error
Workaround this in a build script by pasing -O1 optimisation flag to XNNPACK if compiled on older compilers

Fixes https://github.com/pytorch/pytorch/issues/47292

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47805

Reviewed By: seemethere

Differential Revision: D24905758

Pulled By: malfet

fbshipit-source-id: 93f4e3b3b5c10b69734627c50e36b2eb544699c8
2020-11-11 16:33:47 -08:00
52ec8b9340 Added CUDA support for complex input for torch.triangular_solve (#46916)
Summary:
`torch.triangular_solve` now works for complex inputs on GPU.
I moved the existing tests to `test_linalg.py` and modified them to test complex and float32 dtypes.

Ref. https://github.com/pytorch/pytorch/issues/33152

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46916

Reviewed By: navahgar, agolynski

Differential Revision: D24706647

Pulled By: anjali411

fbshipit-source-id: fe780eac93d2ae1b2549539bb385e5fac25213b3
2020-11-11 16:08:11 -08:00
a0c4aae3d5 Free original weight after prepacking in XNNPACK based op (#46541)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46541

When weights are prepacked XNNPACK packs the weights in a separate memory. After that original weights are not needed for inference. Having those weights lying around increase memory footprint, so we would like to remove the original weights once prepacking is done.

Test Plan: buck test //caffe2/aten:mobile_memory_cleanup

Reviewed By: kimishpatel

Differential Revision: D24280928

fbshipit-source-id: 90ffc53b1eabdc545a3ccffcd17fa3137d500cbb
2020-11-11 15:58:35 -08:00
545f624a4a Mark overriden Tensor method override (#47198)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47198

Fixes:

```xplat/caffe2/aten/src/ATen/native/xnnpack/OpContext.h:77:10: error: 'run' overrides a member function but is not marked 'override' [-Werror,-Winconsistent-missing-override]
  Tensor run(const Tensor& input);```

Test Plan: CI tests

Reviewed By: kimishpatel

Differential Revision: D24678573

fbshipit-source-id: 244769cc36d3c1126973a67441aa2d06d2b83b9c
2020-11-11 15:55:52 -08:00
d4fa84bf5f Properly serialize types that only appear at function input (#47775)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47775

When serializing graphs, we check every node for named types referenced,
so that we can register them as dependencies. We were skipping this
check for the graph inputs themselves. Since types used at input are
almost always used somewhere in the graph, we never noticed this gap
until a user reported an issue with NamedTuples.

Test Plan: Imported from OSS

Reviewed By: jamesr66a

Differential Revision: D24896289

Pulled By: suo

fbshipit-source-id: 4ce76816cb7997a7b65e7cea152ea52ed8f27276
2020-11-11 15:27:00 -08:00
32b4b51254 [Docs] Minor doc fixes for init_process_group (#47644)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47644

Minor Update to the init_process_group docs.
ghstack-source-id: 116441798

Test Plan: CI

Reviewed By: jiayisuse, mrshenli

Differential Revision: D24633432

fbshipit-source-id: fbd38dab464ee156d119f9f0b22ffd0e416c4fd7
2020-11-11 15:21:30 -08:00
0c54ea50bd [PyTorch] Avoid atomic refcounting in intrusive_ptr::make (#47100)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47100

Profiling with Linux `perf` shows that we spend at least 1% of our time doing this increment in our framework overhead benchmark. Here's the inline function breakdown for empty_cpu, which takes 6.91% of the total time:

```
   - at::native::empty_cpu
      - 1.91% at::detail::make_tensor<c10::TensorImpl, c10::intrusive_ptr<c10::StorageImpl, c10::detail::intrusive_target_default_null_type<c10::StorageImpl> >, c10::DispatchKey, caffe2::TypeMeta&> (inlined)
         - 0.98% c10::make_intrusive<c10::TensorImpl, c10::detail::intrusive_target_default_null_type<c10::TensorImpl>, c10::intrusive_ptr<c10::StorageImpl, c10::detail::intrusive_target_default_null_type<c10::StorageImpl> >, c10::DispatchKey, caffe2::TypeMeta&> (inlined
              0.97% c10::intrusive_ptr<c10::TensorImpl, c10::detail::intrusive_target_default_null_type<c10::TensorImpl> >::make<c10::intrusive_ptr<c10::StorageImpl, c10::detail::intrusive_target_default_null_type<c10::StorageImpl> >, c10::DispatchKey, caffe2::TypeMeta&>
           0.84% intrusive_ptr<c10::TensorImpl, c10::detail::intrusive_target_default_null_type<c10::TensorImpl> > (inlined)                                                                                                                                                         - 1.44% c10::make_intrusive<c10::StorageImpl, c10::detail::intrusive_target_default_null_type<c10::StorageImpl>, c10::StorageImpl::use_byte_size_t, long&, c10::DataPtr, c10::Allocator*&, bool> (inlined)
         - 1.44% c10::intrusive_ptr<c10::StorageImpl, c10::detail::intrusive_target_default_null_type<c10::StorageImpl> >::make<c10::StorageImpl::use_byte_size_t, long&, c10::DataPtr, c10::Allocator*&, bool> (inlined)
              1.02% std::__atomic_base<unsigned long>::operator++ (inlined)
      - 0.80% ~DataPtr (inlined)
           ~UniqueVoidPtr (inlined)
           ~unique_ptr (inlined)
      - 0.78% c10::TensorOptions::memory_format (inlined)
         - c10::TensorOptions::set_memory_format (inlined)
            - c10::optional<c10::MemoryFormat>::operator bool (inlined)
              c10::optional<c10::MemoryFormat>::initialized (inlined)
```

This change comes with a caveat: if we have constructors where `this` escapes to another thread before returning, we cannot make this assumption, because that other thread may have called `intrusive_ptr::make` already. I chose to just mandate that `instrusive_ptr_target`s's ctors hand back exclusive ownership of `this`, which seems like a reasonable requirement for a ctor anyway. If that turns out to be unacceptable, we could provide an opt-out from this optimization via a traits struct or similar template metaprogramming shenanigan.
ghstack-source-id: 116368592

Test Plan: Run framework overhead benchmark. Results look promising, ranging from a tiny regression (? presumably noise) on the InPlace benchmark, 2.5% - 4% on OutOfPlace, to 9% on the empty benchmarks and 10-12% on the view benchmarks.

Reviewed By: ezyang

Differential Revision: D24606531

fbshipit-source-id: 1cf022063dab71cd1538535c72c4844d8dd7bb25
2020-11-11 15:09:56 -08:00
f2b7c38735 Automated submodule update: FBGEMM (#47605)
Summary:
This is an automated pull request to update the first-party submodule for [pytorch/FBGEMM](https://github.com/pytorch/FBGEMM).

New submodule commit: eb55572e55

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47605

Test Plan: Ensure that CI jobs succeed on GitHub before landing.

Reviewed By: jianyuh

Differential Revision: D24833658

Pulled By: heitorschueroff

fbshipit-source-id: 7a577c75d244a58d94c249c0e50992078a3b62cb
2020-11-11 14:50:45 -08:00
fcd44ce698 Add instruction on how to handle the potential linker error on Linux (#47593)
Summary:
The original issue is https://github.com/pytorch/pytorch/issues/16683, which contains a https://github.com/pytorch/pytorch/issues/16683#issuecomment-459982988 that suggests manually un-shadowing the `ld`.

A better approach can be found at https://github.com/ContinuumIO/anaconda-issues/issues/11152#issuecomment-573120962, which suggests that using a newer version can effectively fix this.

It took me quite some time to realize that this is in fact an issue caused by Anaconda. I think we should add it in README.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47593

Reviewed By: ailzhang

Differential Revision: D24866092

Pulled By: heitorschueroff

fbshipit-source-id: c1f51864d23fd6f4f63a117496d8619053e35196
2020-11-11 14:24:33 -08:00
7864ae9f98 Improve error messages for operator registration API (#47636)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47636

Previously:
```
terminate called after throwing an instance of 'c10::Error'
  what():  *cpp_signature == cpp_signature_->signature INTERNAL ASSERT FAILED at "caffe2/aten/src/ATen/core/dispatch/OperatorEntry.cpp":92, please report a bug to PyTorch. Tried to register a kernel (registered at buck-out/dev/gen/caffe2/generate-code/autograd/generated/TraceType_2.cpp:9847) for operator aten::div.out (registered at buck-out/dev/gen/caffe2/aten/gen_aten=TypeDefault.cpp/TypeDefault.cpp:3541) for dispatch key Tracer, but the C++ function signature at::Tensor& (at::Tensor const&, at::Tensor const&, at::Tensor&) mismatched with a previous kernel (registered at buck-out/dev/gen/caffe2/aten/gen_aten=CPUType.cpp/CPUType.cpp:2166) that had the signature at::Tensor& (at::Tensor&, at::Tensor const&, at::Tensor const&)
```
Now:
```
terminate called after throwing an instance of 'c10::Error'
  what():  *cpp_signature == cpp_signature_->signature INTERNAL ASSERT FAILED at "caffe2/aten/src/ATen/core/dispatch/OperatorEntry.cpp":96, please report a bug to PyTorch.
Mismatch in kernel C++ signatures
  operator: aten::div.out(Tensor self, Tensor other, *, Tensor(a!) out) -> (Tensor(a!))
    registered at buck-out/dev/gen/caffe2/aten/gen_aten=TypeDefault.cpp/TypeDefault.cpp:3541
  kernel 1: at::Tensor& (at::Tensor&, at::Tensor const&, at::Tensor const&)
    dispatch key: CPU
    registered at buck-out/dev/gen/caffe2/aten/gen_aten=CPUType.cpp/CPUType.cpp:2166
  kernel 2: at::Tensor& (at::Tensor const&, at::Tensor const&, at::Tensor&)
    dispatch key: Tracer
    registered at buck-out/dev/gen/caffe2/generate-code/autograd/generated/TraceType_2.cpp:9847
```

Previously:
```
W1109 13:38:52.464170 1644302 OperatorEntry.cpp:117] Warning: Registering a kernel (registered at caffe2/torch/csrc/autograd/VariableTypeManual.cpp:310) for operator aten::_backward (registered at buck-out/dev/gen/caffe2/aten/gen_aten=TypeDefault.cpp/TypeDefault.cpp:3549) for dispatch key Autograd that overwrote a previously registered kernel (registered at caffe2/torch/csrc/autograd/VariableTypeManual.cpp:310) with the same dispatch key for the same operator. (function registerKernel)
```
Now:
```
W1109 13:49:40.501817 1698959 OperatorEntry.cpp:118] Warning: Overriding a previously registered kernel for the same operator and the same dispatch key
  operator: aten::_backward(Tensor self, Tensor[] inputs, Tensor? gradient=None, bool? retain_graph=None, bool create_graph=False) -> ()
    registered at buck-out/dev/gen/caffe2/aten/gen_aten=TypeDefault.cpp/TypeDefault.cpp:3549
  dispatch key: Autograd
  previous kernel: registered at caffe2/torch/csrc/autograd/VariableTypeManual.cpp:310
       new kernel: registered at caffe2/torch/csrc/autograd/VariableTypeManual.cpp:310 (function registerKernel)
```

Previously:
```
terminate called after throwing an instance of 'c10::Error'
  what():  In registration for dummy_library::dummy_op: expected schema of operator to be "dummy_library::dummy_op(Tensor a) -> (Tensor)" (registered at caffe2/torch/csrc/autograd/VariableTypeManual.cpp:298), but got inferred schema "(Tensor _0) -> ()" (registered at caffe2/torch/csrc/autograd/VariableTypeManual.cpp:298). The number of returns is different. 1 vs 0
```
Now:
```
terminate called after throwing an instance of 'c10::Error'
  what():  Inferred operator schema for a C++ kernel function doesn't match the expected function schema.
  operator: dummy_library::dummy_op
  expected schema: dummy_library::dummy_op(Tensor a) -> (Tensor)
    registered at caffe2/torch/csrc/autograd/VariableTypeManual.cpp:298
  inferred schema: (Tensor _0) -> ()
    registered at caffe2/torch/csrc/autograd/VariableTypeManual.cpp:298
  reason: The number of returns is different. 1 vs 0
````

Previously:
```
terminate called after throwing an instance of 'c10::Error'
  what():  !cpp_signature_.has_value() || (CppSignature::make<FuncType>() == cpp_signature_->signature) INTERNAL ASSERT FAILED at "caffe2/aten/src/ATen/core/dispatch/OperatorEntry.h":170, please report a bug to PyTorch. Tried to access operator _test::dummy with a wrong signature. Accessed with void (at::Tensor, long) but the operator was registered with void (at::Tensor) (schema: registered by RegisterOperators, kernel: registered by RegisterOperators) This likely happened in a call to OperatorHandle::typed<Return (Args...)>(). Please make sure that the function signature matches the signature in the operator registration call.
```
Now:
```
terminate called after throwing an instance of 'c10::Error'
  what():  !cpp_signature_.has_value() || (CppSignature::make<FuncType>() == cpp_signature_->signature) INTERNAL ASSERT FAILED at "caffe2/aten/src/ATen/core/dispatch/OperatorEntry.h":169, please report a bug to PyTorch.
Tried to access or call an operator with a wrong signature.
  operator: _test::dummy(Tensor dummy) -> ()
    registered by RegisterOperators
  correct signature:  void (at::Tensor)
    registered by RegisterOperators
  accessed/called as: void (at::Tensor, long)
This likely happened in a call to OperatorHandle::typed<Return (Args...)>(). Please make sure that the function signature matches the signature in the operator registration call.
```
ghstack-source-id: 116359052

Test Plan: waitforsandcastle

Reviewed By: ezyang

Differential Revision: D24846523

fbshipit-source-id: 0ce7d487b725bfbdf2261e36027cb34ef50c1fea
2020-11-11 14:19:38 -08:00
05a76ed705 Batching rule for torch.squeeze(tensor) (#47632)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47632

This one is fun because we have to be careful not to squeeze out any of
the batch dims (it is the dims of the per-example tensor that are being squeezed).

Test Plan: - new tests

Reviewed By: anjali411

Differential Revision: D24859022

Pulled By: zou3519

fbshipit-source-id: 8adbd80963081efb683f62ea074a286a10da288f
2020-11-11 14:08:39 -08:00
df887936a4 Fix transpose batching rule (#47628)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47628

Pytorch has a special case where scalar_tensor.transpose(0, 0) works and
returns the scalar tensor. If the following happens:
```py
>>> x = torch.randn(B0)  # the per-examples are all scalars
>>> vmap(lambda x: x.transpose(0, 0), x)
```
then we replicate this behavior

Test Plan: - new tests

Reviewed By: anjali411

Differential Revision: D24843658

Pulled By: zou3519

fbshipit-source-id: e33834122652473e34a18ca1cecf98e8a3b84bc1
2020-11-11 14:08:37 -08:00
f6ff6478cf Make kwargs argument optional in _batched_grad_test (#47625)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47625

kwargs is {} most of the time so this PR makes it optional. Note that it
is bad practice for {} to be a default argument; we work around this by
using None as the default and handling it accordingly.

Test Plan
- `pytest test/test_vmap.py -v`

Test Plan: Imported from OSS

Reviewed By: Chillee

Differential Revision: D24842571

Pulled By: zou3519

fbshipit-source-id: a46b0c6d5240addbe3b231b8268cdc67708fa9e0
2020-11-11 14:08:35 -08:00
fc24d0656a Tensor.contiguous, Tensor.is_contiguous batch rule (#47621)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47621

Followup to #47365.

is_contiguous on BatchedTensorImpl is implemented as:
- Whenever one creates a BatchedTensorImpl, we cache the strides of the
per-examples, just like how we cache the sizes of the per-examples.
- With the cached strides, we use TensorImpl::refresh_contiguous() to
compute if the tensor is contiguous or not.
- is_contiguous checks the `is_contiguous_` flag that
refresh_contiguous() populates.

Both contiguous and is_contiguous only support torch.contiguous_format.
I'm not sure what the semantics should be for other memory formats; they
are also rank dependent (e.g., channels_last tensor must have 4
dimensions) which makes this a bit tricky.

Test Plan: - new tests

Reviewed By: Chillee, anjali411

Differential Revision: D24840975

Pulled By: zou3519

fbshipit-source-id: 4d86dbf11e2eec45f3f08300ae3f2d79615bb99d
2020-11-11 14:06:05 -08:00
6c815c71b3 Revert to use NCCL 2.7.8-1 (#47638)
Summary:
Only depend on stable NCCL releases

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47638

Reviewed By: mingzhe09088

Differential Revision: D24847765

Pulled By: mrshenli

fbshipit-source-id: 2c5f29602aa7403c110797cb07f8fb6151a1b60d
2020-11-11 13:05:09 -08:00
1abe6e5ad4 [ONNX] Bool inputs to index_put updated symbolic (#46866)
Summary:
Cases with bool inputs to index_put nodes were handled for tracing purposes. This PR adds support for similar situations in scripting

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46866

Reviewed By: malfet

Differential Revision: D24870818

Pulled By: bzinodev

fbshipit-source-id: 2d75ca6f5f4b79d8c5ace337633c5aed3bdc4be7
2020-11-11 12:45:31 -08:00
da2e2336b6 [ONNX] Export and shape inference for prim uninitialized in If subblock (#46094)
Summary:
Enable export of prim::Uninitialized in If subblock outputs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46094

Reviewed By: houseroad

Differential Revision: D24838537

Pulled By: bzinodev

fbshipit-source-id: d0719b140393595e6df114ef5cc1bb845e919c14
2020-11-11 12:10:49 -08:00
4078f44668 [TB][embedding supporting] Modify histogram to accept multipy types to skip Castop and avoid OOMing in Castop
Summary: To support min/max/mean/std, SummarizeOp need to skip size checking (similar to the LpNorm error mentioned above) and accept multiple types

Test Plan:
unit test:
`buck test //caffe2/caffe2/fb/tensorboard/tests:tensorboard_accumulate_histogram_op_test`

https://our.intern.facebook.com/intern/testinfra/testrun/1407375057859572

`buck test //caffe2/caffe2/fb/tensorboard/tests:tensorboard_accumulate_histogram_op_test --stress-runs 1000`

https://our.intern.facebook.com/intern/testinfra/testrun/2533274832166362

Reviewed By: cryptopic

Differential Revision: D24605507

fbshipit-source-id: fa08372d7c9970083c38abd432d4c86e84fb10e0
2020-11-11 12:03:54 -08:00
513f62b45b [hotfix] fix collect_env not working when torch compile/install fails (#47752)
Summary:
fix collect env not working when pytorch compile from source failed mid-way.
```
Traceback (most recent call last):
OSError: /home/rongr/local/pytorch/torch/lib/libtorch_global_deps.so: cannot open shared object file: No such file or directory
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47752

Reviewed By: janeyx99

Differential Revision: D24888576

Pulled By: walterddr

fbshipit-source-id: 3b20daeddbb4118491fb0cca9fb59d861f683da7
2020-11-11 11:47:49 -08:00
a1db5b0f2b Added CUDA support for complex input for torch.inverse #2 (#47595)
Summary:
`torch.inverse` now works for complex inputs on GPU.
Opening a new PR here. The previous PR was merged and reverted due to a bug in tests marked with `slowTest`.
Previous PR https://github.com/pytorch/pytorch/pull/45034

Ref. https://github.com/pytorch/pytorch/issues/33152

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47595

Reviewed By: navahgar

Differential Revision: D24840955

Pulled By: anjali411

fbshipit-source-id: ec49fffdc4b3cb4ae7507270fa24e127be14f59b
2020-11-11 11:06:08 -08:00
dbfee42a7d [FX] Fix uses not updating when erasing a node (#47720)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47720

Test Plan: Imported from OSS

Reviewed By: zdevito

Differential Revision: D24875880

Pulled By: jamesr66a

fbshipit-source-id: aae9ffd10f8085b599e7923152287c6e6950ff49
2020-11-11 11:02:15 -08:00
d1351c66a8 [FX] Add a bunch of docstrings (#47719)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47719

Test Plan: Imported from OSS

Reviewed By: zdevito

Differential Revision: D24875400

Pulled By: jamesr66a

fbshipit-source-id: a1dd43d2eee914a441eff43c4f2efe61a399e8a5
2020-11-11 10:59:57 -08:00
dac0192148 Revert D23632280: [c10d] switch ProcessGroup::Work to be managed by intrusive_ptr
Test Plan: revert-hammer

Differential Revision:
D23632280 (0650a6166f)

Original commit changeset: 0a4642a8ffab

fbshipit-source-id: 2aa8ddb874fab11f773f4c08d740afcd865482e9
2020-11-11 10:54:08 -08:00
1f946e942d Revert D24667128: [c10d] switch Store to be managed by intrusive_ptr
Test Plan: revert-hammer

Differential Revision:
D24667128 (0cfe3451d4)

Original commit changeset: 9b6024c31c85

fbshipit-source-id: d8ddf9eb2fccef5023e05698e0c4662708fe4945
2020-11-11 10:49:58 -08:00
2204374fd4 Revert D24667127: [c10d] switch ProcessGroupNCCL:Options to be managed by intrusive_ptr
Test Plan: revert-hammer

Differential Revision:
D24667127 (ae5c2febb9)

Original commit changeset: 54986193ba1b

fbshipit-source-id: 12e1ebea1981c0b1b6dff4c8a2e2045878d44537
2020-11-11 10:42:33 -08:00
0c64f9f526 Convert from higher order functions to classes in tools.codegen.gen (#47008)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47008

bhosmer has been complaining about how it is difficult to distinguish
between local variables and closed over variables in the higher order
functions.  Well, closures and objects do basically the same thing, so
just convert all these HOFs into objects.

The decoder ring:
- Higher order function => Constructor for object
- Access to closed over variable => Access to member variable on object
- with_native_function => method_with_native_function (because it's
  hard writing decorators that work for both functions and methods)

I didn't even have to change indentation (much).

When there is no need for closed over variables (a few functions), I
kept them as plain old functions, no need for an object with no
members.

While I was at it, I also deleted the kwargs, since the types are
enough to prevent mistakes.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D24600805

Pulled By: ezyang

fbshipit-source-id: 7e3ce8cb2446e3788f934ddcc17f7da6e9299511
2020-11-11 10:30:50 -08:00
d478605dec Fix classmethod override argument passing. (#47114)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/47069.
Fixes https://github.com/pytorch/pytorch/issues/46824.
Fixes https://github.com/pytorch/pytorch/issues/47186

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47114

Reviewed By: ngimel

Differential Revision: D24649598

Pulled By: ezyang

fbshipit-source-id: af077affece7eceb1e4faf9c94d15484796b0f0e
2020-11-11 09:25:48 -08:00
1239d067ae [quant][graphmode][fx] Support standalone_module_class (#47705)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47705

Test Plan: Imported from OSS

Reviewed By: z-a-f

Differential Revision: D24872380

fbshipit-source-id: db2ec7ba03da27203033fbebc11666be572622bb
2020-11-11 09:15:14 -08:00
4cb73f5a4c Allow for string literal return during symbolic tracing (#47618)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47618

Test Plan: Imported from OSS

Reviewed By: jamesr66a

Differential Revision: D24870422

Pulled By: ansley

fbshipit-source-id: 41c56c2f4f1f7bb360cea0fb346f6e4d495f5c2b
2020-11-11 08:54:39 -08:00
48ed577fbd Stop including TypeDefault.h from MPSCNNTests.mm (#46998)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46998

It's not using any TypeDefault symbols directly; running
CI to see if it was being included for other headers.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D24621920

Pulled By: ezyang

fbshipit-source-id: f868e5412ff3e5a616c3fc38110f203ca545eed5
2020-11-11 08:46:56 -08:00
88ec72e1c2 [fbcode][pytorch mobile] Create model reader utilities.
Summary:
For some of the end to end flow projects, we will need the capabilities to read module information during model validation or model publishing.
Creating this model_reader.py for utilities for model content reading, this diff we included the following functionalities:
1. read the model bytecode version;
2. check if a model is lite PyTorch script module;
3. check if a model is PyTorch script module.

This diff is recreated from the reverted diff: D24655999 (7f056e99dd).

Test Plan:
```
[xcheng16@devvm1099]/data/users/xcheng16/fbsource/fbcode% buck test //caffe2/torch/fb/mobile/tests:mobile_model_reader_tests
Action graph will be rebuilt because files have been added or removed.
Parsing buck files: finished in 10.4 sec
Creating action graph: finished in 22.2 sec
Building: finished in 01:29.1 min (100%) 10619/10619 jobs, 1145 updated
  Total time: 02:01.8 min
More details at https://www.internalfb.com/intern/buck/build/f962dfad-76f9-457a-aca3-768ce20f0c31
Tpx test run coordinator for Facebook. See https://fburl.com/tpx for details.
Running with tpx session id: 172633f6-6b5b-49e9-a632-b4efa083a001
Trace available for this run at /tmp/tpx-20201109-165156.109798/trace.log
Started reporting to test run: https://our.intern.facebook.com/intern/testinfra/testrun/3940649712677511
    ✓ ListingSuccess: caffe2/torch/fb/mobile/tests:mobile_model_reader_tests - main (18.229)
    ✓ Pass: caffe2/torch/fb/mobile/tests:mobile_model_reader_tests - test_is_pytorch_lite_module (caffe2.torch.fb.mobile.tests.test_model_reader.TestModelLoader) (8.975)
    ✓ Pass: caffe2/torch/fb/mobile/tests:mobile_model_reader_tests - test_is_pytorch_script_module (caffe2.torch.fb.mobile.tests.test_model_reader.TestModelLoader) (9.136)
    ✓ Pass: caffe2/torch/fb/mobile/tests:mobile_model_reader_tests - test_read_module_bytecode_version (caffe2.torch.fb.mobile.tests.test_model_reader.TestModelLoader) (9.152)
Summary
  Pass: 3
  ListingSuccess: 1
Finished test run: https://our.intern.facebook.com/intern/testinfra/testrun/3940649712677511
```

Reviewed By: husthyc

Differential Revision: D24848563

fbshipit-source-id: ab3371e111206a4bb4d07715c3314596cdc38d2c
2020-11-11 08:11:28 -08:00
5647f0ca7c Revert D24859919: [pytorch][PR] Grammatically updated the tech docs
Test Plan: revert-hammer

Differential Revision:
D24859919 (a843d48ead)

Original commit changeset: 5c6a8bc8e785

fbshipit-source-id: f757995fb64cfd4212c978618d572367e7296758
2020-11-11 07:43:17 -08:00
ae5c2febb9 [c10d] switch ProcessGroupNCCL:Options to be managed by intrusive_ptr (#47075)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47075

Test Plan: Imported from OSS

Reviewed By: pritamdamania87

Differential Revision: D24667127

Pulled By: wanchaol

fbshipit-source-id: 54986193ba1b22480622a2e9d6d41d9472d201f3
2020-11-10 23:36:47 -08:00
0cfe3451d4 [c10d] switch Store to be managed by intrusive_ptr (#47074)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47074

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D24667128

Pulled By: wanchaol

fbshipit-source-id: 9b6024c31c851b7c3243540f460ae57323da523b
2020-11-10 23:36:44 -08:00
0650a6166f [c10d] switch ProcessGroup::Work to be managed by intrusive_ptr (#44046)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44046

Test Plan: Imported from OSS

Reviewed By: gmagogsfm

Differential Revision: D23632280

Pulled By: wanchaol

fbshipit-source-id: 0a4642a8ffabdd26c52c1baabfa30c0f446c3c85
2020-11-10 23:30:22 -08:00
cbf439caf1 Unbreak backward compatibility tests (#47726)
Summary:
Per title

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47726

Reviewed By: Chillee

Differential Revision: D24880651

Pulled By: ngimel

fbshipit-source-id: 1e70f42d98c7a14265aed743669592b4fc08c8d4
2020-11-10 21:37:39 -08:00
bfec376e9f [vulkan] Apply new changes to vulkan api v1 (#47721)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47721

Test Plan: Imported from OSS

Reviewed By: AshkanAliabadi

Differential Revision: D24877403

Pulled By: IvanKobzarev

fbshipit-source-id: acfa8217c10d14bf38472abfc1e6f6216557c359
2020-11-10 20:10:29 -08:00
d73a8db2d2 Use local env for building CUDA extensions on Windows (#47150)
Summary:
Fixes https://github.com/pytorch/vision/pull/2818#issuecomment-719167504
After activating the VC env multiple times, the following error will be raised when building a CUDA extension.
```
FAILED: C:/tools/MINICO~1/CONDA-~2/TORCHV~1/work/build/temp.win-amd64-3.8/Release/tools/MINICO~1/CONDA-~2/TORCHV~1/work/torchvision/csrc/cuda/PSROIAlign_cuda.obj
C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.0\bin\nvcc -Xcompiler /MD -Xcompiler /wd4819 -Xcompiler /wd4251 -Xcompiler /wd4244 -Xcompiler /wd4267 -Xcompiler /wd4275 -Xcompiler /wd4018 -Xcompiler /wd4190 -Xcompiler /EHsc -Xcudafe --diag_suppress=base_class_has_different_dll_interface -Xcudafe --diag_suppress=field_without_dll_interface -Xcudafe --diag_suppress=dll_interface_conflict_none_assumed -Xcudafe --diag_suppress=dll_interface_conflict_dllexport_assumed -DWITH_CUDA -Dtorchvision_EXPORTS -IC:\tools\MINICO~1\CONDA-~2\TORCHV~1\work\torchvision\csrc -I%PREFIX%\lib\site-packages\torch\include -I%PREFIX%\lib\site-packages\torch\include\torch\csrc\api\include -I%PREFIX%\lib\site-packages\torch\include\TH -I%PREFIX%\lib\site-packages\torch\include\THC "-IC:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.0\include" -I%PREFIX%\include -I%PREFIX%\include "-IC:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.27.29110\ATLMFC\include" "-IC:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.27.29110\include" "-IC:\Program Files (x86)\Windows Kits\NETFXSDK\4.8\include\um" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\ucrt" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\shared" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\um" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\winrt" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\cppwinrt" "-IC:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.27.29110\ATLMFC\include" "-IC:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.27.29110\include" "-IC:\Program Files (x86)\Windows Kits\NETFXSDK\4.8\include\um" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\ucrt" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\shared" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\um" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\winrt" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\cppwinrt" -I%PREFIX%\Library\include -c C:\tools\MINICO~1\CONDA-~2\TORCHV~1\work\torchvision\csrc\cuda\PSROIAlign_cuda.cu -o C:\tools\MINICO~1\CONDA-~2\TORCHV~1\work\build\temp.win-amd64-3.8\Release\tools\MINICO~1\CONDA-~2\TORCHV~1\work\torchvision\csrc\cuda\PSROIAlign_cuda.obj -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_35,code=sm_35 -gencode=arch=compute_50,code=sm_50 -gencode=arch=compute_60,code=sm_60 -gencode=arch=compute_70,code=sm_70 -gencode=arch=compute_75,code=sm_75 -gencode=arch=compute_80,code=sm_80 -gencode=arch=compute_50,code=compute_50 -DTORCH_API_INCLUDE_EXTENSION_H -DTORCH_EXTENSION_NAME=_C -D_GLIBCXX_USE_CXX11_ABI=0
'cl.exe' is not recognized as an internal or external command,
operable program or batch file.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47150

Reviewed By: agolynski

Differential Revision: D24706019

Pulled By: ezyang

fbshipit-source-id: c13dc29f62d2d12d6a56f33dd450b467a1bf193b
2020-11-10 20:02:06 -08:00
7908bf27d5 Fix output type of torch.max for Tensor subclasses. (#47110)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/47090

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47110

Reviewed By: ngimel

Differential Revision: D24649568

Pulled By: ezyang

fbshipit-source-id: 9374cf0c562de78e520bcb03415db273c1dd76a3
2020-11-10 19:45:36 -08:00
a5c65b86ce Fixed einsum compatibility/performance issues (#46398)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46398

This PR makes torch.einsum compatible with numpy.einsum except for the sublist input option as requested here https://github.com/pytorch/pytorch/issues/21412. It also fixed 2 performance issues linked below and adds a check for reducing to torch.dot instead of torch.bmm which is faster in some cases.

fixes #45854, #37628, #30194, #15671

fixes #41467 with benchmark below
```python
import torch
from torch.utils.benchmark import Timer

a = torch.randn(10000, 100, 101, device='cuda')
b = torch.randn(10000, 101, 3, device='cuda')

c = torch.randn(10000, 100, 1, device='cuda')
d = torch.randn(10000, 100, 1, 3, device='cuda')

print(Timer(
    stmt='torch.einsum("bij,bjf->bif", a, b)',
    globals={'a': a, 'b': b}
).blocked_autorange())

print()

print(Timer(
    stmt='torch.einsum("bic,bicf->bif", c, d)',
    globals={'c': c, 'd': d}
).blocked_autorange())
```
```
<torch.utils.benchmark.utils.common.Measurement object at 0x7fa37c413850>
torch.einsum("bij,bjf->bif", a, b)
  Median: 4.53 ms
  IQR:    0.00 ms (4.53 to 4.53)
  45 measurements, 1 runs per measurement, 1 thread

<torch.utils.benchmark.utils.common.Measurement object at 0x7fa37c413700>
torch.einsum("bic,bicf->bif", c, d)
  Median: 63.86 us
  IQR:    1.52 us (63.22 to 64.73)
  4 measurements, 1000 runs per measurement, 1 thread
```

fixes #32591 with benchmark below
```python
import torch
from torch.utils.benchmark import Timer

a = torch.rand(1, 1, 16, 2, 16, 2, 16, 2, 2, 2, 2, device="cuda")
b = torch.rand(729, 1, 1, 2, 1, 2, 1, 2, 2, 2, 2, device="cuda")

print(Timer(
    stmt='(a * b).sum(dim = (-3, -2, -1))',
    globals={'a': a, 'b': b}
).blocked_autorange())

print()

print(Timer(
    stmt='torch.einsum("...ijk, ...ijk -> ...", a, b)',
    globals={'a': a, 'b': b}
).blocked_autorange())
```
```
<torch.utils.benchmark.utils.common.Measurement object at 0x7efe0de28850>
(a * b).sum(dim = (-3, -2, -1))
  Median: 17.86 ms
  2 measurements, 10 runs per measurement, 1 thread

<torch.utils.benchmark.utils.common.Measurement object at 0x7efe0de286a0>
torch.einsum("...ijk, ...ijk -> ...", a, b)
  Median: 296.11 us
  IQR:    1.38 us (295.42 to 296.81)
  662 measurements, 1 runs per measurement, 1 thread
```

TODO

- [x] add support for ellipsis broadcasting
- [x] fix corner case issues with sumproduct_pair
- [x] update docs and add more comments
- [x] add tests for error cases

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D24860367

Pulled By: heitorschueroff

fbshipit-source-id: 31110ee598fd598a43acccf07929b67daee160f9
2020-11-10 19:38:43 -08:00
51a661c027 [vulkan] tentative fix for conv2d_pw, and fix checks for addmm (#47723)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47723

Test Plan: Imported from OSS

Reviewed By: AshkanAliabadi

Differential Revision: D24878293

Pulled By: SS-JIA

fbshipit-source-id: 04abb544b87bd047ffe8af7ed52ec2569c61add4
2020-11-10 19:24:33 -08:00
e914a1b976 Support default args in symbolic tracing (#47615)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47615

Test Plan: Imported from OSS

Reviewed By: Chillee

Differential Revision: D24865060

Pulled By: ansley

fbshipit-source-id: 32ff105a1fa9c4a8f00adc20e8d40d1b6bd7157f
2020-11-10 18:57:00 -08:00
a5e9fa1b0d Add max_src_column_width to autograd profiler (#46257)
Summary:
Currently the max `src_column_width` is hardcoded to 75 which might not be sufficient for modules with long file names. This PR exposes `max_src_column_width` as a changeable parameter.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46257

Reviewed By: malfet

Differential Revision: D24280834

Pulled By: yf225

fbshipit-source-id: 8a90a433c6257ff2d2d79f67a944450fdf5dd494
2020-11-10 18:51:39 -08:00
1b954749d0 Disable test_distributed_for for multigpu test env (#47703)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47703

Differential Revision: D24871454

Test Plan: Imported from OSS

Reviewed By: rohan-varma

Pulled By: mrshenli

fbshipit-source-id: 2112867c2aa551392fab16b984c59bcb59ae16ad
2020-11-10 18:46:31 -08:00
4de40dad5d [ONNX] Improve stability of gemm export (#46570)
Summary:
Export as `onnx::MatMul` if possible since it has less constraint. Resolves issue with exporting `weight_norm` in scripting that fails onnx shape inference with `onnx::Gemm` in unreachable `if` subgraph.

Updates skipped tests list.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46570

Reviewed By: ngimel

Differential Revision: D24657480

Pulled By: bzinodev

fbshipit-source-id: 08d47cc9fc01c4a73a9d78c964fef102d12cc21c
2020-11-10 18:32:33 -08:00
69532c4227 Vulkan MobileNetv2 unit test. (#47616)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47616

Test Plan: Imported from OSS

Reviewed By: SS-JIA

Differential Revision: D24848639

Pulled By: AshkanAliabadi

fbshipit-source-id: 81a432a14cdca444ec0f70a4f8692a3abf4d2ea9
2020-11-10 17:28:39 -08:00
bf6a156f64 Fix kthvalue error for scalar input (#47600)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47600

fixes https://github.com/pytorch/pytorch/issues/30818

Note that the median case was already fixed by https://github.com/pytorch/pytorch/pull/45847

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D24860337

Pulled By: heitorschueroff

fbshipit-source-id: 69ccbbb6c7c86671e5712b1c2056c012d898b4f2
2020-11-10 17:21:52 -08:00
6575e674ce [numpy] torch.{all, any} : Extend Dtype Support (#44790)
Summary:
Reference https://github.com/pytorch/pytorch/issues/44779

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44790

Reviewed By: bdhirsh

Differential Revision: D24393119

Pulled By: heitorschueroff

fbshipit-source-id: a9b88e9d06b3c282f2e5360b6eaea4ae8ef77c1d
2020-11-10 17:11:39 -08:00
c9d37675b2 Back out "[pytorch][PR] The dimension being reduced should not be coalesced by TensorIterator" (#47642)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47642

Original commit changeset: 02bb2b15694c

Test Plan: Covered by CI tests

Reviewed By: anjali411

Differential Revision: D24849072

fbshipit-source-id: a8790cbf46936aee7a6f504dac8595997175fc65
2020-11-10 16:31:33 -08:00
f692af209d add unittest for operator benchmark (#47678)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47678

add unittest for operator benchmark.
Covers below cases:
```
generate_c2_test
generate_c2_gradient_test
generate_pt_test
generate_pt_gradient_test
generate_pt_tests_from_op_list
```
Also fixed two issues (incorrect fn signature) found by the unittest in `benchmark_caffe2.py`

Test Plan:
arc lint
buck run caffe2/benchmarks/operator_benchmark:operator_benchmark_unittest
```
test_c2_single_op (operator_benchmark_unittest.BenchmarkTest) ... # ----------------------------------------
# PyTorch/Caffe2 Operator Micro-benchmarks
# ----------------------------------------
# Tag : short

# Benchmarking Caffe2: add
WARNING: Logging before InitGoogleLogging() is written to STDERR
W1109 23:08:39.932207 639464 init.h:137] Caffe2 GlobalInit should be run before any other API calls.
# Name: add_M8
# Input: M: 8
Forward Execution Time (us) : 36.474

# Benchmarking Caffe2: add
# Name: add_M8
# Input: M: 8
Backward Execution Time (us) : 42.281

ok
test_pt_list_of_ops (operator_benchmark_unittest.BenchmarkTest) ... # ----------------------------------------
# PyTorch/Caffe2 Operator Micro-benchmarks
# ----------------------------------------
# Tag : short

# Benchmarking Caffe2: add
# Name: add_M8
# Input: M: 8
Forward Execution Time (us) : 36.579

# Benchmarking Caffe2: add
# Name: add_M8
# Input: M: 8
Backward Execution Time (us) : 42.734

# Benchmarking PyTorch: abs
# Mode: Eager
# Name: abs_M8
# Input: M: 8
Forward Execution Time (us) : 148.929

# Benchmarking PyTorch: abs_
# Mode: Eager
# Name: abs__M8
# Input: M: 8
Forward Execution Time (us) : 71.909

ok
test_pt_single_op (operator_benchmark_unittest.BenchmarkTest) ... # ----------------------------------------
# PyTorch/Caffe2 Operator Micro-benchmarks
# ----------------------------------------
# Tag : short

# Benchmarking Caffe2: add
# Name: add_M8
# Input: M: 8
Forward Execution Time (us) : 36.860

# Benchmarking Caffe2: add
# Name: add_M8
# Input: M: 8
Backward Execution Time (us) : 42.293

# Benchmarking PyTorch: abs
# Mode: Eager
# Name: abs_M8
# Input: M: 8
Forward Execution Time (us) : 148.999

# Benchmarking PyTorch: abs_
# Mode: Eager
# Name: abs__M8
# Input: M: 8
Forward Execution Time (us) : 71.941

# Benchmarking PyTorch: add
# Mode: Eager
# Name: add_M8
# Input: M: 8
Forward Execution Time (us) : 179.108

# Benchmarking PyTorch: add
# Mode: Eager
# Name: add_M8
# Input: M: 8
Backward Execution Time (us) : 1205.902

ok
```
buck run caffe2/benchmarks/operator_benchmark/c2:add_test
```
# ----------------------------------------
# PyTorch/Caffe2 Operator Micro-benchmarks
# ----------------------------------------
# Tag : short

# Benchmarking Caffe2: add
WARNING: Logging before InitGoogleLogging() is written to STDERR
W1109 23:20:11.551795 654290 init.h:137] Caffe2 GlobalInit should be run before any other API calls.
# Name: add_M8_N16_K32_dtypeint
# Input: M: 8, N: 16, K: 32, dtype: int
Forward Execution Time (us) : 984.510

# Benchmarking Caffe2: add
# Name: add_M16_N16_K64_dtypefloat
# Input: M: 16, N: 16, K: 64, dtype: float
Forward Execution Time (us) : 68.526

# Benchmarking Caffe2: add
# Name: add_M64_N64_K128_dtypeint
# Input: M: 64, N: 64, K: 128, dtype: int
Forward Execution Time (us) : 101617.076
```

Reviewed By: mingzhe09088

Differential Revision: D24854414

fbshipit-source-id: 6676549909da6700b42f322c4ad6e8e2ef5b86b5
2020-11-10 15:45:36 -08:00
a843d48ead Grammatically updated the tech docs (#47345)
Summary:
<img width="1440" alt="Screenshot 2020-11-04 at 1 07 21 PM" src="https://user-images.githubusercontent.com/72745540/98082455-c5f89200-1e9e-11eb-97e3-ae0eb62355f6.png">

small grammatical update to the torch tech docs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47345

Reviewed By: malfet

Differential Revision: D24859919

Pulled By: ejguan

fbshipit-source-id: 5c6a8bc8e785c5295bf6f2f5b583dd6054b96fec
2020-11-10 15:33:26 -08:00
febc76a5c6 fix assert_allclose doesnt check shape (#47580)
Summary:
fix assert_allclose doesnt check shape

should fix https://github.com/pytorch/pytorch/issues/47449.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47580

Reviewed By: samestep

Differential Revision: D24836399

Pulled By: walterddr

fbshipit-source-id: 943f8c83864bc01e1a782048c234e9592d2f1a25
2020-11-10 15:03:25 -08:00
8e3af9faa8 [pytorch] fix debug symbol flag for android clang (#46331)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46331

Fix the android build size issue #46246.

Test Plan: Imported from OSS

Reviewed By: dhruvbird

Differential Revision: D24390061

Pulled By: ljk53

fbshipit-source-id: b4a6f297e89b9c08dff4297c6a41aabd41d9fff5
2020-11-10 14:55:43 -08:00
baa2f777c8 [complex] torch.sqrt: fix edge values (#47424)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/47358

Replace the optimized path with a slower but correct `map(std::sqrt)`

Benchmark posted below in comments.

cc: dylanbespalko (original author of fast-path)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47424

Reviewed By: walterddr

Differential Revision: D24855914

Pulled By: mruberry

fbshipit-source-id: c21a38f365d996645db70be96ff1216776bedd3a
2020-11-10 14:51:04 -08:00
7691cf175c [ROCm] set ROCM_ARCH to gfx900 and gfx906 for CI builds (#47683)
Summary:
This change adds the arch settings for caffe2 builds, fixes some typos,
and clarifies that this setting applies to both CircleCI and Jenkins.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47683

Reviewed By: zou3519

Differential Revision: D24864034

Pulled By: malfet

fbshipit-source-id: 304b8a8e5c929ddaeb9c399f6219783a1369d842
2020-11-10 14:44:48 -08:00
ef5f54b2c6 added rocm 3.9 docker image (#47473)
Summary:
Added bionic rocm 3.9 docker image

jeffdaily

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47473

Reviewed By: zou3519

Differential Revision: D24860549

Pulled By: malfet

fbshipit-source-id: d12c39970432ed5fc5051cac10a068fd7bb8f7f9
2020-11-10 14:42:10 -08:00
14f0675903 [ONNX] Fix dtype for log_softmax export (#46627)
Summary:
Previously dtype was not converted from constant node to python number properly.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46627

Reviewed By: houseroad

Differential Revision: D24657535

Pulled By: bzinodev

fbshipit-source-id: 33b0b9087d969f2cb0a2fa608fcf6e10956c06bf
2020-11-10 14:34:46 -08:00
0fb1356a98 [ONNX] Fix eye export (#47016)
Summary:
Previously did not considered the case were optional `m` is not provided in `torch.eye(n, m)`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47016

Reviewed By: ejguan

Differential Revision: D24735916

Pulled By: bzinodev

fbshipit-source-id: ec9b410fc59f27d77d4ae40cb38a67537abb3cd8
2020-11-10 14:24:33 -08:00
5ce9c70631 Revert D24735802: [pytorch][PR] [ONNX] Update batch_norm symbolic to handle track_running_stats=False
Test Plan: revert-hammer

Differential Revision:
D24735802 (1a55f5b3ea)

Original commit changeset: bbb29d92d46a

fbshipit-source-id: dcd7af6d50e2776e63ee4bfcb9e4baf08a4771b4
2020-11-10 14:04:06 -08:00
6b94830cdc faithful signature support in BoxedKernelWrapper (#47267)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47267

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D24701488

Pulled By: bhosmer

fbshipit-source-id: dbce246319670f9590c5762ad20c26cb24575fe8
2020-11-10 13:58:36 -08:00
0a7ebf00f8 [Reland] Add tests for DDP control flow models. (#47470)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47470

Reland of https://github.com/pytorch/pytorch/pull/47206, which was reverted due to failing multigpu tests.

The fix to make multigpu tests work is to compare against `torch.tensor([world_size, 0])`, not hardcode `torch.tensor([2, 0]` which assumes a world size of 2.

Original commit description:

As discussed offline with pritamdamania87, add testing to ensure per-iteration and rank-dependent control flow works as expected in DDP with find_unused_parameters=True.
ghstack-source-id: 115993934
ghstack-source-id: 115993934

Test Plan: CI

Reviewed By: pritamdamania87

Differential Revision: D24767893

fbshipit-source-id: 7d7a2449270eb3e72b5061694e897166e16f9bbc
2020-11-10 12:22:59 -08:00
17c58720fe Revert D24346771: [caffe2][memonger] Add support for distributed inference predict nets in DAG memonger
Test Plan: revert-hammer

Differential Revision:
D24346771 (5882f2e540)

Original commit changeset: ad2dd2e63f3e

fbshipit-source-id: 90346f08c890eebe71f068748a8e24e4db88c250
2020-11-10 12:11:22 -08:00
163adb9fa7 Add HalfToFloat + FloatToHalf operators to PyTorch (#45092)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45092

Adding two operators
1. at::float_to_half -> Converts FP32 tensor to FP16 tensor
2. at::half_to_float -> Converts FP16 tensor to FP32 tensor.

These operators internally use the kernel provided by FBGeMM. Both C2 and PT will use the same FBGeMM kernel underneath.

Test Plan:
buck test //caffe2/test:torch -- .*test_half_tensor.*

Run benchmark locally using

```
buck run //caffe2/benchmarks/operator_benchmark/pt:tensor_to_test
```

AI Bench results are pending. I expect that not to finish as we have large queue with jobs pending for 2+ days.

Benchmark for 512x512 tensor with FbGeMM implementation

```
# ----------------------------------------
# PyTorch/Caffe2 Operator Micro-benchmarks
# ----------------------------------------
# Tag : short

# Benchmarking PyTorch: FloatToHalfTensorConversionBenchmark
# Mode: Eager
# Name: FloatToHalfTensorConversionBenchmark_M512_N512_cpu
# Input: M: 512, N: 512, device: cpu
Forward Execution Time (us) : 1246.332

# Benchmarking PyTorch: HalfToFloatTensorConversionBenchmark
# Mode: Eager
# Name: HalfToFloatTensorConversionBenchmark_M512_N512_cpu
# Input: M: 512, N: 512, device: cpu
Forward Execution Time (us) : 1734.304
```

Benchmark for 512x512 tensor trunk with no FbGeMM integration.

```
# ----------------------------------------
# PyTorch/Caffe2 Operator Micro-benchmarks
# ----------------------------------------
# Tag : short

# Benchmarking PyTorch: FloatToHalfTensorConversionBenchmark
# Mode: Eager
# Name: FloatToHalfTensorConversionBenchmark_M512_N512_cpu
# Input: M: 512, N: 512, device: cpu
Forward Execution Time (us) : 169045.724

# Benchmarking PyTorch: HalfToFloatTensorConversionBenchmark
# Mode: Eager
# Name: HalfToFloatTensorConversionBenchmark_M512_N512_cpu
# Input: M: 512, N: 512, device: cpu
Forward Execution Time (us) : 152382.494
```

Reviewed By: ngimel

Differential Revision: D23824869

fbshipit-source-id: ef044459b6c8c6e5ddded72080204c6a0ab4582c
2020-11-10 12:00:53 -08:00
497cd2506f Add serialize GraphModule to JSON support (#47612)
Summary:
re-opening PR, missed mypy issues, they are now addressed.
Example:

class TestModule(torch.nn.Module):
            def __init__(self):
                super().__init__()
                self.linear = torch.nn.Linear(4, 4)
                self.e = torch.rand(4)

            def forward(self, a, b):
                add_1 = a + b
                linear = self.linear(add_1)
                add_2 = linear + self.e
                return add_2
JSON:

{
    "modules": {},
    "weights": {
        "linear.weight": {
            "dtype": "torch.float32",
            "is_quantized": false,
            "shape": "[4, 4]"
        },
        "linear.bias": {
            "dtype": "torch.float32",
            "is_quantized": false,
            "shape": "[4]"
        },
        "e": {
            "dtype": "torch.float32",
            "is_quantized": false,
            "shape": "[4]"
        }
    },
    "nodes": [
        {
            "shape": "[4]",
            "dtype": "torch.float32",
            "target": "a",
            "op_code": "placeholder",
            "name": "a",
            "args": [],
            "kwargs": {}
        },
        {
            "shape": "[4]",
            "dtype": "torch.float32",
            "target": "b",
            "op_code": "placeholder",
            "name": "b",
            "args": [],
            "kwargs": {}
        },
        {
            "shape": "[4]",
            "dtype": "torch.float32",
            "target": "_operator.add",
            "op_code": "call_function",
            "name": "add_1",
            "args": [
                {
                    "is_node": true,
                    "name": "a"
                },
                {
                    "is_node": true,
                    "name": "b"
                }
            ],
            "kwargs": {}
        },
        {
            "target": "linear",
            "op_code": "call_module",
            "name": "linear_1",
            "args": [
                {
                    "is_node": true,
                    "name": "add_1"
                }
            ],
            "kwargs": {}
        },
        {
            "shape": "[4]",
            "dtype": "torch.float32",
            "target": "e",
            "op_code": "get_attr",
            "name": "e",
            "args": [],
            "kwargs": {}
        },
        {
            "shape": "[4]",
            "dtype": "torch.float32",
            "target": "_operator.add",
            "op_code": "call_function",
            "name": "add_2",
            "args": [
                {
                    "is_node": true,
                    "name": "linear_1"
                },
                {
                    "is_node": true,
                    "name": "e"
                }
            ],
            "kwargs": {}
        },
        {
            "shape": "[4]",
            "dtype": "torch.float32",
            "target": "output",
            "op_code": "output",
            "name": "output",
            "args": [
                {
                    "is_node": true,
                    "name": "add_2"
                }
            ],
            "kwargs": {}
        }
    ]
}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47612

Reviewed By: scottxu0730

Differential Revision: D24836223

Pulled By: gcatron

fbshipit-source-id: d3da2b5f90d143beba3b7f1f67462fb7430df906
2020-11-10 11:54:02 -08:00
5cba3cec5a fix extensions build flags on newer GPUs (#47585)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/47352

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47585

Reviewed By: heitorschueroff

Differential Revision: D24833654

Pulled By: ezyang

fbshipit-source-id: eaec5b8db5f35cac0a74d2858cb054a3853b0990
2020-11-10 11:38:18 -08:00
1a55f5b3ea [ONNX] Update batch_norm symbolic to handle track_running_stats=False (#47135)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/45333

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47135

Reviewed By: ejguan

Differential Revision: D24735802

Pulled By: bzinodev

fbshipit-source-id: bbb29d92d46a8a74dac0cb01639ddd4ec121a54c
2020-11-10 11:31:33 -08:00
ccc53901bd Update CONTRIBUTING and gitignore for docs build (#47539)
Summary:
This PR tries to make building the docs less confusing for new contributors:

- `npm` is discouraged on devservers for Facebook employees, so I added another way to install `katex`
- the path to `check-doxygen.sh` was wrong, so I fixed it
- while generating the CPP docs, it created two new folders that weren't ignored by Git, so I added those to `.gitignore`
- I wasn't able to get the SSH tunnel to work, so I added instructions to use `scp` as an alternative

I'm not entirely sure how the `docs/cpp/source/{html,latex}/` directories were created since I haven't been able to reproduce them.

I also think that it would be better to use the SSH tunnel since `scp` is so much slower, but I just wasn't able to figure it out; I followed the instructions from `CONTRIBUTING.md` and then ran a [Python `http.server`](https://docs.python.org/3/library/http.server.html) on my devserver:
```bash
python -m http.server 8000 --bind 127.0.0.1 --directory build/html
```
but my browser failed to connect and my (local) terminal printed error messages (presumably from the SSH command).

If anyone knows how to properly set up the SSH tunnel and HTTP server, I add those more detailed instructions to `CONTRIBUTING.md` and remove the `scp` instructions from this PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47539

Reviewed By: malfet

Differential Revision: D24806833

Pulled By: samestep

fbshipit-source-id: 456691018a76efadde28fa5eb783b0895582e72d
2020-11-10 11:04:34 -08:00
cc337069e0 .circleci: Add python 3.9 to linux binary build matrix (#47235)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47235

Depends on https://github.com/pytorch/builder/pull/565

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D24863739

Pulled By: seemethere

fbshipit-source-id: ed78087bb7aae118af7a808d7b5620d6c9b8cb26
2020-11-10 10:56:50 -08:00
22d56319ee Moving hypothesis and other installations to Docker (#47451)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/31136

This PR:
1. moves several installations to Docker from `test.sh` for both PyTorch and Caffe2
2. removes version fixing for numba and llvmlite as the issue linked has been resolved

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47451

Reviewed By: walterddr

Differential Revision: D24791350

Pulled By: janeyx99

fbshipit-source-id: bf36cd419e30d9e02622ad7c7049fbc724c89579
2020-11-10 10:42:16 -08:00
fa560ceb9c [reland] make intrusive_ptr as a pybind holder type (#47586)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47586

relanding PR of https://github.com/pytorch/pytorch/pull/44492, and add
additional Capsule related wrapping to ensure we still have the correct
type in pybind11 to resolve Capsule as torch._C.CapsuleType

Test Plan: Imported from OSS

Reviewed By: gmagogsfm

Differential Revision: D24822519

Pulled By: wanchaol

fbshipit-source-id: eaaea446fb54b56ed3b0d04c31481c64096e9459
2020-11-10 10:09:08 -08:00
780f854135 Clear Shape info in frozen modules (#47511)
Summary:
To ensure the frozen models produced from traced models are not
over-optimized, clear shape info in the frozen model

Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47511

Reviewed By: eellison

Differential Revision: D24792849

Pulled By: bzinodev

fbshipit-source-id: 5dc7c4d713a113c23d59cabf5541b3c58b075b43
2020-11-10 09:49:58 -08:00
1c45631f10 Revert D24737050: [WIP] Adding bunch of unary foreach APIs
Test Plan: revert-hammer

Differential Revision:
D24737050 (b6a2444eff)

Original commit changeset: deb59b41ad1c

fbshipit-source-id: 76cd85028114cfc8fc5b7bb49cd27efc2e315aa5
2020-11-10 09:41:41 -08:00
5882f2e540 [caffe2][memonger] Add support for distributed inference predict nets in DAG memonger
Summary:
Distributed Inference splits a predict net into multiple parts, part0 being the main part which contains ops to make remote calls to other parts. part0 predict net may contain AsyncIf ops to optimize rpc call usage. AsyncIf ops have internal nets which may refer to memongered blobs. This change handles AsyncIf ops to update internal nets to refer to memongered blobs. Here is one reference part0 predict net with AsyncIf ops: https://www.internalfb.com/intern/paste/P145812115/

As part of this change, I am also updating dag memonger traversal to always start from root op, i.e. ops with 0 in degree. Earlier logic will start traversing ops based on input head blobs and if one of the head inputs is getting used in a non-root op which gets visited before its parent, the traversal will throwing assertion error here: https://fburl.com/diffusion/ob110s9z . Almost for all the distributed inference part0 nets, it was throwing this assertion error.

Reviewed By: hlu1

Differential Revision: D24346771

fbshipit-source-id: ad2dd2e63f3e822ad172682f6d63f8474492255d
2020-11-10 09:35:28 -08:00
1bf3dc51ae [JIT] Add __prepare_scriptable__ duck typing to allow replacing nn.modules with scriptable preparations (#45645)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/45072

As discussed with zdevito gchanan cpuhrsch and suo, this change allows developers to create custom preparations for their modules before scripting. This is done by adding a `__prepare_scriptable__` method to a module which returns the prepared scriptable module out-of-place. It does not expand the API surface for end users.

Prior art by jamesr66a: https://github.com/pytorch/pytorch/pull/42244

cc: zhangguanheng66

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45645

Reviewed By: dongreenberg, ngimel

Differential Revision: D24039990

Pulled By: zhangguanheng66

fbshipit-source-id: 4ddff2d353124af9c2ef22db037df7e3d26efe65
2020-11-10 08:59:45 -08:00
6bb18b24fb [quant][qat] Ensure observer respects device affinity (#47514)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47514

Previosuly the scale and zero_point were returned on the CPU even if
the input tensor was on the GPU.
This is because `copy_()` doesn't respect the device when copying over the tensor.

Also fixed a bug where we were always setting the device to 'cuda' (irrespective of the device id)
in the calculate_qparams function

Test Plan:
python test/test_quantization.py TestObserver.test_observer_qparams_respects_device_affinity

Imported from OSS

Reviewed By: vkuzo

Differential Revision: D24800495

fbshipit-source-id: d7a76c59569842ed69029d0eb4fa9df63f87e28c
2020-11-10 08:43:52 -08:00
abae12ba41 only set ccbin flag if not provided by user (#47404)
Summary:
Avoid nvcc error if the user specifies c compiler (as pointed out in https://github.com/pytorch/pytorch/issues/47377)

Fixes https://github.com/pytorch/pytorch/issues/47377

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47404

Reviewed By: ejguan

Differential Revision: D24748833

Pulled By: malfet

fbshipit-source-id: 1a4ad1f851c8854795f7f98e28f479a0ff458a00
2020-11-10 07:55:57 -08:00
65a72cae2c Fix type promotion for trace on CPU. (#47305)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47305

Fixes https://github.com/pytorch/pytorch/issues/47127.

Ideally this would just use diag and sum (as the CUDA implementation does), but that seems to have performance problems, which I'll link in the github PR.

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D24729627

Pulled By: gchanan

fbshipit-source-id: 151b786b53e7b958f0929c803dbf8e95981c6884
2020-11-10 07:46:03 -08:00
57dcb04239 Batched gradient support for view+inplace operations (#47227)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47227

Motivation
----------
We would like to compute batched gradients for view+inplace operations.
This most notably shows up in internal implementation of operations.
For example, many view backward functions (SelectBackward, DiagonalBackward)
are implemented with view+inplace, so to support vectorized hessian
computation for e.g. torch.select and torch.diagonal we would need a
way to handle or workaround view+inplace.

Approach
--------
view+inplace creates a CopySlices node and transmute view backward nodes
into an AsStrided node. For example,

```
leaf = torch.randn(4, 5, requires_grad=True)
base = leaf * leaf
view = base[0]
view.cos_()
```

base.grad_fn is CopySlices and view.grad_fn is AsStridedBackward.

To support vmap over CopySlices and AsStridedBackward:
- We use `new_empty_strided` instead of `empty_strided` in CopySlices
so that the batch dims get propagated
- We use `new_zeros` inside AsStridedBackward so that the batch dims get
propagated.

Test Plan
---------
- New tests. When we get closer to having most operations support batched
grad computation via vmap, I'd like to add it as an option to gradcheck
and turn it on for our tests.

Test Plan: Imported from OSS

Reviewed By: kwanmacher, glaringlee

Differential Revision: D24741687

Pulled By: zou3519

fbshipit-source-id: 8210064f782a0a7a193752029a4340e505ffb5d8
2020-11-10 07:38:02 -08:00
22d21414d7 Revert D24574649: [pytorch][PR] Utility that loads a DP/DDP model state dict into a non-DDP model with the same architecture.
Test Plan: revert-hammer

Differential Revision:
D24574649 (b631c872c9)

Original commit changeset: 17d29ab16ae2

fbshipit-source-id: 6766c6b21b82c9463143da0370192d9c68dbce6c
2020-11-10 06:55:45 -08:00
f2eac5df18 [NNC] Fix lowering of aten::remainder (#47611)
Summary:
Fix an issue with the TensorExpr lowering of aten::remainder with integral inputs. We were always lowering to fmod and never to Mod.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47611

Reviewed By: bertmaher, heitorschueroff

Differential Revision: D24846929

Pulled By: nickgg

fbshipit-source-id: adac4322ced5761a11a8e914debc9abe09cf5637
2020-11-09 21:45:42 -08:00
0b30a8d007 [NNC] Simplify and fix some bugs in Bounds Inference (#47450)
Summary:
Refactors NNC bounds inference to use the dependency analysis added in https://github.com/pytorch/pytorch/issues/46952. This ends up being a pretty good simplification because we no longer need the complicated bound merging code that we used to determine contiguous ranges. There were no usages of that code and the memory dependency analyzer is closer to what we want for those use cases anyway.

Added tests for a few cases uncovered by the existing bounds inference test - much of the coverage for this feature is in tests of it's uses: rfactor, computeAt and cacheAccesses.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47450

Reviewed By: heitorschueroff

Differential Revision: D24834458

Pulled By: nickgg

fbshipit-source-id: f93e40b09c0745dcc46c7e34359db594436d04f0
2020-11-09 21:37:04 -08:00
c8a42c32a1 Allow large inputs to svd_lowrank. Fix inaccuracy in torch.svd docs. (#47440)
Summary:
As in title.

Fixes https://github.com/pytorch/pytorch/issues/42062

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47440

Reviewed By: bdhirsh

Differential Revision: D24790628

Pulled By: mruberry

fbshipit-source-id: 1442eb884fbe4ffe6d9c78a4d0186dd0b1482c9c
2020-11-09 21:04:48 -08:00
52fe73a39e Enable Python code coverage for onnx runs (#47387)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/44120

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47387

Reviewed By: heitorschueroff

Differential Revision: D24737378

Pulled By: janeyx99

fbshipit-source-id: 79e3d0b62f7da0617330f312fb1ed548c6be2a3b
2020-11-09 20:52:14 -08:00
b631c872c9 Utility that loads a DP/DDP model state dict into a non-DDP model with the same architecture. (#45643)
Summary:
Added a convenience function that allows users to load models without DP/DDP from a DP/DDP state dict.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45643

Reviewed By: rohan-varma

Differential Revision: D24574649

fbshipit-source-id: 17d29ab16ae24a30890168fa84da6c63650e61e9
2020-11-09 20:49:29 -08:00
49d5b4d1e1 move helper functions out of Partitioner class (#47515)
Summary:
This PR moves some helper functions out of Partitioner class. It will make Partitioner class cleaner and make these helper functions easier to use in the future

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47515

Reviewed By: gcatron, heitorschueroff

Differential Revision: D24844751

Pulled By: scottxu0730

fbshipit-source-id: 04397d0ce995cf96943df0a2b9265a521177b4de
2020-11-09 20:42:10 -08:00
4841e9ef33 Add Vulkan op Conv2D. (#46900)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46900

Test Plan: Imported from OSS

Reviewed By: IvanKobzarev

Differential Revision: D24568211

Pulled By: AshkanAliabadi

fbshipit-source-id: 2819c8308292055aa4e8130109d8764d885c1340
2020-11-09 20:39:20 -08:00
ce11dbbb48 Vulkan tweaks (#47261)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47261

Test Plan: Imported from OSS

Reviewed By: SS-JIA

Differential Revision: D24837714

Pulled By: AshkanAliabadi

fbshipit-source-id: 221258c03a7f2304a3b34ad550c458c49a108cd0
2020-11-09 20:34:20 -08:00
8aca85dbcd Add diagflat complex support (#47564)
Summary:
Adds complex numbers support for `torch.diag`
``` python
>>> import torch
>>> a = torch.ones(2, dtype=torch.complex128)
>>> torch.diagflat(a)
tensor([[1.+0.j, 0.+0.j],
        [0.+0.j, 1.+0.j]], dtype=torch.complex128)
>>> b = a.cuda()
>>> torch.diagflat(b)
tensor([[1.+0.j, 0.+0.j],
        [0.+0.j, 1.+0.j]], device='cuda:0', dtype=torch.complex128)
```

Note that automatic differentiation isn't implemented:
``` python
>>> d = torch.ones(1, dtype=torch.complex128, requires_grad=True)
>>> torch.diagflat(d)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
RuntimeError: diag does not support automatic differentiation for outputs with complex dtype.
```

Fixes https://github.com/pytorch/pytorch/issues/47499

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47564

Reviewed By: heitorschueroff

Differential Revision: D24844467

Pulled By: anjali411

fbshipit-source-id: 9c8cb795d52880b7dcffab0c059b0f6c2e5ef151
2020-11-09 20:28:23 -08:00
79f8582289 [ONNX] Add export of aten::is_floating point (#46442)
Summary:
Add export of aten::is_floating point

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46442

Reviewed By: mrshenli

Differential Revision: D24566156

Pulled By: bzinodev

fbshipit-source-id: 91ea95e2c4d4866e2ef51bffe07461de2e31c110
2020-11-09 18:02:47 -08:00
3dd266304c Fix inaccurate note in DistributedDataParallel (#47156)
Summary:
Sorry for my previous inaccurate [PR](https://github.com/pytorch/pytorch/pull/42471#issue-462329192 ).

Here are some toy code to illustrate my point:

* non-DistributedDataParallel version

```python
import torch

if __name__ == "__main__":
    torch.manual_seed(0)
    inp = torch.randn(1,16)
    inp = torch.cat([inp, inp], dim=0)
    model = torch.nn.Linear(16, 2)
    loss_func = torch.nn.CrossEntropyLoss()
    opti = torch.optim.SGD(model.parameters(), lr=0.001)
    opti.zero_grad()
    loss = loss_func(model(inp), torch.tensor([0, 0]))
    loss.backward()
    opti.step()

    print("grad:", model.weight.grad)
    print("updated weight:\n", model.weight)
```

* DistributedDataParallel version

```python
import os
import torch
import torch.nn as nn
import torch.distributed as dist
from torch.multiprocessing import Process

def run(rank, size):
    torch.manual_seed(0)
    x = torch.randn(1,16)

    model = torch.nn.Linear(16, 2)
    model = torch.nn.parallel.DistributedDataParallel(model)
    loss_func = torch.nn.CrossEntropyLoss()
    opti = torch.optim.SGD(model.parameters(), lr=0.001)
    opti.zero_grad()

    y = model(x)

    label = torch.tensor([0])
    loss = loss_func(y, label)

    loss.backward()
    opti.step()

    if rank == 0:
        print("grad:", model.module.weight.grad)
        print("updated weight:\n", model.module.weight)

def init_process(rank, size, fn, backend="gloo"):
    os.environ['MASTER_ADDR'] = '127.0.0.1'
    os.environ['MASTER_PORT'] = '29500'
    dist.init_process_group(backend, rank=rank, world_size=size)
    fn(rank, size)

if __name__ == "__main__":
    size = 2
    process = []
    for rank in range(size):
        p = Process(target=init_process, args=(rank, size, run))
        p.start()
        process.append(p)

    for p in process:
        p.join()
```

Both of these two pieces of code have the same output.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47156

Reviewed By: mruberry

Differential Revision: D24675199

Pulled By: mrshenli

fbshipit-source-id: 1238a63350a32a824b4b8c0018dc80454ea502bb
2020-11-09 17:42:57 -08:00
8b3f1d1288 [caffe2] Add __slots__ to all classes in schema.py (#47541)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47541

The profiler has guided us to `schema.py`. Since these `Field`s are used everywhere and in huge quantities, we can easily make some optimizations system wide by adding `__slots__`.

From StackOverflow, benefits include:

* faster attribute access.
* space savings in memory.

Read more: https://stackoverflow.com/a/28059785/

Reviewed By: dzhulgakov

Differential Revision: D24771078

fbshipit-source-id: 13f6064d367440069767131a433c820eabfe931b
2020-11-09 16:16:28 -08:00
2f617c5104 skip GPU test on sandcastle if sanitizer is enabled (#47626)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47626

`caffe2/test:cuda` was safeguarded by a GPU availability check however most of the mixed CPU/GPU tests aren't.

Use `TEST_WITH_*SAN` flags to safeguard test discovery for CUDA tests.

Test Plan: sandcastle

Reviewed By: janeyx99

Differential Revision: D24842333

fbshipit-source-id: 5e264344a0b7b98cd229e5bf73c17433751598ad
2020-11-09 16:06:58 -08:00
86bb413600 Optimize backward for torch.repeat (#46726)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46726

Fixes #43192

Test Plan: Imported from OSS

Reviewed By: heitorschueroff

Differential Revision: D24739840

Pulled By: ejguan

fbshipit-source-id: ddf21fc52c4676de25ad7bfb0b5c1c23daa77ee6
2020-11-09 15:12:40 -08:00
4c52a56c40 [caffe2] Properly call super init in schema.py (#47542)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47542

The previous way of doing `Field.__init__(self, [])` is just wrong. Switching to Python2 compatible way: `super(ObjectName, self).__init__(...)`

Reviewed By: dzhulgakov

Differential Revision: D24771077

fbshipit-source-id: d6798c72090c0264b6c583602cae441a1b14587c
2020-11-09 15:02:22 -08:00
b6a2444eff [WIP] Adding bunch of unary foreach APIs (#47383)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47383

Test Plan: Imported from OSS

Reviewed By: anjali411

Differential Revision: D24737050

Pulled By: izdeby

fbshipit-source-id: deb59b41ad1c79b66cafbd9a9d3d6b069794e743
2020-11-09 14:14:28 -08:00
5686d2428c [ONNX] Slightly improve indexing with ellipsis under scripting (#46571)
Summary:
Still depending on rank of original tensor being known. i.e.
```python
x[i, j, k] = y  # rank of x must be known at export time
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46571

Reviewed By: mrshenli

Differential Revision: D24657502

Pulled By: bzinodev

fbshipit-source-id: 6ec87edb67be06e34526225e701954fcfc5606c8
2020-11-09 14:05:56 -08:00
a49367e9c9 Update the docs of torch.eig about derivative (#47598)
Summary:
Related: https://github.com/pytorch/pytorch/issues/33090
I just realized that I haven't updated the docs of `torch.eig` when implementing the backward.
Here's the PR updating the docs about the grad of `torch.eig`.

cc albanD

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47598

Reviewed By: heitorschueroff

Differential Revision: D24829373

Pulled By: albanD

fbshipit-source-id: 89963ce66b2933e6c34e2efc93ad0f2c3dd28c68
2020-11-09 13:28:27 -08:00
4159191f0e [pytorch] split out trace type generator and migrate to new codegen model (#47438)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47438

Test Plan: Imported from OSS

Reviewed By: bhosmer

Differential Revision: D24808211

Pulled By: ljk53

fbshipit-source-id: 44dfadf550a255c05aa201e54b48101aaf722885
2020-11-09 12:39:39 -08:00
499d2fad98 [pytorch] factor out return_names api (#47437)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47437

Test Plan: Imported from OSS

Reviewed By: bhosmer

Differential Revision: D24808213

Pulled By: ljk53

fbshipit-source-id: 8ec6d58952fd677ab2d97e63b060cafda052411a
2020-11-09 12:39:37 -08:00
8d1a6ae51d [pytorch] TraceType codegen tweak - newline before redispatch call (#47436)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47436

Test Plan: Imported from OSS

Reviewed By: bhosmer

Differential Revision: D24808212

Pulled By: ljk53

fbshipit-source-id: a78c27ff76e1f6324eb2ae25467dec72b6b09b87
2020-11-09 12:39:34 -08:00
e26c1726cf [ONNX] Fix scripting rand/randn/where (#45793)
Summary:
- rand/randn: the type signature of int[] is different in scripting, thus failing the check.
- where: scripting produces dynamic cases which are supported by `unbind` export of higher opsets.
- test_list_pass: this test fails when using new scripting api, should be fixed by https://github.com/pytorch/pytorch/issues/45369

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45793

Reviewed By: mrshenli

Differential Revision: D24566096

Pulled By: bzinodev

fbshipit-source-id: 6fe0925c66dee342106d71c9cbc3c95cabe639f7
2020-11-09 12:39:31 -08:00
a08e8dd70c Fix python 3.9 builds on Windows (#47602)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/47460.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47602

Reviewed By: heitorschueroff

Differential Revision: D24832487

Pulled By: malfet

fbshipit-source-id: 8846caeac5e767e8066470d5c981218f147c88dc
2020-11-09 12:39:28 -08:00
6214d0ad88 Update nccl commit tag to head of v2.8 branch (#47603)
Summary:
Previous head of v2.8 branch was force-updated from `cd5a9b73c3028d2496666201588111a8c8d84878` to `31b5bb6f6447da98b9110c605465f9c09621074e`

Fixes https://github.com/pytorch/pytorch/issues/47529

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47603

Reviewed By: seemethere, janeyx99

Differential Revision: D24832450

Pulled By: malfet

fbshipit-source-id: ea141b207d7d8e92300ba286cde3cda3773adf51
2020-11-09 12:36:27 -08:00
ead86b2419 Add batching rule for torch.clone(tensor, torch.contiguous_format) (#47365)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47365

I wanted to avoid defining vmap behavior over contiguous_format for as
long as possible. This is potentially ambiguous, consider the following:
```
>>> x = torch.randn(3, B0, 5)
>>> y = vmap(lambda x: x.clone(torch.contiguous_format), in_dims=1,
out_dims=1)(x)
>>> y[:,0].is_contiguous()  # ??
```
There are two possible ways to interpret this operation (if we choose to
allow it to succeed):
1. Each per-sample becomes contiguous, so y[:,0] is contiguous.
2. The output of vmap is contiguous (so y is contiguous, but y[:,0] is
not)

(1) makes more sense because vmap operates on a per-sample level.
This makes sense when combined with the vmap fallback:
- there are places in the codebase where we perform .contiguous() and
then pass the result to an operator `op` that only accepts contiguous
inputs.
- If we vmap over such code and don't have a batching rule implemented for
`op`, then we want the per-samples to be contiguous so that
when `op` goes through the vmap fallback, it receives contiguous
per-samples.

(1) is the approach we've selected for this PR.

Motivation
----------
To vmap over CopySlices, we have to vmap over a clone(contiguous_format)
call:
e4bc785dd5/torch/csrc/autograd/functions/tensor.cpp (L93)

Alternatives
------------
- Implementing (2) is difficult in the current design because vmap is
allowed to move batch dimensions to the front of the tensor. We would
need some global information about the in_dims and out_dims passed to
vmap.
- We could also error out if someone calls clone(contiguous_format) and
the batch dims are not at the front. This would resolve the ambiguity at
the cost of limiting what vmap can do.

Future Work
-----------
- Add to a "vmap gotchas" page the behavior of contiguous_format.
- Implement is_contiguous, Tensor.contiguous() with the same semantics.
Those currently error out.

Test Plan
---------
- new tests

Test Plan: Imported from OSS

Reviewed By: bdhirsh

Differential Revision: D24741683

Pulled By: zou3519

fbshipit-source-id: 3ef5ded1b646855f41d39dcefe81129176de8a70
2020-11-09 11:36:48 -08:00
7bc8fdb6d7 as_strided batching rule (#47364)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47364

This PR adds a batching rule for as_strided. `as_strided` is a really weird
operation and I hope that users don't use it very much.

Motivation
----------
The motivation for adding a batching rule for as_strided is for
batched gradient computation.

AsStridedBackward appears in PyTorch when handling view+in-place
operations and calls `as_strided`. AsStridedBackward calls as_strided on
a fresh tensor with storage_offset equal to 0. We would like to be able
to vmap through the backward graph of view+in-place operations to
for batched gradient computation, especially because internally we have
a number of functions that are implemented as a view+in-place.

Alternatives
------------
If we think that as_strided is too crazy to have a batching rule, we
could either:
- have a flag that controls the autograd view+in-place
behavior
- require that the input tensor's storage offset must be equal to 0
to make it easier to reason about.

I think the batching rule makes sense, so I didn't pursue the
alternatives.

The batching rule
-----------------
```
y = vmap(lambda x: x.as_strided(sizes, strides, offset))(xs)
```
The result of the above should be "equivalent" to:
- Assume that each x has storage offset equal to xs.storage_offset()
(call that S).
- Calling as_strided with (sizes, sizes, offset + x[i].storage_offset() - S) on each x.

More concretely,
this returns a view on `xs`, such that each y[i] has:
- sizes: `sizes`
- strides: `strides`
- storage_offset: offset + i * x.stride(batch_dim)

Why the behavior can be weird
-----------------------------
The behavior of the batching rule may be different from actually running
as_strided in a for-loop because `as_strided` takes in `offset` as a
"absolute offset". As an example, consider

```
>>> x = torch.tensor([0., 1., 2., 3., 4.])
>>> z = [x[i].as_strided([1], [1], 0) for i in range(5)]
```
Each z[i] is actually the same view on x (z[i] == torch.tensor([0.]))!
However, we consider the above for-loop comprehension to be a user error:
a user should have written the following if they wanted to use as_strided
in a per-sample way:
```
>>> z = [x[i].as_strided([1], [1], 0 + x[i].storage_offset()) for i in range(5)]
```

Test Plan
---------
- Added some tests that compare vmap+as_strided to vmap+(the equivalent operator)

Test Plan: Imported from OSS

Reviewed By: bdhirsh

Differential Revision: D24741685

Pulled By: zou3519

fbshipit-source-id: c1429caff43bfa33661a80bffc0daf2c0eea5564
2020-11-09 11:36:44 -08:00
77c49e65d5 [tensorexpr] Fix registration of intrinsics on llvm-fb (#47540)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47540

In FB's hybrid llvm 7/8 flavor, we (read: I) forgot to register
intrinsics.  It was... a bit annoying to figure out how to do this, and I'm
sure it could be done more efficiently by someone who isn't just cargo-culting
the API from KaleidoscopeJIT.  Anyways.

There are kind of 3 independent changes here but they're a bit annoying to separate out, so:

0. (trivial) Add the correct #defines to the internal build to run test_llvm.
1. (easy) add an assertSuccess function to convert llvm::Errors into
   `TORCH_INTERNAL_ASSERT`s, for better/easier debugging.
2. (medium) Factor out the gigantic register-all-the-things function into a
   helper so we can call it from both the LLVM and LLVM-FB constructors.
3. (hard) Fix the symbol resolver in llvm-fb to do a lookup using the
   ExecutionSession.  This is the bit I don't really understand; it feels like the
   CompileLayer lookup should find these symbols but it doesn't.  Whatever.

Test Plan: `buck test //caffe2/test/cpp/tensorexpr:tensorexpr`

Reviewed By: asuhan

Differential Revision: D24807361

fbshipit-source-id: 8bb0d632dff6a065963ed14a600614cd21fbb095
2020-11-09 11:36:40 -08:00
70d34718b8 [fx] add missing modules for type annoations (#47537)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47537

When a module only appears in a type constructor List[torch.Tensor],
it previously didn't get added to the list of used modules. This fixes it
by introspecting on the type constructor.

Test Plan: Imported from OSS

Reviewed By: jamesr66a

Differential Revision: D24806317

Pulled By: zdevito

fbshipit-source-id: 263391af71e1f2156cbefaab95b9818c6b9aaae1
2020-11-09 11:36:36 -08:00
fbffd959ca Fix compiler warning variable "num_ivalue_args" was declared but never referenced detected during: (#47494)
Summary:
```
/home/gaoxiang/.local/lib/python3.8/site-packages/torch/include/ATen/core/boxing/impl/make_boxed_from_unboxed_functor.h(326): warning: variable "num_ivalue_args" was declared but never referenced
          detected during:
            instantiation of "std::decay_t<c10::guts::infer_function_traits<Functor>::type::return_type> c10::impl::call_functor_with_args_from_stack_<Functor,AllowDeprecatedTypes,ivalue_arg_indices...>(Functor *, c10::Stack *, std::index_sequence<ivalue_arg_indices...>) [with Functor=c10::impl::WrapFunctionIntoRuntimeFunctor<std::decay_t<__nv_bool ()>>, AllowDeprecatedTypes=false, ivalue_arg_indices=<>]"
(346): here
            instantiation of "std::decay_t<c10::guts::infer_function_traits<Functor>::type::return_type> c10::impl::call_functor_with_args_from_stack<Functor,AllowDeprecatedTypes>(Functor *, c10::Stack *) [with Functor=c10::impl::WrapFunctionIntoRuntimeFunctor<std::decay_t<__nv_bool ()>>, AllowDeprecatedTypes=false]"
(396): here
            instantiation of "void c10::impl::make_boxed_from_unboxed_functor<KernelFunctor, AllowDeprecatedTypes>::call(c10::OperatorKernel *, const c10::OperatorHandle &, c10::Stack *) [with KernelFunctor=c10::impl::WrapFunctionIntoRuntimeFunctor<std::decay_t<__nv_bool ()>>, AllowDeprecatedTypes=false]"
/home/gaoxiang/.local/lib/python3.8/site-packages/torch/include/ATen/core/boxing/KernelFunction_impl.h(109): here
            instantiation of "c10::KernelFunction c10::KernelFunction::makeFromUnboxedFunctor<AllowLegacyTypes,KernelFunctor>(std::unique_ptr<c10::OperatorKernel, std::default_delete<c10::OperatorKernel>>) [with AllowLegacyTypes=false, KernelFunctor=c10::impl::WrapFunctionIntoRuntimeFunctor<std::decay_t<__nv_bool ()>>]"
/home/gaoxiang/.local/lib/python3.8/site-packages/torch/include/ATen/core/boxing/KernelFunction_impl.h(175): here
            instantiation of "c10::KernelFunction c10::KernelFunction::makeFromUnboxedRuntimeFunction(FuncType *) [with AllowLegacyTypes=false, FuncType=__nv_bool ()]"
/home/gaoxiang/.local/lib/python3.8/site-packages/torch/include/torch/library.h(92): here
            instantiation of "torch::CppFunction::CppFunction(Func *, std::enable_if_t<c10::guts::is_function_type<Func>::value, std::nullptr_t>) [with Func=__nv_bool ()]"
/home/gaoxiang/.local/lib/python3.8/site-packages/torch/include/torch/library.h(457): here
            instantiation of "torch::Library &torch::Library::def(NameOrSchema &&, Func &&) & [with NameOrSchema=const char (&)[23], Func=__nv_bool (*)()]"
/home/gaoxiang/extension-jit/test.cu(6): here
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47494

Reviewed By: bdhirsh

Differential Revision: D24796223

Pulled By: ezyang

fbshipit-source-id: 598b94b4012beaa74c6bde0b96a9136a8a6bc4f2
2020-11-09 11:32:07 -08:00
4a2fb34042 check sparse sizes (#47148)
Summary:
checks sizes of sparse tensors when comparing them in assertEqual.
Removes additional checks in safeCoalesce, safeCoalesce should not be a test for `.coalesce()` function.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47148

Reviewed By: mruberry

Differential Revision: D24823127

Pulled By: ngimel

fbshipit-source-id: 9303a6ff74aa3c9d9207803d05c0be2325fe392a
2020-11-09 10:33:24 -08:00
65e5bd23d8 [quant] Add _FusedModule type to capture all fused modules for quantization (#47484)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47484

Test Plan: Imported from OSS

Reviewed By: z-a-f

Differential Revision: D24774703

fbshipit-source-id: f0efc5d77035b9854ec3e31a1d34f05d5680bc22
2020-11-09 10:28:45 -08:00
8339f88353 Add complex autograd support for torch.mean (#47566)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47566

Test Plan: Imported from OSS

Reviewed By: heitorschueroff

Differential Revision: D24817013

Pulled By: anjali411

fbshipit-source-id: f2b8411fb9abdc3e2d07c8e4fef3071b76605b12
2020-11-09 08:31:10 -08:00
3d962430a9 Make gen_op_registration flake8 compliant (#47604)
Summary:
Fixes regression introduced by D24686838 (8182558c22)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47604

Reviewed By: walterddr

Differential Revision: D24832687

Pulled By: malfet

fbshipit-source-id: e9f7a35561c2b1705e11fd11abe402e3c83cf5cc
2020-11-09 08:31:07 -08:00
b80da89891 Batching rule for Tensor.new_empty_strided (#47226)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47226

The batching rule is a little weird because it's not immediately obvious
what the strides of the result should be. If
tensor.new_empty_strided(size, stride) is called inside vmap and
`tensor` is being vmapped over, the result is a physical tensor with:
- size `[batch_shape] + size`
- strides `[S0, S1, ..., Sn] + stride` such that the
S0...Sn are part of a contiguous subspace and Sn is equal to the size of
the storage of `torch.empty_strided(size, stride)`.

I refactored some of the logic that computes the storage size for
`torch.empty_strided(size, stride)` into a helper function
`native::storage_size_for` and use it in the batching rule.

Test Plan: - New tests in test/test_vmap.py

Reviewed By: ejguan

Differential Revision: D24741690

Pulled By: zou3519

fbshipit-source-id: f09b5578e923470d456d50348d86687a03b598d2
2020-11-09 08:31:04 -08:00
59aca02224 Implement Tensor.new_empty_strided(sizes, strides, *, dtype, device, requires_grad) (#47225)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47225

Summary
-------
This PR implements Tensor.new_empty_strided. Many of our torch.* factory
functions have a corresponding new_* method (e.g., torch.empty and
torch.new_empty), but there is no corresponding method to
torch.empty_strided. This PR adds one.

Motivation
----------
The real motivation behind this is for vmap to be able to work through
CopySlices. CopySlices shows up a lot in double backwards because a lot
of view functions have backward formulas that perform view+inplace.

e0fd590ec9/torch/csrc/autograd/functions/tensor.cpp (L78-L106)

To support vmap through CopySlices, the approach in this stack is to:
- add `Tensor.new_empty_strided` and replace `empty_strided` in
CopySlices with that so that we can propagate batch information.
- Make some slight modifications to AsStridedBackward (and add
as_strided batching rule)

Please let me know if it would be better if I squashed everything related to
supporting vmap over CopySlices together into a single big PR.

Test Plan
---------
- New tests.

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D24741688

Pulled By: zou3519

fbshipit-source-id: b688047d2eb3f92998896373b2e9d87caf2c4c39
2020-11-09 08:31:01 -08:00
4a58f35bef [caffe2] Fix duplicate name bug in Net.AddExternalInput (#47530)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47530

`Net.AddExternalInput` should raise if there are duplicate names. The previous code would only raise if the addition of duplicates was in separate calls, but not if it was in the same call.

Test Plan:
Added two new regression tests

```
    ✓ Pass: caffe2/caffe2/python:core_test - testSetInputRecordWithBlobs (caffe2.caffe2.python.core_test.TestExternalInputs) (9.622)
    ✓ Pass: caffe2/caffe2/python:core_test - testAddExternalInputShouldRaiseIfDuplicate (caffe2.caffe2.python.core_test.TestExternalInputs) (9.639)
    ✓ Pass: caffe2/caffe2/python:core_test - testSetInputRecordWithoutBlobs (caffe2.caffe2.python.core_test.TestExternalInputs) (9.883)
    ✓ Pass: caffe2/caffe2/python:core_test - testAddExternalInputShouldRaiseIfDuplicateInSameCall (caffe2.caffe2.python.core_test.TestExternalInputs) (10.153)
```

Test trained 2 models. No issues

f230755456
f230754926

Reviewed By: dzhulgakov

Differential Revision: D24763586

fbshipit-source-id: c87088441d76f7198f8b07508b2607aec13521ed
2020-11-09 08:30:58 -08:00
6248e0621c Revert D24801481: [pytorch][PR] Add AcceleratedGraphModule and serialzie GraphModule to JSON
Test Plan: revert-hammer

Differential Revision:
D24801481 (9e0102c10f)

Original commit changeset: 6b3fe69b51f7

fbshipit-source-id: f8287ef88b302e0f08d58090dc61603a4ef5cb3c
2020-11-09 08:28:22 -08:00
9e0102c10f Add AcceleratedGraphModule and serialzie GraphModule to JSON (#47233)
Summary:
Example:
```
class TestModule(torch.nn.Module):
            def __init__(self):
                super().__init__()
                self.linear = torch.nn.Linear(4, 4)
                self.e = torch.rand(4)

            def forward(self, a, b):
                add_1 = a + b
                linear = self.linear(add_1)
                add_2 = linear + self.e
                return add_2
```
JSON:
```
{
    "modules": {},
    "weights": {
        "linear.weight": {
            "dtype": "torch.float32",
            "is_quantized": false,
            "shape": "[4, 4]"
        },
        "linear.bias": {
            "dtype": "torch.float32",
            "is_quantized": false,
            "shape": "[4]"
        },
        "e": {
            "dtype": "torch.float32",
            "is_quantized": false,
            "shape": "[4]"
        }
    },
    "nodes": [
        {
            "shape": "[4]",
            "dtype": "torch.float32",
            "target": "a",
            "op_code": "placeholder",
            "name": "a",
            "args": [],
            "kwargs": {}
        },
        {
            "shape": "[4]",
            "dtype": "torch.float32",
            "target": "b",
            "op_code": "placeholder",
            "name": "b",
            "args": [],
            "kwargs": {}
        },
        {
            "shape": "[4]",
            "dtype": "torch.float32",
            "target": "_operator.add",
            "op_code": "call_function",
            "name": "add_1",
            "args": [
                {
                    "is_node": true,
                    "name": "a"
                },
                {
                    "is_node": true,
                    "name": "b"
                }
            ],
            "kwargs": {}
        },
        {
            "target": "linear",
            "op_code": "call_module",
            "name": "linear_1",
            "args": [
                {
                    "is_node": true,
                    "name": "add_1"
                }
            ],
            "kwargs": {}
        },
        {
            "shape": "[4]",
            "dtype": "torch.float32",
            "target": "e",
            "op_code": "get_attr",
            "name": "e",
            "args": [],
            "kwargs": {}
        },
        {
            "shape": "[4]",
            "dtype": "torch.float32",
            "target": "_operator.add",
            "op_code": "call_function",
            "name": "add_2",
            "args": [
                {
                    "is_node": true,
                    "name": "linear_1"
                },
                {
                    "is_node": true,
                    "name": "e"
                }
            ],
            "kwargs": {}
        },
        {
            "shape": "[4]",
            "dtype": "torch.float32",
            "target": "output",
            "op_code": "output",
            "name": "output",
            "args": [
                {
                    "is_node": true,
                    "name": "add_2"
                }
            ],
            "kwargs": {}
        }
    ]
}
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47233

Reviewed By: jackm321, yinghai

Differential Revision: D24801481

Pulled By: gcatron

fbshipit-source-id: 6b3fe69b51f7ac57f445675acdac36b0e563f73d
2020-11-08 19:26:02 -08:00
8182558c22 [PyTorch Mobile] Don't use __ROOT__ for inference only ops
Summary:
`__ROOT__` ops are only used in full-jit. To make size compact, disable using it in inference. Since FL is still in fill-jit, keep it for training only.

It saves -17 KB for fbios.

TODO: when FL is migrated to lite_trainer, remove `__ROOT__` to save size in training too.

Test Plan: CI

Reviewed By: dhruvbird

Differential Revision: D24686838

fbshipit-source-id: 15214cebb9d8defa3fdac3aa0d73884b352aa753
2020-11-08 15:27:47 -08:00
16c72a5a6b [pytorch] continue to rewrite gen_python_functions.py with typed models (#46978)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46978

Refactored and added type annotations to the most part of the file.

Some top-level codegen functions are called by other codegen scripts.
Will migrate them in subsequent PRs.

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D24589210

Pulled By: ljk53

fbshipit-source-id: e0c7e5b3672b41983f321400c2e2330d1462e76e
2020-11-08 01:34:12 -08:00
4a7de2746f Add docs on how to toggle TF32 flags on C++ (#47331)
Summary:
I have been asked several times how to toggle this flag on libtorch. I think it would be good to mention it in the docs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47331

Reviewed By: glaringlee

Differential Revision: D24777576

Pulled By: mruberry

fbshipit-source-id: cc2a338c477bb57e0bb74b8960c47fde99665e41
2020-11-08 01:29:24 -08:00
781e0ed835 Support RRef.backward() for Owner RRefs. (#46641)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46641

Second part of https://github.com/pytorch/pytorch/pull/46568, allows
RRef.backward() to work for owner RRefs.
ghstack-source-id: 115440252

Test Plan: waitforbuildbot

Reviewed By: mrshenli

Differential Revision: D24441300

fbshipit-source-id: 64af28e6b6ae47ea27e611a148f217bc344a4c5b
2020-11-07 21:25:32 -08:00
5a5258cb0d Support the strided tensor on input for torch.cat (#46859)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46859

Current implementation, for non-contiguous, it will go to slow path. This change tries to enable fast path for non-contiguous input(up to 4-dim).

Test Plan:
#benchamark

before
```
# ----------------------------------------
# PyTorch/Caffe2 Operator Micro-benchmarks
# ----------------------------------------
# Tag : all

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(1,1,1)_N2_dim0_cuda
# Input: sizes: (1, 1, 1), N: 2, dim: 0, device: cuda
Forward Execution Time (us) : 17.126

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(512,512,2)_N2_dim1_cuda
# Input: sizes: (512, 512, 2), N: 2, dim: 1, device: cuda
Forward Execution Time (us) : 20.652

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(128,1024,2)_N2_dim1_cuda
# Input: sizes: (128, 1024, 2), N: 2, dim: 1, device: cuda
Forward Execution Time (us) : 20.412

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(1024,1024,2)_N2_dim0_cuda
# Input: sizes: (1024, 1024, 2), N: 2, dim: 0, device: cuda
Forward Execution Time (us) : 48.265

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(1025,1023,2)_N2_dim1_cuda
# Input: sizes: (1025, 1023, 2), N: 2, dim: 1, device: cuda
Forward Execution Time (us) : 52.964

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(1024,1024,2)_N2_dim2_cuda
# Input: sizes: (1024, 1024, 2), N: 2, dim: 2, device: cuda
Forward Execution Time (us) : 71.111

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f8a3cdc2440>,111,65]_N5_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f8a3cdc2440>, 111, 65], N: 5, dim: 0, device: cuda
Forward Execution Time (us) : 39.492

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[96,<function<lambda>at0x7f8a3cdc2b90>,64]_N5_dim1_cuda
# Input: sizes: [96, <function <lambda> at 0x7f8a3cdc2b90>, 64], N: 5, dim: 1, device: cuda
Forward Execution Time (us) : 31.596

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[128,64,<function<lambda>at0x7f880e7db3b0>]_N5_dim2_cuda
# Input: sizes: [128, 64, <function <lambda> at 0x7f880e7db3b0>], N: 5, dim: 2, device: cuda
Forward Execution Time (us) : 66.668

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f880e7db5f0>,32,64]_N50_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f880e7db5f0>, 32, 64], N: 50, dim: 0, device: cuda
Forward Execution Time (us) : 54.562

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[32,<function<lambda>at0x7f880e7db680>,64]_N50_dim1_cuda
# Input: sizes: [32, <function <lambda> at 0x7f880e7db680>, 64], N: 50, dim: 1, device: cuda
Forward Execution Time (us) : 53.255

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[33,65,<function<lambda>at0x7f880e7db710>]_N50_dim2_cuda
# Input: sizes: [33, 65, <function <lambda> at 0x7f880e7db710>], N: 50, dim: 2, device: cuda
Forward Execution Time (us) : 69.771

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(64,32,4,16,32)_N2_dim2_cuda
# Input: sizes: (64, 32, 4, 16, 32), N: 2, dim: 2, device: cuda
Forward Execution Time (us) : 98.438

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(16,32,4,16,32)_N8_dim2_cuda
# Input: sizes: (16, 32, 4, 16, 32), N: 8, dim: 2, device: cuda
Forward Execution Time (us) : 115.045

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(9,31,5,15,33)_N17_dim4_cuda
# Input: sizes: (9, 31, 5, 15, 33), N: 17, dim: 4, device: cuda
Forward Execution Time (us) : 476.497

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f880e7db7a0>]_N100_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f880e7db7a0>], N: 100, dim: 0, device: cuda
Forward Execution Time (us) : 86.307

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f880e7db830>]_N1000_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f880e7db830>], N: 1000, dim: 0, device: cuda
Forward Execution Time (us) : 453.269

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f880e7db8c0>]_N2000_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f880e7db8c0>], N: 2000, dim: 0, device: cuda
Forward Execution Time (us) : 935.365

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f880e7db950>]_N3000_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f880e7db950>], N: 3000, dim: 0, device: cuda
Forward Execution Time (us) : 1355.937
```
after
```
WARNING:2020-11-01 21:14:23 3332963:3336757 EventProfilerController.cpp:143] (x1) Lost sample due to delays (ms): 488, 11, 4121, 0
# ----------------------------------------
# PyTorch/Caffe2 Operator Micro-benchmarks
# ----------------------------------------
# Tag : all

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(1,1,1)_N2_dim0_cuda
# Input: sizes: (1, 1, 1), N: 2, dim: 0, device: cuda
Forward Execution Time (us) : 17.174

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(512,512,2)_N2_dim1_cuda
# Input: sizes: (512, 512, 2), N: 2, dim: 1, device: cuda
Forward Execution Time (us) : 20.399

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(128,1024,2)_N2_dim1_cuda
# Input: sizes: (128, 1024, 2), N: 2, dim: 1, device: cuda
Forward Execution Time (us) : 23.349

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(1024,1024,2)_N2_dim0_cuda
# Input: sizes: (1024, 1024, 2), N: 2, dim: 0, device: cuda
Forward Execution Time (us) : 47.847

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(1025,1023,2)_N2_dim1_cuda
# Input: sizes: (1025, 1023, 2), N: 2, dim: 1, device: cuda
Forward Execution Time (us) : 53.463

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(1024,1024,2)_N2_dim2_cuda
# Input: sizes: (1024, 1024, 2), N: 2, dim: 2, device: cuda
Forward Execution Time (us) : 72.789

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7fd5b5567710>,111,65]_N5_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7fd5b5567710>, 111, 65], N: 5, dim: 0, device: cuda
Forward Execution Time (us) : 39.747

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[96,<function<lambda>at0x7fd5b56b1320>,64]_N5_dim1_cuda
# Input: sizes: [96, <function <lambda> at 0x7fd5b56b1320>, 64], N: 5, dim: 1, device: cuda
Forward Execution Time (us) : 31.814

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[128,64,<function<lambda>at0x7fd3a2289680>]_N5_dim2_cuda
# Input: sizes: [128, 64, <function <lambda> at 0x7fd3a2289680>], N: 5, dim: 2, device: cuda
Forward Execution Time (us) : 67.202

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7fd3a2289710>,32,64]_N50_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7fd3a2289710>, 32, 64], N: 50, dim: 0, device: cuda
Forward Execution Time (us) : 65.229

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[32,<function<lambda>at0x7fd3a22897a0>,64]_N50_dim1_cuda
# Input: sizes: [32, <function <lambda> at 0x7fd3a22897a0>, 64], N: 50, dim: 1, device: cuda
Forward Execution Time (us) : 60.843

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[33,65,<function<lambda>at0x7fd3a2289830>]_N50_dim2_cuda
# Input: sizes: [33, 65, <function <lambda> at 0x7fd3a2289830>], N: 50, dim: 2, device: cuda
Forward Execution Time (us) : 69.756

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(64,32,4,16,32)_N2_dim2_cuda
# Input: sizes: (64, 32, 4, 16, 32), N: 2, dim: 2, device: cuda
Forward Execution Time (us) : 98.222

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(16,32,4,16,32)_N8_dim2_cuda
# Input: sizes: (16, 32, 4, 16, 32), N: 8, dim: 2, device: cuda
Forward Execution Time (us) : 112.521

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(9,31,5,15,33)_N17_dim4_cuda
# Input: sizes: (9, 31, 5, 15, 33), N: 17, dim: 4, device: cuda
Forward Execution Time (us) : 477.736

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7fd3a22898c0>]_N100_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7fd3a22898c0>], N: 100, dim: 0, device: cuda
Forward Execution Time (us) : 50.617

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7fd3a2289950>]_N1000_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7fd3a2289950>], N: 1000, dim: 0, device: cuda
Forward Execution Time (us) : 461.631

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7fd3a22899e0>]_N2000_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7fd3a22899e0>], N: 2000, dim: 0, device: cuda
Forward Execution Time (us) : 840.469

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7fd3a2289a70>]_N3000_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7fd3a2289a70>], N: 3000, dim: 0, device: cuda
Forward Execution Time (us) : 1317.866
```

Reviewed By: ngimel

Differential Revision: D24527676

fbshipit-source-id: 83d6431e59fa7e1748292b37f5d1fa4ab6242299
2020-11-07 17:24:44 -08:00
6e69a24a1d [ONNX] Reimplement _var_mean to ensure non-negative (#47240)
Summary:
The current `_var_mean` implementation cannot ensure non-negative for variance, because it is actually `E(X^2)-(E(X))^2`: numerically when the dimension number is large and X is close to 0, it can have negative numbers (like our UT shows). The new implementation is `(E(X-E(X))^2)`, it ensures non-negative because the expectation of square is non-negative for sure.

The UT passes for the new implementation (but fails for the existing one). So it is good to go.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47240

Reviewed By: ejguan

Differential Revision: D24735729

Pulled By: bzinodev

fbshipit-source-id: 136f448dd16622b2b46f40cdf6cb2fccf357c48d
2020-11-07 12:27:09 -08:00
f23a2a1115 The dimension being reduced should not be coalesced by TensorIterator (#47237)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/37583#issuecomment-720172838

Also add overload of `<<` for convenience of debugging.

This PR is tested by `test_reduction_split_cuda` which was added in https://github.com/pytorch/pytorch/pull/37788.

Reproduce
```python
import torch

a = torch.zeros(8, 1, 128, 1024, 1024)
a.cuda().sum(1)
```

Before

```
TensorIterator @ 0x7ffd05b10ba0 {
  ntensors() = 2
  noutputs() = 1
  shape() = [1073741824]
  strides(*) = {
    (0) = [4]
    (1) = [4]
  }
  dtype(*) = {
    (0) = Float
    (1) = Float
  }
  is_reduction_ = 1
}
```

After

```
TensorIterator @ 0x7fffc9051010 {
  ntensors() = 2
  noutputs() = 1
  shape() = [1, 1073741824]
  strides(*) = {
    (0) = [0, 4]
    (1) = [536870912, 4]
  }
  dtype(*) = {
    (0) = Float
    (1) = Float
  }
  is_reduction_ = 1
}
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47237

Reviewed By: ejguan

Differential Revision: D24734763

Pulled By: ngimel

fbshipit-source-id: 02bb2b15694c68f96434f55033b63b6e5ff7085b
2020-11-07 01:30:24 -08:00
29184f86b0 Correctly print out sign of near-zero double values (#47081)
Summary:
inside IValue.h, we previously printed -0.0 as 0.0. Therefore, it was causing some inconsistency when using -0.0.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47081

Test Plan:
A new test case inside test_jit that divides a tensor by -0. and checks if it outputs -inf for all modes.

Fixes https://github.com/pytorch/pytorch/issues/46848

Reviewed By: mrshenli

Differential Revision: D24688572

Pulled By: gmagogsfm

fbshipit-source-id: 01a9d3f782e0711dd10bf24e6f3aa62eee72c895
2020-11-07 01:25:47 -08:00
c19eb4ad73 BoxWithNMSLimit support int batch_splits input (#47504)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47504

allow int type input of `batch_splits`

Test Plan:
```
buck test caffe2/caffe2/python/operator_test:torch_integration_test -- test_box_with_nms_limits
```

Reviewed By: jackm321

Differential Revision: D24629522

fbshipit-source-id: 61cb132e792bddd8f9f1bca5b808f1a9131808f0
2020-11-07 00:27:51 -08:00
9d0c6e9469 Implement Complex tensor support in all reduce and all gather (#47523)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47523

Reviewed By: bdhirsh

Differential Revision: D24806743

Pulled By: gmagogsfm

fbshipit-source-id: 627a5a0654c603bc82b90e4cb3d924b4ca416fbe
2020-11-06 22:26:48 -08:00
f90da88d8f Add complex support for torch.mean [CUDA] (#47048)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/46982

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47048

Reviewed By: heitorschueroff

Differential Revision: D24729895

Pulled By: anjali411

fbshipit-source-id: 8e948480eb87c37de810207edf909375c0380772
2020-11-06 21:29:19 -08:00
451e7d3db4 Enable diag for bool Tensors (#47455)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47455

Test Plan: Imported from OSS

Reviewed By: bdhirsh

Differential Revision: D24772483

Pulled By: H-Huang

fbshipit-source-id: 08ea4af4352972617db3c6475943b326f36b3049
2020-11-06 21:29:17 -08:00
3253ccbd9f Add bool tensor support for where (#47454)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47454

Test Plan: Imported from OSS

Reviewed By: bdhirsh

Differential Revision: D24772482

Pulled By: H-Huang

fbshipit-source-id: ea488aae5bf64ac20f7a5d001e8edf55eed16eaf
2020-11-06 21:26:24 -08:00
a1fef453b6 Support extra files in _load_for_mobile (#47425)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47425

Extra files can be exported in lite interpreter model, but it could not be loaded. This PR is to add the capability to load extra files from lite interpreter model. Because extra_files is a default argument, it should not affect the existing usage of _load_for_mobile. It's a simple assembly or a generic unordered_map. No additional dependency should be introduced and the size overhead should be small (to be tested).

Test Plan: Imported from OSS

Reviewed By: kwanmacher

Differential Revision: D24770266

Pulled By: iseeyuan

fbshipit-source-id: 7e8bd301ce734dbbf36ae56c9decb045aeb801ce
2020-11-06 20:26:54 -08:00
3f9697b10e Correctly compare Stream IValues (#47303)
Summary:
Stream IValue equality comparison was comparing wrong object.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47303

Test Plan:
Added a new C++ test

Fixes #{issue number}

Reviewed By: bdhirsh

Differential Revision: D24752434

Pulled By: gmagogsfm

fbshipit-source-id: 78bc7a812740485ebbc7cf0c06c2e671a7ccd26f
2020-11-06 17:29:09 -08:00
25d1fb519d Build nightly binaries only for the latest ROCM (#47503)
Summary:
Because ROCM 3.7 is not longer supported, isn't it?
Also, delete ROCM3.5.1 docker image generation job

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47503

Reviewed By: seemethere

Differential Revision: D24789230

Pulled By: malfet

fbshipit-source-id: 36964f8e1096964f0ee2112e6ee67f29bcbd4373
2020-11-06 16:34:03 -08:00
e09ec8eefa Update the error message for retain_grad (#47084)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/46588

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47084

Reviewed By: albanD

Differential Revision: D24632403

Pulled By: iramazanli

fbshipit-source-id: 8dfd50fcbb6ef585ea4f903e3755b5a807312235
2020-11-06 16:34:00 -08:00
7af9752fdc Fix rounding error flakiness in quantized_test (#47468)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47468

**Summary:** QuantizePerChannel4d and QuantizePerChannel4dChannelsLast have issues with flakiness on both ARM and x86 builds.

The flakiness stems from two sources:
1. The rounding strategy used by quantization for half values is to round the number to the nearest even integer (e.g. `4.5->4`, `5.5 -> 6`, `6.5->6`; however the above tests are incorrect by expecting the values to be rounded away from zero.

2. On ARM devices, `quantize_val_arm` calculates `zero_point + round(val / scale)` which behaves differently from `quantize_val`, which calculates `zero_point + round(val * (1.0f/scale))`. This small distinction leaves enough room for the floating point arithmetic errors to change rounding behavior (e.g. `3 / .24 = 12.5` whereas `3 * (1.0f / .24) = 12.500001`).

**Test Plan:**
For local builds:
```
python setup.py develop
./build/bin/quantized_test --gtest_filter='TestQTensor.QuantizePerChannel4d*' --gtest_repeat=10000 | grep FAILURE
```

For ARM Neon:
```
BUILD_MOBILE_BENCHMARK=1 BUILD_MOBILE_TEST=1 ANDROID_DEBUG_SYMBOLS=1 BUILD_PYTORCH_MOBILE=1 ANDROID_ABI="armeabi-v7a with NEON" ./scripts/build_android.sh  -DANDROID_CCACHE=$(which ccache) -DBUILD_BINARY=ON
adb push ./build/bin/quantized_test /data/local/tmp
adb shell "/data/local/tmp/quantized_test --gtest_filter='TestQTensor.QuantizePerChannel4d*' --gtest_repeat=1000 | grep FAILURE"
```

For ARM64:
```
BUILD_MOBILE_BENCHMARK=1 BUILD_MOBILE_TEST=1 ANDROID_DEBUG_SYMBOLS=1 BUILD_PYTORCH_MOBILE=1 ANDROID_ABI=arm64-v8a ./scripts/build_android.sh  -DANDROID_CCACHE=$(which ccache) -DBUILD_BINARY=ON
adb push ./build/bin/quantized_test /data/local/tmp
adb shell "/data/local/tmp/quantized_test --gtest_filter='TestQTensor.QuantizePerChannel4d*' --gtest_repeat=1000 | grep FAILURE"
```

**Reviewers:**

**Subscribers:**

**Tasks:** T79019469

**Tags:**

Test Plan: Imported from OSS

Reviewed By: kimishpatel

Differential Revision: D24769889

Pulled By: AJLiu

fbshipit-source-id: 417e7339bac70df5b9f630a1e286fad435e49240
2020-11-06 16:31:04 -08:00
637787797b [JIT] add support for torch.jit.Final in python 3.6 (#47393)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47393

Test Plan: Imported from OSS

Reviewed By: bdhirsh

Differential Revision: D24739402

Pulled By: Lilyjjo

fbshipit-source-id: 46f003f0a4b1a36894050b72b8f2334c30268e54
2020-11-06 14:30:44 -08:00
31d041c946 Back out "[c10] make intrusive_ptr available as a pybind holder type"
Summary:
Original commit changeset: b9796e15074d

have weird issue happening with custom class + recursive scripting, unland this first to figure out more details

Test Plan: wait for sandcastle

Reviewed By: zhangguanheng66

Differential Revision: D24780498

fbshipit-source-id: 99a937a26908897556d3bd9f1b2b39f494836fe6
2020-11-06 14:27:48 -08:00
8eb228a7f3 Add support for log_softmax (#47409)
Summary:
This diff adds support for `log_softmax` op in NNC.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47409

Reviewed By: ejguan

Differential Revision: D24750203

Pulled By: navahgar

fbshipit-source-id: c4dacc7f62f9df65ae467f0d578ea03d3698273d
2020-11-06 13:29:27 -08:00
582e852fba [caffe2] Add unittests for schema.Field init (#47512)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47512

I deleted the last line of `__init__` -- `self._field_offsets.append(offset)` -- and the unittests didn't fail.

So this diff is to improve test coverage.

Test Plan:
```
    ✓ Pass: caffe2/caffe2/python:schema_test - testInitShouldSetEmptyParent (caffe2.caffe2.python.schema_test.TestField) (8.225)
    ✓ Pass: caffe2/caffe2/python:schema_test - testInitShouldSetFieldOffsetsIfNoChildren (caffe2.caffe2.python.schema_test.TestField) (8.339)
    ✓ Pass: caffe2/caffe2/python:schema_test - testInitShouldSetFieldOffsets (caffe2.caffe2.python.schema_test.TestField) (8.381)
```

Reviewed By: dzhulgakov

Differential Revision: D24767188

fbshipit-source-id: b6ce8cc96ecc61768b55360e0238f7317a2f18ea
2020-11-06 13:27:58 -08:00
2572d7a671 [quant][eagermode][qat][test] Add numerical test for qat convert (#47376)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47376

For sigmoid, hardsimoid, tanh, leaky_relu

Test Plan: Imported from OSS

Reviewed By: vkuzo

Differential Revision: D24734754

fbshipit-source-id: f42ff9410629fa344be97494ffdbe453a7943f65
2020-11-06 12:36:16 -08:00
24b549ba84 [jit] better message for bad type annotation (#47464)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47464

```
ValueError: Unknown type annotation: 'typing.Sequence[torch.Tensor]' at  File "xxx.py", line 223
        images = [x["image"].to(self.device) for x in batched_inputs]
        images = [(x - self.pixel_mean) / self.pixel_std for x in images]
        images = ImageList.from_tensors(images, self.backbone.size_divisibility)
                 ~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
        return images
```

Otherwise have no clue where the error is.

Test Plan: sandcastle

Reviewed By: glaringlee

Differential Revision: D24764886

fbshipit-source-id: abd5734394e53b20baa6473134896e3a2b178662
2020-11-06 12:36:14 -08:00
c26c4690fe Add sub operator
Summary: Add sub operator for caffe2

Test Plan:
```
buck test //caffe2/torch/fb/model_transform/c2_convert:c2_pt_converter_test
```

Reviewed By: houseroad

Differential Revision: D24685090

fbshipit-source-id: 60d745065d01b634ebd3087e533d8b9ddab77a1f
2020-11-06 12:31:17 -08:00
47198e3208 [caffe2] improve core.Net cloning/init performance (24x for large models!) (#47475)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47475

This improves the core.Net cloning/init performance by quite a bit. It makes set_input_record run in linear time instead of O(n) by checking the external_input map instead of regenerating the external inputs each time and then iterating over it.

Test Plan: unit tests + canary runs

Reviewed By: dzhulgakov

Differential Revision: D24765346

fbshipit-source-id: 92d9f6dec158512bd50513b78675174686f0f411
2020-11-06 11:34:12 -08:00
90a90ab1d6 Add type informations to torch/storage.py (#46876)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/46875

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46876

Reviewed By: glaringlee

Differential Revision: D24758448

Pulled By: ezyang

fbshipit-source-id: afbc19637fbfaa1b0276cdd707043111aee3abc3
2020-11-06 11:34:10 -08:00
d0d673b043 Improve reciprocal() and rsqrt() accuracy on arm64 (#47478)
Summary:
Neither `vrecpeq_f32` nor `vrsqrteq_f32` yield accurate results but just perform first of two steps in an iteration of the Newton-Raphson method, as documented at
https://developer.arm.com/documentation/dui0472/j/using-neon-support/neon-intrinsics-for-reciprocal-and-sqrt

Use appropriate NEON instruction to run two more steps of the Newton's method to improve results

Before:
```
$ python -c "import torch;print(torch.arange(1.0, 17.0, 1.0, dtype=torch.float32).reciprocal())"
tensor([0.9980, 0.4990, 0.3330, 0.2495, 0.1997, 0.1665, 0.1426, 0.1248, 0.1108,
        0.0999, 0.0908, 0.0833, 0.0769, 0.0713, 0.0667, 0.0624])
$ python -c "import torch;print(torch.arange(1.0, 17.0, 1.0, dtype=torch.float32).rsqrt())"
tensor([0.9980, 0.7051, 0.5762, 0.4990, 0.4463, 0.4082, 0.3779, 0.3525, 0.3330,
        0.3154, 0.3008, 0.2881, 0.2773, 0.2666, 0.2578, 0.2495])
```
After:
```
$ python -c "import torch;print(torch.arange(1.0, 17.0, 1.0, dtype=torch.float32).reciprocal())"
tensor([1.0000, 0.5000, 0.3333, 0.2500, 0.2000, 0.1667, 0.1429, 0.1250, 0.1111,
        0.1000, 0.0909, 0.0833, 0.0769, 0.0714, 0.0667, 0.0625])
$ python -c "import torch;print(torch.arange(1.0, 17.0, 1.0, dtype=torch.float32).rsqrt())"
tensor([1.0000, 0.7071, 0.5774, 0.5000, 0.4472, 0.4082, 0.3780, 0.3536, 0.3333,
        0.3162, 0.3015, 0.2887, 0.2774, 0.2673, 0.2582, 0.2500])
```

Partially addresses https://github.com/pytorch/pytorch/issues/47476

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47478

Reviewed By: walterddr

Differential Revision: D24773443

Pulled By: malfet

fbshipit-source-id: 224dca9725601d29fb229f8d71d968a30f25c829
2020-11-06 11:31:05 -08:00
5614f72534 Suppres test issues in test_torch running in sandcastle (#47474)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47474

After enabling GPU/Re, some issues were specific to those runs

Test Plan:
```
buck test -c test.external_runner=tpx mode/opt //caffe2/test:torch_cuda -- --use-remote-execution --force-tpx --run-disabled
```

Reviewed By: malfet, janeyx99

Differential Revision: D24771578

fbshipit-source-id: 1ada79dae12c8cb6f795a0d261c60f038eee2dfb
2020-11-06 10:34:28 -08:00
611080a118 [hot fix] cuda 11.0.x doesn't support sm86. (#47408)
Summary:
Bump condition check from >11.0 to >11.0.3

CMAKE 3.5 doesn't support VERSION_GREATER_EQUAL see [here](https://github.com/Dav1dde/glad/issues/134), so we might need to bump this again iv 11.0.4+ releases.

should fix https://github.com/pytorch/pytorch/issues/47352

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47408

Reviewed By: glaringlee

Differential Revision: D24759949

Pulled By: walterddr

fbshipit-source-id: de384c7b150babaf799cce53ed198e5e931899da
2020-11-06 10:34:25 -08:00
160db3db4f Adding profiling capability to c++ ddp collective functions (#46471)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46471

ghstack-source-id: 116018837

Test Plan:
Added unit tests:

 buck test mode/dev-nosan caffe2/test/distributed:distributed_gloo_fork
 buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork

Reviewed By: rohan-varma

Differential Revision: D23948397

fbshipit-source-id: 6d93a370aff26bf96c39e5d78a2492c5142a9156
2020-11-06 10:29:58 -08:00
1aeefcdaa6 Revert D24730264: [pytorch][PR] Added CUDA support for complex input for torch.inverse
Test Plan: revert-hammer

Differential Revision:
D24730264 (33acbedace)

Original commit changeset: b9c94ec46301

fbshipit-source-id: beb9263700e9bc92685f74c37c46aa33f3b595b9
2020-11-06 07:28:14 -08:00
f3ad7b2919 [JIT][Reland] add list() support (#42382)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/40869

Resubmit of https://github.com/pytorch/pytorch/pull/33818.

Adds support for `list()` by desugaring  it to a list comprehension.

Last time I landed this it made one of the tests slow, and got unlanded. I think that's bc the previous PR changed the emission of `list()` on a list input or a str input to a list comprehension, which is the more general way of emitting `list()`, but also a little bit slower. I updated this version to emit to the builtin operators for these two case. Hopefully it can land without being reverted this time...

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42382

Reviewed By: navahgar

Differential Revision: D24767674

Pulled By: eellison

fbshipit-source-id: a1aa3d104499226b28f47c3698386d365809c23c
2020-11-06 01:28:54 -08:00
eaa993a2e0 Add type annotations to torch._C._distributed_rpc module. (#46624)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46624

Test Plan: Imported from OSS

Reviewed By: glaringlee

Differential Revision: D24761656

Pulled By: xuzhao9

fbshipit-source-id: b55aee5dd2b97f573a50e5bbfddde7d984943fec
2020-11-06 01:28:51 -08:00
73a3e70b24 Add type annotations for torch._C._distributed_c10d module. (#46623)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46623

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D24761606

Pulled By: xuzhao9

fbshipit-source-id: 827eaf2502e381ee24d36741c1613b4c08208569
2020-11-06 01:28:48 -08:00
fe77ded48a Add Python declaration of torch._C and torch._C._autograd modules. (#46622)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46622

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D24761503

Pulled By: xuzhao9

fbshipit-source-id: c7ff9a9e46480a83bf6961e09972b5d20bdeb67b
2020-11-06 01:25:47 -08:00
fccfe7bd1a [Gradient Compression] Add unit tests that test default Python comm hook implementations (#47158)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47158

1. Test the default Python comm hook implementations ALLREDUCE and FP16_COMPRESS, besides an ad-hoc all-reduce implementation.
2. Typo fix.
3. Reformat default_hooks.py.
4. Publish register_comm_hook API for DDP module (This should be done in a separate diff, but got merged unintentionally.)

The new style can be used for testing any new comm hook like PowerSGD easily.
Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202

ghstack-source-id: 116012600

Test Plan: buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_default_ddp_comm_hooks_nccl

Reviewed By: rohan-varma

Differential Revision: D24669639

fbshipit-source-id: 048c87084234edc2398f0ea6f01f2f083a707939
2020-11-06 00:28:09 -08:00
873652d9ac [TensorExpr] Fix LLVM 12 build after LLVM API changes (#47480)
Summary:
PolySize was removed: https://reviews.llvm.org/D88982

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47480

Test Plan: Build against LLVM 12.

Reviewed By: glaringlee

Differential Revision: D24773973

Pulled By: asuhan

fbshipit-source-id: d09566675c043d8b63032c52bdadd09e09ccfc39
2020-11-05 22:30:37 -08:00
fd72ec53d4 [JIT] Optimize hot path in ProfilingGraphExecutorImpl::getPlanFor. (#47465)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47465

This results in a ~10% improvement on a DeepAndWide model:

10 runs of a benchmark before the change:
```
1.480785621330142
1.430812582373619
1.3845220785588026
1.4510653037577868
1.4827174227684736
1.3679781593382359
1.4239587392657995
1.5069784726947546
1.3988622818142176
1.4533461946994066
```

10 runs of the same benchmark after the change:
```
1.3221493270248175
1.3624659553170204
1.3415213637053967
1.3560577500611544
1.3064174111932516
1.2934542261064053
1.379274770617485
1.3850531745702028
1.26725466363132
1.3738237638026476
```

Link to benchmark: https://gist.github.com/ZolotukhinM/2308732eabb47685c6f7786e5a13b3d1

Test Plan: Imported from OSS

Reviewed By: eellison

Differential Revision: D24767247

Pulled By: ZolotukhinM

fbshipit-source-id: a77e89fdfb54286e6463533c86b3a4ba606ca1c7
2020-11-05 22:27:24 -08:00
9a9383ef2e PyTorch NNAPI integration prototype (#46780)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46780

This is in prototype status, but pretty functional.  There are two major
parts.

- Model converter.  This is a pure Python component that consumes a
  model in TorchScript format, converts the operations into NNAPI
  semantics, and serializes the model in a custom format.  It then wraps
  the result in a new TorchScript model that can invoke NNAPI under the
  hood.
- Runtime.  This is a TorchBind object that deserializes the model and
  sends the result to NNAPI.  This is fairly simple since the serialized
  format is basically just a list of NNAPI calls to make, so most of the
  code is spent on bounds checking.

A few notes on the design.
- Currently, all tensor sizes need to be fixed, and those fixed sizes
  are burned directly into the serialized model.  This will probably
  need to change.  NNAPI supports variable-sized tensors, but the
  important hardware backends do not.  However, we're seeing use cases
  crop up where the input size is not known until around the time that
  the model is loaded (for example, it might depend on the camera aspect
  ratio).  I think the proper fix here is to remove the code in the
  converter that eagerly calculates the sizes of the intermediate
  tensors and replace it with a code generator that will generate some
  TorchScript code that will perform those calculations at model load
  time.  This way, we will be able to support models that have
  variable-sized inputs while still only showing fixed-sized operands to
  NNAPI.
- The important hardware backends want operands to be in NHWC order, but
  PyTorch natively represents all tensors and NCHW.  The strategy for
  this is to keep NCHW during most of the conversion process, but track
  and additional value per operand representing the "dimension order".
  The dimension order gets propagated through convolutions and pointwise
  ops.  When we're ready to serialize the model, we reorder the
  dimensions for "channels last" operands to NHWC.

Test Plan:
Some local testing with FB prod models.  I'll need to add some examples
and automated tests.

Reviewed By: iseeyuan

Differential Revision: D24574040

Pulled By: dreiss

fbshipit-source-id: 6adc8571b234877ee3666ec0c0de24da35c38a1f
2020-11-05 21:31:01 -08:00
ad8c0e57ef Add a command-line flag for overriding pthreadpool size (#46781)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46781

Test Plan: Passed it to speed_benchmark_torch and saw perf change.

Reviewed By: iseeyuan

Differential Revision: D24752889

Pulled By: dreiss

fbshipit-source-id: 762981510f271d20f76e33b6e6f361c4a6f48e6c
2020-11-05 21:30:54 -08:00
a63f391c6f [JIT] fix documentation typo (#46926)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/46816

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46926

Reviewed By: glaringlee

Differential Revision: D24762897

Pulled By: eellison

fbshipit-source-id: f58c4db5f4dd037141c18ec1121816eba33f87b7
2020-11-05 21:26:27 -08:00
ceb16d8836 [Bootcamp] add CUDA kernel checks to ATen/native/cuda (#47466)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47466

- Add kernel launch check `TORCH_CUDA_KERNEL_LAUNCH_CHECK()` (D24309971 (353e7f940f)) to several files in aten/src/ATen/native/cuda
- Get rid of old check `AT_CUDA_CHECK(cudaGetLastError())` in these same files

Test Plan:
Test build:
```
buck build //caffe2/aten:ATen-cu
```
To check for launches without checks:
```
python3 caffe2/torch/testing/check_kernel_launches.py
```
Make sure none of the updated files are in the returned list:

{F343234608}

Reviewed By: r-barnes

Differential Revision: D24724947

fbshipit-source-id: a7c7d3c70ed8fb5dfd69997b50f9c838f8651791
2020-11-05 20:27:56 -08:00
e985503d80 [NNC] Fix an issue with half-scalar vars coerced to float (Take 2) (#47448)
Summary:
Take 2 of this fix, I removed the repro from the issue which is a bit flaky due to parallelism. It broke on Windows but isn't specific to Windows or this fix, I think. I'll make sure all the tests pass this time (cc zou3519).

Fixes an issue where fp16 scalars created by the registerizer could be referenced as floats - causing invalid conversions which would crash in the NVRTX compile. I also noticed that we were inserting patterns like float(half(float(X))) and added a pass to collapse those down inside the CudaHalfScalarRewriter.

Fixes https://github.com/pytorch/pytorch/issues/47138

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47448

Reviewed By: glaringlee

Differential Revision: D24765070

Pulled By: nickgg

fbshipit-source-id: 5297e647534d53657bef81f4798e8aa6a93d1fbd
2020-11-05 19:31:52 -08:00
9c8f40516f Batched grad for advanced indexing (index) (#47223)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47223

This PR enables batched gradient computation for advanced indexing.
Previously, the backward formula was writing parts of the grad tensori
in-place to zeros_like(self). Since grad is a BatchedTensor and self is
not a BatchedTensor, this is not possible.

To solve the problem, we instead create a new tensor with
`grad.new_zeros` and then write to that in-place. This new tensor will
have the same batchedness as the `grad` tensor.

To prevent regressions (the autograd codegen special cases zeros_like
to avoid saving the `self` tensor for backward), we teach the autograd
codegen how to save `self.options()`.

Test Plan:
- new tests
- run old indexing tests

Reviewed By: ejguan

Differential Revision: D24741684

Pulled By: zou3519

fbshipit-source-id: e267999dc079f4fe58c3f0bdf5c263f1879dca92
2020-11-05 18:25:33 -08:00
65241e3681 add remove_node in Partition class (#47452)
Summary:
add remove_node method in Partition class for the future use

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47452

Reviewed By: glaringlee, gcatron

Differential Revision: D24762770

Pulled By: scottxu0730

fbshipit-source-id: 35473ab7322d8e6ecab1c2624b668342bfec4cca
2020-11-05 17:27:18 -08:00
b4b0fa6371 add get_device_to_partitions_mapping (#47361)
Summary:
add get_device_to_partitions_mapping function in the Partitioner class to make size_based_partition more modular and organized. This function will also be used in the future cost_aware_partition

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47361

Reviewed By: gcatron

Differential Revision: D24760911

Pulled By: scottxu0730

fbshipit-source-id: 8cdda51b9a1145f9d13ebabbb98b4d9df5ebb6cd
2020-11-05 16:33:02 -08:00
33acbedace Added CUDA support for complex input for torch.inverse (#45034)
Summary:
`torch.inverse` now works for complex inputs on GPU.
Test cases with complex matrices are xfailed for now. For example, batched matmul does not work with complex yet.

Ref. https://github.com/pytorch/pytorch/issues/33152

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45034

Reviewed By: zou3519

Differential Revision: D24730264

Pulled By: anjali411

fbshipit-source-id: b9c94ec463012913c117278a884adeee96ea02aa
2020-11-05 16:30:11 -08:00
c2d4a5b137 Disable unused docker-pytorch-linux-xenial-py3.6-gcc4.8 job (#47446)
Summary:
The `docker-pytorch-linux-xenial-py3.6-gcc4.8` job is not used for any builds anymore. This PR removes it from CI.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47446

Reviewed By: seemethere, samestep

Differential Revision: D24759876

Pulled By: janeyx99

fbshipit-source-id: e7d420fc2c6c7ffa43001d83b449e9ef3070e902
2020-11-05 12:13:37 -08:00
373246733d [FX] get the correct error message (#47108)
Summary:
Currently, code like
```
class Test(nn.Module):
    def __init__(self):
        super(Test, self).__init__()
        self.W = torch.nn.Parameter(torch.randn(5))

    def forward(self, x):
        return torch.dot(self.W, x)

mod = Test()
print(fx.symbolic_trace(Test())(5))
```
gives an error like the below, which does not show the actual code that throws the error.
```
Traceback (most recent call last):
  File "t.py", line 20, in <module>
    print(fx.symbolic_trace(Test())(5))
  File "/home/chilli/fb/pytorch/torch/nn/modules/module.py", line 744, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/chilli/fb/pytorch/torch/fx/graph_module.py", line 191, in debug_forward
    return src_forward(self, *args, **kwargs)
  File "<eval_with_key_0>", line 5, in forward
TypeError: dot(): argument 'tensor' (position 2) must be Tensor, not int
```

This is particularly annoying when your function has already been transformed several times.

So, the really annoying thing is that the error clearly has the requisite information in `exception.__traceback__` - it just isn't printing it.

I think the right way of doing this is simply replacing `sys.excepthook`. This appears to be the standard way to modify exception messages.

**Scratch the below**

The 2 methods in the PR right now are:
1. Just prepend the final part of the traceback to the beginning of your error message. Looks like
```
Traceback (most recent call last):
  File "t.py", line 20, in <module>
    print(fx.symbolic_trace(Test())(5))
  File "/home/chilli/fb/pytorch/torch/nn/modules/module.py", line 744, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/chilli/fb/pytorch/torch/fx/graph_module.py", line 197, in debug_forward
    raise e
  File "/home/chilli/fb/pytorch/torch/fx/graph_module.py", line 192, in debug_forward
    return src_forward(self, *args, **kwargs)
  File "<eval_with_key_0>", line 5, in forward
TypeError:   File "<eval_with_key_0>", line 5, in forward
    dot_1 = torch.dot(w, x)
dot(): argument 'tensor' (position 2) must be Tensor, not int
```

2. Use the `from exception` feature in Python. Looks like
```
Traceback (most recent call last):
  File "/home/chilli/fb/pytorch/torch/fx/graph_module.py", line 192, in debug_forward
    return src_forward(self, *args, **kwargs)
  File "<eval_with_key_0>", line 5, in forward
TypeError:   File "<eval_with_key_0>", line 5, in forward
    dot_1 = torch.dot(w, x)
dot(): argument 'tensor' (position 2) must be Tensor, not int

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "t.py", line 20, in <module>
    print(fx.symbolic_trace(Test())(5))
  File "/home/chilli/fb/pytorch/torch/nn/modules/module.py", line 744, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/chilli/fb/pytorch/torch/fx/graph_module.py", line 197, in debug_forward
    raise Exception(last_tb) from e
Exception:   File "<eval_with_key_0>", line 5, in forward
    dot_1 = torch.dot(w, x)
```

I think the first one looks better, but it's pretty hacky since we're shoving the traceback in the message.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47108

Reviewed By: jamesr66a

Differential Revision: D24751019

Pulled By: Chillee

fbshipit-source-id: 83e6ed0165f98632a77c73de75504fd6263fff40
2020-11-05 10:59:01 -08:00
eed4a57d54 Speedup copysign for half and bfloat16 types (#47413)
Summary:
This also avoids internal compiler error exceptions on aarch64 platforms and transitively fixes https://github.com/pytorch/pytorch/issues/47395

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47413

Reviewed By: walterddr

Differential Revision: D24745921

Pulled By: malfet

fbshipit-source-id: 790e5b91d9116670c882d838b3862d5b47178d68
2020-11-05 10:31:32 -08:00
35491412d1 Revert D24649817: [pytorch][PR] Fix pickling for Tensor subclasses.
Test Plan: revert-hammer

Differential Revision:
D24649817 (c4209f1115)

Original commit changeset: 1872faa36030

fbshipit-source-id: b9832cea45552bd8776909118c4324fbd61fd414
2020-11-05 10:25:48 -08:00
7a599870b0 [ONNX] Update peephole pass for prim::ListUnpack (#46264)
Summary:
Update pass that handles prim::ListUnpack in peephole file, so that it also covers the case when input to the node is of ListType.

Fixes https://github.com/pytorch/pytorch/issues/45816

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46264

Reviewed By: mrshenli

Differential Revision: D24566070

Pulled By: bzinodev

fbshipit-source-id: 32555487054f6a7fe02cc17c66bcbe81ddf9623e
2020-11-05 09:42:24 -08:00
5977d1d864 FixedQParamsFakeQuantize: adjust default quant_min and quant_max (#47423)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47423

Since the dtype of this fake_quant is `quint8`, the output range should be
from 0 to 255.  Fixing.  This should address the numerical inaccuracies with
sigmoid and hardsigmoid with `FixedQParamsFakeQuantize` attached compared
to their quantized counterparts.

In a future PR, might be safer to also make the activation functions
using `FixedQParamsFakeQuantize` to explicitly specify their expected
output range and zero_point.  Leaving that for later, as this bugfix
should be landed urgently.

Test Plan:
Manual script which gives low SQNR before this PR and high SQNR after
this PR: https://gist.github.com/vkuzo/9906bae29223da72b10d6b6aafadba42

https://github.com/pytorch/pytorch/pull/47376, which can be landed after
this, adds a proper test.

Imported from OSS

Reviewed By: ayush29feb, jerryzh168

Differential Revision: D24751497

fbshipit-source-id: 4c32e22a30116caaceeedb4cd47146d066054a89
2020-11-05 09:06:55 -08:00
745899f926 Revert D24706475: [pytorch][PR] [NNC] Fix an issue in Cuda fusion with fp16 scalar vars coerced to float
Test Plan: revert-hammer

Differential Revision:
D24706475 (33cf7fddd2)

Original commit changeset: 9df72bbbf203

fbshipit-source-id: f16ff04818de4294713d5b97eab5b298c1a75a6b
2020-11-05 08:25:48 -08:00
9c8078cdfb Revert D24659901: Add tests for DDP control flow models.
Test Plan: revert-hammer

Differential Revision:
D24659901 (31c9d2efcd)

Original commit changeset: 17fc2b3ebba9

fbshipit-source-id: 26b0bdbe83cba54da4f363cfa7fc85c503aa05ab
2020-11-05 08:08:59 -08:00
1519c7145c __noinline__ the top level igamma cuda kernel. (#47414)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47414

This improves the build time of this file by 10x on my machine (~12 minutes to ~1 minute).

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D24746458

Pulled By: gchanan

fbshipit-source-id: cdef801199d4fdc2bbd740fe1b771285b1d71319
2020-11-05 07:50:59 -08:00
e40a563050 Fix sum batching rule, add simple clone batching rule (#47189)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47189

PyTorch has a special case where sum(scalar_tensor, dim=0) does not fail
and instead returns a new copy of the original scalar_tensor. If we
end up vmapping over per-example scalar tensors, e.g.,
```
>>> x = torch.randn(B0)  # the per-examples are all scalars
>>> vmap(partial(torch.sum, dim=0), x)
```
then we should replicate the behavior of sum(scalar_tensor, dim=0) by
returning a clone of the input tensor.

This PR also adds a batching rule for clone(Tensor, MemoryFormat). The
batching rule:
- unwraps the BatchedTensor, calls clone(), and rewraps the
BatchedTensor if MemoryFormat is torch.preserve_format (which is the
default).
- errors out with an NYI for all other memory formats, including
torch.contiguous_format. There are some weird semantics for memory
layouts with vmap that I need to go and figure out. Those are noted in
the comments for `clone_batching_rule`

Test Plan: - new tests

Reviewed By: ejguan

Differential Revision: D24741689

Pulled By: zou3519

fbshipit-source-id: e640344b4e4aa8c0d2dbacc5c49901f4c33c6613
2020-11-05 07:38:43 -08:00
9a9529aa84 Batching rules for complex view functions (#47188)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47188

Includes batching rules for:
- torch.real, torch.imag, torch.view_as_real, and torch.view_as_complex

Test Plan: - new tests

Reviewed By: ejguan

Differential Revision: D24741686

Pulled By: zou3519

fbshipit-source-id: c143bab9bb5ebbcd8529e12af7c117cbebd4447e
2020-11-05 07:37:15 -08:00
ae374dc690 Move igamma cuda specific code to kernel file. (#47410)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47410

This is a copy-paste except for:
1) The code is put in an anonymous namespace
1) The static declarations on functions (in the now-anonymous namespace) are removed

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D24745597

Pulled By: gchanan

fbshipit-source-id: 049b6bb10845cd8d7961b533782f582b3db25248
2020-11-05 07:21:39 -08:00
220b3bd667 Add op benchmark for batch box cox as baseline (#47275)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47275

```
# Benchmarking Caffe2: batch_box_cox
# Name: batch_box_cox_M64_N64_dtypedouble
# Input: M: 64, N: 64, dtype: double
Forward Execution Time (us) : 49.005
```

Test Plan: `buck run mode/opt caffe2/benchmarks/operator_benchmark/c2:batch_box_cox_test -- --iterations=1000  --warmup 100`

Reviewed By: houseroad

Differential Revision: D24675426

fbshipit-source-id: 8bb1f3076dc6b01e7b63468136ddf3d9b6d7e5d2
2020-11-05 07:16:32 -08:00
68954fe897 Add release note scripts (#47360)
Summary:
First commit contains the initial code from Richard's branch.
Second commit are the changes that I made during the writing process
Third commit is the update to support category/topic pair for each commit

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47360

Reviewed By: ejguan

Differential Revision: D24741003

Pulled By: albanD

fbshipit-source-id: d0fcc6765968dc1732d8a515688d11372c7e653d
2020-11-05 06:43:24 -08:00
a4ba018e57 Updated docs/test for dot and vdot (#47242)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47242

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D24733771

Pulled By: heitorschueroff

fbshipit-source-id: 92e3b0e28e0565918335fa85d52abe5db9eeff57
2020-11-05 06:27:50 -08:00
d8c3b2b10c [quant][pyper] Add support for pruned weights in embedding_bag_byte lookup (#47329)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47329

Supports pruned weights along with mapping for the compressed indices

Test Plan:
python test/test_quantization.py TestQuantizedEmbeddingOps

Imported from OSS

Reviewed By: qizzzh

Differential Revision: D24719909

fbshipit-source-id: f998f4039e84bbe1886e492a3bff6aa5f56b6b0f
2020-11-04 22:33:33 -08:00
433b55bc7c [quant] Add testing coverage for 4-bit embedding_bag sparse lookup op (#47328)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47328

Extend tests to cover case for pruned weights with mapping table.
Support for 8-bits sparse lookup to follow

Test Plan:
python test/test_quantization.py TestQuantizedEmbeddingOps

Imported from OSS

Reviewed By: qizzzh

Differential Revision: D24719910

fbshipit-source-id: d31db6304f446104ee8c7b10b902accd2919a513
2020-11-04 22:29:12 -08:00
f19637e6ee Expand the test of torch.addbmm and torch.baddbmm (#47079)
Summary:
This is to satisfy the request at https://github.com/pytorch/pytorch/pull/42553#issuecomment-673673914. See also https://github.com/pytorch/pytorch/pull/47124

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47079

Reviewed By: ejguan

Differential Revision: D24735356

Pulled By: ngimel

fbshipit-source-id: 122fceb4902658f350c2fd6f92455adadd0ec2a4
2020-11-04 21:11:26 -08:00
df5b4696cf [Pytorch] Specialize guts of c10::optional for 32-bit scalars (#47015)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47015

c10::optional has non-trivial copy and move operations always. This change specializes it for 32-bit scalars so that it has trivial copy and move operations in that case. Ideally, we would instead rely on P0602 "variant and optional should propagate copy/move triviality" and use `std::optional` (or implement that functionality ourselves). We can't use `std::optional` because we are stuck with C++14. Implementing the full P0602 ourselves would add even more complexity. We could do it, but this should be a helpful first step.
ghstack-source-id: 115886743

Test Plan:
Collect Callgrind instruction counts for `torch.empty(())`. Data:

Make empty c10-ful (https://github.com/pytorch/pytorch/pull/46092):

```
<torch.utils.benchmark.utils.valgrind_wrapper.timer_interface.CallgrindStats object at 0x7ffaed1128e0>
torch.empty(())
                           All          Noisy symbols removed
    Instructions:       648005                     632899
    Baseline:             4144                       3736
100 runs per measurement, 1 thread
```

This diff atop #46092:

```
<torch.utils.benchmark.utils.valgrind_wrapper.timer_interface.CallgrindStats object at 0x7f943f1dc8e0>
torch.empty(())
                           All          Noisy symbols removed
    Instructions:       602347                     591005
    Baseline:             4106                       3736
100 runs per measurement, 1 thread
```

(6.6% improvement vs #46092)

Pass optionals by const reference (https://github.com/pytorch/pytorch/pull/46598)

```
<torch.utils.benchmark.utils.valgrind_wrapper.timer_interface.CallgrindStats object at 0x7f1abb3988e0>
torch.empty(())
                           All          Noisy symbols removed
    Instructions:       601349                     590005
    Baseline:             4162                       3736
100 runs per measurement, 1 thread
```
(6.8% improvement vs #46092)

This diff atop #46598 (i.e., both together)

```
<torch.utils.benchmark.utils.valgrind_wrapper.timer_interface.CallgrindStats object at 0x7f9577c22850>
torch.empty(())
                           All          Noisy symbols removed
    Instructions:       596095                     582451
    Baseline:             4162                       3736
100 runs per measurement, 1 thread
Warning: PyTorch was not built with debug symbols.
         Source information may be limited. Rebuild with
         REL_WITH_DEB_INFO=1 for more detailed results.
```

(another 1.3% savings!)

#46598 outperformed this change slightly, and combining the two leads to further benefits. I guess we should do both! (Though I still don't understand why passing optionals that should fit in a register by const reference would help...)

Reviewed By: smessmer

Differential Revision: D24552280

fbshipit-source-id: 4d93bfcffafebd8c01559398513fa6b9db959d11
2020-11-04 21:08:50 -08:00
0edc6a39c8 [NNC] Read/Write Dependency analysis (#46952)
Summary:
Adds a new piece of infrastructure to the NNC fused-kernel generation compiler, which builds a dependency graph of the reads and writes to memory regions in a kernel.

It can be used to generate graphs like this from the GEMM benchmark (not this only represents memory hierarchy not compute hierarchy):

![image](https://user-images.githubusercontent.com/701287/97368797-e99d5600-1868-11eb-9a7e-ceeb91ce72b8.png)

Or to answer questions like this:
```
Tensor* c = Compute(...);
Tensor* d = Compute(...);
LoopNest loop({d});
MemDependencyChecker analyzer;
loop.root_stmt()->accept(analyzer);
if (analyzer.dependsDirectly(loop.getLoopStmtsFor(d)[0], loop.getLoopStmtsFor(c)[0]) {
  // do something, maybe computeInline
}
```

Or this:
```
Tensor* d = Compute(...);
LoopNest loop({d});
MemDependencyChecker analyzer(loop.getInputs(), loop.getOutputs());
const Buf* output = d->buf();
for (const Buf* input : inputs) {
  if (!analyzer.dependsIndirectly(output, input)) {
    // signal that this input is unused
  }
}
```

This is a monster of a diff, and I apologize. I've tested it as well as possible for now, but it's not hooked up to anything yet so should not affect any current usages of the NNC fuser.

**How it works:**

Similar to the registerizer, the MemDependencyChecker walks the IR aggregating memory accesses into scopes, then merges those scopes into their parent scope and tracks which writes are responsible for the last write to a particular region of memory, adding dependency links where that region is used.

This relies on a bunch of math on symbolic contiguous regions which I've pulled out into its own file (bounds_overlap.h/cpp). Sometimes this wont be able to infer dependence with 100% accuracy but I think it should always be conservative and occaisionally add false positives but I'm aware of no false negatives.

The hardest part of the analysis is determining when a Load inside a For loop depends on a Store that is lower in the IR from a previous iteration of the loop. This depends on a whole bunch of factors, including whether or not we should consider loop iteration order. The analyzer comes with configuration of this setting. For example this loop:
```
for (int i = 0; i < 10; ++i) {
 A[x] = B[x] + 1;
}
```

has no inter loop dependence, since each iteration uses a distinct slice of both A and B. But this one:

```
for (int i = 0; i < 10; ++i) {
 A[0] = A[0] + B[x];
}
```

Has a self loop dependence between the Load and the Store of A. This applies to many cases that are not reductions as well. In this example:

```
for (int i =0; i < 10; ++i) {
  A[x] = A[x+1] + x;
}
```

Whether or not it has self-loop dependence depends on if we are assuming the execution order is fixed (or whether this loop could later be parallelized). If the read from `A[x+1]` always comes before the write to that same region then it has no dependence.

The analyzer can correctly handle dynamic shapes, but we may need more test coverage of real world usages of dynamic shapes. I unit test some simple and pathological cases, but coverage could be better.

**Next Steps:**

Since the PR was already so big I didn't actually hook it up anywhere, but I had planned on rewriting bounds inference based on the dependency graph. Will do that next.

There are few gaps in this code which could be filled in later if we need it:
* Upgrading the bound math to work with write strides, which will reduce false positive dependencies.
* Better handling of Conditions, reducing false positive dependencies when a range is written in both branches of a Cond.
* Support for AtomicAdd node added in Cuda codegen.

**Testing:**

See new unit tests, I've tried to be verbose about what is being tested. I ran the python tests but there shouldn't be any way for this work to affect them yet.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46952

Reviewed By: ejguan

Differential Revision: D24730346

Pulled By: nickgg

fbshipit-source-id: 654c67c71e9880495afd3ae0efc142e95d5190df
2020-11-04 19:52:20 -08:00
c4209f1115 Fix pickling for Tensor subclasses. (#47115)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/47051

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47115

Reviewed By: ejguan

Differential Revision: D24649817

Pulled By: ezyang

fbshipit-source-id: 1872faa3603085f07c0a8a026404161d0715720d
2020-11-04 19:25:32 -08:00
60ae84754e Add torch.overrides checks for submodules. (#47285)
Summary:
Partially addresses the override component of https://github.com/pytorch/pytorch/issues/42666 and https://github.com/pytorch/pytorch/issues/42175.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47285

Reviewed By: agolynski

Differential Revision: D24706493

Pulled By: ezyang

fbshipit-source-id: bf5a742ac7002dce5a9a454a945f1994b4c8b93e
2020-11-04 19:14:04 -08:00
6c5a1c50bf Benchmark combining Distributed Data Parallel and Distributed RPC (#46993)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46993

Introducing benchmark that combines Distributed Data Parallelism with Distributed Model Parallelism. The benchmark measures distributed training iteration time. The number of trainer nodes and parameter servers are configurable. The default setup has 8 trainers, 1 master node and 8 parameter servers.

The training process is executed as follows:

1) The master creates embedding tables on each of the 8 Parameter Servers and holds an RRef to it.
2) The master, then kicks off the training loop on the 8 trainers and passes the embedding table RRef to the trainers.
3) The trainers create a `HybridModel` which performs embedding lookups in all 8 Parameter Servers using the embedding table RRef provided by the master and then executes the FC layer which is wrapped and replicated via DDP (DistributedDataParallel).
4) The trainer executes the forward pass of the model and uses the loss to
   execute the backward pass using Distributed Autograd.
5) As part of the backward pass, the gradients for the FC layer are computed
   first and synced to all trainers via allreduce in DDP.
6) Next, Distributed Autograd propagates the gradients to the parameter servers,
   where the gradients for the embedding table are updated.
7) Finally, the Distributed Optimizer is used to update all parameters.

Test Plan:
waitforbuildbot

Benchmark output:

---------- Info ---------

* PyTorch version: 1.7.0
* CUDA version: 9.2.0

---------- nvidia-smi topo -m ---------

    GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    CPU     Affinity
    GPU0     X      NV2     NV1     NV2     NV1     NODE    NODE    NODE    0-19,40-59
    GPU1    NV2      X      NV2     NV1     NODE    NV1     NODE    NODE    0-19,40-59
    GPU2    NV1     NV2      X      NV1     NODE    NODE    NV2     NODE    0-19,40-59
    GPU3    NV2     NV1     NV1      X      NODE    NODE    NODE    NV2     0-19,40-59
    GPU4    NV1     NODE    NODE    NODE     X      NV2     NV1     NV2     0-19,40-59
    GPU5    NODE    NV1     NODE    NODE    NV2      X      NV2     NV1     0-19,40-59
    GPU6    NODE    NODE    NV2     NODE    NV1     NV2      X      NV1     0-19,40-59
    GPU7    NODE    NODE    NODE    NV2     NV2     NV1     NV1      X      0-19,40-59

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe switches (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing a single PCIe switch
  NV#  = Connection traversing a bonded set of # NVLinks

------------------  PyTorch Distributed Benchmark (DDP and RPC) ---------------------

                    sec/iter    ex/sec      sec/iter    ex/sec      sec/iter    ex/sec      sec/iter    ex/sec
    Trainer0:  p50:  0.376s     185/s  p75:  0.384s     182/s  p90:  0.390s     179/s  p95:  0.396s     176/s
    Trainer1:  p50:  0.377s     204/s  p75:  0.384s     200/s  p90:  0.389s     197/s  p95:  0.393s     195/s
    Trainer2:  p50:  0.377s     175/s  p75:  0.384s     172/s  p90:  0.390s     169/s  p95:  0.395s     166/s
    Trainer3:  p50:  0.377s     161/s  p75:  0.384s     158/s  p90:  0.390s     156/s  p95:  0.393s     155/s
    Trainer4:  p50:  0.377s     172/s  p75:  0.383s     169/s  p90:  0.389s     166/s  p95:  0.395s     164/s
    Trainer5:  p50:  0.377s     180/s  p75:  0.383s     177/s  p90:  0.389s     174/s  p95:  0.395s     172/s
    Trainer6:  p50:  0.377s     204/s  p75:  0.384s     200/s  p90:  0.390s     197/s  p95:  0.394s     195/s
    Trainer7:  p50:  0.377s     185/s  p75:  0.384s     182/s  p90:  0.389s     179/s  p95:  0.394s     177/s
         All:  p50:  0.377s    1470/s  p75:  0.384s    1443/s  p90:  0.390s    1421/s  p95:  0.396s    1398/s

Reviewed By: pritamdamania87

Differential Revision: D24409230

fbshipit-source-id: 61de31dd4b69914198cb4becc2e616b17d47ef1a
2020-11-04 18:53:19 -08:00
ca293ec4e7 [TensorExpr] Run constant pooling in fusion groups to dedupe constants. (#47402)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47402

Test Plan: Imported from OSS

Reviewed By: eellison

Differential Revision: D24740957

Pulled By: ZolotukhinM

fbshipit-source-id: 741cbddc4bf2decd95d444235c424a4ae003d0de
2020-11-04 18:44:12 -08:00
5107a411cd add partition_by_partition_cost (#47280)
Summary:
This PR adds the support to calculate the cost of a partitioned graph partition by partition based on the node cost. In a partitioned graph, top partitions (partitions without parents) are collected as the starting points, then use DFS to find the critical path among all partitions in the graph

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47280

Reviewed By: gcatron

Differential Revision: D24735932

Pulled By: scottxu0730

fbshipit-source-id: 96653a8208554d2c3624e6c8718628f7c13e320b
2020-11-04 18:21:18 -08:00
878032d387 [ONNX] Add export of prim::data (#45747)
Summary:
Add export of prim::data

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45747

Reviewed By: bdhirsh

Differential Revision: D24280334

Pulled By: bzinodev

fbshipit-source-id: d21eda84eaba9e690852a72c0e63cbb40eae89bc
2020-11-04 18:15:28 -08:00
192b2967a5 [quant][graphmode][fx][test] Add test for nn.Sequential (#47411)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47411

Test Plan: Imported from OSS

Reviewed By: supriyar

Differential Revision: D24745678

fbshipit-source-id: f8a4858748402db6e72a21bf051f5542b9215ffa
2020-11-04 18:04:19 -08:00
c8872051e6 Validate number of GPUs in distributed_test. (#47259)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47259

As described in https://github.com/pytorch/pytorch/issues/47257, not
using enough number of GPUs would result in an error.

As a result, before we call `init_process_group` in distributed_test, we
validate we have enough GPUs.

#Closes: https://github.com/pytorch/pytorch/issues/47257
ghstack-source-id: 115790475

Test Plan: waitforbuildbot

Reviewed By: rohan-varma

Differential Revision: D24699122

fbshipit-source-id: 59c78d191881d1e063c43623dcf4d7eb75a2e94e
2020-11-04 17:55:34 -08:00
8a3728c819 Make torch.det() support complex input. (#45980)
Summary:
As per title. A minor fix required to make it available for the CPU (`fmod` does not support complex).
For CUDA requires [https://github.com/pytorch/pytorch/issues/45898 ](https://github.com/pytorch/pytorch/pull/45898).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45980

Reviewed By: izdeby

Differential Revision: D24539097

Pulled By: anjali411

fbshipit-source-id: 508830dbfd7794ab73e19320d07c69a051c91819
2020-11-04 17:47:03 -08:00
030caa190f Expand the test of torch.bmm on CUDA (#47124)
Summary:
basically https://github.com/pytorch/pytorch/pull/47070, enabled on all CI with `ci-all`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47124

Reviewed By: ejguan

Differential Revision: D24735130

Pulled By: ngimel

fbshipit-source-id: c2124562a9f9d1caf24686e5d8a1106c79366233
2020-11-04 17:29:34 -08:00
32c76dbecc Split IGamma cuda kernel into it's own file to speed up compilation times. (#47401)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47401

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D24740657

Pulled By: gchanan

fbshipit-source-id: 78244dba8624ca7be8761a8f4bf1aa078602e5cc
2020-11-04 17:23:25 -08:00
735f8cc6c2 [DI] Allow explicit taskLauncher for torchscript interpreter (#46865)
Summary:
By default, TorchScript execution is single threaded and uses the caller's thread pool. For the use case of distributed inference, we hope there is a way to customize the behavior where the  interpreter in torch script can be executed in other places. This diff allows an explicit taskLauncher for torchscript interpreter.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46865

Test Plan:
unit test is passed.

fbshipit-source-id: 1d7b003926c0d1f8facc53206efb960cff8897ac

Fixes #{issue number}

Reviewed By: houseroad

Differential Revision: D24616102

Pulled By: garroud

fbshipit-source-id: 79202b62f92d0b0baf72e4bf7aa3f05e0da91d59
2020-11-04 17:07:55 -08:00
b704cbeffe [FX] Speed up non-parameter tensor lookup (#47325)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47325

Test Plan: Imported from OSS

Reviewed By: zdevito

Differential Revision: D24715484

Pulled By: jamesr66a

fbshipit-source-id: 983eef6212ae95f5ddd3255adc8a585fb336074c
2020-11-04 16:59:02 -08:00
ff3e1de6d7 Clean up some imports in cuda kernel code. (#47400)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47400

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D24740655

Pulled By: gchanan

fbshipit-source-id: b56a602637c375575444c074c4be0a698441a4ab
2020-11-04 16:56:48 -08:00
848901f276 Fix collect_env when pytorch is not installed (#47398)
Summary:
Moved all torch specific checks under `if TORCH_AVAILABLE` block

Embed gpu_info dict back into SystemEnv constructor creation and deduplicate some code between HIP and CUDA cases

Fixes https://github.com/pytorch/pytorch/issues/47397

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47398

Reviewed By: walterddr

Differential Revision: D24740421

Pulled By: malfet

fbshipit-source-id: d0a1fe5b428617cb1a9d027324d24d7371c68d64
2020-11-04 16:54:08 -08:00
da491d7535 Split up BinaryMiscOpKernels.cu because it's slow to compile. (#47362)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47362

Test Plan: Imported from OSS

Reviewed By: bwasti

Differential Revision: D24730228

Pulled By: gchanan

fbshipit-source-id: 17edc203fcc06aa5f64174305b184868c7f3e67b
2020-11-04 15:56:11 -08:00
33cf7fddd2 [NNC] Fix an issue in Cuda fusion with fp16 scalar vars coerced to float (#47229)
Summary:
Fixes an issue where fp16 scalars created by the registerizer could be referenced as floats - causing invalid conversions which would crash in the NVRTX compile. I also noticed that we were inserting patterns like `float(half(float(X)))` and added a pass to collapse those down inside the CudaHalfScalarRewriter.

Fixes https://github.com/pytorch/pytorch/issues/47138

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47229

Reviewed By: agolynski

Differential Revision: D24706475

Pulled By: nickgg

fbshipit-source-id: 9df72bbbf203353009e98b9cce7ab735efff8b21
2020-11-04 15:48:12 -08:00
31c9d2efcd Add tests for DDP control flow models. (#47206)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47206

As discussed offline with pritamdamania87, add testing to ensure per-iteration and rank-dependent control flow works as expected in DDP with `find_unused_parameters=True`.
ghstack-source-id: 115854944

Test Plan: CI

Reviewed By: pritamdamania87

Differential Revision: D24659901

fbshipit-source-id: 17fc2b3ebba9cef2dd01d2877bad5702174b9767
2020-11-04 15:40:57 -08:00
2e5bfa9824 Add input argument to autograd.backward() cpp api (#47214)
Summary:
Helps fix https://github.com/pytorch/pytorch/issues/46373 for the cpp api.

Follow up to https://github.com/pytorch/pytorch/pull/46855/ which only changed the api for python only

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47214

Reviewed By: agolynski

Differential Revision: D24716139

Pulled By: soulitzer

fbshipit-source-id: 3e1f35968e8dee132985b883481cfd0d1872ccdd
2020-11-04 14:43:59 -08:00
6f6025183f Skip iomp5 emebedding if torch_cpu could not be found (#47390)
Summary:
This would be the case when package is build for local development rather than for installation

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47390

Reviewed By: janeyx99

Differential Revision: D24738416

Pulled By: malfet

fbshipit-source-id: 22bd676bc46e5d50a09539c969ce56d37cfe5952
2020-11-04 14:22:53 -08:00
ae7063788c [Pytorch] Add basic c10::optional tests (#47014)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47014

Some tests are better than zero tests.
ghstack-source-id: 115769678

Test Plan: Run new tests, passes

Reviewed By: smessmer

Differential Revision: D24558649

fbshipit-source-id: 50b8872f4f15c9a6e1f39b945124a31b57dd61d9
2020-11-04 14:19:46 -08:00
17be8ae11a [pytorch] Remove c10::nullopt_t::init (#47013)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47013

It was getting used in client code, and it's not part of `std::optional`.
ghstack-source-id: 115769682

Test Plan: Existing tests

Reviewed By: smessmer

Differential Revision: D24547710

fbshipit-source-id: a24e0fd03aba1cd996c85b12bb5dcdb3e7af46b5
2020-11-04 14:14:55 -08:00
7ab843e78b [JIT] add freeze to docs (#47120)
Summary:
freeze was temporarily renamed to _freeze in a reorg, and then removed from doc [here](https://github.com/pytorch/pytorch/pull/43473). add it back to docs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47120

Reviewed By: suo

Differential Revision: D24650712

Pulled By: eellison

fbshipit-source-id: 399e31586b8093de66937ba1266007ee291f509e
2020-11-04 13:50:36 -08:00
a11bc04997 Expand GRADIENT_IMPLEMENTED_FOR_COMPLEX to allow named tensors (#47289)
Summary:
Complex-valued named tensors do not support backpropagation currently. This is due to `tools/autograd/gen_variable_type.py` not containing `alias` in `GRADIENT_IMPLEMENTED_FOR_COMPLEX` which is required to constructed named tensors.

This fixes https://github.com/pytorch/pytorch/issues/47157. Also removed a duplicate `cholesky` in the list and added a test in `test_autograd.py`.

Apologies, this is a duplicate of https://github.com/pytorch/pytorch/issues/47181 as I accidently removed my pytorch fork.

cc: zou3519 anjali411

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47289

Reviewed By: agolynski

Differential Revision: D24706571

Pulled By: zou3519

fbshipit-source-id: 2cc48ce38eb180183c5b4ce2f8f4eef8bcac0316
2020-11-04 13:30:44 -08:00
5d82311f0d Add vulkan reshape op (hack) (#47252)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47252

For now, just use the CPU to reshape.

Test Plan: Imported from OSS

Reviewed By: IvanKobzarev

Differential Revision: D24733906

Pulled By: SS-JIA

fbshipit-source-id: df0e4c0f21379cb2533a1717300b2f7275936e55
2020-11-04 13:14:26 -08:00
6b3802a711 [Gradient Compression] Export sizes, along with length and offset of each variable to GradBucket for PowerSGD (#47203)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47203

1. Create a new field in BucketReplica to store sizes info for each variable.
2. Export sizes list, along with lengths and offsets to GradBuceket.

These fields are needed for PowerSGD.

Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202
ghstack-source-id: 115875194

Test Plan: Checked the field values from log.

Reviewed By: rohan-varma

Differential Revision: D24644137

fbshipit-source-id: bcec0daf0d02cbf25389bfd9be90df1e6fd8fc56
2020-11-04 12:34:53 -08:00
2c55426610 Renamed a TensorListMetaData property. Cleaned up a test (#46662)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46662

Test Plan: Imported from OSS

Reviewed By: ansley

Differential Revision: D24453346

Pulled By: izdeby

fbshipit-source-id: f88ac21708befa2e8f3edeffe5805b69a4634d12
2020-11-04 12:01:28 -08:00
f588ad6a35 [quant][graphmode][fx] Test to make sure dequantize node are placed properly (#47332)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47332

Test Plan: Imported from OSS

Reviewed By: supriyar

Differential Revision: D24719736

fbshipit-source-id: 51b1f14b479edbc5d7f28d85920faf5fee8dd5ea
2020-11-04 11:13:01 -08:00
bba5a31176 Revert D24481801: Optimize backward for torch.repeat
Test Plan: revert-hammer

Differential Revision:
D24481801 (4e6f2440d8)

Original commit changeset: 95c155e0de83

fbshipit-source-id: 0fb0afde760b0f5e17bd75df950a5d76aee5370b
2020-11-04 10:44:40 -08:00
4189c3ca76 Fix onnx test-reports path in CI (#47315)
Summary:
Currently, no test reports are uploaded to CI because the paths for the `onnx` runs are incorrect. This PR attempts to change that.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47315

Reviewed By: malfet

Differential Revision: D24727607

Pulled By: janeyx99

fbshipit-source-id: f6d91698fdb15a39e01ef812032d4cd30621f864
2020-11-04 10:30:52 -08:00
01da0fe5ff Including generator param in randperm documentation (#47231)
Summary:
The `randperm` documentation is outdated and did not use to include the optional `generator` parameter. This PR just adds that along with the `pin_memory` parameter.

This PR was brought up in [PR 47022](https://github.com/pytorch/pytorch/pull/47022), but is now rebased onto master.

New docs look like:
![image](https://user-images.githubusercontent.com/31798555/97923963-e6084400-1d2c-11eb-9d46-573ba3189ad6.png)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47231

Reviewed By: mruberry

Differential Revision: D24711960

Pulled By: janeyx99

fbshipit-source-id: 3ff8be62ec33e34ef87d017ea97bb950621a3064
2020-11-04 09:37:41 -08:00
fe17269e75 Revert "Revert D24335982: explicitly error out in comparison ops when the types don't match" (#47288)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47288

This reverts commit b3eb0c86cf21d8dad5744a917c70d846a8715e69.

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D24706531

Pulled By: bdhirsh

fbshipit-source-id: f3bf34ddba7882932155819251b6c7dcb5c6b56c
2020-11-04 09:27:47 -08:00
e4bc785dd5 randperm: add torch check to ensure generator device = tensor device (#47022)
Summary:
**BC-breaking Note:**

This PR disallows passing in a generator of a different device than the tensor being created during `randperm` execution. For example, the following code which used to work no longer works.
```
> torch.randperm(3, device='cuda', generator=torch.Generator(device='cpu'))
tensor([0, 1, 2], device='cuda:0')
```
It now errors:
```
> torch.randperm(3, device='cuda', generator=torch.Generator(device='cpu'))
RuntimeError: Expected a 'cuda:0' generator device but found 'cpu'
```

**PR Summary:**

Fixes https://github.com/pytorch/pytorch/issues/44714

Also added + ran tests to ensure this functionality.

Disclaimer: More work needs to be done with regards to small cuda tensors when a generator is specified, look at the issue thread for more details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47022

Reviewed By: samestep

Differential Revision: D24608237

Pulled By: janeyx99

fbshipit-source-id: b83c47219c7816d93f938f7ce86dc8857513961b
2020-11-04 08:29:31 -08:00
07e8f48e6b Removing caffe2 and third_party from our code coverage (#47310)
Summary:
Our tests do not test these folders (as they shouldn't), and their inclusion in codecov obfuscates our coverage metrics.

We ask codecov to ignore these folders when calculating our coverage.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47310

Reviewed By: walterddr

Differential Revision: D24711775

Pulled By: janeyx99

fbshipit-source-id: 6095bb5e8d52202c7930114d2f357163d2271022
2020-11-04 08:18:13 -08:00
f1ac63d324 Implement copysign (#46396)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46396

Related #38349

[numpy](https://numpy.org/doc/stable/reference/generated/numpy.copysign.html?highlight=copysign#numpy.copysign)
- No in-place function
- No method
- Optional output
- Available: byte, char, bool, int, short, long, float, double, half
- Integral promoted to float
- Not available: float/double complex

`c = np.copysign(a, b)`
|  a |  b |  c | a.grad |
| -1 | -1 | -1 |   1  |
| -0 | -1 | -0 |   0  |
|  0 | -1 | -0 |  0  |
|  1 | -1 | -1 |  -1  |
| -1 | -0 |  -1 |  1  |
| -0 | -0 |  0 |  0  |
|  0 | -0 |  0 |   0  |
|  1 | -0 |  -1 |   -1  |
| -1 |  0 |  1 |  -1  |
| -0 |  0 |  0 |  0  |
|  0 |  0 |  0 |   0  |
|  1 |  0 |  1 |   1  |
| -1 |  1 |  1 |  -1  |
| -0 |  1 |  0 |  0  |
|  0 |  1 |  0 |   0  |
|  1 |  1 |  1 |   1  |

This function becomes **non-differentiable** at `a=0` for any `b`. So, in my opinion, we may set the gradient for `a=0` to 0.

TODO:
- [x] test (cpu/gpu)
- [x] doc
- [x] ~kernel_vec~

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D24401366

Pulled By: ejguan

fbshipit-source-id: 3621c5ff74b185376a3705589983bb5197ab896d
2020-11-04 08:08:57 -08:00
996f444c00 [pt][static_runtime] Memory model (#46896)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46896

The idea of the memory model is quite similar to that of BlackBoxPredictor, however, it's more complicated in pt due to 1) tensor views that share storage with storage refcount bumps but with different TensorImpls, 2) tensors sharing the same TensorImpl and the same storage, but with no refcount bump of the StorageImpl, 3) data types such as TensorList and Tuples that have Tensors in them, 4) need to support non-out/out variant mix while we move the aten ops to out variants.

As a result, I have to make the following adjustments:
1) remove tensors in output Tuples from internal blob list;
2) for memory allocation/deallocation, get candidate Tensors from the outputs of ops with out variant, extract StorageImpls from the Tensors, dedup, and remove output tensor StorageImpls, and get the final list of blobs for memory planning;
3) during the clean_up_memory pass, clean up memory held by the StorageImpls as well as Tensors/Lists/Tuples in IValues that don't participate in memory planning to reduce overall memory usage

Risk:
PyTorch team is planning to deprecate the current resize_outout api, which we do rely on. This is a pretty big risk.

https://www.internalfb.com/intern/diffusion/FBS/browsefile/master/fbcode/caffe2/aten/src/ATen/native/Resize.cpp?commit=6457b329847607553d34e788a3a7092f41f38895&lines=9-23

Test Plan:
```
buck test //caffe2/test:static_runtime
buck test //caffe2/benchmarks/static_runtime:static_runtime_cpptest
buck test //caffe2/caffe2/fb/predictor:pytorch_predictor_test
```
Benchmarks:
```
MKL_NUM_THREADS=1 OMP_NUM_THREADS=1 numactl -m 0 -C 13 \
buck-out/opt/gen/caffe2/caffe2/fb/predictor/ptvsc2_predictor_bench \
--scripted_model=/home/hlu/ads/adindexer/adindexer_ctr_mobilefeed/pt/merge/traced_precomputation.pt \
--pt_inputs=/home/hlu/ads/adindexer/adindexer_ctr_mobilefeed/pt/merge/container_precomputation_bs1.pt \
--iters=1000 --warmup_iters=10000 --num_threads=1 --pt_enable_static_runtime=true \
--pt_cleanup_activations=true --pt_enable_out_variant=false
```

|pt_cleanup_activations	|pt_enable_out_variant	|old ms/iter	|new ms/iter	|
|---	|---	|---	|---	|
|0	|0	|0.31873	|0.30228	|
|0	|1	|0.30018	|0.29184	|
|1	|0	|0.35246	|0.31895	|
|1	|1	|0.35742	|0.30417	|

Reviewed By: bwasti, raziel

Differential Revision: D24471854

fbshipit-source-id: 4ac37dca7d2a0c362120a7f02fd3995460c9a55c
2020-11-03 23:47:59 -08:00
5c4bd9a38f Move python-independent c10d implementations to torch/lib (#47309)
Summary:
* This is a pre-step to build c10d into libtorch
* Includes a minor cleanup in c10d/CMakeLists.txt

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47309

Reviewed By: wanchaol

Differential Revision: D24711768

Pulled By: gmagogsfm

fbshipit-source-id: 6f9e0a6a73c30f5ac7dafde9082efcc4b725dde1
2020-11-03 23:39:54 -08:00
0ec717c830 Support int32 indices and offsets in nn.EmbeddingBag (#46758)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46758

It's in general helpful to support int32 indices and offsets, especially when such tensors are large and need to be transferred to accelerator backends. Since it may not be very useful to support the combination of int32 indices and int64 offsets, here we enforce that these two must have the same type.

Test Plan: unit tests

Reviewed By: ngimel

Differential Revision: D24470808

fbshipit-source-id: 94b8a1d0b7fc9fe3d128247aa042c04d7c227f0b
2020-11-03 23:33:50 -08:00
a2f9c7d4e3 Expose SparseLengthsSum8BitRowwiseSparse to C10 (#47306)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47306

Expose SparseLengthsSum8BitRowwiseSparse to PyTorch, since pt's 8bit embedding doesn't support pruning yet.

It's temporary solution to unblock the QRT test, not best for performance.

Test Plan: ci

Reviewed By: ashishenoyp

Differential Revision: D24709524

fbshipit-source-id: 725dfc9d803e4a555dd71fa5ab75dc175e671563
2020-11-03 22:51:12 -08:00
0cba3e3704 [quant][graphmode][fx] Add support for qat convbn{relu}1d (#47248)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47248

Test Plan: Imported from OSS

Reviewed By: vkuzo

Differential Revision: D24696524

fbshipit-source-id: 684db12be201307acbdc89a44192cf2270491dba
2020-11-03 22:43:33 -08:00
3a0024574d Do not delete rpath from torch.dylib on Darwin (#47337)
Summary:
Fixes CI regressions introduced by https://github.com/pytorch/pytorch/issues/47262

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47337

Reviewed By: ngimel

Differential Revision: D24721954

Pulled By: malfet

fbshipit-source-id: 395b037b29c0fc3b62ca50bba9be940ad72e0c5b
2020-11-03 22:36:35 -08:00
53a5f08e0c [quant][eagermode] Avoid inserting fakequant for sigmoid/hardsigmoid/tanh in eval mode (#47297)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47297

Test Plan: Imported from OSS

Reviewed By: z-a-f

Differential Revision: D24708270

fbshipit-source-id: a19b6dbe07d5c80f3cc78a987742d345d86e1cd1
2020-11-03 21:33:35 -08:00
c6fe65bf90 [quant][graphmode][fx][fix] Fix error that DefaultQuantizer is not inserted after a module configured with None qconfig (#47316)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47316

Test Plan: Imported from OSS

Reviewed By: z-a-f

Differential Revision: D24713727

fbshipit-source-id: e604ef2274ff4bb4e8b6ebbb6ba681018e9ae248
2020-11-03 20:08:41 -08:00
dec1c36487 Create prototype for AST rewriter (#47216)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47216

Test Plan: Imported from OSS

Reviewed By: glaringlee

Differential Revision: D24687539

Pulled By: ansley

fbshipit-source-id: 421108d066ff93ee18f4312ee67c287ca1cef881
2020-11-03 19:21:58 -08:00
f91fcefc81 [Gradient Compression] Surface C++ comm hooks to Python API as built-in comm hooks (#47270)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47270

This is almost same as #46959, except that in caffe2/torch/nn/parallel/distributed.py, BuiltinCommHookType should be imported conditionally, only when dist.is_available(). Otherwise, this Python enum type defined in caffe2/torch/scrc/distributed/c10d/init.cpp cannot be imported. See https://github.com/pytorch/pytorch/issues/47153

I tried to follow another enum type enum type ReduceOp defined in the same file, but did not work, because the C++ enum class is defined torch/lib/c10d library, but BuiltinCommHookType is defined in torch/csrc/distributed library. These two libraries are compiled in two different ways.

To avoid adding typing to distributed package, which can be a new project, I simply removed the arg type of BuiltinCommHookType in this file.

To review the diff on top of #46959, compare V1 vs Latest:
https://www.internalfb.com/diff/D24700959?src_version_fbid=270445741055617

Main Changes in V1 (#46959):
1. Implemented the Pybind part.
2. In the reducer, once the builtin_comm_hook_type is set,  a c++ comm hook instance will be created in Reducer::autograd_hook.
3. Added unit tests for the builit-in comm hooks.

Original PR issue: C++ DDP Communication Hook https://github.com/pytorch/pytorch/issues/46348
ghstack-source-id: 115783237

Test Plan:
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_builtin_ddp_comm_hooks_nccl

//arvr/projects/eye_tracking/Masquerade:python_test

USE_DISTRIBUTED=0 USE_GLOO=0 BUILD_TEST=0 USE_CUDA=1 USE_MKLDNN=0 DEBUG=0 python setup.py install

Reviewed By: mrshenli

Differential Revision: D24700959

fbshipit-source-id: 69f303a48ae275aa856e6e9b50e12ad8602e1c7a
2020-11-03 18:33:50 -08:00
2652f2e334 Optimize arguments checks (#46661)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46661

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D24453342

Pulled By: izdeby

fbshipit-source-id: 26866fdbc9dc2b5410b3b728b175a171cc6a4521
2020-11-03 17:43:10 -08:00
2caa3bd453 Inlining all non-output buffers, including intermediate buffers. (#47258)
Summary:
This diff enables inlining for all non-output buffers, including the intermediate buffers that are created as part of an op. However, the buffers that correspond to reductions will not be inlined.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47258

Reviewed By: anjali411

Differential Revision: D24707015

Pulled By: navahgar

fbshipit-source-id: ad8b03e38497600cd69980424db6d586bf93db74
2020-11-03 17:00:32 -08:00
464c569dbf [vulkan] Add mean.dim op for vulkan (#47312)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47312

Test Plan:
```
cd ~/pytorch
BUILD_CUSTOM_PROTOBUF=OFF \
  BUILD_TEST=ON \
  USE_EIGEN_FOR_BLAS=OFF \
  USE_FBGEMM=OFF \
  USE_MKLDNN=OFF \
  USE_NNPACK=OFF \
  USE_NUMPY=OFF \
  USE_OBSERVERS=OFF \
  USE_PYTORCH_QNNPACK=OFF \
  USE_QNNPACK=OFF \
  USE_VULKAN=ON \
  USE_VULKAN_API=ON \
  USE_VULKAN_SHADERC_RUNTIME=ON \
  USE_VULKAN_WRAPPER=OFF \
  MACOSX_DEPLOYMENT_TARGET=10.9 CC=clang CXX=clang++ python3 setup.py develop --cmake && ./build/bin/vulkan_api_test
```

Reviewed By: IvanKobzarev

Differential Revision: D24713617

Pulled By: SS-JIA

fbshipit-source-id: 20c0f411fb390ad2114c7deff27cc6fc77448089
2020-11-03 16:45:21 -08:00
9b168a1fed [TensorExpr] Pick meaningful names for functions in TE codegen. (#47255)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47255

As a result of this change, the generated CUDA code for the following fusion group:
```
graph(%0 : Float(32, 32, 1, 1, strides=[32, 1, 1, 1], requires_grad=0, device=cuda:0),
      %1 : Float(32, 32, strides=[32, 1], requires_grad=0, device=cuda:0),
      %2 : Float(32, 32, 1, strides=[32, 1, 1], requires_grad=0, device=cuda:0)):
  %3 : int = prim::Constant[value=1]()
  %v1.1 : Float(32, 32, 32, strides=[1024, 32, 1], requires_grad=0, device=cuda:0) = aten::add(%1, %2, %3) # test/test_tensorexpr.py:155:0
  %5 : int = prim::Constant[value=1]()
  %6 : Float(32, 32, 32, 32, strides=[32768, 1024, 32, 1], requires_grad=0, device=cuda:0) = aten::add(%v1.1, %0, %5) # test/test_tensorexpr.py:156:0
  return (%6)
```

Would look like the following:
```
extern "C" __global__
void fused_add_add(float* t0, float* t1, float* t2, float* aten_add) {
{
  float v = __ldg(t1 + 32 * (((512 * blockIdx.x + threadIdx.x) / 32) % 32) + (512 * blockIdx.x + threadIdx.x) % 32);
  float v_1 = __ldg(t2 + ((512 * blockIdx.x + threadIdx.x) / 32) % 32 + 32 * (((512 * blockIdx.x + threadIdx.x) / 1024) % 32));
  float v_2 = __ldg(t0 + ((512 * blockIdx.x + threadIdx.x) / 1024) % 32 + 32 * ((512 * blockIdx.x + threadIdx.x) / 32768));
  aten_add[((((512 * blockIdx.x + threadIdx.x) / 32768) * 32768 + 32 * (((512 * blockIdx.x + threadIdx.x) / 32) % 32)) + 1024 * (((512 * blockIdx.x + threadIdx.x) / 1024) % 32)) + (512 * blockIdx.x + threadIdx.x) % 32] = (v + v_1) + v_2;
}
}
```

Previously we generated:
```
extern "C" __global__
void func(float* t0, float* t1, float* t2, float* aten_add) {
{
  float v = __ldg(t1 + 32 * (((512 * blockIdx.x + threadIdx.x) / 32) % 32) + (512 * blockIdx.x + threadIdx.x) % 32);
  float v_1 = __ldg(t2 + ((512 * blockIdx.x + threadIdx.x) / 32) % 32 + 32 * (((512 * blockIdx.x + threadIdx.x) / 1024) % 32));
  float v_2 = __ldg(t0 + ((512 * blockIdx.x + threadIdx.x) / 1024) % 32 + 32 * ((512 * blockIdx.x + threadIdx.x) / 32768));
  aten_add[((((512 * blockIdx.x + threadIdx.x) / 32768) * 32768 + 32 * (((512 * blockIdx.x + threadIdx.x) / 32) % 32)) + 1024 * (((512 * blockIdx.x + threadIdx.x) / 1024) % 32)) + (512 * blockIdx.x + threadIdx.x) % 32] = (v + v_1) + v_2;
}
}
```

Differential Revision: D24698273

Test Plan: Imported from OSS

Reviewed By: bertmaher

Pulled By: ZolotukhinM

fbshipit-source-id: 6da95c6ac3d5155ebfaaab4f84f55a24deb6d10d
2020-11-03 16:41:22 -08:00
a65e757057 [TensorExpr] CudaCodegen: restart counter for function names unique ID inside each codegen instantiation. (#47254)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47254

CUDA codegen used a static global counter for picking names for
functions, but the functions only need to be unique in the scope of the
given codegen. This PR fixes that.

Differential Revision: D24698271

Test Plan: Imported from OSS

Reviewed By: bertmaher

Pulled By: ZolotukhinM

fbshipit-source-id: 516c0087b86b35bbb6ea7c71bb0ed9c3daaca2b8
2020-11-03 16:41:20 -08:00
3161fe6d5a [JIT] SubgraphUtils: add a function for generating a string name for a given graph. (#47253)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47253

The function simply goes over all aten nodes in the graph and
concatenates their names, truncating the final name to a given length.

Differential Revision: D24698272

Test Plan: Imported from OSS

Reviewed By: bertmaher

Pulled By: ZolotukhinM

fbshipit-source-id: d6e50194ca5faf0cb61f25af83247b5e40f202e4
2020-11-03 16:36:41 -08:00
7a0f0d24d0 Codegen - error when an argument that looks like an out argument isn't a kwarg (fix #43273) (#47284)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47284

Test Plan: Imported from OSS

Reviewed By: nikithamalgifb

Differential Revision: D24706763

Pulled By: bdhirsh

fbshipit-source-id: 60fbe81a0dff7e07aa8c169235d15b84151d3ed7
2020-11-03 16:30:01 -08:00
a8ef4d3f0b Provide 'out' parameter for 'tensordot' (#47278)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/42102

Added an optional out parameter to the tensordot operation to allow using buffers.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47278

Test Plan: pytest test/test_torch.py -k tensordot -v

Reviewed By: agolynski

Differential Revision: D24706258

Pulled By: H-Huang

fbshipit-source-id: eb4bcd114795f67de3a670291034107d2826ea69
2020-11-03 15:56:00 -08:00
31ebac3eb7 [quant] Quantized flip dispatch (#46235)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46235

Test Plan: Imported from OSS

Reviewed By: vkuzo

Differential Revision: D24689161

Pulled By: z-a-f

fbshipit-source-id: 6833c2639b29ea5f6c81c880b8928c5a1951c7b8
2020-11-03 15:36:22 -08:00
f41f3e3cd1 Implement bicubic grid sampler (#44780)
Summary:
Fix https://github.com/pytorch/pytorch/issues/44601

I added bicubic grid sampler in both cpu and cuda side, but haven't in AVX2

There is a [colab notebook](https://colab.research.google.com/drive/1mIh6TLLj5WWM_NcmKDRvY5Gltbb781oU?usp=sharing) show some test results. The notebook use bilinear for test, since I could only use distributed version of pytorch in it. You could just download it and modify the `mode_torch=bicubic` to show the results.

There are some duplicate code about getting and setting values, since the helper function used in bilinear at first clip the coordinate beyond boundary, and then get or set the value. However, in bicubic, there are more points should be consider. I could refactor that part after making sure the overall calculation are correct.

Thanks

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44780

Reviewed By: mrshenli

Differential Revision: D24681114

Pulled By: mruberry

fbshipit-source-id: d39c8715e2093a5a5906cb0ef040d62bde578567
2020-11-03 15:34:59 -08:00
63978556fd [numpy] torch.a{cosh, sinh} : promote integer inputs to float (#47152)
Summary:
Reference https://github.com/pytorch/pytorch/issues/42515

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47152

Reviewed By: mrshenli

Differential Revision: D24681083

Pulled By: mruberry

fbshipit-source-id: 246e2272536cf912a2575bfaaa831c3eceec034c
2020-11-03 15:26:13 -08:00
2b5433dee6 [Pytorch][Annotation] Update inlined callstack with module instance info (#46729)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46729

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D24493220

Pulled By: cccclai

fbshipit-source-id: f37834157e6f69bbe87f73a7d3d38a94ece6017d
2020-11-03 15:19:02 -08:00
f730f2597e [NNC] Implement Cond in LLVM codegen (#47256)
Summary:
Generate LLVM IR for statements such as
```
if (...) {
   ....
} else {
   ....
}
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47256

Test Plan: added unit tests to test_llvm.cpp

Reviewed By: nickgg

Differential Revision: D24699080

Pulled By: cheng-chang

fbshipit-source-id: 83b0cebcd242828263eb6052483f0924b5f091ce
2020-11-03 14:46:30 -08:00
8b13ab9370 Event Logging for NCCL Async Error Handling Process Crash (#47244)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47244

This is an event-logging based update that should allow us to collect high-quality data about how many times the NCCL Async Error Handling mechanism is triggered. This logs an event called `ProcessGroupNCCL.WorkNCCL.handleNCCLGuard`, which is recorded as an entry in the `scuba_caffe2_pytorch_usage_stats` Scuba table. This Scuba entry will also contain metadata like workflow status, entitlement, hostnames, and workflow names, which will give us insight into what workloads/domains and machines are benefiting from async error handling. It also contains the Flow Run ID, which can be used as a join key with the `fblearner_workflow_run_status` scuba table for additional information like final error message, etc. We can easily quantify the number of times the async handling code was triggered by querying the `scuba_caffe2_pytorch_usage_stats` table.

As a demonstration, I ran the following workflow with this diff patched: f229675892
Since the workflow above causes a desync, the `handleNCCLGuard` event is logged in scuba soon. See here for the filtered table: https://www.fburl.com/scuba/scuba_caffe2_pytorch_usage_stats/tmp1uvio

As you can see, there are 4 entries. The workflow above uses 3 GPUs, 2 of which run into the desync scenario and are crashed using async error handling. We make this fail twice before succeeding the 3rd time, hence 4 entries.
ghstack-source-id: 115708632

Test Plan: Did a quick demo as described above. Scuba entries with the logs can be found here: https://www.fburl.com/scuba/scuba_caffe2_pytorch_usage_stats/tmp1uvio

Reviewed By: jiayisuse

Differential Revision: D24688739

fbshipit-source-id: 7532dfeebc53e291fbe10d28a6e50df6324455b1
2020-11-03 13:42:42 -08:00
ca61b061f3 Update minimum supported Python version to 3.6.2 (#47314)
Summary:
As typing.NoReturn is used in the codebase

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47314

Reviewed By: seemethere

Differential Revision: D24712847

Pulled By: malfet

fbshipit-source-id: f0692d408316d630bc11f1ee881b695437fb47d4
2020-11-03 13:32:07 -08:00
ea93bdc212 Add comment explaining purpose of the accumulate_grad argument (#47266)
Summary:
Addressing a comment from a PR that has already been merged https://github.com/pytorch/pytorch/issues/46855

https://github.com/pytorch/pytorch/pull/46855#discussion_r515161953

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47266

Reviewed By: agolynski

Differential Revision: D24709017

Pulled By: soulitzer

fbshipit-source-id: 3c104c2fef90ffd75951ecef4ae9e938d4b12d8c
2020-11-03 13:18:23 -08:00
dc0d68a1ee [JIT] Print out interface mismatch for prim::ModuleDictIndex (#47300)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47300

**Summary**
This commit augments the module interface subtyping check that is done
before the emission of the `prim::ModuleDictIndex` operator so that the
error message that is printed if the subtyping check fails provides more
information on which methods do not match.

**Test Plan**
Existing unit tests for `prim::ModuleDictIndex`. Compilation of `ModWithWrongAnnotation` now produces this error:
```
Attribute module is not of annotated type __torch__.jit.test_module_containers.ModuleInterface: Method on class '__torch__.jit.test_module_containers.DoesNotImplementInterface' (1) is not compatible with interface '__torch__.jit.test_module_containers.ModuleInterface' (2)
  (1) forward(__torch__.jit.test_module_containers.DoesNotImplementInterface self, Tensor inp) -> ((Tensor, Tensor))
  (2) forward(InterfaceType<ModuleInterface> self, Any inp) -> (Any)
:
```

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D24709538

Pulled By: SplitInfinity

fbshipit-source-id: 6b6cb75e4b2b12b08576a5530b4b90cbcad9b6e5
2020-11-03 13:07:21 -08:00
14194e4f23 Embed libiomp5.dylib into wheel package (#47262)
Summary:
libiomp runtime  is the only external dependency OS X package has if compiled with MKL
Copy it to the stage directory from one of the available rpathes
And remove all absolute rpathes, since project shoudl have none

Fixes https://github.com/pytorch/pytorch/issues/38607

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47262

Reviewed By: walterddr

Differential Revision: D24705094

Pulled By: malfet

fbshipit-source-id: 9f588a3ec3c6c836c8986d858fb53df815a506c8
2020-11-03 13:00:30 -08:00
c424d9389e [numpy] torch.a{cos, tan} : promote integer inputs to float (#47005)
Summary:
Reference https://github.com/pytorch/pytorch/issues/42515

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47005

Reviewed By: mrshenli

Differential Revision: D24681097

Pulled By: mruberry

fbshipit-source-id: 2f29655a5f3871ee96c2bfd35c93f4d721730e37
2020-11-03 13:00:24 -08:00
0d00724e36 [numpy] torch.{a}tanh : promote integer inputs to float (#47064)
Summary:
Reference https://github.com/pytorch/pytorch/issues/42515

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47064

Reviewed By: mrshenli

Differential Revision: D24681107

Pulled By: mruberry

fbshipit-source-id: 1818206c854dbce7074363bf6f1949daa7bf6052
2020-11-03 12:56:58 -08:00
c68c3d0a02 [fix] nn.Embedding.from_pretrained : honour padding_idx argument (#47184)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/46585 (first snippet)

Now the behaviour of `padding_idx` agrees with documentation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47184

Reviewed By: mruberry

Differential Revision: D24682567

Pulled By: albanD

fbshipit-source-id: 864bd34eb9099d367a3fcbb8f4f4ba2e2b270724
2020-11-03 12:47:19 -08:00
f276ab55cd Added Kronecker product of tensors (torch.kron) (#45358)
Summary:
This PR adds a function for calculating the Kronecker product of tensors.
The implementation is based on `at::tensordot` with permutations and reshape.
Tests pass.

TODO:

- [x] Add more test cases
- [x] Write documentation
- [x] Add entry `common_methods_invokations.py`

Ref. https://github.com/pytorch/pytorch/issues/42666

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45358

Reviewed By: mrshenli

Differential Revision: D24680755

Pulled By: mruberry

fbshipit-source-id: b1f8694589349986c3abfda3dc1971584932b3fa
2020-11-03 12:41:41 -08:00
32b66b0851 reorganize sparse_nn_partition (#47283)
Summary:
This PR moves combine_partitions_based_on_size and find_partition_to_combine_based_on_size to sparse_nn_partition since they are both only used by sparse_nn_partition

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47283

Reviewed By: gcatron

Differential Revision: D24707864

Pulled By: scottxu0730

fbshipit-source-id: 183fe945e477e16301d7f489103287eb9d8a30af
2020-11-03 12:36:36 -08:00
774b638eb6 Change largeCUDATensorTest to largeTensorTest+onlyCUDA; add a buffer to large cuda tensor test (#45332)
Summary:
Effectively, `largeCUDATensorTest` = `largeTensorTest` + `onlyCUDA`.

There was this problem where a user got OOM for a `largeCUDATensorTest('16GB')` on a 16GB V100. This decorator was checking total memory for a GPU device, however in most cases, we can't allocate all of the memory that a GPU has. So, it would be beneficial that we have a buffer on this `largeTensorTest` check for CUDA. I added a 10% buffer to it.

Definition of `largeTensorTest`

d22dd80128/torch/testing/_internal/common_device_type.py (L560-L578)

`_has_sufficient_memory`

d22dd80128/torch/testing/_internal/common_device_type.py (L535-L557)

`largeCUDATensorTest`

d22dd80128/torch/testing/_internal/common_device_type.py (L526-L532)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45332

Reviewed By: ngimel

Differential Revision: D24698690

Pulled By: mruberry

fbshipit-source-id: a77544478e45ce271f6639ea04e87700574ae307
2020-11-03 11:43:49 -08:00
4e6f2440d8 Optimize backward for torch.repeat (#46726)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46726

Fixes #43192

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D24481801

Pulled By: ejguan

fbshipit-source-id: 95c155e0de83b71f173c9135732ea84ba6399d69
2020-11-03 11:16:55 -08:00
9c3a75527b Update doc to reflect current behavior (#46937)
Summary:
This behavior was changed by side effect by https://github.com/pytorch/pytorch/pull/41984
Update the doc to reflect the actual behavior of the function.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46937

Reviewed By: mruberry

Differential Revision: D24682750

Pulled By: albanD

fbshipit-source-id: 89b94b61f54dbcfc6a6988d7e7d361bd24ee4964
2020-11-03 11:02:19 -08:00
782f92b569 fix windows CI passed incorrectly (#47105)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/47103 and fixes https://github.com/pytorch/pytorch/issues/45864
1. make the sh failsafe.
2. disable the failed test in windows.
3. verified [link](https://app.circleci.com/pipelines/github/pytorch/pytorch/233616/workflows/e33286c1-f5e2-4cf2-82ca-ef4f54dfa495/jobs/8608415/tests)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47105

Reviewed By: samestep

Differential Revision: D24648414

Pulled By: walterddr

fbshipit-source-id: 1977007c2c7e8043efc590eb7261956a44e8f9ab
2020-11-03 10:29:26 -08:00
8c865493c6 Automated submodule update: FBGEMM (#47263)
Summary:
This is an automated pull request to update the first-party submodule for [pytorch/FBGEMM](https://github.com/pytorch/FBGEMM).

New submodule commit: 8eb6dcb23e

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47263

Test Plan: Ensure that CI jobs succeed on GitHub before landing.

Reviewed By: malfet

Differential Revision: D24701113

fbshipit-source-id: 92ab4ae93c4d0753ee3d6590e5616fc8cd6082a0
2020-11-03 10:24:43 -08:00
9e58c85d08 [ROCm] remove use of HIP_PLATFORM (#47241)
Summary:
Fixes deprecated use of the HIP_PLATFORM env var.  This env var is no longer needed to be set explicitly.  Instead, HIP_PLATFORM is automatically detected by hipcc.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47241

Reviewed By: mruberry

Differential Revision: D24699982

Pulled By: ngimel

fbshipit-source-id: 9cd2f32e7c0c8d662832b0cbbc2988835a45961a
2020-11-03 09:54:44 -08:00
579cfc6641 Moving test order to rebalance test1 and test2 times (#47290)
Summary:
asan testing diff is absurd right now, moving some heftier tests to be in shard2 (test_nn and test_quantization)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47290

Reviewed By: malfet

Differential Revision: D24706877

Pulled By: janeyx99

fbshipit-source-id: 35069d1e425857f85775f9be76501d6a158e0376
2020-11-03 09:39:29 -08:00
5c8896f8ad Delete CUDA build rules from MacOS build (#47277)
Summary:
Also remove MAX_JOBS constraint, since OOM warning was about nvcc rather than clang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47277

Reviewed By: walterddr

Differential Revision: D24705180

Pulled By: malfet

fbshipit-source-id: 25fd0161de3f7e14a2a4db86cbea8357cdc69e06
2020-11-03 09:01:12 -08:00
c05ee86edd Fix return-type-is-always-copy warning (#47279)
Summary:
`std::vector<bool>` can not return values by reference, since they are stored as bit fields

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47279

Reviewed By: glaringlee

Differential Revision: D24705188

Pulled By: malfet

fbshipit-source-id: 96e71cc4b9881f92af3b4a508d397deab6d68174
2020-11-03 08:53:24 -08:00
a341a4329a Format error message for unmatched signature between _out and base functions (#47087)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47087

Fixes #33547

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D24633077

Pulled By: ejguan

fbshipit-source-id: d1baca84cb3bc415cced9b696103f17131e1e4c7
2020-11-03 07:36:37 -08:00
73e121de1c [GPU] Enable optimize_for_metal in fbcode (#47102)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47102

Since the current mobile end to end workflow involves using `optmize_for_mobile` in python, the goal here is to be able to use `optmize_for_mobile(m, backend="metal")` in fbcode.
ghstack-source-id: 115749752

Test Plan:
1. Be able to export models for metal (see the next diff)
2. Make sure the change won't break the OSS workflow
3. Make sure the change won't break on the mobile bulild.

Reviewed By: xcheng16

Differential Revision: D24644422

fbshipit-source-id: bd77e22f0799533a96d048207932055fd051a67e
2020-11-03 00:58:55 -08:00
ad3a3bd0d6 [GPU] Add an attribute to the torchscript model exported by metal (#47174)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47174

As title
ghstack-source-id: 115747991

Test Plan: Sandcastle

Reviewed By: kimishpatel

Differential Revision: D24616430

fbshipit-source-id: 2ccd264688471788f0dfea8bdc234fa69d39817f
2020-11-03 00:54:19 -08:00
0ead9d545a [quant][graphmode][fx] Add test for non quantized embedding and embeddingbag (#47092)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47092

Test Plan: Imported from OSS

Reviewed By: supriyar

Differential Revision: D24637423

fbshipit-source-id: baaa431931242072edd9519a3393efba7469da6f
2020-11-02 23:56:43 -08:00
4df7eefa06 [TensorExpr] Support LLVM versions 8 through 12 (#47033)
Summary:
Adjust llvm_{codegen, jit}.cpp to support LLVM versions 8 through 12.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47033

Test Plan: test_tensorexpr --gtest_filter=TensorExprTest.LLVM*

Reviewed By: bertmaher

Differential Revision: D24689903

Pulled By: asuhan

fbshipit-source-id: 2654bb7eb2ab6a95a5527c079b07ed8552c51bde
2020-11-02 22:32:11 -08:00
ac8a8185eb expose Timer docs to PyTorch website. (#46880)
Summary:
CC: gchanan jspisak seemethere

I previewed the docs and they look reasonable. Let me know if I missed anything.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46880

Reviewed By: seemethere, izdeby

Differential Revision: D24551503

Pulled By: robieta

fbshipit-source-id: 627f73d3dd4d8f089777bca8653702735632b9fc
2020-11-02 21:59:29 -08:00
09a52676ad Add NestedTensor specific dispatch key to PyTorch (#44668)
Summary:
This adds a dedicated dispatch key for the [nestedtensor project](https://github.com/pytorch/nestedtensor).

- [ ] Since this isn't a device or a backend, does this need further updates in other places other than DispatchKey.h?

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44668

Reviewed By: zhangguanheng66, ailzhang

Differential Revision: D23998801

Pulled By: cpuhrsch

fbshipit-source-id: 133b5a9a04c4f61c27c0728832da09e4b38a5939
2020-11-02 21:35:54 -08:00
1fe273d798 add node by node cost function (#47009)
Summary:
This PR adds node-by-node cost function. Given a partition of nodes, get_latency_of_one_partition function will find the critical path in the partition and return its latency. A test unit is also provided. In the test unit, a graph module is partitioned into two partitions and the latency of each partition is tested.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47009

Reviewed By: gcatron

Differential Revision: D24692542

Pulled By: scottxu0730

fbshipit-source-id: 64c20954d842507be0d1afa2516d88f705e11224
2020-11-02 21:15:43 -08:00
084b71125f Fix bug in toComplexWithDefault (#43841)
Summary:
I don't think this method is used anywhere, so I don't know how to test it. But the diff should justify itself.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43841

Reviewed By: mruberry

Differential Revision: D24696505

Pulled By: anjali411

fbshipit-source-id: f2a249ae2e078b16fa11941a048b7d093e60241b
2020-11-02 21:07:08 -08:00
b1b77148ac Back out "[Gradient Compression] Surface C++ comm hooks to Python API as built-in comm hooks" (#47234)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47234

Revert the diff because of https://github.com/pytorch/pytorch/issues/47153

Original PR issue: C++ DDP Communication Hook https://github.com/pytorch/pytorch/issues/46348
ghstack-source-id: 115720415

Test Plan: waitforbuildbot

Reviewed By: mrshenli

Differential Revision: D24691866

fbshipit-source-id: 58fe0c45943a2ae2a09fe5d5eac4a4d947586539
2020-11-02 20:51:18 -08:00
2cff3bba58 [vulkan_api][ops] Mm, Pool, Upsample (#47063)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47063

Test Plan: Imported from OSS

Reviewed By: AshkanAliabadi

Differential Revision: D24624610

Pulled By: IvanKobzarev

fbshipit-source-id: b6cf555506ea0e2426fa77c53b9d25ffb95d5bbc
2020-11-02 19:02:30 -08:00
b0e954fff5 quantize_tensor_per_channel ARM implementation (#46018)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46018

Currently on mobile devices quantize_tensor has a vectorized implementation
using ARM intrinsics; however quantize_tensor_per_channel does not.

Test Plan:
Build for ARM Neon
```
BUILD_MOBILE_BENCHMARK=1 BUILD_MOBILE_TEST=1 ANDROID_DEBUG_SYMBOLS=1 BUILD_PYTORCH_MOBILE=1 ANDROID_ABI="armeabi-v7a with NEON" ./scripts/build_android.sh  -DANDROID_CCACHE=$(which ccache) -DBUILD_BINARY=ON
```
Build for ARM64
```
BUILD_MOBILE_BENCHMARK=1 BUILD_MOBILE_TEST=1 ANDROID_DEBUG_SYMBOLS=1 BUILD_PYTORCH_MOBILE=1 ANDROID_ABI=arm64-v8a ./scripts/build_android.sh  -DANDROID_CCACHE=$(which ccache) -DBUILD_BINARY=ON
```
Then run the benchmark binary over adb shell. Note that by android cpu is not frequency locked by default and can lead to noisy benchmark results, but this can be changed by running the following for every cpu.
```
adb shell "echo userspace > /sys/devices/system/cpu/${cpu}/cpufreq/scaling_governor"
adb shell "echo '2000000' > /sys/devices/system/cpu/${cpu}/cpufreq/scaling_setspeed"
adb push build_android/bin/quantize_per_channel /data/local/tmp/
adb shell "/data/local/tmp/quantize_per_channel"
```

Resulting benchmarks are located [here](https://gist.github.com/AJLiu/d1711bb6a5e93b3338eca2c14c8aec9f)
Google spreadsheet comparing results [here](https://docs.google.com/spreadsheets/d/1Ky-rEu2CqOqex2a84b67hB1VLAlfEDgAN2ZXe8IlGF8/edit?usp=sharing)

Reviewed By: kimishpatel

Differential Revision: D24286528

fbshipit-source-id: 5481dcbbff8345a2c0d6cc9b7d7f8075fbff03b3
2020-11-02 18:31:19 -08:00
ecfa7a27b8 [jit] fix traced training attribute (#47211)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47211

The attribute is getting shadowed by the default one set on all modules,
and the __setattr__ on the TracedModule object prevents setting it correctly.

    import torch

    inp = torch.zeros(1, 3, 224, 224)
    model = torch.hub.load('pytorch/vision:v0.6.0', 'mobilenet_v2', pretrained=True)
    model.eval()
    print(model.training)
    with torch.no_grad():
        traced = torch.jit.trace(model, inp)
    print(traced.training)
    traced.eval()
    print(traced.training)
    traced.training = False
    print(traced.training)
    torch.jit.freeze(traced)

Test Plan: Imported from OSS

Reviewed By: suo

Differential Revision: D24686690

Pulled By: zdevito

fbshipit-source-id: 9c1678dc68e9bf83176e9f5a20fa8f6bff5d69a0
2020-11-02 17:28:49 -08:00
27f4a78bb8 Add benchmark for per channel tensor quantization (#46017)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46017

Currently on mobile only per tensor quantization is optimized for mobile using ARM intrinsics. This benchmark is
 dded to help gauge performance improvement on mobile after performing the same optimizations for per channel quantization.

Test Plan:
Build for ARM Neon
```
BUILD_MOBILE_BENCHMARK=1 BUILD_MOBILE_TEST=1 ANDROID_DEBUG_SYMBOLS=1 BUILD_PYTORCH_MOBILE=1 ANDROID_ABI="armeabi-v7a with NEON" ./scripts/build_android.sh  -DANDROID_CCACHE=$(which ccache) -DBUILD_BINARY=ON
```
Build for ARM64
```
BUILD_MOBILE_BENCHMARK=1 BUILD_MOBILE_TEST=1 ANDROID_DEBUG_SYMBOLS=1 BUILD_PYTORCH_MOBILE=1 ANDROID_ABI=arm64-v8a ./scripts/build_android.sh  -DANDROID_CCACHE=$(which ccache) -DBUILD_BINARY=ON
```
Then run the benchmark binary over adb shell. Note that by android cpu is not frequency locked by default and can lead to noisy benchmark results, but this can be changed by running the following for every cpu.
```
adb shell "echo userspace > /sys/devices/system/cpu/${cpu}/cpufreq/scaling_governor"
adb shell "echo '2000000' > /sys/devices/system/cpu/${cpu}/cpufreq/scaling_setspeed"
adb push build_android/bin/quantize_per_channel /data/local/tmp/
adb shell "/data/local/tmp/quantize_per_channel"
```

Reviewed By: kimishpatel

Differential Revision: D24286488

fbshipit-source-id: 1e7942f0bb3d9d1fe172409d522be9f351a485bd
2020-11-02 17:11:16 -08:00
82b74bd929 For torch::jit::module's attr method to moble::module (#47059)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47059

This diff adds attr getter to mobile::module similar to torchscript module at https://github.com/pytorch/pytorch/blob/master/torch/csrc/jit/api/object.h#L75-L83.

Test Plan: LiteInterpreterTest::CheckAttrAccess

Reviewed By: xta0

Differential Revision: D24604950

fbshipit-source-id: cfac187f47f5115807dc119fe6c203f60dbd5dff
2020-11-02 16:38:12 -08:00
b6685d3863 [PT] optional -> c10::optional (#47144)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47144

Change `optional` to `c10::optional` to avoid conflicting with `std::optional` in files that happen to include both.

Test Plan: contbuild

Reviewed By: yinghai

Differential Revision: D24662515

fbshipit-source-id: 1e72fbc791d585e797a7239305ab5e3f82ddfec9
2020-11-02 16:33:36 -08:00
be2e3dd2a1 [quant][graphmode][fx][fix] Linear work with float_qparam_dynamic_qconfig (#47068)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47068

Filter the dtype config before performing the quantization in linear

Test Plan: Imported from OSS

Reviewed By: supriyar

Differential Revision: D24627907

fbshipit-source-id: 162fa47b3fcf6648049f8bc0438e41ee97ac19e9
2020-11-02 16:28:33 -08:00
cedeee2cd4 Add scalar.conj() and update backward formulas for add and sub (#46596)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46596

1. Added `conj` method for scalar similar to numpy.
2. Updates backward formulas for add and sub to work correctly for R -> C cases and for the case when alpha is complex.
3. Enabled complex backward for nonzero (no formula update needed).

Test Plan: Imported from OSS

Reviewed By: glaringlee

Differential Revision: D24529227

Pulled By: anjali411

fbshipit-source-id: da871309a6decf5a4ab5c561d5ab35fc66b5273d
2020-11-02 16:17:00 -08:00
86151da19e Port CPU Trace from TH to ATen (#47126)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47126

Context
-------
This PR is a rebase of shihongzhi's https://github.com/pytorch/pytorch/pull/35360.
I forgot to merge it back when it was submitted so I rebased it and ran new benchmarks on it.

Benchmarks
----------

TL;DR: The op has more overhead than the TH version but for larger shapes the overhead disappears.

```
import torch

shapes = [
    [1, 1],
    [100, 100],
    [1000, 1000],
    [10000, 10000],
    [100000, 100000],
]

for shape in shapes:
    x = torch.ones(shape)
    %timeit x.trace()

Before:
1.83 µs ± 42.4 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
1.98 µs ± 48.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
3.19 µs ± 10.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
85.2 µs ± 700 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
1.23 ms ± 4.34 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

After:
2.16 µs ± 325 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
2.08 µs ± 275 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
4.45 µs ± 19.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
81.8 µs ± 766 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
1.27 ms ± 6.75 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
```

Future work
-----------
Things that can be done after this PR:
- add complex tensor support
- Fix the type promotion discrepancy between CPU and CUDA

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D24683259

Pulled By: zou3519

fbshipit-source-id: f92b566ad0d58b72663ab64899d209c96edb78eb
2020-11-02 16:03:22 -08:00
8054ae3e77 Add test for trace (#47125)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47125

We didn't actually have any tests for torch.trace. The tests expose a
discrepancy between the behavior of torch.trace on CPU and CUDA that
I'll file an issue for.

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D24683260

Pulled By: zou3519

fbshipit-source-id: 71dd3af62bc98c6b9b0ba2bf2923cb6d44daa640
2020-11-02 16:00:33 -08:00
f58842c214 Enable inlining into reductions (#47020)
Summary:
This diff enables inlining producers into reductions. It also guards against inlining reductions themselves.

Prior to this diff, if there was a reduction in the loopnest, no inlining was happening. After this change, we will inline all non-output buffers that do not correspond to a reduction.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47020

Reviewed By: albanD

Differential Revision: D24644346

Pulled By: navahgar

fbshipit-source-id: ad234a6877b65be2457b734cbb7f3a1800baa6a5
2020-11-02 15:33:38 -08:00
b5a1be02a0 Add RAII DetectAnomalyGuard (#47164)
Summary:
This is a followup to the C++ anomaly detection mode, implementing the guard.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47164

Reviewed By: mruberry

Differential Revision: D24682574

Pulled By: albanD

fbshipit-source-id: b2224a56bf6eca0b90b8e10ec049cbcd5af9d108
2020-11-02 15:07:59 -08:00
ebf36ad3da Remove travis-python references as well as some unnecessary dependencies (#47209)
Summary:
This PR attempts to remove unneeded installations of `pip` among other packages in `install_base.sh` since these very same packages are already installed elsewhere (for example in `install_conda.sh`).

In the process, I found some old `TRAVIS_PYTHON_VERSION` references that are no longer needed, so I removed all references that need `install_travis_python.sh`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47209

Reviewed By: mruberry

Differential Revision: D24690079

Pulled By: janeyx99

fbshipit-source-id: f8fef4cda9832c868595d4745d811fc7d42df34d
2020-11-02 15:01:05 -08:00
42b6f96764 Make "Run flake8" step always succeed again (#47236)
Summary:
In https://github.com/pytorch/pytorch/issues/46990 I asked whether the "Run flake8" step was supposed to always succeed (so that the "Add annotations" step would be sure to run). The reviewers and I weren't sure of the answer to that question, so we merged it anyway, but that turned out to be wrong: https://github.com/pytorch/pytorch/runs/1327599980 So this PR fixes that issue introduced by https://github.com/pytorch/pytorch/issues/46990.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47236

Reviewed By: janeyx99

Differential Revision: D24692359

Pulled By: samestep

fbshipit-source-id: c12382de6945245d6251ce792896e5e688f480af
2020-11-02 14:53:38 -08:00
f5073b0c5a Add inputs argument to autograd.backward() (#46855)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/46373

As noted in https://github.com/pytorch/pytorch/issues/46373, there needs to be a flag passed into the engine that indicates whether it was executed through the backward api or grad api. Tentatively named the flag `accumulate_grad` since functionally, backward api accumulates grad into .grad while grad api captures the grad and returns it.

Moving changes not necessary to the python api (cpp, torchscript) to a new PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46855

Reviewed By: ngimel

Differential Revision: D24649054

Pulled By: soulitzer

fbshipit-source-id: 6925d5a67d583eeb781fc7cfaec807c410e1fc65
2020-11-02 14:32:38 -08:00
18470f68bc Fix max_pool1d on discontiguous tensor (#47065)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47065

#fixes https://github.com/pytorch/pytorch/issues/47054

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D24633342

Pulled By: heitorschueroff

fbshipit-source-id: b318f3a4fe68e538c71b147a82b62367f23146fa
2020-11-02 14:21:31 -08:00
b3eb0c86cf Revert D24335982: explicitly error out in comparison ops when the types don't match
Test Plan: revert-hammer

Differential Revision:
D24335982 (60fea510a1)

Original commit changeset: 3dfb02bcb403

fbshipit-source-id: 00072f1b00e228bbbe295053091cf4a7a46f4668
2020-11-02 14:08:01 -08:00
7f125bca1c [Metal] Add pin_memory check in empty_strided (#47228)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47228

Add the false checking if pin_memory has been specified to `False`
ghstack-source-id: 115715087

Test Plan:
- CircleCI
- Sandcastle

Reviewed By: IvanKobzarev

Differential Revision: D24690472

fbshipit-source-id: c65fc494fcd7b0b409a80c86e108a029ca7fd71e
2020-11-02 14:00:12 -08:00
e03820651a Make conversions explicit (#46835)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46835

We make explicit a couple of previously implicit down/narrowing conversions. This fixes a couple of compiler warnings.

Test Plan: Standard pre-commit test rig.

Reviewed By: ngimel

Differential Revision: D24481427

fbshipit-source-id: 8c9b0215a662ccdef8e2ba3df5f78ef110071f7b
2020-11-02 13:54:00 -08:00
22b3d414de Enhance the torch.pow testcase for the complex scalar base (#47101)
Summary:
Related https://github.com/pytorch/pytorch/issues/45259

This PR is to address the https://github.com/pytorch/pytorch/pull/45259#discussion_r514390664

- leverage the `make_tensor`  function to generate a random tensor as the exponent, preventing the full zeros for the integer exponent.
- add some special cases for the zero exponents and the `1 + 0j` base.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47101

Reviewed By: mruberry

Differential Revision: D24682430

Pulled By: zou3519

fbshipit-source-id: f559dc0ba08f37ae070036fb25a52ede17a24149
2020-11-02 13:13:15 -08:00
9b52654620 annotate a few torch.nn.modules.* modules (#45772)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/45771

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45772

Reviewed By: mruberry

Differential Revision: D24682013

Pulled By: albanD

fbshipit-source-id: e32bc4fe9c586c079f7070924a874c70f3d127fa
2020-11-02 13:04:59 -08:00
7178790381 Add vulkan clamp op (#47196)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47196

Added vulkan clamp ops

Test Plan: Imported from OSS

Reviewed By: AshkanAliabadi

Differential Revision: D24688470

Pulled By: SS-JIA

fbshipit-source-id: b74d6718811972904816441e93515a982a518fd9
2020-11-02 12:48:04 -08:00
96b23f7db1 add sandcastle device type test base discovery (#47119)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47119

Test Plan:
tests
1. test_cuda still works:
`buck test --no-cache -c test.external_runner=tpx mode/dev-nosan //caffe2/test:cuda -- --use-remote-execution --force-tpx`
2. test_torch is blocked on D24623962
`buck test --no-cache -c test.external_runner=tpx mode/dev-nosan //caffe2/test:torch -- --use-remote-execution --force-tpx`

Reviewed By: mruberry

Differential Revision: D24649868

fbshipit-source-id: 97cb41996ea0c37a66a4bf2154e254d2d2912a17
2020-11-02 12:22:30 -08:00
70d58031d7 [c10] make intrusive_ptr available as a pybind holder type (#44492)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44492

Test Plan: Imported from OSS

Reviewed By: smessmer

Differential Revision: D23632278

Pulled By: wanchaol

fbshipit-source-id: b9796e15074d68a347de443983abf7f052a3cdfe
2020-11-02 12:11:45 -08:00
6852cbb952 [RFC] Better error message in case operator could not be run (#46885)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46885

There seem to be a lot of situations where users are running into problems with missing operators (these users are mostly within FB for now since the mobile focus is currently on internal use cases). To avoid wasting their time, we would very much like to point them in a more actionable direction. This diff attempts to do just that.

Please find additional context at: https://fb.workplace.com/groups/894363187646754/permalink/1081939438889127/

Previous message:

```
Could not run 'aten::less_equal.Scalar' with arguments from the 'CPU' backend.
```

New message:

```
Could not run 'aten::less_equal.Scalar' with arguments from the 'CPU' backend.
This could be because the operator doesn't exist for this backend, or was
omited during the selective/custom build process (if using custom build).
If you are a Facebook employee using PyTorch on mobile, please visit
https://fburl.com/ptmfixes for possible resolutions.
```
ghstack-source-id: 115691682

Test Plan: Sandcastle

Reviewed By: ezyang

Differential Revision: D24552243

fbshipit-source-id: fb78b1ab2c1fa0e1faf5537cbf0575256391f081
2020-11-02 11:55:34 -08:00
c5ae875179 Add bfloat support for torch.randn and torch.norm (#47143)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47143

Reviewed By: pbelevich

Differential Revision: D24664407

Pulled By: malfet

fbshipit-source-id: c63ff1cbb812751aba4c56e64e6ee1008cfc2d7f
2020-11-02 11:49:21 -08:00
60fea510a1 explicitly error out in comparison ops when the types don't match (#46399)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46399

Explicitly error out in comparison/logical ops when the dtypes of the various input/output tensors don't match. See [this comment](https://github.com/pytorch/pytorch/pull/46399#discussion_r505686406) for more details.

fixes #42660

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D24335982

Pulled By: bdhirsh

fbshipit-source-id: 3dfb02bcb403dda5bcbf5ed3eae543354ad698b2
2020-11-02 11:42:32 -08:00
6e22b6008d [MLF] Allow for computing prune quantile thresholds on absolute value of indicators in distributed-inference-compatible embedding LUT pruning (#46789)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46789

1. Now `SelfBinningHistogram` can calculate the binning histogram using the absolute values from the given an array of values.
2. Update the invocation of `SelfBinningHistogram` in `post_training_prune`.

Test Plan:
1. [buck test caffe2/caffe2/python/operator_test:self_binning_histogram_test](https://www.internalfb.com/intern/testinfra/testconsole/testrun/6473924488326108/)
2. [buck test dper3/dper3_backend/delivery/tests:post_training_prune_test](https://www.internalfb.com/intern/testinfra/testconsole/testrun/2251799854023163/)

Reviewed By: hwangjeff

Differential Revision: D24494097

fbshipit-source-id: 95e47137b25746e686ef9baa9409560af5d58fc1
2020-11-02 11:31:31 -08:00
6906701bde [ROCm] enable stream priorities (#47136)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47136

Reviewed By: mruberry

Differential Revision: D24672457

Pulled By: ngimel

fbshipit-source-id: 54f60c32df87cbd40fccd7fb1ecf0437905f01a3
2020-11-02 11:25:44 -08:00
c2e123331a Check CUDA kernel launches (fbcode/caffe2/aten/src/ATen/native/cuda/) (#47207)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47207

Add a safety check TORCH_CUDA_KERNEL_LAUNCH_CHECK() after each kernel launch. This diff only chnges the files inside the directory fbcode/caffe2/aten/src/ATen/native/cuda/. Will create similar DIFFS per directory.

Test Plan:
Run test with:
```
buck build //caffe2/aten:ATen-cu
```

Results:
```
[fjponce@30644.od ~/fbcode (c32bce1c)]$ buck build //caffe2/aten:ATen-cu
Building: finished in 0.8 sec (100%) 1/1 jobs, 0 updated
  Total time: 1.0 sec
More details at https://www.internalfb.com/intern/buck/build/c8d463e5-2d8b-4566-97f0-2d355eda8f2d
[fjponce@30644.od ~/fbcode (b78b1f2d)]$
```

The files does not appear anymore in the list when executing python script
https://www.internalfb.com/intern/paste/P147803236/

Reviewed By: r-barnes

Differential Revision: D24685062

fbshipit-source-id: 6ef7989d28b6629752d98dc36dd4a92c2507204c
2020-11-02 11:09:36 -08:00
0d6bf8864b add rocm 3.9 to nightly builds (#47121)
Summary:
Corresponding pytorch builder repo update: https://github.com/pytorch/builder/pull/561.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47121

Reviewed By: samestep

Differential Revision: D24660850

Pulled By: walterddr

fbshipit-source-id: 68b22e0a2d341396eb1cdcfaa0a413ce7ad93ca3
2020-11-02 10:18:45 -08:00
da26858c9c Add complex backward support for torch.exp (#47194)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47194

Test Plan: Imported from OSS

Reviewed By: izdeby

Differential Revision: D24683201

Pulled By: anjali411

fbshipit-source-id: c447dec51cbfe7c09d6943fbaafa94f48130d582
2020-11-02 09:39:44 -08:00
c10aa44e33 Back out "Providing more information while crashing process in async error handling" (#47185)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47185

Original commit changeset: 02d48f13352a

Test Plan: CI

Reviewed By: mruberry

Differential Revision: D24682055

fbshipit-source-id: 060efa29eb2f322971848ead447021f6972cb3f3
2020-11-02 08:34:30 -08:00
85e5b76f17 Automated submodule update: FBGEMM (#47190)
Summary:
This is an automated pull request to update the first-party submodule for [pytorch/FBGEMM](https://github.com/pytorch/FBGEMM).

New submodule commit: 5b7566f412

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47190

Test Plan: Ensure that CI jobs succeed on GitHub before landing.

Reviewed By: jspark1105

Differential Revision: D24682770

fbshipit-source-id: da11d6039c3e158444253d3c6237e3ee71d5afb5
2020-11-02 07:51:50 -08:00
1cc1da5411 LayerNormInt8QuantizeFakeNNPI fix to match ICEREF. (#47140)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47140

LayerNorm + Int8Quantize fix to match ICEREF.

(Note: this ignores all push blocking failures!)

Test Plan:
buck test --debug //caffe2/caffe2/contrib/fakelowp/test:test_layernorm_nnpi_fp16nnpi -- test_fused_ln_quantize --print-passing-details

https://internalfb.com/intern/testinfra/testrun/7881299371969005

Reviewed By: hyuen

Differential Revision: D24659904

fbshipit-source-id: 026d1a1f69a68eca662a39752af5ab0756bace2d
2020-11-01 14:31:38 -08:00
19ede75eb9 [JIT] Enable ModuleDict non-literal indexing (#45716)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45716

**Summary**
This commit enables indexing into `ModuleDict` using a non-literal
index if the `ModuleDict` is annotated with `Dict[str, X]`, where `X` is
a module interface type. These annotations must be expressed using a
class attribute named `__annotations__`, which is a `Dict[str, Type]`
where the keys are the names of module attributes and the values are
their types.

The approach taken by this commit is that these annotations are stored
as "hints" along with the corresponding module attributes in the
`ConcreteSubmoduleTypeBuilder` instance for each module (which might be
a `ModuleDict`). These hints are passed into the `ModuleValue` that is
created for desugaring operations on submodules so that indexing into a
`ModuleDict` can be emitted as a getitem op into a dict emitted into the
graph that represents the `ModuleDict`.

**Test Plan**
This commit adds unit tests to `TestModuleContainers` to test this
feature (`test_typed_module_dict`).

Differential Revision: D24070606

Test Plan: Imported from OSS

Reviewed By: ansley

Pulled By: SplitInfinity

fbshipit-source-id: 6019a7242d53d68fbfc1aa5a49df6cfc0507b992
2020-10-31 21:36:23 -07:00
317b78d56e Revert D24665950: Create prototype for AST rewriter
Test Plan: revert-hammer

Differential Revision:
D24665950 (54feb00bbd)

Original commit changeset: b72110436126

fbshipit-source-id: 961412df006acd33c91a745c809832d5c6494c76
2020-10-31 18:07:10 -07:00
54feb00bbd Create prototype for AST rewriter (#46410)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46410

Test Plan: Imported from OSS

Reviewed By: SplitInfinity

Differential Revision: D24665950

Pulled By: ansley

fbshipit-source-id: b72110436126a24ddc294b8ee7b3f691281c1f1b
2020-10-31 10:51:17 -07:00
ee0033af9b [Gradient Compression] Surface C++ comm hooks to Python API as built-in comm hooks (#46959)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46959

1. Implemented the Pybind part.
2. In the reducer, once the builtin_comm_hook_type is set,  a c++ comm hook instance will be created in Reducer::autograd_hook.
3. Added unit tests for the builit-in comm hooks.

Original PR issue: C++ DDP Communication Hook https://github.com/pytorch/pytorch/issues/46348
ghstack-source-id: 115629230

Test Plan: buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_builtin_ddp_comm_hooks_nccl

Reviewed By: pritamdamania87

Differential Revision: D24471910

fbshipit-source-id: f96b752298549ea2067e2568189f1b394abcd99a
2020-10-30 23:19:42 -07:00
e3f912e8b7 Revert D24655999: [fbcode] Make model reader utilities.
Test Plan: revert-hammer

Differential Revision:
D24655999 (7f056e99dd)

Original commit changeset: 5095ca158d89

fbshipit-source-id: c43f672def7331667421e01b90f979940366e3c9
2020-10-30 21:00:42 -07:00
7f056e99dd [fbcode] Make model reader utilities.
Summary:
For some of the end to end flow projects, we will need the capabilities to read module information during model validation or model publishing.
Creating this model_reader.py for utilities for model content reading, this diff we included the following functionalities:
1. read the model bytecode version;
2. check if a model is lite PyTorch script module;
3. check if a model is PyTorch script module.

Test Plan:
```
[xcheng16@devvm1099]/data/users/xcheng16/fbsource/fbcode% buck test pytorch_mobile/utils/tests:mobile_model_reader_tests
Processing filesystem changes: finished in 1.5 sec
Parsing buck files: finished in 1.6 sec
Building: finished in 4.9 sec (100%) 9249/43504 jobs, 2 updated
  Total time: 6.5 sec
More details at https://www.internalfb.com/intern/buck/build/6d0e2c23-d86d-4248-811f-31cb1aa7eab3
Tpx test run coordinator for Facebook. See https://fburl.com/tpx for details.
Running with tpx session id: 2ffccd62-ece5-44b5-8350-3a292243fad9
Trace available for this run at /tmp/tpx-20201030-122220.664763/trace.log
Started reporting to test run: https://our.intern.facebook.com/intern/testinfra/testrun/3940649711969390
    ✓ ListingSuccess: pytorch_mobile/utils/tests:mobile_model_reader_tests - main (10.234)
    ✓ Pass: pytorch_mobile/utils/tests:mobile_model_reader_tests - test_is_pytorch_lite_module (pytorch_mobile.utils.tests.test_model_reader.TestModelLoader) (7.039)
    ✓ Pass: pytorch_mobile/utils/tests:mobile_model_reader_tests - test_is_pytorch_script_module (pytorch_mobile.utils.tests.test_model_reader.TestModelLoader) (7.205)
    ✓ Pass: pytorch_mobile/utils/tests:mobile_model_reader_tests - test_read_module_bytecode_version (pytorch_mobile.utils.tests.test_model_reader.TestModelLoader) (7.223)
Summary
  Pass: 3
  ListingSuccess: 1
Finished test run: https://our.intern.facebook.com/intern/testinfra/testrun/3940649711969390

Reviewed By: husthyc

Differential Revision: D24655999

fbshipit-source-id: 5095ca158d89231fb17285d445548f91ddb89bab
2020-10-30 19:04:14 -07:00
1aa57bb761 Moving coverage, xunit, pytest installation to Docker (#47082)
Summary:
Fixes a TODO. This PR moves `pip_install unittest-xml-reporting coverage pytest` to the base Ubuntu docker instead of running that installation during every test.

This should save some time.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47082

Reviewed By: samestep

Differential Revision: D24634393

Pulled By: janeyx99

fbshipit-source-id: 3b980890409eafef9b006b9e03ad7f3e9017529e
2020-10-30 18:34:44 -07:00
cb4b6336ba [FX] Fix handling of attributes (#47030)
Summary:
Probably works :)

Fixes https://github.com/pytorch/pytorch/issues/46872

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47030

Reviewed By: ngimel

Differential Revision: D24652600

Pulled By: Chillee

fbshipit-source-id: 3fe7099ad02d1b5c23a7335b855d36d373603d18
2020-10-30 17:08:58 -07:00
7eb427e931 Providing more information while crashing process in async error handling (#46274)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46274

We crash the process in NCCL Async Error Handling if the collective
has been running for greater than some set timeout. This PR logs more
information about the rank and duration the collective ran before throwing an exception.
ghstack-source-id: 115614622

Test Plan:
Run desync tests and flow. Here are the Flow runs showing the right messages: f225031389
f225032004

Reviewed By: jiayisuse

Differential Revision: D24200144

fbshipit-source-id: 02d48f13352aed40a4476768c123d5cebbedc8e0
2020-10-30 16:22:51 -07:00
d1d6dc2e3c Add more specific error message (#46905)
Summary:
While using `torch.utils.data.TensorDataset`, if sizes of tensors mismatch, there's now a proper error message.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46905

Reviewed By: ngimel

Differential Revision: D24565712

Pulled By: mrshenli

fbshipit-source-id: 98cdf189591c2a7a1b693627cc8464e8f553d9ee
2020-10-30 16:03:44 -07:00
a81572cdc5 Add anomaly mode for C++ (#46981)
Summary:
This adds anomaly mode for C++.

The backtrace isn't perfect yet, but it's a start.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46981

Reviewed By: IvanKobzarev

Differential Revision: D24631957

Pulled By: albanD

fbshipit-source-id: 4b91e205e7e51f4cf0fbc651da5013a00a3b2497
2020-10-30 15:18:07 -07:00
c86af4aa55 Disable NEON acceleration on older compilers (#47099)
Summary:
Optimized build compiled by gcc-7.5.0 generates numerically incorrect code

Works around https://github.com/pytorch/pytorch/issues/47098

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47099

Reviewed By: walterddr

Differential Revision: D24642272

Pulled By: malfet

fbshipit-source-id: 2cfb43e950c0d1c92cfcee13749f1ad13248c39b
2020-10-30 13:33:42 -07:00
085193c291 [quant][graphmode][fx][fusion] Add test for fuse_fx (#47085)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47085

Both in train and eval mode

Test Plan: Imported from OSS

Reviewed By: vkuzo

Differential Revision: D24632457

fbshipit-source-id: 486aee4e073fb87e9da46a344e8dc77e848a60cf
2020-10-30 12:25:54 -07:00
1dd220bd84 Add C++ coverage for Ubuntu cpu tests (#46656)
Summary:
In order to enable C++ code coverage for tests, we need to build pytorch with the correct coverage flags. This PR should introduce a build that allows coverage tests to stem from a specific coverage build.

This PR does the following:
1. Adds a new build to `*-coverage_*` build with the correct `--coverage` flag for C++ coverage
2. Calls `lcov` at the end of testing to capture C++ coverage results
3. Pushes C++ results along with Python results
4. Shards the coverage test to not take ~4hrs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46656

Reviewed By: walterddr

Differential Revision: D24636213

Pulled By: janeyx99

fbshipit-source-id: 362a1a2a20c069ba0a7931669194dac53ac81133
2020-10-30 11:11:14 -07:00
edac4060d7 Fix mul cuda for bool (#47031)
Summary:
Also, add tests for tensor by scalar multiplication / division

Fixes https://github.com/pytorch/pytorch/issues/47007

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47031

Reviewed By: walterddr

Differential Revision: D24608874

Pulled By: malfet

fbshipit-source-id: 4e15179904814d6e67228276d3d11ff1b5d15d0d
2020-10-30 10:38:32 -07:00
69fe10c127 use bitfield to shrink TensorImpl (#45263)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45263

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D23900587

Pulled By: bhosmer

fbshipit-source-id: 9214b887fde010bd7c8be848ee7846329c35752f
2020-10-30 10:18:44 -07:00
99fed7bd87 faster TensorOptions merging (#45046)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45046

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D23806787

Pulled By: bhosmer

fbshipit-source-id: 3c8304f9a4503658081f8805ec06da78a467e125
2020-10-30 10:18:40 -07:00
c7fc8cab3b track Half/ComplexHalf default dtype (#45043)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45043

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D23806097

Pulled By: bhosmer

fbshipit-source-id: 1c816b09c1e6b3c7ba85ed43d8e6c2518a768da4
2020-10-30 10:18:38 -07:00
f05b66b70d pass TypeMeta by value (#45026)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45026

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D23802943

Pulled By: bhosmer

fbshipit-source-id: 81b06ef00bf8eb4375c0e0ff2032e03bd1d1188a
2020-10-30 10:14:17 -07:00
2643800881 Fix max_pool2d with ceil_mode bug (#46558)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46558

This PR fixes a bug with how pooling output shape was computed.

## BC Breaking Notes
Previously, a bug in the pooling code allowed a sliding window to be entirely off bounds. Now, sliding windows must start inside the input or left padding (not right padding, see https://github.com/pytorch/pytorch/issues/46929) and may only go off-bounds if ceil_mode=True.

fixes #45357

TODO

- [x] Ensure existing tests are checking for the correct output size

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D24633372

Pulled By: heitorschueroff

fbshipit-source-id: 55925243a53df5d6131a1983076f11cab7516d6b
2020-10-30 09:36:04 -07:00
7df0224cba Automated submodule update: FBGEMM (#47071)
Summary:
This is an automated pull request to update the first-party submodule for [pytorch/FBGEMM](https://github.com/pytorch/FBGEMM).

New submodule commit: 39d5addbff

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47071

Test Plan: Ensure that CI jobs succeed on GitHub before landing.

Reviewed By: albanD

Differential Revision: D24628407

fbshipit-source-id: 9b636f66b92e5853cafd521e704996ebc2faa954
2020-10-30 08:15:44 -07:00
67b7e751e6 add warning if DataLoader is going to create excessive number of thread (#46867)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46867

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D24545540

Pulled By: glaringlee

fbshipit-source-id: a3bef0d417e535b8ec0bb33f39cfa2308aadfff0
2020-10-30 07:54:23 -07:00
eec201c138 Add last_n_window_collector
Summary:
Add `last_n_window_collector` as C2 supports and PyTorch currently does not have this operator: https://www.internalfb.com/intern/diffusion/FBS/browsefile/master/fbcode/caffe2/caffe2/operators/last_n_window_collector.cc?lines=139

## Problem that we are solving

This operator works on multiple pieces of data and collects last `n` element that has been seen.

If you have the following pieces of data that has been passed around:

```
  [1, 2, 3, 4]
  [5, 6, 7]
  [8, 9, 10, 11]
```

for 3 times and the number of collector is given to be 6. The expected result is:

```
  [6, 7, 8, 9, 10, 11]
```
What this means is that, almost like we need a FIFO(First in First Out) mechanism where as we are passing this data through the collector, we will be pushing some other data at the end.

In this particular example, in the first pass(the data is `[1, 2, 3, 4]`) , we hold `[1, 2, 3, 4]` in the queue as our queue size is 6.

In the second pass(the data is `[5, 6, 7]`), we hold `[2, 3, 4, 5, 6, 7]` in the queue and since 1 is inserted the last, it will drop due to the size limitation of the queue.

In the third pass(the data is `[8, 9, 10, 11]`), we hold `[6, 7, 8, 9, 10, 11]` in the queue and `2,3,4,5` are dropped due the the size of the queue.

For multidimension case, when we have the following data:

```
  [[1, 2], [2, 3], [3, 4], [4, 5]]
  [[5, 6], [6, 7], [7, 8]]
  [[8, 9], [9, 10], [10, 11], [11, 12]]
```

and our queue size is 6.

In the first pass, we will have `  [[1, 2], [2, 3], [3, 4], [4, 5]]`
In the second pass, we will have `[2, 3], [3, 4], [4, 5]] [[5, 6], [6, 7], [7, 8]]`
In the third pass, we will have `[6, 7], [7, 8]] [[8, 9], [9, 10], [10, 11], [11, 12]]`

### The implementation

I am using FIFO queue in Python which is in the collections library. This accepts `maxlen` argument which can be used to set the size of the queue.

I am using last n indices of the tensor through list indices and in this operator, I am not doing copy.

In the test plan, I have both single dimension tensors as well as multi-dimension tensors.

### Benchmark
I used various different configurations and added a benchmark test. PyTorch implementation is much master than Caffe2 implementation:

#### CPU Benchmark
```
torch_response.median
0.00019254473969340324

caffe_response.median
0.00030233583599794657
```

#### GPU Benchmark

```
torch_response.mean
0.000081007429903838786

caffe_response.mean
0.00010279081099724863
```

Test Plan:
### For CPU:
```
buck test //caffe2/torch/fb/sparsenn:test
```

### For GPU:
- Used an on-demand machine and did the following commands:
```
jf get D24435544
buck test mode/opt  //caffe2/torch/fb/sparsenn:test
```
https://www.internalfb.com/intern/testinfra/testconsole/testrun/4222124688138052/

Reviewed By: dzhulgakov, radkris-git

Differential Revision: D24435544

fbshipit-source-id: 8193b4746b20f2a4920fd4d41271341045cdcee1
2020-10-30 02:35:54 -07:00
6c34aa720c add add_node function for partition to fix partition mem size calculation (#47083)
Summary:
Placeholders and constants in the partition are counted twice when combining two partitions. This PR fixes it by adding add_node function into Partition class. A unit test is also updated to test if the partition size is correct

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47083

Reviewed By: gcatron

Differential Revision: D24634368

Pulled By: scottxu0730

fbshipit-source-id: ab408f29da4fbf729fd9741dcb3bdb3076dc30c4
2020-10-30 01:59:42 -07:00
f9d32c4fa8 [JIT] Add selective backend lowering API (#43613)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43613

**Summary**
This commit adds a helper/utility to faciliate the selective lowering of
specific submodules within a module hierarchy to a JIT backend. The reason
that this is needed is that lowering a submodule of a scripted
module to a backend after the module has been scripted requires
adjusting its JIT type.

**Test Plan**
This commit refactors `NestedModuleTest` in `jit/test_backends.py` to
use this new selective lowering API.

**Fixes**
This commit fixes ##41432.

Test Plan: Imported from OSS

Reviewed By: mortzur

Differential Revision: D23339855

Pulled By: SplitInfinity

fbshipit-source-id: d9e69aa502febbe04fd41558c70d219729252be9
2020-10-30 00:37:33 -07:00
0dbd72935a Split comm hooks into python-dependent hooks and others (#47019)
Summary:
Stack from [ghstack](https://github.com/ezyang/ghstack):
* **https://github.com/pytorch/pytorch/issues/47019 Split comm hooks into python-dependent hooks and others**

This is needed because we plan to move most of c10d C++ implementation into `libtorch_*.so`, which can not have Python dependencies.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47019

Reviewed By: albanD

Differential Revision: D24614129

Pulled By: gmagogsfm

fbshipit-source-id: 3f32586b932a2fe6a7b01a3800f000e66e9786bb
2020-10-30 00:30:45 -07:00
d95e1afad3 [pytorch] add script to run all codegen (#46243)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46243

Add util script to test whether any codegen output changes.

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D24388873

Pulled By: ljk53

fbshipit-source-id: ef9ef7fe6067df1e0c53aba725fc13b0dfd7f4c2
2020-10-29 22:55:12 -07:00
707d271493 Fix links in tools/build_variables.bzl (#47066)
Summary:
I know the `aten/src/ATen/core/CMakeLists.txt` link is now correct because that file was deleted in 061ed739c17028fe907737b52f50a495ab5b4617:
```
$ git log --full-history -- aten/src/ATen/core/CMakeLists.txt | head -n 1
commit 061ed739c17028fe907737b52f50a495ab5b4617
$ git show 061ed739c17028fe907737b52f50a495ab5b4617^ | head -n 1
commit f99a693cd9ff7a9b5fdc71357dac66b8192786d3
```
But I can't tell what the `tools/cpp_build/torch/CMakeLists.txt` link is supposed to be, because that file (indeed, its entire parent directory) doesn't seem to have ever existed:
```
$ git log --full-history -- tools/cpp_build/torch
```
(The output of the above command is empty.) So I saw that the grandparent directory was deleted in 130881f0e37cdedc0e90f6c9ed84957aee6c80ef:
```
$ git log --full-history -- tools/cpp_build | head -n 1
commit 130881f0e37cdedc0e90f6c9ed84957aee6c80ef
$ git show 130881f0e37cdedc0e90f6c9ed84957aee6c80ef^ | head -n 1
commit c6facc2aaa5a568756e971a9d2b7f2af282dff39
```
And looking at [the history of that directory](c6facc2aaa/tools/cpp_build), I see that some of the last commits touch `torch/CMakeLists.txt`, so I'm just using that here and hoping it's correct.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47066

Reviewed By: albanD, seemethere

Differential Revision: D24628980

Pulled By: samestep

fbshipit-source-id: 0fe8887e323593ef1676c34d4b920aeeaebd8550
2020-10-29 18:47:46 -07:00
366888a5e2 [quant][graphmode][fx] Remove logging for standalone module api calls (#47032)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47032

these are not top level apis, not supposed to be called directly by user.

Test Plan: Imported from OSS

Reviewed By: vkuzo

Differential Revision: D24610602

fbshipit-source-id: c5510f06b05499387d70f23508470b676aea582c
2020-10-29 18:39:43 -07:00
e3b55a8a65 [pytorch/ops] Concat fast path w/ zero tensor (#46805)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46805

The current implementation goes with slow path if there is zero tensor in the list. This is inefficient. Use the fast path for torch.cat even if there are empty tensors. This wastes one thread block for the empty tensor, but still much better than the slow path.

Test Plan: CI + sandcastle

Reviewed By: ngimel

Differential Revision: D24524441

fbshipit-source-id: 529c8af51ecf8374621deee3a9d16cacbd214741
2020-10-29 18:14:40 -07:00
2e2dc5874b Fix lint (#47095)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47095

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D24639056

Pulled By: jamesr66a

fbshipit-source-id: e4f7842eb0438675723d1cac78e20d13b96e802c
2020-10-29 18:09:23 -07:00
f5477e3703 Enable python code coverage on windows (#44548)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/43897

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44548

Reviewed By: walterddr

Differential Revision: D24582777

Pulled By: malfet

fbshipit-source-id: 9b68b72ba356fef61461fc2446c73360f67ce0b4
2020-10-29 17:30:53 -07:00
ddeacf1565 Fix median bug on discontigous tensors (#46917)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46917

fixes https://github.com/pytorch/pytorch/issues/46814

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D24633412

Pulled By: heitorschueroff

fbshipit-source-id: 54732671b298bdc2b04b13ab3a373892ee0933c3
2020-10-29 17:12:22 -07:00
9bc8f071a3 [WIP] Move torch.fx into its own target (#46658)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46658

ghstack-source-id: 115213192

Test Plan: waitforsadcastle

Reviewed By: zdevito, vkuzo

Differential Revision: D24374723

fbshipit-source-id: 2b5708001f5df2ffb21ea5e586e26030653ccdcf
2020-10-29 17:03:08 -07:00
7190155408 [Transposed Conv]add ConvTranspose3d with FBGEMM as backend (#46608)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46608

introduce frontend API for quantized transposed convolution with only FBGEMM as backend.
ghstack-source-id: 115289210

Test Plan: https://www.internalfb.com/intern/testinfra/testconsole/testrun/7599824394184104/

Reviewed By: z-a-f

Differential Revision: D24369831

fbshipit-source-id: b8babd3ddbe0df8f4c8bc652bb745f85e0813797
2020-10-29 16:18:43 -07:00
3c643d112e Pin destination memory for cuda_tensor.to("cpu", non_blocking=True) (#46878)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/39694.

[`torch.cuda._sleep(int(100 * get_cycles_per_ms()))`](https://github.com/pytorch/pytorch/pull/46878/files#diff-893b1eea27352f336f4cd832919e48d721e4e90186e63400b8596db6b82e7450R511-R513) in the test helps avoid flakiness noted by ngimel (https://github.com/pytorch/pytorch/pull/35144#issuecomment-602103631).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46878

Reviewed By: izdeby

Differential Revision: D24550403

Pulled By: xw285cornell

fbshipit-source-id: 1ecc35ef75f9a38ab332aacdf4835955105edafc
2020-10-29 15:42:55 -07:00
e17b8dea1d fix calculation of number of elements to not overflow (#46997)
Summary:
Possibly fixes https://github.com/pytorch/pytorch/issues/46764.
Computing number of tensor elements in many cases is written as
```
int64_t numel = std::accumulate(oldshape.begin(), oldshape.end(), 1,
                                  std::multiplies<int64_t>());
```
This computes the product with the type of `1` literal, which is `int`. When there's more than INT_MAX elements, result overflows. In https://github.com/pytorch/pytorch/issues/46746, the tensor that was sent to reshape had 256^4 elements, and that was computed as `0`, so reshape was not done correctly.
I've audited usages of std::accumulate and changed them to use int64_t as `init` type.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46997

Reviewed By: albanD

Differential Revision: D24624654

Pulled By: ngimel

fbshipit-source-id: 3d9c5e6355531a9df6b10500eec140e020aac77e
2020-10-29 15:37:16 -07:00
78de12f588 Replace -f with -x for pytest tests. (#46967)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46967

Tests under `tests/distributed/_pipeline/sync` use pytest and
specifying the `-f` option for such tests as follows: `python test/run_test.py
-i distributed/_pipeline/sync/skip/test_api -- -f` doesn't work.

The equivalent option for pytest is `-x`. To resolve this issue, I've updated
`run_test.py` to replace `-f` with `-x` for pytest tests.

More details in https://github.com/pytorch/pytorch/issues/46782

#Closes: https://github.com/pytorch/pytorch/issues/46782
ghstack-source-id: 115440558

Test Plan:
1) waitforbuildbot
2) `python test/run_test.py -i distributed/_pipeline/sync/skip/test_api -- -f`

Reviewed By: malfet

Differential Revision: D24584556

fbshipit-source-id: bd87f5b4953504e5659fe72fc8615e126e5490ff
2020-10-29 15:28:06 -07:00
a4caa3f596 [ONNX] bump CI ort to 1.5.2 rel for stability (#46595)
Summary:
Recently the ort-nightly has become unstable and causing issues with CI tests. Switching to release package for now for stability, until the situation is improved.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46595

Reviewed By: houseroad

Differential Revision: D24566175

Pulled By: bzinodev

fbshipit-source-id: dcf36e976daeeb17465df88f28bc9673eebbb7b7
2020-10-29 14:51:38 -07:00
843cab3f2e Delete TypeDefault.h and TypeDerived.h codegen entirely. (#47002)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47002

There was no good reason for TypeDerived.h (CPUType.h) codegen
to exist after static dispatch was deleted, and now that we
have Math alias key TypeDefault.h header is not needed either.
Sorry to anyone who was using these out of tree.

I didn't entirely delete TypeDefault.h as it has a use in
a file that I can't conveniently compile test locally.  Will
kill it entirely in a follow up.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D24596583

Pulled By: ezyang

fbshipit-source-id: b5095d3509098ff74f836c5d0c272db0b2d226aa
2020-10-29 14:43:53 -07:00
c689b4d491 Delete TypeDefault call code generation logic in VariableType (#47000)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47000

There is a new invariant that emit_body is only ever called when
strategy is 'use_derived', which means we can delete a bunch of code.
This removes the last use of TypeXXX.h headers.

Note that this change makes sense, as the TypeDefault entries are
registered as Math entries, which means they automatically populate
Autograd (and we no longer have to register them ourselves).  Ailing
did all the hard work, this is just the payoff.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D24596584

Pulled By: ezyang

fbshipit-source-id: 6fa754b5f16e75cf2dcbf437887c0fdfda5e44b1
2020-10-29 14:43:50 -07:00
41f8641f1e Delete SchemaRegister.cpp, make flag operate on TypeDefault.cpp (#46991)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46991

This change is motivated by a problem bdhirsh observed which is
that in internal builds that include both SchemaRegister.cpp
and TypeDefault.cpp, some operators have their schemas defined
multiple times.  Instead of dumping schema registrations in
multiple files, it seems better to just toggle how many schemas
we write into TypeDefault.cpp.

ljk53 observes that technically SchemaRegister.cpp is only needed by
full-JIT frontend, and not by light interpreter (to resolve schema
lookups).  However, in practice, the registration file seems to be
unconditionally loaded.  This change will make it harder to do the
optimization where we drop schemas in the light interpreter, but you
probably want to architect this differently (similar to per-op
registrations, DON'T do any registrations in ATen, and then write out
the schema registrations in a separate library.)

I took this opportunity to also simplify the TypeDefault generation
logic by reworking things so that we only ever call with None argument
when registering.  Soon, we should be able to just split these
files up entirely.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: ljk53

Differential Revision: D24593704

Pulled By: ezyang

fbshipit-source-id: f01ea22a3999493da77b6e254d188da0ce9adf2f
2020-10-29 14:43:47 -07:00
54d83296a9 Desugar missing dispatch field into singleton Math entry (#46970)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46970

Now that catchall declarations are reinterpreted as registrations to
dispatch key Math, we can now simplify code generation logic by directly
generating to Math, and bypasing logic for catchall.  This also helps
avoid bugs where we incorrectly classify some kernels as Math and others
as not, even though they get registered in the same way.

Bill of changes:
- Give Math its own unique TORCH_LIBRARY_IMPL
- Make it so NativeFunction.dispatch is always non-None.  Simplify
  downstream conditionals accordingly
- When parsing NativeFunction, fill in missing dispatch with a
  singleton Math entry (pointing to the cpp.name!)

One thing that is a little big about this change is a lot of kernels
which previously didn't report as "math" now report as math.  I picked
a setting for these booleans that made sense to me, but I'm not sure
if e.g. XLA will handle it 100% correctly.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D24592391

Pulled By: ezyang

fbshipit-source-id: 2e3355f19f9525698864312418df08411f30a85d
2020-10-29 14:43:44 -07:00
87e86fa84c Some miscellaneous cleanup in codegen (#46940)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46940

- Remove inaccurate generated comments
- Delete some dead code
- Delete some unused headers
- Delete unnecessary SparseTypeDerived.cpp template

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D24573971

Pulled By: ezyang

fbshipit-source-id: 3de05d9cd9bada4c73f01d6cfaf51f16ada66013
2020-10-29 14:43:41 -07:00
dc6f723cb4 Delete Vulkan from code generator. (#46938)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46938

It turns out that after https://github.com/pytorch/pytorch/pull/42194
landed we no longer actually generate any registrations into this
file.  That means it's completely unnecessary.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: IvanKobzarev

Differential Revision: D24573518

Pulled By: ezyang

fbshipit-source-id: b41ada9e394b780f037f5977596a36b896b5648c
2020-10-29 14:40:54 -07:00
156c08b0d9 view_as_real doesn't work for all backends since it relies on strides. (#47018)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47018

Test Plan: Imported from OSS

Reviewed By: jamesr66a

Differential Revision: D24607340

Pulled By: ailzhang

fbshipit-source-id: c7fd85cd636ae9aebb22321f8f1a255af81a473f
2020-10-29 14:33:19 -07:00
71c0133e23 enable PE everywhere but mobile (#47001)
Summary:
enable PE everywhere but mobile

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47001

Reviewed By: eellison

Differential Revision: D24596252

Pulled By: Krovatkin

fbshipit-source-id: 3e3093a43287e1ff838cb03ec0e53c11c82c8dd2
2020-10-29 14:22:56 -07:00
377a09c8e8 reland fast TypeMeta/ScalarType conversion (#45544)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45544

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D24006482

Pulled By: bhosmer

fbshipit-source-id: 5da2401ab40bbf58da27a5d969e00bcee7562ed6
2020-10-29 14:07:39 -07:00
1ea14e30f5 [ONNX] Enable NoneType inputs to export API (#45792)
Summary:
Enables the use of NoneType arguments to inputs tuple in the export API

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45792

Reviewed By: heitorschueroff

Differential Revision: D24312784

Pulled By: bzinodev

fbshipit-source-id: 1717e856b56062add371af7dc09cdd9c7b5646da
2020-10-29 13:56:52 -07:00
c556d4550c fix_combine_two_partition_size (#47053)
Summary:
fix combine_two_partitions in Partitioner.py to calculate new partition used memory size after combining two partitions.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47053

Reviewed By: gcatron

Differential Revision: D24624270

Pulled By: scottxu0730

fbshipit-source-id: a0e2a8486e012d02ea797d6ba36ab304d27cc93f
2020-10-29 13:40:44 -07:00
129b41226e [ONNX] Support nd mask index in opset >= 11 (#45252)
Summary:
Fixes below pattern for opset >= 11

`return tensor[tensor > 0]`

where rank of `tensor` > 1.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45252

Reviewed By: VitalyFedyunin

Differential Revision: D24116945

Pulled By: bzinodev

fbshipit-source-id: 384026cded1eb831bb5469e31ece4fcfb6ae8f2a
2020-10-29 13:32:59 -07:00
1d233d7d1f [fix] torch.nn.functional.embedding -> padding_idx behavior (#46714)
Summary:
Reference https://github.com/pytorch/pytorch/issues/46585

Fix for second snippet in the mentioned issue.
```python
predefined_weights = torch.rand(10, 3)
result = torch.nn.functional.embedding(torch.LongTensor([1,2,0]), predefined_weights, padding_idx=0)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46714

Reviewed By: VitalyFedyunin

Differential Revision: D24593352

Pulled By: albanD

fbshipit-source-id: 655b69d9ec57891871e26feeda2aa0dcff73beba
2020-10-29 13:29:00 -07:00
3e499e490a Bump up NCCL to v2.8 (#46742)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46742

Use NCCL v2.8

Test Plan: waitforsandcastle

Reviewed By: mrshenli

Differential Revision: D24488800

fbshipit-source-id: d39897da1499e63ca783a81aec1ce707606423a3
2020-10-29 13:17:58 -07:00
d850b5c98c Fix DDP issue where parameters share same grad_accumulator (#46755)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46755

As reported in https://github.com/pytorch/pytorch/issues/41324, there is a bug in DDP when `find_unused_parameters=True` and 2 or more parameters share the same gradient accumulator.

In the reducer, we currently keep a mapping of grad accumulator to index and populate it with map[accumulator] = index, but this overwrites indices when the accumulator is the same. To fix this, switch the mapping values to a vector of indices to hold all such indices that share the same accumulator.
ghstack-source-id: 115453567

Test Plan: Added UT

Reviewed By: pritamdamania87

Differential Revision: D24497388

fbshipit-source-id: d32dfa9c5cd0b7a8df13c7873d5d28917b766640
2020-10-29 12:23:06 -07:00
680571533b [RFC] Decouple fast pass functions (#46469)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46469

There are some "fast_pass" function calls, where the symbols in `ATen/native` are directly referenced from outside of native at linking stage. This PR is to decouple one of the fast pass from native, while keeping the same functionality. `scalar_to_tensor` is included through `ATen/ATen.h`, which could be referenced by any cpp file including this header.

ghstack-source-id: 114485740

Test Plan: CI

Reviewed By: ezyang

Differential Revision: D24361863

fbshipit-source-id: 28d658688687b6cde286a6e6933ab33a4b3cf9ec
2020-10-29 12:18:50 -07:00
74d730c0b5 implement NumPy-like functionality column_stack, row_stack (#46313)
Summary:
Related https://github.com/pytorch/pytorch/issues/38349

This PR implements `column_stack` as the composite ops of `torch.reshape` and `torch.hstack`, and makes `row_stack` as the alias of `torch.vstack`.

Todo

- [x] docs
- [x] alias pattern for `row_stack`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46313

Reviewed By: ngimel

Differential Revision: D24585471

Pulled By: mruberry

fbshipit-source-id: 62fc0ffd43d051dc3ecf386a3e9c0b89086c1d1c
2020-10-29 12:14:39 -07:00
fee585b5a3 Correctly mark unannotated NamedTuple field to be inferred TensorType (#46969)
Summary:
If there is no annotation given, we want to show users that the type is inferred

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46969

Test Plan:
Added a new test case that throws an error with the expected error message

Fixes https://github.com/pytorch/pytorch/issues/46326

Reviewed By: ZolotukhinM

Differential Revision: D24614450

Pulled By: gmagogsfm

fbshipit-source-id: dec555a53bfaa9cdefd3b21b5142f5e522847504
2020-10-29 12:07:40 -07:00
1e275bc1a6 Show Flake8 errors in GitHub CI again (#46990)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/46985.

Can someone comment on whether the "Run flake8" step should fail if `flake8` produces errors? This PR makes sure the errors are still shown, but [the job linked from the issue](https://github.com/pytorch/pytorch/runs/1320258832) also shows that the failure of that step seems to have caused the "Add annotations" step not to run.

Is this what we want, or should I instead revert back to the `--exit-zero` behavior (in this case by just removing the `-o pipefail` from this PR) that we had before https://github.com/pytorch/pytorch/issues/46740? And if the latter, then (how) should I modify this `flake8-py3` job to make sure it fails when `flake8` fails (assuming it didn't already do that?)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46990

Reviewed By: VitalyFedyunin

Differential Revision: D24593573

Pulled By: samestep

fbshipit-source-id: 361392846de9fadda1c87d2046cf8d26861524ca
2020-10-29 11:59:30 -07:00
6eaa324c9f Implement torch.igamma (#46183)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/41637
This is regularized lower incomplete gamma function, equivalent to scipy's `gammainc` and tensorflow `igamma`.

cc fritzo mruberry

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46183

Reviewed By: gchanan

Differential Revision: D24479126

Pulled By: mruberry

fbshipit-source-id: fdf8ea289fe4ca1b408810732192411e948fcdfe
2020-10-29 11:40:18 -07:00
dd95bf65b6 [caffe2/FC DNNLOWP] Shrink Y_int32_ vector capacity when appropriate
Summary:
The FullyConnectedDNNLowPOp::Y_int32_ vectors consume between 1GB and 2GB on one of FB's larger applications. By adding tracing I noticed that the number of elements in each instance oscillates wildy over time. As the buffer backing a vector can only be extended in a resize operation, this means there is wasted memory space. So as a simple optimization, I added code to right-size the buffer backing the vector when the number of elements is less than half the vector capacity at that point; this doesn't affect the existing elements.

There is of course a memory/cpu tradeoff here - with the change we are doing more mallocs and frees. I added tracing to measure how many times we grow or shrink per second: it's about 100 per second on average, which is not a great deal.

Test Plan:
Memory growth impact: over 24 hours and after the startup period, the memory consumed by this code grows from 0.85GB to 1.20GB vs 0.95GB to 1.75GB in the baseline. [ source: https://fburl.com/scuba/heap_profiles/wm47kpfe ]
https://pxl.cl/1pHlJ

Reviewed By: jspark1105

Differential Revision: D24592098

fbshipit-source-id: 7892b35f24e42403653a74a1a9d06cbc7ee866b9
2020-10-29 11:19:45 -07:00
38265acfbe Add Mul op for Vulkan (#47021)
Summary:
Updates mul_scalar shader to support the new Vulkan API, and adds a new op for it using the new API.

Also adds an in-place version for the op.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47021

Test Plan:
Unit test included. To build & run:
```
BUILD_CUSTOM_PROTOBUF=OFF \
  BUILD_TEST=ON \
  USE_EIGEN_FOR_BLAS=OFF \
  USE_FBGEMM=OFF \
  USE_MKLDNN=OFF \
  USE_NNPACK=OFF \
  USE_NUMPY=OFF \
  USE_OBSERVERS=OFF \
  USE_PYTORCH_QNNPACK=OFF \
  USE_QNNPACK=OFF \
  USE_VULKAN=ON \
  USE_VULKAN_API=ON \
  USE_VULKAN_SHADERC_RUNTIME=ON \
  USE_VULKAN_WRAPPER=OFF \
  MACOSX_DEPLOYMENT_TARGET=10.9 CC=clang CXX=clang++ python3 setup.py develop --cmake && ./build/bin/vulkan_api_test
```

Reviewed By: AshkanAliabadi

Differential Revision: D24624729

Pulled By: SS-JIA

fbshipit-source-id: 97e76e4060307a9a24311ac51dca8812e4471249
2020-10-29 11:14:25 -07:00
2b6a720eb1 Update pybind to 2.6.0 (#46415)
Summary:
Preserve PYBIND11 (63ce3fbde8) configuration options in `torch._C._PYBIND11 (63ce3fbde8)_COMPILER_TYPE` and use them when building extensions

Also, use f-strings in `torch.utils.cpp_extension`

"Fixes" https://github.com/pytorch/pytorch/issues/46367

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46415

Reviewed By: VitalyFedyunin

Differential Revision: D24605949

Pulled By: malfet

fbshipit-source-id: 87340f2ed5308266a46ef8f0317316227dab9d4d
2020-10-29 10:53:47 -07:00
2249a293b7 Fix segfault with torch.orgqr. (#46700)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/41768

The fault was that a NULL `tau` would get passed to LAPACK function. This PR fixes that by checking whether the `tau` contains 0 elements at the beginning of the function.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46700

Reviewed By: albanD

Differential Revision: D24616427

Pulled By: mruberry

fbshipit-source-id: 92e8f1489b113c0ceeca6e54dea8b810a51a63c3
2020-10-29 10:34:39 -07:00
f629fbe235 Added torch.linalg.tensorsolve (#46142)
Summary:
This PR adds `torch.linalg.tensorsolve` function that matches `numpy.linalg.tensorsolve`.

Ref https://github.com/pytorch/pytorch/issues/42666.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46142

Reviewed By: izdeby

Differential Revision: D24539400

Pulled By: mruberry

fbshipit-source-id: 6e38364fe0bc511e739036deb274d9307df119b2
2020-10-29 10:29:28 -07:00
13b4127c95 Fix implicit conversion (#46833)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46833

Implicit integer conversions are causing compiler warnings. Since in this case the logs make it pretty clear that the `unsigned` types won't overflow despite 64-bit inputs, we fix the issue by making the downconversion explicit.

Test Plan: Standard test rig.

Reviewed By: malfet

Differential Revision: D24481377

fbshipit-source-id: 4422538286d8ed2beb65065544016fd430394ff8
2020-10-29 10:22:37 -07:00
ecdbea77bc Fix DDP documentation (#46861)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46861

Noticed that in the DDP documentation:
https://pytorch.org/docs/master/generated/torch.nn.parallel.DistributedDataParallel.html?highlight=distributeddataparallel
there were some examples with `torch.nn.DistributedDataParallel`, fix this to
read `torch.nn.parallel.DistributedDataParallel`.
ghstack-source-id: 115453703

Test Plan: ci

Reviewed By: pritamdamania87, SciPioneer

Differential Revision: D24534486

fbshipit-source-id: 64b92dc8a55136c23313f7926251fe825a2cb7d5
2020-10-29 09:13:47 -07:00
262bd6437a Show old kernel location when there are mismatches (#46850)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46850

So far, in the error messages when kernel signatures mismatched, we showed the location where the second kernel came from,
but we didn't show the location of the first kernel. This PR now shows the location of both.
ghstack-source-id: 115468616

Test Plan: waitforsandcastle

Reviewed By: ezyang

Differential Revision: D24540368

fbshipit-source-id: 3b4474062879d17f9bb7870ad3814343edc1b755
2020-10-29 08:30:49 -07:00
dfdc1dbee4 Disable softmax tests on ROCm (#46793)
Summary:
This PR disables the test_softmax and test_softmax_results in test_nn.py that were enabled in https://github.com/pytorch/pytorch/issues/46363. The softmax tests are causing failure on gfx906 machines. Disabling those until we root cause and fix them on 906.

cc: jeffdaily ezyang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46793

Reviewed By: izdeby

Differential Revision: D24539211

Pulled By: ezyang

fbshipit-source-id: 633cb9dc497ad6359af85b85a711c4549d772b2a
2020-10-29 08:05:36 -07:00
4a581ba6c2 Implement LengthsToOffsets operator in Caffe2 (#46590)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46590

This operator is very similar to LengthsToRanges but doesn't pack the offsets next to the original lengths.

Reviewed By: yf225

Differential Revision: D24419746

fbshipit-source-id: aa8b014588bb22eced324853c545f8684086c4e4
2020-10-29 07:03:34 -07:00
18d273dc0e [RFC][LocalSession] Fix workspace type
Summary: I was reading/looking into how LocalSession works and realized that the workspace type being passed around was the bound function on TaskGroup instead of the actual type. This meant that all workspaces for localsession would always be global, because they'd never match the private workspace type.

Test Plan: <not sure, could use some suggestions>

Reviewed By: cryptopic

Differential Revision: D24458428

fbshipit-source-id: 0f87874babe9c1ddff25b5363b443f9ca37e03c1
2020-10-29 04:12:17 -07:00
d0df29ac22 [FX] Put inf and nan in globals instead of with an import string (#47035)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47035

Chillee thought the `from math import inf, nan` string at the top of `.code` was annoying so here's an alternative way to do it by putting those values in `globals` before we `exec`

Test Plan: Imported from OSS

Reviewed By: dzhulgakov

Differential Revision: D24611278

Pulled By: jamesr66a

fbshipit-source-id: c25ef89e649bdd3e79fe91aea945a30fa7106961
2020-10-29 00:35:41 -07:00
cab32d9cdf [RPC Framework] Support remote device format "<workername>/<device>" (#46773)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46773

Changed the constructor of RemoteModule to accept a `remote_device` arg in the following format:
"<workername>/<device>" (e.g., "trainer0/cpu", "ps0/cuda:0")

This arg merges the original `on` and `device` arg.

Original PR issue: RemoteDevice Format #46554
ghstack-source-id: 115448051

Test Plan: buck test mode/dev-nosan caffe2/test/distributed/rpc:process_group_agent -- RemoteModule

Reviewed By: pritamdamania87

Differential Revision: D24482562

fbshipit-source-id: 5acfc73772576a4b674df27625bf560b8f8e67c1
2020-10-29 00:14:56 -07:00
b553c06abb Throw an exception in the constructor of torchscript serialization to avoid double-exception (#44266)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44266

If PyTorchStreamWriter is writing to a file in a non-existing path, it throws an exception. In unwinding the destructor calls writeEndOfFile() and throws again. To avoid this double-exception, a check and throw is added in the constructor. In such case the destructor will not be called and the exception can go through the unwinding.

Test Plan: python test/test_jit.py TestSaveLoad.test_save_nonexit_file

Reviewed By: dreiss

Differential Revision: D23560770

Pulled By: iseeyuan

fbshipit-source-id: 51b24403500bdab3578c7fd5e017780467a5d06a
2020-10-28 22:41:19 -07:00
9c1a41b724 [RFC] Add OperatorHandle overload to the RecordFunction::before() method (#46401)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46401

Broader context about selective/custom build available at https://fb.quip.com/2oEzAR5MKqbD and https://fb.workplace.com/groups/pytorch.mobile.team/permalink/735794523641956/

Basically, we want to be able to trace full operator names (with overload name). The current observer infra picks up the operator name from the schema, which doesn't seem to include the overload name. To ensure consistency with the existing uses and to accomodate the new use-case, this diff adds a new overload to accept an `OperatorHandle` object, and the code in `before()` eagerly resolves it to an `OperatorName` object (which can be cached in a member variable) as well as a string (view) operator-name which has the same semantics as before.

Why do we pass in an `OperatorHandle` but then resolve it to an `OperatorName`? This might come across as a strange design choice (and it is), but it is grounded in practicality.

It is not reasonable to cache an `OperatorHandle` object but caching an `OperatorName` object is reasonable since it holds all the data itself.

An initial version of this change was trying to test this change in the `xplat` repo, which didn't work. Thanks to ilia-cher for pointing out that the dispatcher observing mechanism is disabled under a compile time flag (macro) for xplat.
ghstack-source-id: 114360747

Test Plan:
`buck test fbcode/caffe2/fb/test:record_function_test` succeeds. Also replicated this test in OSS in the file `test_misc.cpp` where the rest of the `RecordFunction` subsystem is being tested.

Ran benchmark as reqiested by ilia-cher

{P146511280}

Reviewed By: ilia-cher

Differential Revision: D24315241

fbshipit-source-id: 239f3081e6aa2e26c3021a7dd61f328b723b03d9
2020-10-28 22:38:26 -07:00
604e1b301a Fix negative column numbers for the torch.eye (#46841)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/46757

Error out the negative column numbers and add the corresponding tests in the `test/test_tensor_creation_ops.py`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46841

Reviewed By: VitalyFedyunin

Differential Revision: D24593839

Pulled By: ngimel

fbshipit-source-id: b8988207911453de7811cf3ceb43747192cd689d
2020-10-28 22:29:25 -07:00
5c8aad1141 [numpy] torch.cos, torch.tan : promote integer inputs to float (#46706)
Summary:
References https://github.com/pytorch/pytorch/issues/42515

cc: mruberry

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46706

Reviewed By: izdeby

Differential Revision: D24537262

Pulled By: mruberry

fbshipit-source-id: e57377a625814a3f34a765ce6bfd63a33c02a5d9
2020-10-28 22:02:52 -07:00
42a51148c1 Use f-strings in torch.utils.cpp_extension (#47025)
Summary:
Plus two minor fixes to `torch/csrc/Module.cpp`:
 - Use iterator of type `Py_ssize_t` for array indexing in `THPModule_initNames`
 - Fix clang-tidy warning of unneeded defaultGenerator copy by capturing it as `const auto&`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47025

Reviewed By: samestep

Differential Revision: D24605907

Pulled By: malfet

fbshipit-source-id: c276567d320758fa8b6f4bd64ff46d2ea5d40eff
2020-10-28 21:32:33 -07:00
9d23fd5c00 [pytorch] get rid of cpp_type_str from pybind codegen (#46977)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46977

Clean up a few TODOs in the new python binding codegen.
Get rid of the _simple_type() hack and the uses of cpp_type_str.
Now python argument type strings and PythonArgParser unpacking methods
are directly generated from the original Type model.

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D24589209

Pulled By: ljk53

fbshipit-source-id: b2a6c3911d58eae49c031d319c8ea6f804e2cfde
2020-10-28 21:25:55 -07:00
79474a1928 [pytorch] simplify tensor options logic in pybinding codegen (#46976)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46976

Technically, it's not semantic preserving, e.g.: emition of
'requires_grad' is no longer gated by 'has_tensor_return' - there is no
guarantee that is_like_or_new_function should all have tensor return.
But the output is identical so there might be some invariant - could
also add assertion to fail loudly when it's broken.

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D24589211

Pulled By: ljk53

fbshipit-source-id: 47c7e43b080e4e67a526fde1a8a53aae99df4432
2020-10-28 21:22:59 -07:00
a86b3438eb add support for different memory sizes on size_based_partition (#46919)
Summary:
WIP: add support for different memory sizes on size_based_partition, so the size_based_partition could support different logical devices with different memory sizes. Compared to the original size_based_partition, the new one also supports partition to logical device mapping. Multiple partitions can be mapped into one device if the memory size is allowed. A test unit test_different_size_partition is also added.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46919

Reviewed By: gcatron, VitalyFedyunin

Differential Revision: D24603511

Pulled By: scottxu0730

fbshipit-source-id: 1ba37338ae054ad846b425fbb7e631d3b6c500b6
2020-10-28 21:11:41 -07:00
c2a3951352 [quant][graphmode][fx] Remove inplace option for convert_fx (#46955)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46955

Initially we were thinking of adding a `invalidate_quantized_float_parameters` option to free the memory
of quantized floating parameters, but it turns out we will do module swap just like in eager mode for the modules
that are quantized, so the old floating point module will not be referenced after quantization. therefore this feature
is only needed for functionals, since most people are using quantization with modules we may not need this.

we'll revisit after we find there is a need for this.

Test Plan: Imported from OSS

Reviewed By: supriyar

Differential Revision: D24579400

fbshipit-source-id: fbb0e567405dc0604a2089fc001573affdade986
2020-10-28 21:07:19 -07:00
ad260ae7fd Disable test_joing_running_workers for TSAN. (#46966)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46966

These tests had false positives in TSAN for modifying thread local
variables:

```
WARNING: ThreadSanitizer: data race (pid=5364)
  Write of size 8 at 0x7b2c0004ff70 by thread T2:
    #0 free <null> (libtools_build_sanitizers_tsan-py.so+0xde6ad)
    #1 __GI__dl_deallocate_tls

  Previous write of size 1 at 0x7b2c0004ff71 by thread T3:
    #0 at::GradMode::set_enabled(bool) caffe2/aten/src/ATen/core/grad_mode.cpp:20 (libcaffe2_ATen-core.so+0x40e013)
    #1 torch::autograd::set_grad_enabled(_object*, _object*) caffe2/torch/csrc/autograd/init.cpp:143 (libcaffe2__C_impl_cuda.so+0x115ef0e)
    #2 _PyMethodDef_RawFastCallKeywords

  Thread T3 (tid=5385, finished) created by main thread at:
    #0 pthread_create <null> (libtools_build_sanitizers_tsan-py.so+0xc5a86)
    #1 PyThread_start_new_thread
```
ghstack-source-id: 115330433

Test Plan: waitforbuildbot

Reviewed By: mrshenli

Differential Revision: D24584411

fbshipit-source-id: e35f704dfcb7b161a13a4902beaf8b1e41ccd596
2020-10-28 19:28:04 -07:00
9fefb40628 Fix signed-to-unsigned conversion warning (#46834)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46834

`next` is set to `-1`, presumably to avoid an "undefined variable" warning. However, Setting `next=-1` gives a signed-to-unsigned warning. In practice, the `-1` wraps around to `size_t::max`. So we set this to `size_t::max` from the get go to avoid all warnings.

Test Plan: Standard pre-commit test rig.

Reviewed By: xw285cornell

Differential Revision: D24481068

fbshipit-source-id: 58b8a1b027a129fc4994c8593838a82b3991be22
2020-10-28 18:17:23 -07:00
c7183c9878 Fix object-based collectives API to use torch.cuda.current_device instead of (#46897)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46897

These APIs implicitly assumed that gpu for rank == rank index, but
that is not necessarily true. For example, the first GPU could be used for a
different purpose and rank 0 could use GPU 1, rank 1 uses GPU 2, etc. Thus, we
mandate that the user specify the device to use via `torch.cuda.set_device()`
before making calls to this API. This expectation should be okay since we
clearly document it, and we expect the user to set this for
DistributedDataParallel as well.

Also adds/tidies up some documentation.
ghstack-source-id: 115359633

Test Plan: Modified unittests

Reviewed By: divchenko

Differential Revision: D24556177

fbshipit-source-id: 7e826007241eba0fde3019180066ed56faf3c0ca
2020-10-28 18:12:50 -07:00
dc8176356e Various cleanups to ir_emitter and friends (#46686)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46686

I was trying to page this code back in after a while and some things
stuck out as unnecessarily confusing.

1. Improve documentation of closures and fork stuff to be more accurate
to how we use them today.
2. Change `prim::LocalVariableScope` to `prim::ListComprehension`. It is
only ever used for a list comprehensions, and in general the nodes
emitted by `ir_emitter` should correspond to concrete operations or
language features rather than semantic constraints.
3. Change the somewhat mysterious "inputs" and "attributes" argument
names throughout the codebase to be the more obvious "args" and "kwargs"
that they generally represent (I think "inputs" and "attributes" come
from the AST naming).

Test Plan: Imported from OSS

Reviewed By: navahgar, jamesr66a

Differential Revision: D24464197

Pulled By: suo

fbshipit-source-id: 1f4b1475b58b5690a0b204e705caceff969533b4
2020-10-28 16:28:05 -07:00
fc2bd991cc [quant] Fix flaky test test_histogram_observer_against_reference (#46957)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46957

Possibly due to use of large tensor in hypothesis. Reducing the size to see if it helps

Test Plan:
python test/test_quantization.py TestRecordHistogramObserver.test_histogram_observer_against_reference

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D24580137

fbshipit-source-id: f44ab059796fba97cccb12353c13803bf49214a1
2020-10-28 15:48:49 -07:00
cd26d027b3 [doc] Fix info on the shape of pivots in torch.lu + more info on what and how they encode permutations. (#46844)
Summary:
As per title.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46844

Reviewed By: VitalyFedyunin

Differential Revision: D24595538

Pulled By: ezyang

fbshipit-source-id: 1bb9c0310170124c3b6e33bd26ce38c22b36e926
2020-10-28 14:56:31 -07:00
058f43fc51 Fix torch.version.debug generation (#47006)
Summary:
argparser type bool returns True for any argument passed as input

Use `distutils.util.strtobool` which returns 0 for input values like "0", "no", "n", "f", "false" and 1 for "1", "yes", "y", "t", "true"

Fixes https://github.com/pytorch/pytorch/issues/46973 and https://github.com/pytorch/pytorch/issues/47003

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47006

Reviewed By: samestep

Differential Revision: D24598193

Pulled By: malfet

fbshipit-source-id: e8f6688d6883011f301b49a0f03c452c611f7001
2020-10-28 12:48:30 -07:00
14d87ec5a3 Add Vulkan op Add. (#44017)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44017

Test Plan: Imported from OSS

Reviewed By: IvanKobzarev

Differential Revision: D23820826

Pulled By: AshkanAliabadi

fbshipit-source-id: 47db435894696f4eb4277370d4d317d2df9e3b98
2020-10-28 12:12:56 -07:00
ec600bc391 Add Vulkan tensor copy. (#46481)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46481

Test Plan: Imported from OSS

Reviewed By: IvanKobzarev

Differential Revision: D24379143

Pulled By: AshkanAliabadi

fbshipit-source-id: cf492c5c632bf193c8aff0169d17bbf962e019e1
2020-10-28 12:09:53 -07:00
bf08814b73 [FX] Kill functional transforms name (#47004)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47004

Test Plan: Imported from OSS

Reviewed By: zdevito

Differential Revision: D24597581

Pulled By: jamesr66a

fbshipit-source-id: 9213d58f4a53ea55e97e6ca0572fdcf5e271bdc3
2020-10-28 11:59:28 -07:00
23bce17baa Add inputsSize to Python IR, like outputsSize (#46779)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46779

Test Plan: Used it in some notebooks.

Reviewed By: suo

Differential Revision: D24574005

Pulled By: dreiss

fbshipit-source-id: 78ba7a2bdb859fef5633212b73c7a3eb2cfbc380
2020-10-28 11:35:39 -07:00
179d2b288c Fix interval midpoint calculation in vulkan (#46839)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46839

Interval midpoint calculations can overflow (integers). This fixes such an instance.

Test Plan: Standard test rig.

Reviewed By: drdarshan

Differential Revision: D24392545

fbshipit-source-id: 84c81802165bb8084e2d54c9f3755f39143a5b00
2020-10-28 11:22:42 -07:00
98b3da8b13 Revert D24452660: [pytorch][PR] Add CUDA 11.1 CI
Test Plan: revert-hammer

Differential Revision:
D24452660 (1479ed91be)

Original commit changeset: 3480a2533214

fbshipit-source-id: 1e720c5d6fe1a377f6decd3ecc4f412c53fb293c
2020-10-28 10:53:53 -07:00
61ee0242c0 Fix backcompat in master following revert (#46984)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46984

Reviewed By: mrshenli

Differential Revision: D24592404

Pulled By: albanD

fbshipit-source-id: d317d934b650f1ac0f91e51ef5cbc14e886aa3fe
2020-10-28 10:32:14 -07:00
069232a574 [FX] Fix corner case in name sanitization (#46958)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46958

Test Plan: Imported from OSS

Reviewed By: dzhulgakov

Differential Revision: D24580474

Pulled By: jamesr66a

fbshipit-source-id: 2f8d252998c72e1e79d6a5f7766c2d51a271cc83
2020-10-28 10:22:33 -07:00
cbf90dafe1 Fix CPUCaching allocator guard bug (#46922)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46922

Earlier bug wrongly captures the previous value to be saved.

Test Plan: cpu_caching_allocator_test

Reviewed By: dreiss

Differential Revision: D24566514

fbshipit-source-id: 734a4c1f810bbec16fe007f31fffa360898955ac
2020-10-28 10:06:22 -07:00
c3fc17b48e Fix bit math (#46837)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46837

Formerly `static_cast<StreamId>(bits)` and `static_cast<DeviceIndex>(bits)` were and-ed against `ull` types resulting in an integer promotion which later raised a warning in downcasting passes to  `Stream` and `Device`.

Moving the `&` operation inside the cast results in two `uint64_t` being operated on and then cast to the correct type, eliminating the warning.

Test Plan: Standard pre-commit test rig.

Reviewed By: malfet

Differential Revision: D24481292

fbshipit-source-id: a8bcbde631054c26ca8c98fbed275254dd359dd0
2020-10-28 09:55:52 -07:00
c9222b7471 Implement clip_ranges operator for PyTorch
Test Plan:
unit test for correctness
```
buck test caffe2/torch/fb/sparsenn:test -- test_clip_ranges
Parsing buck files: finished in 1.6 sec
Creating action graph: finished in 18.9 sec
Building: finished in 15.0 sec (100%) 9442/9442 jobs, 1 updated
  Total time: 35.6 sec
More details at https://www.internalfb.com/intern/buck/build/66fb17de-859e-4d01-89bf-5c5de2950693
Tpx test run coordinator for Facebook. See https://fburl.com/tpx for details.
Running with tpx session id: 80f5e0c2-7db2-48a4-b148-25dd34651682
Trace available for this run at /tmp/tpx-20201026-123217.050766/trace.log
Started reporting to test run: https://our.intern.facebook.com/intern/testinfra/testrun/4503599665041422
    ✓ ListingSuccess: caffe2/torch/fb/sparsenn:test - main (14.912)
    ✓ Pass: caffe2/torch/fb/sparsenn:test - test_clip_ranges (caffe2.torch.fb.sparsenn.tests.sparsenn_operators_test.SparseNNOperatorsTest) (14.098)
Summary
  Pass: 1
  ListingSuccess: 1
Finished test run: https://our.intern.facebook.com/intern/testinfra/testrun/4503599665041422
```

new  benchmark perf test
```
# ----------------------------------------
# PyTorch/Caffe2 Operator Micro-benchmarks
# ----------------------------------------
# Tag : short

# Benchmarking PyTorch: clip_ranges
# Mode: JIT
# Name: clip_ranges_LENGTH6_M1_N2_MAX_LENGTH1_dtypetorch.int32_cpu
# Input: LENGTH: 6, M: 1, N: 2, MAX_LENGTH: 1, dtype: torch.int32, device: cpu
Forward Execution Time (us) : 155.765

# Benchmarking PyTorch: clip_ranges
# Mode: JIT
# Name: clip_ranges_LENGTH7_M1_N2_MAX_LENGTH2_dtypetorch.int32_cpu
# Input: LENGTH: 7, M: 1, N: 2, MAX_LENGTH: 2, dtype: torch.int32, device: cpu
Forward Execution Time (us) : 156.248

# Benchmarking PyTorch: clip_ranges
# Mode: JIT
# Name: clip_ranges_LENGTH8_M1_N2_MAX_LENGTH3_dtypetorch.int32_cpu
# Input: LENGTH: 8, M: 1, N: 2, MAX_LENGTH: 3, dtype: torch.int32, device: cpu
Forward Execution Time (us) : 156.634

# Benchmarking PyTorch: clip_ranges
# Mode: JIT
# Name: clip_ranges_LENGTH9_M1_N2_MAX_LENGTH4_dtypetorch.int32_cpu
# Input: LENGTH: 9, M: 1, N: 2, MAX_LENGTH: 4, dtype: torch.int32, device: cpu
Forward Execution Time (us) : 155.408

# Benchmarking PyTorch: clip_ranges
# Mode: JIT
# Name: clip_ranges_LENGTH10_M1_N2_MAX_LENGTH5_dtypetorch.int32_cpu
# Input: LENGTH: 10, M: 1, N: 2, MAX_LENGTH: 5, dtype: torch.int32, device: cpu
Forward Execution Time (us) : 165.168
```

Compare with the old implementation, there are **around 300us gain**
```
# ----------------------------------------
# PyTorch/Caffe2 Operator Micro-benchmarks
# ----------------------------------------
# Tag : short

# Benchmarking PyTorch: clip_ranges
# Mode: JIT
# Name: clip_ranges_LENGTH6_M1_N2_MAX_LENGTH1_dtypetorch.int32_cpu
# Input: LENGTH: 6, M: 1, N: 2, MAX_LENGTH: 1, dtype: torch.int32, device: cpu
Forward Execution Time (us) : 443.012

# Benchmarking PyTorch: clip_ranges
# Mode: JIT
# Name: clip_ranges_LENGTH7_M1_N2_MAX_LENGTH2_dtypetorch.int32_cpu
# Input: LENGTH: 7, M: 1, N: 2, MAX_LENGTH: 2, dtype: torch.int32, device: cpu
Forward Execution Time (us) : 446.480

# Benchmarking PyTorch: clip_ranges
# Mode: JIT
# Name: clip_ranges_LENGTH8_M1_N2_MAX_LENGTH3_dtypetorch.int32_cpu
# Input: LENGTH: 8, M: 1, N: 2, MAX_LENGTH: 3, dtype: torch.int32, device: cpu
Forward Execution Time (us) : 444.064

# Benchmarking PyTorch: clip_ranges
# Mode: JIT
# Name: clip_ranges_LENGTH9_M1_N2_MAX_LENGTH4_dtypetorch.int32_cpu
# Input: LENGTH: 9, M: 1, N: 2, MAX_LENGTH: 4, dtype: torch.int32, device: cpu
Forward Execution Time (us) : 445.511

# Benchmarking PyTorch: clip_ranges
# Mode: JIT
# Name: clip_ranges_LENGTH10_M1_N2_MAX_LENGTH5_dtypetorch.int32_cpu
# Input: LENGTH: 10, M: 1, N: 2, MAX_LENGTH: 5, dtype: torch.int32, device: cpu
Forward Execution Time (us) : 450.468
```

Reviewed By: MarcioPorto

Differential Revision: D24546110

fbshipit-source-id: e6c9b38e911f177f97961ede5bf375107f240363
2020-10-28 09:46:37 -07:00
c6858fd71a Set up benchmarks for ClipRanges operator for Caffe2 and PyTorch
Summary: As title, adding the benchmark tests for ClipRanges operators.

Test Plan:
benchmark test for Caffe2
```
# ----------------------------------------
# PyTorch/Caffe2 Operator Micro-benchmarks
# ----------------------------------------
# Tag : short

# Benchmarking Caffe2: clip_ranges
WARNING: Logging before InitGoogleLogging() is written to STDERR
W1026 12:30:33.938997 2658759 init.h:137] Caffe2 GlobalInit should be run before any other API calls.
# Name: clip_ranges_LENGTH6_M1_N2_MAX_LENGTH1_dtypeint32
# Input: LENGTH: 6, M: 1, N: 2, MAX_LENGTH: 1, dtype: int32
Forward Execution Time (us) : 5.805

# Benchmarking Caffe2: clip_ranges
# Name: clip_ranges_LENGTH7_M1_N2_MAX_LENGTH2_dtypeint32
# Input: LENGTH: 7, M: 1, N: 2, MAX_LENGTH: 2, dtype: int32
Forward Execution Time (us) : 5.913

# Benchmarking Caffe2: clip_ranges
# Name: clip_ranges_LENGTH8_M1_N2_MAX_LENGTH3_dtypeint32
# Input: LENGTH: 8, M: 1, N: 2, MAX_LENGTH: 3, dtype: int32
Forward Execution Time (us) : 5.941

# Benchmarking Caffe2: clip_ranges
# Name: clip_ranges_LENGTH9_M1_N2_MAX_LENGTH4_dtypeint32
# Input: LENGTH: 9, M: 1, N: 2, MAX_LENGTH: 4, dtype: int32
Forward Execution Time (us) : 5.868

# Benchmarking Caffe2: clip_ranges
# Name: clip_ranges_LENGTH10_M1_N2_MAX_LENGTH5_dtypeint32
# Input: LENGTH: 10, M: 1, N: 2, MAX_LENGTH: 5, dtype: int32
Forward Execution Time (us) : 6.408
```

benchmark test for PyTorch
```
# ----------------------------------------
# PyTorch/Caffe2 Operator Micro-benchmarks
# ----------------------------------------
# Tag : short

# Benchmarking PyTorch: clip_ranges
# Mode: JIT
# Name: clip_ranges_LENGTH6_M1_N2_MAX_LENGTH1_dtypetorch.int32_cpu
# Input: LENGTH: 6, M: 1, N: 2, MAX_LENGTH: 1, dtype: torch.int32, device: cpu
Forward Execution Time (us) : 443.012

# Benchmarking PyTorch: clip_ranges
# Mode: JIT
# Name: clip_ranges_LENGTH7_M1_N2_MAX_LENGTH2_dtypetorch.int32_cpu
# Input: LENGTH: 7, M: 1, N: 2, MAX_LENGTH: 2, dtype: torch.int32, device: cpu
Forward Execution Time (us) : 446.480

# Benchmarking PyTorch: clip_ranges
# Mode: JIT
# Name: clip_ranges_LENGTH8_M1_N2_MAX_LENGTH3_dtypetorch.int32_cpu
# Input: LENGTH: 8, M: 1, N: 2, MAX_LENGTH: 3, dtype: torch.int32, device: cpu
Forward Execution Time (us) : 444.064

# Benchmarking PyTorch: clip_ranges
# Mode: JIT
# Name: clip_ranges_LENGTH9_M1_N2_MAX_LENGTH4_dtypetorch.int32_cpu
# Input: LENGTH: 9, M: 1, N: 2, MAX_LENGTH: 4, dtype: torch.int32, device: cpu
Forward Execution Time (us) : 445.511

# Benchmarking PyTorch: clip_ranges
# Mode: JIT
# Name: clip_ranges_LENGTH10_M1_N2_MAX_LENGTH5_dtypetorch.int32_cpu
# Input: LENGTH: 10, M: 1, N: 2, MAX_LENGTH: 5, dtype: torch.int32, device: cpu
Forward Execution Time (us) : 450.468
```

Reviewed By: MarcioPorto

Differential Revision: D24500468

fbshipit-source-id: a582090a3982005af272cb10cdd257b2b2e787c4
2020-10-28 09:42:10 -07:00
b75b961934 Fix requires_grad arg for new_full, new_empty, new_zeros (#46486)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/36455

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46486

Reviewed By: gchanan

Differential Revision: D24497034

Pulled By: ezyang

fbshipit-source-id: 769a7f00f9a8f7cb77273a1193173a837ae7e32f
2020-10-28 09:34:53 -07:00
353e7f940f Ensure kernel launches are checked (#46474)
Summary:
Caffe2 and Torch currently does not have a consistent mechanism for determining if a kernel has launched successfully. The result is difficult-to-detect or silent errors. This diff provides functionality to fix that. Subsequent diffs on the stack fix the identified issues.

Kernel launch errors may arise if invalid launch parameters (number of blocks, number of threads, shared memory, or stream id) are specified incorrectly for the hardware or for other reasons. Interestingly, unless these launch errors are specifically checked for CUDA will silently fail and return garbage answers which can affect downstream computation. Therefore, catching launch errors is important.

Launches are currently checked by placing
```
AT_CUDA_CHECK(cudaGetLastError());
```
somewhere below the kernel launch. This is bad for two reasons.
1. The check may be performed at a site distant to the kernel launch, making debugging difficult.
2. The separation of the launch from the check means that it is difficult for humans and static analyzers to determine whether the check has taken place.

This diff defines a macro:
```
#define TORCH_CUDA_KERNEL_LAUNCH_CHECK() AT_CUDA_CHECK(cudaGetLastError())
```
which clearly indicates the check.

This diff also introduces a new test which analyzes code to identify kernel launches and determines whether the line immediately following the launch contains `TORCH_CUDA_KERNEL_LAUNCH_CHECK();`.

A search of the Caffe2 codebase identifies 104 instances of `AT_CUDA_CHECK(cudaGetLastError());` while the foregoing test identifies 1,467 launches which are not paired with a check. Visual inspection indicates that few of these are false positives, highlighting the need for some sort of static analysis system.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46474

Test Plan:
The new test is run with:
```
buck test //caffe2/test:kernel_launch_checks -- --print-passing-details
```
And should be launched automatically with the other land tests. (TODO: Is it?)

The test is currently set up only to provide warnings but can later be adjusted to require checks.

Otherwise, I rely on the existing test frameworks to ensure that changes resulting from reorganizing existing launch checks don't cause regressions.

Reviewed By: ngimel

Differential Revision: D24309971

Pulled By: r-barnes

fbshipit-source-id: 0dc97984a408138ad06ff2bca86ad17ef2fdf0b6
2020-10-28 09:27:48 -07:00
50c9581de1 AT_ERROR if mmap allocation has failed (#46934)
Summary:
All other system call failures in `THMapAllocator` constructor are considered failures, but this one, for some reason was not

Fixes https://github.com/pytorch/pytorch/issues/46651

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46934

Reviewed By: walterddr, seemethere

Differential Revision: D24572657

Pulled By: malfet

fbshipit-source-id: 0a2b6ce78d5484190536bc4949fc4697d6387ab8
2020-10-28 09:06:48 -07:00
c886c7f6dd fix: Fixed typing of bool in _ConvNd (#46828)
Summary:
Hello there 👋

I do believe there is some typo in the typing of the `bool` argument of `_ConvNd`constructor.
The typing of the attribute is correct, but the constructor argument, while being the same way, is not the value that will be assigned to `self.bias`.

This PR simply corrects that.

Any feedback is welcome!

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46828

Reviewed By: izdeby

Differential Revision: D24550435

Pulled By: ezyang

fbshipit-source-id: ab10f1a5b29a912cb23fc321a51e78b04a8391e3
2020-10-28 08:08:53 -07:00
cd8ed93287 [quant][graphmode][fx][api] Remove inplace option from prepare_fx (#46954)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46954

Test Plan: Imported from OSS

Reviewed By: supriyar

Differential Revision: D24579401

fbshipit-source-id: adce623ce819fa220f7bb08d1ff3beaa69850621
2020-10-28 08:00:12 -07:00
46b252b83a Revert D24262885: [pytorch][PR] Added foreach_zero_ API
Test Plan: revert-hammer

Differential Revision:
D24262885 (8e37dcb1f3)

Original commit changeset: 144c283dd009

fbshipit-source-id: 451b202e23bc1fcb11b20d26c11d9a1329789d22
2020-10-28 06:48:59 -07:00
ddbdbce623 [jit] Prevent caching of graph attribute. (#46960)
Summary:
`graph` is automatically cached even when the underlying graph changes -- this PR hardcodes a fix to that.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46960

Reviewed By: mrshenli

Differential Revision: D24582185

Pulled By: bwasti

fbshipit-source-id: 16aeeba251830886c92751dd5c9bda8699d62803
2020-10-27 23:56:52 -07:00
d92bf921db [quant][graphmode][fx] Remove inplace option for fuse_fx (#46953)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46953

Test Plan: Imported from OSS

Reviewed By: supriyar

Differential Revision: D24579402

fbshipit-source-id: 5e0b8abf682287ab3c7dd54c2fc2cf309295e147
2020-10-27 22:34:11 -07:00
e299393fd5 [Gradient Compression] Provide 2 default C++ comm hooks (#46701)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46701

Provide 2 built-in implementations of C++ comm hook.

Original PR issue: C++ DDP Communication Hook https://github.com/pytorch/pytorch/issues/46348
ghstack-source-id: 115319061

Test Plan: waitforbuildbot

Reviewed By: pritamdamania87

Differential Revision: D24382504

fbshipit-source-id: 1c1ef56620f91ab37a1707c5589f1d0eb4455bb3
2020-10-27 21:43:15 -07:00
e077a2a238 [Gradient Compression] Add CppCommHook subclass for supporting the C++ API of communication hook. (#46566)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46566

Only provides an interface. Some built-in implementations will be provided in a follow-up commit.

Original PR issue: C++ DDP Communication Hook https://github.com/pytorch/pytorch/issues/46348
ghstack-source-id: 115319038

Test Plan: waitforbuildbot

Reviewed By: pritamdamania87

Differential Revision: D24379460

fbshipit-source-id: 8382dc4185c7c01d0ac5b3498e1bead785bccec5
2020-10-27 21:43:12 -07:00
998b9b9e68 [quant][graphmode][fx] custom_module support static/dynamic/weight_only quant (#46786)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46786

Previously we only support static quant, this PR added support for other types of quantization.

Note qat is actually orthogonal to these quant types, this is referring to the convert step where we
convert the observed module to a quantized module.

for qat, user will provide a CustomModule -> FakeQuantizedCustomModule in prepare_custom_config_dict
and FakeQuantizedCustomModule -> static/dynamic/weight_only quantized CustomModule in convert_custom_config_dict.

Test Plan: Imported from OSS

Reviewed By: raghuramank100

Differential Revision: D24514701

fbshipit-source-id: 2918be422dd76093d67a6df560aaaf949b7f338c
2020-10-27 21:41:33 -07:00
5a8198eb3c [quant][graphmode][fx][fix] scalar as first input for add/mul (#46751)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46751

Currently we assume the first input for add/mul is node (Tensor), but it might not be the case

Test Plan:
python test/test_quantization.py TestQuantizeFxOps.test_quantized_add
python test/test_quantization.py TestQuantizeFxOps.test_quantized_mul
python test/test_quantization.py TestQuantizeFxOps.test_quantized_add_relu
python test/test_quantization.py TestQuantizeFxOps.test_quantized_mul_relu

Imported from OSS

Reviewed By: raghuramank100

Differential Revision: D24494456

fbshipit-source-id: ef5e23ba60eb22a57771791f4934306b25c27c01
2020-10-27 19:59:28 -07:00
810c68fb1d [OpBench] fix jit tracing with quantized op/tensor by enabling _compare_tensors_internal to compare quantized tensors (#46772)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46772

When running `buck run caffe2/benchmarks/operator_benchmark/pt:qactivation_test -- --use_jit`, I encountered the following error P146518683. The error was traced down to the fact that `torch.allclose` does not work with quantized tensors (the error was triggered by this particular multiplication https://fburl.com/diffusion/8vw647o6 since native mul can not work with a float scalar and a quantized tensor.)

Minimum example to reproduce:
```(Pdb) input = torch.ones(5)
(Pdb) aa = torch.quantize_per_tensor(input, scale=1.0, zero_point=0, dtype=torch.quint8)
(Pdb) bb = torch.quantize_per_tensor(input, scale=1.0, zero_point=0, dtype=torch.quint8)
(Pdb) torch.allclose(aa, bb)
Comparison exception: 	promoteTypes with quantized numbers is not handled yet; figure out what the correct rules should be, offending types: QUInt8 Float
```

Here the proposed fix is to compare quantized tensors strictly within `_compare_tensors_internal`.

The other two possible fixes are:
1. convert quantized tensors to float tensors first before sending them to `torch.allclose`
2. change `torch.allclose` to handle quantized tensor.

Test Plan: buck run caffe2/benchmarks/operator_benchmark/pt:qactivation_test -- --use_jit

Reviewed By: kimishpatel

Differential Revision: D24506723

fbshipit-source-id: 6426ea2a88854b4fb89abef0edd2b49921283796
2020-10-27 18:53:13 -07:00
8e37dcb1f3 Added foreach_zero_ API (#46215)
Summary:
Adding Added foreach_zero_(TensorList) API

Tested via unit tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46215

Reviewed By: zhangguanheng66

Differential Revision: D24262885

Pulled By: izdeby

fbshipit-source-id: 144c283dd00924083096d6d92eb9085cbd6097d3
2020-10-27 18:03:34 -07:00
67c1dc65a3 [FX] Fix handling of inf and nan literals (#46894)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46894

Test Plan: Imported from OSS

Reviewed By: zdevito

Differential Revision: D24555136

Pulled By: jamesr66a

fbshipit-source-id: 22765a4d9d373711e9e6d7b1d3898080ecbcf2f5
2020-10-27 17:55:35 -07:00
53839ac9d7 Fix internal assert for torch.heaviside with cuda tensor and cpu scalar tensor (#46831)
Summary:
Fixed https://github.com/pytorch/pytorch/issues/46681

```
>>> x = torch.randn(10, device='cuda')
>>> y = torch.tensor(1.)
>>> torch.heaviside(x, y)
tensor([0., 1., 0., 1., 1., 0., 1., 1., 1., 0.], device='cuda:0')
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46831

Reviewed By: navahgar

Differential Revision: D24567953

Pulled By: izdeby

fbshipit-source-id: e5fcf4355b27ce0bdf434963d01863d3b24d0bea
2020-10-27 16:47:33 -07:00
115bbf9945 [caffe2] Disable running full grad check in tests by default
Summary:
We've been seeing a lot of Hypothesis timeouts and from profiling a few of the failing tests one of the contributing factors is really slow grad checker. In short, it launches the whole op for each of the input elements so the overall complexity is O(numel^2) at least.

This applies a very unscientific hack to just run grad check on the first and last few elements. It's not ideal, but it's better than flaky tests. One can still explicitly opt in with the env var.

Reviewed By: malfet

Differential Revision: D23336220

fbshipit-source-id: f04d8d43c6aa1590c2f3e72fc7ccc6aa674e49d2
2020-10-27 16:10:03 -07:00
8066e89f64 quant: fix bug with copy.deepcopy of FX prepared quantization models (#46895)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46895

Bug: models after the FX graph mode quant prepare step lost information,
such as the extra attributes defined in `Quantizer.save_state`,
if the user performed `copy.deepcopy` on them.  The information was lost
because `GraphModule` does not copy attributes which are not present on
`nn.Module` by default.

Fix: define a custom `__deepcopy__` method on observed models and
whitelist the attributes we care about.

This is needed because users sometimes run `copy.deepcopy` on their
models during non-quantization related preparations, and we should make
sure that quantization related state survives these calls.

Test Plan:
```
python test/test_quantization.py TestQuantizeFx.test_deepcopy
python test/test_quantization.py TestQuantizeFx.test_standalone_module
```

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D24556035

fbshipit-source-id: f7a6b28b6d2225fa6189016f967f175f6733b124
2020-10-27 16:05:35 -07:00
1479ed91be Add CUDA 11.1 CI (#46616)
Summary:
libtorch XImportant now runs on CUDA 11.1,

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46616

Reviewed By: gchanan

Differential Revision: D24452660

Pulled By: malfet

fbshipit-source-id: 3480a2533214f2d986444ff912f619503a75940d
2020-10-27 15:58:13 -07:00
c20c840c1b Install sccache from source (#46672)
Summary:
Build `sccache` from https://github.com/pytorch/sccache

Also, update sccache wrappers not to call sccache from sccache

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46672

Reviewed By: janeyx99

Differential Revision: D24455767

Pulled By: malfet

fbshipit-source-id: b475a65e6ad03b9a192ab29a6d9a14280cd76a92
2020-10-27 15:23:23 -07:00
64d4b24a12 Adding link to gcov depending on GCC_VERSION (#46928)
Summary:
We already link g++ and gcc to the correct version, but we do not do that for gcov, which is needed for coverage.

This PR adds a link for gcov as well.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46928

Reviewed By: malfet

Differential Revision: D24569240

Pulled By: janeyx99

fbshipit-source-id: 4be012bff21ddae0c81339665b58324777b9304f
2020-10-27 15:09:35 -07:00
dc53eefd25 Conditional requirement for py3.6 only (#46932)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/46930

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46932

Reviewed By: mrshenli

Differential Revision: D24574196

Pulled By: seemethere

fbshipit-source-id: 11daf8abe226670277f1b5682fd9890d23576271
2020-10-27 14:59:55 -07:00
79a1d2bd78 [iOS] Bump up the cocoapods version (#46935)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46935

Bump up the cocoapods version
ghstack-source-id: 115283786

(Note: this ignores all push blocking failures!)

Test Plan: CI

Reviewed By: xta0

Differential Revision: D24572715

fbshipit-source-id: 41ffcd43512dc7d4e94af887fb5dfeab703d7602
2020-10-27 14:51:33 -07:00
717e6d8081 add type annotations to comm.py (#46736)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/46735

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46736

Reviewed By: albanD

Differential Revision: D24565554

Pulled By: mrshenli

fbshipit-source-id: 4e40e4232ebf256af228f9c742ea4d28c626c616
2020-10-27 14:27:06 -07:00
151f31ba27 remove event not ready assertion from TestCuda.test_copy_non_blocking (#46857)
Summary:
It is incorrect to assume that a newly recorded event will immediately query as False.
This test is flaky on ROCm due to this incorrect assumption.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46857

Reviewed By: albanD

Differential Revision: D24565581

Pulled By: mrshenli

fbshipit-source-id: 0e9ba02cf52554957b29dbeaa5093696dc914b67
2020-10-27 14:21:40 -07:00
8c39f198b4 Fix typo in setup.py (#46921)
Summary:
Also, be a bit future-proof in support version list

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46921

Reviewed By: seemethere

Differential Revision: D24568733

Pulled By: malfet

fbshipit-source-id: ae34f8da1ed39b80dc34db0b06e4ef142104a3ff
2020-10-27 13:14:41 -07:00
21e60643c0 [numpy] torch.log{2,10} : promote integer inputs to float (#46810)
Summary:
References https://github.com/pytorch/pytorch/issues/42515

cc: mruberry

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46810

Reviewed By: izdeby

Differential Revision: D24536187

Pulled By: mruberry

fbshipit-source-id: b7dd7678d4e996f3dea0245c65055654e02be459
2020-10-27 13:07:44 -07:00
bbe5bfaa4f Add GradMode::enabled check to max_pool1d (#46767)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46767

## Benchmark
```
------------------------------------------------------------------------------------------ benchmark: 2 tests ------------------------------------------------------------------------------------------
Name (time in us)             Min                   Max                  Mean              StdDev                Median                IQR            Outliers         OPS            Rounds  Iterations
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_grad_disabled       390.0155 (1.0)        533.4131 (1.0)        392.6161 (1.0)        8.5603 (1.0)        390.7457 (1.0)       0.3912 (1.0)        98;319  2,547.0171 (1.0)        2416           1
test_grad_enabled      3,116.7269 (7.99)     4,073.2883 (7.64)     3,178.0827 (8.09)     122.7487 (14.34)    3,142.2675 (8.04)     33.0228 (84.42)       10;22    314.6551 (0.12)        225           1
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
```

*snippet (using pytest benchmark module)*
```
import torch

torch.set_num_threads(1)
x = torch.randn(1000, 10, 36, requires_grad=True)

def test_grad_enabled(benchmark):
    benchmark(torch.max_pool1d, x, 2)

def test_grad_disabled(benchmark):
    torch.set_grad_enabled(False)
    benchmark(torch.max_pool1d, x, 2)
```

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D24565126

Pulled By: heitorschueroff

fbshipit-source-id: 91a93be9921f597db21e9dc277f6e36eae85b37a
2020-10-27 10:23:42 -07:00
daf2a6a29d Increase no-output-timeout for OSX builds (#46891)
Summary:
Because conda-build native library relocation scripts can take a while.
From see https://app.circleci.com/pipelines/github/pytorch/pytorch/227245/workflows/e287613d-5e48-4bca-b3d8-b75df2be9f65/jobs/8235584 :
```
Oct 15 08:53:31
Oct 15 09:49:27    INFO:
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46891

Reviewed By: seemethere

Differential Revision: D24553645

Pulled By: malfet

fbshipit-source-id: 62b2251f174aec7ff573a8c4f8cb7a920fa3eaca
2020-10-27 08:04:36 -07:00
d5cd781cd3 Update dper3 to use torch.nan_to_num and nan_to_num_ (#46873)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46873

OSS:
Add op benchmark for torch.nan_to_num and torch.nan_to_num_

Test Plan:
OSS:
`buck run mode/opt caffe2/benchmarks/operator_benchmark/pt:nan_to_num_test`

Reviewed By: qizzzh, houseroad

Differential Revision: D24521835

fbshipit-source-id: 1fd50a99e5329ffec2d470525ce6976d39424958
2020-10-27 06:41:48 -07:00
8640905088 add sparse_nn_partition (#46390)
Summary:
WIP: This PR adds sparse_nn_partition into Partitioner class. It includes logical device assignment for all dag nodes. The basic idea is to do size_based_partition separately for embedding nodes and non-embedding nodes. A test unit is also added in test_fx_experimental.py.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46390

Reviewed By: gcatron

Differential Revision: D24555415

Pulled By: scottxu0730

fbshipit-source-id: 8772af946d5226883759a02a1c827cfdfce66097
2020-10-27 00:11:58 -07:00
4b6e307191 Replace flatten tensors with flatten loops. (#46737)
Summary:
This is the second attempt at replacing flatten tensors with flatten loops in `TensorExprKernel::generateStmt`. The first attempt (https://github.com/pytorch/pytorch/pull/46539) resulted in a build failure due to an exception that gets thrown during inline.

The reason for the build failure was because there was an inline step, which was supposed to happen on the unflattened tensors. This was necessary earlier because for every flattened tensor there was an unflattened tensor which had to be inlined. That is no longer necessary since we do not have 2 tensors (flattened and unflattened) now. Removed this inline.

Checked python and cpp tests on CPU as well as CUDA.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46737

Reviewed By: anjali411, izdeby

Differential Revision: D24534529

Pulled By: navahgar

fbshipit-source-id: 8b131a6be076fe94ed369550d9f54d3879fdfefd
2020-10-27 00:01:20 -07:00
6b50ccc41c [quant][graphmode][fx] Support sigmoid/hardsigmoid/tanh in qat (#46738) (#46871)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46871

Test Plan:
Imported from OSS

Imported from OSS

Reviewed By: vkuzo

Differential Revision: D24547180

fbshipit-source-id: d2eb9aa74c6e5436204376b1a2ebcc6188d3562f
2020-10-26 23:52:07 -07:00
60eded6c0f Add single element tuple output from to_backend/to_glow (#5029)
Summary:
Pull Request resolved: https://github.com/pytorch/glow/pull/5029

Support single element tuples in to_backend

Test Plan: new unit test for to_glow

Reviewed By: andrewmillspaugh

Differential Revision: D24539869

fbshipit-source-id: fb385a7448167b2b948e70f6af081bcf78f338dc
2020-10-26 22:29:04 -07:00
bcbb6baccf Add a warning message that torch.sign would not support complex numbers (#43280)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43280

Test Plan: Imported from OSS

Reviewed By: ansley

Differential Revision: D24538769

Pulled By: anjali411

fbshipit-source-id: ab2d5283501e4c1d7d401d508e32f685add7ebb1
2020-10-26 21:13:12 -07:00
37da6d26ff add fburl link to error message (#46795)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46795

add fburl link to the error message of missing ops so user can debug themselves.

Test Plan: fburl.com/missing_ops

Reviewed By: iseeyuan

Differential Revision: D24519992

fbshipit-source-id: d2d16db7e9d9c84ce2c4600532eb253c30b31971
2020-10-26 21:05:49 -07:00
9858b012ec Fix TripletMarginWithDistanceLoss example code (#46853)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/45210

Removes `requires_grad=True` from all the `randint`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46853

Reviewed By: izdeby

Differential Revision: D24549483

Pulled By: soulitzer

fbshipit-source-id: c03576571ed0b2dbb281870f29a28eb6f6209c65
2020-10-26 21:02:54 -07:00
4a35280ec2 [c10] fix weak_intrusive_ptr lock() (#46007)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46007

When owner released the object, target will become null and illegal to
access refcount_ again. This PR fixes this and return null in that case.

Test Plan: Imported from OSS

Reviewed By: gmagogsfm

Differential Revision: D24374846

Pulled By: wanchaol

fbshipit-source-id: 741074f59c0904a4d60b7bde956cad2d0925be4e
2020-10-26 20:54:12 -07:00
b3e64c86e0 Remove loop_test mode (#46618)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46618

Using D19631971 (b4b1b100bd) and https://github.com/pytorch/pytorch/pull/32935/files as a reference.

Test Plan:
```
$ buck run glow/fb/test:sparsenn_test -- --gtest_filter='SparseNNTest.vanillaC2' --onnxifi_debug_mode --nocaffe2_predictor_use_memonger
```

Generated dot file https://www.internalfb.com/intern/graphviz/?paste=P146216905

Reviewed By: yinghai

Differential Revision: D24427800

fbshipit-source-id: 7d1d8768352a52af104e0a75ce982c1eb861aa73
2020-10-26 20:38:41 -07:00
af27da93de Add Vulkan Tensor factory. (#44016)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44016

Test Plan: Imported from OSS

Reviewed By: IvanKobzarev

Differential Revision: D23820823

Pulled By: AshkanAliabadi

fbshipit-source-id: d007650f255fd79b4a2f4bba0bf8ea00f9a2e6cf
2020-10-26 18:38:13 -07:00
c9bf03a6c4 Add Vulkan Tensor. (#44015)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44015

Test Plan: Imported from OSS

Reviewed By: IvanKobzarev

Differential Revision: D23820827

Pulled By: AshkanAliabadi

fbshipit-source-id: 7691da56a5d0073d078b901d8951437757ab1085
2020-10-26 18:35:16 -07:00
2397c8d1f7 [pytorch] Improve/fix heuristics for using mkldnn vs native conv (#46675)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46675

We've found a few heuristics for using/not using mkldnn that seem to generally
improve performance on 2d and 3d conv.

- 1x1 convolutions are basically batch matmuls, and mkldnn's implementation
  appears to usually be slower than using the native conv (which lowers to
  aten::mm, which in turn calls mkl gemm).

- 3d conv was often not using mkldnn even when it's beneficial, because the
  heuristic was checking the kernel depth rather than height/width.  mkldnn
  seems to be faster for (1, 7, 7) and (3, 7, 7) kernel sizes, which are
  allowed by the new heuristic.

Test Plan:
Bento notebooks showing before/after:
before: https://www.internalfb.com/intern/anp/view/?id=38089
after: https://www.internalfb.com/intern/anp/view/?id=380893

Also, I've run a conv fuzzer, and it generally supports these heuristics.  I'm
not sure how to best share the data since there's a lot of it (I tried about
50k parameter combinations).

For the 1x1 case, about 70% were faster with "native".  I played with
constructing a decision tree (using scikit-learn) and found that switching back
to MKL for batch size > 16 might be slightly better still, but I'm not sure
it's worth complicating the heuristic.

Results for some popular shapes in tabular format:
```
[------------------------- conv2d_1x1 ------------------------]
                                           |   base   |   diff
1 threads: ----------------------------------------------------
      [1, 128, 56, 56] [256, 128, 1, 1]    |  3665.3  |  2838.4
      [1, 512, 14, 14] [1024, 512, 1, 1]   |  3174.7  |  3164.0
      [1, 64, 56, 56] [256, 64, 1, 1]      |  2249.1  |  1468.8
      [1, 1024, 14, 14] [512, 1024, 1, 1]  |  3158.2  |  3147.7
      [1, 1024, 7, 7] [2048, 1024, 1, 1]   |  8191.8  |  3973.9
      [1, 2048, 7, 7] [1024, 2048, 1, 1]   |  7901.2  |  3861.6
      [1, 256, 28, 28] [512, 256, 1, 1]    |  3103.9  |  2775.9
2 threads: ----------------------------------------------------
      [1, 128, 56, 56] [256, 128, 1, 1]    |  1973.7  |  1475.8
      [1, 512, 14, 14] [1024, 512, 1, 1]   |  2265.0  |  1603.0
      [1, 64, 56, 56] [256, 64, 1, 1]      |  1445.4  |   789.8
      [1, 1024, 14, 14] [512, 1024, 1, 1]  |  2298.8  |  1620.0
      [1, 1024, 7, 7] [2048, 1024, 1, 1]   |  6350.7  |  1995.0
      [1, 2048, 7, 7] [1024, 2048, 1, 1]   |  6471.2  |  1903.7
      [1, 256, 28, 28] [512, 256, 1, 1]    |  1932.3  |  1524.2
4 threads: ----------------------------------------------------
      [1, 128, 56, 56] [256, 128, 1, 1]    |  1198.8  |   785.6
      [1, 512, 14, 14] [1024, 512, 1, 1]   |  1305.0  |   901.6
      [1, 64, 56, 56] [256, 64, 1, 1]      |   791.0  |   472.9
      [1, 1024, 14, 14] [512, 1024, 1, 1]  |  1311.2  |   908.5
      [1, 1024, 7, 7] [2048, 1024, 1, 1]   |  3958.6  |   997.7
      [1, 2048, 7, 7] [1024, 2048, 1, 1]   |  4099.6  |  1023.1
      [1, 256, 28, 28] [512, 256, 1, 1]    |  1120.3  |   740.8

Times are in microseconds (us).

[--------------------- conv2d_7x7 ---------------------]
                                      |   base  |   diff
1 threads: ---------------------------------------------
      [25, 3, 48, 320] [64, 3, 7, 7]  |  209.3  |  229.3
      [1, 3, 384, 288] [64, 3, 7, 7]  |   68.9  |   72.3
2 threads: ---------------------------------------------
      [25, 3, 48, 320] [64, 3, 7, 7]  |  116.0  |  117.6
      [1, 3, 384, 288] [64, 3, 7, 7]  |   40.4  |   38.7
4 threads: ---------------------------------------------
      [25, 3, 48, 320] [64, 3, 7, 7]  |   64.2  |   66.5
      [1, 3, 384, 288] [64, 3, 7, 7]  |   21.4  |   21.9

Times are in milliseconds (ms).

[---------------------------- conv3d ---------------------------]
                                               |   base  |   diff
1 threads: ------------------------------------------------------
      [1, 3, 16, 224, 224] [32, 3, 1, 7, 7]    |  602.8  |  296.2
      [1, 3, 4, 112, 112] [64, 3, 3, 7, 7]     |   52.5  |   26.5
      [1, 256, 8, 14, 14] [256, 256, 3, 3, 3]  |   50.0  |   50.3
2 threads: ------------------------------------------------------
      [1, 3, 16, 224, 224] [32, 3, 1, 7, 7]    |  351.0  |  168.1
      [1, 3, 4, 112, 112] [64, 3, 3, 7, 7]     |   38.5  |   14.9
      [1, 256, 8, 14, 14] [256, 256, 3, 3, 3]  |   24.8  |   26.2
4 threads: ------------------------------------------------------
      [1, 3, 16, 224, 224] [32, 3, 1, 7, 7]    |  212.6  |   96.0
      [1, 3, 4, 112, 112] [64, 3, 3, 7, 7]     |   21.5  |    7.6
      [1, 256, 8, 14, 14] [256, 256, 3, 3, 3]  |   12.7  |   13.3

Times are in milliseconds (ms).
```

Reviewed By: jansel

Differential Revision: D24452071

fbshipit-source-id: 12687971be531831530dc29bf2fc079a917d0c8d
2020-10-26 18:27:12 -07:00
a602811da7 [ROCm] fix bug in miopen findAlgorithm. (#46852)
Summary:
findAlgorithm should return if and only if a suitable algorithm is found.
The default algorithm is not guaranteed to have been cached.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46852

Reviewed By: izdeby

Differential Revision: D24546748

Pulled By: bhosmer

fbshipit-source-id: 171137b377193e0825769b61d42a05016f02c34c
2020-10-26 18:20:04 -07:00
a4adc1b6d7 Fix unused variable warning (#46838)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46838

`SCALAR_TYPE` may be unused in some contexts where the macro is used. We use the standard `(void)var` trick to suppress the compiler warning in these instances.

Test Plan: Standard pre-commit tests.

Reviewed By: jerryzh168

Differential Revision: D24481142

fbshipit-source-id: 4fcde669cc279b8863443d49c51edaee69f4d7bd
2020-10-26 18:14:27 -07:00
a6cd294c9b [Gradient Compression] Refactor CommHookInterface and PythonCommHook. (#46512)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46512

1. Merge 1-line PythonCommHook constructor into the header for simplicity.
2. Move the implementation of PythonCommHook destructor from the header file to cpp file.
3. Rename processFuture method as parseHookResult for readability.
4. Simplify some comments.

Original PR issue: C++ DDP Communication Hook https://github.com/pytorch/pytorch/issues/46348
ghstack-source-id: 115161086

Test Plan:
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_ddp_comm_hook_allreduce_hook_nccl

buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_ddp_comm_hook_sparse_gradients

buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_ddp_comm_hook_allreduce_with_then_hook_nccl

buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_ddp_comm_hook_future_passing_gpu_gloo

Reviewed By: jiayisuse

Differential Revision: D24374282

fbshipit-source-id: c8dbdd764bca5b3fa247708f1218cb5ff3e321bb
2020-10-26 18:07:58 -07:00
adafd3d4b2 Support RRef.backward() for local RRefs. (#46568)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46568

This PR adds support for an RRef.backward() API. This would be useful
in applications like pipeline parallelism as described here:
https://github.com/pytorch/pytorch/issues/44827

This PR only adds support for local RRefs, remote RRef support will be added in
a follow up PR.
ghstack-source-id: 115100729

Test Plan:
1) unit tests.
2) waitforbuildbot

Reviewed By: mrshenli

Differential Revision: D24406311

fbshipit-source-id: fb0b4e185d9721bf57f4dea9847e0aaa66b3e513
2020-10-26 17:31:17 -07:00
7731370e71 CUDA BFloat16 gelu, hardswish, hardsigmoid (#44997)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44997

Reviewed By: izdeby

Differential Revision: D24547748

Pulled By: ngimel

fbshipit-source-id: 34639dfe6ca41c3f59fd2af861e5e3b1bb86757a
2020-10-26 16:01:22 -07:00
99cf3b1ce4 CUDA BFloat16 signal windows (#45155)
Summary:
Looks like this op is never tested for the support of different dtypes?

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45155

Reviewed By: zou3519

Differential Revision: D24438839

Pulled By: ngimel

fbshipit-source-id: 103ff609e11811a0705d04520c2b97c456b623ef
2020-10-26 15:53:30 -07:00
13a5be571b Enable complex backward for torch.take() and tensor.fill_() (#46860)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46860

Test Plan: Imported from OSS

Reviewed By: izdeby

Differential Revision: D24544601

Pulled By: anjali411

fbshipit-source-id: 4e29d48da30da3630cb558ccee464d89780b1ab7
2020-10-26 15:46:08 -07:00
02dc52f25b vmap fallback: gracefully error out when vmap over dim of size 0 (#46846)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46846

Previously, this would crash with a floating point error. If the user vmaps
over a dimension of size 0, ideally we would return a tensor with a
batch dim of size 0 and the correct output shape. However, this isn't
possible without a shape-checking API. This PR changes the vmap fallback
to error out gracefully if it sees vmap occuring over a dimension of
size 0.

If we want to support vmapping over dimension of size 0 for a specific
op, then the guidance is to implement a batching rule for that op that
handles 0-sized dims.

Test Plan: - new test

Reviewed By: ezyang

Differential Revision: D24539315

Pulled By: zou3519

fbshipit-source-id: a19c049b46512d77c084cfee145720de8971f658
2020-10-26 15:32:22 -07:00
5e2f17d77a Add NCCL_ASYNC_ERROR_HANDLING to docs (#46856)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46856

Add reference to NCCL_ASYNC_ERROR_HANDLING in the pytorch docs,
similar to how NCCL_BLOCKING_WAIT is curently described.
ghstack-source-id: 115186877

Test Plan: CI, verifying docs change

Reviewed By: jiayisuse

Differential Revision: D24541822

fbshipit-source-id: a0b3e843bc6392d2787a4bb270118f2dfda5f4ec
2020-10-26 14:41:32 -07:00
57bf0b596a [docs] Changing the wording on quantization versioning and support (#46858)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46858

Test Plan: Imported from OSS

Reviewed By: dskhudia

Differential Revision: D24542598

Pulled By: z-a-f

fbshipit-source-id: 0eb7a2dcc8f8ad52954f2555cf41d5f7524cbc2c
2020-10-26 14:30:50 -07:00
58ed60c259 Added context manager enabling all futures returned by rpc_async and custom build rpc functions to be automatically waited on (#41807)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/41807

Test Plan: Make sure ci tests pass, including newly written test

Reviewed By: mrshenli

Differential Revision: D22640839

Pulled By: osandoval-fb

fbshipit-source-id: 3ff98d8e8c6e6d08575e307f05b5e159442d7216
2020-10-26 12:53:35 -07:00
25db74bf5e Revert D24486972: [quant][graphmode][fx] Support sigmoid/hardsigmoid/tanh in qat
Test Plan: revert-hammer

Differential Revision:
D24486972 (e927b62e73)

Original commit changeset: c9f139bfdd54

fbshipit-source-id: 2a75f5ec93d55a62b40d1cdd49adcf65436058f7
2020-10-26 12:47:05 -07:00
0c74b43a3f Update TensorPipe submodule (#46842)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46842

Reviewed By: mrshenli

Differential Revision: D24539899

Pulled By: lw

fbshipit-source-id: 8731165c6ecd3c97433b4dfa469989f5b9019e36
2020-10-26 12:40:09 -07:00
56a3831bc6 [NVFuser]Benchmark minor update (#46778)
Summary:
This is a tiny PR for two minor fixes:

1. Added `torch._C._jit_set_texpr_fuser_enabled(False)` to enable shape inference on nv fuser runs.
2. Renamed dynamic benchmark module names to avoid multiple matching. i.e. `simple_element` with `dynamic_simple_element`. I guess it'd be much easier if the pattern matching was based on `startswith`. Would be happy to update that if agreed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46778

Reviewed By: zhangguanheng66

Differential Revision: D24516911

Pulled By: bertmaher

fbshipit-source-id: 839f9a3e058f9d7aca17b2e6eb8b558e0e48e8f4
2020-10-26 12:22:36 -07:00
e927b62e73 [quant][graphmode][fx] Support sigmoid/hardsigmoid/tanh in qat (#46738)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46738

Test Plan: Imported from OSS

Reviewed By: raghuramank100

Differential Revision: D24486972

fbshipit-source-id: c9f139bfdd54973da1a93a45e32937595dbe67fc
2020-10-26 12:04:42 -07:00
b5662ba0f0 [uhm][0/n] add cuda Mod Op (#46732)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46732

as titled

Test Plan:
unittest

buck test mode/dev-nosan //caffe2/caffe2/python/operator_test:mod_op_test

Reviewed By: xianjiec

Differential Revision: D24368100

fbshipit-source-id: 1232d22a67ac268986043911d548fa9d657470ec
2020-10-26 11:07:51 -07:00
5a2b537b54 Add error messages and workaround for RET failure of containers with a torch class type (#46543)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46543

Add error messages and workaround for RET failure of containers with a torch class type.
 - Error case condition
  1) ins.op == RET
  2) input_type == TypeKind::ListType or TypeKind::DictType
  3) Any(input_type's element type) == TypeKind::ClassType
ghstack-source-id: 114618426

Test Plan:
buck test mode/dev caffe2/test:mobile -- 'test'

    Summary
       Pass: 13
       ListingSuccess: 1
    Finished test run: https://our.intern.facebook.com/intern/testinfra/testrun/7318349417617713

Reviewed By: iseeyuan

Differential Revision: D24388483

fbshipit-source-id: 7d30f6684a999054d0163e691422797cb818bb6a
2020-10-26 10:46:07 -07:00
3e606da0af Upgrading lcov install to install v1.15 to be compatible with GCC9 (#46847)
Summary:
According to [this issue](https://github.com/linux-test-project/lcov/issues/58), LCOV 1.14 and below are not compatible with GCC9 when gathering coverage.

Instead of installing `lcov` with `apt-get`, which installs version 1.13, this PR would install v1.15 from source onto the ubuntu Docker images.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46847

Reviewed By: seemethere

Differential Revision: D24540444

Pulled By: janeyx99

fbshipit-source-id: 0ac2a37241d94cdd8fea2fded7984c495a64cedc
2020-10-26 10:11:46 -07:00
83d358da7c Fix LAPACK functionality detection from static OpenBLAS (#46710)
Summary:
BLAS `sgemm_` only depends on pthreads, but LAPACK `cheev_` also depends on libm

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46710

Reviewed By: walterddr

Differential Revision: D24476082

Pulled By: malfet

fbshipit-source-id: e0b91116f18bbcdabb1f99c2ec9d98283df4393f
2020-10-26 08:34:28 -07:00
b61671ccd2 Enable dtype arg for torch.linalg.norm with order 'fro' and 'nuc' (#46637)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/46255

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46637

Reviewed By: gchanan

Differential Revision: D24459097

Pulled By: mruberry

fbshipit-source-id: 7f207a23de902c27f8313ee80f452687a97e8f6f
2020-10-26 02:59:00 -07:00
d94bd998ec Update backward formulas (Re #44444) (#46275)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46275

Re #44444

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D24285785

Pulled By: anjali411

fbshipit-source-id: c60ecd4fe4f144132085f2c91d3b950e92b2a491
2020-10-25 19:40:59 -07:00
edbc84aa4a Fix hash type (#46769)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46769

Value to be hashed is `int64_t`, but hash is `int`. This results in a downconversion which throws out bits which would otherwise be hashed.

Test Plan: Standard pre-commit test rig

Reviewed By: malfet

Differential Revision: D24480962

fbshipit-source-id: 497b1d8bc3f6d2119a6ba16e6ae92911bd34b916
2020-10-24 16:14:41 -07:00
fa8cd06a5c Perform explicit cast (#46771)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46771

`std::ceil` returns a `float` which is cast to `size_t` by the `max` operation.

We convert to `int64_t` to suppress the warning while matching the type of `newDims[0]`.

Since the types match, we don't need an explicit template type for `max`. This allows `max` to take `int64_t` as its values, matching the type of `newCapacity`.

Test Plan: Standard pre-commit test rig.

Reviewed By: malfet

Differential Revision: D24481684

fbshipit-source-id: aed7cabc1e9d395b2662cb633f3ace19c279ab4c
2020-10-24 16:10:02 -07:00
9cbdd84e15 Fix compiler warning
Summary: `sizeof` returns an unsigned, so comparison against `-1` is a warning. This fixes that.

Test Plan: Standard pre-commit test rig.

Reviewed By: bhosmer

Differential Revision: D24506390

fbshipit-source-id: cdb2887d319c6730a90b9f8d74a248527dd6c2ab
2020-10-24 14:23:43 -07:00
f9b9430152 Support doc_string for TorchBind custom classes (#46576)
Summary:
With this PR, users can optionally provide a "doc_string" to describe a class or its method. doc_string for TorchBind classes and methods are stored as `doc_string` properties in `Function` and `ScriptClass`. These `dos_string` properties are then exposed in Python layer via PyBind for doc generation.

Fixes https://github.com/pytorch/pytorch/issues/46047

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46576

Reviewed By: wanchaol

Differential Revision: D24440636

Pulled By: gmagogsfm

fbshipit-source-id: bfa9b270a6c2d8bc769a88fad6be939cc6310412
2020-10-24 12:51:35 -07:00
7d4c1a5ab0 Fix type warning (#46770)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46770

Test Plan: Standard pre-commit test rig.

Reviewed By: malfet

Differential Revision: D24480898

fbshipit-source-id: a5031f1e20f4b1ea5954e7cabd54502300d5a916
2020-10-24 12:37:24 -07:00
37dbc6117f [quant][eagermode] Add additional_fuser_method_mapping to config (#46355)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46355

Test Plan: Imported from OSS

Reviewed By: vkuzo

Differential Revision: D24319562

fbshipit-source-id: be9800723c0b3e36f26e73c25c0c6ae1d4344f45
2020-10-24 02:18:04 -07:00
13b7855f33 Support hashing of various data types by implementing generic hashing for IValues (#46441)
Summary:
It used to be that TorchScript only supported hashing of `int`, `float` and `str`. This PR adds hashing for many other types including `Tuple`, `bool`, `device` by implementing generic hashing on IValue.

* Tensor hashing follows eager behavior, which is identity-based (hash according to pointer address rather than tensor content).

Fixes https://github.com/pytorch/pytorch/issues/44038

This is based on suo's https://github.com/pytorch/pytorch/issues/44047, with some cleaning, more tests and fixing BC check issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46441

Reviewed By: robieta

Differential Revision: D24440713

Pulled By: gmagogsfm

fbshipit-source-id: 851f413f99b6f65084b551383ad21e558e7cabeb
2020-10-23 21:26:01 -07:00
789e935304 Annotate torch.nn.cpp (#46490)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/46489

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46490

Reviewed By: zhangguanheng66

Differential Revision: D24509519

Pulled By: ezyang

fbshipit-source-id: edffd32ab2ac17ae4bbd44826b71f5cb9f1da1c5
2020-10-23 17:40:32 -07:00
c4892c8efe [pytorch][tensorexpr] Promote integer arguments to sin/cos/tan to float (#46776)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46776

Following numpy and (now) eager mode

Fixes #46458

Test Plan: test_jit_fuser_te

Reviewed By: navahgar

Differential Revision: D24509884

fbshipit-source-id: c063030fc609ba4aefcd9abd25b50f082fef1548
2020-10-23 17:32:54 -07:00
343260a1cc [quant][graphmode][fx] Add support for additional_{fusion/quant}_pattern (#46346)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46346

Allow user to provide additional fusion/quant patterns for fx graph mode

Test Plan: Imported from OSS

Reviewed By: vkuzo

Differential Revision: D24317437

fbshipit-source-id: 719927cce50c74dffa4f848bd5c98995c944a26a
2020-10-23 15:03:42 -07:00
74d81080a0 Use new_zeros in evenly_distribute_backward (#46674)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46674

Summary
-------

This adds batched gradient support (i.e., vmap through the gradient
formulas) for Tensor.max(), Tensor.min(), Tensor.median()
that have evenly_distribute_backward as their backward formula.

Previously, the plan was to register incompatible gradient formulas as
backward operators (see #44052). However, it turns out that we can just use
`new_zeros` to get around some incompatible gradient formulas (see next
section for discussion).

Context: the vmap+inplace problem
---------------------------------

A lot of backwards functions are incompatible with BatchedTensor due to
using in-place operations. Sometimes we can allow the in-place
operations, but other times we can't. For example, consider select_backward:

```
Tensor select_backward(const Tensor& grad, IntArrayRef input_sizes,
                       int64_t dim, int64_t index) {
  auto grad_input = at::zeros(input_sizes, grad.options());
  grad_input.select(dim, index).copy_(grad);
  return grad_input;
}
```
and consider the following code:

```
x = torch.randn(5, requires_grad=True)
def select_grad(v):
  torch.autograd.grad(x[0], x, v)

vs = torch.randn(B0)
batched_grads = vmap(select_grad)(vs)
```

For the batched gradient use case, grad is a BatchedTensor.
The physical version of grad has size (B0,).
However, select_backward creates a grad_input of shape (5), and
tries to copy grad to a slice of it.

Up until now, the proposal to handle this has been to register these
backward formulas as operators so that vmap doesn’t actually see the
`copy_` calls (see #44052). However, it turns out we can actually just
use `new_zeros` to construct a new Tensor that has the same
"batched-ness" as grad:
```
auto grad_input = grad.new_zeros(input_sizes);
grad_input.select(dim, index).copy_(grad);
```
We should use this for simple backward functions. For more complicated
backward functions where this solution doesn't work, we should register
those as operators.

Alternatives
------------
Option 2: Register `evenly_distribute_backward` as an operator and have the
vmap fallback run it in a loop.
- This requires more LOC changes.
- Furthermore, we'd have to write an efficient batching rule for
`evenly_distribute_backward` in the future.
- If we use `new_zeros` instead, we don't need to write an efficient
batching rule for `evenly_distribute_backward` as long as the
constituents of `evenly_distributed_backward` have efficient batching rules.

Option 3: Have factory functions perform differently if they are called
inside vmap.
- For example, `at::zeros(3, 5)` could return a Tensor of shape
`(B0, B1, 3, 5)` if we are vmapping over two dimensions with size B0 and B1.
This requires maintaining some global and/or thread-local state about
the size of the dims being vmapped over which can be tricky.

And more...

Future
------
- I will undo some of the work I’ve done in the past to move backward
functions to being operators (#44052, #44408). The simpler backward
functions (like select backward) can just use Tensor.new_zeros.
I apologize for the thrashing.
- Include a NOTE about the vmap+inplace problem somewhere in the
codebase. I don't have a good idea of where to put it at the moment.

Test Plan
---------
- New tests

Test Plan: Imported from OSS

Reviewed By: gchanan

Differential Revision: D24456781

Pulled By: zou3519

fbshipit-source-id: 9c6c8ee2cb1a4e25afd779bdf0bdf5ab76b9bc20
2020-10-23 14:29:40 -07:00
aa828bf084 Support undefined grads in vmap fallback (#46671)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46671

Previously, the vmap fallback would choke whenever it saw an undefined
tensor. For each sample in a batch, the fallback runs an operator
and then stacks together outputs to get the actual output.
Undefined tensors can occur as outputs while computing batched gradients
with vmap.

This PR updates the vmap fallback to handle undefined tensors which can
appear in backward formulas:
- if for each sample in a batch the output was undefined, then the vmap
fallback returns an undefined tensor
- if for each sample in a batch the output is defined, then the vmap
fallback stacks together the defined tensors
- if for some samples in a batch the output is defined/undefined, then
we error out.

Test Plan: - new tests

Reviewed By: ezyang

Differential Revision: D24454909

Pulled By: zou3519

fbshipit-source-id: d225382fd17881f23c9833323b68834cfef351f3
2020-10-23 14:26:50 -07:00
85954164a4 fix minor bug, message variable does not exist (#46777)
Summary:
When run with `--continue-through-error`, the script ends with the following error:

```
Traceback (most recent call last):
  File "run_test.py", line 745, in <module>
    main()
  File "run_test.py", line 741, in main
    print_to_stderr(message)
NameError: name 'message' is not defined
make: *** [macos-compat] Error 1
```

This PR just changes `message` to `err`, which is the intended variable.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46777

Reviewed By: seemethere

Differential Revision: D24510460

Pulled By: janeyx99

fbshipit-source-id: be1124b6fc72b178d62acc168d0cbc74962de52b
2020-10-23 14:20:23 -07:00
89f368bef8 Enable XNNPACK on Windows & Update XNNPACK (#45830)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/44283.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45830

Reviewed By: zhangguanheng66

Differential Revision: D24504302

Pulled By: ezyang

fbshipit-source-id: ab28088a4fbb553a27ed7c8da87ec7b40c73c2f1
2020-10-23 14:17:45 -07:00
999f7ed3a1 Refactored ForeachFunctors.cuh (#46660)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46660

Test Plan: Imported from OSS

Reviewed By: cpuhrsch

Differential Revision: D24453345

Pulled By: izdeby

fbshipit-source-id: 307839d40a358d9dda3eee6f62990b38b8274642
2020-10-23 13:58:45 -07:00
822efb7275 add workflow ID to report tags (#46725)
Summary:
Currently circle doesn't report workflow ID as one of the dimensions
This causes statistics for some failed/rerun CircleCI job to report overlapping results.

Fixing it by adding workflow ID tag

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46725

Reviewed By: seemethere, zhangguanheng66

Differential Revision: D24505006

Pulled By: walterddr

fbshipit-source-id: cc65bb8ebc0787e443a42584dfb0d2224e824e7d
2020-10-23 12:10:54 -07:00
ccb79f3ac7 Add option to log subprocess output to files in DDP launcher. (#33193)
Summary:
Closes https://github.com/pytorch/pytorch/issues/7134. This request is to add an option to log the subprocess output (each subprocess is training a network with DDP) to a file instead of the default stdout.

The reason for this is that if we have N processes all writing to stdout, it'll be hard to decipher the output, and it would be cleaner to log these to separate files.

To support this, we add an optional argument `--logdir` set the subprocess stdout to be the a file of the format "node_rank_{}_local_rank_{}" in the logging directory. With this enabled, none of the training processes output to the parent process stdout, and instead write to the aformentioned file. If a user accidently passes in something that's not a directory, we fallback to ignoring this argument.

Tested by taking a training script at https://gist.github.com/rohan-varma/2ff1d6051440d2c18e96fe57904b55d9 and running `python -m torch.distributed.launch --nproc_per_node=2 --nnodes=1 --node_rank=0 --master_addr="127.0.0.1" --master_port="29500" --logdir test_logdir train.py`. This results in a directory `test_logdir` with files "node_0_local_rank_0" and "node_0_local_rank_1" being created with the training process stdout.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/33193

Reviewed By: gchanan

Differential Revision: D24496013

Pulled By: rohan-varma

fbshipit-source-id: 1d3264cba242290d43db736073e841bbb5cb9e68
2020-10-23 11:22:57 -07:00
e519fcd1aa Remap net name inside arg.n for AsyncIf operator
Summary: Similar to If operator, AsyncIf also contains nets in args. It needs the same handling.

Test Plan:
New unit test test_control_op_remap
`buck test caffe2/caffe2/python:core_test`

Also it worked end to end in prototype of dist bulk eval workflow f226680903

Reviewed By: yyetim

Differential Revision: D24451775

fbshipit-source-id: 50594e2ab9bb457329ed8da7b035f7409461b5f6
2020-10-23 10:41:06 -07:00
3ea26b1424 [WIP] Push rocm to slow path for foreach APIs (#46733)
Summary:
Move ROCM to a slow path for now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46733

Reviewed By: ngimel

Differential Revision: D24485012

Pulled By: izdeby

fbshipit-source-id: f0f4227cc594d8a87d44008cd5e27ebe100b6b22
2020-10-23 10:33:41 -07:00
c31ced4246 make torch.lu differentiable. (#46284)
Summary:
As per title. Limitations: only for batches of squared full-rank matrices.

CC albanD

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46284

Reviewed By: zou3519

Differential Revision: D24448266

Pulled By: albanD

fbshipit-source-id: d98215166268553a648af6bdec5a32ad601b7814
2020-10-23 10:13:46 -07:00
52f8d320b3 [ONNX] Update ONNX doc for indexing export (#46349)
Summary:
Adding example code for supported cases.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46349

Reviewed By: gchanan

Differential Revision: D24459449

Pulled By: malfet

fbshipit-source-id: 65021a96cd12225615aa40af5d916e0cda56d107
2020-10-23 09:49:43 -07:00
f230245c06 Revert D24422354: [pytorch][PR] fix-process-group-counter
Test Plan: revert-hammer

Differential Revision:
D24422354 (caed29a069)

Original commit changeset: 32493cc2001d

fbshipit-source-id: 9b633f738ea555f45031056689f780dde8eda859
2020-10-23 08:04:37 -07:00
e0fd590ec9 Fix incorrect usage of CUDACachingAllocator (#46605)
Summary:
We need an object to hold the ownership of allocated memory in the scope, instead of directly using the raw pointer.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46605

Reviewed By: zou3519

Differential Revision: D24453548

Pulled By: ezyang

fbshipit-source-id: d29e5a69afa6c0d9e519849910e04524667d0a26
2020-10-23 07:36:39 -07:00
6c5f634657 Fix grammar and spelling errors (#46713)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46713

Test Plan: Imported from OSS

Reviewed By: Lilyjjo

Differential Revision: D24477771

Pulled By: ansley

fbshipit-source-id: bc39b63ab2158a5233e48b89bfaa97a4cfb1f7a1
2020-10-23 01:31:17 -07:00
4fd2cce9fa Check support_as_strided before using empty_strided. (#46746)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46746

Test Plan: Imported from OSS

Reviewed By: bhosmer

Differential Revision: D24492468

Pulled By: ailzhang

fbshipit-source-id: 25f869e64cf8628e41661edca9823e95170ae1ed
2020-10-22 21:56:12 -07:00
129279a374 [FBGEMM][Transposed Conv] add transposed conv support for fbgemm backend for 1d, 2d, 3d (#46607)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46607

wire the fbgemm backend of transposed conv for 1d, 2d, 3d cases in qconv, although there is no official frontend API 3d cases.
ghstack-source-id: 114896586

Test Plan: https://www.internalfb.com/intern/testinfra/testconsole/testrun/6755399464206048/

Reviewed By: z-a-f

Differential Revision: D24323802

fbshipit-source-id: 1c7d2fbb703018fd15f5c85edcfa6c9deac9662e
2020-10-22 20:55:52 -07:00
8558c0e612 Eliminate narrowing conversion (#46730)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46730

A narrowing conversion on `last_idx` raises a compiler warning. This fixes that.

Test Plan: Standard pre-commit test rig.

Reviewed By: EscapeZero

Differential Revision: D24481497

fbshipit-source-id: f3e913b586738add59c422c3cf65035d87fc9e34
2020-10-22 20:08:59 -07:00
511f89eaa9 Add nvtx.range() context manager (#42925)
Summary:
Small quality-of-life improvement to NVTX Python bindings, that we're using internally and that would be useful to other folks using NVTX annotations via PyTorch. (And my first potential PyTorch contribution.)

Instead of needing to be careful with try/finally to make sure all your range_push'es are range_pop'ed:

```
nvtx.range_push("Some event")
try:
    # Code here...
finally:
    nvtx.range_pop()
```

you can simply do:

```
with nvtx.range("Some event"):
    # Code here...
```

or even use it as a decorator:

```
class MyModel(nn.Module):

    # Other methods here...

    nvtx.range("MyModel.forward()")
    def forward(self, *input):
        # Forward pass code here...
```

A couple small open questions:

1. I also added the ability to call `msg.format()` inside `range()`, with the intention that, if there is nothing listening to NVTX events, we should skip the string formatting, to lower the overhead in that case. If you like that idea, I could add the actual "skip string formatting if nobody is listening to events" parts. We can also just leave it as is. Or I can remove that if you folks don't like it. (In the first two cases, should we add that to `range_push()` and `mark()` too?) Just let me know which one it is, and I'll update the pull request.

2. I don't think there are many places for bugs to hide in that function, but I can certainly add a quick test, if you folks want.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42925

Reviewed By: gchanan

Differential Revision: D24476977

Pulled By: ezyang

fbshipit-source-id: 874882818d958e167e624052e42d52fae3c4abf1
2020-10-22 19:46:16 -07:00
88e94da580 Enable softmax and tiny norm FP16 tests on ROCm (#46363)
Summary:
This pull request enables the following tests on ROCm:
* TestCuda.test_tiny_half_norm_
* TestNNDeviceTypeCUDA.test_softmax_cuda_float16
* TestNNDeviceTypeCUDA.test_softmax_cuda_float32
* TestNNDeviceTypeCUDA.test_softmax_results_cuda_float16
* TestNNDeviceTypeCUDA.test_softmax_results_cuda_float32

The earlier failures, because of which the tests were skipped, were because of a precision issue for FP16 compute on MI25 hardware with ROCm 3.7 and older. The fix was delivered in the compiler in ROCm 3.8.

The pull request fixes https://github.com/pytorch/pytorch/issues/37493

cc: jeffdaily ezyang malfet mruberry

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46363

Reviewed By: heitorschueroff

Differential Revision: D24325639

Pulled By: ezyang

fbshipit-source-id: a7dbb238cf38c04b6592baad40b4d71725a358c9
2020-10-22 19:40:00 -07:00
6ae0a7c919 Add ReplaceNaN benchmark as baseline (#46685)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46685

as title

Test Plan:
caffe2

```
./buck-out/gen/caffe2/benchmarks/operator_benchmark/c2/replace_nan_test.par

# ----------------------------------------
# PyTorch/Caffe2 Operator Micro-benchmarks
# ----------------------------------------
# Tag : short

# Benchmarking Caffe2: replace_nan
WARNING: Logging before InitGoogleLogging() is written to STDERR
W1022 10:09:48.508246 1887813 init.h:137] Caffe2 GlobalInit should be run before any other API calls.
# Name: replace_nan_M16_N16_dtypefloat
# Input: M: 16, N: 16, dtype: float
Forward Execution Time (us) : 30.742

# Benchmarking Caffe2: replace_nan
# Name: replace_nan_M16_N16_dtypedouble
# Input: M: 16, N: 16, dtype: double
Forward Execution Time (us) : 29.135

# Benchmarking Caffe2: replace_nan
# Name: replace_nan_M64_N64_dtypefloat
# Input: M: 64, N: 64, dtype: float
Forward Execution Time (us) : 94.059

# Benchmarking Caffe2: replace_nan
# Name: replace_nan_M64_N64_dtypedouble
# Input: M: 64, N: 64, dtype: double
Forward Execution Time (us) : 93.569
```

Reviewed By: qizzzh, houseroad

Differential Revision: D24448483

fbshipit-source-id: 51574ca0eca6dba5828dfdc754193dba5a62954f
2020-10-22 19:12:14 -07:00
27e2ea4cea Make add_relu an internal function (#46676)
Summary:
Cleanup for 1.7

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46676

Reviewed By: gchanan

Differential Revision: D24458565

Pulled By: albanD

fbshipit-source-id: b1e4b4630233d3f1a4bac20e3077411d1ae17f7b
2020-10-22 18:08:15 -07:00
870a5a0d6d Enable DataParallel to run zero input Module (#46565)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46565

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D24405275

Pulled By: glaringlee

fbshipit-source-id: a8baaf4cf227f7f21fc3b080a446f92f0effe18e
2020-10-22 18:04:33 -07:00
842494af77 [quant][fx] EmbeddingBag quantization support (#46678)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46678

Test Plan:
python test/test_quantization.py TestQuantzeFxOps.test_qembedding_bag_module

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D24463306

fbshipit-source-id: 175e77f4450344fbf63409be35338b0c29afd585
2020-10-22 18:04:31 -07:00
e34c825b77 [quant][fx] Embedding quantization support (#46677)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46677

Add support for weight only embedding quantization

Test Plan:
python test/test_quantization.py TestQuantizeFxOps.test_qembedding_module

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D24463305

fbshipit-source-id: 2dba49d8a77cf237a8e6da2efdd83b1ebdc432d6
2020-10-22 17:59:52 -07:00
fe6fb7753e Clean up use of Flake8 in GitHub CI (#46740)
Summary:
[Previously](https://github.com/pytorch/pytorch/runs/1293724033) Flake8 was run using `flake8-mypy`, which didn't change the actual lint output, and undesirably resulted in this noisy message being printed many times:
```
/opt/hostedtoolcache/Python/3.9.0/x64/lib/python3.9/site-packages is in the MYPYPATH. Please remove it.
See https://mypy.readthedocs.io/en/latest/running_mypy.html#how-mypy-handles-imports for more info
```
Since `mypy` is already run in other test scripts, this PR simply removes it from the Flake8 setup. This PR also removes the `--exit-zero` flag from Flake8, because currently Flake8 gives no error output, so it would be valuable to know if it ever does happen to return error output.

(This doesn't strike me as a perfect solution since now it's a bit harder to reproduce the Flake8 behavior when running locally with `flake8-mypy` installed, but it's the easiest way to fix it in CI specifically.)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46740

Reviewed By: janeyx99

Differential Revision: D24487904

Pulled By: samestep

fbshipit-source-id: d534fdeb18e32d3bc61406462c1cf955080a688f
2020-10-22 17:08:16 -07:00
bf1ea14fbc [CI][IOS] Add a arm64 ios job for Metal (#46646)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46646

Test Plan: Imported from OSS

Reviewed By: seemethere, linbinyu

Differential Revision: D24459597

Pulled By: xta0

fbshipit-source-id: e93a3a26897614c66768804c71658928cd26ede7
2020-10-22 16:54:46 -07:00
344abd56f9 [CI][IOS] Rename the IOS_VERSION (#46645)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46645

### Summary

The IOS_VERSION should be renamed to XCODE_VERSION

### Test

- CircleCI

Test Plan: Imported from OSS

Reviewed By: seemethere, linbinyu

Differential Revision: D24459598

Pulled By: xta0

fbshipit-source-id: 9dcba973cc57aa44f8fd4151daf5d89c8da61c67
2020-10-22 16:49:22 -07:00
ce5bca5502 ProcessGroupNCCL::alltoall_base needs to call recordStream (#46603)
Summary:
For similar reasons as documented in the `[Sync Streams]` note.  For a current example, `ProcessGroupNCCL::allgather` must also call `recordStream` and does so already.

The output tensor is created on the default stream (by the application).  NCCL/RCCL internally uses another stream (i.e., ncclStream).  If we do not record the output tensor on the ncclStream, there is a chance that the output tensor might be deallocated while NCCL/RCCL is using it.

The application is not aware of the ncclStream since it's internal to ProcessGroupNCCL.  So, the application cannot record the output tensor on the ncclStream.

Patch originally developed by sarunyap.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46603

Reviewed By: srinivas212

Differential Revision: D24458530

fbshipit-source-id: b02e74d1c3a176ea1b9bbdd7dc671b221fcadaef
2020-10-22 15:53:19 -07:00
bd90379df5 [quant][graphmode][fx] Add support for additional_fuse_method_mapping (#46345)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46345

Allow user to add more fusion mappings

Test Plan: Imported from OSS

Reviewed By: vkuzo

Differential Revision: D24317439

fbshipit-source-id: 3b144bbc305e41efbdf3e9fb25dbbeaad9e86c6a
2020-10-22 15:15:31 -07:00
d6519d4e9f [pt][static_runtime] Add option enable_out_variant (#46690)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46690

- Add option enable_out_variant to Static Runtime
- Add gflags --pt_cleanup_activations and --pt_enable_out_variant to the benchmark script

Reviewed By: yinghai, houseroad

Differential Revision: D24438107

fbshipit-source-id: c1185c0fee93edc0118542b2faa8bc4ffdd19075
2020-10-22 15:00:23 -07:00
f326f6a8a0 Remove dilation restriction on cuDNN ConvTranspose2d (#46290)
Summary:
Close https://github.com/pytorch/pytorch/issues/31690

I have verified the functionality of ConvTranspose2d (with this PR) on roughly 32,000 random shapes on V100, A100, using cuDNN 8.0.4 and CUDA 11.1. The 32,000 shapes contain 4x8,000 of (fp16, fp32) x (nchw, nhwc) each.

The random shapes are sampled from
```jsonc
{
    "batch_size": {"low": 1, "high": 8},
    "in_channels": {"low": 16, "high": 128},
    "out_channels": {"low": 16, "high": 128},
    "height": {"low": 16, "high": 224},
    "stride": {"set": [[1, 1], [2, 2]]},
    "padding": {"set": [[0, 0]]},
    "output_padding": {"set": [[0, 0], [1, 1], [0, 1], [1, 0]]},
    "kernel_size": {"set": [[3, 3], [1, 1], [1, 3], [3, 1], [2, 2]]},
    "dilation": {"set": [[1, 1]]},
    "deterministic": {"set": [true, false]},
    "benchmark": {"set": [true, false]},
    "allow_tf32": {"set": [true, false]},
    "groups": {"set": [1, IN_CHANNELS]}
}
```
- Input `width` is the same as `height`.
- `groups` can be either 1, or the same as `in_channels` (grouped convolution). When `groups` is 1, `out_channels` is random; when `groups` is the same as `in_channels`, `out_channels` is also the same as `in_channels`

All of the checked shapes can be found in csv files here https://github.com/xwang233/code-snippet/tree/master/convtranspose2d-dilation/functionality-check-cudnn8.0.4.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46290

Reviewed By: mruberry

Differential Revision: D24422091

Pulled By: ngimel

fbshipit-source-id: 9f0120f2995ae1575c0502f1b2742390d7937b24
2020-10-22 13:42:03 -07:00
53dff784e2 [caffe2] Fix inplace ops in onnx::SsaRewrite (#46134)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46134

Make sure in-place ops stay in-place after SsaRewrite. This seems to break the premise of SSA, but it's necessary to ensure correctness. Note here we only preserve the inplace ops that enforce inplace. Ops like `Relu` don't enforce inplace, they allow inplace.

(Note: this ignores all push blocking failures!)

Reviewed By: yinghai

Differential Revision: D24234957

fbshipit-source-id: 274bd3ad6227fce6a98e615aad7e57cd2696aec3
2020-10-22 13:26:31 -07:00
51bf7bed84 [caffe2] Allow memonger to optimize nets with inplace(enforced) ops (#46560)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46560

Follow-up for D24236604 (16c52d918b).

For nets that pass the schema check, memonger actually makes sure to preserve the inplaceness of operators if they are already inplace. So we can safely enable it for correct input nets.

(Note: this ignores all push blocking failures!)

Differential Revision: D24402482

fbshipit-source-id: a7e95cb0e3eb87adeac79b9b69eef207957b0bd5
2020-10-22 13:23:33 -07:00
23fad9111e [quant][graphmode][fx] Add additional_qat_module_mapping (#46344)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46344

Test Plan: Imported from OSS

Reviewed By: vkuzo

Differential Revision: D24317438

fbshipit-source-id: f9e73aeb4c7a107c8df0bae8319464e7d5d7275b
2020-10-22 13:11:26 -07:00
982fa07ccb torch.nn.Unfold accepts 0-dim for batch size (#40689)
Summary:
In partial completion of https://github.com/pytorch/pytorch/issues/12013

Allows specifying a tensor with 0-dim batch size for `torch.nn.Unfold()`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/40689

Reviewed By: zou3519

Differential Revision: D24441164

Pulled By: ngimel

fbshipit-source-id: 49cd53b9b23f2e221aecdb4b5fed19a234038063
2020-10-22 13:05:24 -07:00
c57c560744 Revert "Push rocm to slow path (#46216)" (#46728)
Summary:
This reverts commit bc1ce584512a860c15cb991460d8c98debd62b26.

Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46728

Reviewed By: cpuhrsch

Differential Revision: D24482783

Pulled By: izdeby

fbshipit-source-id: 619b710a8e790b9878e7317f672b4947e7b88145
2020-10-22 12:04:29 -07:00
9ccf85b7b4 [FX] Make wrapped functions traceable (#46692)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46692

Test Plan: Imported from OSS

Reviewed By: dzhulgakov

Differential Revision: D24465958

Pulled By: jamesr66a

fbshipit-source-id: 8c04aa3f59d1371d730ded7abd8f0c6c047e76b6
2020-10-22 12:00:02 -07:00
2700932ef2 [FX] Fix recursion depth issue on Graph deepcopy (#46669)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46669

Make `Graph`'s deepcopy behavior iterative rather than recursive. This prevents stack overflow issues with very large `Graph`s

Test Plan: Imported from OSS

Reviewed By: suo

Differential Revision: D24455120

Pulled By: jamesr66a

fbshipit-source-id: 5c37db5acabe313b9a7a464bebe2a82c59e4e2e9
2020-10-22 11:55:23 -07:00
18d80501a6 Batching rules for: new_zeros, new_empty (#46606)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46606

Note that new_empty uses `m.impl_UNBOXED` because the operator doesn't
go through the c10 dispatcher due to #43572.

Test Plan: - new tests

Reviewed By: ezyang

Differential Revision: D24428106

Pulled By: zou3519

fbshipit-source-id: 5e10f87a967fb27c9c3065f3d5b577db61aeb20e
2020-10-22 11:40:51 -07:00
c44300884e Clarify timing of GetDeviceProperty() (#46715)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46715

Test Plan: N/A

Reviewed By: ezyang

Differential Revision: D24455538

fbshipit-source-id: 1770807d178f618ef6338e28f669f09e4cbd2009
2020-10-22 11:29:31 -07:00
920ec6651f [OpBench] fix jit mode run of operator benchmark for ops with parameters (#46694)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46694

For the op with parameters (e.g. conv), the jit mode run currently will raise an error of
`RuntimeError: Cannot insert a Tensor that requires grad as a constant. Consider making it a parameter or input, or detaching the gradient`. After consulting https://www.fburl.com/vtkys6ug, decided to turn-off gradient for the parameters in the forward run. If we want op with parameters to work in backward with jit mode, probably needs to turn `TorchBenchmarkBase` into a sub-class of `nn.Module`

Test Plan: ./buck-out/gen/caffe2/benchmarks/operator_benchmark/pt/conv_test.par  --use_jit

Reviewed By: mingzhe09088

Differential Revision: D24451206

fbshipit-source-id: 784eb60ca155b0152d745c92f6d0ce6b2c9014c6
2020-10-22 11:10:28 -07:00
06d50b5eb0 Pull in fairscale.nn.Pipe into PyTorch. (#44090)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44090

This is an initial commit pulling in the torchgpipe fork at
https://github.com/facebookresearch/fairscale.

The purpose of this commit is to just pull in the code and ensure all tests and
builds work fine. We will slowly modify this to match our intended API
mentioned in https://fb.quip.com/txurAV3zIFox#RPZACAfAKMq. Follow up PRs would
address further changes needed on top of the initial commit..

We're pulling the code into the `torch.distributed._pipeline.sync` package. The
package is private on purpose since there is a lot of work (ex: docs, API
changes etc.) that needs to go in before we can actually officially support
this.
ghstack-source-id: 114864254

Test Plan:
1) waitforbuildbot
2) Ran all tests on my devgpu

Reviewed By: mrshenli

Differential Revision: D23493316

fbshipit-source-id: fe3c8b7dadeeb86abdc00e8a8652491b0b16743a
2020-10-22 10:59:02 -07:00
b63ddd6f57 [OSS][Metal] Support Resnet models
Summary:
This diff adds the missing ops to run the Resnet models from Torchvision. Move the tensors to GPU can significantly improve the perf as show below (iPhone11)

Time running on CPU (ms):

```
forward took: 166.115
forward took: 150.722
forward took: 150.383
forward took: 150.345
forward took: 150.761
forward took: 150.533
forward took: 150.588
forward took: 150.812
forward took: 150.925
forward took: 150.25
```

Time running on GPU (ms):

```
forward took: 39.9355
forward took: 41.3531
forward took: 41.798
forward took: 40.4744
forward took: 39.5181
forward took: 42.6464
forward took: 41.2658
forward took: 40.0862
forward took: 42.3533
forward took: 41.9348
```

Discrepancy in result

```
GPU:
    "(623, 4.6211)",
    "(111, 3.8809)",
    "(499, 3.8555)",
    "(596, 3.8047)",
    "(473, 3.7422)",
    "(846, 3.5762)",
    "(892, 3.5449)",
    "(813, 3.5098)",
    "(446, 3.5020)",
    "(902, 3.4980)"
CPU:
    "(623, 4.4229)",
    "(499, 3.8321)",
    "(596, 3.6192)",
    "(111, 3.5295)",
    "(813, 3.4848)",
    "(584, 3.3979)",
    "(418, 3.3357)",
    "(473, 3.2760)",
    "(846, 3.2745)",
    "(902, 3.2376)"
```

Test Plan: {F340824316}

Reviewed By: IvanKobzarev

Differential Revision: D24416294

fbshipit-source-id: 12c9199ade0b76a7aa8a3838eddc4c19c79b6f37
2020-10-22 10:49:51 -07:00
93719440b8 Replace map(lambda constructs (#46462)
Summary:
Follow-up of https://github.com/pytorch/pytorch/issues/46461 with a similar goal

Makes them more readable and possibly faster. Care has to be taken because `map` applies the function immediately while `(x for x in xs)` is a generator expression which gets evaluated later. This is a benefit in some cases where it is not required to actually create the list of values in memory (e.g. when passing to `tuple` or `extend` or `join`)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46462

Reviewed By: zou3519

Differential Revision: D24422343

Pulled By: ezyang

fbshipit-source-id: 252e33499c92ac0b15238f2df32681dbbda2b237
2020-10-22 09:50:22 -07:00
25dc0056f2 [RPC] print exception message on workers that run python functions (#46372)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46372

Currently, in `_run_function`, we catch an exception from the python
function which is run, and report it back to the master. However in some large
scale training jobs, it would be valuable to also log the error on the trainer
itself for faster debugging.

Test Plan: Added unittest.

Reviewed By: pritamdamania87

Differential Revision: D24324578

fbshipit-source-id: 88460d7599ea69d2c38fd9c10eb6471f7edd4100
2020-10-22 09:44:15 -07:00
3112e23428 [py][vulkan][reland] Add is_vulkan to py api, add vulkan to device type parsing (#46655)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46655

Test Plan: Imported from OSS

Pulled By: IvanKobzarev

Reviewed By: mrshenli

Differential Revision: D24448984

fbshipit-source-id: 5000846a06077f7a5a06dd51da422d2a42f70820
2020-10-22 09:35:50 -07:00
bc1ce58451 Push rocm to slow path (#46216)
Summary:
Push rocm to slow path

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46216

Reviewed By: bwasti

Differential Revision: D24263731

Pulled By: izdeby

fbshipit-source-id: 98ede2478b8f075ceed44a9e4f2aa292f523b8e2
2020-10-22 09:31:01 -07:00
3526b604b1 Add comment about running C++ executable lint locally (#46698)
Summary:
I got confused while locally running some of the `quick-checks` lints (still confused by `.jenkins/run-shellcheck.sh` but that's a separate matter) so I'm adding a comment to the "Ensure C++ source files are not executable" step in case someone in the future tries it and gets confused like I did.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46698

Reviewed By: walterddr

Differential Revision: D24470718

Pulled By: samestep

fbshipit-source-id: baacd8f414aa41b9b7b7aac765d938f21085eac5
2020-10-22 09:24:43 -07:00
52a970bac9 Minor cleaning of test_cuda.py (#46617)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46617

Sort includes, fix deprecated test warning

Test Plan:
```
buck run mode/dev-nosan //caffe2/test:cuda
```

Reviewed By: drdarshan

Differential Revision: D24429247

fbshipit-source-id: 65f53d7c904032e5c8f8ca45d1d2bb437358ffdd
2020-10-22 09:03:30 -07:00
aa9ca85bd0 Fix interval midpoint calculation (#46666)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46666

Interval midpoint calculations can overflow (integers) this diff fixes such an instance.

Test Plan: Standard test rig

Reviewed By: xw285cornell

Differential Revision: D23997893

fbshipit-source-id: 788c1181031e0b71d3efb6f7090fbd4ba2aa3f86
2020-10-22 08:53:38 -07:00
7245d2c939 Avoid scatter for single-device case in DDP (#46304)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46304

In the case that a single process operates only on one GPU, we can
avoid this scatter and instead replace it with a recursive version of `to`
which transfers the input tensors to the correct device.

The implementation of `_recursive_to` is modeled after `scatter` in https://github.com/pytorch/pytorch/blob/master/torch/nn/parallel/scatter_gather.py, in order to keep parity with the previous conventions (i.e. custom types not having their tensors moved).
ghstack-source-id: 114896677

Test Plan: Added unittest, and CI

Reviewed By: pritamdamania87

Differential Revision: D24296377

fbshipit-source-id: 536242da05ecabfcd36dffe14168b1f2cf58ca1d
2020-10-22 08:29:37 -07:00
e5a2ba2ea1 Fix benchmark_caffe2
Summary: benchmakr_caffe2 is broken, due to some refactoring which change from eager test generation to register only.

Test Plan:
`buck run caffe2/benchmarks/operator_benchmark/c2:add_test`

```
# ----------------------------------------
# PyTorch/Caffe2 Operator Micro-benchmarks
# ----------------------------------------
# Tag : short

# Benchmarking Caffe2: add
WARNING: Logging before InitGoogleLogging() is written to STDERR
W1021 08:07:06.350742 390665 init.h:137] Caffe2 GlobalInit should be run before any other API calls.
# Name: add_M8_N16_K32_dtypeint
# Input: M: 8, N: 16, K: 32, dtype: int
Forward Execution Time (us) : 652.748

# Benchmarking Caffe2: add
# Name: add_M16_N16_K64_dtypefloat
# Input: M: 16, N: 16, K: 64, dtype: float
Forward Execution Time (us) : 63.570

# Benchmarking Caffe2: add
# Name: add_M64_N64_K128_dtypeint
# Input: M: 64, N: 64, K: 128, dtype: in
```

Reviewed By: qizzzh

Differential Revision: D24448374

fbshipit-source-id: 850fd375d194c20c385ea4433aea13066c7476e6
2020-10-22 08:09:06 -07:00
143d1fd9f5 Namespace cleanup for 1.7 Part 2 (#46673)
Summary:
make valgrind_toggle and valgrind_supported_platform private functions

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46673

Reviewed By: gchanan

Differential Revision: D24458133

Pulled By: albanD

fbshipit-source-id: 6f3fad9931d73223085edbd3cd3b7830c569570c
2020-10-22 07:57:51 -07:00
16c5b7b3f2 Avoid leaking has_torch_function and handle_torch_function in torch namespace (#46680)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46680

Reviewed By: zou3519

Differential Revision: D24459823

Pulled By: albanD

fbshipit-source-id: 4ff6925afcf14214dc45921bca0d2f33ca1944a1
2020-10-22 07:48:36 -07:00
905ed3c840 Revised sparse tensor documentation. (#45400)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/44635.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45400

Reviewed By: ezyang

Differential Revision: D24359410

Pulled By: mruberry

fbshipit-source-id: 37c691a49a7b0042c7a298e0ed1226702b097c8b
2020-10-22 02:07:54 -07:00
8e13fe6c44 [numpy] torch.sin : support and promote integer inputs to float (#45733)
Summary:
References https://github.com/pytorch/pytorch/issues/42515

> Enable integer -> float unary type promotion for ops like sin

Will follow-up for other such Ops once this PR is merged.

cc: mruberry

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45733

Reviewed By: zou3519

Differential Revision: D24431194

Pulled By: mruberry

fbshipit-source-id: db600bc5de0e535b538d2aa301c3526b7c75ed17
2020-10-22 01:58:57 -07:00
98aad933b6 [pytorch][PR] Record FutureNCCL callback stream on CUDA caching allocator (#45318)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45318

When calling `then()` from WorkNCCL, record the input data pointers in futureNCCLCallbackStream_ before the execution of the input callback.

Note that the recording cannot be directly added to the lambda used by addCallback in ProcessGroupNCCL.hpp. This is because the type of future value in that context is pyobject rather than TensorList, but a type casting will require pybind and introduce Python dependency, which should not be allowed in c10d library.

I have considered creating a util function in a separate file to support this type casting, and then placing it under torch/csrc directory where python dependency is allowed. However, torch/csrc has a dependency on c10d, so this will create a circular dependency.

Finally, a `record_stream_cb_` member is added to FutureNCCL, and the default value is nullptr. A default `record_stream_cb_` implementation is added to `PythonFutureWrapper,` where Python dependency is allowed.

In addition, a few lines are reformatted by lint.
caffe2/torch/csrc/distributed/c10d/init.cpp is only reformatted.

#Closes: https://github.com/pytorch/pytorch/issues/44203

Test Plan:
buck test mode/dev-nosan caffe2/test/distributed:c10d -- ProcessGroupNCCLTest
buck test mode/dev-nosan caffe2/test/distributed:c10d  -- test_accumulate_gradients_no_sync_allreduce_with_then_hook
buck test mode/dev-nosan caffe2/test/distributed:c10d  -- test_ddp_comm_hook_allreduce_with_then_hook_nccl

Reviewed By: pritamdamania87

Differential Revision: D23910257

fbshipit-source-id: 66920746c41f3a27a3689f22e2a2d9709d0faa15
2020-10-22 01:49:47 -07:00
ab28bd528d [quant][graphmode][fx] Support quantizing FloatFunctional (#46634)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46634

Test Plan: Imported from OSS

Reviewed By: vkuzo

Differential Revision: D24438227

fbshipit-source-id: f33439d51112e13f59ee4292e804495d38fa3899
2020-10-22 01:21:17 -07:00
9b5197b763 [mlf][efficiency] add tensor inference function to last-n collector op (#46693)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46693

title

Test Plan: unit tests

Reviewed By: hx89

Differential Revision: D23946770

fbshipit-source-id: f7c3d4a1b4ef3b0e5f56e5a9a30f5003ce9f40b0
2020-10-22 01:15:00 -07:00
fe4f90c40b Cusolver inverse check info (#46625)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/46557

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46625

Reviewed By: zou3519

Differential Revision: D24438577

Pulled By: ngimel

fbshipit-source-id: d00e6eb2eae4aa39ca6ecf5914fe9cf37c24b906
2020-10-21 21:46:33 -07:00
adffd8eb6b Add const to the first arg 'grad' of Reducer::copy_grad_to_bucket (#46501)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46501

Gradients in this method will not be modified.
ghstack-source-id: 114851646

Test Plan: waitforbuildbot

Reviewed By: pritamdamania87

Differential Revision: D24374300

fbshipit-source-id: a2941891008f9f197a5234b50260218932d2d37d
2020-10-21 21:34:31 -07:00
db83ddcb86 small doc fix (#46599)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46599

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D24426181

Pulled By: bdhirsh

fbshipit-source-id: d0900d5c43574c80f1bf614824eafd21ba6a9caf
2020-10-21 20:17:31 -07:00
adbb50ea67 Enabling alias annotation checks for all operations during autograd tests (#46601)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46601

* except excluded tests and magic methods.

https://github.com/pytorch/pytorch/issues/38731

Previously, we'd only do run these tests for inplace operations. Since this is a lot more tests, fixed these issues that came up when running them -
- Updated schema of conj() to reflect existing behaviour.
- Updated deepEquals method in check_alias_annotation.cpp to re-use the overloaded == operator. Previous implementation did not cover all types of IValues.
- Corrected the order inputs are passed in during autograd testing of 'view' & 'reshape'.
- Subbed out atn::ger with the func its aliased to, atn::outer, for testing. The alias annotation checking code doesn't handle aliased operators properly.
ghstack-source-id: 114830903

Test Plan: Ran all tests in test:jit and verified they pass.

Reviewed By: eellison

Differential Revision: D24424955

fbshipit-source-id: 382d7e2585911b81b1573f21fff1d54a5e9a2054
2020-10-21 20:01:57 -07:00
33e82c0269 Update error message to include link to readme. (#46613)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46613

Test Plan: CI

Reviewed By: ezyang

Differential Revision: D24430852

fbshipit-source-id: 811e4d10508d47ef830d2b8445f11592f342461f
2020-10-21 19:38:19 -07:00
13decddae2 [reland][quant] Add FixedQParamsFakeQuantize module (#45538) (#46657)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46657

This is used to simulate fake quantize operation for ops with fixed quantization parameters
e.g. hardsigmoid

Test Plan:
Imported from OSS

Imported from OSS

Reviewed By: vkuzo

Differential Revision: D24451406

fbshipit-source-id: 26cc140c00f12bdec9a8f9dc880f4c425f4d4074
2020-10-21 16:47:11 -07:00
746febdeac [quant][graphmode][fx] Add additional_object_mapping argument to convert (#46338)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46338

Should we merge quantized module and quantized operator configurations?

Test Plan: Imported from OSS

Reviewed By: vkuzo

Differential Revision: D24317435

fbshipit-source-id: 3575251fe9d80a6628b8c3243c2ed92ea5e921e3
2020-10-21 16:39:07 -07:00
8908f6ad8e [op-bench] modify import path of configs (#46679)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46679

Current way of import configs will have runtime error when a single benchmark is launched directly with buck(e.g. `/buck-out/gen/caffe2/benchmarks/operator_benchmark/pt/conv_test.par`). The diff fixed that issue.
ghstack-source-id: 114857978

Test Plan: waitforsandcastle

Reviewed By: vkuzo

Differential Revision: D24459631

fbshipit-source-id: 29df17e66962a8604dbb7b8b9106713c3c19bed5
2020-10-21 16:15:11 -07:00
6011b36080 Fix type qualifiers ignored on return type warning (#46668)
Summary:
This fixes following warning:
```
../aten/src/ATen/cpu/vec256/vec256_float_neon.h:262:3: warning: type qualifiers ignored on function return type [-Wignored-qualifiers]
  262 |   const float operator[](int idx) const {
      |   ^~~~~
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46668

Reviewed By: seemethere, janeyx99

Differential Revision: D24454206

Pulled By: malfet

fbshipit-source-id: 8ba86a6d6c144f236a76bcef7ce794def7ea131f
2020-10-21 15:49:28 -07:00
e02a3e190e DOC: Building libtorch using CMake (#44196)
Summary:
I am adding documentation for building the C++-only libtorch.so without invoking Python in the build and install process.  This works on my Ubuntu 20.04 system and is designed to be operating system agnostic.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44196

Reviewed By: zou3519

Differential Revision: D24421066

Pulled By: malfet

fbshipit-source-id: e77c222703353ff7f7383fb88f7bce705f88b7bf
2020-10-21 14:29:36 -07:00
ff0e20b384 Config inheritance was added for pytorch project (#46584)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46584

The diff enables clang-tidy config inheritance for pytorch project.

Reviewed By: suo

Differential Revision: D24418191

fbshipit-source-id: 5cc0cf2d564236cedc4333af9324387d6d7a55cc
2020-10-21 14:06:35 -07:00
475b4e30e6 Allow for source code comments at any level of indentation (#46548)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46548

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D24434778

Pulled By: ansley

fbshipit-source-id: e24ed73d497381e02ef1155622641027ae34770a
2020-10-21 13:49:42 -07:00
e3b2bfa2a3 [pytorch] Early return in nn.EmbeddingBag when weight is empty (#46572)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46572

When `num_samples == 0`, grid becomes zero. Although CUDA just silently proceeds, `cudaGetLastError()` will complain about the `Error: invalid configuration argument`. So it's actually failing in some future places that becomes really hard to debug.

Reviewed By: jianyuh

Differential Revision: D24409874

fbshipit-source-id: ca54de13b1ab48204bbad265e3f55b56b94a1a2f
2020-10-21 13:44:56 -07:00
caed29a069 fix-process-group-counter (#46563)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/46561

A minimal fix to issue https://github.com/pytorch/pytorch/issues/46561. Increment the global variable `_group_count` at the same time as the others so the global state remains consistent in case of a failure.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46563

Reviewed By: zou3519

Differential Revision: D24422354

Pulled By: mrshenli

fbshipit-source-id: 32493cc2001d21ad366c396d16c303936959434e
2020-10-21 13:03:53 -07:00
ce04e527b4 Bump up windows cudnn version (#46436)
Summary:
Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46436

Reviewed By: zou3519

Differential Revision: D24421785

Pulled By: ezyang

fbshipit-source-id: 5aab2ae673e9ae07344a5f3bf0dc374a91dd12b2
2020-10-21 12:30:12 -07:00
c3c249aa0b Workaround to pay attention for CUDA version (#46535)
Summary:
Added a workaround for the cases when NVCC tries to compile the object for sm_30 GPU compute capability to avoid the error message telling that `__ldg` intrinsic is not defined.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46535

Reviewed By: zou3519

Differential Revision: D24422445

Pulled By: ezyang

fbshipit-source-id: 66e8eb1cbe42d848cfff46d78720d72100e628f8
2020-10-21 12:00:47 -07:00
09896eda14 Fix version comparisons for Python 3.6, 3.10 and 4 (#32389)
Summary:
There's some code which uses `six.PY3`, similar to:

```python
if six.PY3:
    print("Python 3+ code")
else:
    print "Python 2 code"
```

Where:

```python
PY3 = sys.version_info[0] == 3
```

When run on Python 4, this will run the Python 2 code! Instead, use `six.PY2` and avoid `six.PY3`.

 ---

Similarly, there's some `sys.version_info[0] == 3` checks, better done as `sys.version_info[0] >= 3`.

 ---

Also, it's better to avoid comparing the `sys.version` string, as it makes assumptions that each version component is exactly one character long, which will break in Python 3.10:

```pycon
>>> sys.version
'3.8.1 (v3.8.1:1b293b6006, Dec 18 2019, 14:08:53) \n[Clang 6.0 (clang-600.0.57)]'
>>> sys.version < "3.3"
False
>>> fake_v3_10 = '3.10.1 (v3.8.1:1b293b6006, Dec 18 2019, 14:08:53) \n[Clang 6.0 (clang-600.0.57)]'
>>> fake_v3_10 < "3.3"
True
```

 ---

Finally, I think the intention here is to skip when the Python version is < 3.6:

```python
unittest.skipIf(sys.version_info[0] < 3 and sys.version_info[1] < 6, "dict not ordered")
```

However, it will really skip for Python 0.0-0.5, 1.0-1.5 and 2.0-2.5. It's best to compare to the `sys.version_info` tuple and not `sys.version_info[1]`:

```python
    unittest.skipIf(sys.version_info < (3, 6), "dict not ordered")
```

 ---

Found using https://github.com/asottile/flake8-2020:
```console
$ pip install -U flake8-2020
$ flake8 --select YTT
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/32389

Reviewed By: zou3519

Differential Revision: D24424662

Pulled By: ezyang

fbshipit-source-id: 1266c4dbcc8ae4d2e2e9b1d7357cba854562177c
2020-10-21 11:52:50 -07:00
65da50c099 Apply hip vs hipcc compilation flags correctly for building extensions (#46273)
Summary:
Fixes issues when building certain PyTorch extensions where the cpp files do NOT compile if flags such as `__HIP_NO_HALF_CONVERSIONS__` are defined.
cc jeffdaily

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46273

Reviewed By: zou3519

Differential Revision: D24422463

Pulled By: ezyang

fbshipit-source-id: 7a43d1f7d59c95589963532ef3bd3c68cb8262be
2020-10-21 11:40:40 -07:00
ac4ee0ef5d Fix typo in docs for interpolate (#46589)
Summary:
Removes a spurious backtick in [the docs for `torch.nn.functional.interpolate`](https://pytorch.org/docs/stable/nn.functional.html?highlight=grid_sample#torch.nn.functional.interpolate)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46589

Reviewed By: zou3519

Differential Revision: D24422550

Pulled By: ezyang

fbshipit-source-id: c1e6b7de4584b2a3f68b458801a33b3fc71c1944
2020-10-21 11:31:53 -07:00
96bc7faa50 [ONNX] Export var, var_mean and std_mean ops (#45678)
Summary:
Adding export for var, var_mean and std_mean ops

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45678

Reviewed By: houseroad

Differential Revision: D24398811

Pulled By: bzinodev

fbshipit-source-id: bf51422a9e035d521156c0fa6e77898aac83a380
2020-10-21 11:23:54 -07:00
6de619e4a4 Allow converting parameters of nn.Module to complex dtypes (#44788)
Summary:
This PR makes it possible to cast the parameters of nn.Module to complex dtypes.
The following code works with the proposed changes.
```python
In [1]: import torch
In [2]: lin = torch.nn.Linear(5, 1).to(torch.complex64)
In [3]: lin(torch.zeros(3, 5, dtype=torch.complex64))
Out[3]:
tensor([[-0.1739+0.j],
        [-0.1739+0.j],
        [-0.1739+0.j]], grad_fn=<AddmmBackward>)
```
Fixes https://github.com/pytorch/pytorch/issues/43477.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44788

Reviewed By: zou3519

Differential Revision: D24307225

Pulled By: anjali411

fbshipit-source-id: dacc4f5c8c9a99303f74d1f5d807cd657b3b69b5
2020-10-21 08:54:59 -07:00
611f028168 Add Batch-Updating Parameter Server Example to CI Tests (#46510)
Summary:
Resolves one item in https://github.com/pytorch/pytorch/issues/46321

This PR sets up DistExamplesTest which will be used as the class to implement future tests for examples. This class is run as part of CI tests. It also creates a dist_examples folder and includes the [batch server example](https://github.com/pytorch/examples/blob/master/distributed/rpc/batch/parameter_server.py) which is slightly modified to allow to be tested.

Run test:
pytest test/distributed/rpc/test_tensorpipe_agent.py -k test_batch_updating_parameter_server -vs
pytest test/distributed/rpc/test_process_group_agent.py -k test_batch_updating_parameter_server -vs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46510

Reviewed By: mrshenli

Differential Revision: D24379296

Pulled By: H-Huang

fbshipit-source-id: 1c102041e338b022b7a659a51894422addc0e06f
2020-10-21 08:46:46 -07:00
cf3d7a2660 first cut of adding a dangling impl test. fix #45165 (#46484)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46484

Test Plan: Imported from OSS

Reviewed By: ezyang, izdeby

Differential Revision: D24392625

Pulled By: bdhirsh

fbshipit-source-id: a6ab9c53e3e580e5713e08b20682ee6f8ed3bd84
2020-10-21 08:39:40 -07:00
62e714c9d9 Delete CUDAUnaryOps.cpp (#46280)
Summary:
This file is no longer used

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46280

Reviewed By: ezyang

Differential Revision: D24392749

Pulled By: heitorschueroff

fbshipit-source-id: 677e1ba8664e3c53448a962f8a5d05e806961c2d
2020-10-21 08:31:34 -07:00
cebe87fe3a Revert D24379422: [py][vulkan] Add is_vulkan to py api, add vulkan to device type parsing
Test Plan: revert-hammer

Differential Revision:
D24379422 (e8fbe54cf5)

Original commit changeset: afab89bb9e17

fbshipit-source-id: 743c77e453239f10c155c67490cba5a42ab42f58
2020-10-21 08:23:05 -07:00
8328630315 avoid inlining kernel lambdas on mobile (#46249)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46249

This saves 15kb binary size on ios and increases binary size on android x86 for 30kb. It also reduces size a bit for android arm. I've talked to Martin and we should land this since Android binary size is much less important because of Voltron.

ghstack-source-id: 114177627

Test Plan: bsb

Reviewed By: ezyang

Differential Revision: D23057150

fbshipit-source-id: 43bd62901b81daf08ed96de561d711357689178f
2020-10-21 03:27:21 -07:00
8357e2edc3 Back out "Revert D24269034: [fx] Refactor Tracer so that find_module and root args creation could be overridden by implementations" (#46573)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46573

Original commit changeset: 7dd709b585f8
ghstack-source-id: 114730143

Test Plan: Verified on circleci that previously broken test is fixed.

Reviewed By: zdevito

Differential Revision: D24413096

fbshipit-source-id: 439568c631c4556b8ed6af20fcaa4b1375e554cf
2020-10-20 22:17:36 -07:00
e8fbe54cf5 [py][vulkan] Add is_vulkan to py api, add vulkan to device type parsing (#46511)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46511

Test Plan: Imported from OSS

Reviewed By: AshkanAliabadi

Differential Revision: D24379422

Pulled By: IvanKobzarev

fbshipit-source-id: afab89bb9e17c50934083598262bbe14ea82e893
2020-10-20 20:04:24 -07:00
a651b876a7 preserve non-dense or overlapping tensor's layout in *_like functions (#46046)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46046

*_like functions are used in pytorch to create a new tensor with the same shape of the input tensor. But we don’t always preserve the layout permutation of the tensor. Current behavior is that, for a dense and non-overlapping tensor, its layout permutation is preserved. For eg.  passing a channel last contiguous tensor t with ‘shape/stride’  (2, 4, 3, 2)/(24, 1, 8, 4) to empty_like(t) function will create a new tensor with exactly the same ‘shape/stride’ as the input tensor t. However, if the input tensor is non-dense or has overlap, we simply create a contiguous tensor based on input tensor’s shape, so the tensor layout permutation is lost.

This PR preserves the layout permutation for non-dense or overlapping tensor. The strides propagation rule that used in this PR is exactly the same as what is being used in TensorIterator.  The behavior changes are listed below:

| code                                                                                                                                                                                           | old                                                   | new                                                  |
|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------|------------------------------------------------------|
| #strided tensors<br>a=torch.randn(2,3,8)[:,:,::2].permute(2,0,1)<br>print(a.stride())<br>print(a.exp().stride())<br>print((a+a).stride())<br>out = torch.empty(0)<br>torch.add(a,a,out=out)<br>print(out.stride()) | (2, 24, 8) <br>(6, 3, 1) <br>(1, 12, 4) <br>(6, 3, 1) | (2, 24, 8)<br>(1, 12, 4)<br>(1, 12, 4)<br>(1, 12, 4) |
| #memory dense tensors<br>a=torch.randn(3,1,1).as_strided((3,1,1), (1,3,3))<br>print(a.stride(), (a+torch.randn(1)).stride())<br>a=torch.randn(2,3,4).permute(2,0,1)<br>print(a.stride())<br>print(a.exp().stride())<br>print((a+a).stride())<br>out = torch.empty(0)<br>torch.add(a,a,out=out)<br>print(out.stride())                                                                                                                                                                                               |  (1, 3, 3) (1, 1, 1)<br>(1, 12, 4)<br>(6, 3, 1)<br>(1, 12, 4)<br>(6, 3, 1)                                                       |  (1, 3, 3) (1, 3, 3)<br>(1, 12, 4)<br>(1, 12, 4)<br>(1, 12, 4)<br>(1, 12, 4) |

This is to solve the non-dense tensor layout problem in #45505

TODO:
- [x] Fix all the BC broken test cases in pytorch
- [ ] Investigate if any fb internal tests are broken

This change will cover all kinds of non-dense tensors.

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D24288970

Pulled By: glaringlee

fbshipit-source-id: 320fd4e0d1a810a12abfb1441472298c983a368d
2020-10-20 19:49:49 -07:00
2181449068 Revert D24004795: [quant] Add FixedQParamsFakeQuantize module
Test Plan: revert-hammer

Differential Revision:
D24004795 (253918ec55)

Original commit changeset: fc4797f80842

fbshipit-source-id: 663169e90a2f58e5a89e4d382291ae41c24d0fee
2020-10-20 19:40:21 -07:00
f47231bf0e [caffe2][dnnlowp] Remove openmp usage in quantize dnnlowp op
Summary: It creates cpu overload issues when openmp gets enabled and OMP_NUM_THREADS=1 is not set.

Test Plan: buck test //caffe2/caffe2/quantization/server:quantize_dnnlowp_op_test

Reviewed By: jspark1105

Differential Revision: D24437305

fbshipit-source-id: 426209fc33ce0d4680c478f584716837ee62cb5e
2020-10-20 19:33:56 -07:00
6cd8b5e9a7 Provide CMake option to enable Vulkan API. (#46503)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46503

Test Plan: Imported from OSS

Reviewed By: IvanKobzarev

Differential Revision: D24379144

Pulled By: AshkanAliabadi

fbshipit-source-id: 8d8c57f96bbac2a44615828a3474c912704f3a85
2020-10-20 18:45:52 -07:00
3e041b503f Add Vulkan job dispatch and flush. (#46008)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46008

Test Plan: Imported from OSS

Reviewed By: IvanKobzarev

Differential Revision: D24291507

Pulled By: AshkanAliabadi

fbshipit-source-id: a3d02e76708a38e49398bb71e31bb2ad676d01af
2020-10-20 18:41:29 -07:00
cb3c1d17e4 Promote -Wcast-function-type to an error in builds. (#46356)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46356

Adding the flag `-Werror=cast-function-type` to ensure we don't allow
any invalid casts (ex: PyCFunction casts).

For more details see: https://github.com/pytorch/pytorch/issues/45419
ghstack-source-id: 114632980

Test Plan: waitforbuildbot

Reviewed By: albanD

Differential Revision: D24319759

fbshipit-source-id: 26ce4650c220e8e9dd3550245f214c7e6c21a5dc
2020-10-20 18:09:06 -07:00
42a70dc5a8 Implement all communication APIs in DistributedC10d new frontend (#46053)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46053

Reviewed By: wanchaol

Differential Revision: D24300487

Pulled By: gmagogsfm

fbshipit-source-id: 0d0b01c4f9d9e1d59dd17d7606ce47d54d61951d
2020-10-20 17:52:07 -07:00
253918ec55 [quant] Add FixedQParamsFakeQuantize module (#45538)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45538

This is used to simulate fake quantize operation for ops with fixed quantization parameters
e.g. hardsigmoid

Test Plan: Imported from OSS

Reviewed By: vkuzo

Differential Revision: D24004795

fbshipit-source-id: fc4797f80842daacd3b3584c5b72035774634edd
2020-10-20 17:43:25 -07:00
f83cf2dab3 [JIT] adding torch.jit.isinstance support (#46062)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46062

Adds support for torch.jit.isinstance in both eager and script mode

Example use:

```
import torch
from typing import Any, List

class TestModule(torch.nn.Module):
    def __init__(self):
        super(TestModule, self).__init__()

    def call(self, input1: str, input2: str) -> str:
        return input1

    def forward(self, input: Any) -> None:
        if torch.jit.isinstance(input, List[str]):
            for el in input:
                print(el)

TestModule().forward(["1","2"])
scripted_module = torch.jit.script(TestModule())
scripted_module(["1", "2"])
```

Test Plan: Imported from OSS

Reviewed By: bertmaher, zou3519

Differential Revision: D24264415

Pulled By: Lilyjjo

fbshipit-source-id: 039c95bddd854c414027ac8332832e6bc830b5b9
2020-10-20 16:47:49 -07:00
fdc5261a20 Support %-based string formatting (#45976)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45976

Test Plan: Imported from OSS

Reviewed By: jamesr66a

Differential Revision: D24374215

Pulled By: ansley

fbshipit-source-id: 2005fe7f09dc8d3c44c4bfdccab6b4dc46a5e517
2020-10-20 16:13:36 -07:00
f9446cb15a [quant][refactor] Remove register api and rename get_*_mapping to get_default_*_mapping (#46337)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46337

We plan to pass around the mappings instead of using global registration api to keep
the mappings local to the transformations user is performing

Test Plan: Imported from OSS

Reviewed By: vkuzo

Differential Revision: D24317436

fbshipit-source-id: 81569b88f05eeeaa9595447e482a12827aeb961f
2020-10-20 15:53:47 -07:00
4f5b55f722 Revert D24395956: [pytorch][PR] Replace flatten tensors with flatten loops.
Test Plan: revert-hammer

Differential Revision:
D24395956 (2f51ddb81f)

Original commit changeset: f3792903f206

fbshipit-source-id: ef70713f0f67f577b09674219631d22440ceec31
2020-10-20 15:42:23 -07:00
2b221a9599 Remove PyCFunction casts as much as possible. (#46227)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46227

Follow up from https://github.com/pytorch/pytorch/issues/45419, in
this PR I've removed as many PyCFunction casts as I could from the codebase.

The only ones I didn't remove were the ones with `METH_VARARGS | METH_KEYWORDS`
which have 3 parameters instead of 2 and had to be casted. Example: `
{"copy_", (PyCFunction)(void(*)(void))THPStorage_(copy_), METH_VARARGS |
METH_KEYWORDS, nullptr},`
ghstack-source-id: 114632704

Test Plan: waitforbuildbot

Reviewed By: albanD

Differential Revision: D24269435

fbshipit-source-id: 025cfd43a9a2a3e59f6b2951c1a78749193d77cf
2020-10-20 15:01:51 -07:00
1a3ea46dbf [StaticRuntime] Threading model (#46219)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46219

- Refactor StaticRuntime and group common data structures, the jit graph, and the script module into a separate struct `InferenceModule`:
```
struct InferenceModule {
  explicit InferenceModule(const torch::jit::Module& m);
  explicit InferenceModule(std::shared_ptr<torch::jit::Graph> g);
  torch::jit::Module module;
  std::shared_ptr<torch::jit::Graph> graph;
  std::unique_ptr<c10::FunctionSchema> schema;

  std::unordered_map<Value*, size_t> value_to_reg;
  std::vector<size_t> input_regs; // inputs to the graph
  std::vector<size_t> output_regs; // outputs of the graph
  std::vector<size_t> internals;
};
```
which is stored in the PyTorchPredictor, as well as the static runtime, and shared across threads. Then this is what's left inside the Static Runtime:
```
  mutable std::vector<IValue> reg_;
  // The nodes we need to run
  std::vector<ProcessedNode> nodes_;
```
`reg_` holds all the weights and activations, which is different across threads during running. `nodes_` holds the op nodes and input/output registers, and is the same across threads for now. We could potentially put other stateful data structures in it, so I kept it inside the static runtime. It could be easily moved into the `InferenceModule` if we decide not to anything else into `ProcessedNode`.

- Added StaticRuntimeOptions so we can toggle certain optimizations on/off, for testing and benchmarking. `cleanup_activations` is an example.

- Integration with PyTorchPredictor. Added a lockfree stack in the PyTorchPredictor to hold all the static runtime instances. Benchmark shows that the `push` and `pop` combo takes about 80 ns, which is quite acceptable.

This diff focuses on threading model only. Benchmarks will be separate.

Reviewed By: bwasti

Differential Revision: D24237078

fbshipit-source-id: fd0d6347f02b4526ac17dec1f731db48424bade1
2020-10-20 14:37:30 -07:00
e18a8aba95 Add CUDA 11.1 docker build (#46283)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46283

Reviewed By: ezyang

Differential Revision: D24346026

Pulled By: malfet

fbshipit-source-id: f69558f35527833b867a7352c78b4e8ebc370db3
2020-10-20 13:35:31 -07:00
187e23397c Remove non-existent trusty image references (#46594)
Summary:
Simplifies some parts of build.sh and removes old references in the code to non-existent trusty images.

There are other parts of the code where trusty is referenced for travis (most of them in third party directories) and I did not touch those. https://github.com/pytorch/pytorch/search?q=trusty

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46594

Reviewed By: seemethere

Differential Revision: D24426796

Pulled By: janeyx99

fbshipit-source-id: 428c52893d2d35c1ddd1fd2e65a4b6575f260492
2020-10-20 12:54:45 -07:00
2f51ddb81f Replace flatten tensors with flatten loops. (#46539)
Summary:
This diff changes `TensorExprKernel::generateStmt` to use flatten loops instead of flatten tensors.

Checked all tests on CPU as well as CUDA.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46539

Reviewed By: nickgg

Differential Revision: D24395956

Pulled By: navahgar

fbshipit-source-id: f3792903f2069bda37b571c9f0a840e6fb02f189
2020-10-20 12:16:18 -07:00
9c02e2112e Automated submodule update: FBGEMM (#46578)
Summary:
This is an automated pull request to update the first-party submodule for [pytorch/FBGEMM](https://github.com/pytorch/FBGEMM).

New submodule commit: 23cb1db72b

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46578

Test Plan: Ensure that CI jobs succeed on GitHub before landing.

Reviewed By: YazhiGao

Differential Revision: D24415308

fbshipit-source-id: c353dcf86cfd833a571a509930a17d09277a73e4
2020-10-20 11:43:01 -07:00
e6ed887908 Add view test for tensor_split (#46427)
Summary:
Fulfills Mike's suggestion here: https://github.com/pytorch/pytorch/pull/44868#discussion_r505095018

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46427

Reviewed By: ezyang

Differential Revision: D24355107

Pulled By: mruberry

fbshipit-source-id: bddef2f9c2c41b5c5ac47a17d5ecdda580072e99
2020-10-20 09:56:37 -07:00
5003fd189c Add an option to getWriteableTensorData to avoid copy CUDA tensor to CPU (#46524)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46524

Test Plan: Imported from OSS

Reviewed By: wanchaol

Differential Revision: D24392794

Pulled By: mrshenli

fbshipit-source-id: 21bf81dfc6c1d81689f8278d81f4c8776bc76ec1
2020-10-20 08:54:58 -07:00
5e0bfd7455 [Build] [CMake] [ROCm] find hsa-runtime64 properly (#45550)
Summary:
Properly Fixes https://github.com/pytorch/pytorch/issues/44384
similar in vein to https://github.com/pytorch/pytorch/issues/42064

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45550

Reviewed By: ezyang

Differential Revision: D24412674

Pulled By: malfet

fbshipit-source-id: f3d056c7069cb9d8a7d4174b604b9e3fbb14180b
2020-10-20 08:38:32 -07:00
35a35c3498 Move Open MPI installation to Ubuntu CUDA Docker images (#46569)
Summary:
Instead of installing Open MPI for build and test jobs with environment *-xenial-cuda*, install Open MPI into the relevant Docker images. This would save time and remove duplication in our scripts.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46569

Reviewed By: walterddr

Differential Revision: D24409534

Pulled By: janeyx99

fbshipit-source-id: 6152f2f5daf63744d907dd234bc12d2a5ec58f3d
2020-10-20 08:31:35 -07:00
0d4590c279 renaming env var IN_CIRCLECI to a broader name of IN_CI (#46567)
Summary:
The `IN_CIRCLECI` variable is a misnomer since the flag really indicates when we enable XML reporting because we want to run the test in CI. Since this doesn't necessarily mean CircleCI in particular, IN_CI is more accurate and general.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46567

Reviewed By: walterddr

Differential Revision: D24407642

Pulled By: janeyx99

fbshipit-source-id: 5e141a0571b914310a174a58ac0fde58e9521c6b
2020-10-20 08:25:39 -07:00
1c8d0d8cc9 Allow vmap to accept nested python data structures as inputs (#46289)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46289

Previously, vmap had the restriction that any Tensors in the inputs must
not be a part of a nested python collection. This PR relaxes that
restriction. We can also do the same thing for vmap outputs, but I'll
leave that for future work

The mechanism behind vmap is to convert any Tensor inputs (that have
been specified via in_dims) into BatchedTensor. Using a pytree
implementation, that logic becomes:
- flatten inputs
- broadcast in_dims to inputs and unflatten it
- use the flat inputs and flat in_dims to construct BatchedTensors
- unflatten the BatchedTensors into the same structure as the original
inputs.
- Send the unflattened BatchedTensors into the desired function.

Performance
-----------
Some benchmarking using
```
import torch
def foo(a, b, c, d):
    return a, b, c, d

x = torch.randn(2, 3)
foo_vmap = torch.vmap(foo)
%timeit foo_vmap(x, x, x, x)
```
shows a slowdown from 15us to 25us on my machine. The 10us overhead is
not a lot, especially since our vmap implementation is a "prototype". We
can work around the performance in the future by either moving part of
the pytree implementation into C++ or depending on a library that has a
performant pytree implementation.

Test Plan
---------
- New tests, also updated old tests.

Test Plan: Imported from OSS

Reviewed By: heitorschueroff

Differential Revision: D24392892

Pulled By: zou3519

fbshipit-source-id: 072b21dcc6065ab43cfd341e84a01a5cc8ec3daf
2020-10-20 07:52:17 -07:00
6025f8148a Implement _broadcast_to_and_flatten(pytree, spec) (#46288)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46288

This "broadcasts" `pytree` to have the same structure as `spec`
and then flattens it.
I find it hard to describe what that does in words, so here's an example:

- Broadcasting 1 to have the same structure as [0, [0, 0]] would
return [1, [1, 1]]. Further flattening it gives us [1, 1, 1].
- Broadcasting [1, 2] to have the same structure as [0, [0, 0]] would
return [1, [2, 2]]. Further flattening it gives us [1, 2, 2].

What is this used for?
----------------------
The next PR up in the stack uses this helper function to allow vmap to
accept nested data structures. `vmap(fn, in_dims)(*inputs)` allows the
user to specify in_dims with a tree structure that is a sub-graph of
that of `inputs` (where both contain the root of the tree).

For example, one can do `vmap(fn, in_dims=0)(x, y, z)`. `in_dims` is 0
and inputs is (x, y, z). We would like to broadcast in_dims up to the
structure of inputs to get (0, 0, 0).

Another example, is `vmap(fn, in_dims=(0, 1))(x, [y, z])`. `in_dims` is
(0, 1) and inputs is (x, [y, z]). We would like to broadcast in_dims up
to the structure of inputs to get (0, [1, 1]); this value of in_dims is
used to say "let's vmap over dim 0 for x and dim 1 for y and z".

Test Plan
---------
New tests.

Test Plan: Imported from OSS

Reviewed By: heitorschueroff

Differential Revision: D24392891

Pulled By: zou3519

fbshipit-source-id: 6f494d8b6359582f1b4ab6b8dd6a956d8bfe8ed4
2020-10-20 07:52:14 -07:00
0285618a11 Add utilities to support handling of nested python data structures (#46287)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46287

This adds a lightweight `pytree` implementation that is similar to and
inspired by JAX pytrees, tensorflow.nest, deepmind/tree,
TorchBeast's TensorNest, etc.

A *pytree* is Python nested data structure. It is a tree in the sense
that nodes are Python collections (e.g., list, tuple, dict) and the leaves
are Python values. Furthermore, a pytree should not contain reference
cycles.

This PR:
- adds support for flattening and unflattening nested Python list/dict/tuples

Context: nested Tensor inputs for vmap
--------------------------------------
Right now, vmap is restricted to taking in flat lists of tensors. This
is because vmap needs to be able to convert every tensor in the input
that is being vmapped over into a BatchedTensor.

With a pytree library, we can simply flatten the input data structure
(returning the leaves), map all of the Tensors in the flat input to
BatchedTensors, and unflatten the flat list of BatchedTensors into a new
input. Or equivalently, with a `tree_map` function, we can map a nested
python data structure containing Tensors into one containing
BatchedTensors.

Future work
-----------
In some future PRs, we'll add nested input support for vmap. The
prerequisites for that are:
- a `broadcast_to(small, big)` that broadcasts `small` up to `big`.
  This is for handling the in_dims to vmap: the in_dims structure must
  be compatible with the structure of the inputs.

Test Plan
---------
- New tests in test/test_pytree.py

Test Plan: Imported from OSS

Reviewed By: heitorschueroff

Differential Revision: D24392890

Pulled By: zou3519

fbshipit-source-id: 7daf7430c5a38354e7d203a72882bd7a9b24cfb1
2020-10-20 07:45:45 -07:00
75322dbeb4 [PyTorch] [BUCK] Replace pt_deps.bzl with a YAML operator dependency file which is generated by the code analyser (#46057)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46057

The code analyser (that uses LLVM and runs in the OSS PyTorch git repo) already produces a YAML file which contains base operator names and the operators that they depend on. Currently, this operator dependency graph is converted into a python dictionary to be imported in BUCK and used there. However, it is mostly fed into other executables by serializing the JSON and the consumer pieces this JSON together by concatenating each argument together. This seems unnecessary. Instead, this diff retains the original YAML file and makes all consumers consume that same YAML file.
ghstack-source-id: 114641582

Test Plan: Build Lite Predictor + sandcastle.

Reviewed By: iseeyuan

Differential Revision: D24186303

fbshipit-source-id: eecf41bf673d90b960c3efe7a1271249f0a4867f
2020-10-20 02:00:36 -07:00
e5ed037529 [StaticRuntime] Add a 'speed of light' benchmark. (#46308)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46308

This PR adds a hand optimized version of DeepAndWide model with the goal
of estimating overheads of static runtime. While static runtime is
currently much faster than the existing JIT interpreter, it would be
useful to understand how close we are to an absolutely 0-overhead
system. Currently, this "ideal" implementation is 2x faster than the
static runtime on batchsize=1.

Full benchmark results:
```
Running build/bin/static_runtime_bench
Run on (24 X 2394.71 MHz CPU s)
CPU Caches:
  L1 Data 32K (x24)
  L1 Instruction 32K (x24)
  L2 Unified 4096K (x24)
  L3 Unified 16384K (x24)
------------------------------------------------------------------------------
Benchmark                                       Time           CPU Iterations
------------------------------------------------------------------------------
BM_deep_wide_base/1                         59518 ns      59500 ns      10909
BM_deep_wide_base/8                         74635 ns      74632 ns       9317
BM_deep_wide_base/20                        82186 ns      82147 ns       9119
BM_deep_wide_fast/1                         13851 ns      13851 ns      49825 << new
BM_deep_wide_fast/8                         22497 ns      22497 ns      32089 << new
BM_deep_wide_fast/20                        23868 ns      23841 ns      31184 << new
BM_deep_wide_jit_graph_executor/1           62786 ns      62786 ns      10835
BM_deep_wide_jit_graph_executor/8           76730 ns      76718 ns       7529
BM_deep_wide_jit_graph_executor/20          78886 ns      78883 ns       8769
BM_deep_wide_jit_profiling_executor/1       69504 ns      69490 ns      10309
BM_deep_wide_jit_profiling_executor/8       75718 ns      75715 ns       9199
BM_deep_wide_jit_profiling_executor/20      75364 ns      75364 ns       9010
BM_deep_wide_static/1                       40324 ns      40318 ns      17232
BM_deep_wide_static/8                       50327 ns      50319 ns      13335
BM_deep_wide_static/20                      53075 ns      53071 ns      12855
BM_deep_wide_static_threaded/threads:8       6258 ns      49873 ns      14008
```

PS: The implementation could probably be optimized even more.

Differential Revision: D24300702

Test Plan: Imported from OSS

Reviewed By: dzhulgakov

Pulled By: ZolotukhinM

fbshipit-source-id: 7870bdef127c39d11bcaa4f03a60eb80a46be58e
2020-10-19 23:35:55 -07:00
17f8c329df [NNC] IRSimplifier rules for Compare and Mod (#46412)
Summary:
Adds new rules to the NNC IRSimplifier to take care of the following cases:

* Comparisons which are symbolic but have a constant difference. E.g. this is most useful in cases like `if (x > x + 4) ...` which we can now eliminate.

* Simplification of `Mod` nodes, including simple rules such as `0 % x` and `x % 1`, but also factorization of both sides to find common symbolic multiples. E.g. `(x * y) % x` can be cancelled out to `0`.

See tests for many more examples!

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46412

Reviewed By: navahgar

Differential Revision: D24396151

Pulled By: nickgg

fbshipit-source-id: abb954dc930867d62010dcbcd8a4701430733715
2020-10-19 19:37:09 -07:00
a06b95b2ba [quant][graphmode][fx] Support non_traceable_module/module_class (#46298)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46298

Allow user to specify a list of qualified names for non traceable submodule
or type of the non traceable submodule
See quantize_fx.py for api

Test Plan: Imported from OSS

Reviewed By: vkuzo

Differential Revision: D24294210

fbshipit-source-id: eb1e309065e3dfbf31e63507aaed73587f0dae29
2020-10-19 18:50:08 -07:00
5b0f400488 Replace list(map(...)) constructs by list comprehensions (#46461)
Summary:
As discussed in https://github.com/pytorch/pytorch/issues/46392 this makes the code more readable and possibly more performant.

It also fixes a bug detected by this where the argument order of `map` was confused: 030a24906e (diff-5bb26bd3a23ee3bb540aeadcc0385df2a4e48de39f87ed9ea76b21990738fe98L1537-R1537)

Fixes https://github.com/pytorch/pytorch/issues/46392

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46461

Reviewed By: ailzhang

Differential Revision: D24367015

Pulled By: ezyang

fbshipit-source-id: d55a67933cc22346b00544c9671f09982ad920e7
2020-10-19 18:42:49 -07:00
3d421b3137 [pytorch] rewrite of the python binding codegen with the v2 API (#46244)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46244

- What does the generated binding code do?

The Python binding codegen produces code that takes the input list of
PyObjects, finds the matching ATen C++ function using PythonArgParser,
converts the PyObjects into C++ types and calls the ATen C++ function:

```
+--------+  parsing   +------------------------+  binding   +-----------------------+
| PyObjs | ---------> | PythonArgParser Output | ---------> | Cpp Function Dispatch |
+--------+            +------------------------+            +-----------------------+
```

- Are Python arguments 1-1 mapped to C++ arguments?

Python arguments might be reordered, packed, unpacked when binding to
C++ arguments, as illustrated below:

```
// Binding - Reorder & Packing
// aten::empty.names(int[] size, *, Dimname[]? names, ScalarType? dtype=None, Layout? layout=None,
                     Device? device=None, bool? pin_memory=None, MemoryFormat? memory_format=None) -> Tensor

            Python Args               Cpp Args
-----------------------------------------------------------
         0: size                      size
         1: names                     names
         2: memory_format -------+
         3: dtype         -----+-|--> options
         4: layout            /  |
         5: device           /   +--> memory_format
         6: pin_memory      /
         7: requires_grad -+

// Binding - Unpacking
// aten::max.names_dim(Tensor self, Dimname dim, bool keepdim=False) -> (Tensor values, Tensor indices)

            Python Args               Cpp Args
-----------------------------------------------------------
                               +----> max
                              /-----> max_values
         0: input            /        self
         1: dim             /         dim
         2: keepdim        /          keepdim
         3: out      -----+
```

- Why do we want to rewrite the python binding codegen?

The old codegen takes Declarations.yaml as input. It doesn't distinguish
between Python arguments and C++ arguments - they are all mixed together
as a bag of non-typed dict objects. Different methods process these arg
objects and add new attributes for various different purposes. It's not so
obvious to figure out the semantics of these attributes. The complicated
binding logic happens implicitly and scatteredly.

```
+--------------------+
|  Native Functions  |
+--------------------+
  |
  |
  v
+--------------------+
|   Cpp Signatures   |
+--------------------+
  |
  |
  v
+--------------------+
| Declarations.yaml  |
+--------------------+
  |                        +-------------------------------------+
  |              +-------> |       PythonArgParser Schema        |
  |              |         +-------------------------------------+
  |              |                            .
  |              |                            .
  v              |                            .
+--------------------+     +-------------------------------------+
| NonTyped Args Objs | --> | PythonArgParser -> Cpp Args Binding |
+--------------------+     +-------------------------------------+
                 |                            .
                 |                            .
                 |                            .
                 |         +-------------------------------------+
                 +-------> |        Cpp Function Dispatch        |
                           +-------------------------------------+
```

This PR leverages the new immutable data models introduced in the new
aten codegen. It introduces dedicated data models for python schema.
This way, we can not only avoid subtle Declaration.yaml conversions but
also decouple the generation of python schema, python to c++ binding and
c++ function call.

The ultimate state will be like the following diagram:

```
            +-------------------+     +-------------------------------------+
  +-------> | Python Signatures | --> |       PythonArgParser Schema        |
  |         +-------------------+     +-------------------------------------+
  |                         |                            .
  |                         |                            .
  |                         |                            .
+------------------+        |         +-------------------------------------+
| Native Functions |        +-------> | PythonArgParser -> Cpp Args Binding |
+------------------+        |         +-------------------------------------+
  |                         |                            .
  |                         |                            .
  |                         |                            .
  |         +-------------------+     +-------------------------------------+
  +-------> |  Cpp Signatures   | --> |        Cpp Function Dispatch        |
            +-------------------+     +-------------------------------------+
```

This PR has migrated the core binding logic from
tools/autograd/gen_python_functions.py to tools/codegen/api/python.py.

It produces the byte-for-byte same results (tested with #46243).

Will migrate the rest of gen_python_functions.py in subsequent PRs.

Test Plan: Imported from OSS

Reviewed By: bhosmer

Differential Revision: D24388874

Pulled By: ljk53

fbshipit-source-id: f88b6df4e917cf90d868a2bbae2d5ffb680d1841
2020-10-19 17:36:45 -07:00
8f12c0e786 Revert D24269034: [fx] Refactor Tracer so that find_module and root args creation could be overridden by implementations
Test Plan: revert-hammer

Differential Revision:
D24269034 (7b2e8bec85)

Original commit changeset: d7b67f2349dd

fbshipit-source-id: 7dd709b585f82d52d9b9973508137e36d5b5871e
2020-10-19 17:29:18 -07:00
cda88e8e4b Fix interval midpoint calculation in register_op_utils
Summary: Interval midpoint calculations can overflow (integers). This fixes such an instance.

Test Plan: Standard test rigs.

Reviewed By: iseeyuan

Differential Revision: D24392608

fbshipit-source-id: 0face1133d99cea342abbf8884b14262d50b0826
2020-10-19 16:11:22 -07:00
ac146c4820 [nvFuser] Switching to CudaFusionGuard from BailOut for nvfuser - update 2 (#46452)
Summary:
1. Added CudaFusionGuard as the custom TypeCheck for nvfuser; enabled dynamic shape support with profiling executor;
2. dropped support for legacy fuser;
3. re-enabled nvfuser tests;
4. added registration for profiling record to allow profiling on user specified nodes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46452

Reviewed By: zou3519, anjali411

Differential Revision: D24364642

Pulled By: ngimel

fbshipit-source-id: daf53a9a6b6636e1ede420a3a6d0397d4a8b450b
2020-10-19 15:44:31 -07:00
30d687522d [reland][quant][eagermode] Move custom_module registration to prepare/convert_custom_config_dict (#46293) (#46364)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46364

Test Plan:
Imported from OSS

Imported from OSS

Reviewed By: vkuzo

Differential Revision: D24322747

fbshipit-source-id: 4801ba1835fc805bf767fe9810b9edfa2ceeefb4
2020-10-19 15:21:00 -07:00
f0f10f82f4 Automated submodule update: FBGEMM (#46443)
Summary:
This is an automated pull request to update the first-party submodule for [pytorch/FBGEMM](https://github.com/pytorch/FBGEMM).

New submodule commit: f20d61e119

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46443

Test Plan: Ensure that CI jobs succeed on GitHub before landing.

Reviewed By: dskhudia

Differential Revision: D24355446

fbshipit-source-id: dc2900367d5de37e67efa963cb2c417e29fe7a88
2020-10-19 14:23:50 -07:00
7b2e8bec85 [fx] Refactor Tracer so that find_module and root args creation could be overridden by implementations (#46493)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46493

This will allow us to override tow following methods of Tracer:
-- get_module_qualified_name: to find qualified name of a module. In default implementation, it looks for module in registered modules and from there it gets to the name, but in some scenarios, the module being called could not be the exact same module that was registered.
-- create_args_for_root: This allows to create and pass custom structured input (like dictionary with specific keys) to the main module, rather than pure proxy objects. This will also allows us to let proxy objects only represent tensors when they are passed to modules.
ghstack-source-id: 114609258

Test Plan: Unit tests passed

Reviewed By: zdevito, bradleyhd

Differential Revision: D24269034

fbshipit-source-id: d7b67f2349dd516b6f7678e41601d6899403d9de
2020-10-19 13:55:31 -07:00
6dc763df30 PyTorch: add API usage logging to numeric suite (#46504)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46504

As titled, so we can start seeing who is using this.

Test Plan: CI

Reviewed By: hx89

Differential Revision: D24375254

fbshipit-source-id: ff7b5560d0a6a175cecbf546eefc910759296dbb
2020-10-19 13:17:02 -07:00
d38a71d579 torch.nn.modules.LazyModuleMixin and torch.nn.LazyLinear (Shape Inference II) (#44538)
Summary:
Retake on https://github.com/pytorch/pytorch/issues/40493 after all the feedback from albanD

This PR implements the generic Lazy mechanism and a sample `LazyLinear` layer with the `UninitializedParameter`.

The main differences with the previous PR are two;
Now `torch.nn.Module` remains untouched.
We don't require an explicit initialization or a dummy forward pass before starting the training or inference of the actual module. Making this much simpler to use from the user side.

As we discussed offline, there was the suggestion of not using a mixin, but changing the `__class__` attribute of `LazyLinear` to become `Linear` once it's completely initialized. While this can be useful, by the time being we need `LazyLinear` to be a `torch.nn.Module` subclass since there are many checks that rely on the modules being instances of `torch.nn.Module`.
This can cause problems when we create complex modules such as
```
class MyNetwork(torch.nn.Module):
    def __init__(self):
        super(MyNetwork, self).__init__()
        self.conv = torch.nn.Conv2d(20, 4, 2)
        self.linear = torch.nn.LazyLinear(10)
    def forward(self, x):
        y = self.conv(x).clamp(min=0)
        return self.linear(y)
```
Here, when the __setattr__ function is called at the time LazyLinear is registered, it won't be added to the child modules of `MyNetwork`, so we have to manually do it later, but currently there is no way to do such thing as we can't access the parent module from LazyLinear once it becomes the Linear module. (We can add a workaround to this if needed).

TODO:

Add convolutions once the design is OK
Fix docstrings

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44538

Reviewed By: ngimel

Differential Revision: D24162854

Pulled By: albanD

fbshipit-source-id: 6d58dfe5d43bfb05b6ee506e266db3cf4b885f0c
2020-10-19 13:13:54 -07:00
7f8b02f5b7 [ONNX] Add test for Batchnorm (#45633)
Summary:
Add test for Batchnorm in training mode

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45633

Reviewed By: VitalyFedyunin

Differential Revision: D24117026

Pulled By: bzinodev

fbshipit-source-id: 2728d8732e856390a2a00c3e8425b9c312c00650
2020-10-19 13:07:40 -07:00
172ed51a17 Mark parts of spectral tests as slow (#46509)
Summary:
According to https://app.circleci.com/pipelines/github/pytorch/pytorch/228154/workflows/31951076-b633-4391-bd0d-b2953c940876/jobs/8290059
TestFFTCUDA.test_fftn_backward_cuda_complex128 takes 242 seconds to finish, where most of the time spent checking 2nd gradient

Refactor common part of test_fft_backward and test_fftn_backward into _fft_grad_check_helper
Introduce `slowAwareTest` decorator
Split test into fast and slow parts by checking 2nd degree gradient only during the slow part

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46509

Reviewed By: walterddr

Differential Revision: D24378901

Pulled By: malfet

fbshipit-source-id: 606670c2078480219905f63b9b278b835e760a66
2020-10-19 10:11:46 -07:00
e7564b076c Refactor scalar list APIs to use overloads (#45673)
Summary:
Refactor foreach APIs to use overloads in case of scalar list inputs.
Tested via unit tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45673

Reviewed By: heitorschueroff

Differential Revision: D24053424

Pulled By: izdeby

fbshipit-source-id: 35976cc50b4acfe228a32ed26cede579d5621cde
2020-10-19 09:28:49 -07:00
f06ea4bcac Add myself as owner of C++ APIs related folder (#46487)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46487

Test Plan: Imported from OSS

Reviewed By: bdhirsh

Differential Revision: D24370164

Pulled By: glaringlee

fbshipit-source-id: 7b25b15eb906917376d2e5290782572a3cd48d3c
2020-10-19 09:16:02 -07:00
eadc59df55 Enable TP_USE_CUDA and TP_ENABLE_CUDA_IPC (#46523)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46523

Test Plan: Imported from OSS

Reviewed By: beauby

Differential Revision: D24385830

Pulled By: mrshenli

fbshipit-source-id: 59a40843b4dc1585e176062476da9ab74c84179b
2020-10-19 09:05:00 -07:00
00c779a92b detect inplace modifications of views earlier (fix #21875) (#46204)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46204

Test Plan: Imported from OSS

Reviewed By: izdeby

Differential Revision: D24259500

Pulled By: bdhirsh

fbshipit-source-id: 223f8a07da4e4121009fc0a8b6760d90eef089b3
2020-10-19 08:58:33 -07:00
0c5cd8c2b9 [RFC] Switch PyTorch Selective Build (Custom Build) to use the SelectiveBuilder abstraction (#45722)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45722

This diff does a bunch of things:

1. Introduces some abstractions as detailed in https://fb.quip.com/2oEzAR5MKqbD to help with selective build related codegen in multiple files.
2. Adds helper methods to combine operators, debug info, operator lists, etc...
3. Currently, the selective build machinery querying `op_registration_whitelist` directly at various places in the code. `op_registration_whitelist` is a list of allowed operator names (without overload name). We want to move to a world where the overload names are also included so that we can be more selective about which operators we include. To that effect, it makes sense to hide the checking logic in a separate abstraction and have the build use that abstraction instead of putting all this selective build specific logic in the code-generator itself. This change is attempting to do just that.
4. Updates generate_code, unboxing-wrapper codegen, and autograd codegen to accept the operator selector paradigm as opposed to a selected operator list.
5. Update `tools/code_analyzer/gen_op_registration_allowlist.py` to expose providing an actual structured operator dependency graph in addition to a serialized string.

There are a bunch of structural changes as well:

1. `root_op_list.yaml` and `combined_op_list.yaml` are now actual YAML files (not a space separated list of operator names)
2. `generate_code.py` accepts only paths to operator list YAML files (both old style as well as new style) and not list of operator names on the command line as arguments
3. `gen.py` optionally also accepts a custom build related operators YAML path (this file has information about which operators to register in the generated library).

ghstack-source-id: 114578753

(Note: this ignores all push blocking failures!)

Test Plan:
`buck test caffe2/test:selective_build`

Generated YAML files after the change:

{P143981979}

{P143982025}

{P143982056}

Ensure that the generated files are same before and after the change:

```
[dhruvbird@devvm2490 /tmp/TypeDefault.cpp] find -name "*.cpp" | xargs md5sum
d72c3d125baa7b77e4c5581bbc7110d2  ./after_change/gen_aten/TypeDefault.cpp
42353036c83ebc7620a7159235b9647f  ./after_change/lite_predictor_lib_aten/TypeDefault.cpp
d72c3d125baa7b77e4c5581bbc7110d2  ./before_change/gen_aten/TypeDefault.cpp
42353036c83ebc7620a7159235b9647f  ./before_change/lite_predictor_lib_aten/TypeDefault.cpp
```

`VariableTypes_N.cpp` are generated the same both before and after the change:

```
[dhruvbird@devvm2490 /tmp/VariableType] find -name "*.cpp" | xargs -n 1 md5sum | sort
3be89f63fd098291f01935077a60b677  ./after/VariableType_2.cpp
3be89f63fd098291f01935077a60b677  ./before/VariableType_2.cpp
40a3e59d64e9dbe86024cf314f127fd6  ./after/VariableType_4.cpp
40a3e59d64e9dbe86024cf314f127fd6  ./before/VariableType_4.cpp
a4911699ceda3c3a430f08c64e8243fd  ./after/VariableType_1.cpp
a4911699ceda3c3a430f08c64e8243fd  ./before/VariableType_1.cpp
ca9aa611fcb2a573a8cba4e269468c99  ./after/VariableType_0.cpp
ca9aa611fcb2a573a8cba4e269468c99  ./before/VariableType_0.cpp
e18f639ed23d802dc4a31cdba40df570  ./after/VariableType_3.cpp
e18f639ed23d802dc4a31cdba40df570  ./before/VariableType_3.cpp
```

Reviewed By: ljk53

Differential Revision: D23837010

fbshipit-source-id: ad06b1756af5be25baa39fd801dfdf09bc565442
2020-10-18 15:10:42 -07:00
bcd68dfa5f Change codecov comment style to be less verbose (#46499)
Summary:
Do not post comments if there are no changes in coverage and post only diff stats

Partially addresses an issue raised in https://github.com/pytorch/pytorch/issues/44187

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46499

Reviewed By: walterddr

Differential Revision: D24373480

Pulled By: malfet

fbshipit-source-id: a49d7e661507ad98d5222c119b2f3f3550c1a949
2020-10-18 14:10:41 -07:00
351670f004 Delete libtorch test jobs (#46508)
Summary:
As they are noops for the 90+ days but still take 10min  from start to finish:
58f14d3a28/.jenkins/pytorch/test.sh (L376-L374)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46508

Reviewed By: walterddr

Differential Revision: D24378877

Pulled By: malfet

fbshipit-source-id: 2e9990a8e1524ef39e891bb5ea874447eef34b31
2020-10-18 14:05:44 -07:00
c3466dabaa Disable profiling when getGraphExecutorOptimize is unset (#46479)
Summary:
`getGraphExecutorOptimize` mandates we don't do any optimizations beyond what's required to run graphs. In this scenario, we don't want to do any profiling as profiling information will not be used.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46479

Reviewed By: ZolotukhinM

Differential Revision: D24368292

Pulled By: Krovatkin

fbshipit-source-id: a2c7618d459efca9cb0700c4d64d829b352792a8
2020-10-17 22:30:05 -07:00
6a2f40dc66 Expose script_if_tracing as public API (#46494)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/45921

`torch.jit._script_if_tracing` is still kept for BC

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46494

Reviewed By: ZolotukhinM

Differential Revision: D24381621

Pulled By: gmagogsfm

fbshipit-source-id: 35d9f2da38c591039ba95cd95ef186e6c7e47586
2020-10-17 17:31:57 -07:00
da95eec613 torch.fft: Two dimensional FFT functions (#45164)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45164

This PR implements `fft2`, `ifft2`, `rfft2` and `irfft2`. These are the last functions required for `torch.fft` to match `numpy.fft`. If you look at either NumPy or SciPy you'll see that the 2-dimensional variants are identical to `*fftn` in every way, except for the default value of `axes`. In fact you can even use `fft2` to do general n-dimensional transforms.

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D24363639

Pulled By: mruberry

fbshipit-source-id: 95191b51a0f0b8e8e301b2c20672ed4304d02a57
2020-10-17 16:23:06 -07:00
495070b388 [Metal] Add the Python binding for optimize_for_mobile (#46456)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46456

Add the python binding in CMake. The general workflow is

- Build pytorch -  `USE_PYTORCH_METAL=ON python setup.py install --cmake`
- Run optimize_for_mobile

```
import torch
from torch.utils.mobile_optimizer import optimize_for_mobile

scripted_model = torch.jit.load('./mobilenetv2.pt')
optimized_model = optimize_for_mobile(scripted_model, backend='metal')
torch.jit.export_opnames(optimized_model)
torch.jit.save(optimized_model, './mobilenetv2_metal.bc')
```
The exported ops are

```
['aten::adaptive_avg_pool2d', 'aten::add.Tensor', 'aten::addmm', 'aten::reshape', 'aten::size.int', 'metal::copy_to_host', 'metal_prepack::conv2d_run']
```
ghstack-source-id: 114559878

Test Plan:
- Sandcastle CI
- Circle CI

Reviewed By: kimishpatel

Differential Revision: D24356768

fbshipit-source-id: fb5c4c4b6316347b67edb4132da044a81470ddfd
2020-10-17 10:26:25 -07:00
e8ff0f6c5c [c10] add operator= of intrusive_ptr to weak_intrusive_ptr (#44045)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44045

Test Plan: Imported from OSS

Reviewed By: gmagogsfm

Differential Revision: D23632281

Pulled By: wanchaol

fbshipit-source-id: ea50427fc261f0c77ddaac2e73032827320d7077
2020-10-17 03:35:44 -07:00
cc471c6daf [Metal] Enable optimize_for_mobile on Linux (#46384)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46384

Currently, the optimize_for_mobile binary only works on macOS, which is not very convenient to use. This diff introduces a new buck target that separates out the objective-c code. The goal here is to be able to export models for metal on linux machines.
ghstack-source-id: 114499418

Test Plan:
- set `enable_mpscnn` to 1 in pt_defs.bzl
- buck build //xplat/caffe2:optimize_for_mobile --show-full-output
- ./optimize_for_mobile --model="./model.pt" --backend="metal"
- CI

```
[taox@devvm2780.vll0 ~/mobilenetv2] ./optimize_for_mobile --model="./model.pt" --backend="metal"

pt_operator_library(
	name = "old_op_library",
	ops = [
		"aten::Int.Tensor",
		"aten::_convolution.deprecated",
		"aten::adaptive_avg_pool2d",
		"aten::add.Tensor",
		"aten::addmm",
		"aten::batch_norm",
		"aten::dropout",
		"aten::hardtanh_",
		"aten::reshape",
		"aten::size.int",
		"aten::t",
		"prim::NumToTensor.Scalar",
	],
)

find output:
%411 : Tensor = aten::addmm(%self.classifier.1.bias, %input0.1, %694, %26, %26) # /Users/taox/anaconda/lib/python3.7/site-packages/torch/nn/functional.py:1674:0
insert: %817 : Tensor = metal::copy_to_host(%411)

pt_operator_library(
	name = "new_op_library",
	ops = [
		"aten::adaptive_avg_pool2d",
		"aten::add.Tensor",
		"aten::addmm",
		"aten::reshape",
		"aten::size.int",
		"metal::copy_to_host",
		"metal_prepack::conv2d_run",
	],
)
```

Reviewed By: linbinyu

Differential Revision: D24322017

fbshipit-source-id: 2b8062b069659bfec78ca4f9f9a5bb9dfbd693d2
2020-10-16 23:46:02 -07:00
5233ff9f15 [TensorExpr] Re-enable test for torch.cat, add a test for torch.cat being a single node in a fusion group. (#46447)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46447

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D24356017

Pulled By: ZolotukhinM

fbshipit-source-id: 847c1d9c4f0f77f53ea3412a5663d486e78bccad
2020-10-16 20:26:48 -07:00
d6de9d573a [TensorExpr] Properly handle input types promotion and special case of empty inputs for aten::cat. (#46500)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46500

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D24373671

Pulled By: ZolotukhinM

fbshipit-source-id: b3be73a89a9ab6654212cb7094f32bf1c445e876
2020-10-16 20:26:46 -07:00
0f668d95b6 [TensorExpr] Fix shape inference logic for aten::cat. (#46482)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46482

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D24366778

Pulled By: ZolotukhinM

fbshipit-source-id: 000ff363b11599ba3827cdf2db3d4793878b84ab
2020-10-16 20:24:30 -07:00
58f14d3a28 Remove catchAllKernel_. (#46354)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46354

Test Plan: Imported from OSS

Reviewed By: bhosmer

Differential Revision: D24319418

Pulled By: ailzhang

fbshipit-source-id: 295f0439087722b5cb60e43f2bca1ba8bd56a817
2020-10-16 19:11:17 -07:00
04e5fcc0ed [GPU] Introduce USE_PYTORCH_METAL (#46383)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46383

The old `USE_METAL` is actually being used by Caffe2. Here we introduce a new macro to enable metal in pytorch.
ghstack-source-id: 114499392

Test Plan:
- Circle CI
- The Person Segmentation model works

Reviewed By: linbinyu

Differential Revision: D24322018

fbshipit-source-id: 4e5548afba426b49f314366d89b18ba0c7e745ca
2020-10-16 18:19:32 -07:00
fa108bd264 Add flatten loops transformation (#46365)
Summary:
This diff removes the dependency of flattening on tensors by performing flattening on loops instead.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46365

Reviewed By: ailzhang

Differential Revision: D24366347

Pulled By: navahgar

fbshipit-source-id: 4ba182f37212b6e4033cae13f8e75bc5144389f4
2020-10-16 17:05:26 -07:00
5da4a08675 [GPU] Add metal to DispatchKeySet (#46455)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46455

After this PR(https://github.com/pytorch/pytorch/pull/46236 ) landed, the `aten::copy_` can no longer be dispatched to Metal kernels.
ghstack-source-id: 114499399

Test Plan:
- Sandcastle CI
- Circle CI

Reviewed By: IvanKobzarev, ailzhang

Differential Revision: D24356769

fbshipit-source-id: 8660ca5be663fdc8985d9eb710ddaadbb43b0ddd
2020-10-16 16:33:26 -07:00
8c629ecc9a [WIP] Move catchAll to Math (#45939)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45939

Test Plan: Imported from OSS

Reviewed By: bhosmer

Differential Revision: D24165890

Pulled By: ailzhang

fbshipit-source-id: 72fe71ea95a738251b2fafc9eea4ab3831cf426b
2020-10-16 16:17:16 -07:00
d1ca7ef33e [Gradient Compression] Rename the first arg of pybinding of _register_comm_hook: ddp_model -> reducer. (#46498)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46498

The name of the first arg "ddp_model" is misleading, because it is actually the reducer of DDP model rather than entire model.

This method is called in the file caffe2/torch/nn/parallel/distributed.py:
`dist._register_comm_hook(self.reducer, state, hook)`
ghstack-source-id: 114531188

(Note: this ignores all push blocking failures!)

Test Plan: waitforbuildbot

Reviewed By: pritamdamania87

Differential Revision: D24372827

fbshipit-source-id: dacb5a59e87400d93a2f35da43560a591ebc5499
2020-10-16 16:12:42 -07:00
0c9787c758 caffe2: use at::mt19937 instead of std::mt19937 (10x speedup) (#43987)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43987

This replaces the caffe2 CPU random number (std::mt19937) with at::mt19937 which is the one currently used in pytorch. The ATen RNG is 10x faster than the std one and appears to be more robust given bugs in the std (https://fburl.com/diffusion/uhro7lqb)

For large embedding tables (10GB+) we see UniformFillOp taking upwards of 10 minutes as we're bottlenecked on the single threaded RNG. Swapping to at::mt19937 cuts that time to 10% of the current.

Test Plan: Ran all relevant tests + CI. This doesn't introduce new features (+ is a core change) so existing tests+CI should be sufficient to catch regressions.

Reviewed By: dzhulgakov

Differential Revision: D23219710

fbshipit-source-id: bd16ed6415b2933e047bcb283a013d47fb395814
2020-10-16 16:08:35 -07:00
e6e83bf802 [hotfix] remove test.pypi.org (#46492)
Summary:
fix CI: https://app.circleci.com/pipelines/github/pytorch/pytorch/227894/workflows/67d2ded3-82eb-4a5d-be2c-dfccb8ed9133/jobs/8275321

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46492

Reviewed By: janeyx99

Differential Revision: D24371755

Pulled By: walterddr

fbshipit-source-id: ae7e96f22920f85f04acdccc999774510a60cfa9
2020-10-16 16:01:20 -07:00
11cc7f143d Run __setstate__ when cloning modules (#45858)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45858

When cloning a module that has __setstate__, __getstate__ methods.
We need to load these methods to initialize these modules.

Test Plan: Imported from OSS

Reviewed By: suo

Differential Revision: D24116524

Pulled By: bzinodev

fbshipit-source-id: a5111638e2dc903781f6468838c000850d1f9a74
2020-10-16 15:55:31 -07:00
478fa180ee Removing hack so that NCCL is not removed in base Docker commands (#46495)
Summary:
This hack was introduced in 2018 and should be able to b removed now. Previously, all Docker images removed NCCL installation to allow some distributed tests to pass: https://github.com/pytorch/pytorch/issues/5877

This should no longer be the case.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46495

Reviewed By: malfet

Differential Revision: D24372198

Pulled By: janeyx99

fbshipit-source-id: 6285db77734367f0b8dd363bfd6e9f61a0b58084
2020-10-16 15:42:23 -07:00
89108ba6ea type check for torch.quantization.stubs (#46475)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/42973

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46475

Reviewed By: malfet

Differential Revision: D24368088

Pulled By: walterddr

fbshipit-source-id: 7a0ccb4fa66b28d4ac59923d727e632351a02b3f
2020-10-16 15:34:23 -07:00
997e672a27 Move NCCL installation for xenial-cuda10.1 to Docker image instead of for every job (#46476)
Summary:
Instead of installing NCCL for every build and test job with environment `*-xenial-cuda10.1-*`, install NCCL into the Docker image. This would save time and remove duplication in our scripts.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46476

Reviewed By: ailzhang

Differential Revision: D24369397

Pulled By: janeyx99

fbshipit-source-id: c107aac9845024d2621eb967ca83c5fc8127a950
2020-10-16 14:17:40 -07:00
28f8372bf4 Avoid mat1 references in mm_mat1_backward (#45777)
Summary:
Avoiding references to `mat1` in `mm_mat1_backward` is a first step to solving issue https://github.com/pytorch/pytorch/issues/42371

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45777

Reviewed By: malfet

Differential Revision: D24347967

Pulled By: albanD

fbshipit-source-id: f09a8149d9795481b5ed5b48fdd0e598ba027d0b
2020-10-16 13:52:44 -07:00
dd169ca17c caffe2/plan_executor: propagate exceptions from reporter substeps (#46424)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46424

Currently if an exception occurs in a reporter thread the process is killed via std::terminate. This adds support for handling the reporter exception if FLAGS_caffe2_handle_executor_threads_exceptions is set to true.

Test Plan: buck test mode/opt -c python.package_style=inplace //caffe2/caffe2/python:hypothesis_test //caffe2/caffe2:caffe2_test_cpu -- --stress-runs 100

Reviewed By: dahsh

Differential Revision: D24345027

fbshipit-source-id: 0659495c9e27680ebae41fe5a3cf26ce2f455cb3
2020-10-16 12:28:57 -07:00
24ca2763e1 [fx] allow custom behavior for args, kwargs, and bool (#45193)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45193

This change makes it possible to subclass the tracer to add additional
behavior when you know something about the shape of the Proxy objects,
by overriding the defaults for how the tracer tries to make something iterable,
looks for keys for **kwargs, or tries to convert to a boolean.

An example test shows how this can be used to tag inputs with shapes.
It can also be used combined with create_node to do type propagation during
tracing to fullfil requests like iter.

Test Plan: Imported from OSS

Reviewed By: jamesr66a

Differential Revision: D24258993

Pulled By: zdevito

fbshipit-source-id: 6ece686bec292e51707bbc7860a1003d0c1321e8
2020-10-16 11:19:12 -07:00
2e2fe8cf3b [NCCL] Modularize ncclCommWatchdog (#46051)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46051

Creates a subroutine for aborting timed out collectives. This should help modularize the ncclCommWatchdog a bit, since it is growing too large.
ghstack-source-id: 114398496

Test Plan:
Successful Flow Run:
f225037915
f217609101

Reviewed By: jiayisuse

Differential Revision: D23607535

fbshipit-source-id: 0b1c9483bcd3a41847fc8c0bf6b22cdba01fb1e6
2020-10-16 11:06:40 -07:00
be0c431874 Fix implicit cast in custom_function (#46445)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46445

Fix an instance in which a truncated integer prevents downstream type safety checks.

Test Plan: I'm not sure what's appropriate here.

Reviewed By: albanD

Differential Revision: D24339292

fbshipit-source-id: 15748ec64446344ff1a8344005385906d3484d7c
2020-10-16 10:58:02 -07:00
9300a27702 Make torch.lu support complex input on CUDA. (#45898)
Summary:
As per title. LU decomposition is used for computing determinants, and I need this functionality to implement the matrix square root. Next PR on my list is to enable `torch.det` on CUDA with complex input.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45898

Reviewed By: heitorschueroff

Differential Revision: D24306951

Pulled By: anjali411

fbshipit-source-id: 168f578fe65ae1b978617a66741aa27e72b2172b
2020-10-16 10:29:39 -07:00
5c5484c889 [Flaky tests] Fix test_all_gather_timeout test (#45989)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45989

This test was failing internally for the Thrift-based RPC agent, since
it has a different error regex. Use `self.get_timeout_error_regex` which gets
the timeout error string for each backend to fix this.
ghstack-source-id: 114463458

Test Plan: CI

Reviewed By: pritamdamania87

Differential Revision: D24170394

fbshipit-source-id: 9b30945e3e30f36472268d042173f8175ad88098
2020-10-16 09:02:46 -07:00
c37baa9177 [caffe2] add concat benchmark (#46457)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46457

Wanted to see if using CopyMatrix specialized for float that uses mkl_somatcopy can be faster but it wasn't. Still want to check in benchmark that can be used later.

Test Plan: .

Reviewed By: dskhudia

Differential Revision: D24345901

fbshipit-source-id: d3e68dbb560e3138fda11c55789cd41bc0715c6d
2020-10-16 08:48:42 -07:00
b5702e2350 Fix out-of-bounds access for caching allocator calls (#46439)
Summary:
In assertValidDevice() compare device index to `caching_allocator.device_allocator` rather than to `device_no`

Fixes potential crashes when caching allocator is accessed before being initialized, for example by calling something like:
`python -c "import torch;print(torch.cuda.memory_stats(0))"`

Fixes https://github.com/pytorch/pytorch/issues/46437

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46439

Reviewed By: ngimel

Differential Revision: D24350717

Pulled By: malfet

fbshipit-source-id: 714e6e74f7c2367a9830b0292478270192f07a7f
2020-10-16 08:24:46 -07:00
d6e6073016 install lcov in Docker image if coverage is specified (#46404)
Summary:
As a step to enable C++ code coverage in PyTorch, we want to install `lcov` in the Docker image so that lcov is available to use later on in the process. `lcov` is now installed in all non-rocm non-cuda ubuntu images.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46404

Reviewed By: seemethere

Differential Revision: D24343450

Pulled By: janeyx99

fbshipit-source-id: 3fdab0c0f78c004488e115505740620417f76646
2020-10-16 08:15:57 -07:00
7b788d113e Fix deprecated warnings for nan_to_num (#46309)
Summary:
Related to https://github.com/pytorch/pytorch/issues/44592

This PR is to fix the deprecated warnings for the nan_to_num function.

Below is the warning message when building the latest code.
```
../aten/src/ATen/native/UnaryOps.cpp: In function ‘at::Tensor& at::native::nan_to_num_out(at::Tensor&,
const at::Tensor&, c10::optional<double>, c10::optional<double>, c10::optional<double>)’:
../aten/src/ATen/native/UnaryOps.cpp:397:45: warning: ‘bool c10::isIntegralType(c10::ScalarType)’
is deprecated: isIntegralType is deprecated.
Please use the overload with 'includeBool' parameter instead. [-Wdeprecated-declarations]
   if (c10::isIntegralType(self.scalar_type())) {
```

The deprecated warning is defined in `ScalarType.h`.
d790ec6de0/c10/core/ScalarType.h (L255-L260)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46309

Reviewed By: mrshenli

Differential Revision: D24310248

Pulled By: heitorschueroff

fbshipit-source-id: 0f9f2ad304eb5a2da9d2b415343f2fc9029037af
2020-10-16 06:07:14 -07:00
ecf63351bc [mlf][efficiency] modify equalization scale operator to return single output (#46449)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46449

modifies `ComputeEqualizationScale` to have a single output `S`

Test Plan:
```
buck test caffe2/caffe2/quantization/server:compute_equalization_scale_test
```

plus e2e tests

Reviewed By: hx89

Differential Revision: D23946768

fbshipit-source-id: 137c2d7a58bb858db411248606a5784b8066ab23
2020-10-16 01:22:37 -07:00
4359c5e036 [TensorExpr] Correctly handle negative dimensions in aten::cat when lowering to tensor expressions. (#46446)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46446

Fixes #46440.

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D24356016

Pulled By: ZolotukhinM

fbshipit-source-id: b759760bb8c765aeb128eb94d18af20cddd888a2
2020-10-16 01:13:14 -07:00
ec5f81f9d3 Remove variable_excluded_from_dispatch() check for factory functions. (#46371)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46371

Test Plan: Imported from OSS

Reviewed By: bhosmer

Differential Revision: D24324545

Pulled By: ailzhang

fbshipit-source-id: 78038054690dff14883df711073be4c2da4e1f8b
2020-10-15 21:15:41 -07:00
d1745c36dc fix type check for torch.quantization._numeric_suite (#46330)
Summary:
fix https://github.com/pytorch/pytorch/issues/42977

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46330

Reviewed By: malfet

Differential Revision: D24320449

Pulled By: walterddr

fbshipit-source-id: f892b5c83cb932aee53245d6c825568b3e05f3c6
2020-10-15 20:45:07 -07:00
92921c82bb Add named tuple's error message and workaround for RET failure (#46347)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46347

Added the named tuple's error messages & workarounds when it returns from a function of a class in Pytorch Mobile.

To identify the error cases (returning NamedTuple type), I used the following coditions:
1) ins.op == RET  (for returing)
2) type->kind() == TypeKind::TupleType  (for pruning non-tuple types)
3) type->cast<TupleType>().name()  (for pruning Tuple type)
  - I could use the type's str (str() or repr_str()) directly, but I used whether it has the "name" attribute. Please give the comment for this.

[Information of Tuple and NamedTuple types]
1. Tuple
type->str(): (int, int)
type->repr_str(): Tuple[int, int]
type->kind():  TypeKind::TupleType         # different with other types
type()->cast<NamedType>(): True
type()->cast<NamedType>()>name(): False    # different with NamedTuple

2. NamedTuple
type->str():  __torch__.myNamedTuple
type->repr_str(): __torch__.myNamedTuple
type->kind():  TypeKind::TupleType         # different with other types
type()->cast<NamedType>(): True
type->cast<TupleType>().name() = True      # different with Tuple

(From the next diff, I will handle the other error cases: 1) returning List<module class>, Dict<module class> and 2) accessing Module class's member functions)
ghstack-source-id: 114361762

Test Plan:
[Added test results]
  buck test mode/dev caffe2/test:mobile -- 'test_unsupported_return'

  Summary
    Pass: 2
    ListingSuccess: 1
    Finished test run: https://our.intern.facebook.com/intern/testinfra/testrun/7036874440497926

[Whole test results]
  buck test mode/dev caffe2/test:mobile -- 'test'

  Summary
    Pass: 11
    ListingSuccess: 1
    Finished test run: https://our.intern.facebook.com/intern/testinfra/testrun/4503599664074084

Reviewed By: iseeyuan

Differential Revision: D24291962

fbshipit-source-id: a1a9e24e41a5f1e067738f59f1eae34d07cba31a
2020-10-15 17:41:06 -07:00
d278e83e69 Update VariableTypeManual.cpp to not use catchAllKernel. (#46353)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46353

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D24319416

Pulled By: ailzhang

fbshipit-source-id: e6ca74919949f757112a35e8fce74bded45dcfde
2020-10-15 17:10:28 -07:00
dc7cd97402 Fixes bug in sspaddmm (#45113) (#45963)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/45113

Description:
- Fixed bug in sspaddmm by calling contiguous on indices.
- Added tests

We have to make indices contiguous as we use `indices.data_ptr` in `_to_csr` which assumes row-contiguous storage:
be45c3401a/aten/src/ATen/native/sparse/SparseTensorMath.cpp (L1087-L1090)

> Part 1 of fixing this is probably to document sspaddmm. Part 2 may be to rewrite it using other ops. (https://github.com/pytorch/pytorch/issues/45113#issuecomment-700166809)

- Docs will be written here: https://github.com/pytorch/pytorch/pull/45400

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45963

Reviewed By: malfet

Differential Revision: D24335599

Pulled By: ngimel

fbshipit-source-id: 8278c73a1b4cccc5e22c6f3818dd222588c46b45
2020-10-15 16:50:16 -07:00
dda95e6914 More Timer refinement (#46023)
Summary:
This PR just adds more polish to the benchmark utils:

1) `common.py`, `timer.py`, and `valgrind_wrapper/timer_interface.py` are now MyPy strict compliant. (except for three violations due to external deps.) Compare and Fuzzer will be covered in a future PR.
2) `CallgrindStats` now uses `TaskSpec` rather than accepting the individual fields which brings it closer to `Measurement`.
3) Some `__repr__` logic has been moved into `TaskSpec` (which `Measurement` and `CallgrindStats` use in their own `__repr__`s) for a more unified feel and less horrible f-string hacking, and the repr's have been given a cleanup pass.
4) `Tuple[FunctionCount, ...]` has been formalized as the `FunctionCounts` class, which has a much nicer `__repr__` than just the raw tuple, as well as some convenience methods (`__add__`, `__sub__`, `filter`, `transform`) for easier DIY stat exploration. (I find myself using the latter two a lot now.) My personal experience is that manipulating `FunctionCounts` is massively more pleasant than the raw tuples of `FunctionCount`. (Though it's still possible to get at the raw data if you want.)
5) Better support for multi-line `stmt` and `setup`.
6) Compare now also supports rowwise coloring, which is often the more natural layout for A/B testing.
7) Limited support for `globals` in `collect_callgrind`. This should make it easier to benchmark JIT models. (CC ZolotukhinM)
8) More unit tests, including extensive tests for the Callgrind stats manipulation APIs.
9) Mitigate issue with `MKL_THREADING_LAYER` when run in Jupyter. (https://github.com/pytorch/pytorch/issues/37377)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46023

Test Plan: changes should be covered by existing and new unit tests.

Reviewed By: navahgar, malfet

Differential Revision: D24313911

Pulled By: robieta

fbshipit-source-id: 835d4b5cde336fb7ff0adef3c0fd614d64df0f77
2020-10-15 16:32:53 -07:00
757173a4da Add Sigmoid operator from Caffe2 (#46286)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46286

commonize fp16 unary operators

Reviewed By: hyuen

Differential Revision: D24199660

fbshipit-source-id: 99bffa24dc3fa459561a7a2743b1a4dce4be5d58
2020-10-15 16:13:37 -07:00
16c52d918b [caffe2] Bypass memonger for in-place ops (#46378)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46378

Reviewed By: dzhulgakov

Differential Revision: D24236604

fbshipit-source-id: 9f599687467ea969e89243482f8e2a41f7db0a23
2020-10-15 16:03:52 -07:00
faf03bd226 Update default ouput extension in optimize_for_mobile.cc (#45598)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45598

.bc is causing issues on Android.  Let's switch to .ptl.

Test Plan: CI

Reviewed By: kimishpatel

Differential Revision: D24026180

fbshipit-source-id: 9f252f3652d748bccb19dc61a783d693e171b2c6
2020-10-15 15:34:34 -07:00
f96cb9de79 vmap: added fallback for in-place operators (#46191)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46191

This PR adds a fallback for in-place operators to vmap. We define an
in-place operator to be an operator that operators in-place on its first
argument and returns the first argument.

The "iteration over batch" logic is mostly copied from the out-of-place
vmap fallback. I wanted to try to not copy this but the iteration logic
is pretty entangled with the rest of the logic; one alternative was to
use if/else statements inside batchedTensorForLoopFallback but then
there are ~3-4 different sites where we would need that.

When in-place operations are not possible
=========================================
Sometimes, an in-place operation inside of vmap is not possible. For
example, `vmap(Tensor.add_, (None, 0))(torch.rand(3), torch.rand(B0, 3))`
is not possible because the tensor being written to in-place has size
[3] and the other tensor has size [B0, 3].

We detect if this is the case and error out inside the in-place
fallback.

Test Plan
=========
Added some new tests to `test_vmap.py`.

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D24335240

Pulled By: zou3519

fbshipit-source-id: 1f60346059040dc226f0aeb80a64d9458208fd3e
2020-10-15 15:21:25 -07:00
401c85b4d3 Rename createLevelsBitset -> createVmapLevelsBitset; move it (#46190)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46190

Moved `createLevelsBitset` to BatchedTensorImpl.h and renamed it to
`createVmapLevelsBitset`. I moved it there because I want to use it in
another file in a future diff.

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D24335241

Pulled By: zou3519

fbshipit-source-id: 1c225a00b6da9c69dc7fbc61516bf7257298355c
2020-10-15 15:19:53 -07:00
849bc77ee4 Add quick fix for view/inplace issue with DDP (#46406)
Summary:
As per title, temporary mitigation for https://github.com/pytorch/pytorch/issues/46242 for which https://github.com/pytorch/pytorch/pull/46296 will be a proper fix.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46406

Reviewed By: malfet

Differential Revision: D24339689

Pulled By: albanD

fbshipit-source-id: 0726e5abe4608d8ffcd7846cbaaffbb8564b04ab
2020-10-15 15:13:11 -07:00
ba92920a28 Remove codegen for old RegistrationDeclarations.h (#46370)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46370

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D24324546

Pulled By: ailzhang

fbshipit-source-id: 6dea28c0c7ab5d00f8887735c32304f1b68bf923
2020-10-15 14:02:15 -07:00
03c7d5be6b Add operator benchmark for 4bit/8bit embedding lookups
Summary: Add operator benchmark for 4bit/8bit embedding lookups in `aibench`.

Test Plan:
```
buck build //caffe2/benchmarks/operator_benchmark/pt:qembedding_bag_lookups_test
aibench-cli adhoc -c 'buck run //caffe2/benchmarks/operator_benchmark/pt:qembedding_bag_lookups_test'
````

The run was successful in aibench: https://www.internalfb.com/intern/aibench/details/738300474
https://www.internalfb.com/intern/aibench/details/346463246

Reviewed By: radkris-git

Differential Revision: D24268413

fbshipit-source-id: 7fb4ff75da47f8f327edab562c5d29bb69e00b8d
2020-10-15 13:51:32 -07:00
c99378af1b Fixing pow for special case between cuda tensors and cpu tensors and reframed test cases a tiny bit (#46320)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/46037

I now isolated the special case to be only between cuda tensor bases and cpu tensor exponents. My previous fix was not a complete fix--it fixed some stuff but broke others. The current fix is a more complete fix:
```
In [1]: import torch
In [2]: a=torch.randn(3)
In [3]: b=torch.tensor(2, device="cuda")
In [4]: torch.pow(a,b) #should not work and throws exception now!

In [5]: a=torch.tensor(3, device="cuda")
In [6]: b=torch.tensor(2)
In [7]: torch.pow(a,b) #should work, and now does

In [8]: a=torch.randn(3, device="cuda")
In [9]: torch.pow(a,b) # yeah, that one is fixed and still works
```

To add a test case to reflect the change, I had to modify the existing setup a little bit. I think it is an improvement but would appreciate any tips on how to make it better!

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46320

Reviewed By: malfet

Differential Revision: D24306610

Pulled By: janeyx99

fbshipit-source-id: cc74c61373d1adc2892a7a31226f38895b83066a
2020-10-15 13:43:47 -07:00
7d6d5f4be0 Migrate CPU_tensor_apply to TensorIterator (#44242)
Summary:
Closes https://github.com/pytorch/pytorch/issues/24487
Closes https://github.com/pytorch/pytorch/issues/24488
Closes https://github.com/pytorch/pytorch/issues/24489
Closes https://github.com/pytorch/pytorch/issues/24490
Closes https://github.com/pytorch/pytorch/issues/24491
Closes https://github.com/pytorch/pytorch/issues/24492
Closes https://github.com/pytorch/pytorch/issues/24493
Closes https://github.com/pytorch/pytorch/issues/24494
Closes https://github.com/pytorch/pytorch/issues/24495
Closes https://github.com/pytorch/pytorch/issues/24496

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44242

Reviewed By: mruberry

Differential Revision: D24212123

Pulled By: VitalyFedyunin

fbshipit-source-id: feca9a0bf3b25be76409e8c83f90e7c5c91dce1a
2020-10-15 13:22:41 -07:00
8f61fa653f use @mode/ndk_libcxx instead of mode/gnustl
Summary: as title

Test Plan:
run ai bench

buck run aibench:run_bench -- -b aibench/specifications/models/caffe2/squeezenet/squeezenet.json --remote --devices s9f

INFO 2020-10-15 12:24:21 everstore.py: 177: Everstore: Try internal upload
INFO 2020-10-15 12:24:21 subprocess_with_logger.py:  56: Running: clowder put --fbtype=24665 /tmp/aibenchj5w4craj/build.sh
INFO 2020-10-15 12:24:25 subprocess_with_logger.py:  39: Process Succeeded: clowder put --fbtype=24665 /tmp/aibenchj5w4craj/build.sh
INFO 2020-10-15 12:24:25 everstore.py: 185: Uploading /tmp/aibenchj5w4craj/build.sh
everstore handle: GIxbhAQoAOdQNqwCAEam42Z-pfQrbllgAAAP
url:
INFO 2020-10-15 12:24:25 run_remote.py: 182: program: //everstore/GIxbhAQoAOdQNqwCAEam42Z-pfQrbllgAAAP/build.sh
INFO 2020-10-15 12:24:25 everstore.py: 177: Everstore: Try internal upload
INFO 2020-10-15 12:24:25 subprocess_with_logger.py:  56: Running: clowder put --fbtype=24665 /tmp/aibenchj5w4craj/program
INFO 2020-10-15 12:24:32 subprocess_with_logger.py:  39: Process Succeeded: clowder put --fbtype=24665 /tmp/aibenchj5w4craj/program
INFO 2020-10-15 12:24:32 everstore.py: 185: Uploading /tmp/aibenchj5w4craj/program
everstore handle: GICWmAA66cD0qNMCAJh4vKJzTU8-bllgAAAP
url:
INFO 2020-10-15 12:24:32 run_remote.py: 182: program: //everstore/GICWmAA66cD0qNMCAJh4vKJzTU8-bllgAAAP/program
Scuba => https://fburl.com/scuba/caffe2_benchmarking/w6tvinl0
Job status for SM-G960F-8.0.0-26 (sm-g960f-26,galaxy-s9f,s9f) is changed to CLAIMED
Job status for SM-G960F-8.0.0-26 (sm-g960f-26,galaxy-s9f,s9f) is changed to RUNNING
Job status for SM-G960F-8.0.0-26 (sm-g960f-26,galaxy-s9f,s9f) is changed to DONE
ID:0	NET latency: 82192.9
You can find more info via https://our.intern.facebook.com/intern/aibench/details/915762192256210

Reviewed By: smeenai

Differential Revision: D24340103

fbshipit-source-id: cea15bc866361b397b4e1b001e609eb7e9f9aa47
2020-10-15 13:09:26 -07:00
e69f2f82ab Automated submodule update: FBGEMM (#46395)
Summary:
This is an automated pull request to update the first-party submodule for [pytorch/FBGEMM](https://github.com/pytorch/FBGEMM).

New submodule commit: abc56f644d

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46395

Test Plan: Ensure that CI jobs succeed on GitHub before landing.

Reviewed By: malfet

Differential Revision: D24333846

fbshipit-source-id: 307bbb7e857cd9e472d03374d3d3941128d807b5
2020-10-15 13:06:26 -07:00
c1141b6f68 Added support for complex torch.pinverse (#45819)
Summary:
This PR adds support for complex-valued input for `torch.pinverse`.
Fixed cuda SVD implementation to return singular values with real dtype.

Fixes https://github.com/pytorch/pytorch/issues/45385.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45819

Reviewed By: heitorschueroff

Differential Revision: D24306539

Pulled By: anjali411

fbshipit-source-id: 2fe19bc630de528e0643132689e1bc5ffeaa162a
2020-10-15 12:28:22 -07:00
5ce46fbbca BFloat16 support for torch.sign (#45244)
Summary:
Added BF16 support for torch.sign on CUDA

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45244

Reviewed By: zou3519

Differential Revision: D23932304

Pulled By: izdeby

fbshipit-source-id: e50b9510ecf2337ec0288392d6950046116b2599
2020-10-15 12:23:14 -07:00
8c26111adb Add fence. (#45148)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45148

Test Plan: Imported from OSS

Reviewed By: IvanKobzarev

Differential Revision: D23906774

Pulled By: AshkanAliabadi

fbshipit-source-id: 93fe923bbd59d6e8bf3f13372217bd998856e8d7
2020-10-15 12:15:03 -07:00
e3eef0cd7a Add image sampler. (#45037)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45037

Test Plan: Imported from OSS

Reviewed By: IvanKobzarev

Differential Revision: D23820824

Pulled By: AshkanAliabadi

fbshipit-source-id: 2b71f24fb590ad87a963d00a4e380b4d990a11ef
2020-10-15 12:14:59 -07:00
50f833248d Redo Vulkan command and descriptor pools. (#44496)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44496

Test Plan: Imported from OSS

Reviewed By: IvanKobzarev

Differential Revision: D23820829

Pulled By: AshkanAliabadi

fbshipit-source-id: 3e114a3adcb2df01fb151c0536ce1a2e3f9dfbc1
2020-10-15 12:10:35 -07:00
1e654a4b7f Fix error message for scatter reduction (#46397)
Summary:
Follow up to https://github.com/pytorch/pytorch/pull/41377 to update the error message to match the removed arguments

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46397

Reviewed By: malfet

Differential Revision: D24336009

Pulled By: albanD

fbshipit-source-id: b9bf2f9ef7fd2ae622c4079384afc93e9c473f47
2020-10-15 11:34:59 -07:00
528158af47 Updated derivatives for complex mm, mv, ger, bmm, triangular_solve (#45737)
Summary:
This PR updates derivatives for a few functions so that `gradgradcheck` for `torch.cholesky` is passed ([ref](https://github.com/pytorch/pytorch/pull/45267#discussion_r494439967)).

Some tests (that call to `bmm_cuda`) fail with with `RuntimeError: _th_bmm_out not supported on CUDAType for ComplexDouble`
until PR https://github.com/pytorch/pytorch/issues/42553 is merged.

Ref. https://github.com/pytorch/pytorch/issues/33152

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45737

Reviewed By: bdhirsh

Differential Revision: D24279917

Pulled By: anjali411

fbshipit-source-id: 7b696d2cfc2ef714332c2e3e5d207e257be67744
2020-10-15 11:27:30 -07:00
7f458e16ba Allow Undefined to get kernel from Math/DefaultBackend. (#46352)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46352

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D24319417

Pulled By: ailzhang

fbshipit-source-id: de2d7db2cb931b0dcf2fbabd7d292e22cfc5e7b7
2020-10-15 11:17:08 -07:00
908c23579d [JIT] Revert Freezing shared type PR (#46285)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/45902 by reverting https://github.com/pytorch/pytorch/pull/42457

The test case introduced by https://github.com/pytorch/pytorch/pull/42457 was fixed by https://github.com/pytorch/pytorch/pull/46250, which I'm assuming is the real source of the bug.

In the future it would be good to provide repro's for freezing issues without including a quantization dependency; there was another another issue in freezing (see: https://github.com/pytorch/pytorch/pull/46054) who's root cause was the same quantization issue https://github.com/pytorch/pytorch/pull/46250.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46285

Reviewed By: bdhirsh

Differential Revision: D24288739

Pulled By: eellison

fbshipit-source-id: b69ee8c713f749cd93d5eba370c3eafed86568bb
2020-10-15 10:57:30 -07:00
b5479737d7 Add windows JNI support (#44257)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/32516

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44257

Reviewed By: malfet

Differential Revision: D24332820

Pulled By: ezyang

fbshipit-source-id: 1dd97e9c8140129a02a9078623b190b33f30d5b0
2020-10-15 10:48:45 -07:00
bd449334b8 Fix formatting issues in torch.tensor_split documentation (#46328)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46328

Reviewed By: heitorschueroff

Differential Revision: D24318003

Pulled By: mruberry

fbshipit-source-id: 140d391dd927ff3374dd6c4c6e2da7cb67417b31
2020-10-15 10:08:38 -07:00
75809626fb Stop running clang-tidy on torch/csrc/generic/*.cpp. (#46335)
Summary:
Those files are never directly built, only included in other files.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46335

Reviewed By: albanD

Differential Revision: D24316737

Pulled By: gchanan

fbshipit-source-id: 67bb95e7f4450e3bbd0cd54f15fde9b6ff177479
2020-10-15 08:28:28 -07:00
e366591dc8 Fix incorrect signatures in get_testing_overrides, and add test for incorrect signatures (#45983)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/45494

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45983

Reviewed By: agolynski

Differential Revision: D24220048

Pulled By: ezyang

fbshipit-source-id: 67826efdb203d849e028467829f7b5ad4559ec67
2020-10-15 07:48:20 -07:00
2d6fd22e24 Rationalize inlining of kernels into the unboxing wrapper (#42845)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42845

- In server builds, always allow the compiler to inline the kernel into the unboxing wrapper, i.e. optimize for perf.
- In mobile builds, never inline the kernel into the unboxing wrapper, i.e. optimize for binary size.

Note that this only applies for registration API calls where we can actually inline it, i.e. calls with `TORCH_FN` or some of the old API calls.
Registrations that give the registration API a runtime function pointer can't inline and won't do so on server either.

Note also that in server builds, all we do is **allow** the compiler to inline. We don't force inlining.

ghstack-source-id: 114177591

Test Plan:
waitforsandcastle

https://www.internalfb.com/intern/fblearner/details/225217260/

Reviewed By: ezyang

Differential Revision: D23045772

fbshipit-source-id: f74fd600eaa3f5cfdf0da47ea080801a03db7917
2020-10-15 04:02:51 -07:00
053c252c66 Update COMPILE_TIME_MAX_DEVICE_TYPES to 12 (#46327)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46327

### Summary

Update the COMPILE_TIME_MAX_DEVICE_TYPES to 12 as we landed a new Metal backend.

### Test Plan

- Circle CI

Test Plan: Imported from OSS

Reviewed By: IvanKobzarev

Differential Revision: D24309189

Pulled By: xta0

fbshipit-source-id: eec076b7e4fc94bab11840318821aa554447e541
2020-10-15 02:09:17 -07:00
38c97fb6f0 [shape inference] add shape inference support
Summary:
* To make pruning op compatible with shape inference, we introduced a new quantile argument (as in D23463390) to differentiate dynamic/fixed pruning.

* The fixed pruning op has defined output shapes. However, the input shapes are not determined therefore we want to bypass the input shapes checking for two pruning ops, as implemented in this diff.

Test Plan:
buck test caffe2/caffe2/opt:bound_shape_inference_test

```
Started reporting to test run: https://our.intern.facebook.com/intern/testinfra/testrun/844425102187909
    ✓ ListingSuccess: caffe2/caffe2/opt:bound_shape_inference_test - main (1.973)
    ✓ Pass: caffe2/caffe2/opt:bound_shape_inference_test - BoundShapeInference.FC3D (2.604)
    ✓ Pass: caffe2/caffe2/opt:bound_shape_inference_test - BoundShapeInference.SparseLengthsSumFused4BitRowwise (2.635)
    ✓ Pass: caffe2/caffe2/opt:bound_shape_inference_test - BoundShapeInference.FC (2.690)
    ✓ Pass: caffe2/caffe2/opt:bound_shape_inference_test - BoundShapeInference.Int8QuantizeInferInputBackwards (2.705)
    ✓ Pass: caffe2/caffe2/opt:bound_shape_inference_test - BoundShapeInference.SparseLengthsSum (2.729)
    ✓ Pass: caffe2/caffe2/opt:bound_shape_inference_test - BoundShapeInference.Reshape (2.754)
    ✓ Pass: caffe2/caffe2/opt:bound_shape_inference_test - BoundShapeInference.ConcatMissingInput (2.770)
    ✓ Pass: caffe2/caffe2/opt:bound_shape_inference_test - BoundShapeInference.ElementwiseOp (2.770)
    ✓ Pass: caffe2/caffe2/opt:bound_shape_inference_test - BoundShapeInference.Tile (2.785)
    ✓ Pass: caffe2/caffe2/opt:bound_shape_inference_test - BoundShapeInference.Bucketize (2.789)
    ✓ Pass: caffe2/caffe2/opt:bound_shape_inference_test - BoundShapeInference.SparseLengthsSumFused8BitRowwise (2.807)
    ✓ Pass: caffe2/caffe2/opt:bound_shape_inference_test - BoundShapeInference.SparseLengthsSum8BitRowwiseSparse (2.841)
    ✓ Pass: caffe2/caffe2/opt:bound_shape_inference_test - BoundShapeInference.Split (2.863)
    ✓ Pass: caffe2/caffe2/opt:bound_shape_inference_test - BoundShapeInference.ConcatInferInputBackwards (2.894)
    ✓ Pass: caffe2/caffe2/opt:bound_shape_inference_test - BoundShapeInference.ElementwiseInferInputBackwards (2.898)
    ✓ Pass: caffe2/caffe2/opt:bound_shape_inference_test - BoundShapeInference.Combo0 (2.902)
    ✓ Pass: caffe2/caffe2/opt:bound_shape_inference_test - BoundShapeInference.LengthsRangeFill (2.964)
    ✓ Pass: caffe2/caffe2/opt:bound_shape_inference_test - BoundShapeInference.Quantization (2.964)
Summary
  Pass: 18
  ListingSuccess: 1
Finished test run: https://our.intern.facebook.com/intern/testinfra/testrun/844425102187909
```

buck test caffe2/caffe2/fb/opt:bound_shape_inference_net_test

```
 Started reporting to test run: https://our.intern.facebook.com/intern/testinfra/testrun/3096224780078093
    ✓ ListingSuccess: caffe2/caffe2/fb/opt:bound_shape_inference_net_test - main (14.092)
    ✓ Pass: caffe2/caffe2/fb/opt:bound_shape_inference_net_test - BoundShapeInference.ClipLengths (15.508)
    ✓ Pass: caffe2/caffe2/fb/opt:bound_shape_inference_net_test - BoundShapeInference.DPER3IdListFeaturePreProcessing (15.521)
    ✓ Pass: caffe2/caffe2/fb/opt:bound_shape_inference_net_test - BoundShapeInference.ClipRanges (16.198)
    ✓ Pass: caffe2/caffe2/fb/opt:bound_shape_inference_net_test - BoundShapeInference.RowwisePrune (16.302)
    ✓ Pass: caffe2/caffe2/fb/opt:bound_shape_inference_net_test - FbBoundShapeInferencerTest.GatherRanges1 (16.585)
    ✓ Pass: caffe2/caffe2/fb/opt:bound_shape_inference_net_test - BoundShapeInference.Combo3 (16.865)
    ✓ Pass: caffe2/caffe2/fb/opt:bound_shape_inference_net_test - BoundShapeInference.DPER3IdListFeaturePreProcessingWithCast (16.907)
    ✓ Pass: caffe2/caffe2/fb/opt:bound_shape_inference_net_test - BoundShapeInference.GatherRanges2 (16.921)
    ✓ Pass: caffe2/caffe2/fb/opt:bound_shape_inference_net_test - FbBoundShapeInferencerTest.LengthsRangeFill (17.157)
    ✓ Pass: caffe2/caffe2/fb/opt:bound_shape_inference_net_test - BoundShapeInference.ClipRangesAndGatherRanges (17.277)
    ✓ Pass: caffe2/caffe2/fb/opt:bound_shape_inference_net_test - BoundShapeInference.DPER3IdScoreListFeaturePreProcessing (17.274)
    ✓ Pass: caffe2/caffe2/fb/opt:bound_shape_inference_net_test - BoundShapeInference.ClipRangesGatherSigridHash (17.554)
    ✓ Pass: caffe2/caffe2/fb/opt:bound_shape_inference_net_test - BoundShapeInference.Combo1 (17.645)
    ✓ Pass: caffe2/caffe2/fb/opt:bound_shape_inference_net_test - BoundShapeInference.DPER3IdScoreListFeaturePreProcessingDEFAULT (17.887)
    ✓ Pass: caffe2/caffe2/fb/opt:bound_shape_inference_net_test - BoundShapeInference.DPER3IdListFeaturePreProcessingDEFAULT (17.929)
    ✓ Pass: caffe2/caffe2/fb/opt:bound_shape_inference_net_test - BoundShapeInference.f97293388_0 (19.343)
    ✓ Pass: caffe2/caffe2/fb/opt:bound_shape_inference_net_test - FbBoundShapeInferencerTest.GatherRangesToDense1 (19.489)
    ✓ Pass: caffe2/caffe2/fb/opt:bound_shape_inference_net_test - BoundShapeInference.DPER3IdScoreListFeaturePreProcessingWithCast (19.887)
    ✓ Pass: caffe2/caffe2/fb/opt:bound_shape_inference_net_test - BoundShapeInference.xray_v11 (19.905)
    ✓ Pass: caffe2/caffe2/fb/opt:bound_shape_inference_net_test - FbBoundShapeInferencerTest.SigridTransforms (20.080)
    ✓ Pass: caffe2/caffe2/fb/opt:bound_shape_inference_net_test - BoundShapeInference.Combo2 (20.086)
    ✓ Pass: caffe2/caffe2/fb/opt:bound_shape_inference_net_test - BoundShapeInference.vanillaSparseNN (59.847)
    ✓ Pass: caffe2/caffe2/fb/opt:bound_shape_inference_net_test - BoundShapeInference.gather (97.822)
Summary
  Pass: 23
  ListingSuccess: 1
```

## Workflow testing

===
* non-DI/fixed quantile/user side/non-self-binning
f224250571

*  non-DI/fixed quantile/user+ad side/non-self-binning
f224250610

* DI/fixed quantile/user side/self-binning
f224250637

* DI/fixed quantile/user+ad side/self-binning
f224250662

*  non-DI/dynamic quantile/user+ad side/non-self-binning
f224250705

* DI/dynamic quantile/user+ad side/self-binning
f224250760

Reviewed By: ChunliF

Differential Revision: D23647390

fbshipit-source-id: 3ec1c0eaea53bd4d5eda4a0436577216f7fa8ead
2020-10-15 00:46:06 -07:00
a87a1c1103 Fix perfornance issue of GroupNorm on CUDA when feature map is small. (#46170)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46170

Fix perfornance issue of GroupNorm on CUDA when feature map is small.

Benchmark script:

```
import torch
import torch.nn.functional as F

from timeit import Timer

norm = torch.nn.GroupNorm(8, 512).cuda()

num = 5000

sizes = [(1024, 512, 14, 14), (1024, 512, 7, 7), (1024, 512)]

def forward(x):
    _ = norm(x)
    torch.cuda.synchronize()

def backward(y, grad):
    y.backward(grad, retain_graph=True)
    torch.cuda.synchronize()

if __name__ == "__main__":
    # warm up
    x = torch.rand(*(sizes[0]), dtype=torch.float,
                   device="cuda", requires_grad=True)
    for _ in range(100):
        forward(x)

    for size in sizes:
        x = torch.rand(*size, dtype=torch.float,
                       device="cuda", requires_grad=True)
        t = Timer("forward(x)", "from __main__ import forward, x")
        print(f"size = {size}:")
        t1 = t.timeit(num) / num * 1e6
        print(f"avg_forward_time =  {t1}us")

        y = norm(x)
        grad = torch.randn_like(y)
        t = Timer("backward(y, grad)", "from __main__ import backward, y, grad")
        t2 = t.timeit(num) / num * 1e6
        print(f"avg_backward_time = {t2}us")
```
Benchmark result before this Diff:
```
size = (1024, 512, 14, 14):
avg_forward_time =  1636.729855206795us
avg_backward_time = 5488.682465581223us
size = (1024, 512, 7, 7):
avg_forward_time =  465.88476160541177us
avg_backward_time = 3129.9425506033003us
size = (1024, 512):
avg_forward_time =  96.90486900508404us
avg_backward_time = 2319.4099438143894us
```

Benchmark result after this Diff:
```
size = (1024, 512, 14, 14):
avg_forward_time =  1635.6191572034732us
avg_backward_time = 4140.7730475999415us
size = (1024, 512, 7, 7):
avg_forward_time =  463.6513736099005us
avg_backward_time = 1641.7451039887965us
size = (1024, 512):
avg_forward_time =  66.59087920561433us
avg_backward_time = 128.6882139975205us

```

Test Plan: buck test mode/dev-nosan //caffe2/test:nn -- "GroupNorm"

Reviewed By: hl475, houseroad

Differential Revision: D24242738

fbshipit-source-id: b52c82d7b6e47855c48fa8ceacd0c55d03bb92d5
2020-10-14 23:34:33 -07:00
75bf5f2b59 [JIT] Improve class type annotation inference (#45940)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45940

**Summary**
In `try_ann_to_type`, if an annotation has an attribute named
`__torch_script_class__`, it is assumed to be a TorchScript class that
has already been scripted. However, if it is a class that extends
another class, this code path causes a crash because it looks up the
JIT type for the class by name in the compilation unit. This JIT type
obviously cannot exist because inheritance is not supported.

This commit fixes this by looking up the qualified name of a class
in torch.jit._state._script_class in order to ascertain whether it has
already been scripted (instead of looking for a `__torch_script_class__`
attribute on the class object.

**Test Plan**
This commit adds a unit test consisting of the code sample from the
issue that reported this problem.

**Fixes**
This commit fixes #45860.

Test Plan: Imported from OSS

Reviewed By: anjali411

Differential Revision: D24310027

Pulled By: SplitInfinity

fbshipit-source-id: 9f8225f3316fd50738d98e3544bf5562b16425b6
2020-10-14 23:28:47 -07:00
86abc8cd48 [JIT] Make InsertInstruction overflow check a warning instead of fatal (#46369)
Summary:
This diff restores previous behavior of silently allow overflowing when inserting instructions. The behavior was changed recently in https://github.com/pytorch/pytorch/issues/45382. But it started to break some existing use cases that haver overflow problems.

Restoring original behavior but throw a warning to to unblock existing use cases where overflowing happens.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46369

Reviewed By: kwanmacher, wanchaol, fbhuba

Differential Revision: D24324345

Pulled By: gmagogsfm

fbshipit-source-id: 1c0fac421d4de38f070e21059bbdc1b788575bdf
2020-10-14 23:09:53 -07:00
5393588e11 Add guideline about which dispatch keyword to use in native_functions.yaml. (#46126)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46126

Test Plan: Imported from OSS

Reviewed By: Krovatkin

Differential Revision: D24233887

Pulled By: ailzhang

fbshipit-source-id: 640543494e0d5211f2f910a75fa2e9bdf558f7ce
2020-10-14 22:53:56 -07:00
4aaad88790 Bug fixes in profiling allocator (#45993)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45993

Some bug exposed via updated test and validation code.
Also enabled this test to be run on CI instead of just mobile only test.

Test Plan:
cpu_profiling_allocator_test

Imported from OSS

Reviewed By: dzhulgakov

Differential Revision: D24172599

fbshipit-source-id: da0d2e1d1dec87b476bf39a1c2a2ffa0e4b5df66
2020-10-14 22:45:04 -07:00
419dafe791 [Reland] Update native_functions.yaml to add DefaultBackend. (#46236)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46236

Test Plan: Imported from OSS

Reviewed By: Krovatkin

Differential Revision: D24273378

Pulled By: ailzhang

fbshipit-source-id: bed1d4c84c0bba88a7da4d9bd2ccaa58253cf91e
2020-10-14 22:37:28 -07:00
22f4a58a45 [pytorch] activation checkpointing: enable mixing tensor without requires_grad (#45934)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45934

https://pytorch.org/docs/stable/checkpoint.html pytorch checkpoint requires all input to the function being checkpointed to requires_grad, but this assumption is not necessarily try. consider the following two examples

```
output = MultiheadedMaskedAtten(input, mask)

output = LSTM(input, seq_length)
```
both length and mask are tensors that won't requires grad, currently if you try to checkpoint torch.autograd.backward will complain

```
  File "/mnt/xarfuse/uid-124297/7d159c34-seed-nspid4026531836-ns-4026531840/torch/autograd/function.py
", line 87, in apply
    return self._forward_cls.backward(self, *args)
  File "/mnt/xarfuse/uid-124297/7d159c34-seed-nspid4026531836-ns-4026531840/torch/utils/checkpoint.py"
, line 99, in backward
    torch.autograd.backward(outputs, args)
  File "/mnt/xarfuse/uid-124297/7d159c34-seed-nspid4026531836-ns-4026531840/torch/autograd/__init__.py
", line 132, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: element 1 of tensors does not require grad and does not have a grad_fn
```

this diff allows skipping the non-grad-requiring tensor when running autograd.backward.

added documentation for this feature as well.

Test Plan: added unit test to make sure partial tensor grads can be used in checkpoint().

Differential Revision: D24094764

fbshipit-source-id: 6557e8e74132d5a392526adc7b57b6998609ed12
2020-10-14 21:28:02 -07:00
103b100ddc Bazel build has warnings (#46233)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/43212

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46233

Reviewed By: bdhirsh

Differential Revision: D24278560

Pulled By: ezyang

fbshipit-source-id: d7e1c7e97f57f6f0dcf2ff966b795a6d13b07e95
2020-10-14 20:05:34 -07:00
af8c75e211 [PyTorch] Stringize kernel tag names consistently during macro expansion and require all tag names to be a compile time character array (#46074)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46074

I found 2 instances where the NAME parameter passed in to the dispatch macros is not a C++ string (constant) (i.e. double quoted compile time string).

In one instance it is a single quoted multi-character constant (I don't know what this resolves to in practice), and in the other instance, it is an unquoted identified generated as a result of concatenating 2 identifiers using the `##` operator.

In addition, I found 2 instances where the `NAME` of the tag passed in is not a constant character array, but an `std::string` variable instead. I am changing it to a constant character array with the same name as the variable name (earlier). For the purposes of any code using this data, eveything remains the same since the code was string-izing the value anyway using `#NAME` so it would get the name of the variable and not the contents.
ghstack-source-id: 113928208

Test Plan: I have a commit (not part of this change set) that attempts to print the `NAME` argument passed in to the various dispatch macros to be able to do some analysis. These weren't expanding correctly for the uses cases that are fixed in this diff.

Reviewed By: ezyang

Differential Revision: D24211393

fbshipit-source-id: 28953d9f859315b371a60ae34b19671720209c99
2020-10-14 18:13:59 -07:00
a69910868a Fix possible padding length overflow in DistributedSampler (#45329)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/45324

This fix handles cases for `len(dataset) * 2 < num_replica` in DistributedSampler. (which previous code resulted in error.)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45329

Reviewed By: mruberry

Differential Revision: D24205035

Pulled By: rohan-varma

fbshipit-source-id: f94329d9c1e7deaee41e5af319e7c7d0c741910c
2020-10-14 17:19:44 -07:00
ff0af7242b Revert D24290811: [quant][eagermode] Move custom_module registration to prepare/convert_custom_config_dict
Test Plan: revert-hammer

Differential Revision:
D24290811 (3ad797c937)

Original commit changeset: 7d2aee98e194

fbshipit-source-id: 24013e92044f2a1b36b1a9f475bbaa6f17bdaa11
2020-10-14 16:42:55 -07:00
a38eeeff5c Make setup.py python 2 friendly (#46317)
Summary:
import print_function to make setup.py invoked by Python2 print human readable error:
```
% python2 setup.py
Python 2 has reached end-of-life and is no longer supported by PyTorch.
```
Also, remove `future` from the list of the PyTorch package install dependencies

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46317

Reviewed By: walterddr, bugra

Differential Revision: D24305004

Pulled By: malfet

fbshipit-source-id: 9181186170562384dd2c0e6a8ff0b1e93508f221
2020-10-14 16:37:06 -07:00
e7e919fc34 Add warning on ProcessGroup and ProcessGroup::Work APIs (#46220)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46220

Test Plan: Imported from OSS

Reviewed By: gmagogsfm

Differential Revision: D24294437

Pulled By: gmagogsfm

fbshipit-source-id: 198f8e5760beeb1d18740f971647d2537afb3dd6
2020-10-14 16:27:37 -07:00
fc1d6bf135 [fx] make sure args/kwargs are immutable (#46325)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46325

Otherwise, mutating them would make the uses/users lists inaccurate.
You can still mutate the node by assigning a new value to .args or .kwargs

Test Plan: Imported from OSS

Reviewed By: jamesr66a

Differential Revision: D24308672

Pulled By: zdevito

fbshipit-source-id: a5305e1d82668b36e46876c3bc517f6f1d03dd78
2020-10-14 15:51:43 -07:00
2bc6caa9e4 Add three-phase option to OneCycleLR (#42715)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/40362

The new `three_phase` option provides a way of constructing schedules according to the scheme recommended in [Super-Convergence: Very Fast Training of Neural Networks Using Large Learning Rates](https://arxiv.org/abs/1708.07120).

Note that this change maintains backwards compatibility, and as a result the default behaviour of OneCycleLR remains quite counter-intuitive.

vincentqb

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42715

Reviewed By: heitorschueroff

Differential Revision: D24289744

Pulled By: vincentqb

fbshipit-source-id: e4aad87880716bb14613c0aa8631e43b04a93e5c
2020-10-14 15:05:14 -07:00
635aebdfab [quant] Refactoring the mappings files (#44847)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44847

Test Plan: Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D23747007

Pulled By: z-a-f

fbshipit-source-id: 7d8fcc84a77454cc1479e5158f5a62eda5824a87
2020-10-14 13:15:34 -07:00
b28b5d3c68 [ONNX] Update squeeze test for opset 9 (#45369)
Summary:
Only under static axes does opset 9 supports no-op squeeze when dim is not 1.
Updating the test case where it was setting dynamic axes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45369

Reviewed By: anjali411

Differential Revision: D24280180

Pulled By: bzinodev

fbshipit-source-id: d7cda88ab338a1c41a68052831dcebe739a3843c
2020-10-14 12:53:13 -07:00
6ca03aeb96 [ONNX] Fix flatten operator (#45632)
Summary:
Even when dim is None, there are cases when flatten can be exported.
Also enable test_densenet in scripting mode

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45632

Reviewed By: VitalyFedyunin

Differential Revision: D24116994

Pulled By: bzinodev

fbshipit-source-id: 76da6c073ddf79bba64397fd56b592de850034c4
2020-10-14 12:44:25 -07:00
d655341adb [Distributed] General Function for Parsing Environment Variable Flags in PG (#46045)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46045

PG NCCL functionality differs based on certain binary environment
variables such as NCCL_BLOCKING_WAIT and NCCL_ASYNC_ERROR_HANDLING. Previously
we had separate helper function to parse these env vars and set class variables
accordingly. This PR introduces a general purpose function for this purpose.
ghstack-source-id: 114209823

Test Plan:
Ran the following flow with NCCL_BLOCKING_WAIT set, and ensured the
ProcessGroup constructor set blcokingWait_ to true: f223454701

Reviewed By: jiayisuse

Differential Revision: D24173982

fbshipit-source-id: b84db2dda29fcf5d163ce8860e8499d5070f8818
2020-10-14 12:21:11 -07:00
3ad797c937 [quant][eagermode] Move custom_module registration to prepare/convert_custom_config_dict (#46293)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46293

Test Plan: Imported from OSS

Reviewed By: raghuramank100

Differential Revision: D24290811

fbshipit-source-id: 7d2aee98e1946c2a4268efb94443f1e5daaa793e
2020-10-14 12:10:37 -07:00
2ffb768607 [Distributed] deleteKey support for HashStore (#46049)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46049

Adding support for the deleteKey API in the c10d HashStore.
ghstack-source-id: 113874207

Test Plan:
Added C++ tests to check whether deleteKey function works, and
whether it returns an exception for attempting to delete non-existing keys.

Reviewed By: jiayisuse

Differential Revision: D24067657

fbshipit-source-id: 4c58dab407c6ffe209585ca91aa430850261b29e
2020-10-14 12:04:42 -07:00
74f13a8b8f [Distributed] Adding getNumKeys support to the HashStore (#46048)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46048

This PR adds support for the getNumKeys API for the HashStore
ghstack-source-id: 113874241

Test Plan: Added C++ tests for the HashStore::getNumKeys

Reviewed By: jiayisuse

Differential Revision: D24067658

fbshipit-source-id: 2db70a90f0ab8ddf0ff03cedda59b45ec987af07
2020-10-14 12:01:22 -07:00
5500b62f28 Enable zero batch conv tests for ROCm (#46305)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/26669

This PR enables convolution tests for zero batch size implemented in https://github.com/pytorch/pytorch/pull/26214/.

jamesr66a jeffdaily sunway513

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46305

Reviewed By: navahgar

Differential Revision: D24307981

Pulled By: heitorschueroff

fbshipit-source-id: dfc595fa855ae084b60a693e209b0fdcc714221d
2020-10-14 11:36:30 -07:00
dec61f93f2 [ROCm] update GPG key URL in circleci Dockerfile (#46256)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46256

Reviewed By: mrshenli

Differential Revision: D24308563

Pulled By: heitorschueroff

fbshipit-source-id: 33ef6e5490bdd59e14db4851c03f6df6ce227358
2020-10-14 11:29:55 -07:00
53316e8b97 [quant] Remove prehook option (#46292)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46292

since it is not needed

Test Plan: Imported from OSS

Reviewed By: raghuramank100

Differential Revision: D24290815

fbshipit-source-id: 5cc24a305dbdfee5de3419dc83a9c3794d949300
2020-10-14 11:08:38 -07:00
9d389b1dcc [ONNX] Preprocess index_put with bool inputs to masked_scatter/masked_fill (#45584)
Summary:
When the input to an indexing operation is a boolean, for example array[True] = value,
the subsequent index_put node formed needs to be converted to masked_scatter/masked_fill node based on the type of val the indexing node is equated. If that value is just a single scalar, then we use the masked_fill functionality and if value is a tensor of appropriate size, we use the masked_scatter functionality.

Fixes https://github.com/pytorch/pytorch/issues/34054

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45584

Reviewed By: VitalyFedyunin

Differential Revision: D24116921

Pulled By: bzinodev

fbshipit-source-id: ebd66e06d62e15f0d49c8191d9997f55edfa520e
2020-10-14 10:58:55 -07:00
49903a5cd5 [quant][graphmode][fx] Move custom_module_class config to prepare/convert_custom_config_dict (#46251)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46251

Test Plan: Imported from OSS

Reviewed By: raghuramank100

Differential Revision: D24290810

fbshipit-source-id: 7a96f04a0f33f0315943ac18ef2d08e4f5a5d1c0
2020-10-14 10:43:48 -07:00
e7dbaa252e Update optim.rst for better understanding (#45944)
Summary:
The `i` variable in `Line 272` may cause ambiguity in understanding. I think it should be named as `epoch` variable.

Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45944

Reviewed By: agolynski

Differential Revision: D24219486

Pulled By: vincentqb

fbshipit-source-id: 2af0408594613e82a1a1b63971650cabde2b576e
2020-10-14 09:36:06 -07:00
1f791c06f0 adding BAND/BOR/BXOR reduce ops to unsupported list for complex numbers. added tests (#46270)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46270

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D24284702

Pulled By: bdhirsh

fbshipit-source-id: 7e6c3fce83a4367808a638f0400999399b2c35b0
2020-10-14 08:48:14 -07:00
8a074af929 Added scalar lists APIs for addcdiv and addcmul (#45932)
Summary:
1) Added new APIs:
 _foreach_addcdiv(Tensor(a!)[] self, Tensor[] tensor1, Tensor[] tensor2, float[] scalars)
 _foreach_addcdiv_(Tensor(a!)[] self, Tensor[] tensor1, Tensor[] tensor2, float[] scalars)
 _foreach_addcmul(Tensor(a!)[] self, Tensor[] tensor1, Tensor[] tensor2, float[] scalars)
 _foreach_addcmul_(Tensor(a!)[] self, Tensor[] tensor1, Tensor[] tensor2, float[] scalars)

2) Updated optimizers to use new APIs

Tested via unit tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45932

Reviewed By: navahgar

Differential Revision: D24150306

Pulled By: izdeby

fbshipit-source-id: c2e65dedc95d9d81a2fdd116e41df0accb0b6f26
2020-10-14 08:12:37 -07:00
f2e5ae4ba2 Undefine bool and vector after including altivec.h (#46179)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/41261

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46179

Reviewed By: bdhirsh

Differential Revision: D24258470

Pulled By: glaringlee

fbshipit-source-id: f9d3589a30ed396cb88404d3471788aed8dea237
2020-10-14 07:52:51 -07:00
45de2ee3ac Remove Python version upper boundary check (#46315)
Summary:
This prevents setup.py from erroring out when Python-3.9 is used

Fixes https://github.com/pytorch/pytorch/issues/46314

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46315

Reviewed By: heitorschueroff

Differential Revision: D24304846

Pulled By: malfet

fbshipit-source-id: 573a88ea8c1572d7d8a9991539effb3c228bffc9
2020-10-14 07:36:55 -07:00
69e152e60b Fix device guard for c10-full ops (#46091)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46091

ghstack-source-id: 114269274

Test Plan:
vs prev diff: https://www.internalfb.com/intern/fblearner/details/224487971/

vs D23328718 (6ba6ecb048) : https://www.internalfb.com/intern/fblearner/details/224488043/

Reviewed By: ezyang

Differential Revision: D24219943

fbshipit-source-id: bbabafb5c5b76ce0e93df4fdae2f08221354d9f7
2020-10-14 06:32:43 -07:00
4534bf5799 Fix NativeFunctions.h for c10-full ops (#46090)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46090

ghstack-source-id: 114269272

Test Plan: vs base diff: https://www.internalfb.com/intern/fblearner/details/223884639/

Reviewed By: ezyang

Differential Revision: D24219942

fbshipit-source-id: 6f338c7c0dd5adfe2fba8b36ccc340032d3faef8
2020-10-14 06:32:36 -07:00
84771fc64f [caffe2] Add 10s deadline for all Caffe2 hypothesis fuzz tests
Test Plan: CI

Reviewed By: walterddr

Differential Revision: D24298118

fbshipit-source-id: 2286c1e37ed9c43f404b888386c0bd4b0b6a55c6
2020-10-14 06:30:09 -07:00
62d37b9f26 add size_based_partition final (#46282)
Summary:
Reopen the PR: https://github.com/pytorch/pytorch/pull/45837
This PR add a new feature for Partitioner() class called size_based_partition. Given a list of devices with the same memory size, this function could distribute graph nodes into different devices. To implement this feature, several help functions are created in Partitioner.py and GraphManipulation.py.
An unit test is also added in test/test_fx_experimental.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46282

Reviewed By: gcatron

Differential Revision: D24288470

Pulled By: scottxu0730

fbshipit-source-id: e81b1e0c56e34f61e497d868882126216eba7538
2020-10-14 03:44:05 -07:00
b64cf93f05 [jit] support tracing tensor __setitem__ with dynamic shape (#45828)
Summary:
fix (partially) https://github.com/pytorch/pytorch/issues/43548

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45828

Test Plan: buck test mode/dev-nosan //caffe2/test:jit -- 'test_trace_slice' --jobs 1

Reviewed By: bdhirsh

Differential Revision: D24106641

Pulled By: ppwwyyxx

fbshipit-source-id: 8036c9819c9816e040796dac8f9c98bd33ce80a8
2020-10-14 02:52:57 -07:00
38e64cf949 Revert D24232288: [fx] make sure args/kwargs are immutable
Test Plan: revert-hammer

Differential Revision:
D24232288 (61df99b78e)

Original commit changeset: c95b1a73ae55

fbshipit-source-id: b910a6618f76ef64caead20e8207997317bc2f5e
2020-10-14 01:39:33 -07:00
d790ec6de0 [JIT] Update comment in jit_log.h. (#46301)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46301

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D24295281

Pulled By: ZolotukhinM

fbshipit-source-id: a4f84c773029845065895a81f9d753a9c82a99e0
2020-10-13 23:42:28 -07:00
d22455128f [dispatcher] avoid autograd fixup step on non-backend keys (#46135)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46135

Test Plan: Imported from OSS

Reviewed By: ailzhang

Differential Revision: D24235974

Pulled By: bhosmer

fbshipit-source-id: 21215b31146673caae904bb82395858419641633
2020-10-13 23:33:15 -07:00
61df99b78e [fx] make sure args/kwargs are immutable (#46121)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46121

Otherwise, mutating them would make the uses/users lists inaccurate.
You can still mutate the node by assigning a new value to .args or .kwargs

Test Plan: Imported from OSS

Reviewed By: jamesr66a

Differential Revision: D24232288

Pulled By: zdevito

fbshipit-source-id: c95b1a73ae55ad9bdb922ca960c8f744ff732100
2020-10-13 21:33:19 -07:00
965046c445 [NCCL] Provide additional information about NCCL error codes. (#45950)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45950

A pain point for debugging failed training jobs due to NCCL errors has
been understanding the source of the error, since NCCL does not itself report
too many details (usually just "unhandled {system, cuda, internal} error").

In this PR, we add some basic debug information about what went wrong. The information is collected by grepping the NCCL codebase for when these errors are thrown. For example, `ncclSystemError` is what is thrown when system calls such as malloc or munmap fail.

Tested by forcing `result = ncclSystemError` in the macro. The new error
message looks like:

```RuntimeError: NCCL error in:
caffe2/torch/lib/c10d/ProcessGroupNCCL.cpp:759, unhandled system error, NCCL
version 2.7.3
ncclSystemError: System call (socket, malloc, munmap, etc) failed.
```

The last line is what we have added to the message.

In the future, we will also evaluate setting NCCL_DEBUG=WARN, by which NCCL
provides more details about errors sa well.
ghstack-source-id: 114219288

Test Plan: CI

Reviewed By: mingzhe09088

Differential Revision: D24155894

fbshipit-source-id: 10810ddf94d6f8cd4989ddb3436ddc702533e1e1
2020-10-13 21:18:20 -07:00
f7398759b4 Only populate grad accumulator to var mapping for find_unused_parameters=True in DDP (#45942)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45942

We only need to keep track of this for traversing the autograd graph
when find_unused_parameters=True. Without that, we populate and keep this
mapping in memory, which occupies sizeof(pointer) * number of grad accumulators
of extra memory.
ghstack-source-id: 114219289

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D24154407

fbshipit-source-id: 220d723e262f36590a03a3fd2dab47cbfdb87d40
2020-10-13 21:12:59 -07:00
31bcd96395 Parallelize the quantization conversion operators (#45536)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45536

Quantization conversion/reverse conversion operators will be used in critical serving path.

The operators can make use of aten::parallel to parallelize the rowwise quantization of tensors.

Overall, i see 20-25% improvement with the parallelization optimization added here.

The following result is from running benchmark on my `devvm`. Requested a dedicated machine and will post benchmark results again.

Easier view to compare results  https://our.intern.facebook.com/intern/diffing/?paste_number=143973933

Baseline results are based on D23675777 (677a59dcaa)
```
# ----------------------------------------
# PyTorch/Caffe2 Operator Micro-benchmarks
# ----------------------------------------
# Tag : short

# Benchmarking PyTorch: qembeddingbag_byte_prepack
# Mode: Eager
# Name: qembeddingbag_byte_prepack_num_embeddings80_embedding_dim128
# Input: num_embeddings: 80, embedding_dim: 128
Forward Execution Time (us) : 10.782

# Benchmarking PyTorch: qembeddingbag_byte_prepack
# Mode: Eager
# Name: qembeddingbag_byte_prepack_num_embeddings80_embedding_dim256
# Input: num_embeddings: 80, embedding_dim: 256
Forward Execution Time (us) : 17.443

# Benchmarking PyTorch: qembeddingbag_byte_prepack
# Mode: Eager
# Name: qembeddingbag_byte_prepack_num_embeddings80_embedding_dim512
# Input: num_embeddings: 80, embedding_dim: 512
Forward Execution Time (us) : 25.898

# Benchmarking PyTorch: qembeddingbag_4bit_prepack
# Mode: Eager
# Name: qembeddingbag_4bit_prepack_num_embeddings80_embedding_dim128
# Input: num_embeddings: 80, embedding_dim: 128
Forward Execution Time (us) : 13.903

# Benchmarking PyTorch: qembeddingbag_4bit_prepack
# Mode: Eager
# Name: qembeddingbag_4bit_prepack_num_embeddings80_embedding_dim256
# Input: num_embeddings: 80, embedding_dim: 256
Forward Execution Time (us) : 18.575

# Benchmarking PyTorch: qembeddingbag_4bit_prepack
# Mode: Eager
# Name: qembeddingbag_4bit_prepack_num_embeddings80_embedding_dim512
# Input: num_embeddings: 80, embedding_dim: 512
Forward Execution Time (us) : 30.650

# Benchmarking PyTorch: qembeddingbag_2bit_prepack
# Mode: Eager
# Name: qembeddingbag_2bit_prepack_num_embeddings80_embedding_dim128
# Input: num_embeddings: 80, embedding_dim: 128
Forward Execution Time (us) : 14.158

# Benchmarking PyTorch: qembeddingbag_2bit_prepack
# Mode: Eager
# Name: qembeddingbag_2bit_prepack_num_embeddings80_embedding_dim256
# Input: num_embeddings: 80, embedding_dim: 256
Forward Execution Time (us) : 19.818

# Benchmarking PyTorch: qembeddingbag_2bit_prepack
# Mode: Eager
# Name: qembeddingbag_2bit_prepack_num_embeddings80_embedding_dim512
# Input: num_embeddings: 80, embedding_dim: 512
Forward Execution Time (us) : 30.852

# Benchmarking PyTorch: qembeddingbag_byte_unpack
# Mode: Eager
# Name: qembeddingbag_byte_unpack_num_embeddings80_embedding_dim128
# Input: num_embeddings: 80, embedding_dim: 128
Forward Execution Time (us) : 47.596

# Benchmarking PyTorch: qembeddingbag_byte_unpack
# Mode: Eager
# Name: qembeddingbag_byte_unpack_num_embeddings80_embedding_dim256
# Input: num_embeddings: 80, embedding_dim: 256
Forward Execution Time (us) : 91.025

# Benchmarking PyTorch: qembeddingbag_byte_unpack
# Mode: Eager
# Name: qembeddingbag_byte_unpack_num_embeddings80_embedding_dim512
# Input: num_embeddings: 80, embedding_dim: 512
Forward Execution Time (us) : 131.425

# Benchmarking PyTorch: qembeddingbag_4bit_unpack
# Mode: Eager
# Name: qembeddingbag_4bit_unpack_num_embeddings80_embedding_dim128
# Input: num_embeddings: 80, embedding_dim: 128
Forward Execution Time (us) : 12.637

# Benchmarking PyTorch: qembeddingbag_4bit_unpack
# Mode: Eager
# Name: qembeddingbag_4bit_unpack_num_embeddings80_embedding_dim256
# Input: num_embeddings: 80, embedding_dim: 256
Forward Execution Time (us) : 20.856

# Benchmarking PyTorch: qembeddingbag_4bit_unpack
# Mode: Eager
# Name: qembeddingbag_4bit_unpack_num_embeddings80_embedding_dim512
# Input: num_embeddings: 80, embedding_dim: 512
Forward Execution Time (us) : 33.944

# Benchmarking PyTorch: qembeddingbag_2bit_unpack
# Mode: Eager
# Name: qembeddingbag_2bit_unpack_num_embeddings80_embedding_dim128
# Input: num_embeddings: 80, embedding_dim: 128
Forward Execution Time (us) : 21.181

# Benchmarking PyTorch: qembeddingbag_2bit_unpack
# Mode: Eager
# Name: qembeddingbag_2bit_unpack_num_embeddings80_embedding_dim256
# Input: num_embeddings: 80, embedding_dim: 256
Forward Execution Time (us) : 34.213

# Benchmarking PyTorch: qembeddingbag_2bit_unpack
# Mode: Eager
# Name: qembeddingbag_2bit_unpack_num_embeddings80_embedding_dim512
# Input: num_embeddings: 80, embedding_dim: 512
Forward Execution Time (us) : 59.622
```

Results with the parallelization

```
# Benchmarking PyTorch: qembeddingbag_byte_prepack
# Mode: Eager
# Name: qembeddingbag_byte_prepack_num_embeddings80_embedding_dim128
# Input: num_embeddings: 80, embedding_dim: 128
Forward Execution Time (us) : 8.852

# Benchmarking PyTorch: qembeddingbag_byte_prepack
# Mode: Eager
# Name: qembeddingbag_byte_prepack_num_embeddings80_embedding_dim256
# Input: num_embeddings: 80, embedding_dim: 256
Forward Execution Time (us) : 13.594

# Benchmarking PyTorch: qembeddingbag_byte_prepack
# Mode: Eager
# Name: qembeddingbag_byte_prepack_num_embeddings80_embedding_dim512
# Input: num_embeddings: 80, embedding_dim: 512
Forward Execution Time (us) : 20.120

# Benchmarking PyTorch: qembeddingbag_4bit_prepack
# Mode: Eager
# Name: qembeddingbag_4bit_prepack_num_embeddings80_embedding_dim128
# Input: num_embeddings: 80, embedding_dim: 128
Forward Execution Time (us) : 12.049

# Benchmarking PyTorch: qembeddingbag_4bit_prepack
# Mode: Eager
# Name: qembeddingbag_4bit_prepack_num_embeddings80_embedding_dim256
# Input: num_embeddings: 80, embedding_dim: 256
Forward Execution Time (us) : 20.710

# Benchmarking PyTorch: qembeddingbag_4bit_prepack
# Mode: Eager
# Name: qembeddingbag_4bit_prepack_num_embeddings80_embedding_dim512
# Input: num_embeddings: 80, embedding_dim: 512
Forward Execution Time (us) : 23.320

# Benchmarking PyTorch: qembeddingbag_2bit_prepack
# Mode: Eager
# Name: qembeddingbag_2bit_prepack_num_embeddings80_embedding_dim128
# Input: num_embeddings: 80, embedding_dim: 128
Forward Execution Time (us) : 11.998

# Benchmarking PyTorch: qembeddingbag_2bit_prepack
# Mode: Eager
# Name: qembeddingbag_2bit_prepack_num_embeddings80_embedding_dim256
# Input: num_embeddings: 80, embedding_dim: 256
Forward Execution Time (us) : 15.972

# Benchmarking PyTorch: qembeddingbag_2bit_prepack
# Mode: Eager
# Name: qembeddingbag_2bit_prepack_num_embeddings80_embedding_dim512
# Input: num_embeddings: 80, embedding_dim: 512
Forward Execution Time (us) : 23.619

# Benchmarking PyTorch: qembeddingbag_byte_unpack
# Mode: Eager
# Name: qembeddingbag_byte_unpack_num_embeddings80_embedding_dim128
# Input: num_embeddings: 80, embedding_dim: 128
Forward Execution Time (us) : 30.764

# Benchmarking PyTorch: qembeddingbag_byte_unpack
# Mode: Eager
# Name: qembeddingbag_byte_unpack_num_embeddings80_embedding_dim256
# Input: num_embeddings: 80, embedding_dim: 256
Forward Execution Time (us) : 50.969

# Benchmarking PyTorch: qembeddingbag_byte_unpack
# Mode: Eager
# Name: qembeddingbag_byte_unpack_num_embeddings80_embedding_dim512
# Input: num_embeddings: 80, embedding_dim: 512
Forward Execution Time (us) : 129.960

# Benchmarking PyTorch: qembeddingbag_4bit_unpack
# Mode: Eager
# Name: qembeddingbag_4bit_unpack_num_embeddings80_embedding_dim128
# Input: num_embeddings: 80, embedding_dim: 128
Forward Execution Time (us) : 10.797

# Benchmarking PyTorch: qembeddingbag_4bit_unpack
# Mode: Eager
# Name: qembeddingbag_4bit_unpack_num_embeddings80_embedding_dim256
# Input: num_embeddings: 80, embedding_dim: 256
Forward Execution Time (us) : 15.767

# Benchmarking PyTorch: qembeddingbag_4bit_unpack
# Mode: Eager
# Name: qembeddingbag_4bit_unpack_num_embeddings80_embedding_dim512
# Input: num_embeddings: 80, embedding_dim: 512
Forward Execution Time (us) : 27.032

# Benchmarking PyTorch: qembeddingbag_2bit_unpack
# Mode: Eager
# Name: qembeddingbag_2bit_unpack_num_embeddings80_embedding_dim128
# Input: num_embeddings: 80, embedding_dim: 128
Forward Execution Time (us) : 16.521

# Benchmarking PyTorch: qembeddingbag_2bit_unpack
# Mode: Eager
# Name: qembeddingbag_2bit_unpack_num_embeddings80_embedding_dim256
# Input: num_embeddings: 80, embedding_dim: 256
Forward Execution Time (us) : 26.050

# Benchmarking PyTorch: qembeddingbag_2bit_unpack
# Mode: Eager
# Name: qembeddingbag_2bit_unpack_num_embeddings80_embedding_dim512
# Input: num_embeddings: 80, embedding_dim: 512
Forward Execution Time (us) : 45.231
```

Test Plan:
1. buck test //caffe2/test:quantization -- 'test_embedding_bag*'  --print-passing-details

2. Ran benchmarks with ```buck build mode/opt caffe2/benchmarks/operator_benchmark/pt:qembedding_pack_test; ./buck-out/gen/caffe2/benchmarks/operator_benchmark/pt/qembedding_pack_test.par```

Reviewed By: qizzzh

Differential Revision: D24002456

fbshipit-source-id: 23b9b071b2ce944704b2582be40d0aaaaeceb298
2020-10-13 20:46:58 -07:00
d5ca53c955 Performance fix for torch.cat operator on ROCm (#46097)
Summary:
This pull request is a partial revert of https://github.com/pytorch/pytorch/pull/44833 for ROCm to fix the performance of the concatenate operator. The changes only affect execution on ROCm and are guarded by the define `__HIP_PLATFORM_HCC__`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46097

Test Plan:
Benchmark
`python -m pt.cat_test --tag_filter all --device cuda`

Results on ROCm before the PR:
```
# ----------------------------------------
# PyTorch/Caffe2 Operator Micro-benchmarks
# ----------------------------------------
# Tag : all

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(1,1,1)_N2_dim0_cuda
# Input: sizes: (1, 1, 1), N: 2, dim: 0, device: cuda
Forward Execution Time (us) : 10828.314

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(512,512,2)_N2_dim1_cuda
# Input: sizes: (512, 512, 2), N: 2, dim: 1, device: cuda
Forward Execution Time (us) : 11888.028

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(128,1024,2)_N2_dim1_cuda
# Input: sizes: (128, 1024, 2), N: 2, dim: 1, device: cuda
Forward Execution Time (us) : 11898.945

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(1024,1024,2)_N2_dim0_cuda
# Input: sizes: (1024, 1024, 2), N: 2, dim: 0, device: cuda
Forward Execution Time (us) : 11787.744

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(1025,1023,2)_N2_dim1_cuda
# Input: sizes: (1025, 1023, 2), N: 2, dim: 1, device: cuda
Forward Execution Time (us) : 11792.479

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(1024,1024,2)_N2_dim2_cuda
# Input: sizes: (1024, 1024, 2), N: 2, dim: 2, device: cuda
Forward Execution Time (us) : 11769.718

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f989e5c2510>,111,65]_N5_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f989e5c2510>, 111, 65], N: 5, dim: 0, device: cuda
Forward Execution Time (us) : 11633.882

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[96,<function<lambda>at0x7f989e5c2620>,64]_N5_dim1_cuda
# Input: sizes: [96, <function <lambda> at 0x7f989e5c2620>, 64], N: 5, dim: 1, device: cuda
Forward Execution Time (us) : 11617.768

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[128,64,<function<lambda>at0x7f96eee4df28>]_N5_dim2_cuda
# Input: sizes: [128, 64, <function <lambda> at 0x7f96eee4df28>], N: 5, dim: 2, device: cuda
Forward Execution Time (us) : 11625.143

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f96ef874048>,32,64]_N50_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f96ef874048>, 32, 64], N: 50, dim: 0, device: cuda
Forward Execution Time (us) : 13079.204

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[32,<function<lambda>at0x7f96ef8740d0>,64]_N50_dim1_cuda
# Input: sizes: [32, <function <lambda> at 0x7f96ef8740d0>, 64], N: 50, dim: 1, device: cuda
Forward Execution Time (us) : 13095.620

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[33,65,<function<lambda>at0x7f96ef874158>]_N50_dim2_cuda
# Input: sizes: [33, 65, <function <lambda> at 0x7f96ef874158>], N: 50, dim: 2, device: cuda
Forward Execution Time (us) : 13403.086

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(64,32,4,16,32)_N2_dim2_cuda
# Input: sizes: (64, 32, 4, 16, 32), N: 2, dim: 2, device: cuda
Forward Execution Time (us) : 118.704

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(16,32,4,16,32)_N8_dim2_cuda
# Input: sizes: (16, 32, 4, 16, 32), N: 8, dim: 2, device: cuda
Forward Execution Time (us) : 263.273

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(9,31,5,15,33)_N17_dim4_cuda
# Input: sizes: (9, 31, 5, 15, 33), N: 17, dim: 4, device: cuda
Forward Execution Time (us) : 463.024

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f96ef8741e0>]_N100_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f96ef8741e0>], N: 100, dim: 0, device: cuda
Forward Execution Time (us) : 23818.032

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f96ef874268>]_N1000_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f96ef874268>], N: 1000, dim: 0, device: cuda
Forward Execution Time (us) : 234778.296

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f96ef8742f0>]_N2000_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f96ef8742f0>], N: 2000, dim: 0, device: cuda
Forward Execution Time (us) : 470288.132

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f96ef874378>]_N3000_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f96ef874378>], N: 3000, dim: 0, device: cuda
Forward Execution Time (us) : 704361.221
```

Results on ROCm after the PR:
```
# ----------------------------------------
# PyTorch/Caffe2 Operator Micro-benchmarks
# ----------------------------------------
# Tag : all

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(1,1,1)_N2_dim0_cuda
# Input: sizes: (1, 1, 1), N: 2, dim: 0, device: cuda
Forward Execution Time (us) : 29.292

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(512,512,2)_N2_dim1_cuda
# Input: sizes: (512, 512, 2), N: 2, dim: 1, device: cuda
Forward Execution Time (us) : 46.320

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(128,1024,2)_N2_dim1_cuda
# Input: sizes: (128, 1024, 2), N: 2, dim: 1, device: cuda
Forward Execution Time (us) : 36.969

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(1024,1024,2)_N2_dim0_cuda
# Input: sizes: (1024, 1024, 2), N: 2, dim: 0, device: cuda
Forward Execution Time (us) : 92.816

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(1025,1023,2)_N2_dim1_cuda
# Input: sizes: (1025, 1023, 2), N: 2, dim: 1, device: cuda
Forward Execution Time (us) : 93.943

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(1024,1024,2)_N2_dim2_cuda
# Input: sizes: (1024, 1024, 2), N: 2, dim: 2, device: cuda
Forward Execution Time (us) : 163.914

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f1da3186510>,111,65]_N5_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f1da3186510>, 111, 65], N: 5, dim: 0, device: cuda
Forward Execution Time (us) : 75.475

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[96,<function<lambda>at0x7f1da3186620>,64]_N5_dim1_cuda
# Input: sizes: [96, <function <lambda> at 0x7f1da3186620>, 64], N: 5, dim: 1, device: cuda
Forward Execution Time (us) : 68.880

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[128,64,<function<lambda>at0x7f1bf3c50f28>]_N5_dim2_cuda
# Input: sizes: [128, 64, <function <lambda> at 0x7f1bf3c50f28>], N: 5, dim: 2, device: cuda
Forward Execution Time (us) : 85.268

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f1bf4669048>,32,64]_N50_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f1bf4669048>, 32, 64], N: 50, dim: 0, device: cuda
Forward Execution Time (us) : 111.543

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[32,<function<lambda>at0x7f1bf46690d0>,64]_N50_dim1_cuda
# Input: sizes: [32, <function <lambda> at 0x7f1bf46690d0>, 64], N: 50, dim: 1, device: cuda
Forward Execution Time (us) : 110.644

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[33,65,<function<lambda>at0x7f1bf4669158>]_N50_dim2_cuda
# Input: sizes: [33, 65, <function <lambda> at 0x7f1bf4669158>], N: 50, dim: 2, device: cuda
Forward Execution Time (us) : 116.201

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(64,32,4,16,32)_N2_dim2_cuda
# Input: sizes: (64, 32, 4, 16, 32), N: 2, dim: 2, device: cuda
Forward Execution Time (us) : 117.708

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(16,32,4,16,32)_N8_dim2_cuda
# Input: sizes: (16, 32, 4, 16, 32), N: 8, dim: 2, device: cuda
Forward Execution Time (us) : 264.953

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(9,31,5,15,33)_N17_dim4_cuda
# Input: sizes: (9, 31, 5, 15, 33), N: 17, dim: 4, device: cuda
Forward Execution Time (us) : 480.304

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f1bf46691e0>]_N100_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f1bf46691e0>], N: 100, dim: 0, device: cuda
Forward Execution Time (us) : 116.385

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f1bf4669268>]_N1000_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f1bf4669268>], N: 1000, dim: 0, device: cuda
Forward Execution Time (us) : 913.591

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f1bf46692f0>]_N2000_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f1bf46692f0>], N: 2000, dim: 0, device: cuda
Forward Execution Time (us) : 2003.212

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f1bf4669378>]_N3000_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f1bf4669378>], N: 3000, dim: 0, device: cuda
Forward Execution Time (us) : 3004.174
```

Reviewed By: bdhirsh

Differential Revision: D24286324

Pulled By: malfet

fbshipit-source-id: 291f3f3f80f9d2f9ba52a455a942f3fb0406e7d2
2020-10-13 19:22:35 -07:00
09842a44fa [FX] Allow tracing free functions (#46268)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46268

Test Plan: Imported from OSS

Reviewed By: dzhulgakov

Differential Revision: D24283019

Pulled By: jamesr66a

fbshipit-source-id: 938322e13a16386ac931a666f4eecfc4d9c68a5a
2020-10-13 19:18:04 -07:00
ac3f23deb0 Fixed usage of std::move function (#46199)
Summary:
Removed std::move in situations when move wasn't really possible (therefore std::move didn't move anything but created copy instead).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46199

Reviewed By: bdhirsh

Differential Revision: D24287408

Pulled By: glaringlee

fbshipit-source-id: f88b9500e7bbaa709bff62b845966e2adc7fa588
2020-10-13 19:13:30 -07:00
173363f31a Use tensor's quantized properties directly in pickler (#46267)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46267

Test Plan: Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D24283008

Pulled By: iseeyuan

fbshipit-source-id: 76c8410d428a5fc487381e65a9f3a789a9f04eb0
2020-10-13 19:05:52 -07:00
1fcec6e72b [caffe2] Add operator schema for FP16SparseNorm (#46300)
Summary:
Fixes regression introduced by https://github.com/pytorch/pytorch/pull/45551
Also Fix signed-unsigned comparison warnings in test/cpp/tensorexpr/test_train_impl.cpp

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46300

Reviewed By: walterddr

Differential Revision: D24294821

Pulled By: malfet

fbshipit-source-id: 16bffa71ec0d2d38208855223a3c5efb18414ab5
2020-10-13 18:58:23 -07:00
f89498f3f8 Allow RPC framework to use rank in addition to WorkerInfo and name. (#46221)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46221

The RPC framework only allowed sending RPCs based on provided
WorkerInfo or name. When using RPC with DDP, sometimes it might just be easier
to refer to everything in terms of ranks since DDP doesn't support names yet.

As a result, support a `to` parameter in the RPC APIs which allow for
specifying a rank as well would be helpful.
ghstack-source-id: 114207172

Test Plan:
1) waitforbuildbot
2) Unit Tests

Reviewed By: mrshenli

Differential Revision: D24264989

fbshipit-source-id: 5edf5d92e2bd2f213471dfe7c74eebfa9efc9f70
2020-10-13 17:52:54 -07:00
e1c9aa918a Reformat ivalue_inl.h and ivalue.h (#46174)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46174

Want to separate the real changes in this file from noisy reformatting changes, so check in this reformatting first.

Test Plan: N/A

Reviewed By: pritamdamania87

Differential Revision: D24246841

fbshipit-source-id: 50bb671b0a2feab38acaa4fc171608e379fc92e9
2020-10-13 16:31:54 -07:00
952dc7ed87 [NCCL] Fix Hang in Async Error Handling due to Work logging (#46265)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46265

tl;dr - we must remove tensor-related logging from the
WorkNCCL::operator<< function, otherwise printing the work objects tracked in
the workMetaList_ will cause segfaults.

The Work objects we track in the workMetaList for the NCCL Async Error
Handling mechanism don't have any `outputs_`. As described in the workEnqueue
function, destructing the output tensors calls into autograd_meta, which
happens in the user thread, but our system destructs work objects in the
workCleanupThread, so this could lead to a deadlock scenario. We avoid this
problem by not tracking the tensors in the work objects in the workMetaList
(it's called work meta list because these work objects only track the metadata
and not the actual tensors), so when the WorkNCCL::operator<< function tried to
log tensor shapes for work objects from the watchdog thread, the async error
handling mechanism hanged (in the desync test) or segfaulted (in the desync
flow). This PR removes the tensor-related logging from the operator<< function.
ghstack-source-id: 114192929

Test Plan: Verified that this fixes the desync test and desync flow.

Reviewed By: jiayisuse

Differential Revision: D24268204

fbshipit-source-id: 20ccb8800aa3d71a48bfa3cbb65e07ead42cd0dc
2020-10-13 16:23:56 -07:00
b1d24dded1 make a way to disable callgrind (#46116)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46116

Ideally I would just use one of the existing preprocessor flags such as `FBCODE_CAFFE2`, but this implies a whole bunch of other things elsewhere, so it is not really a solution for ovrsource.

Test Plan: CI green, we are able to disable it internally with `-DNVALGRIND`

Reviewed By: malfet

Differential Revision: D24227360

fbshipit-source-id: 24a3b393cf46d6a16acca0a9ec52610d4bb8704f
2020-10-13 16:18:04 -07:00
6ef41953e6 [RFC] Generate generated_unboxing_wrappers_everything.cpp for unboxing wrappers codegen to aid debugging (#45872)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45872

`VariableType_N.cpp` is generated in a sharded manner to speed up compilationt time. Same for `generated_unboxing_wrappers_N.cpp`. However, `VariableTypeEverything.cpp` exists, but `generated_unboxing_wrappers_everything.cpp` does not. These files have all the registration/implementation code in them for easier debugging of codegen logic.

This diff adds `generated_unboxing_wrappers_everything.cpp`.

ghstack-source-id: 113606771

Test Plan: Build + CI

Reviewed By: iseeyuan

Differential Revision: D24124405

fbshipit-source-id: 1f6c938105e17cd4b14502978483a1b178c777dd
2020-10-13 15:44:09 -07:00
5c67cc7a9e [caffe2] Enable fp16 for SparseNormalize op (#45551)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45551

The FP16 version of SparseNormalize op in Caffe2 is missing. This Diff adds FP16 support to unblock MC process of adding FP16 to Dper3.

Check https://fb.quip.com/L0T2AXGwUY3n#EReACAeifk3 .

One question is whether the pure FP16 Sparse Normalized op will affect the accuracy? Maybe we should do it in FP32 domain.
ghstack-source-id: 114184398

Test Plan:
```
 buck run mode/opt //caffe2/caffe2/python/operator_test:sparse_normalize_test
```

```
buck run mode/opt -c python.package_style=inplace mode/no-gpu //caffe2/caffe2/python/benchmarks:sparse_normalize_benchmark -- --fp16
```

Reviewed By: jspark1105

Differential Revision: D24005618

fbshipit-source-id: 8b918ec4063fdaafa444779b95206ba2b7b38537
2020-10-13 15:35:22 -07:00
2118d58d45 Add some more docs to expecttest. (#46263)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46263

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: robieta

Differential Revision: D24281640

Pulled By: ezyang

fbshipit-source-id: 88c5b3bf091f47b69ce58aa321669158c5afda79
2020-10-13 15:17:11 -07:00
1c3e335c4b [pytorch][glow][NNPI] Using int32 as indices for embedding_bag operators (#45878)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45878

Support int32 as indices and offsets for embedding_bag_byte|4bit_rowwise_offsets, to avoid costly casting operators such as `aten::to`.
Currently we don't make the assumption that indices and offsets should be the same type, which should not be a problem since downstream fbgemm supports either cases.

Test Plan:
```
buck test mode/dev caffe2/test:quantization -- --stress-runs 100 test_embedding_bag
```

Reviewed By: radkris-git

Differential Revision: D23854367

fbshipit-source-id: 6758a4252b36a7fe2890f37d38d66f20651e850e
2020-10-13 15:08:39 -07:00
a37f2749cd Avoid computing AutogradKey if not needed. (#46252)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46252

Test Plan: CI

Reviewed By: ngimel

Differential Revision: D24272744

fbshipit-source-id: 6cb66d13e6c910df1ad1a8badd43f990e7b55368
2020-10-13 15:01:55 -07:00
ac245f6b45 Complex autograd doc fix (#46258)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46258

Test Plan: Imported from OSS

Reviewed By: ailzhang

Differential Revision: D24286512

Pulled By: anjali411

fbshipit-source-id: 60bc98d69336101c0d8fe5ab542b9757b5e7faac
2020-10-13 14:36:50 -07:00
67a0c0af27 [quant][fx][graphmode] Add prepare_custom_config_dict and convert_custom_config_dict (#46223)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46223

Also move standalone module config to the prepare_custom_config_dict

Test Plan: Imported from OSS

Reviewed By: vkuzo

Differential Revision: D24266900

fbshipit-source-id: fe3ff5b8c657af3f377041e7881d400938e044f8
2020-10-13 14:19:49 -07:00
dac680721c Automated submodule update: FBGEMM (#46271)
Summary:
This is an automated pull request to update the first-party submodule for [pytorch/FBGEMM](https://github.com/pytorch/FBGEMM).

New submodule commit: a570f94657

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46271

Test Plan: Ensure that CI jobs succeed on GitHub before landing.

Reviewed By: jspark1105

Differential Revision: D24285369

fbshipit-source-id: a5928251ec8386891d31d2f88193aa97e4ad715f
2020-10-13 13:15:40 -07:00
95ccf34fb9 [quant][graph][fix] Set type for GetAttr nodes in remapTypes (#46250)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46250

Previously the type of GetAttr nodes was getting set incorrectly and wasn't matching the module type

Test Plan:
Existing quantization tests

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D24279872

fbshipit-source-id: 2b2e3027f6e9ad8ba9e9b7937bd5cc5daaf6e17c
2020-10-13 12:59:28 -07:00
7b7f2519d9 Use storage.cpu() for moving storage to CPU in serialization. (#46028)
Summary:
As reported in https://github.com/pytorch/pytorch/issues/46020, something seems to go wrong with the storage._write_file method used with a BytesIO and a GPU buffer.
Given that we were going to create the intermediate buffer (currently via BytesIO) anyway, we might as well use storage.cpu() to move the storage to the CPU. This appears to work better.

This is a hot fix, further investigation is highly desirable. In particular, I don't have a reproducing test to show.

Fixes https://github.com/pytorch/pytorch/issues/46020

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46028

Reviewed By: bdhirsh

Differential Revision: D24194370

Pulled By: gchanan

fbshipit-source-id: 99d463c4accb4f1764dfee42d7dc98e7040e9ed3
2020-10-13 12:51:10 -07:00
fc846db667 .circleci: Fix android publish snapshot job (#46266)
Summary:
The android publish snapshot job was failing since it wasn't utilizing
the new docker tagging system

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46266

Reviewed By: walterddr

Differential Revision: D24282375

Pulled By: seemethere

fbshipit-source-id: 58e6ca80bda0b81b09f8614b9ccec764a2f26b49
2020-10-13 11:35:30 -07:00
5604997b09 [quant][refactor] Alphabetize the entries in the quantized import (#46218)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46218

Test Plan: Imported from OSS

Reviewed By: vkuzo

Differential Revision: D24264414

Pulled By: z-a-f

fbshipit-source-id: 6d6fb8cc0e1ab28c64fa16dd343ff8f540ccf773
2020-10-13 11:24:38 -07:00
faa9c22a51 Support pytest for distribution testing (#45648)
Summary:
In response to https://github.com/pytorch/pytorch/issues/11578. This is a test run to see if CI (and other internal systems) works fine with pytest style tests.
 - Creates a separate `distributions` directory within `test`.
 - For testing, this rewrites the `constraint` tests as parameterized tests in pytest. I don't plan to convert any other tests to pytest style, but only expose this option for adding new tests, if required.

If this is a success, we can move `EXAMPLES` in `test_distributions` into a separate file that can be imported by both pytest and unittest style tests. cc. fritzo

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45648

Reviewed By: ezyang, colesbury

Differential Revision: D24080248

Pulled By: neerajprad

fbshipit-source-id: 1f2e7d169c3c291a3051d0cece17851560fe9ea9
2020-10-13 10:56:50 -07:00
ad376f1a62 trying to make pow work for tensor raised to the power of a scalar (#46185)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/46037

I'm not sure this is the most performant solution, but this works:

torch.pow(cuda_tensor, 5) should work and worked before.
torch.pow(cuda_tensor, torch.tensor(5)), should work **and works now!**
torch.pow(cuda_tensor, torch.tensor((5,))), should NOT work and complain the tensors are on different devices and indeed continues to complain.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46185

Reviewed By: glaringlee, malfet

Differential Revision: D24257687

Pulled By: janeyx99

fbshipit-source-id: 2daf235d62ec5886d7c153da05445c2ec71dec98
2020-10-13 10:14:36 -07:00
1a57b390e8 Add torch._foreach_maximum(TensorList, TensorList) & torch._foreach_minimum(TensorList, TensorList) APIs (#45692)
Summary:
- Adding torch._foreach_maximum(TensorList, TensorList) API
- Adding torch._foreach_minimum(TensorList, TensorList) API
- Updated Adam/AdamW optimizers

Tested via unit tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45692

Reviewed By: anjali411

Differential Revision: D24142464

Pulled By: izdeby

fbshipit-source-id: 6a4fc343a1613cb1e26c8398450ac9cea0a2eb51
2020-10-13 09:22:30 -07:00
5741de883a Define the record_stream method in native_functions.yaml (#44301)
Summary:
The record_stream method was hard coded for CUDA device. Define the record_stream in the native_functions.yaml to enable the dynamic dispatch to different end device.

Fixes https://github.com/pytorch/pytorch/issues/36556

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44301

Reviewed By: glaringlee

Differential Revision: D23763954

Pulled By: ezyang

fbshipit-source-id: e6d24f5e7892b56101fa858a6cad2abc5cdc4293
2020-10-13 09:15:22 -07:00
d705083c2b Refactor dispatcher and native to use Signature structure. (#45990)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45990

In #45890 we introduced the concept of a CppSignature, which bundled
up all of the information necessary to declare a C++ signature for
the cpp API.  This PR introduces analogous concepts for dispatcher
and native: DispatcherSignature and NativeSignature.

The three interfaces are not particularly well coupled right now,
but they do have some duck typing coincidences:

- defn() which renders the C++ definition "bool f(int x)"
- decl() which renders the C++ declaration "bool f(int x = 2)"
- type() which renders the C++ function type "bool(int)"

Maybe at some point we'll introduce a Protocol, or a supertype.
Many other methods (like arguments()) have varying types.  These
signatures also have some helper methods that forward back to real
implementations in the api modules.  Something to think about is
whether or not we should attempt to reduce boilerplate here or
not; I'm not too sure about it yet.

The net effect is we get to reduce the number of variables we
have to explicitly write out in the codegen, since now these are all
bundled together into a signature.  Something extra special happens
in BackendSelect, where we now dynamically select between dispatcher_sig
and native_sig as "how" the backend select is implemented.

A little bit of extra cleanup:
- Some places where we previously advertised Sequence, we now advertise
  a more informative Tuple.
- defn() may take an optional positional parameter overriding the entire
  name, or a kwarg-only prefix parameter to just add a prefix to the
  name.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: smessmer

Differential Revision: D24223100

Pulled By: ezyang

fbshipit-source-id: f985eced08af4a60ba9641d125d0f260f8cda9eb
2020-10-13 08:34:48 -07:00
f086032676 Remove unnecessary byte-for-byte compatibility code that is not needed. (#45975)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45975

I reordered declarations in the faithful API reimplementation to
make sure the diffs lined up nicely; they're not necessary now.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: smessmer

Differential Revision: D24223102

Pulled By: ezyang

fbshipit-source-id: 77c6ae40c9a3dac36bc184dd6647d6857c63a50c
2020-10-13 08:34:46 -07:00
8d5c899b19 Rename legacy_dispatcher to native. (#45974)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45974

The term "legacy dispatcher" caused a bunch of confusion between
me and Sebastian when discussing what the intended semantics of
legacy dispatcher argument is.  Legacy dispatcher argument implies
that you ought NOT to use it when you have use_c10_dispatcher: full;
but that's not really what's going on; legacy dispatcher API describes
the API that you write native:: functions (NativeFunctions.h) to.
Renaming it here makes this more clear.

I applied these seds:

```
git grep -l 'legacy_dispatcher' | xargs sed -i 's/legacy_dispatcher/native/g'
git grep -l 'legacydispatcher' | xargs sed -i 's/legacydispatcher/native/g'
git grep -l 'LegacyDispatcher' | xargs sed -i 's/LegacyDispatcher/Native/g'
```

and also grepped for "legacy" in tools/codegen and fixed documentation.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: smessmer

Differential Revision: D24223101

Pulled By: ezyang

fbshipit-source-id: d1913b8b823b3b95e4546881bc0e876acfa881eb
2020-10-13 08:34:43 -07:00
527a8bee02 Reorder dispatcher/legacy_dispatcher types (#45973)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45973

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: smessmer

Differential Revision: D24163527

Pulled By: ezyang

fbshipit-source-id: 2631a2ccd7ab525fe32fa56192ded4ff7ac3723f
2020-10-13 08:34:39 -07:00
944eb0e31d Add NativeFunctionGroup (#45918)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45918

This groups together related native functions (functional, inplace, out)
into a single group.  It's not used by anything but Jiakai said this
would be useful for his stuff so I'm putting it in immediately.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: smessmer

Differential Revision: D24163526

Pulled By: ezyang

fbshipit-source-id: 9979b0fe9249c78e4a64a50c5ed0e2ab99f499b9
2020-10-13 08:34:36 -07:00
9079aea1ac Rewrite implementation of faithful cpp signatures (#45890)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45890

This rewrite is as per my comments at https://github.com/pytorch/pytorch/pull/44087#issuecomment-701664506
I did the rewrite by reverting #44087 and then reimplementing it on top.
You may find it easier to review by diffing against master with only #44087
reverted.

There are two main ideas.

First, we now factor cpp argument processing into two phases operating
on three representations of data:

1. `FunctionSchema` - this is the source from native_functions.yaml
2. `Union[Argument, ThisArgument, TensorOptionsArgument]` - this is
   the arguments after doing some basic semantic analysis to group
   them (for TensorOptions) or identify the this argument (if this
   is a method).  There is only ever one of these per functions.
3. `Union[CppArgument, CppThisArgument, CppTensorOptionsArgument]` -
   this is the arguments after we've elaborated them to C++.  There
   may be multiple of these per actual C++ signature.

You can think of (2) as common processing, whereas (3) bakes in specific
assumptions about whether or not you have a faithful or non-faithful
signature.

Second, we now have CppSignature and CppSignatureGroup representing
the *total* public C++ API signature.  So those dataclasses are what
know how to render definitions/declarations, and you no longer have
to manually type it out in the Functions/TensorMethods codegen.

Here is an exhaustive accounting of the changes.

tools.codegen.api.types

- CppSignature and CppSignatureGroup got moved to tools.codegen.api.types
- Add new CppThisArgument and CppTensorOptionsArguments (modeled off
  of ThisArgument and TensorOptionsArguments) so that we can retain
  high level semantic structure even after elaborating terms with C++
  API information.  Once this is done, we can refine
  CppArgument.argument to no longer contain a ThisArgument (ThisArgument
  is always translated to CppThisArgument.  Note that this doesn't
  apply to TensorOptionsArguments, as those may be expanded or not
  expanded, and so you could get a single CppArgument for 'options')
- Add no_default() functional mutator to easily remove default arguments
  from CppArgument and friends
- Add an explicit_arguments() method to CppArgument and friends to
  extract (flat) argument list that must be explicitly written in the signature.
  This is everything except (Cpp)ThisArgument, and is also convenient
  when you don't care about the extra structure of
  CppTensorOptionsArguments

tools.codegen.api.cpp

- group_arguments is back, and it doesn't send things directly to a
  CppSignatureGroup; instead, it moves us from representation (1) to (2)
  (perhaps it should live in model).  Here I changed my mind from my
  PR comment; I discovered it was not necessary to do classification at
  grouping time, and it was simpler and easier to do it later.
- argument got split into argument_not_this/argument/argument_faithful.
  argument and argument_faithful are obvious enough what they do,
  and I needed argument_not_this as a more refined version of argument
  so that I could get the types to work out on TensorOptionsArguments

tools.codegen.api.dispatcher

- Here we start seeing the payoff.  The old version of this code had a
  "scatter" mode and a "gather" mode.  We don't need that anymore:
  cppargument_exprs is 100% type-directed via the passed in cpp
  arguments.  I am able to write the functions without any reference
  to use_c10_dispatcher

tools.codegen.gen

- Instead of having exprs_str and types_str functions, I moved these to
  live directly on CppSignature, since it seemed pretty logical.
- The actual codegen for TensorMethods/Functions is greatly simplified,
  since (1) all of the heavy lifting is now happening in
  CppSignature(Group) construction, and (2) I don't need to proxy one
  way or another, the new dispatcher translation code is able to handle
  both cases no problem.  There is a little faffing about with ordering
  to reduce the old and new diff which could be removed afterwards.

Here are codegen diffs.  For use_c10_dispatcher: full:

```
+// aten::_cudnn_init_dropout_state(float dropout, bool train, int dropout_seed, *, ScalarType? dtype=None, Layout? layout=None, Device? device=None, bool? pin_memory=False) -> Tensor
 Tensor _cudnn_init_dropout_state(double dropout, bool train, int64_t dropout_seed, const TensorOptions & options) {
-    return _cudnn_init_dropout_state(dropout, train, dropout_seed, optTypeMetaToScalarType(options.dtype_opt()), options.layout_opt(), options.device_opt(), options.pinned_memory_opt());
+    static auto op = c10::Dispatcher::singleton()
+        .findSchemaOrThrow("aten::_cudnn_init_dropout_state", "")
+        .typed<Tensor (double, bool, int64_t, c10::optional<ScalarType>, c10::optional<Layout>, c10::optional<Device>, c10::optional<bool>)>();
+    return op.call(dropout, train, dropout_seed, optTypeMetaToScalarType(options.dtype_opt()), options.layout_opt(), options.device_opt(), options.pinned_memory_opt());
 }
```

Otherwise:

```
+// aten::empty_meta(int[] size, *, ScalarType? dtype=None, Layout? layout=None, Device? device=None, bool? pin_memory=None, MemoryFormat? memory_format=None) -> Tensor
 Tensor empty_meta(IntArrayRef size, c10::optional<ScalarType> dtype, c10::optional<Layout> layout, c10::optional<Device> device, c10::optional<bool> pin_memory, c10::optional<MemoryFormat> memory_format) {
-    return empty_meta(size, TensorOptions().dtype(dtype).layout(layout).device(device).pinned_memory(pin_memory), memory_format);
+    static auto op = c10::Dispatcher::singleton()
+        .findSchemaOrThrow("aten::empty_meta", "")
+        .typed<Tensor (IntArrayRef, const TensorOptions &, c10::optional<MemoryFormat>)>();
+    return op.call(size, TensorOptions().dtype(dtype).layout(layout).device(device).pinned_memory(pin_memory), memory_format);
 }
```

Things that I probably did not get right:

- The Union[Argument, TensorOptionsArguments, ThisArgument] and
  the Cpp variants are starting to get a little unwieldy.  Not sure if
  this means I should add a supertype (or at the very least an
  alias); in some cases I do purposely omit one of these from the Union
- Code may not necessarily live in the most logical files.  There isn't
  very much rhyme or reason to it.
- The fields on CppSignature.  They're not very well constrained and
  it will be better if people don't use them directly.
- Disambiguation.  We should do this properly in #44087 and we don't
  need special logic for deleting defaulting for faithful signatures;
  there is a more general story here.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: smessmer

Differential Revision: D24144035

Pulled By: ezyang

fbshipit-source-id: a185f8bf9df8b44ca5718a7a44dac23cefd11c0a
2020-10-13 08:31:54 -07:00
a3caa719af fix #45552 - adding add_done_callback(fn) to torch.futures.Future (#45675)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45675

Test Plan: Imported from OSS

Reviewed By: glaringlee

Differential Revision: D24055353

Pulled By: bdhirsh

fbshipit-source-id: 9233c8e17acc878f0fecbe740a4397fb55cf722f
2020-10-13 07:47:36 -07:00
282f4ab947 Workaround for bug in DistributedDataParallel (#46186)
Summary:
Fix the DistributedDataParallelSingleProcessTest to work around a limitation in DistributedDataParallel where the batch_size needs to evenly divide by the number of GPUs used
See https://github.com/pytorch/pytorch/issues/46175

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46186

Reviewed By: bdhirsh

Differential Revision: D24264664

Pulled By: mrshenli

fbshipit-source-id: 6cfd6d29e97f3e3420391d03b7f1a8ad49d75f48
2020-10-13 07:34:02 -07:00
a277c097ac [iOS][GPU] Add Metal/MPSCNN support on iOS (#46112)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46112

### Summary

This PR adds the support of running torchscript models on iOS GPU via Metal (Inference only). The feature is currently in prototype state, API changes are expected. The tutorial and the documents will be added once it goes to beta.

allow-large-files

- Users API

```
  auto module = torch::jit::load(model);
  module.eval();
  at::Tensor input = at::ones({1,3,224,224}, at::ScalarType::Float).metal();
  auto output = module.forward({input}).toTensor().cpu();
```
- Supported Models
    - Person Segmentation v106 (FB Internal)
    - Mobilenetv2

- Supported Operators
    - aten::conv2d
    - aten::addmm
    - aten::add.Tensor
    - aten::sub.Tensor
    - aten::mul.Tensor
    - aten::relu
    - aten::hardtanh
    - aten::hardtanh_
    - aten::sigmoid
    - aten::max_pool2d
    - aten::adaptive_avg_pool2d
    - aten::reshape
    - aten::t
    - aten::view
    - aten::log_softmax.int
    - aten::upsample_nearest2d.vec

- Supported Devices
    - Apple A9 and above
    - iOS 10.2 and above

- CMake scripts
    - `IOS_ARCH=arm64 ./scripts/build_ios.sh -DUSE_METAL=ON`

### Test Plan

- Circle CI

ghstack-source-id: 114155638

Test Plan:
1. Sandcastle CI
2. Circle CI

Reviewed By: dreiss

Differential Revision: D23236555

fbshipit-source-id: 98ffc48b837e308bc678c37a9a5fd8ae72d11625
2020-10-13 01:46:56 -07:00
7f6a1b2bd5 [quant][fx][graphmode][api] Change API for custom module (#45920)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45920

See docs for new way of defining custom modules

Test Plan: Imported from OSS

Reviewed By: vkuzo

Differential Revision: D24145856

fbshipit-source-id: 488673fba503e39e8e303ed5a776fe36899ea4e3
2020-10-12 23:42:27 -07:00
e6d30c89c1 Revert D24165889: Update native_functions.yaml to add DefaultBackend.
Test Plan: revert-hammer

Differential Revision:
D24165889 (1f9ddf64d2)

Original commit changeset: 7f3ccdb3499b

fbshipit-source-id: b5d0de57d918011f1e19c9ef6aafa89fefcb42d5
2020-10-12 23:17:06 -07:00
1f9ddf64d2 Update native_functions.yaml to add DefaultBackend. (#45938)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45938

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D24165889

Pulled By: ailzhang

fbshipit-source-id: 7f3ccdb3499b40795bc34af716d0e63241ae8de3
2020-10-12 22:06:50 -07:00
ba1e0a88bb Use const-references in nodes_to_rewrite range loop
Test Plan: CI

Reviewed By: supriyar

Differential Revision: D24267389

fbshipit-source-id: c56d6bf1924b4c4c993fdf1328cfd5ab0d890869
2020-10-12 20:08:34 -07:00
4ad4715643 Fix JIT test config (#46230)
Summary:
Kill jit_simple test config because it was no different than a regular diff config
Fix pattern matching in test.sh for `jit_legacy` config, as it was expecting `legacy_jit`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46230

Reviewed By: walterddr

Differential Revision: D24270144

Pulled By: malfet

fbshipit-source-id: 2e00dba288af1f1e904334b952033aa21062927a
2020-10-12 19:42:06 -07:00
66505b64a5 Fix incorrect CUDA torch.nn.Embedding result when max_norm is not None and indices are not sorted (#45248)
Summary:
Sorting indices before calling `thrust::unique` fixes the issue.
Fixes https://github.com/pytorch/pytorch/issues/44792

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45248

Reviewed By: mruberry

Differential Revision: D24194696

Pulled By: ngimel

fbshipit-source-id: ab59ef9d46b9917b1417bab25f80ce9780f0c930
2020-10-12 18:28:07 -07:00
88dcb95e22 [fx] use a linked list for nodes (#45708)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45708

This makes it possible to define reasonable semantics for what happens
when a node in the list is deleted. In particular the iteration over nodes
will continue at the node that was after the deleted node _when it was deleted_.
If the new node is also deleted, we skip it and, continue to the node after it.
Eventually we either reach a node still in the list or we reach the end of the list.

Test Plan: Imported from OSS

Reviewed By: jamesr66a

Differential Revision: D24089516

Pulled By: zdevito

fbshipit-source-id: d01312d11fe381c8d910a83a08582a2219f47dda
2020-10-12 18:20:14 -07:00
31ee5d8d8b Adding information how to control randomness with DataLoader (#45749)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45749

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D24088407

Pulled By: VitalyFedyunin

fbshipit-source-id: 398b73ec5e8c83000ebc692001da847fc0aaa48f
2020-10-12 16:57:58 -07:00
ee3d3e6dba [pytorch][PR][Gradient Compression] Reduce the peak memory of fp16 compression provided by ddp comm hook (#46078)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46078

The peak memory usage of ddp comm hook has increased due to an extra copy of gradient tensors. To reduce the memory usage, decompress the fp16 tensor in place of the tensor stored in the the gradient bucket.

#Closes: https://github.com/pytorch/pytorch/issues/45968
ghstack-source-id: 113996453

Test Plan:
buck test mode/dev-nosan caffe2/test/distributed:c10d  -- test_accumulate_gradients_no_sync_allreduce_hook

Also verified the decrease in memory consumption with some toy modeling exmaples.

Reviewed By: pritamdamania87

Differential Revision: D24178118

fbshipit-source-id: 453d0b52930809bd836172936b77abd69610237a
2020-10-12 16:15:38 -07:00
87a4baf616 [pt][quant] Support either min or max in qclamp (#45937)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45937

torch.clamp can now be used with quantized tensors with either min argument or max argument only

Fixes https://github.com/pytorch/pytorch/issues/45928
ghstack-source-id: 114085914

Test Plan:
buck test mode/dev caffe2/test:quantization -- 'test_qclamp'  --print-passing-details
```
Started reporting to test run: https://our.intern.facebook.com/intern/testinfra/testrun/4222124686876909
    ✓ ListingSuccess: caffe2/test:quantization - main (7.602)
    ✓ Pass: caffe2/test:quantization - test_qclamp (quantization.test_quantized_op.TestQuantizedOps) (7.233)
Summary
  Pass: 1
  ListingSuccess: 1
Finished test run: https://our.intern.facebook.com/intern/testinfra/testrun/4222124686876909
```

Reviewed By: jerryzh168

Differential Revision: D24153431

fbshipit-source-id: 9735635a48bcdd88d1dd6dc2f18b59311d45ad90
2020-10-12 16:07:31 -07:00
bed3b40523 Implement ravel (#46098)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46098

Doc:
![image](https://user-images.githubusercontent.com/68879799/95611323-ae5cf380-0a2f-11eb-9b8e-56bf79ce68af.png)

Test Plan: Imported from OSS

Reviewed By: glaringlee

Differential Revision: D24253213

Pulled By: ejguan

fbshipit-source-id: 42a866c902272cbe3743a9d0cb3afb9165d51c0b
2020-10-12 16:00:44 -07:00
b98e35948f fix test_serialization not working with Windows. (#46120)
Summary:
fixes https://github.com/pytorch/pytorch/issues/45917.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46120

Reviewed By: janeyx99

Differential Revision: D24253317

Pulled By: walterddr

fbshipit-source-id: 6caa0970b3e3eb972d314639be773a104a4e89a5
2020-10-12 15:18:46 -07:00
f3db68776c [NNC] Fix two more bugs in Cuda Half support (#46129)
Summary:
Fixes two bugs reported by https://github.com/pytorch/pytorch/issues/45953 in the NNC Cuda codegen which could break when using Half floats:

1. The Registerizer will generate new scalars with the type of the load being replaced, and doesn't have Cuda specific logic to avoid using the half type. I've added a quick mutator to coerce these to float, similar to the existing load casting rules.

2. We're not handling explicit casts to Half inserted by the user (in the report the user being the JIT). Addressing this by replacing these with casts to Float since thats the type we do Half math in.

Fixes https://github.com/pytorch/pytorch/issues/45953.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46129

Reviewed By: glaringlee

Differential Revision: D24253639

Pulled By: nickgg

fbshipit-source-id: 3fef826eab00355c81edcfabb1030332cae595ac
2020-10-12 13:31:07 -07:00
c02efdefa8 adding complex support for distributed functions and . fix #45760 (#45879)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45879

Test Plan: Imported from OSS

Reviewed By: glaringlee

Differential Revision: D24127949

Pulled By: bdhirsh

fbshipit-source-id: 8061b14fa1c0adbe22b9397c2d7f92618556d223
2020-10-12 12:44:47 -07:00
8de9aa196a clean up dataclasses installation to only <3.7 (#46182)
Summary:
clean up docker image installation of dataclasses.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46182

Reviewed By: glaringlee

Differential Revision: D24257553

Pulled By: walterddr

fbshipit-source-id: 065a607f52c7e1dc6d0765d87e4468d1752c063b
2020-10-12 12:18:29 -07:00
ba78eb80ff including tensorexpr tests in CI for all configs (#46188)
Summary:
Removed test_tensorexpr from the JIT-EXECUTOR exclude list.

CI will now run those tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46188

Reviewed By: glaringlee

Differential Revision: D24255433

Pulled By: janeyx99

fbshipit-source-id: f18e5b41d49b439407c1c24ef6190ef68bc809bf
2020-10-12 12:03:06 -07:00
85c3ba5588 [caffe2] add PlanExecutorTest ErrorPlanWithCancellableStuckNet (#46110)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46110

## Motivation
* `Cancel` is now added to `OperatorBase` and `NetBase` (https://github.com/pytorch/pytorch/pull/44145).
* We need a test to cover and exhibit that we can cancel stuck net and propagate error with plan executor.

## Summary
* Added PlanExecutorTest `ErrorPlanWithCancellableStuckNet` for plan executor.
* Set cancelCount to zero at the beginning of tests to avoid global state be carried over in some test environment.

Test Plan:
## Unit Test Added

```
buck test caffe2/caffe2:caffe2_test_cpu -- PlanExecutorTest
buck test caffe2/caffe2:caffe2_test_cpu -- PlanExecutorTest --stress-runs 1000
```

Reviewed By: d4l3k

Differential Revision: D24226577

fbshipit-source-id: c834383bfe6ab50747975c229eb42a363eed3458
2020-10-12 12:00:15 -07:00
8d5256e6dd Made exception message for torch.LongTensor() legacy constructor more readable (#46147)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/46085

Made exception message for torch.LongTensor() legacy constructor more
readable

![exception_screenshot](https://user-images.githubusercontent.com/13827698/95664789-e3387b80-0aff-11eb-8e8e-bd2ee449cd7e.png)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46147

Reviewed By: glaringlee

Differential Revision: D24252617

Pulled By: mrshenli

fbshipit-source-id: 6c03b66fef50cf18f9d37c7047d3b98c847ae287
2020-10-12 11:26:38 -07:00
2070834b9e Improve error checking of Storage._writeFile. (#46036)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46036

Previously, this function didn't do error-bounds checking on the GetItem (GET_ITEM) calls, which led to issues like https://github.com/pytorch/pytorch/issues/46020.

A better solution would be to use pybind, but given writing the file is going to dominate bounds checking, this is strictly better.

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D24228370

Pulled By: gchanan

fbshipit-source-id: f5d0a3d21ff12b4380beefe1e9954fa81ea2f567
2020-10-12 11:10:04 -07:00
9202c44379 Fix error in Binomial to retain lazy logit initialization (#46055)
Summary:
Some internal tests were sporadically failing for https://github.com/pytorch/pytorch/pull/45648. The cause of this is a bug in `Binomial.__init__` that references the lazy `logits` attribute and sets it when not needed. This cleans up the `is_scalar` logic too which isn't needed given that `broadcast_all` will convert `Number` to a `tensor`.

The reason for the flakiness is the mutation of the params dict by the first test, which is fixed by doing a shallow copy. It will be better to convert this into a pytest parameterized test once https://github.com/pytorch/pytorch/pull/45648 is merged.

cc. fritzo, ezyang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46055

Reviewed By: agolynski

Differential Revision: D24221151

Pulled By: neerajprad

fbshipit-source-id: 15aae90a692ee6aed729c9f1d2d1b1388170a3c0
2020-10-12 10:56:06 -07:00
146721f1df Fix typing errors in the torch.distributions module (#45689)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/42979.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45689

Reviewed By: agolynski

Differential Revision: D24229870

Pulled By: xuzhao9

fbshipit-source-id: 5fc87cc428170139962ab65b71cacba494d46130
2020-10-12 10:29:45 -07:00
6a001decf2 [quant][test] Add mul_scalar test (#46106)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46106

make sure quantized::mul_scalar matches dequantize - mul - quantize

Test Plan: Imported from OSS

Reviewed By: dskhudia

Differential Revision: D24230790

fbshipit-source-id: 1adcc82b9c41f1b53c9a761477f7c5c08aba1001
2020-10-12 10:24:08 -07:00
6ba6ecb048 Only use hacky_wrapper_for_legacy_signatures if an op needs it (#45742)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45742

Add a new flag to native_functions.yaml: `use_c10_dispatcher: hacky_wrapper_for_legacy_signatures`
and the codegen only wraps kernels in the aforementioned wrapper if that flag is set.
Apart from that, `use_c10_dispatcher: hacky_wrapper_for_legacy_signatures` is equivalent to `full`,
i.e. it has full boxing and unboxing support.

This greatly reduces the number of ops we apply the hacky_wrapper to, i.e. all ops marked as `use_c10_dispatcher: full` don't have it anymore.
ghstack-source-id: 113982139

Test Plan:
waitforsandcastle

vs fbcode:
https://www.internalfb.com/intern/fblearner/details/214511705/

vs base diff:
https://www.internalfb.com/intern/fblearner/details/214693207/

Reviewed By: ezyang

Differential Revision: D23328718

fbshipit-source-id: be120579477b3a05f26ca5f75025bfac37617620
2020-10-12 09:39:18 -07:00
e1f74b1813 Fix mkldnn build on legacy x64 arch (#46082)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/45838

`ARCH_OPT_FLAGS` was the old name of `MKLDNN_ARCH_OPT_FLAGS`, which has been renamed in [this commit](2a011ff02e (diff-a0abcbf647ed740b80615fb5b1614a44L97)), but not updated in pytorch.

As its default value will be set to sse4.1, some kernels are going to fail on the legacy arch that does not support SSE4.1. This patch was to make this flag effective.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46082

Reviewed By: glaringlee

Differential Revision: D24252149

Pulled By: agolynski

fbshipit-source-id: 7079deed373d664763c5888feb28795e5235caa8
2020-10-12 08:45:06 -07:00
a814231616 [fix] torch.kthvalue : handle non-contiguous CUDA tensor (#45802)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/45721

TODO
* [x] Test

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45802

Reviewed By: ngimel

Differential Revision: D24236706

Pulled By: mruberry

fbshipit-source-id: 5a51049233efa710f9500a6f7d099c90d43062c9
2020-10-11 20:13:08 -07:00
3883cdb87e TensorInferenceFunction checks
Summary: Added OpSchema::NeedsAllInputShapes wrapper around the TensorInferenceFunction to fix exception when referencing the dim array when the input shape was unknown. There may be other operators that could use a similar change, these are just the ones that was causing InferShapesAndTypes throw an exception for my examples.

Test Plan: Tested with notebook n352716

Differential Revision: D23745442

fbshipit-source-id: d63eddea47d7ba595e73c4693d34c790f3a329cc
2020-10-11 16:08:58 -07:00
1a99689d71 [caffe2] Fix preprocessor checks for FMA
Summary: I think this preprocessor check is incorrect.  The fused multiply-add (FMA) instructions are not part of AVX2.

Test Plan: CI

Reviewed By: jspark1105

Differential Revision: D24237836

fbshipit-source-id: 44f9b9179918332eb85ac087827726300f56224e
2020-10-11 11:48:32 -07:00
bbb3f09377 Automated submodule update: FBGEMM (#46151)
Summary:
This is an automated pull request to update the first-party submodule for [pytorch/FBGEMM](https://github.com/pytorch/FBGEMM).

New submodule commit: da05c8db75

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46151

Test Plan: Ensure that CI jobs succeed on GitHub before landing.

Reviewed By: jspark1105

Differential Revision: D24239896

fbshipit-source-id: 78ff9c100e39ef9a429eafd11a4c158dabd5cb15
2020-10-10 21:26:37 -07:00
a0a8bc8870 Fix mistakes and increase clarity of norm documentation (#42696)
Summary:
* Removes incorrect statement that "the vector norm will be applied to the last dimension".
* More clearly describe each different combination of `p`, `ord`, and input size.
* Moves norm tests from `test/test_torch.py` to `test/test_linalg.py`
* Adds test ensuring that `p='fro'` and `p=2` give same results for mutually valid inputs

Fixes https://github.com/pytorch/pytorch/issues/41388

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42696

Reviewed By: bwasti

Differential Revision: D23876862

Pulled By: mruberry

fbshipit-source-id: 36f33ccb6706d5fe13f6acf3de8ae14d7fbdff85
2020-10-10 14:12:43 -07:00
496d72d700 [TensorExpr] Disable and/or fix some failing tests. (#46146)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46146

Test Plan: Imported from OSS

Reviewed By: SplitInfinity

Differential Revision: D24238545

Pulled By: ZolotukhinM

fbshipit-source-id: 0d8242da9d1c6960f7b5e9065c3e8defd3d32494
2020-10-10 13:54:25 -07:00
4c87d337af [Caffe2] use the real new fbgemm sparse adagrad interface (#46132)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46132

As title

Test Plan: .

Reviewed By: dskhudia

Differential Revision: D24197694

fbshipit-source-id: 2bfe8f52409fa500d2ea359dec7f521cffb20efb
2020-10-10 08:57:54 -07:00
9f743015bf a few more comments on dispatch key computation methods (#46128)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46128

Test Plan: Imported from OSS

Reviewed By: ailzhang

Differential Revision: D24233868

Pulled By: bhosmer

fbshipit-source-id: efb80fb25d4e3ece3ef9190ee1ed834dff505d7c
2020-10-10 01:17:40 -07:00
b7261de0df [pytorch][te] Add compilation time benchmark (#46124)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46124

We want to make sure we can actually fuse kernels within a fairly
tight time budget.  So here's a quick benchmark of codegen for a simple
pointwise activation function (swish).  I kept all the intermediate tensors
separate to force TE to actually do inlining.

Test Plan:
```
buck run mode/opt //caffe2/benchmarks/cpp/tensorexpr:tensorexpr_bench
```

I've only run in debug mode so results aren't super meaningful, but even in
that mode it's 18ms for compilation, 15 of which are in llvm.

Update, opt build mode:
```
----------------------------------------------------------------------------
Benchmark                                     Time           CPU Iterations
----------------------------------------------------------------------------
BM_CompileSwish                         5123276 ns    5119846 ns        148
BM_CompileSwishLLVMOnly                 4754361 ns    4753701 ns        160
```

Reviewed By: asuhan

Differential Revision: D24232801

fbshipit-source-id: d58a8b7f79bcd9244c49366af7a693e09f24bf76
2020-10-09 23:11:37 -07:00
43fe45ab0f [JIT] Add dynamic shape benchmark for NV Fuser (#46107)
Summary:
This PR modifies `benchmarks/tensorexpr`. It follows up[ https://github.com/pytorch/pytorch/issues/44101](https://github.com/pytorch/pytorch/pull/44101) and further supports characterizing fusers with dynamic shape benchmarks. Dynamic shape condition models the use case when the input tensor shape changes in each call to the graph.

Changes include:

Added an auxiliary class `DynamicShape `that provides a simple API for enabling dynamic shapes in existing test cases, example can be found with `DynamicSimpleElementBench`

Created new bench_cls: `DynamicSimpleElementBench`, `DynamicReduce2DInnerBench`, `DynamicReduce2DOuterBench`, and `DynamicLSTM`. They are all dynamic shaped versions of existing benchmarks and examples of enabling dynamic shape with `DynamicShape`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46107

Reviewed By: glaringlee

Differential Revision: D24229400

Pulled By: bertmaher

fbshipit-source-id: 889fece5ea87d0f6f6374d31dbe11b1cd1380683
2020-10-09 22:09:21 -07:00
689499ffa8 remove duplicate autograd srcs (#46059)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46059

These files are not generated sources, and they also already exist in another variable in the same file (`libtorch_extra_sources`).

Test Plan: CI green

Reviewed By: malfet

Differential Revision: D24203450

fbshipit-source-id: 0c9e12cd1a292c5484961876d4fa7f2341a3165b
2020-10-09 20:17:21 -07:00
5e4b3dd25a Automated submodule update: FBGEMM (#46125)
Summary:
This is an automated pull request to update the first-party submodule for [pytorch/FBGEMM](https://github.com/pytorch/FBGEMM).

New submodule commit: 75ea7ce6f8

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46125

Test Plan: Ensure that CI jobs succeed on GitHub before landing.

Reviewed By: jspark1105

Differential Revision: D24233275

fbshipit-source-id: 526c44aba92e622c6b46c17b467e146303a77b57
2020-10-09 19:59:52 -07:00
34951e9adc [shape inference] adding a new flag to the struct
Summary: Adding a new flag shape_is_set to the structs for shape inference on in-place op to prevent duplicated inference.

Test Plan:
buck test mode/opt-clang caffe2/caffe2/opt:bound_shape_inference_test

buck test mode/opt-clang caffe2/caffe2/fb/opt:shape_info_utils_test

Reviewed By: ChunliF

Differential Revision: D24134767

fbshipit-source-id: 5142e749fd6d1b1092a45425ff7b417a8086f215
2020-10-09 19:29:08 -07:00
138c22f8e3 qnnpack quantized activations: fix memory format issues (#46077)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46077

Some of QNNPACK quantized kernels were not handling NHWC correctly,
the data written respected the input format but the memory flag
was always set to contiguous.  This PR
1. adds testing for NHWC for qnnpack activations
2. fixes those activations which did not set the memory format on the output

Test Plan:
```
python test/test_quantization.py TestQuantizedOps.test_qhardsigmoid
python test/test_quantization.py TestQuantizedOps.test_leaky_relu
python test/test_quantization.py TestQuantizedOps.test_hardswish
python test/test_quantization.py TestQNNPackOps.test_qnnpack_tanh
python test/test_quantization.py TestQNNPackOps.test_qnnpack_sigmoid
```

Imported from OSS

Reviewed By: supriyar

Differential Revision: D24213257

fbshipit-source-id: 764fb588a8d8a0a6e6e4d86285904cdbab26d487
2020-10-09 19:18:15 -07:00
172036a565 [NCCL] Add Error log when ProcessGroupNCCL takes down process upon (#44988)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44988

The new NCCL async error handling feature throws an exception from the
workCleanup Thread if one of the NCCL operations encounters an error or times
out. This PR adds an error log to make it more clear to the user why the
training process crashed.
ghstack-source-id: 114002493

Test Plan:
Verified that we see this error message when running with the desync
test.

Reviewed By: pritamdamania87

Differential Revision: D23794801

fbshipit-source-id: 16a44ce51f01531062167fb762a8553221363698
2020-10-09 16:58:50 -07:00
7094c09ff7 quantizaton: add API usage logging (#46095)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46095

Adds logging on usage of public quantization APIs. This only works in FB codebase
and is a no-op in OSS.

Test Plan: The test plan is fb-only

Reviewed By: raghuramank100

Differential Revision: D24220817

fbshipit-source-id: a2cc957b5a077a70c318242f4a245426e48f75e5
2020-10-09 16:51:27 -07:00
c73af6040e [FX] Make graph_copy examine existing values in val_map (#46104)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46104

Test Plan: Imported from OSS

Reviewed By: zdevito

Differential Revision: D24224505

Pulled By: jamesr66a

fbshipit-source-id: ffdf8ea8cb92439f3aacf08b0c0db63ce3a15b8f
2020-10-09 16:37:55 -07:00
d811d4d7ba Support DefaultBackend keyword in native_functions.yaml. (#45719)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45719

Test Plan: Imported from OSS

Reviewed By: bhosmer

Differential Revision: D24165888

Pulled By: ailzhang

fbshipit-source-id: 9b3c5e71f5b6a985e1a43157813e7d77dbe13b07
2020-10-09 16:28:26 -07:00
e33d455ef7 [Distributed] Set smaller Store timeouts to make c10d tests run faster (#46067)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46067

In our store tests, we expect there to be an exception when we call
get on a recently-deleted key. Unforunately, the store waits for the timeout
period for the key to be set before throwing, which causes the tests to idel
wait for 5+ minutes. This PR decreases the timeouts before this set call so
these tests run faster.
ghstack-source-id: 113917315

Test Plan: Ran both the Python and C++ tests.

Reviewed By: pritamdamania87

Differential Revision: D24208617

fbshipit-source-id: c536e59ee305e0c01c44198a3b1a2247b8672af2
2020-10-09 15:45:42 -07:00
2fa91fa305 [NNC] Fix crash when simplifying certain subtractions (#46108)
Summary:
Fixes a crash bug in the IRSimplifier when the LHS is a Term (e.g. 2x) and the RHS is a Polynomial (e.g. 2x+1).

This case crashes 100% of the time so I guess it's not very common in models we've been benchmarking.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46108

Reviewed By: agolynski

Differential Revision: D24226593

Pulled By: nickgg

fbshipit-source-id: ef454c855ff472febaeba16ec34891df932723c0
2020-10-09 15:15:55 -07:00
281463ba0b [NCCL] Enable send/recv tests (#45994)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45994

Send/Recv tests were disabled because of the https://github.com/pytorch/pytorch/issues/42517. With that issue fixed, this diff enables those tests.
ghstack-source-id: 113970569

Test Plan: waitforsandcastle

Reviewed By: jiayisuse

Differential Revision: D24172484

fbshipit-source-id: 7492ee2e9bf88840c0d0086003ce8e99995aeb91
2020-10-09 15:00:39 -07:00
3ffd2af8cd Add exception classification to torch.multiprocessing.spawn (#45174)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45174

Introduce different types of exceptions that map to different failures
of torch.multiprocessing.spawn. The change introduces three different exception types:
ProcessRaisedException - occurs when the process initiated by spawn raises an exception
ProcessExitedException - occurs when the process initiated by spawn exits
The following logic will allow frameworks that use mp.spawn to categorize failures.
This can be helpful for tracking metrics and enhancing logs.

Test Plan: Imported from OSS

Reviewed By: taohe

Differential Revision: D23889400

Pulled By: tierex

fbshipit-source-id: 8849624c616230a6a81158c52ce0c18beb437330
2020-10-09 12:59:41 -07:00
da033e0b2d [Caffe2] use new fbgemm sparse adagrad interface with temp name (#46089)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46089

Follow-up of D24195799

Test Plan: .

Reviewed By: dskhudia

Differential Revision: D24196753

fbshipit-source-id: 216512822cfb752984bb97bd229af9746e866eaa
2020-10-09 12:51:43 -07:00
0ddcc0ce35 Add alias dispatch key DefaultBackend. (#45718)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45718

Test Plan: Imported from OSS

Reviewed By: bhosmer

Differential Revision: D24165892

Pulled By: ailzhang

fbshipit-source-id: ed28bf62b7c6320d966fd10b7a44b14efffe2f62
2020-10-09 12:02:44 -07:00
f8b3af21f2 Allow Tensor-likes in torch.autograd.gradcheck (#45732)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/42942

Re-do of https://github.com/pytorch/pytorch/issues/43877.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45732

Reviewed By: mruberry

Differential Revision: D24195820

Pulled By: albanD

fbshipit-source-id: 8f43353077f341e34371affd76be553c0ef7d98a
2020-10-09 11:51:27 -07:00
59414b359d Document fix for logspace and linspace (#46056)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46056

The result:
* logspace
![image](https://user-images.githubusercontent.com/68879799/95513793-e6f5c200-0988-11eb-8279-b093612743ca.png)
* linspace
![image](https://user-images.githubusercontent.com/68879799/95513824-f543de00-0988-11eb-9910-72d28d7b6277.png)

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D24204441

Pulled By: ejguan

fbshipit-source-id: fe1179fdbebb326d33e9c474b1efc8282a391901
2020-10-09 10:20:57 -07:00
c83314e982 [ci-all tests] Improve logging in ProcessGroupNCCL for debugging purposes. (#46010)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46010

When training jobs running with NCCL fail sometimes it is hard to
debug the reason of the failure and our logging doesn't provide enough
information at times to narrow down the issue.

To improve the debugging experience, I've enhanced our logging to add a lot
more information about what the ProcessGroup is doing under the hood.

#Closes: https://github.com/pytorch/pytorch/issues/45310

Sample output:
```
> I1002 15:18:48.539551 1822062 ProcessGroupNCCL.cpp:528] [Rank 2] NCCL watchdog thread started!
> I1002 15:18:48.539533 1821946 ProcessGroupNCCL.cpp:492] [Rank 2] ProcessGroupNCCL initialized with following options:
> NCCL_ASYNC_ERROR_HANDLING: 0
> NCCL_BLOCKING_WAIT: 1
> TIMEOUT(ms): 1000
> USE_HIGH_PRIORITY_STREAM: 0
> I1002 15:18:51.080338 1822035 ProcessGroupNCCL.cpp:530] [Rank 1] NCCL watchdog thread terminated normally
> I1002 15:18:52.161218 1821930 ProcessGroupNCCL.cpp:385] [Rank 0] Wrote aborted communicator id to store: NCCLABORTEDCOMM:a0e17500002836080c8384c50000000100000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
> I1002 15:18:52.161238 1821930 ProcessGroupNCCL.cpp:388] [Rank 0] Caught collective operation timeout for work: WorkNCCL(OpType=ALLREDUCE, TensorShape=[10], Timeout(ms)=1000)
> I1002 15:18:52.162120 1821957 ProcessGroupNCCL.cpp:530] [Rank 0] NCCL watchdog thread terminated normally
> I1002 15:18:58.539937 1822062 ProcessGroupNCCL.cpp:649] [Rank 2] Found key in store: NCCLABORTEDCOMM:a0e17500002836080c8384c50000000100000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000, from rank: 0, aborting appropriate communicators
> I1002 15:19:34.740937 1822062 ProcessGroupNCCL.cpp:662] [Rank 2] Aborted communicators for key in store: NCCLABORTEDCOMM:a0e17500002836080c8384c50000000100000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
> I1002 15:19:34.741678 1822062 ProcessGroupNCCL.cpp:530] [Rank 2] NCCL watchdog thread terminated normally
```
ghstack-source-id: 113961408

Test Plan: waitforbuildbot

Reviewed By: osalpekar

Differential Revision: D24183463

fbshipit-source-id: cb09c1fb3739972294e7edde4aae331477621c67
2020-10-09 09:46:58 -07:00
362d9a932e Remove object-based collective APIs from public docs (#46075)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46075

Removes these from public docs for now as we are still
iterating/formalizing these APIs. Will add them back once they are part of a
PyTorch release.
ghstack-source-id: 113928700

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D24211510

fbshipit-source-id: 3e36ff6990cf8e6ef72b6e524322ae06f9097aa2
2020-10-09 09:24:51 -07:00
62554a3bd2 Prioritize raising error message about unused parameters when rebuild_buckets fails (#45933)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45933

Occasionally users run DDP with models with unused params, in this
case we would like to surface an error message telling them to run with
find_unused_params=True. However, a recent change to rebuild_buckets logic (https://github.com/pytorch/pytorch/pull/44798) made
it so that we raise a size mismatch error when this happens, but the
information about unused parameters is likely to be more useful and likely to
be the most common case of failure. Prefer raising this error over the
subsequent size mismatch errors.
ghstack-source-id: 113914759

Test Plan: Added unittest

Reviewed By: mrshenli

Differential Revision: D24151256

fbshipit-source-id: 5d349a988b4aac7d3e0ef7b3cd84dfdcbe9db675
2020-10-09 09:16:45 -07:00
9fb8e33a5b [AutoAccept][Codemod][FBSourceClangFormatLinter] Daily arc lint --take CLANGFORMAT
Reviewed By: zertosh

Differential Revision: D24215555

fbshipit-source-id: 21d10bd60ab302c7cf7e245979b2d2ef0a142a1c
2020-10-09 08:37:54 -07:00
9443033e71 Automated submodule update: FBGEMM (#46079)
Summary:
This is an automated pull request to update the first-party submodule for [pytorch/FBGEMM](https://github.com/pytorch/FBGEMM).

New submodule commit: 974d2b41e7

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46079

Test Plan: Ensure that CI jobs succeed on GitHub before landing.

Reviewed By: jspark1105

Differential Revision: D24213375

fbshipit-source-id: b80786490079f9f56a90e10fbb476d0963cf2abc
2020-10-09 07:40:18 -07:00
c734961e26 [cpp-extensions] Ensure default extra_compile_args (#45956)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/45835

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45956

Reviewed By: ngimel

Differential Revision: D24162289

Pulled By: albanD

fbshipit-source-id: 9ba2ad51e818864f6743270212ed94d86457f4e6
2020-10-09 07:33:28 -07:00
a5c0dbc519 Add support for Softmax. (#45286)
Summary:
This PR adds support for Softmax in NNC.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45286

Reviewed By: mrshenli

Differential Revision: D24042901

Pulled By: navahgar

fbshipit-source-id: 120bafe17586d3ecf0918f9aee852a7c3a8f4990
2020-10-08 23:57:02 -07:00
87226f72d2 [caffe2] temp remove ErrorPlanWithCancellableStuckNet (#46080)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46080

temp removal of ErrorPlanWithCancellableStuckNet, will fill out more

Test Plan:
```
buck test caffe2/caffe2:caffe2_test_cpu -- PlanExecutorTest
```
remove a test

Reviewed By: fegin

Differential Revision: D24213971

fbshipit-source-id: e6e600bad00b45c726311193b4b3238f1700526e
2020-10-08 23:35:45 -07:00
0983ddbfd2 add sharding option to test framework (#45988)
Summary:
Adding a sharding node to our python CONFIG_TREE

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45988

Reviewed By: mruberry

Differential Revision: D24200636

Pulled By: janeyx99

fbshipit-source-id: 08c8c4cf98bbd4980fe6082ae6caa64fbc2ca792
2020-10-08 21:22:51 -07:00
f363a2e106 Mark top 3 slowest tests as slow (#46068)
Summary:
`TCPStoreTest.test_numkeys_delkeys` takes 5+ min (mostly in idle wait for socket timeout)
`TestDataLoader.test_proper_exit` and `TestDataLoaderPersistentWorkers.test_proper_exit` take 2.5 min each
`TestXNNPACKConv1dTransformPass.test_conv1d_with_relu_fc` takes 2 min to finish

Add option to skip reporting test classes that run for less than a second to `print_test_stats.py` and speed up `TestTorchDeviceTypeCUDA.test_matmul_45724_cuda`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46068

Reviewed By: mruberry

Differential Revision: D24208660

Pulled By: malfet

fbshipit-source-id: 780e0d8be4f0cf69ea28de79e423291a1f3349b7
2020-10-08 21:10:03 -07:00
487624e369 [caffe2] plan executor error propagation test with blocking cancellable op (#45319)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45319

## Motivation
* `Cancel` is now added to `OperatorBase` and `NetBase` (https://github.com/pytorch/pytorch/pull/44145)
* We need a test to cover and exhibit that we can cancel stuck net and propagate error with plan executor.

## Summary
* Added `ErrorPlanWithCancellableStuckNet` for plan executor.
* We set a plan with two nets: one stuck net with blocking operator that never returns, and one with error
  net with error op that throws, and tested it throw and cancel.

Test Plan:
## Unit Test added
```
buck test caffe2/caffe2:caffe2_test_cpu -- PlanExecutorTest
buck test caffe2/caffe2:caffe2_test_cpu -- PlanExecutorTest --stress-runs 100
```
```
Summary
  Pass: 400
  ListingSuccess: 2
```

Reviewed By: d4l3k

Differential Revision: D23920548

fbshipit-source-id: feff41f73698bd6ea9b744f920e0fece4ee44438
2020-10-08 19:54:49 -07:00
8cd3857bc7 [NCCL] Add torch::cuda::nccl::send/recv (#45926)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45926

torch/csrc/cuda/nccl.cpp is compiled as part of torch_cuda library and thus by calling this function from ProcessGroupNCCCL.cpp it avoids linking 2nd instance of libnccl.a into torch_python
Fixes similiar issue as https://github.com/pytorch/pytorch/issues/42517

ghstack-source-id: 113910530

Test Plan: waitforsandcastle

Reviewed By: jiayisuse

Differential Revision: D24147802

fbshipit-source-id: d8901fdb31bdc22ddca2364f8050844639a1beb3
2020-10-08 19:20:40 -07:00
b7f7378b2d [NCCL] support send/recv to/from self when communicator is created on demand (#45873)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45873

This diff adds support for sending/receiving to/from self. It also fixed a bug when p2p operations are not used by all processes.
ghstack-source-id: 113910526

Test Plan: waitforsandcastle

Reviewed By: jiayisuse

Differential Revision: D24124413

fbshipit-source-id: edccb830757ac64f569e7908fec8cb2b43cd098d
2020-10-08 19:19:15 -07:00
96d48178c8 Make pipeWrite and pipeRead noexcept (#45783)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45783

After the previous device maps commits, `pipeWrite` might throw. In
this case, if we increment active calls before `pipeWrite` on the
caller, that active call won't be decremented properly when `pipeWrite`
throws. As a result, `shutdown` can silently timeout. I noticed this
as some tests take more than 60s to finish.

This commit extract the tensor device checking logic out of pipeWrite,
and make sure the error is thrown before the active call count is
incremented.

Differential Revision: D24094803

Test Plan: Imported from OSS

Reviewed By: mruberry

Pulled By: mrshenli

fbshipit-source-id: d30316bb23d2afd3ba4f5540c3bd94a2ac10969b
2020-10-08 18:53:51 -07:00
c86ee082a2 torch.fft: Add helper functions section to docs (#46032)
Summary:
Fixes https://github.com/pytorch/pytorch/pull/44877#issuecomment-705411068

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46032

Reviewed By: ngimel

Differential Revision: D24191580

Pulled By: mruberry

fbshipit-source-id: 58a32de886b40f85653ddc3b65bf8d551395f023
2020-10-08 17:57:12 -07:00
2b204e6db3 [quant][fx][graphmode] Run symbolic_trace in quantization (#45919)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45919

As discussed with JIT team, we'll run symbolic trace in quantization functions
prepare_fx now takes orginal pytorch model (torch.nn.Module) instead of `GraphModule` as input

Test Plan: Imported from OSS

Reviewed By: supriyar

Differential Revision: D24145857

fbshipit-source-id: 2b7a4ca525a7a8c23a26af54ef594c6a951e4024
2020-10-08 17:26:03 -07:00
c6672a608b caffe2 missing cctype header (#46052)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46052

`<cctype>` is what provides `isuppper`, etc.
https://en.cppreference.com/w/cpp/header/cctype

clang on windows complaining about the missing header.

Test Plan: CI green

Reviewed By: yinghai

Differential Revision: D24201925

fbshipit-source-id: 7b242200f09c30bf78dde226e14ee4be71758b87
2020-10-08 16:48:49 -07:00
31888b2e77 [quant][pyper] Rename the sparse argument for embedding_bag ops (#46003)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46003

sparse is confusing because itt is used in training for sparse gradients

Test Plan: Imported from OSS

Reviewed By: radkris-git, qizzzh

Differential Revision: D24178248

fbshipit-source-id: 0a2b595f3873d33b2ce25839b6eee31d2bfd3b0d
2020-10-08 16:15:28 -07:00
8c80ee8ba5 [quant] Set sparse to False for embedding_bag ops in graph mode (#45997)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45997

The current sparse field using in the float module is for sparse gradients, which is not applicable
to inference. The sparse field in the quantizd ops denotes pruned weights.

Test Plan:
python test/test_quantization.py TestQuantizeDynamicJitOps.test_embedding_bag

Imported from OSS

Reviewed By: qizzzh

Differential Revision: D24176543

fbshipit-source-id: a05b4ff949e0375462ae411947f68076e1b460d2
2020-10-08 16:13:12 -07:00
0cf0b5f2e8 Minor refactor to normalize assignments (#45671)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45671

This is a follow up on D23977080 (2596113a79) and https://github.com/pytorch/pytorch/pull/45474.

Test Plan: See D23977080 (2596113a79).

Reviewed By: z-a-f

Differential Revision: D24043125

fbshipit-source-id: 0c05930668533bfd7145fa605f3785484391130b
2020-10-08 16:06:48 -07:00
64b0686986 Expose ChannelShuffle (#46000)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/45999
Also small fix for caffe2 counterpart

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46000

Reviewed By: mruberry

Differential Revision: D24185855

Pulled By: ngimel

fbshipit-source-id: c5d599bb8100b86b81c6901f1b8b8baefc12cb16
2020-10-08 16:00:01 -07:00
89256611b5 Doc note update for complex autograd (#45270)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45270

<img width="1679" alt="Screen Shot 2020-10-07 at 1 45 59 PM" src="https://user-images.githubusercontent.com/20081078/95368324-fa7b2d00-08a3-11eb-9066-2e659a4085a2.png">
<img width="1673" alt="Screen Shot 2020-10-07 at 1 46 10 PM" src="https://user-images.githubusercontent.com/20081078/95368332-fbac5a00-08a3-11eb-9be5-77ce6deb8967.png">
<img width="1667" alt="Screen Shot 2020-10-07 at 1 46 30 PM" src="https://user-images.githubusercontent.com/20081078/95368337-fe0eb400-08a3-11eb-80a2-5ad23feeeb83.png">
<img width="1679" alt="Screen Shot 2020-10-07 at 1 46 48 PM" src="https://user-images.githubusercontent.com/20081078/95368345-00710e00-08a4-11eb-96d9-e2d544554a4b.png">
<img width="1680" alt="Screen Shot 2020-10-07 at 1 47 03 PM" src="https://user-images.githubusercontent.com/20081078/95368350-023ad180-08a4-11eb-89b3-f079480741f4.png">
<img width="1680" alt="Screen Shot 2020-10-07 at 1 47 12 PM" src="https://user-images.githubusercontent.com/20081078/95368364-0535c200-08a4-11eb-82fc-9435a046e4ca.png">

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D24203257

Pulled By: anjali411

fbshipit-source-id: cd637dade5fb40cecf5d9f4bd03d508d36e26fcd
2020-10-08 15:04:52 -07:00
e3112e3ed6 aten::set_grad_enabled should not push as it does not return a value (#45559)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/45558

This assertion failure is caused by the incorrect implementation of ``aten::set_grad_enabled`` in [torch/csrc/jit/runtime/register_special_ops.cpp](https://github.com/pytorch/pytorch/blob/master/torch/csrc/jit/runtime/register_special_ops.cpp#L436). The current implementation is:

```cpp
Operator(
    "aten::set_grad_enabled(bool val) -> ()",
    [](Stack* stack) {
      torch::GradMode::set_enabled(pop(stack).toBool());
      push(stack, IValue());
    },
    aliasAnalysisConservative()),
```

which push a ``None`` on to the evaluation stack after calling ``set_enabled``. But according to the signature, the behavior is incorrect as the signature says this function won't return a value. I guess the original author might be confused by the behavior of Python, which pushes a ``None`` on to the evaluation stack when the function definition does not end with a return statement with an explicit result value.

If ``aten::set_grad_enabled`` pushes a ``None`` on to the evaluation stack, each time it's called, the evaluation stack will accumulate an extra ``None``. In our case, ``with torch.no_grad():`` will cause ``aten::set_grad_enabled`` to be called twice, so when the ``forward`` method finishes, the evaluation stack will be ``[None, None, Tensor]``. But the return statement of ``GraphFunction::operator()`` in [torch/csrc/jit/api/function_impl.cpp](https://github.com/pytorch/pytorch/blob/master/torch/csrc/jit/api/function_impl.cpp#L51) is ``return stack.front();`` which will try to extract a tensor out of a ``None`` thus causes the assertion failure.

The solution is simple, just remove the push in the implementation of ``aten::set_grad_enabled``.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45559

Reviewed By: albanD

Differential Revision: D24142153

Pulled By: SplitInfinity

fbshipit-source-id: 75aad0e38bd912a437f7e1a1ee89ab4445e35b5d
2020-10-08 14:42:11 -07:00
ddcacc736d Do not rebase select nighly builds on top of master (#46038)
Summary:
Prevents the following nightly failures from happening:
https://app.circleci.com/pipelines/github/pytorch/pytorch/224752/workflows/3a01ccc2-0215-4e95-9222-bbb4f9309201/jobs/8084912

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46038

Reviewed By: seemethere

Differential Revision: D24195706

Pulled By: malfet

fbshipit-source-id: d53da554bc43841ab6573188f9465c691c601eb3
2020-10-08 14:36:44 -07:00
59e4803b94 Recommit: caffe2/plan_executor: wait for 1 minute after exception and then abort (#45981)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45981

This is a recommit of previously reverted D20850851 (3fbddb92b1).

TL;DR - combining condition_variables and atomics is a bad idea

https://stackoverflow.com/questions/49622713/c17-atomics-and-condition-variable-deadlock

This also adds some ifdefs to disable the death test for mobile, xplat and tsan builds since forking doesn't play nicely with them.

Test Plan:
buck test mode/opt //caffe2/caffe2/python:hypothesis_test -- --stress-runs 1000 test_atomic_iter_with_concurrent_steps --timeout 120
  buck test mode/opt //caffe2/caffe2/python:hypothesis_test -- --stress-runs 100
  buck test mode/opt caffe2/caffe2:caffe2_test_cpu -- PlanExecutorTest --stress-runs 100

no timeouts https://www.internalfb.com/intern/testinfra/testconsole/testrun/7036874440059883/

will ensure no timeouts in OSS

Reviewed By: walterddr, dahsh

Differential Revision: D24165505

fbshipit-source-id: 17cd23bfbcd9c2826a4067a387023d5186353196
2020-10-08 14:17:30 -07:00
402abdfdf4 [NNC] cacheAccesses transform (cache_reads + cache_writes) (#45869)
Summary:
Adds a new transform to the NNC compiler, which adds support for buffer access caching. All accesses within a provided scope are redirected to a cache which is initialized or written back as necessary at the boundaries of that scope. For TVM fans, this is essentially a combination of cache_reads and cache_writes. E.g. it can do this kind of thing:

Before:
```
for (int i = 0; i < 64; i++) {
  for (int j = 0; j < 64; j++) {
    A[i, j] = i * j;
  }
}
for (int i_1 = 0; i_1 < 20; i_1++) {
  for (int j_1 = 0; j_1 < 10; j_1++) {
    B[i_1, j_1] = (A(i_1 + 30, j_1 + 40)) + (A(i_1 + 31, j_1 + 41));
  }
```

After `cacheAccesses(A->buf(), "A_local", j_loop);`

```
for (int i = 0; i < 64; i++) {
  for (int j = 0; j < 64; j++) {
    A[i, j] = i * j;
  }
}
for (int i_1 = 0; i_1 < 20; i_1++) {
  for (int i_2 = 0; i_2 < 2; i_2++) {
    for (int j_1 = 0; j_1 < 11; j_1++) {
      A_local[i_2, j_1] = A[(i_2 + i_1) + 30, j_1 + 40];
    }
  }
  for (int j_2 = 0; j_2 < 10; j_2++) {
    B[i_1, j_2] = (A_local[1, j_2 + 1]) + (A_local[0, j_2]);
  }
}
```

Or this reduction:
```
for (int l1 = 0; l1 < 4; l1++) {
  sum[l1] = 0.f;
  for (int n1_1 = 0; n1_1 < 3; n1_1++) {
    for (int m1_1 = 0; m1_1 < 2; m1_1++) {
      sum[l1] = (sum[l1]) + (scale[(6 * l1 + 2 * n1_1) + m1_1]);
    }
  }
}
```

After `l.cacheAccesses(d->buf(), "d_local", n_loop);`:

```
for (int l1 = 0; l1 < 4; l1++) {
  Allocate(d_local, float, {1});
  sum[l1] = 0.f;
  d_local[0] = 0.f;
  for (int n1_1 = 0; n1_1 < 3; n1_1++) {
    for (int m1_1 = 0; m1_1 < 2; m1_1++) {
      d_local[0] = (d_local[0]) + (scale[(6 * l1 + 2 * n1_1) + m1_1]);
    }
  }
  sum[l1] = (sum[l1]) + (d_local[0]);
  Free(d_local);
}
```

I had originally planned to write `cacheReads` and `cacheWrites` wrappers so we could use them just like their TVM cousins, but they just ended up being big masses of checking that reads or writes weren't present. Didn't feel too useful so I removed them, but let me know.

This is based on bounds inference and inherits a few bugs present in that functionality, which I will address in a followup.

While working on this I realized that it overlaps heavily with `computeAt`: which is really just `cacheReads` + `computeInline`. I'm considering refactoring computeAt to be a wrapper around those two transforms. ZolotukhinM opinions on this?

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45869

Reviewed By: mruberry

Differential Revision: D24195276

Pulled By: nickgg

fbshipit-source-id: 36a58ae265f346903187ebc4923637b628048155
2020-10-08 14:13:28 -07:00
8e8fb8542e upgrade clang-tidy to 11 (#46043)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46043

As title, this is necessary for some internal linter thing

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D24197316

Pulled By: suo

fbshipit-source-id: 07e69fd6ce1937a0caa5838d6995eeed1be5162d
2020-10-08 13:52:58 -07:00
f010df35e5 Added CUDA support for complex input for QR decomposition (#45032)
Summary:
QR decomposition now works for complex inputs on GPU.

Ref. https://github.com/pytorch/pytorch/issues/33152

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45032

Reviewed By: ailzhang

Differential Revision: D24199105

Pulled By: anjali411

fbshipit-source-id: 249552b31fd713446e609b66e508ac54b817b98e
2020-10-08 13:24:21 -07:00
5f7545adf6 Update randomtemp to v0.3 (#46025)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/45982.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46025

Reviewed By: walterddr, mruberry

Differential Revision: D24197124

Pulled By: malfet

fbshipit-source-id: fcb96655375ed7b6c784a5170c6a27e7e13465f1
2020-10-08 12:12:02 -07:00
1197a38a63 [JIT] Bind log1p and lgamma (#45791)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45791

Most of the lowering for log1p and lgamma already existed, add JIT integration.

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D24169536

Pulled By: eellison

fbshipit-source-id: a009c77a3471f3b5d378bad5de6d8e0880e9da3c
2020-10-08 12:06:34 -07:00
338283057b [JIT] [3/3] Make sure fusion occurs in test_tensorexpr (#45790)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45790

Making sure that more tests invoke a run with a Fusion Group.

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D24169534

Pulled By: eellison

fbshipit-source-id: a2666df53fbb12c64571e960f59dbe94df2437e4
2020-10-08 12:06:25 -07:00
564296f051 [2/3] [JIT] Make sure fusion occurs in test_tensorexpr (#45789)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45789

Making sure that more tests invoke a run with a Fusion Group.

Test Plan: Imported from OSS

Reviewed By: Krovatkin

Differential Revision: D24169535

Pulled By: eellison

fbshipit-source-id: 54d7af434772ba52144b12d15d32ae30460c0c3c
2020-10-08 12:06:16 -07:00
1b97ffa07a [1/3] [JIT] Make sure fusion occurs in test_tensorexpr file (#45788)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45788

We were only running the traced graph once, which would not yet have been fused at that point. We should run for num_profiled_runs + 1, and also assert that all nodes in the graph  were fused.

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D24169537

Pulled By: eellison

fbshipit-source-id: 8499bb1a5bd9d2221b1f1c54d6352558cf07ba9a
2020-10-08 12:02:57 -07:00
636eb18029 Fixed median nan propagation and implemented nanmedian (#45847)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45847

Original PR here https://github.com/pytorch/pytorch/pull/45084. Created this one because I was having problems with ghstack.

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D24136629

Pulled By: heitorschueroff

fbshipit-source-id: dd7c7540a33f6a19e1ad70ba2479d5de44abbdf9
2020-10-08 11:20:21 -07:00
298e0e0d57 Refactor gather_ranges_to_dense from Python to C++ (#46021)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46021

Refactor gather_ranges_to_dense from Python to C++

https://www.internalfb.com/intern/tasks/?t=71935517

Test Plan:
General build/test:
```
buck build -c python.helpers=true fbcode/caffe2
buck test -c python.helpers=true fbcode/caffe2
```

Specific Test:
```buck test mode/dev-nosan //caffe2/torch/fb/sparsenn:test -- 'test_gather_ranges_to_dense \(caffe2\.torch\.fb\.sparsenn\.tests\.sparsenn_operators_test\.SparseNNOperatorsTest\)'
```

Reviewed By: houseroad

Differential Revision: D23858186

fbshipit-source-id: 8bce7c279275c8ff7316901b455e1d1dd7e36b13
2020-10-08 11:03:06 -07:00
d360402f34 Use out variants of functions used by linalg.norm, where possible (#45641)
Summary:
Closes https://github.com/pytorch/pytorch/issues/45669

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45641

Reviewed By: ngimel

Differential Revision: D24186731

Pulled By: mruberry

fbshipit-source-id: 7e3d12ef34704bf461b8de19830e7b2f73f3739b
2020-10-08 10:55:35 -07:00
d3d8da7a8e Enable CUDA Fuser for ROCm (#45965)
Summary:
This enables the cuda fuser on ROCm and enables tests for them.

Part of this patch is based on work of Rohith Nallamaddi, thank you.
Errors are my own, of course.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45965

Reviewed By: seemethere

Differential Revision: D24170457

Pulled By: walterddr

fbshipit-source-id: 3dd25b3501a41d2f00acba3ce8642ce51c49c9a6
2020-10-08 10:41:56 -07:00
40828b68e1 Revert D24099167: [HTE @ clang-tidy] Enable clang-tidy configs inheretence for caffe2 project
Test Plan: revert-hammer

Differential Revision:
D24099167 (d93cae00f2)

Original commit changeset: 2e092fe678ad

fbshipit-source-id: bbc73556a1b4d341c2db445fe4ebfb6ee6ba269f
2020-10-08 10:30:50 -07:00
283ae1998c Pin libuv to 1.39 in Windows CI in order to keep version alignment in read me document (#46015)
Summary:
Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46015

Reviewed By: mruberry

Differential Revision: D24193319

Pulled By: mrshenli

fbshipit-source-id: b300116e7ed189a888cb980b63c67d1d402b01b9
2020-10-08 10:06:05 -07:00
ea4fbb2e5e [StaticRuntime] Replace hashtable based workspace with vector<IValue> (#45892)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45892

Previously we were using hashtable (`std::unordered_map` in OSS, `folly::F14FastMap` in fb) for workspace, a container for all the IValues in the graph. Hashtable based lookups can be expensive. This diff replaces the hashtable with `std::vector` and extra bookkeepings are introduced to keep track of the indices of graph inputs/outputs in `StaticRuntime` and op inputs/outputs in `ProcessedNode`.

Reviewed By: dzhulgakov

Differential Revision: D24098763

fbshipit-source-id: 337f835ee144985029b5fa2ab98f9bcc5e3606b6
2020-10-08 09:50:30 -07:00
735d5b8907 Add complex32 dtype support to CPU/GPU implementation of (#45339)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45339

Test Plan:
Imported from OSS

GPU implementation already works as-is:
$ python -c "import torch; a = torch.tensor([1j], dtype=torch.complex32, device=torch.device('cuda')); b = a.clone(); print(b); print(a)"
tensor([0.+1.j], device='cuda:0', dtype=torch.complex32)
tensor([0.+1.j], device='cuda:0', dtype=torch.complex32)

Test for CPU implementation:
$ python -c "import torch; a = torch.tensor([1j], dtype=torch.complex32); b = a.clone(); print(b); print(a)"
tensor([0.+1.j], dtype=torch.complex32)
tensor([0.+1.j], dtype=torch.complex32)

Reviewed By: malfet

Differential Revision: D23932649

Pulled By: soulitzer

fbshipit-source-id: 394b6e1f3d462ee8a010f56f4bb8404af92a066b
2020-10-08 09:29:25 -07:00
7d4f5060ad Fix doc about operator benchmark (#45853)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45853

The method name in README is not consistent with actual implementation.

Reviewed By: qizzzh

Differential Revision: D24114849

fbshipit-source-id: d979e324c768708e99b8cc5b87e261f17c22a883
2020-10-08 09:13:53 -07:00
acca11b898 [torchscript] Verbose logging of code location causing the error (#45908)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45908

As per subj, existing logging does not explain the cause of the error

Test Plan: unit tests pass.

Reviewed By: SplitInfinity

Differential Revision: D23609965

fbshipit-source-id: 818965176f7193c62035e3d2f0547bb525fea0fb
2020-10-08 06:15:49 -07:00
52f2db752d unify reproducibility notes (#45748)
Summary:
Many of our functions contain same warnings about results reproducibility. Make them use common template.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45748

Reviewed By: colesbury

Differential Revision: D24089114

Pulled By: ngimel

fbshipit-source-id: e6aa4ce6082f6e0f4ce2713c2bf1864ee1c3712a
2020-10-08 02:14:57 -07:00
d93cae00f2 [HTE @ clang-tidy] Enable clang-tidy configs inheretence for caffe2 project
Summary:
The primary HTE configuration (for `HTE@clang-tidy` project) is stored at the parent config `~/fbsource/fbcode.clang-tidy`. The diff enables inheretence of that configuration.

Note: `facebook-hte-` checks will not be used until switch to HTE2clang-tidy be made.
Note: `clang-diagnostic-*` will start work. As result clang warning messages can be dublicated: one time from HTE and another time from clang-diagnostic

Test Plan: N/A

Reviewed By: wfarner

Differential Revision: D24099167

fbshipit-source-id: 2e092fe678ad3e53a4cef301ce1cb737cf8401e7
2020-10-08 01:35:55 -07:00
9dc9a55bc4 Fix TypeError when torch.jit.load is passed a pathlib.Path (#45825)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/45824

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45825

Reviewed By: VitalyFedyunin

Differential Revision: D24129441

Pulled By: gmagogsfm

fbshipit-source-id: 52a76e39c163206cee2d19967e333e948adefe99
2020-10-08 01:29:29 -07:00
6e4de44501 [TensorExpr] LoopNest: add a constructor that takes Stmt instead of list of Tensors. (#45949)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45949

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D24156001

Pulled By: ZolotukhinM

fbshipit-source-id: 6f4f050b04e802e274c42ed64be74c21ba79c29f
2020-10-08 00:58:13 -07:00
1036b77416 [TensorExpr] LoopNest: replace output_tensors_ with output_bufs_. (#45948)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45948

No functionality changes expected, it's just a preparation for further
changes in the LoopNest interface.

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D24156000

Pulled By: ZolotukhinM

fbshipit-source-id: f95ab07aac0aba128bc4ed5376a3251ac9c31c06
2020-10-08 00:58:10 -07:00
29da553dd9 [TensorExpr] Loopnest: unify intermediate_tensors_ and temp_bufs_. (#45947)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45947

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D24155999

Pulled By: ZolotukhinM

fbshipit-source-id: d82acf6aba570f6a675eea683c306088e2a41f91
2020-10-08 00:58:08 -07:00
598caddd93 [TensorExpr] Add shorthand versions for splitWith{Mask,Tail} functions. (#45946)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45946

Also, make these functions static - they are not using anything from
`LoopNest` and can be applied to any `Stmt`.

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D24156002

Pulled By: ZolotukhinM

fbshipit-source-id: 1c7d205f85a2a1684e07eb836af662f10d0a50fc
2020-10-08 00:58:06 -07:00
b65ffa365c [TensorExpr] Nuke Function class and directly use Tensor instead. (#45936)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45936

`Tensor` has been a view into a `Function` that was supposed to be used
for a more general case when we have multiple computations over the same
domain (aka multiple output functions). We have never got to a point
where we need this and now have other ideas in mind on how to support
this case if need be. For now, let's just nuke `Function` to reduce the
overall system complexity.

The change should not affect any existing behavior.

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D24153214

Pulled By: ZolotukhinM

fbshipit-source-id: 26d5f11db5d661ff5e1135f4a49eff1c6d4c1bd5
2020-10-08 00:55:31 -07:00
c9caa828f5 Throw special exception when backend compilation is met with fatal error (#45952)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45952

Pull Request resolved: https://github.com/pytorch/glow/pull/4967

When glow compilation meets with nonrecoverable fatal error (hardware is busted), we would like to throw a special exception other than the normal caffe2::EnforceNotMet so that we can signal the upper layer application to handle it differently.

Test Plan: Manually code some error and add LOG(FATAL) in the special exception path and wait for application to fatal.

Reviewed By: ipiszy

Differential Revision: D24156792

fbshipit-source-id: 4ae21bb0d36c89eac331fc52dd4682826b3ea180
2020-10-08 00:46:01 -07:00
a92b49f7c8 [Onnxifi] Don't throw exception when we cannot write out debug files (#45979)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45979

For some reason, sometime we cannot write out the debug files. This shouldn't block the whole service. Hence, we opt in to error out instead of throw error.

Test Plan: Run net_runner test at `/` and observe error being printed out but the test passes.

Reviewed By: ipiszy

Differential Revision: D24165081

fbshipit-source-id: a4e1d0479d54d741e615e3a00b3003f512394fd4
2020-10-08 00:18:24 -07:00
99d3f37bd4 Run gradgradcheck on torch.fft transforms (#46004)
Summary:
Ref https://github.com/pytorch/pytorch/issues/42175

As already noted in the `torch.fft` `gradcheck` tests, `gradcheck` isn't fully working for complex types yet and the function inputs need to be real. A similar workaround for `gradgradcheck` works, viewing the complex outputs as real before returning them makes `gradgradcheck` pass.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46004

Reviewed By: ngimel

Differential Revision: D24187000

Pulled By: mruberry

fbshipit-source-id: 33c2986b07bac282dff1bd4f2109beb70e47bf79
2020-10-08 00:02:05 -07:00
c19b9cd18d Add torch::cuda::ncll::all2all (#45900)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45900

Use `torch:cuda::nccl:all2all` from `ProcesGroupNCCL.cpp`

Fixes https://github.com/pytorch/pytorch/issues/42517

Here is a NCCL dependency graph:
```
libnccl.a --> libtorch_cuda.so ---> libtorch_python.so
    |                                   ^
    |                                   |
    --------> libc10d.a -----------------
```
When static library is linked into a dynamic library or an executable, linker is removes all unused/duplicate symbols from that library, unless `-whole-archive` option is used. Before https://github.com/pytorch/pytorch/pull/42514 all nccl call made from `ProcessGroupNCCL.cpp` were also made from `torch/csrc/cuda/nccl.cpp`, which is compiled as part of `libtorch_cuda.so`
But adding `ncclSend`|`ncclRecv` to ProcesGroupNCCL.cpp forced linker to embed those into `libtorch_python.so`, which also resulted in linking other dependent symbols into the library.

This PR adds `nccl[Send|Recv]` call to `torch_cuda.so` by implementing `all2all` in `torch_cuda` and thus avoids double linking the static library.

More involved, but prone solution, would be to use wrappers exported in `torch::cuda::nccl` namespace, instead of making direct NCCL API calls.

Test Plan: Imported from OSS

Reviewed By: mingzhe09088

Differential Revision: D24138011

Pulled By: malfet

fbshipit-source-id: 33305197fc7d8707b7fd3a66b543f7733b9241a1
2020-10-07 23:56:31 -07:00
ef4817fe5a Add tensor_split function, based on numpy.array_split (#45168)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/9382

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45168

Reviewed By: ngimel

Differential Revision: D24166164

Pulled By: mruberry

fbshipit-source-id: 795459821e52885bc99623a01a2abec060995ce6
2020-10-07 23:14:48 -07:00
b2bff9e431 Workaround for cublas bug for 45724 (#46001)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/45724

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46001

Reviewed By: mruberry

Differential Revision: D24184058

Pulled By: ngimel

fbshipit-source-id: 7d2bab3206ddbc10a7cae3efd9b5e253f38400a9
2020-10-07 22:38:19 -07:00
8d14b50e94 codegen: Improve array default handing (#45163)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45163

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D24132279

Pulled By: mruberry

fbshipit-source-id: 77069e7526b35cf8d13ba448e313c90f20cc67cf
2020-10-07 22:27:28 -07:00
00b8ebe60c [FX] Preserve type annotations on generated code in Graph (#45880)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45880

Test Plan: Imported from OSS

Reviewed By: dzhulgakov

Differential Revision: D24127303

Pulled By: jamesr66a

fbshipit-source-id: 3a042bcfb0bf9f58ac318cc814dfc3cca683c7f8
2020-10-07 21:34:47 -07:00
81d40aaf96 Add [zc]heevd to the list of MKL symbols exported from torch_cpu (#46002)
Summary:
cpu implementation of `torch.symeig` uses `[zc]heev`, but MAGMA only have `d`-suffixed flavors of those functions

Fixes https://github.com/pytorch/pytorch/issues/45922

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46002

Reviewed By: walterddr

Differential Revision: D24177730

Pulled By: malfet

fbshipit-source-id: 0e9aeb60a83f8a4b8ac2a86288721bd362b6040b
2020-10-07 20:50:10 -07:00
c59c4b0d77 Fix cholesky TF32 tests (#45492)
Summary:
This test is changed one day before the landing of the tf32 tests PR, therefore the fix for this is not included in that PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45492

Reviewed By: ezyang

Differential Revision: D24101876

Pulled By: ngimel

fbshipit-source-id: cb3615b2fb8acf17abe54cd18b1faec26582d6b6
2020-10-07 20:42:06 -07:00
903acc6b83 CUDA BFloat16 support of clamp, remainder, lshift, rshift (#45247)
Summary:
Add CUDA BFloat16 support of clamp, remainder, lshift, rshift

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45247

Reviewed By: dzhulgakov

Differential Revision: D24174258

Pulled By: ngimel

fbshipit-source-id: bfcd2d1b3746bb0527d590533f3c38b9c4d0a638
2020-10-07 20:37:06 -07:00
154347d82f Fix distributed documentation for asynchronous collective Work objects (#45709)
Summary:
Closes https://github.com/pytorch/pytorch/issues/42247. Clarifies some documentation related to `Work` object semantics (outputs of async collective functions). Clarifies the difference between CPU operations and CUDA operations (on Gloo or NCCL backend), and provides an example where the difference in CUDA operation's wait() semantics is necessary to understand for correct code.
![sync](https://user-images.githubusercontent.com/8039770/94875710-6f64e780-040a-11eb-8fb5-e94fd53534e5.png)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45709

Reviewed By: ngimel

Differential Revision: D24171256

Pulled By: rohan-varma

fbshipit-source-id: 6365a569ef477b59eb2ac0a8a9a1c1f34eb60e22
2020-10-07 19:59:51 -07:00
19da1d22fe [NNC] Registerizer V2, supporting partial and conditional replacement (#45574)
Summary:
This is a rewrite of the Registerizer, supporting scalar replacement in *vastly* more situations. As a refresher, the registerizer does this:

Before:
``` A[0] = 0;
for (int x = 0; x < 10; x++) {
  A[0] = (A[0]) + x;
}
```
After:
```
int A_ = 0;
for (int x = 0; x < 10; x++) {
  A_ = x + A_;
}
A[0] = A_;
```

Which can greatly reduce the number of accesses to main memory in a kernel. There are cases where doing this gets complicated, and the existing implementation bails out whenever encountering multiple partial overlaps of the same buffer, or conditional accesses under any circumstances. This makes it much less useful in the presence of complex (ie. real world not example) kernels. This new version should work optimally in almost all cases (I have a few minor follow ups).

I tested this version extensively, and found quite a few bugs in the original implementation I'd prefer not to back port fixes for - so I'm in favor of landing this even if we don't immediately see a perf win. I believe the killer app for this kind of optimization is fused reductions and we haven't enabled many examples of that yet.

It is safe to move two accesses of the same Tensor element to a local scalar Var if between all usages of the element there are no other Loads or Stores that may refer to it. In the comments I refer to this as overlapping the access, or "cutting" the existing AccessInfo. In the case where a candidate for registerization is cut, it may be possible to finalize the access early by writing it back to the Tensor and then create a new scalar variable after the overlapping access is complete. We will attempt to do this when it saves memory accesses.

There are a few cases that make this more challenging:

 - For: Loops change the number of real usages of a buffer by the loop extent, but only if we can pull the definition and finalization of the scalar variable out of the loop block. For loops often create accesses which are conditional on a loop var and will overlap large ranges of elements.

E.g. Before:
```
A[0] = 2;
for (int x1 = 0; x1 < 10; x1++) {
  A[0] = (A[0]) + x1;
}
for (int x2 = 1; x2 < 10; x2++) {
  A[x2] = A[x2 - 1];
}
for (int x3 = 0; x3 < 10; x3++) {
  A[0] = (A[0]) + x3;
}
```
After:
```
int A_1 = 2;
for (int x1 = 0; x1 < 10; x1++) {
  A_1 = A_1 + x1;
}
A[0] = A_1;
for (int x2 = 1; x2 < 10; x2++) {
  A[x2] = A[x2 - 1];
}
int A_2 = A[0];
for (int x3 = 0; x3 < 10; x3++) {
  A_2 = A_2 + x3;
}
A[0] = A_2;
```
- Cond: Conditions complicate lifting scalars out of internal scopes. Generally we cannot lift an access outside of a conditional scope unless there is already a reference to that same access at the higher scope, since we don't know if the condition was guarding an array access not safe at the higher scope. In the comments I refer to this as the condition "hiding" the access, and the outer access "unhiding" it.

E.g. this example:
```
if (x<5 ? 1 : 0) {
  A[x] = (A[x]) + 1;
}
A[x] = (A[x]) + 1;
if (x>5 ? 1 : 0) {
  A[x] = (A[x]) + 1;
}
```
The A[x] access can be registerized due to the unconditional access between the two conditions:
```
int A_1 = A[x];
if (x<5 ? 1 : 0) {
  A_1 = A_1 + 1;
}
A_1 = A_1 + 1;
if (x>5 ? 1 : 0) {
  A_1 = A_1 + 1;
}
A[x] = A_1;
```
But this example has no accesses that can be registerized:
```
if (x<5 ? 1 : 0) {
  A[x] = (A[x]) + 1;
}
if (x>5 ? 1 : 0) {
  A[x] = (A[x]) + 1;
}
```

- IfThenElse: Same situation as Cond, except since IfThenElse is an Expr rather than a Stmt we cannot insert the scalar definition or finalizer within the conditional scope. Accesses inside an IfThenElse can be safely combined with external accesses but cannot exist completely within.

E.g in this example the `B[x]` cannot be registerized as there is no safe place to define it.
```
A[x] = IfThenElse(x<3 ? 1 : 0, (B[x]) + (B[x]), B[x]);
```

But the equivalent kernel using Cond can be registerized:
```
if (x<3 ? 1 : 0) {
  float B_1 = B[x];
  A[x] = B_1 + B_1;
} else {
  A[x] = B[x];
}
```
- Let: Accesses dependent on local variables via Let Stmts, or loop vars, cannot be raised outside of the scope of the dependent var.

E.g. no accesses in this example can be registerized:
```
for (int x = 0; x < 10; x++) {
  int y = 30;
  A[y] = x + (A[y]);
}
```

But they can in this example:
```
int y = 30;
for (int x = 0; x < 10; x++) {
  A[y] = x + (A[y]);
}
```

**Testing**

The majority of this PR is tests, over 3k lines of them, because there are many different rules to consider and they can interact together more or less arbitrarily. I'd greatly appreciate any ideas for situations we could encounter that are not covered by the tests.

**Performance**

Still working on it, will update. In many FastRRNS sub kernels this diff reduces the number of total calls to Store or Load by 4x, but since those kernels use Concat very heavily (meaning a lot of branches) the actual number encountered by any particular thread on GPU is reduced only slightly. Overall perf improved by a very small amount.

Reductions is where this optimization should really shine, and in particular the more complex the kernel gets (with extra fusions, etc) the better this version of the registerizer should do compared the existing version.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45574

Reviewed By: albanD

Differential Revision: D24151517

Pulled By: nickgg

fbshipit-source-id: 9f0b2d98cc213eeea3fda16fee3d144d49fd79ae
2020-10-07 18:17:27 -07:00
a36f11a3a5 [FakeLowP] T76913842 Make AddFakeFp16 take int inputs (#45992)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45992

Created a template version of AddFakeFp16 to take both float and int inputs.

Test Plan: notebook with local bento kernel: N369049

Reviewed By: amylittleyang

Differential Revision: D24169720

fbshipit-source-id: 679de391224f65f6c5b3ca890eb0d157f09712f6
2020-10-07 17:43:00 -07:00
c86655a815 [JIT] Fix Dict bug in constant hashing (#45929)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45929

We were checking `and` when we should have been checking `or`.

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D24148804

Pulled By: eellison

fbshipit-source-id: 9c394ea10ac91a588169d934b1e3208512c71b9d
2020-10-07 17:40:17 -07:00
72e4f51bc0 [JIT] fix dict update (#45857)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45857

Fix for https://github.com/pytorch/pytorch/issues/45627

Op was calling `insert` instead of `insert_or_assign`, so it wouldn't overwrite an existing key.

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D24148805

Pulled By: eellison

fbshipit-source-id: bf39c71d5d928890b82cff1a9a0985dc47c1ffac
2020-10-07 17:36:02 -07:00
de0d0bd5ee Revert D24093032: Improve logging in ProcessGroupNCCL for debugging purposes.
Test Plan: revert-hammer

Differential Revision:
D24093032 (c8d76ff7dc)

Original commit changeset: 240b03562f8c

fbshipit-source-id: dab7d54a5ba517bb308a1825b0d63ed146e5269d
2020-10-07 16:41:35 -07:00
505be08c75 [dist_optim] serialize compilation when creating dist_optim (#45871)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45871

Attempt to fix https://github.com/pytorch/pytorch/issues/45845

Test Plan: Imported from OSS

Reviewed By: pritamdamania87

Differential Revision: D24125209

Pulled By: wanchaol

fbshipit-source-id: e3697dd6ef107d8153d2a82d78a17c66d109b4fa
2020-10-07 15:10:41 -07:00
ce82b522c8 Define objects using classes instead of namedtuples in torch.utils.data._utils.worker (#45870)
Summary:
This PR fixes a bug when torch is used with pyspark, by converting namedtuples in `torch.utils.data._utils.worker` into classes.

Before this PR, creating an IterableDataset and then running `list(torch.utils.data.DataLoader(MyIterableDataset(...), num_workers=2)))` will not terminate, if pyspark is also being used. This is because pyspark hijacks namedtuples to make them pickleable ([see here](https://github.com/apache/spark/blob/master/python/pyspark/serializers.py#L370)). So `_IterableDatasetStopIteration` would be modified, and then the check at [this line in dataloader.py](5472426b9f/torch/utils/data/dataloader.py (L1072)) is never true.
Converting the namedtuples to classes avoids this hijack and allows the iteration to correctly stop when signaled.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45870

Reviewed By: ngimel

Differential Revision: D24162748

Pulled By: albanD

fbshipit-source-id: 52f009784500fa594b2bbd15a8b2e486e00c37fb
2020-10-07 15:03:38 -07:00
0927e02a6a [caffe2] Do not run RemoveOpsByType on recurrent networks (#45986)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45986

Recurrent networks have subnets that are not well supported by `RemoveOpsByType`. Here we exclude recurrent networks by adding the same check as in memonger.

Test Plan:
```
buck test //caffe2/caffe2/fb/predictor:black_box_predictor_test
```

AdIndexer canary for sanity check:
https://www.internalfb.com/intern/ads/canary/430059485214766620

Differential Revision: D24167284

fbshipit-source-id: fa90d1c1f34af334a599d879af09d4c0bf7c27bd
2020-10-07 14:07:52 -07:00
c8d76ff7dc Improve logging in ProcessGroupNCCL for debugging purposes. (#45780)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45780

When training jobs running with NCCL fail sometimes it is hard to
debug the reason of the failure and our logging doesn't provide enough
information at times to narrow down the issue.

To improve the debugging experience, I've enhanced our logging to add a lot
more information about what the ProcessGroup is doing under the hood.

#Closes: https://github.com/pytorch/pytorch/issues/45310

Sample output:
```
> I1002 15:18:48.539551 1822062 ProcessGroupNCCL.cpp:528] [Rank 2] NCCL watchdog thread started!
> I1002 15:18:48.539533 1821946 ProcessGroupNCCL.cpp:492] [Rank 2] ProcessGroupNCCL initialized with following options:
> NCCL_ASYNC_ERROR_HANDLING: 0
> NCCL_BLOCKING_WAIT: 1
> TIMEOUT(ms): 1000
> USE_HIGH_PRIORITY_STREAM: 0
> I1002 15:18:51.080338 1822035 ProcessGroupNCCL.cpp:530] [Rank 1] NCCL watchdog thread terminated normally
> I1002 15:18:52.161218 1821930 ProcessGroupNCCL.cpp:385] [Rank 0] Wrote aborted communicator id to store: NCCLABORTEDCOMM:a0e17500002836080c8384c50000000100000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
> I1002 15:18:52.161238 1821930 ProcessGroupNCCL.cpp:388] [Rank 0] Caught collective operation timeout for work: WorkNCCL(OpType=ALLREDUCE, TensorShape=[10], Timeout(ms)=1000)
> I1002 15:18:52.162120 1821957 ProcessGroupNCCL.cpp:530] [Rank 0] NCCL watchdog thread terminated normally
> I1002 15:18:58.539937 1822062 ProcessGroupNCCL.cpp:649] [Rank 2] Found key in store: NCCLABORTEDCOMM:a0e17500002836080c8384c50000000100000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000, from rank: 0, aborting appropriate communicators
> I1002 15:19:34.740937 1822062 ProcessGroupNCCL.cpp:662] [Rank 2] Aborted communicators for key in store: NCCLABORTEDCOMM:a0e17500002836080c8384c50000000100000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
> I1002 15:19:34.741678 1822062 ProcessGroupNCCL.cpp:530] [Rank 2] NCCL watchdog thread terminated normally
```
ghstack-source-id: 113731163

Test Plan: waitforbuildbot

Reviewed By: osalpekar

Differential Revision: D24093032

fbshipit-source-id: 240b03562f8ccccc3d872538f5e331df598ceca7
2020-10-07 12:18:41 -07:00
8fb32b9f55 Parametrize # of longest tests in print_test_stats (#45941)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45941

This adds CLI options to the `test/print_test_stats.py` script for specifying how many of the longest tests should be printed. It also makes the following incidental changes:
- The script now has a `--help` option to describe its usage.
- The number of longest tests being shown is now displayed as a number, rather than in words.
- The median time is now displayed with the label `median_time` instead of `mean_time`, is calculated using `statistics.median` instead of raw indexing and bit shifting, and is displayed even when there are only two tests in a class.

Test Plan: Imported from OSS

Reviewed By: walterddr, seemethere

Differential Revision: D24154491

Pulled By: samestep

fbshipit-source-id: 9fa402bf0fa56badd505f87f289ac9cca1862d6b
2020-10-07 11:49:36 -07:00
9679e1affc annotate torch.autograd.* modules (#45004)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/44638

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45004

Reviewed By: VitalyFedyunin

Differential Revision: D24113562

Pulled By: ezyang

fbshipit-source-id: a85018b7e08b2fe6cf2bc14a217eb418cb2b9de4
2020-10-07 10:53:41 -07:00
83d2c9a232 [quant] Add quantized Sigmoid module (#45883)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45883

Test Plan:
python test/test_quantization.py TestStaticQuantizedModule.test_sigmoid

Imported from OSS

Reviewed By: z-a-f

Differential Revision: D24129116

fbshipit-source-id: aa960549509c60374012f35b1f5be39e90418099
2020-10-07 10:33:18 -07:00
30bf799f9c torch.matrix_exp doc fix (#45909)
Summary:
As per title.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45909

Reviewed By: dzhulgakov

Differential Revision: D24147314

Pulled By: albanD

fbshipit-source-id: fc21094f4dbdd04cc2063a9639b9d1f5728cb53f
2020-10-07 10:23:37 -07:00
b186831c08 Automatic update of fbcode/foxi to 6a4e19a2aaf7ae4b9fa9597526e65b395d5e79ad (#45951)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45951

Pull Request resolved: https://github.com/pytorch/glow/pull/4966

Previous import was 4aba696ec8f31794fd42880346dc586486205e0a

Included changes:
- **[6a4e19a](https://github.com/houseroad/foxi/commit/6a4e19a)**: Add fatal error value (#20) <Yinghai Lu>

Test Plan: build

Reviewed By: houseroad

Differential Revision: D24156364

fbshipit-source-id: f833ada8d6586865e1831e2c8c632e3844c7b6a1
2020-10-07 09:55:52 -07:00
5a2773702f add test sharding to CUDA on linux (#45972)
Summary:
splits up all the cuda linux tests into 2 shards to decrease total test runtime

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45972

Reviewed By: malfet

Differential Revision: D24163521

Pulled By: janeyx99

fbshipit-source-id: da6e88eb4305192fb287c4458c31199bf62354c0
2020-10-07 09:31:44 -07:00
5ce31b6f3f [ONNX] Improve error handling for adaptive_pool (#45874)
Summary:
Duplicate of https://github.com/pytorch/pytorch/issues/43032
This update would also improve error handling for interpolate with 'area' mode.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45874

Reviewed By: albanD

Differential Revision: D24141266

Pulled By: bzinodev

fbshipit-source-id: 7559f1d6af4f1ef3507c15a1aee76fe01fa433cd
2020-10-07 09:20:35 -07:00
1bb2d41b68 Revert D20850851: caffe2/plan_executor: wait for 1 minute after exception and then abort
Test Plan: revert-hammer

Differential Revision:
D20850851 (3fbddb92b1)

Original commit changeset: 330503775d80

fbshipit-source-id: 612c6c3c4d5586bc8ad00a112cd00fc74fb44243
2020-10-07 09:04:24 -07:00
5640b79bf8 Allow consumer ops to sync on GraphRoot's gradient (#45787)
Summary:
Currently, a GraphRoot instance doesn't have an associated stream.  Streaming backward synchronization logic assumes the instance ran on the default stream, and tells consumer ops to sync with the default stream.  If the gradient the GraphRoot instance passes to consumer backward ops was populated on a non-default stream, we have a race condition.

The race condition can exist even if the user doesn't give a manually populated gradient:
```python
with torch.cuda.stream(side_stream):
    # loss.backward() implicitly synthesizes a one-element 1.0 tensor on side_stream
    # GraphRoot passes it to consumers, but consumers first sync on default stream, not side_stream.
    loss.backward()

    # Internally to backward(), streaming-backward logic takes over, stuff executes on the same stream it ran on in forward,
    # and the side_stream context is irrelevant.  GraphRoot's interaction with its first consumer(s) is the spot where
    # the side_stream context causes a problem.
```

This PR fixes the race condition by associating a GraphRoot instance, at construction time, with the current stream(s) on the device(s) of the grads it will pass to consumers. (i think this relies on GraphRoot executing in the main thread, before backward thread(s) fork, because the grads were populated on the main thread.)

The test demonstrates the race condition. It fails reliably without the PR's GraphRoot diffs and passes with the GraphRoot diffs.

With the GraphRoot diffs, manually populating an incoming-gradient arg for `backward` (or `torch.autograd.grad`) and the actual call to `autograd.backward` will have the same stream-semantics relationship as any other pair of ops:
```python
# implicit population is safe
with torch.cuda.stream(side_stream):
    loss.backward()

# explicit population in side stream then backward in side stream is safe
with torch.cuda.stream(side_stream):
    kickoff_grad = torch.ones_like(loss)
    loss.backward(gradient=kickoff_grad)

# explicit population in one stream then backward kickoff in another stream
# is NOT safe, even with this PR's diffs, but that unsafety is consistent with
# stream-semantics relationship of any pair of ops
kickoff_grad = torch.ones_like(loss)
with torch.cuda.stream(side_stream):
    loss.backward(gradient=kickoff_grad)

# Safe, as you'd expect for any pair of ops
kickoff_grad = torch.ones_like(loss)
side_stream.wait_stream(torch.cuda.current_stream())
with torch.cuda.stream(side_stream):
    loss.backward(gradient=kickoff_grad)
```
This PR also adds the last three examples above to cuda docs and references them from autograd docstrings.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45787

Reviewed By: nairbv

Differential Revision: D24138376

Pulled By: albanD

fbshipit-source-id: bc4cd9390f9f0358633db530b1b09f9c1080d2a3
2020-10-07 08:53:53 -07:00
bb99bea774 Compress NVCC flags for Windows (#45842)
Summary:
Fixes #{issue number}
This makes the command line shorter.
Also updates `randomtemp` in which the previous version has a limitation that the length of the argument cannot exceed 260.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45842

Reviewed By: albanD

Differential Revision: D24137088

Pulled By: ezyang

fbshipit-source-id: f0b4240735306e302eb3887f54a2b7af83c9f5dc
2020-10-07 08:39:15 -07:00
be45c3401a [JIT] Make objects throw Python AttributeError on nonexistant attr access (#45911)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45911

Test Plan: Imported from OSS

Reviewed By: robieta

Differential Revision: D24140971

Pulled By: jamesr66a

fbshipit-source-id: 046a2cffff898efad5bcc36a41bf992f36f555f9
2020-10-07 01:57:29 -07:00
8cdb638c62 [FX] Track use nodes in Node (#45775)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45775

Test Plan: Imported from OSS

Reviewed By: zdevito

Differential Revision: D24091082

Pulled By: jamesr66a

fbshipit-source-id: b09bb6ae78436a7722fb135b8ec71464ef9587cd
2020-10-07 00:15:04 -07:00
205ab49612 [packaging] simpler dependency plotting (#45686)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45686

This uses an online graphviz viewer. The code is simpler, and
since it embeds all the data in the url you can just click the url
from your terminal.

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D24059157

Pulled By: zdevito

fbshipit-source-id: 94d755cc2986c4226180b09ba36f8d040dda47cc
2020-10-06 23:40:00 -07:00
317b6516bc [quant] Add quantized::sigmoid that take output_scale/output_zero_point as input (#45882)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45882

Same changes as the stack for leaky_relu: https://github.com/pytorch/pytorch/pull/45702

Test Plan: Imported from OSS

Reviewed By: z-a-f

Differential Revision: D24129113

fbshipit-source-id: a26da33f877d3bdeea1976b69b2bd9369c2bf196
2020-10-06 23:30:18 -07:00
ed1552a48f Add note about in-place weight modification for nn.Embedding (#45595)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/26596

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45595

Reviewed By: albanD

Differential Revision: D24143456

Pulled By: mruberry

fbshipit-source-id: a884a32809105ce16959b40ec745ec873b3c8375
2020-10-06 23:11:39 -07:00
8b39498a23 codegen: Allow string arguments to have defaults (#45665)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45665

Fixes #43944

Note that the codegen doesn't use a proper parser so, in the same way as with lists, the string `, ` cannot appear in defaults or it will be interpreted as a splitting point between arguments.

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D24141835

Pulled By: ezyang

fbshipit-source-id: 578127861fd2504917f4486c44100491a2c40343
2020-10-06 21:53:56 -07:00
1b31ed3ad6 [quant] Refactor qembeddingbag to remove duplicate code (#45881)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45881

Test Plan:
python test/test_quantization.py TestQuantizedEmbeddingBagOps

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D24127892

fbshipit-source-id: 344ee71d335b8c1d668c647db88775632e099dbd
2020-10-06 21:11:55 -07:00
43dc7ef933 [quant] Support for 4-bit quantized EmbeddingBag module (#45865)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45865

Test Plan:
python test/test_quantization.py TestPostTrainingStatic.test_quantized_embedding_bag
python test/test_quantization.py TestStaticQuantizedModule.test_embedding_bag_api

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D24120995

fbshipit-source-id: c55fc6b2cfd683d14d2a05be7c04f787fdf8cc79
2020-10-06 21:11:52 -07:00
11c32611d7 [quant] Support 4-bit embedding_bag operators using the dtype quint4x2 (#45752)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45752

Use the torch.quint4x2 dtype to create 4-bit packed tensors in the previous PR.
These packed tensors can be directly consumed by the operator.
Serialization of the packed tensors is supported using torchbind custom class.
Module support will follow in a later PR.

Test Plan:
python test/test_quantization.py TestEmbeddingBagOps

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D24120996

fbshipit-source-id: 2639353b3343ebc69e058b5ba237d3fc56728e1c
2020-10-06 21:11:49 -07:00
5c283fa292 [quant] Add 4-bit embedding_bag prepack/unpack support using quint4x2 (#45751)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45751

Use the torch.quint4x2 dtype to create 4-bit packed tensors

Test Plan:
python test/test_quantization.py TestEmbeddingBagOps

Imported from OSS

Reviewed By: z-a-f

Differential Revision: D24120997

fbshipit-source-id: 6aba2985715a346f6894cf43d5794e104a9ab061
2020-10-06 21:06:46 -07:00
e8d8de32b4 [StaticRuntime] Implement StaticRuntime::benchmark (#45639)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45639

`StaticRuntime::run_individual` is to mimic the caffe2 operator benchmark `SimpleNet::TEST_Benchmark`, so we can accurate information on the operator breakdown. We found that the PyTorch AutogradProfiler adds a lot of overhead to small models, such as the adindexer precomputation_merge net, 100% for batch_size 1, 33% for batch_size 20. This implementation adds very little overhead, as shown in the test plan.

Test Plan: Test results are fb internal only.

Reviewed By: yinghai, dzhulgakov

Differential Revision: D24012088

fbshipit-source-id: f32eb420aace93e2de421a15e4209fce6a3d90f0
2020-10-06 20:54:43 -07:00
275bb5e801 Fix flakiness in caffe2/test:serialization - test_serialization_new_format_old_format_compat (#45915)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45915

Use temp file instead

Test Plan: buck test mode/opt-asan //caffe2/test:serialization -- 'test_serialization_new_format_old_format_compat \(test_serialization\.TestBothSerialization\)' --run-disabled --jobs 18 --stress-runs 10 --record-results

Reviewed By: malfet

Differential Revision: D24142278

fbshipit-source-id: 9c88330fc5664d464daa9124e67644f497353f3b
2020-10-06 18:11:58 -07:00
4fdba30500 [JIT] Add API for ignoring arbitrary module attributes (#45262)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45262

**Summary**
This commit adds an API for ignoring arbitrary module attributes during
scripting. A class attribute named `ignored_attributes` containing names
of attributes to ignore can be added to the class of the instance being
scripted. Attributes ignored in this fashion cannot be used in
`forward`, methods used by `forward` or by `exported` methods. They
are, however, copied to the `RecursiveScriptModule` wrapper and can be
used by `ignored` methods and regular Python code.

**Test Plan**
This commit adds unit tests to `TestScriptPy3` to test this new API.

Test Plan: Imported from OSS

Reviewed By: eellison

Differential Revision: D23971882

Pulled By: SplitInfinity

fbshipit-source-id: 8c81fb415fde7b78aa2f87e5d83a477e876a7cc3
2020-10-06 18:02:06 -07:00
49af421143 Embed callgrind headers (#45914)
Summary:
Because access to https://sourceware.org/git/valgrind.git can be really slow especially in some regions

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45914

Reviewed By: seemethere

Differential Revision: D24144420

Pulled By: malfet

fbshipit-source-id: a454c8c3182c570ec344bf6468bb5e55d8b8da79
2020-10-06 17:51:10 -07:00
f5e70a7504 fix test flakiness caused by sys.getrefcount(None) (#45876)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45876

sys.getrefcount() can be flaky before/after scope() call

Test Plan: buck test mode/opt-asan //caffe2/test:others -- 'test_none_names_refcount \(test_namedtensor\.TestNamedTensor\)' --run-disabled

Reviewed By: malfet

Differential Revision: D24123724

fbshipit-source-id: 4af0b150222cfb92dd0776a42fcab44d896a772a
2020-10-06 17:32:07 -07:00
624084e6d6 [te][llvm] Enable fused multiply-add (fma) in code generation (#45906)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45906

Test Plan: Imported from OSS

Reviewed By: SplitInfinity

Differential Revision: D24142404

Pulled By: bertmaher

fbshipit-source-id: a8db2e66c1e65bbb255886e165a1773723cbcd20
2020-10-06 16:57:34 -07:00
f2e569461b [te] Tiled (m=32 x n=32) gemm benchmark (#45905)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45905

Test Plan: Imported from OSS

Reviewed By: SplitInfinity

Differential Revision: D24142402

Pulled By: bertmaher

fbshipit-source-id: b39e18b6985ee1c1f654fba4498ed91ff14d8d5f
2020-10-06 16:57:31 -07:00
50f89578dd [te] Add a benchmark harness (#45875)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45875

Adds a googlebenchmark harness for perf testing programs generated by
tensorexpr, sans any pytorch wrappings (for python-level benchmarks of
tensorexpr, see benchmarks/tensorexpr).

Currently there's a harness for gemm that sets up the problem using torch (and
also measures the perf of a torch::mm to give a baseline).

Right now there's just an unoptimized implementation that is expected to be not
very fast.  More optimized versions are coming.

Sample output from my dev box:
```
Run on (48 X 2501 MHz CPU s)
CPU Caches:
  L1 Data 32K (x24)
  L1 Instruction 32K (x24)
  L2 Unified 256K (x24)
  L3 Unified 30720K (x2)
--------------------------------------------------------------------------------------------
Benchmark                                     Time           CPU Iterations UserCounters...
--------------------------------------------------------------------------------------------
Gemm/Torch/128/128/128                    73405 ns      73403 ns       8614 GFLOPS=57.1411G/s
Gemm/TensorExprNoopt/128/128/128        3073003 ns    3072808 ns        229 GFLOPS=1.36497G/s
```

Test Plan: Imported from OSS

Reviewed By: SplitInfinity

Differential Revision: D24142403

Pulled By: bertmaher

fbshipit-source-id: 3354aaa56868a43a553acd1ad9a192f28d8e3597
2020-10-06 16:57:27 -07:00
5ff31620b7 [te] Add a 2D convolution example test (#45514)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45514

Test Plan: Imported from OSS

Reviewed By: SplitInfinity

Differential Revision: D24142405

Pulled By: bertmaher

fbshipit-source-id: 8f064d0638b48f55a732c08938b9fcf1ba3f0415
2020-10-06 16:54:46 -07:00
14997f2125 [quant][graphmode][fx] Add warning for unsupported case (#45714)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45714

Hit the problem when writing a test like following:
```
class M(...):
      def forward(self, x):
          x = x.some_op()
          return x
```
we need to know the scope of `x` to figure out the qconfig for `x`

Test Plan: Imported from OSS

Reviewed By: z-a-f

Differential Revision: D24069959

fbshipit-source-id: 95ac8963c802ebce5d0e54d55f5ebb42085ca8a6
2020-10-06 15:33:34 -07:00
5072728d88 Fix stride printing/parsing formatting (#45156)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45156

Test Plan: Imported from OSS

Reviewed By: gmagogsfm

Differential Revision: D24078695

Pulled By: ansley

fbshipit-source-id: dab993277d43b31105c38d12098c37653747b42a
2020-10-06 15:06:46 -07:00
255b0e839f C++ APIs CUDA Stream Note (Set/Get part) (#45754)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45754

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D24085103

Pulled By: glaringlee

fbshipit-source-id: c9641c2baadcf93b84733c037ce91b670dde5f96
2020-10-06 14:57:16 -07:00
a3662fa78c Minor gradcheck update to reduce computations (#45757)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45757

Test Plan: Imported from OSS

Reviewed By: glaringlee

Differential Revision: D24137143

Pulled By: anjali411

fbshipit-source-id: e0174ec03d93b1fedf27baa72c3542dac0b70058
2020-10-06 13:59:01 -07:00
e154b36685 Standardized clamp kernels to Numpy-like implementation (#43288)
Summary:
**BC-breaking note**

For ease of exposition let a_min be the value of the "min" argument to clamp, and a_max be the value of the "max" argument to clamp.

This PR changes the behavior of torch.clamp to always compute min(max(a, a_min), a_max). torch.clamp currently computes this in its vectorized CPU specializations:

78b95b6204/aten/src/ATen/cpu/vec256/vec256_double.h (L304)

but in other places it clamps differently:

78b95b6204/aten/src/ATen/cpu/vec256/vec256_base.h (L624)

78b95b6204/aten/src/ATen/native/cuda/UnaryOpsKernel.cu (L160)

These implementations are the same when a_min < a_max, but divergent when a_min > a_max. This divergence is easily triggered:

```
t = torch.arange(200).to(torch.float)
torch.clamp(t, 4, 2)[0]
: tensor(2.)

torch.clamp(t.cuda(), 4, 2)[0]
: tensor(4., device='cuda:0')

torch.clamp(torch.tensor(0), 4, 2)
: tensor(4)
```

This PR makes the behavior consistent with NumPy's clip. C++'s std::clamp's behavior is undefined when a_min > a_max, but Clang's std::clamp will return 10 in this case (although the program, per the above comment, is in error). Python has no standard clamp implementation.

**PR Summary**

Fixes discrepancy between AVX, CUDA, and base vector implementation for clamp, such that all implementations are consistent and use min(max_vec, max(min_vec, x) formula, thus making it equivalent to numpy.clip in all implementations.

The same fix as in https://github.com/pytorch/pytorch/issues/32587 but isolated to the kernel change only, so that the internal team can benchmark.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43288

Reviewed By: colesbury

Differential Revision: D24079453

Pulled By: mruberry

fbshipit-source-id: 67f30d2f2c86bbd3e87080b32f00e8fb131a53f7
2020-10-06 13:42:08 -07:00
a69a78daa2 Use smaller N to speed up TestForeach (#45785)
Summary:
Between September 25 and September 27, approximately half an hour was added to the running time of `pytorch_linux_xenial_cuda10_2_cudnn7_py3_gcc7_test`. Judging from the CircleCI data, it looks like the majority of the new time was added by the following PRs:

- https://github.com/pytorch/pytorch/issues/44550
- https://github.com/pytorch/pytorch/issues/45298

I'm not sure what to do about https://github.com/pytorch/pytorch/issues/44550, but it looks like https://github.com/pytorch/pytorch/issues/45298 increased the `N` for `TestForeach` from just 20 to include both 30 and 300. This PR would remove the 300, decreasing the test time by a couple orders of magnitude (at least when running it on my devserver), from over ten minutes to just a few seconds.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45785

Reviewed By: malfet

Differential Revision: D24094782

Pulled By: samestep

fbshipit-source-id: 2476cee9d513b2b07bc384de751e08d0e5d8b5e7
2020-10-06 13:29:04 -07:00
c1af91a13a [caffe2] SliceOp axes indexing fixes. (#45432)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/45431

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45432

Reviewed By: albanD

Differential Revision: D24132547

Pulled By: dzhulgakov

fbshipit-source-id: d67f7a92d806fb8ac8fc8f522b251d3a8fb83037
2020-10-06 13:21:08 -07:00
3fbddb92b1 caffe2/plan_executor: wait for 1 minute after exception and then abort (#45297)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45297

If we have two concurrent substeps and one of them throws an exception and the other is blocking, we'll currently hang. This waits up to 1 minute for it to complete before terminating the process.

Test Plan: buck test caffe2/caffe2:caffe2_test_cpu -- PlanExecutorTest --stress-runs 100

Reviewed By: dahsh

Differential Revision: D20850851

fbshipit-source-id: 330503775d8062a34645ba55fe38e6770de5e3c7
2020-10-06 12:59:09 -07:00
64681d6bec Add all remaining method declarations from torch.distributed Python API to C++ (#45768)
Summary:
Also ran formatter on previous sections

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45768

Reviewed By: wanchaol

Differential Revision: D24129467

Pulled By: gmagogsfm

fbshipit-source-id: aa8a5c45c3609d5b96e5f585b699d9e3e71394c8
2020-10-06 12:36:36 -07:00
0da6730f02 [quant][graphmode][fx][eagermode] Add leaky relu support in quantization workflows (#45712)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45712

Eager mode will still be able to use functional leaky relu, but it will be less accurate than
LeakyReLU module.
FX graph mode will support both leaky relu functional and module

Test Plan: Imported from OSS

Reviewed By: z-a-f

Differential Revision: D24069961

fbshipit-source-id: 8d91c3c50c0bcd068ba3072378ebb4da9549be3b
2020-10-06 12:16:04 -07:00
fb50fcaa82 [C2] Add string equality operator (#45886)
Summary:
This diff adds a string equality checking operator.

Another attempt at reverted D24042344 (cf48872d28)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45886

Test Plan: unit tests, github builds

Reviewed By: dzhulgakov

Differential Revision: D24129953

fbshipit-source-id: caa53c7eac5c67c414c37e9d93416104f72556b9
2020-10-06 12:08:26 -07:00
fcc7f272de maximum number of threads per block for sm_86 is 1536 (#45889)
Summary:
according to https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#features-and-technical-specifications

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45889

Reviewed By: albanD

Differential Revision: D24131188

Pulled By: ngimel

fbshipit-source-id: 31d3038f7b1bc403751448c62b19609573c67a49
2020-10-06 12:01:33 -07:00
ba642d36ff ReplicationPad accepts 0-dim batch size. (#39137)
Summary:
This PR patches the ReplicationPad modules in `torch.nn` to be compatible with 0-dim batch sizes.

EDIT: this is part of the work on gh-12013 (make all nn layers accept empty batch size)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/39137

Reviewed By: albanD

Differential Revision: D24131386

Pulled By: ngimel

fbshipit-source-id: 3d93057cbe14d72571943c8979d5937e4bbf743a
2020-10-06 11:54:32 -07:00
8b7ee33ee6 [quant] Add quantized LeakyReLU module (#45711)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45711

Test Plan: Imported from OSS

Reviewed By: z-a-f

Differential Revision: D24069960

fbshipit-source-id: ccdd294308e07fd215556a63fa47191c09a1519f
2020-10-06 11:34:48 -07:00
930bddd403 Cleanup nccl.cpp (#45899)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45899

Use function polymorphism to avoid repeated casts
I.e. instead of using `NCCL_CHECK(from_nccl_result(` add variant of the function that takes `ncclResult_t` as input argument
Add non-pointer variant of `to_nccl_comm` to avoid `*to_nccl_comm(&comm)` pattern

Test Plan: Imported from OSS

Reviewed By: walterddr

Differential Revision: D24138012

Pulled By: malfet

fbshipit-source-id: 7f62a03e108cbe455910e86e894afdd1c27e8ff1
2020-10-06 11:26:14 -07:00
d1fc1555d4 [quant] Add quantized::leaky_relu that takes scale/zero_point as input (#45702)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45702

https://github.com/pytorch/pytorch/issues/45593

Previously quantized leaky_relu does not require observation and just inherits
the quantization parameters from input, but that does not work very well in qat
This PR added a quantized::leaky_relu that has observation for output and it will
become the default leaky_relu that our quantization tools produce (eager/graph mode)

Test Plan: Imported from OSS

Reviewed By: raghuramank100

Differential Revision: D24067681

fbshipit-source-id: d216738344363794b82bd3d75c8587a4b9415bca
2020-10-06 10:56:45 -07:00
001a7998b4 Disabling XNNPACK integration test in tsan mode (#45850)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45850

In TSAN mode most xnnpack integration tests seem to be failing. Reason for
failure is not entirely clear. It is not clear if this is spurious.

Test Plan: python test/test_xnnpack_integration.py

Reviewed By: xcheng16

Differential Revision: D24113885

fbshipit-source-id: dc3de3ad3d4bf0210ad67211383dbe0e842b09dd
2020-10-06 10:49:58 -07:00
3510f19c5f added some more details + debugging steps to CONTRIBUTING.md (#45903)
Summary:
When attempting to install PyTorch locally on my Macbook, I had some difficulty running the setup steps and understanding what I was really doing. I've added some clarifications and summarized some debugging steps about `python setup.py develop` to lower the barrier of entrance for new contributors.

I'm seeking a lot of review here since I am not sure if what I wrote is entirely the most useful or accurate. Thank you!

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45903

Reviewed By: albanD

Differential Revision: D24140343

Pulled By: janeyx99

fbshipit-source-id: a5e40d1bc804945ae7db2b95ab18cf7fe169e68a
2020-10-06 10:40:17 -07:00
abedd9a274 Reduce size of test_unsqueeze to resolve consistent timeout issue (#45877)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45877

apex_test_L0_optimizers

Test Plan: `buck test mode/dev-tsan //caffe2/test:tensorexpr -- 'test_unsqueeze \(test_tensorexpr\.TestTensorExprFuser\)' --run-disabled`

Reviewed By: malfet

Differential Revision: D24126211

fbshipit-source-id: e38ba0168b6dd44459c070c01e3e39c93d5fae42
2020-10-06 10:33:20 -07:00
9728584cca Replaced whitelist with allowlist (#45796)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/41752

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45796

Reviewed By: dzhulgakov

Differential Revision: D24125214

Pulled By: VitalyFedyunin

fbshipit-source-id: 5b06c1fdaa90a60e8a6efc2e61f37fd647cf0ae7
2020-10-06 09:18:51 -07:00
a09e1098e7 Profiling allocator for mobile. (#43951)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43951

AllocationPlan: Stores the sequence of allocations, their sizes
                and liftime of the allocations. Along with this
                it also stores the total size of a single memory
                blob, total_size, required to satisfy all the allocations.
                It also stores the offsets in the blob, of size
                total_size, corresponding to each allocation.
                Thus allocation plan contains:
                - allocation sizes
                - allocation lifetimes
                - allocation offsets
                - total size
AllocationPlaner: Takes a pointer to the allocation plan and fills
                  it ups with plan, i.e. sizes, lifetimes, offsets,
                  total size.
                  This is done via WithProfileAllocationsGuard which
                  takes in AllocationPlan* and constructs
                  AllocationPlanner* and set the thread local
                  allocation_planner to it.
                  MobileCPUAllocator profiles allocations via
                  allocation_planner.
                  In WithValidateAllocationsGuard, allocations profiled
                  in the allocation plan are validated.
CPUProfilingAllocator:
Application owns CPUProfilingAllocator
Using WithProfilingAllocatorGuard, it passes both CPUProfilingAllocator
and AllocationPlan created earlier. Then CPUProfilingAllocator will
manage allocations and frees according to the plan. Allocations that
are not managed by CPUProfilingAllocator will be routed through
c10::alloc_cpu, c10::free_cpu.

Test Plan:
cpu_profiling_allocator_test on mobile.

Imported from OSS

Reviewed By: dreiss

Differential Revision: D23451019

fbshipit-source-id: 98bf1dbcfa8fcfb83d505ac01095e84a3f5b778d
2020-10-06 09:09:54 -07:00
b1373a74e0 Don't export enums for CUDA sources on Windows (#45829)
Summary:
Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45829

Reviewed By: VitalyFedyunin

Differential Revision: D24113130

Pulled By: ezyang

fbshipit-source-id: 8356c837ed3a790efecf8dfcc8fb6ee6f45bd6e2
2020-10-06 08:04:36 -07:00
be137e45cd reorganizing tests so that test1 and test2 are balanced in timing (#45778)
Summary:
used --shard option to split up python tests ran from `test/run_test.py` in the testing script run in CI

also revised a help message to be more accurate for --shard.

Test results:
BEFORE:
| EVENT | TIMING  |
|---|---|
| **TEST1** | |
| | |
| test_python_nn | 35m19s |
| test_cpp_extensions | 30s |
| **total** | **35m49s** |
| **TEST2** | |
| | |
| install_torchvision | 35s |
| test_python_all_except_nn_and_cpp_extensions | 255m37s |
| test_aten | SKIPPED |
| test_libtorch | 9m8s |
| test_custom_script_ops | SKIPPED |
| test_custom_backend | SKIPPED |
| test_torch_function_benchmark | 10s |
| **total** | **4hr24m** |

AFTER THIS SHARD:
| EVENT | TIMING  |
|---|---|
| **TEST1** | |
| | |
| test_autograd | 26m30s |
| test_foreach | 69m |
| test_nn | test_nn is 35m38s |
| **total** | **3h1m** |
| **TEST2** | |
| | |
| test-quantization | 41m28s |
| test_spectral_ops | 17m37s |
| test_torch | 8m56s |
| test_jit_legacy | 16m21s |
| **total** | **2h18m** |

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45778

Reviewed By: albanD

Differential Revision: D24137156

Pulled By: janeyx99

fbshipit-source-id: 5873fec47aedb9f699ebbda653a4d32a9950fc13
2020-10-06 07:57:08 -07:00
67889db8aa Replaced BLACKLIST with BLOCKLIST (#45781)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/41714

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45781

Reviewed By: nairbv

Differential Revision: D24136821

Pulled By: albanD

fbshipit-source-id: 0c0223bda0c5b4da75167a27d7859562db396304
2020-10-06 07:49:00 -07:00
8bc0c755be adding option to move excluding to run_test.py instead of test.sh (#45868)
Summary:
Cleaning up test.sh a tiny bit

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45868

Reviewed By: albanD

Differential Revision: D24122726

Pulled By: janeyx99

fbshipit-source-id: e8254accad15ad887a000ec1401c401389393c92
2020-10-06 07:13:27 -07:00
8a1e100466 Stricter backward compatibility check (#45773)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45773

Changes the function schema's backward compatibility check to be stricter to comply with C++ API backwards compatibility capabilities.
ghstack-source-id: 113537304

Test Plan:
Updated and added tests to test_function_schema.py

Browsed through several commits to native_functions.yaml and derivatives.yaml and I don't see instances where new arguments where not already being appended.

Reviewed By: dzhulgakov

Differential Revision: D24089751

fbshipit-source-id: a21f407cdc750906d3326e3ea27928b8aa732804
2020-10-06 01:28:48 -07:00
2fbe5971b3 [pytorch/cuda] apply 16-bit mask to the index for device guard registry (#45485)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45485

Essentially this is the problem reported by ezyang: https://fb.workplace.com/groups/llvm.gcc/permalink/4053565044692080. There are two proposed fixes:
* https://github.com/pytorch/pytorch/pull/44883: this doesn't work because it fails some static assert at runtime
```
caffe2/c10/core/TensorOptions.h:553:1: error: static_assert failed due to requirement 'sizeof(c10::TensorOptions) <= sizeof(long) * 2' "TensorOptions must fit in 128-bits"
static_assert( sizeof(TensorOptions) <= sizeof(int64_t) * 2,
^
```
* https://github.com/pytorch/pytorch/pull/44885: to be tested

This diff is a temp hack to work around the problem. W/o this patch:

```
  volatile size_t device_type = static_cast<size_t>(type);
  auto p = device_guard_impl_registry[device_type].load();
  C10_LOG_FIRST_N(WARNING, 10) << "XDW-fail: " << cntr << ", Device type: " << type << ", type cast: " << device_type  << ", guard: " << p;

// output
XDW-fail: 1129, Device type: cuda, type cast: 65537, guard: 0

```

Another workaround is D23788441, which changes -O3 to -O2. So this seems to be a miscompilation for nvcc or the host compiler.

Reviewed By: ezyang

Differential Revision: D23972356

fbshipit-source-id: ab91fbbfccb6389052de216f95cf9a8265445aea
2020-10-05 22:37:47 -07:00
d44eaf63d1 torch.fft helper functions (#44877)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44877

Part of gh-42175. This implements the `torch.fft` helper functions: `fftfreq`, `rfftfreq`, `fftshift` and `ifftshift`.

* #43009 Cleanup tracer handling of optional arguments

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D24043473

Pulled By: mruberry

fbshipit-source-id: 35de7b70b27658a426773f62d23722045ea53268
2020-10-05 22:04:52 -07:00
e4efc420ae Correct Categorical docstring (#45804)
Summary:
Clarified that the `Categorical` distribution will actually accept input of any arbitrary tensor shape, not just 1D and 2D tensors.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45804

Reviewed By: dzhulgakov

Differential Revision: D24125415

Pulled By: VitalyFedyunin

fbshipit-source-id: 5fa1f07911bd85e172199b28d79763428db3a0f4
2020-10-05 21:49:10 -07:00
7eb0a71484 update persons of interest (#45803)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45803

Reviewed By: dzhulgakov

Differential Revision: D24125375

Pulled By: VitalyFedyunin

fbshipit-source-id: a892603c6449a2c15e926d2b161468690d4ec2f4
2020-10-05 21:28:00 -07:00
bf85642c4c Remove lock from GraphTask::set_exception_without_signal. (#45867)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45867

In most cases the lock ordering was hold a lock in local autograd and
then hold a lock in DistAutogradContext.

In case of `set_exception_without_signal` the lock order was in reverse and as
a result we saw potential deadlock issues in our TSAN tests. To fix this, I
removed the lock and instead just used std::atomic exchange.

In addition to this, I fixed TestE2E to ensure that we use the appropriate
timeout.

TestE2EProcessGroup was flaky for these two reasons and now is fixed.
ghstack-source-id: 113592709

Test Plan: waitforbuildbot.

Reviewed By: albanD

Differential Revision: D24120962

fbshipit-source-id: 12447b84ceae772b91e9a183c90d1e6340f44e66
2020-10-05 20:02:29 -07:00
10d86d1196 [NCCL] create NCCL communicator for send/recv on demand (#44922)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44922

For NCCL send/recv operations, we will create NCCL communicator on demand following the same design as how it's currently done for collective operations.
ghstack-source-id: 113592757

Test Plan: to add

Reviewed By: pritamdamania87

Differential Revision: D23773726

fbshipit-source-id: 0d47c29d670ddc07f7181e8485af0e02e2c9cfaf
2020-10-05 18:33:03 -07:00
59083d6176 [NCCL] Support NCCL Send/Recv (#44921)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44921

This diff adds support for Process Group point-to-point operations on NCCL backend based on ncclSend/ncclRecv. See https://github.com/pytorch/pytorch/issues/43995 for more context.
ghstack-source-id: 113592785

Test Plan: unittest

Reviewed By: jiayisuse

Differential Revision: D23709848

fbshipit-source-id: cdf38050379ecbb10450f3394631317b41163258
2020-10-05 18:27:57 -07:00
b04ae953b4 [FX][WIP] Mutable Graph APIs (#45227)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45227

Test Plan: Imported from OSS

Reviewed By: zdevito

Differential Revision: D23880730

Pulled By: jamesr66a

fbshipit-source-id: eb4e8c14d7f6b1deb1ddd6cf38a360413a1705ed
2020-10-05 17:07:08 -07:00
1558a3657b Add LazyNVRTC (#45674)
Summary:
Instead of dynamically loading `caffe2_nvrtc`, lazyNVRTC provides the same functionality by binding all the hooks to lazy bind implementation, very similar to the shared library jump tables:
On the first call, each function from the list tries to get a global handle to the respective shared library and replace itself with the dynamically resolved symbol, using the following template:
```
  auto fn = reinterpret_cast<decltype(&NAME)>(getCUDALibrary().sym(C10_SYMBOLIZE(NAME)));
  if (!fn)
    throw std::runtime_error("Can't get" ## NAME);
  lazyNVRTC.NAME = fn;
  return fn(...)
```
Fixes https://github.com/pytorch/pytorch/issues/31985

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45674

Reviewed By: ezyang

Differential Revision: D24073946

Pulled By: malfet

fbshipit-source-id: 1479a75e5200e14df003144625a859d312885874
2020-10-05 16:27:40 -07:00
54aaffb7c7 Avoid NaN values in torch.cdist backward for p<1 (#45720)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/36493

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45720

Reviewed By: VitalyFedyunin

Differential Revision: D24112541

Pulled By: albanD

fbshipit-source-id: 8598a9e7cc0f6f9ea46c007f2e3365970aea0116
2020-10-05 16:19:29 -07:00
4ab73c1f74 [docs] Fix EmbeddingBag docs (#45763)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45763

**Summary**
This commit updates the documentation for `EmbeddingBag` to say that for
bags of constant length with no per-sample weights, the class is
equivalent to `Embedding` followed by `torch.sum(dim=1)`. The current
docs say `dim=0` and this is readily falsifiable.

**Test Plan**
1) Tried `Embedding` + `sum` with `dim`=0,1 in interpreter and compared
to `EmbeddingBag`
```
>>> import torch
>>> weights = torch.nn.Parameter(torch.randn(10, 3))
>>> e = torch.nn.Embedding(10, 3)
>>> eb = torch.nn.EmbeddingBag(10, 3, mode="sum")
>>> e.weight = weights
>>> eb.weight = weights
# Use 2D inputs because we are trying to test the case in which bags have constant length
>>> inputs = torch.LongTensor([[4,1,2,7],[5,6,0,3]])
>>> eb(inputs)
tensor([[-2.5497, -0.1556, -0.5166],
        [ 2.2528, -0.3627,  2.5822]], grad_fn=<EmbeddingBagBackward>)
>>> torch.sum(e(inputs), dim=0)
tensor([[ 1.6181, -0.8739,  0.8168],
        [ 0.0295,  2.3274,  1.2558],
        [-0.7958, -0.4228,  0.5961],
        [-1.1487, -1.5490, -0.6031]], grad_fn=<SumBackward1>)
>>> torch.sum(e(inputs), dim=1)
tensor([[-2.5497, -0.1556, -0.5166],
        [ 2.2528, -0.3627,  2.5822]], grad_fn=<SumBackward1>)
```
So clearly `torch.sum` with `dim=0` is not correct here.

2) Built docs and viewed in browser.

*Before*
<img width="882" alt="Captura de Pantalla 2020-10-02 a la(s) 12 26 20 p  m" src="https://user-images.githubusercontent.com/4392003/94963035-557be100-04ac-11eb-986c-088965ac3050.png">

*After*
<img width="901" alt="Captura de Pantalla 2020-10-05 a la(s) 11 26 51 a  m" src="https://user-images.githubusercontent.com/4392003/95117732-ea294d80-06fd-11eb-9d6b-9b4e6c805cd0.png">

**Fixes**
This commit closes #43197.

Test Plan: Imported from OSS

Reviewed By: ansley

Differential Revision: D24118206

Pulled By: SplitInfinity

fbshipit-source-id: cd0d6b5db33e415d8e04ba04f2c7074dcecf3eee
2020-10-05 15:56:35 -07:00
78f055272c [docs] Add 3D reduction example to tensordot docs (#45697)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45697

**Summary**
This commit adds an example of a reduction over three dimensions with
`torch.tensordot`. It is unclear from existing docs whether `dims`
should be a list of pairs or a pair of lists.

**Test Plan**
Built the docs locally.

*Before*
<img width="864" alt="Captura de Pantalla 2020-10-01 a la(s) 1 35 46 p  m" src="https://user-images.githubusercontent.com/4392003/94866838-f0b17f80-03f4-11eb-8692-8f50fe3b9863.png">

*After*
<img width="831" alt="Captura de Pantalla 2020-10-05 a la(s) 12 06 28 p  m" src="https://user-images.githubusercontent.com/4392003/95121092-670af600-0703-11eb-959f-73c7797a76ee.png">

**Fixes**
This commit closes #22748.

Test Plan: Imported from OSS

Reviewed By: ansley

Differential Revision: D24118186

Pulled By: SplitInfinity

fbshipit-source-id: c19b0b7e001f8cd099dc4c2e0e8ec39310510b46
2020-10-05 15:36:59 -07:00
26a9012f84 [fx] import used modules for code gen (#45471)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45471

Intead of assuming that 'torch' is the only module used by generated code,
use the qualified names of builtin functions to generate import statements
for all builtins. This allows user-captured functions to also get code generated correctly.

Test Plan: Imported from OSS

Reviewed By: jamesr66a

Differential Revision: D23978696

Pulled By: zdevito

fbshipit-source-id: ecbff150e3de38532531cdadbfe4965468f29a38
2020-10-05 15:21:44 -07:00
5177f8de2b Revert D23398534: [pytorch][PR] [ONNX] Improve error handling for adaptive_pool
Test Plan: revert-hammer

Differential Revision:
D23398534 (45ddeb5ce6)

Original commit changeset: f2d60d40340f

fbshipit-source-id: acc9d6c3d031662c37447fcee027b0c97b8492a7
2020-10-05 15:16:59 -07:00
f18cc9c57d Change type inferred from empty annotation (#45360)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45360

Test Plan: Imported from OSS

Reviewed By: gmagogsfm

Differential Revision: D24078645

Pulled By: ansley

fbshipit-source-id: 5d37d07df75bd7a2111d44638befe53c1021ee82
2020-10-05 15:16:56 -07:00
a9a9d0b181 Rocm skip test cases (#45782)
Summary:
Skip the following test cases for rocm (When PYTORCH_TEST_WITH_ROCM=1):
- test_reference_numerics_tan_cuda_float64 (__main__.TestUnaryUfuncsCUDA)
- test_addmv_cuda_float16 (__main__.TestTorchDeviceTypeCUDA)
- test_logspace_cuda_float64 (__main__.TestTensorCreationCUDA)
- test_gloo_backend_2gpu_module (__main__.DistributedDataParallelTest)
jeffdaily
pruthvistony

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45782

Reviewed By: VitalyFedyunin

Differential Revision: D24115581

Pulled By: xw285cornell

fbshipit-source-id: 4043a9fa19e242301b5007813c15b6b3873889c5
2020-10-05 15:12:25 -07:00
519c086418 Revert D24042344: [C2] Add string equality operator
Test Plan: revert-hammer

Differential Revision:
D24042344 (cf48872d28)

Original commit changeset: c8997c6130e3

fbshipit-source-id: 3d8aec1104a2a59c67ab4b7e77caeaf9fc94ae1d
2020-10-05 15:09:03 -07:00
9a668f94bb [jit] allow slicing multiple dimensions with indicies (#45239)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45239

Test Plan: Imported from OSS

Reviewed By: gmagogsfm

Differential Revision: D23886919

Pulled By: Lilyjjo

fbshipit-source-id: d45c2a550fa8df9960cf2ab5da9d1ae0058a967a
2020-10-05 15:03:54 -07:00
f11f9a8c1f [pytorch][improvement] Improve torch logging to identify problematic key (#45766)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45766

As per subj, making KeyError message more verbose.

Test Plan:
Verified that breakage can be successfully investigated with verbose error message
unit tests

Reviewed By: esqu1

Differential Revision: D24080362

fbshipit-source-id: f4e22a78809e5cff65a69780d5cbbc1e8b11b2e5
2020-10-05 14:54:52 -07:00
9f4abcad9d Automated submodule update: FBGEMM (#45713)
Summary:
This is an automated pull request to update the first-party submodule for [pytorch/FBGEMM](https://github.com/pytorch/FBGEMM).

New submodule commit: fe9164007c

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45713

Test Plan: Ensure that CI jobs succeed on GitHub before landing.

Reviewed By: VitalyFedyunin

Differential Revision: D24069807

fbshipit-source-id: 4670725be42368bdf6e29a3746c89514c5f4ee1b
2020-10-05 14:47:54 -07:00
a83696ad53 quant docs: add API summary section (#45848)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45848

This is a resubmit of the following stack:
* start: https://github.com/pytorch/pytorch/pull/45093
* end: https://github.com/pytorch/pytorch/pull/45306

The original stack was reverted due to build failure,
resubmitting.

Test Plan: Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D24117781

Pulled By: vkuzo

fbshipit-source-id: fb767fff2b044cfbba695ca3949221904fc8931f
2020-10-05 14:42:40 -07:00
c80ec91b00 [iOS] Bump up the cocoapods version (#45862)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45862

Bump up the cocoapods version
ghstack-source-id: 113585513

(Note: this ignores all push blocking failures!)

Test Plan: CI

Reviewed By: xta0

Differential Revision: D24119158

fbshipit-source-id: e689b69628dcf802084e67c5ea627220cafcc575
2020-10-05 14:37:26 -07:00
21fa877026 [quant][test] Remove numeric equivalence test for debug and non-debug option (#45852)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45852

Test Plan: Imported from OSS

Reviewed By: vkuzo

Differential Revision: D24115329

fbshipit-source-id: ad32e68cbd54431fd440c8437a4361905a5dbdad
2020-10-05 14:11:07 -07:00
14e6e50700 Refactor computeLRWorkDim (#45812)
Summary:
Move duplicated code for computing LRWork array dimention form CPU/CUDA implementation of apply_svd into LinearAlgebraUtils

Reduce common multiplication factor from 7 to 5, which according to the documentation should be sufficient for LAPACK-3.6+
From 122506cd8b/SRC/cgesdd.f (L186)
```
RWORK is REAL array, dimension (MAX(1,LRWORK))
Let mx = max(M,N) and mn = min(M,N).
If JOBZ = 'N',    LRWORK >= 5*mn (LAPACK <= 3.6 needs 7*mn);
else if mx >> mn, LRWORK >= 5*mn*mn + 5*mn;
else              LRWORK >= max( 5*mn*mn + 5*mn,
                                 2*mx*mn + 2*mn*mn + mn ).
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45812

Reviewed By: walterddr

Differential Revision: D24100836

Pulled By: malfet

fbshipit-source-id: 0ca86aed25077c91cf60086ed301298381d5f628
2020-10-05 13:56:02 -07:00
ffbffc0436 fixed formatting in function rstrings in torch.autograd.functional (#45849)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/44426

The changes look like:
![Screen Shot 2020-10-05 at 12 34 32 PM](https://user-images.githubusercontent.com/31798555/95107954-9839f500-0708-11eb-88b0-444486f53061.png)
(compare with https://pytorch.org/docs/stable/autograd.html#torch.autograd.functional.jacobian)

and also
![Screen Shot 2020-10-05 at 12 35 15 PM](https://user-images.githubusercontent.com/31798555/95107966-9bcd7c00-0708-11eb-979a-b3578b8203da.png)
(compare with https://pytorch.org/docs/stable/autograd.html#torch.autograd.functional.hessian)

and lastly
![Screen Shot 2020-10-05 at 12 38 19 PM](https://user-images.githubusercontent.com/31798555/95107971-9e2fd600-0708-11eb-9919-5b809f5f0f20.png)
(compare with https://pytorch.org/docs/stable/autograd.html#torch.autograd.functional.hvp)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45849

Reviewed By: albanD

Differential Revision: D24114223

Pulled By: janeyx99

fbshipit-source-id: bfea5f0d594933db4b2c400291d330f747f518e8
2020-10-05 13:39:01 -07:00
615013edcb setup: Dataclasses only when < 3.7 (#45844)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45844

Someone pointed out that dataclasses were actually added to the python
stdlib in 3.7 and not 3.8, so bumping down the dependency on dataclasses
from 3.8 -> 3.7 makes sense here

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: walterddr, malfet

Differential Revision: D24113367

Pulled By: seemethere

fbshipit-source-id: 03d2d93f7d966d48a30a8e2545fd07dfe63b4fb3
2020-10-05 13:29:21 -07:00
b5a2f04089 Disallow creation of ProcessGroupNCCL without GPUs. (#45642)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45642

Prior to https://github.com/pytorch/pytorch/pull/45181, initializing a
NCCL process group would work even if no GPUs were present. Although, now since
init_process_group calls `barrier()` this would fail.

In general the problem was that we could initialize ProcessGroupNCCL without
GPUs and then if we called a method like `barrier()` the process would crash
since we do % numGPUs resulting in division by zero.
ghstack-source-id: 113490343

Test Plan: waitforbuildbot

Reviewed By: osalpekar

Differential Revision: D24038839

fbshipit-source-id: a1f1db52cabcfb83e06c1a11ae9744afbf03f8dc
2020-10-05 12:05:48 -07:00
45ddeb5ce6 [ONNX] Improve error handling for adaptive_pool (#43032)
Summary:
This would also improve error handling for interpolate with 'area' mode.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43032

Reviewed By: malfet

Differential Revision: D23398534

Pulled By: bzinodev

fbshipit-source-id: f2d60d40340f46e7c0499ea73c1e39945713418d
2020-10-05 11:53:14 -07:00
adc21c6db2 Rename jobs and cli switches for testing GraphExecutor configurations to something a little bit more sensical. (#45715)
Summary:
Rename jobs for testing GraphExecutor configurations to something a little bit more sensical.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45715

Reviewed By: ezyang, anjali411

Differential Revision: D24114344

Pulled By: Krovatkin

fbshipit-source-id: 89e5f54aaebd88f8c5878e060e983c6f1f41b9bb
2020-10-05 11:43:28 -07:00
cf48872d28 [C2] Add string equality operator
Summary: This diff adds a string equality checking operator.

Test Plan: Unit tests

Differential Revision: D24042344

fbshipit-source-id: c8997c6130e3438f2ae95dae69f76978e2e95527
2020-10-05 10:47:53 -07:00
162717e527 grammatically update index.rst (#45801)
Summary:
This is a following up PR for https://github.com/pytorch/pytorch/issues/45652 which has a problem to rebase.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45801

Reviewed By: VitalyFedyunin

Differential Revision: D24111776

Pulled By: glaringlee

fbshipit-source-id: 2c727a17426be91a4df78a195de79197e1c5d120
2020-10-05 09:55:56 -07:00
3ab88c3903 Enable TorchBind tests on ROCm (#45426)
Summary:
The torchbind tests didn't work be cause somehow we missed the rename of caffe2_gpu to torch_... (hip for us) in https://github.com/pytorch/pytorch/issues/20774 (merged 2019-06-13, oops) and still tried to link against it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45426

Reviewed By: VitalyFedyunin

Differential Revision: D24112439

Pulled By: walterddr

fbshipit-source-id: a66a574e63714728183399c543d2dafbd6c028f7
2020-10-05 09:38:12 -07:00
e829d4fba9 [op-bench] fix jit mode (#45774)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45774

Fix RuntimeError: No such operator operator_benchmark::_consume

Test Plan: waitforsandcastle

Reviewed By: ngimel

Differential Revision: D24064982

fbshipit-source-id: 13160b6d18569e659ca1ab0ca1d444ed9947260c
2020-10-05 09:29:41 -07:00
f65ab89edd [numpy] Add torch.nan_to_num (#44592)
Summary:
Reference https://github.com/pytorch/pytorch/issues/42515

TODO:
* [x] Add tests
* [x] Add docs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44592

Reviewed By: colesbury

Differential Revision: D24079472

Pulled By: mruberry

fbshipit-source-id: 2b67d36cba46eaa7ca16cd72671b57750bd568bc
2020-10-05 01:38:56 -07:00
e1ff46b6e5 CUDA BFloat16 TopK (#44755)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44755

Reviewed By: mruberry

Differential Revision: D23741680

Pulled By: ngimel

fbshipit-source-id: 8fce92a26663336bcb831c72202fe2623a2ddaf0
2020-10-04 11:38:00 -07:00
2ab74a4839 [FX] Make Tracer.trace() just return a Graph (#45704)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45704

Test Plan: Imported from OSS

Reviewed By: zdevito

Differential Revision: D24067982

Pulled By: jamesr66a

fbshipit-source-id: c82aa6be504d45e110055a3c4db129d0b9ac3ef5
2020-10-03 21:13:48 -07:00
8a6b919163 [StaticRuntime] Fix broken tests (#45813)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45813

Fix tests broken by D23996656 (2b48dd168d).

Test Plan:
```
buck test mode/opt //pytorch/tensorboardX:test_pytorchtb -- 'test_pytorch_graph \(pytorch\.tensorboardX\.tests\.test_pytorch_graph\.PytorchGraphTest\)'
buck test mode/opt //pytext/tests:
buck test mode/dev-nosan //mobile-vision/projects/detectron2go/tests:test_caffe2_compatibles
```

Reviewed By: yinghai

Differential Revision: D24100807

fbshipit-source-id: e2f92aadca4161f5cf9f552e922fb4d6500af3a4
2020-10-03 16:54:22 -07:00
24fa2daea6 Revert D24100389: Revert D24072697: [te] Get llvm codegen to compile with llvm9 and llvm-fb
Test Plan: revert-hammer

Differential Revision:
D24100389

Original commit changeset: b32c5163e4fb

fbshipit-source-id: 9ce7bfbcf411c0584e5d535ee107fb5a135ee6e6
2020-10-03 15:33:42 -07:00
ff568a0e6b Revert D24072697: [te] Get llvm codegen to compile with llvm9 and llvm-fb
Test Plan: revert-hammer

Differential Revision:
D24072697 (e3d2defdc8)

Original commit changeset: 7f56b9f3cbe5

fbshipit-source-id: b32c5163e4fb6df99447f95fdb82674e5ae62f22
2020-10-03 12:27:26 -07:00
3a27fc966a Test torch.svd using complex float and double numbers (take 2) (#45795)
Summary:
Adds support for magmaSvd for complex numbers

Fixes use-after-free error in `apply_symeig`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45795

Reviewed By: ezyang

Differential Revision: D24096955

Pulled By: malfet

fbshipit-source-id: 0d8d8492f89fe722bbd5aed3528f244245b496d0
2020-10-03 11:33:28 -07:00
d8a9c2c27e [iOS][CI] Fix the timeout for nightlies (#45798)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45798

Test Plan: Imported from OSS

Reviewed By: husthyc

Differential Revision: D24098451

Pulled By: xta0

fbshipit-source-id: 269517e0d54b0a07ea2ae5e2aee7f0ebc7985191
2020-10-02 23:13:30 -07:00
2b48dd168d [StaticRuntime] Integrate Static Runtime into PyTorchPredictor (#45640)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45640

Reviewed By: dzhulgakov

Differential Revision: D23996656

fbshipit-source-id: 63d88c89d1df61a04deadc472319607ed83867e5
2020-10-02 23:03:05 -07:00
546aab66c1 Revert D24027761: Update backward definition for more operators and reenable tests in test_ops.py
Test Plan: revert-hammer

Differential Revision:
D24027761 (7d809f5d8e)

Original commit changeset: c1f707c2a039

fbshipit-source-id: 30750d2f08886036fb8b2cd0ae51c7732d3b7b19
2020-10-02 18:52:57 -07:00
31621c828d Fix JIT tests when run locally in fbcode (#45776)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45776

Splitting out backend and custom class registration into their own library is
not currently implemented in fbcode, so detect that we are running tests in
fbcode and disable those tests.

Test Plan: buck test mode/no-gpu mode/dev caffe2/test:jit

Reviewed By: smessmer

Differential Revision: D24085871

fbshipit-source-id: 1fcc0547880bc4be59428e2810b6a7f6e50ef798
2020-10-02 17:43:01 -07:00
53aea60bce [FX] Make output a non-special Node (#45599)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45599

Test Plan: Imported from OSS

Reviewed By: zdevito

Differential Revision: D24027586

Pulled By: jamesr66a

fbshipit-source-id: 747c25e3c7668ca45f03bed0be71fd3c9af67286
2020-10-02 17:08:17 -07:00
2fa062002e CUDA BFloat16 infrastructure (#44925)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44925

Reviewed By: agolynski

Differential Revision: D23783910

Pulled By: ngimel

fbshipit-source-id: dacac2ad87d58056bdc68bfe0b7ab1de5c2af0d8
2020-10-02 16:21:30 -07:00
8cb7280242 Revert "Remove device maps from TensorPipe for v1.7 release (#45353)" (#45762)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45762

This reverts commit 5211fb97ac4c246151f1286c78d63e0e317a8a4a.

Test Plan: Imported from OSS

Reviewed By: colesbury

Differential Revision: D24088231

Pulled By: mrshenli

fbshipit-source-id: b6ee15ec5ae137ea127bdc2db8e1842764bc01d4
2020-10-02 15:14:05 -07:00
d150d3e276 Make sure each warnings.warn only executes once inside TorchScript. (#45382)
Summary:
* Add a pass at end of runCleanupPasses to annotate `aten::warn` so that each has its unique id
* Enhanced interpreter so that it tracks which `aten::warn` has been executed before and skip them
* Improved insertInstruction so that it correctly checks for overflow

Fixes https://github.com/pytorch/pytorch/issues/45108

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45382

Reviewed By: mrshenli

Differential Revision: D24060677

Pulled By: gmagogsfm

fbshipit-source-id: 9221bc55b9ce36b374bdf614da3fe47496b481c1
2020-10-02 14:55:10 -07:00
73e9daa35f [caffe2] Optimize Dedup version of RowWiseSparseAdagrad fused op by WarpReduce (#45649)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45649

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44275

* This Diff applies WarpReduce optimization for dedup version of RowWiseSparseAdagrad fused op. Basically we can achieve ~1.33x performance improvement with this Diff.

* Port the way from D23948802 to find the num_dup
* fix the likely bug about fp16 in the dedup kernel

Reviewed By: jianyuh

Differential Revision: D23561994

fbshipit-source-id: 1a633fcdc924593063a67f9ce0d36eadb19a7efb
2020-10-02 14:28:24 -07:00
c31066ac9d Torch Integration Test Formatting Changes (#45740)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45740

Reviewed By: esqu1

Differential Revision: D23869021

fbshipit-source-id: 5910d44f9475bd7a53dc0478b69b39572dc8666f
2020-10-02 14:02:31 -07:00
7d809f5d8e Update backward definition for more operators and reenable tests in test_ops.py (#44444)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44444

This PR:
1. Fixes https://github.com/pytorch/pytorch/issues/41510. Updates backward formula for the following functions: `asin`, `acos`, `asinh`, `acosh`, `atan`, `atanh`, `div`, `log`, `log10`, `log2`, `log1p`, `pow`, `reciprocal`, `angle`.
2. Re-enables the tests in `test_ops.py`.
3. Adds dispatch for complex dtypes for `tanh_backward`.
4. Re-enables commented tests in `common_methods_invocation.py`.

Test Plan: Imported from OSS

Reviewed By: glaringlee

Differential Revision: D24027761

Pulled By: anjali411

fbshipit-source-id: c1f707c2a039149a6e04bbde53ee120d9119d99a
2020-10-02 13:37:10 -07:00
e3d2defdc8 [te] Get llvm codegen to compile with llvm9 and llvm-fb (#45726)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45726

FB has an old internal platform that uses some random llvm version
that looks sort of like llvm 7.  I've guarded that with the appropriate
LLVM_VERSION_PATCH.

I've also swapped out some of our uses of ThreadSafeModule/ThreadSafeContext
for the variants without ThreadSafe in the name.  As far as I can tell we
weren't using the bundled locks anyways, but I'm like 85% sure this is OK since
we compile under the Torch JIT lock anyways.

Test Plan: unit tests

Reviewed By: ZolotukhinM, asuhan

Differential Revision: D24072697

fbshipit-source-id: 7f56b9f3cbe5e6d54416acdf73876338df69ddb2
2020-10-02 13:33:13 -07:00
5a47a2126d Revert D24018160: [pytorch][PR] Test torch.svd using complex float and double numbers
Test Plan: revert-hammer

Differential Revision:
D24018160 (888f3c12e7)

Original commit changeset: 1b6103f5af94

fbshipit-source-id: 3040250db25995fc0d41fd0f497550dded43cad9
2020-10-02 13:33:11 -07:00
f8c1ca5dd8 Enable NamedTuple data type to work with DDP (#44220)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44220

Closes https://github.com/pytorch/pytorch/issues/44009
Currently if a dataloader returns objects created with a
collections.namedtuple, this will incorrectly be cast to a tuple. As a result, if we have data of these types, there can be runtime errors during the forward pass if the module is expecting a named tuple.

Fix this in
`scatter_gather.py` to resolve the issue reported in
https://github.com/pytorch/pytorch/issues/44009
ghstack-source-id: 113423287

Test Plan: CI

Reviewed By: colesbury

Differential Revision: D23536752

fbshipit-source-id: 3838e60162f29ebe424e83e474c4350ae838180b
2020-10-02 13:33:08 -07:00
8619de84f2 Fix cuDNN error message when it's Conv2d (#45729)
Summary:
Originally introduced in https://github.com/pytorch/pytorch/issues/45023. When I was doing test in the original PR, it was a Conv3d, so this problem was not discovered.

Arrays in `ConvolutionParams` have a fixed length of 3 or 5. This is because `max_dim` is set as a constexpr of 3, regardless of Conv2d or Conv3d. The current code will make some error message be weird. See below in the comments.

9201c37d02/aten/src/ATen/native/cudnn/Conv.cpp (L212-L226)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45729

Reviewed By: mruberry

Differential Revision: D24081542

Pulled By: ngimel

fbshipit-source-id: 141f8946f4d0db63a723131775731272abeaa6ab
2020-10-02 13:33:06 -07:00
322855e380 type check for torch.quantization.observer (#45630)
Summary:
add type checker for observer

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45630

Reviewed By: malfet

Differential Revision: D24058304

Pulled By: walterddr

fbshipit-source-id: ac1c0f5ff0d34b0445bd1364653fc5c9d7571b05
2020-10-02 13:25:41 -07:00
db8b076272 Change signature for torch.poisson (#45656)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45656

Test Plan: Imported from OSS

Reviewed By: gmagogsfm

Differential Revision: D24078609

Pulled By: ansleyadelaide

fbshipit-source-id: 97a95b08334ed0d710e032a267b940c2fc9f7f40
2020-10-02 13:14:12 -07:00
7726754e70 Add function signature for pixel_shuffle (#45661)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45661

Test Plan: Imported from OSS

Reviewed By: gmagogsfm

Differential Revision: D24078627

Pulled By: ansleyadelaide

fbshipit-source-id: 44917ff5932e4d0adcc18ce24ecfc0b5686818e3
2020-10-02 11:46:35 -07:00
6acd7b686c adding sharding option to run_test.py (#45583)
Summary:
Added a sharding option to run_test.py to enable users to run a subset of the many tests. The new `--shard` argument takes in two integer values, `x` and `y`, where the larger value would denote the number of shards and the smaller value would denote which shard to run.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45583

Reviewed By: malfet

Differential Revision: D24083469

Pulled By: janeyx99

fbshipit-source-id: 1777bd7822c95b3bf37079deff9381c6f8eaf4cc
2020-10-02 11:21:51 -07:00
3799ba83e5 [Docs] Adding Store API Docs (#45543)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45543

This PR adds documentation for the c10d Store to the public docs. Previously these docs were missing although we exposed a lightly-used (but potentially useful) Python API for our distributed key-value store.
ghstack-source-id: 113409195

Test Plan: Will verify screenshots by building the docs.

Reviewed By: pritamdamania87

Differential Revision: D24005598

fbshipit-source-id: 45c3600e7c3f220710e99a0483a9ce921d75d044
2020-10-02 11:16:56 -07:00
a052597e6c Bump nightlies to 1.8.0 (#45696)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45696

Similar to https://github.com/pytorch/pytorch/pull/40519

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: samestep

Differential Revision: D24064381

Pulled By: seemethere

fbshipit-source-id: 1484b9c4fc5fa8cfa7be591a0a5d4b6e05968589
2020-10-02 11:10:34 -07:00
6e43f0db8b Use correct signatures for METH_NOARGS. (#45528)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45528

As described in https://github.com/pytorch/pytorch/issues/45419,
resolving a bunch of cpython signature issues.

#Closes: https://github.com/pytorch/pytorch/issues/45419
ghstack-source-id: 113385726

Test Plan: sentinel

Reviewed By: albanD

Differential Revision: D24000626

fbshipit-source-id: d334596f1f0256063691aa044c8fb2face260817
2020-10-02 10:43:58 -07:00
cdf93b03de Add string versions of argument funcs in jit Node (#45464)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45464

Usage of Symbols to find arguments requires one to generate a nonsense symbol for inputs which don't already have one. The intention of symbols appears to be something of an internalized string, but the namespace component doesn't apply to an argument. In order to access the arguments by name without adding new symbols, versions of those functions with std::string input was added. These can be proved valid based on the existing codepath. Additionally, a hasNamedInput convenience function was added to remove the necessity of a try/catch block in user code.

The primary motivation is to be able to easily handle the variable number of arguments in glow, so that the arange op may be implemented.

Reviewed By: eellison

Differential Revision: D23972315

fbshipit-source-id: 3e0b41910cf07e916186f1506281fb221725a91b
2020-10-02 10:26:29 -07:00
b234acd414 Exposes SparseToDenseMask Caffe2 Operator (#45670)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45670

Reviewed By: esqu1

Differential Revision: D23868280

fbshipit-source-id: d6afa129c073fe611cb43a170025bc3c880a4bec
2020-10-02 10:05:13 -07:00
ad31068fe9 Add a distributed package reviewer (#45744)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45744

Tag me as reviewer

Test Plan: na

Reviewed By: jiayisuse

Differential Revision: D23881569

fbshipit-source-id: 8452fa60fe3d017ae1f0da26c0ce476f2b9c170c
2020-10-02 09:56:28 -07:00
24187a0b42 Enable type check for torch.quantization.fake_quantize (#45701)
Summary:
Addresses part of https://github.com/pytorch/pytorch/issues/42969.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45701

Reviewed By: walterddr

Differential Revision: D24066672

Pulled By: samestep

fbshipit-source-id: 53bb5e7b4703738d3de86fa89fb0980f1d6251f3
2020-10-02 09:27:34 -07:00
888f3c12e7 Test torch.svd using complex float and double numbers (#45572)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45572

Reviewed By: anjali411

Differential Revision: D24018160

Pulled By: malfet

fbshipit-source-id: 1b6103f5af94e9f74b73ed23aa02c0236b199b34
2020-10-02 08:29:14 -07:00
4d08930ccb remove beta defaulting in smooth_l1_loss_backward. added to the bc whitelist (#45588)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45588

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D24024312

Pulled By: bdhirsh

fbshipit-source-id: 7246e5da741fbc5641deecaf057ae9a6e44e8c34
2020-10-02 07:53:04 -07:00
869b2ca048 some documentation and style fixes to smooth_l1_loss (#45587)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45587

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D24024313

Pulled By: bdhirsh

fbshipit-source-id: c50efb2934d7b9d3b090e92678319cde42c0df45
2020-10-02 07:47:31 -07:00
c703602e17 make broadcasting explanation clearer in matmul doc: #22763 (#45699)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45699

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D24065584

Pulled By: bdhirsh

fbshipit-source-id: 5e2cdd00ed18ad47d24d11751cfa5bee63853cc9
2020-10-02 06:51:42 -07:00
82cc86b64c VariableKernel calls into scattered C++ api (#44158)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44158

Previously, the C++ API only supported calling ops with a gathered TensorOptions object. So even if the VariableKernel took scattered arguments,
it had to re-gather them to call into the C++ API. But a diff stacked below this one introduced a scattered API for the C++ frontend.

This reaps the benefits and makes sure that if the Variable kernel gets scattered arguments (i.e. it's a c10-full op), then it passes those on without regathering
ghstack-source-id: 113355690

Test Plan:
vs master: https://www.internalfb.com/intern/fblearner/details/216342597/

vs prev diff: https://www.internalfb.com/intern/fblearner/details/216342688/

Reviewed By: ezyang

Differential Revision: D23512538

fbshipit-source-id: 8ee6c1cc99443a2141db85072fd6dbc52b4d77fd
2020-10-02 04:13:39 -07:00
6e2eee2b9d Add faithful C++ API (#44087)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44087

Each op taking a TensorOptions argument now has an additional overload in the C++ frontend where it takes scattered ScalarType, Layout, Device, bool instead of one TensorOptions argument.

If it is a c10-full op, then the scattered version calls into the dispatcher and the gathered version is a proxy calling into the scattered version.
If it is a non-c10-full op, then the gathered version calls into the dispatcher and the scattered version is a proxy calling into the gathered version.

This should minimize the amount of gathering and scattering needed.

This PR is also a prerequisite to remove the re-gathering of arguments that is currently happening in VariableKernel. Currently, VariableKernels gather arguments into a TensorOptions object
to call into the C++ API. In a PR stacked on top of this, VariableKernel will just directly call into the scattered C++ API introduced here and avoid the gathering step.
ghstack-source-id: 113355689

Test Plan:
waitforsandcastle

vs master: https://www.internalfb.com/intern/fblearner/details/216169815/

vs previous diff: https://www.internalfb.com/intern/fblearner/details/216169957/

Reviewed By: ezyang

Differential Revision: D23492188

fbshipit-source-id: 3e84c467545ad9371e98e09075a311bd18411c5a
2020-10-02 04:08:53 -07:00
9201c37d02 Use addmm directly for 1x1 convolution (#45557)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/45274
Based on https://github.com/pytorch/pytorch/issues/44041, sets intermediate for backward computation (otherwise, backward tests are failing).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45557

Reviewed By: izdeby

Differential Revision: D24030655

Pulled By: ngimel

fbshipit-source-id: 368fe9440668dffc004879f8b1d2dd3787d915c9
2020-10-02 00:26:53 -07:00
1a2d3b6a75 [quant] PerChannelFloatQParams support for quint4x2 dtype (#45594)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45594

Adds support for Per-channel quantization using float qparams for 4-bit dtype
We use the new dispatch mechanism and use existing quantize/dequantize kernels to pack the
4-bit data depending on the bit_width.
Size of 4-bit quantized tensor is half that of 8-bit quantized tensor.

Test Plan:
python test/test_quantization.py TestQuantizedTensor.test_quantize_per_channel_sub_byte

Imported from OSS

Reviewed By: raghuramank100

Differential Revision: D24025595

fbshipit-source-id: dd9d0557de585dd4aaf5f138959c3523a29fb759
2020-10-01 23:59:53 -07:00
04526a49d3 [quant] creating quint4x2 dtype for quantized tensors (#44678)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44678

This is a prototype PR that introduces 4 bit qtensors. The new dtype added for this is c10::quint4x2
The underlying storage for this is still uint8_t, so we pack 2 4-bit values in a byte while quantizing it.

This change uses most of the existing scaffolding for qtensor storage. We allocate storage
based on the dtype before creating a new qtensor.

It also adds a dispatch mechanism for this dtype so we can use this to get the bitwidth, qmin and qmax info
while quantizing and packing the qtensor (when we add 2-bit qtensor)

Kernels that use this dtype should be aware of the packing format.

Test Plan:
Locally tested
```
x = torch.ones((100, 100), dtype=torch.float)
qx_8bit = torch.quantize_per_tensor(x, scale=1.0, zero_point=2, dtype=torch.quint8)
qx = torch.quantize_per_tensor(x, scale=1.0, zero_point=2, dtype=torch.quint4x2)

torch.save(x, "temp.p")
print('Size float (B):', os.path.getsize("temp.p"))
os.remove('temp.p')

torch.save(qx_8bit, "temp.p")
print('Size quantized 8bit(B):', os.path.getsize("temp.p"))
os.remove('temp.p')

torch.save(qx, "temp.p")
print('Size quantized 4bit(B):', os.path.getsize("temp.p"))
os.remove('temp.p')
```

Size float (B): 40760
Size quantized 8bit(B): 10808
Size quantized 4bit(B): 5816

Imported from OSS

Reviewed By: raghuramank100

Differential Revision: D23993134

fbshipit-source-id: 073bf262f9680416150ba78ed2d932032275946d
2020-10-01 23:53:34 -07:00
a0d08b2199 Set the default bailout depth to 20 (#45710)
Summary:
This modifies the default bailout depth to 20 which gives us a reasonable performance in benchmarks we considered (fastrnns, maskrcnn, hub/benchmark, etc)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45710

Reviewed By: robieta

Differential Revision: D24071861

Pulled By: Krovatkin

fbshipit-source-id: 472aacc136f37297b21f577750c1d60683a6c81e
2020-10-01 23:37:41 -07:00
402caaeba5 [docs] Update docs for NegativeBinomial (#45693)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45693

**Summary**
This commit updates the docstring for
`torch.distributions.NegativeBinomial` to better match actual behaviour.
In particular, the parameter currently documented as probability of
success is actually probability of failure.

**Test Plan**
1) Ran the code from the issue to make sure this is still an issue (it
is)
2) `make html` and viewed the docs in a browser.

*Before*
<img width="879" alt="Captura de Pantalla 2020-10-01 a la(s) 1 35 28 p  m" src="https://user-images.githubusercontent.com/4392003/94864456-db3a5680-03f0-11eb-977e-3bab0fb9c206.png">

*After*
<img width="877" alt="Captura de Pantalla 2020-10-01 a la(s) 2 12 24 p  m" src="https://user-images.githubusercontent.com/4392003/94864478-e42b2800-03f0-11eb-965a-51493ca27c80.png">

**Fixes**
This commit closes #42449.

Test Plan: Imported from OSS

Reviewed By: robieta

Differential Revision: D24071048

Pulled By: SplitInfinity

fbshipit-source-id: d345b4de721475dbe26233e368af62eb57a47970
2020-10-01 23:20:34 -07:00
36de05dbf6 passing all arguments to sccache wrapper script should be quoted as "$@" (#45582)
Summary:
This fixes MIOpen runtime compilation since it passes quoted arguments to the clang compiler.  This change also makes the sccache wrapper scripts consistent with the nvcc wrapper.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45582

Reviewed By: seemethere, izdeby

Differential Revision: D24034477

Pulled By: malfet

fbshipit-source-id: 1964bac1e693b238e8efe9c046a39be64571e9df
2020-10-01 23:11:59 -07:00
f6dc256bc6 example of splitting up an FX graph into smaller subgraphs with own submodules (#45404)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45404

Test Plan: Imported from OSS

Reviewed By: jamesr66a

Differential Revision: D23956147

Pulled By: Lilyjjo

fbshipit-source-id: a35e33a0b9f1ed5f3fb6e5cd146f66c29bf3d518
2020-10-01 20:40:27 -07:00
1552a926a3 migrate cuda implementation of take() from TH to ATen (#45430)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45430

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D24037297

Pulled By: bdhirsh

fbshipit-source-id: 7c5f2c08e895fb0c25eec1d68c7455e4f2b1c64e
2020-10-01 20:03:01 -07:00
a015ba8dd5 migrating the take() fn from TH to ATen (#45283)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45283

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D24037298

Pulled By: bdhirsh

fbshipit-source-id: 088ce39e55ee8b5a79fa501395fa9eec08d1d396
2020-10-01 19:58:09 -07:00
fc4209bd4f Fix the bucketization wrong doc for right argument (#45684)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45684

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D24057996

Pulled By: glaringlee

fbshipit-source-id: 3db1c24f3cae9747effa4b1f3c5c3baf6888c9a1
2020-10-01 18:16:49 -07:00
4c1e50eb5c remove skip annotations since we already disabled the tests wholesale (#45698)
Summary:
Remove skip annotations since we already disabled the tests wholesale

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45698

Reviewed By: mrshenli

Differential Revision: D24064547

Pulled By: Krovatkin

fbshipit-source-id: 0d154135de0c0550d6874bea3c2d42d5f4d71cb4
2020-10-01 17:47:48 -07:00
cbdba7cc1e win job for the legacy executor (#45612)
Summary:
Adds a CUDA job on Windows for the jit legacy executor

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45612

Reviewed By: mrshenli

Differential Revision: D24042196

Pulled By: Krovatkin

fbshipit-source-id: 35c79c53ed569d221e79376c108bc864900ef49e
2020-10-01 17:23:55 -07:00
0393a1e8b9 add an indexer to SymbolicShape (#45450)
Summary:
A convenience indexer into `SymbolicShape`s

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45450

Reviewed By: ZolotukhinM

Differential Revision: D23971758

Pulled By: Krovatkin

fbshipit-source-id: 1f18c5f89f579072f6bf467809ea9471bf42bc2d
2020-10-01 16:57:07 -07:00
0de5824f36 [iOS][CI] Upgrade xcode version to 12.0 (#45677)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45677

Test Plan: Imported from OSS

Reviewed By: husthyc

Differential Revision: D24065647

Pulled By: xta0

fbshipit-source-id: f2535b1d93e58cf79e7075bf56b0613a3ded16eb
2020-10-01 16:53:18 -07:00
e8e0fca99e [iOS][CI] Update the dev cert (#45651)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45651

### Summary

1. Update the iOS developer certificates. The new expiration date is 10/01/2021.
2. Restore the iOS arm64 jobs and the nightly.

### Test Plan

The following CI jobs succeed

- ci/circleci: pytorch_ios_11_2_1_arm64_build
- ci/circleci: pytorch_ios_11_2_1_arm64_custom_build
- ci/circleci: pytorch_ios_11_2_1_x86_64_build

Test Plan: Imported from OSS

Reviewed By: husthyc

Differential Revision: D24065648

Pulled By: xta0

fbshipit-source-id: 758f41de8296fdfbd3cfad87e9445c2acafd5f94
2020-10-01 16:48:30 -07:00
de3a48013a Use CAFFE2_USE_MSVC_STATIC_RUNTIME to determine when to avoid waiting for global destructors on Windows (#43532)
Summary:
We are trying to build libtorch statically (BUILD_SHARED_LIBS=OFF) then link it into a DLL. Our setup hits the infinite loop mentioned [here](54c05fa34e/torch/csrc/autograd/engine.cpp (L228)) because we build with `BUILD_SHARED_LIBS=OFF` but still link it all into a DLL at the end of the day.

This PR fixes the issue by changing the condition to guard on which windows runtime the build links against using the `CAFFE2_USE_MSVC_STATIC_RUNTIME` flag. `CAFFE2_USE_MSVC_STATIC_RUNTIME` defaults to ON when `BUILD_SHARED_LIBS=OFF`, so backwards compatibility is maintained.

I'm not entirely confident I understand the subtleties of the windows runtime versus linking setup, but this setup works for us and should not affect the existing builds.

Fixes https://github.com/pytorch/pytorch/issues/44470

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43532

Reviewed By: mrshenli

Differential Revision: D24053767

Pulled By: albanD

fbshipit-source-id: 1127fefe5104d302a4fc083106d4e9f48e50add8
2020-10-01 16:41:14 -07:00
4f685ecc25 [reland][quant][graphmode][fx] Merge all quantization mode (#45292) (#45672)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45672

This PR merges all quantization mode and will only expose the following top level functions:
```
prepare_fx
prepare_qat_fx
convert_fx
```

Test Plan:
Imported from OSS

Imported from OSS

Reviewed By: z-a-f

Differential Revision: D24053439

fbshipit-source-id: 03d545e26a36bc22a73349061b751eeb35171e64
2020-10-01 15:47:11 -07:00
18253f4a48 Fix BUILD_CAFFE2 if FBGEMM and NNPACK are not built (#45610)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45610

Also add to the usual documentation places that this option exists.

Test Plan: Imported from OSS

Reviewed By: gmagogsfm

Differential Revision: D24058199

Pulled By: suo

fbshipit-source-id: 81574fbd042f47587e2c7820c726fac0f68af2a7
2020-10-01 14:58:55 -07:00
5959de3aeb setup: Only include dataclasses for py < 3.8 (#45611)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45611

dataclasses was made a standard library item in 3.8

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: walterddr

Differential Revision: D24031740

Pulled By: seemethere

fbshipit-source-id: 15bdf1fe0d8de9b8ba7912e4a651f06b18d516ee
2020-10-01 14:52:28 -07:00
93be03cec0 [quant] torch.mean add path for unsupported QNNPACK modes (#45533)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45533

Test Plan: Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D24030446

Pulled By: z-a-f

fbshipit-source-id: a392402ef701c5e45e244ac440bc151ef942cccd
2020-10-01 14:44:26 -07:00
4564444c91 [RFC][caffe2] TaskGroup.__repr__ shouldn't have side effects
Summary: `__repr__` calling self.tasks() ends up marking the instance as "used", which doesn't seem appropriate. I was debugging a value being passed around and then ran into `Cannot add Task to an already used TaskGroup.` because the value had been logged once.

Test Plan:
Added a unit test -- didn't see a clean public method to test it, but I'm happy to add one if that makes sense.

Will wait for sandcastle to trigger everything else; I'm not at all familiar with this code so any other recommendations would be great!

Reviewed By: cryptopic

Differential Revision: D23541198

fbshipit-source-id: 5d1ec674a1ddaedf113140133b90e0da6afa7270
2020-10-01 14:21:03 -07:00
03e4e94d24 Find single partition (#45429)
Summary:
WIP: This PR is working in progress for the partition of fx graph module. _class partitioner_ generates partitions for the graph module. _class partition_ is a partition node in the partitions.
_Partitioner()_ : create a partitioner
_partition_graph(self, fx_module: GraphModule, devices: List[str]) -> None_:
use fx graph module and devices as the input and create partition_ids for each node inside the graph module

_dump_partition_DAG(self) -> None_:
print out the information about each partition, including its id, its backend type (what type of device this partition uses), all the nodes included in this partition,  its parent partitions, children partitions, input nodes, and output nodes.

So far, only a single partition is considered, which means there is only one device with unlimited memory.
A test unit call _test_find_single_partition()_ is added to test if all nodes in the graph are marked for the only partition.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45429

Reviewed By: izdeby

Differential Revision: D24026268

Pulled By: scottxu0730

fbshipit-source-id: 119d506f33049a59b54ad993670f4ba5d8e15b0b
2020-10-01 13:07:34 -07:00
dcda11c4d3 Disable tcuda_fuser tests in Profiling Mode (#45638)
Summary:
Disable tcuda_fuser tests in Profiling Mode to address flakey tests until fuser switches to the new approach.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45638

Reviewed By: mrshenli

Differential Revision: D24057230

Pulled By: Krovatkin

fbshipit-source-id: 8f7a47610d9b7da6ad3057208057a5a596e1bffa
2020-10-01 12:41:57 -07:00
381f6d32a7 [docs] Fix hyperlinks for nn.CrossEntropyLoss (#45660)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/45460. This PR makes it so that LogSoftmax and NLLLoss are correctly linked from the nn.CrossEntropyLoss documentation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45660

Test Plan:
- built and viewed docs locally

![image](https://user-images.githubusercontent.com/5652049/94816513-ee85fb80-03c9-11eb-8289-56642c133e11.png)

Reviewed By: glaringlee

Differential Revision: D24049009

Pulled By: zou3519

fbshipit-source-id: 3bd0660acb8575d753cefd2d0f1e523ca58a25b6
2020-10-01 12:18:43 -07:00
1efdbfabcc [docs] Fix back quote rendering in loss modules docs (#45662)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/42855. Previously, back quotes weren't rendering correctly in
equations. This is because we were quoting things like `'mean'`. In
order to backquote properly in latex in text-mode, the back-quote needs
to be written as a back-tick.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45662

Test Plan:
- built docs locally and viewed the changes.

For NLLLoss (which is not the original module mentioned in the issue, but it has the same problem), we can see how the back quotes now render properly:

![image](https://user-images.githubusercontent.com/5652049/94819862-c5676a00-03cd-11eb-9e92-01380ee52bd6.png)

Reviewed By: glaringlee

Differential Revision: D24049880

Pulled By: zou3519

fbshipit-source-id: 61a1257994144549eb8f29f19d639aea962dfec0
2020-10-01 11:52:27 -07:00
77cd8e006b Added support for complex torch.symeig (#45121)
Summary:
This PR adds support for complex-valued input for `torch.symeig`.

TODO:
- [ ] complex cuda tests raise `RuntimeError: _th_bmm_out not supported on CUDAType for ComplexFloat`
Update: Added xfailing tests for complex dtypes on CUDA. Once support for complex `bmm` is added these tests will work.

Fixes https://github.com/pytorch/pytorch/issues/45061.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45121

Reviewed By: mrshenli

Differential Revision: D24049649

Pulled By: anjali411

fbshipit-source-id: 2cd11f0e47d37c6ad96ec786762f2da57f25dac5
2020-10-01 08:57:13 -07:00
4583edb5d6 Add NativeFunction.signature and kind. (#45131)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45131

These make it easier to group native functions together and determine
what kind of native function it is (inplace/out/functional).  Currently
they are not used but they may be useful for tools.autograd porters.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: zhangguanheng66

Differential Revision: D23872526

Pulled By: ezyang

fbshipit-source-id: 1d6e429ab9a1f0fdb764be4228c5bca4dce8f24e
2020-10-01 08:46:40 -07:00
41bd5a5ee0 Switch all Sequences in tools.codegen.model to Tuple (#45127)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45127

I thought I was being clever by using Sequence, which doesn't commit to
List or Tuple, but forces read-onlyness in the type system.  However,
there is runtime implication to using List or Tuple: Lists can't be
hashed, but Tuples can be!  This is important because I shortly want
to group by FunctionSchema, and to do this I need FunctionSchema to
be hashable.  Switch everything to Tuple for true immutability.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: gchanan

Differential Revision: D23872527

Pulled By: ezyang

fbshipit-source-id: 5c8fae1c50a5ae47b4167543646d94ddcafff8c3
2020-10-01 08:41:53 -07:00
a242ac8c27 Update torchvision version to current latest master (#45342)
Summary:
Updating torchvision version to the current latest master.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45342

Reviewed By: seemethere

Differential Revision: D23933572

Pulled By: izdeby

fbshipit-source-id: c374156eb608e882a1e2107143e39f03b7399081
2020-10-01 08:31:38 -07:00
72bc3d9de4 Use MTA for amp grad unscaling, enforce op math type in MTA functors, and allow op lambdas (#44778)
Summary:
Amp gradient unscaling is a great use case for multi tensor apply (in fact it's the first case I wrote it for).  This PR adds an MTA unscale+infcheck functor.  Really excited to have it for `torch.cuda.amp`. izdeby your interface was clean and straightforward to use, great work!

Labeled as bc-breaking because the native_functions.yaml exposure of unscale+infcheck changes from [`_amp_non_finite_check_and_unscale_` to `_amp_foreach_non_finite_check_and_unscale_`]( https://github.com/pytorch/pytorch/pull/44778/files#diff-f1e4b2c15de770d978d0eb77b53a4077L6289-L6293).

The PR also modifies Unary/Binary/Pointwise Functors to
- do ops' internal math in FP32 for FP16 or bfloat16 inputs, which improves precision ([and throughput, on some architectures!](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#arithmetic-instructions)) and has no downside for the ops we care about.
- accept an instantiated op functor rather than an op functor template (`template<class> class Op`).  This allows calling code to pass lambdas.

Open question:  As written now, the PR has MTA Functors take care of pre- and post-casting FP16/bfloat16 inputs to FP32 before running the ops.  However, alternatively, the pre- and post-math casting could be deferred/written into the ops themselves, which gives them a bit more control.  I can easily rewrite it that way if you prefer.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44778

Reviewed By: gchanan

Differential Revision: D23944102

Pulled By: izdeby

fbshipit-source-id: 22b25ccad5f69b413c77afe8733fa9cacc8e766d
2020-10-01 07:51:16 -07:00
84cf3372d1 [AutoAccept][Codemod][FBSourceClangFormatLinter] Daily arc lint --take CLANGFORMAT
Reviewed By: zertosh

Differential Revision: D24044108

fbshipit-source-id: 6dfe2f1201304fa58e42472e3f53c72cbb63d7d2
2020-10-01 05:29:03 -07:00
592b398e82 [AutoAccept][Codemod][FBSourceGoogleJavaFormatLinter] Daily arc lint --take GOOGLEJAVAFORMAT
Reviewed By: zertosh

Differential Revision: D24044052

fbshipit-source-id: 50ac5b7480ed65af94617bf8b014252ea7b27c4f
2020-10-01 05:19:37 -07:00
c36b354072 Revert D23913105: [quant][graphmode][fx] Merge all quantization mode
Test Plan: revert-hammer

Differential Revision:
D23913105 (ffcb0989e7)

Original commit changeset: 4e335286d6de

fbshipit-source-id: 5765b4e8ec917423f1745f73a9f3f235fc53423d
2020-10-01 03:12:42 -07:00
78b95b6204 Revert "Revert D24024606: [FX] Shape propagation example" (#45637)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45637

This reverts commit 869b05648def7a3b01685da94d4ee36f671d5dd6.

Test Plan: Imported from OSS

Reviewed By: gmagogsfm

Differential Revision: D24037870

Pulled By: jamesr66a

fbshipit-source-id: 851beb42fe72383108ceeff1fe97f388d9ad059e
2020-10-01 01:07:56 -07:00
4339f5c076 [PyTorch][QPL] Add instance_key into MOBILE_MODULE_LOAD_STATS logging. (#45518)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45518

Similar to previous diff,  Add instance_key into MOBILE_MODULE_LOAD_STATS logging.
ghstack-source-id: 113149713

Test Plan:
```
09-29 11:50:23.345  6477  9351 W MobileModuleQPLObserver.cpp: TESTINGTESTING onEnterLoadModel instance_key = 2015064908
09-29 11:50:23.409  6477  9351 W MobileModuleQPLObserver.cpp: TESTINGTESTING markerAnnotate instance_key = 2015064908, model_name = bi_pytext_v10
09-29 11:50:23.410  6477  9351 W MobileModuleQPLObserver.cpp: TESTINGTESTING markerAnnotate instance_key = 2015064908, model_type = FBNet
09-29 11:50:23.410  6477  9351 W MobileModuleQPLObserver.cpp: TESTINGTESTING markerAnnotate instance_key = 2015064908, op_list_string = ["aten::__getitem__.t", "aten::__is__", "aten::__isnot__", "aten::add.Tensor", "aten::append.t", "aten::cat", "aten::contiguous", "aten::conv1d", "aten::dim", "aten::embedding", "aten::eq.int", "aten::format", "aten::len.t", "aten::max.dim", "aten::mul.Tensor", "aten::permute", "aten::relu", "aten::softmax.int", "aten::tanh", "prepacked::linear_clamp_run", "prim::RaiseException", "prim::TupleIndex", "prim::TupleUnpack", "prim::Uninitialized", "prim::unchecked_cast"]
09-29 11:50:23.410  6477  9351 W MobileModuleQPLObserver.cpp: TESTINGTESTING onExitLoadModel instance_key = 2015064908
```

Reviewed By: iseeyuan

Differential Revision: D23996150

fbshipit-source-id: 7bf76af3b7e6b346afd20ab341204743c81cfe83
2020-09-30 23:31:35 -07:00
d306d0c2b1 remove redundant PE(profiling executor) jobs in CI (#45397)
Summary:
This PR removes redundant profiling jobs since after the switch PE (https://github.com/pytorch/pytorch/pull/45396)  will be now running by default.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45397

Reviewed By: zhangguanheng66

Differential Revision: D23966890

Pulled By: Krovatkin

fbshipit-source-id: ef184ca5fcf079580fa139b6653f8d9a6124050e
2020-09-30 22:18:02 -07:00
3da4cea658 [ONNX] Add dim_param support in export with onnx shape inference (#44920)
Summary:
* Support propagating `dim_param` in ONNX by encoding as `ShapeSymbol` in `SymbolicShape` of outputs. If export is called with `dynamic_axes` provided, shape inference will start with these axes set as dynamic.
* Add new test file `test_pytorch_onnx_shape_inference.py`, reusing all test cases from `test_pytorch_onnx_onnxruntime.py`, but focus on validating shape for all nodes in graph. Currently this is not enabled in the CI, since there are still quite some existing issues and corner cases to fix. The test is default to run only at opset 12.
* Bug fixes, such as div, _len, and peephole.cpp passes for PackPadded, and LogSoftmaxCrossEntropy.
* This PR depends on existing PR such as 44332.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44920

Reviewed By: eellison

Differential Revision: D23958398

Pulled By: bzinodev

fbshipit-source-id: 00479d9bd19c867d526769a15ba97ec16d56e51d
2020-09-30 21:56:24 -07:00
ffcb0989e7 [quant][graphmode][fx] Merge all quantization mode (#45292)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45292

This PR merges all quantization mode and will only expose the following top level functions:
```
prepare_fx
prepare_qat_fx
convert_fx
```

Test Plan: Imported from OSS

Reviewed By: vkuzo

Differential Revision: D23913105

fbshipit-source-id: 4e335286d6de225839daf51d1df54322d52d68e5
2020-09-30 21:20:34 -07:00
3f440d74fc [PyTorch][QPL] Add instance_key into MOBILE_MODULE_STATS logging. (#45517)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45517

Add unique instance_key instead of the default one into MOBILE_MODULE_STATS logging to avoid multiple events overlaps.
ghstack-source-id: 113149453

Test Plan:
Make sure that each event's start, annotate and end are having the same instancekey:
```
09-28 23:46:03.094 19349 21069 W MobileModuleQPLObserver.cpp: TESTINGTESTING onEnterRunMethod instance_key = 1123198800, method_name = forward
09-28 23:46:03.094 19349 21069 W MobileModuleQPLObserver.cpp: TESTINGTESTING onEnterRunMethod instance_key = 1123198800, model_name = bi_pytext_v10
09-28 23:46:03.094 19349 21069 W MobileModuleQPLObserver.cpp: TESTINGTESTING onEnterRunMethod instance_key = 1123198800, model_type = FBNet
09-28 23:46:03.094 19349 21069 W MobileModuleQPLObserver.cpp: TESTINGTESTING onEnterRunMethod instance_key = 1123198800, op_list_string = ["aten::__getitem__.t", "aten::__is__", "aten::__isnot__", "aten::add.Tensor", "aten::append.t", "aten::cat", "aten::contiguous", "aten::conv1d", "aten::dim", "aten::embedding", "aten::eq.int", "aten::format", "aten::len.t", "aten::max.dim", "aten::mul.Tensor", "aten::permute", "aten::relu", "aten::softmax.int", "aten::tanh", "prepacked::linear_clamp_run", "prim::RaiseException", "prim::TupleIndex", "prim::TupleUnpack", "prim::Uninitialized", "prim::unchecked_cast"]
09-28 23:46:03.181 19349 21069 W MobileModuleQPLObserver.cpp: TESTINGTESTING onExitRunMethod instance_key = 1123198800
09-28 23:46:04.183 19349 20896 W MobileModuleQPLObserver.cpp: TESTINGTESTING onEnterRunMethod instance_key = 1521608147, method_name = forward
09-28 23:46:04.184 19349 20896 W MobileModuleQPLObserver.cpp: TESTINGTESTING onEnterRunMethod instance_key = 1521608147, model_name = __torch__.Model
09-28 23:46:04.205 19349 20896 W MobileModuleQPLObserver.cpp: TESTINGTESTING onExitRunMethod instance_key = 1521608147
```

Reviewed By: iseeyuan

Differential Revision: D23985178

fbshipit-source-id: bcd5db8dc680e3cf8d12edf865377e80693cc23b
2020-09-30 20:13:33 -07:00
75fc263579 [TensorExpr] Add a tensor expressions tutorial. (#45527)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45527

Differential Revision: D23998787

Test Plan: Imported from OSS

Reviewed By: eellison

Pulled By: ZolotukhinM

fbshipit-source-id: 1f78ccfe8ef13bf493812cfec7f2fd4853e630ee
2020-09-30 19:35:58 -07:00
9d5607fcd9 [quant] Use PlaceholderObserver as default dynamic quant observer (#45343)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45343

Current default dynamic quant observer is not correct since we don't accumulate
min/max and we don't need to calculate qparams.

Test Plan: Imported from OSS

Reviewed By: supriyar

Differential Revision: D23933995

fbshipit-source-id: 3ff497c9f5f74c687e8e343ab9948d05ccbba09b
2020-09-30 19:01:18 -07:00
2b13d9413e Re-land: Add callgrind collection to Timer #44717 (#45586)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45586

Test Plan: The unit test has been softened to be less platform sensitive.

Reviewed By: mruberry

Differential Revision: D24025415

Pulled By: robieta

fbshipit-source-id: ee986933b984e736cf1525e1297de6b21ac1f0cf
2020-09-30 17:43:06 -07:00
3a2d45304d [Experimental][Partial] New implementation for torch.distributed APIs in C++ (#45547)
Summary:
This is an attempt at refactoring `torch.distributed` implementation. Goal is to push Python layer's global states (like _default_pg) to C++ layer such that `torch.distributed` becomes more TorchScript friendly.

This PR adds the skeleton of C++ implementation, at the moment it is not included in any build (and won't be until method implementations are filled in). If you see any test failures related, feel free to revert.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45547

Reviewed By: izdeby

Differential Revision: D24024213

Pulled By: gmagogsfm

fbshipit-source-id: 2762767f63ebef43bf58e17f9447d53cf119f05f
2020-09-30 17:35:51 -07:00
0b3ad5404a [bot] Add quantization triage bot script (#45622)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45622

Copied and modified from https://github.com/pytorch/pytorch/blob/master/.github/workflows/jit_triage.yml

Test Plan: Imported from OSS

Reviewed By: raghuramank100

Differential Revision: D24036142

fbshipit-source-id: 41287b6a0390cabe4474c99464d74da2c0934401
2020-09-30 17:19:41 -07:00
869b05648d Revert D24024606: [FX] Shape propagation example
Test Plan: revert-hammer

Differential Revision:
D24024606 (ac9a708ed0)

Original commit changeset: 5340eab20f80

fbshipit-source-id: f465eb5e8e994b3b0bedbc779901f76b9ab16f02
2020-09-30 17:03:14 -07:00
f2c2b75e80 flush the buffer when printing the IR (#45585)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45585

I discovered this bug when I was trying to print the graph to a file. Turns out I had to close the file, but flushing should be a good safeguard in case other users forget.

Test Plan:
Tested with and without flushing.
with P144064292
without P144064767

Reviewed By: mortzur

Differential Revision: D24023819

fbshipit-source-id: 39574b3615feb28e5b5939664c04ddfb1257706a
2020-09-30 16:55:27 -07:00
6fde2df1b8 [JIT] Update JIT triage project board workflow (#45613)
Summary:
This commit updates `.github/workflows/jit_triage.yml` to use the new `oncall: jit` tag instead of the old `jit` tag.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45613

Reviewed By: izdeby

Differential Revision: D24032388

Pulled By: SplitInfinity

fbshipit-source-id: 6631a596b2f80bdb322caa74adaf0dc2cb146350
2020-09-30 16:36:23 -07:00
4be42034b6 Clear shape information before finalizing graph-mode quantization (#45282)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45282

Test Plan: Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D23909601

Pulled By: bzinodev

fbshipit-source-id: 3062cda46b15a79094a360216c35906afab7c723
2020-09-30 16:13:55 -07:00
85a70ce71f Add multiline string dedent support (#45580)
Summary:
Fixes #{44842}
Summary
========
This PR adds support for multiline string dedents.

Test
=====
pytest -k test_multiline_string_dedents test/test_jit.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45580

Reviewed By: wconstab

Differential Revision: D24025866

Pulled By: nikithamalgifb

fbshipit-source-id: 0f49739fb93f70f73a8f367caca2887f558a3937
2020-09-30 16:08:26 -07:00
56840f0a81 Prevent overflow in bucketize binary search
Summary: The current `median` calculation in the bucketize binary search is done in a way which is well-known to produce overflow issues ([link](https://en.wikipedia.org/wiki/Binary_search_algorithm#Implementation_issues)). This diff fixes the calculation so that overflows do not occur.

Test Plan:
Standard commit tests.

Also can test with:
```
#include <cassert>
#include <iostream>
#include <cstdint>

int32_t mp1(int32_t a, int32_t b){
        return (a+b)/2;
}

int32_t mp2(int32_t a, int32_t b){
        return a+(b-a)/2;
}

int main(){
        int32_t low=-1;
        for(int32_t high=1;high<10000;high++){
                if(mp1(low,high)!=mp2(low,high)){
                        std::cout<<"Ahhhh!"<<std::endl;
                }
        }
}
```

Reviewed By: drdarshan

Differential Revision: D23993920

fbshipit-source-id: 6b4567515552092de5876de6cab77df27c9cf61d
2020-09-30 15:04:11 -07:00
2596113a79 Add fuse support for batchnorm with affine=False (#45474)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45474

When batchnorm affine is set to false, weight and bias is set to None, which is not supported in this case. Added a fix to set weights to 1 and bias to 0 if they are not set.

Test Plan: Add unit test for testing fusing conv, batchnorm where batchnorm is in affine=False mode.

Reviewed By: z-a-f

Differential Revision: D23977080

fbshipit-source-id: 2782be626dc67553f3d27d8f8b1ddc7dea022c2a
2020-09-30 14:15:05 -07:00
6b42ca2d69 [ONNX] Update embedding_bag export (#44693)
Summary:
Export of embedding bag with dynamic list of offsets.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44693

Reviewed By: malfet

Differential Revision: D23831980

Pulled By: bzinodev

fbshipit-source-id: 3eaff1a0f20d1bcfb8039e518d78c491be381e1a
2020-09-30 13:36:40 -07:00
ac9a708ed0 [FX] Shape propagation example (#45589)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45589

Test Plan: Imported from OSS

Reviewed By: zdevito

Differential Revision: D24024606

Pulled By: jamesr66a

fbshipit-source-id: 5340eab20f805c232bfeb37e4e2156f39a161c19
2020-09-30 13:18:23 -07:00
ffd50b8220 SET USE_DISTRIBUTED OFF when libuv is not installed (#45554)
Summary:
Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45554

Reviewed By: izdeby

Differential Revision: D24016825

Pulled By: mrshenli

fbshipit-source-id: 332d860429626a915c06f98cad31e6db1cbc4eb1
2020-09-30 12:46:36 -07:00
c9bb990707 [c++] Distance-agnostic triplet margin loss (#45377)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45377

This PR adds a C++ implementation of the TripletMarginWithDistanceLoss, for which the Python implementation was introduced in PR #43680.  It's based on PR #44072, but I'm resubmitting this to unlink it from Phabricator.

Test Plan: Imported from OSS

Reviewed By: izdeby

Differential Revision: D24003973

fbshipit-source-id: 2d9ada7260a6f27425ff2fdbbf623dad0fb79405
2020-09-30 12:37:35 -07:00
181afd5220 Add an option to DDP to take a list of parameters to ignore upfront. (#44826)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44826

As described in https://github.com/pytorch/pytorch/issues/43690, there
is a need for DDP to be able to ignore certain parameters in the module (not
install allreduce hooks) for certain use cases. `find_unused_parameters` is
sufficient from a correctness perspective, but we can get better performance
with this upfront list if users know which params are unused, since we won't
have to traverse the autograd graph every iteration.

To enable this, we add a field `parameters_to_ignore` to DDP init and don't
pass in that parameter to reducer if that parameter is in the given list.
ghstack-source-id: 113210109

Test Plan: Added unittest

Reviewed By: xw285cornell, mrshenli

Differential Revision: D23740639

fbshipit-source-id: a0411712a8b0b809b9c9e6da04bef2b955ba5314
2020-09-30 11:52:50 -07:00
c112e89cc6 [quant] Make choose_qparams_optimized return Tensors to preserve dtype (#45530)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45530

Returning double values requires special handling as a return type for aten functions.
Instead return tensors where the type is preserved in the tensor dtype

Test Plan:
python test/test_quantization.py TestQuantizedTensor.test_choose_qparams_optimized

Imported from OSS

Reviewed By: dskhudia

Differential Revision: D24001134

fbshipit-source-id: bec6b17242f4740ab5674be06e0fc30c35eb0379
2020-09-30 11:35:23 -07:00
ce9df084d5 [pytorch] Replace "blacklist" in test/test_mobile_optimizer.py (#45512)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45512

This diff addresses https://github.com/pytorch/pytorch/issues/41443.
It is a clone of D23205313 which could not be imported from GitHub
for strange reasons.

Test Plan: Continuous integration.

Reviewed By: AshkanAliabadi

Differential Revision: D23967322

fbshipit-source-id: 744eb92de7cb5f0bc9540ed6a994f9e6dce8919a
2020-09-30 10:43:59 -07:00
a245dd4317 add dllexport before template specialization functions for windows build (#45477)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/41896

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45477

Reviewed By: zhangguanheng66

Differential Revision: D24006579

Pulled By: walterddr

fbshipit-source-id: 01e8808f0fecf9a405174fab5f348c02fb063e37
2020-09-30 10:39:23 -07:00
5539066d12 [quant][graphmode][fx] Support quantization for custom module (#44074)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44074

Test Plan: Imported from OSS

Reviewed By: vkuzo

Differential Revision: D23580642

fbshipit-source-id: a80b0b3e5e1f4c4a9647da872239cc0a4d58dd3b
2020-09-30 10:24:54 -07:00
51d0ae9207 Revert D24010742: [pytorch][PR] Add callgrind collection to Timer
Test Plan: revert-hammer

Differential Revision:
D24010742 (9b27e0926b)

Original commit changeset: df6bc765f8ef

fbshipit-source-id: 4c1edd57ea932896f7052716427059c924222501
2020-09-30 10:15:46 -07:00
6c4aa2a79c Revert D24002415: Some fixes to smooth_l1_loss
Test Plan: revert-hammer

Differential Revision:
D24002415 (fdbed7118e)

Original commit changeset: 980c141019ec

fbshipit-source-id: 8981b5f6d982ed66c670122e437540444cb5f39c
2020-09-30 10:00:17 -07:00
4f3920951e type check for torch.quantization.quantize_jit (#45548)
Summary:
added type signal for more jit python functions

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45548

Reviewed By: malfet

Differential Revision: D24010922

Pulled By: walterddr

fbshipit-source-id: 2fdd75482481adf2eddc01b915d7d5720fbb2b82
2020-09-30 09:17:00 -07:00
939e0389de Update test_multi_tensor_optimizers test (#45510)
Summary:
Following up on previous [feedback](https://github.com/pytorch/pytorch/pull/45475/files#r496330797).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45510

Reviewed By: heitorschueroff

Differential Revision: D23992304

Pulled By: izdeby

fbshipit-source-id: 4784ed8d79e09da3aa61880add6443e3a8d322e4
2020-09-30 08:59:18 -07:00
415ed434aa Add whitelist for complex backward (#45461)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45461

This PR disables autograd for all C -> C, R -> C functions which are not included in the whitelist `GRADIENT_IMPLEMENTED_FOR_COMPLEX`. In practice, there will be a RuntimeError during forward computation when the outputs are differentiable:
```
>>> x=torch.randn(4, 4, requires_grad=True, dtype=torch.cdouble)
>>> x.pow(3)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
RuntimeError: pow does not support automatic differentiation for outputs with complex dtype.
```

The implicit assumption here is that all the C -> R functions have correct backward definitions. So before merging this PR, the following functions must be tested and verified to have correct backward definitions:
`torch.abs` (updated in #39955 ), `torch.angle`, `torch.norm`, `torch.irfft`, `torch.istft`.

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D23998156

Pulled By: anjali411

fbshipit-source-id: 370eb07fe56ac84dd8e2233ef7bf3a3eb8aeb179
2020-09-30 08:45:55 -07:00
7e863475d7 Upgrade ReadMe document to guide user to install libuv(1.39) in conda env on Windows platform (#45553)
Summary:
Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45553

Reviewed By: SciPioneer

Differential Revision: D24017246

Pulled By: mrshenli

fbshipit-source-id: ec69f864a7acfbdddd60c3d2b442294ec3e34558
2020-09-30 08:28:47 -07:00
96540e918c Add ShuffleDataset with buffer (#45290)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45290

Test Plan: Imported from OSS

Reviewed By: gchanan

Differential Revision: D24001084

Pulled By: erjia-guan

fbshipit-source-id: d8a7455cf3f18e1f8c1edc53c42c1a99c8573c51
2020-09-30 07:58:15 -07:00
fdbed7118e Some fixes to smooth_l1_loss (#45532)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45532

- updated documentation
- explicitly not supporting negative values for beta (previously the
result was incorrect)
- Removing default value for beta in the backwards function, since it's
only used internally by autograd (as per convention)

Test Plan: Imported from OSS

Reviewed By: gchanan

Differential Revision: D24002415

Pulled By: bdhirsh

fbshipit-source-id: 980c141019ec2d437b771ee11fc1cec4b1fcfb48
2020-09-30 07:28:44 -07:00
e02868e12d Unify Transformer coder Constructors (#45515)
Summary:
Fixes #{[45502](https://github.com/pytorch/pytorch/issues/45502)}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45515

Reviewed By: zhangguanheng66, ZolotukhinM

Differential Revision: D23994644

Pulled By: glaringlee

fbshipit-source-id: b8728e8dfd8857e27246ebb11b17c2d1b48796ca
2020-09-30 07:05:41 -07:00
7566823779 Enable PE + TE (#45546)
Summary:
This PR enables PE + TE for 1.7

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45546

Reviewed By: ZolotukhinM

Differential Revision: D24006940

Pulled By: Krovatkin

fbshipit-source-id: a3326077d34a023941acdb06c4907c96e7ba0115
2020-09-30 06:49:59 -07:00
9b27e0926b Add callgrind collection to Timer (#44717)
Summary:
This PR allows Timer to collect deterministic instruction counts for (some) snippets. Because of the intrusive nature of Valgrind (effectively replacing the CPU with an emulated one) we have to perform our measurements in a separate process. This PR writes a `.py` file containing the Timer's `setup` and `stmt`, and executes it within a `valgrind` subprocess along with a plethora of checks and error handling. There is still a bit of jitter around the edges due to the Python glue that I'm using, but the PyTorch signal is quite good and thus this provides a low friction way of getting signal. I considered using JIT as an alternative, but:

A) Python specific overheads (e.g. parsing) are important
B) JIT might do rewrites which would complicate measurement.

Consider the following bit of code, related to https://github.com/pytorch/pytorch/issues/44484:
```
from torch.utils._benchmark import Timer
counts = Timer(
    "x.backward()",
    setup="x = torch.ones((1,)) + torch.ones((1,), requires_grad=True)"
).collect_callgrind()

for c, fn in counts[:20]:
    print(f"{c:>12}  {fn}")
```

```
      812800  ???:_dl_update_slotinfo
      355600  ???:update_get_addr
      308300  work/Python/ceval.c:_PyEval_EvalFrameDefault'2
      304800  ???:__tls_get_addr
      196059  ???:_int_free
      152400  ???:__tls_get_addr_slow
      138400  build/../c10/core/ScalarType.h:c10::typeMetaToScalarType(caffe2::TypeMeta)
      126526  work/Objects/dictobject.c:_PyDict_LoadGlobal
      114268  ???:malloc
      101400  work/Objects/unicodeobject.c:PyUnicode_FromFormatV
       85900  work/Python/ceval.c:_PyEval_EvalFrameDefault
       79946  work/Objects/typeobject.c:_PyType_Lookup
       72000  build/../c10/core/Device.h:c10::Device::validate()
       70000  /usr/include/c++/8/bits/stl_vector.h:std::vector<at::Tensor, std::allocator<at::Tensor> >::~vector()
       66400  work/Objects/object.c:_PyObject_GenericGetAttrWithDict
       63000  ???:pthread_mutex_lock
       61200  work/Objects/dictobject.c:PyDict_GetItem
       59800  ???:free
       58400  work/Objects/tupleobject.c:tupledealloc
       56707  work/Objects/dictobject.c:lookdict_unicode_nodummy
```

Moreover, if we backport this PR to 1.6 (just copy the `_benchmarks` folder) and load those counts as `counts_1_6`, then we can easily diff them:
```
print(f"Head instructions: {sum(c for c, _ in counts)}")
print(f"1.6 instructions:  {sum(c for c, _ in counts_1_6)}")
count_dict = {fn: c for c, fn in counts}
for c, fn in counts_1_6:
    _ = count_dict.setdefault(fn, 0)
    count_dict[fn] -= c
count_diffs = sorted([(c, fn) for fn, c in count_dict.items()], reverse=True)
for c, fn in count_diffs[:15] + [["", "..."]] + count_diffs[-15:]:
    print(f"{c:>8}  {fn}")
```

```
Head instructions: 7609547
1.6 instructions:  6059648
  169600  ???:_dl_update_slotinfo
  101400  work/Objects/unicodeobject.c:PyUnicode_FromFormatV
   74200  ???:update_get_addr
   63600  ???:__tls_get_addr
   46800  work/Python/ceval.c:_PyEval_EvalFrameDefault
   33512  work/Objects/dictobject.c:_PyDict_LoadGlobal
   31800  ???:__tls_get_addr_slow
   31700  build/../aten/src/ATen/record_function.cpp:at::RecordFunction::RecordFunction(at::RecordScope)
   28300  build/../torch/csrc/utils/python_arg_parser.cpp:torch::FunctionSignature::parse(_object*, _object*, _object*, _object**, bool)
   27800  work/Objects/object.c:_PyObject_GenericGetAttrWithDict
   27401  work/Objects/dictobject.c:lookdict_unicode_nodummy
   24115  work/Objects/typeobject.c:_PyType_Lookup
   24080  ???:_int_free
   21700  work/Objects/dictobject.c:PyDict_GetItemWithError
   20700  work/Objects/dictobject.c:PyDict_GetItem
          ...
   -3200  build/../c10/util/SmallVector.h:at::TensorIterator::binary_op(at::Tensor&, at::Tensor const&, at::Tensor const&, bool)
   -3400  build/../aten/src/ATen/native/TensorIterator.cpp:at::TensorIterator::resize_outputs(at::TensorIteratorConfig const&)
   -3500  /usr/include/c++/8/x86_64-redhat-linux/bits/gthr-default.h:std::unique_lock<std::mutex>::unlock()
   -3700  build/../torch/csrc/utils/python_arg_parser.cpp:torch::PythonArgParser::raw_parse(_object*, _object*, _object**)
   -4207  work/Objects/obmalloc.c:PyMem_Calloc
   -4500  /usr/include/c++/8/bits/stl_vector.h:std::vector<at::Tensor, std::allocator<at::Tensor> >::~vector()
   -4800  build/../torch/csrc/autograd/generated/VariableType_2.cpp:torch::autograd::VariableType::add__Tensor(at::Tensor&, at::Tensor const&, c10::Scalar)
   -5000  build/../c10/core/impl/LocalDispatchKeySet.cpp:c10::impl::ExcludeDispatchKeyGuard::ExcludeDispatchKeyGuard(c10::DispatchKey)
   -5300  work/Objects/listobject.c:PyList_New
   -5400  build/../torch/csrc/utils/python_arg_parser.cpp:torch::FunctionParameter::check(_object*, std::vector<pybind11::handle, std::allocator<pybind11::handle> >&)
   -5600  /usr/include/c++/8/bits/std_mutex.h:std::unique_lock<std::mutex>::unlock()
   -6231  work/Objects/obmalloc.c:PyMem_Free
   -6300  work/Objects/listobject.c:list_repeat
  -11200  work/Objects/listobject.c:list_dealloc
  -28900  build/../torch/csrc/utils/python_arg_parser.cpp:torch::FunctionSignature::parse(_object*, _object*, _object**, bool)
```

Remaining TODOs:
  * Include a timer in the generated script for cuda sync.
  * Add valgrind to CircleCI machines and add a unit test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44717

Reviewed By: soumith

Differential Revision: D24010742

Pulled By: robieta

fbshipit-source-id: df6bc765f8efce7193893edba186cd62b4b23623
2020-09-30 05:52:54 -07:00
f5c95d5cf1 Source code level attribution in profiler (#43898)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43898

Adding with_source parameter to enable tracking source code
(filename and line) in profiler for eager, torchscript and autograd
modes

Test Plan:
python test/test_profiler.py
```
Name                                 Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     Number of Calls  Source Location
-----------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  --------------------------------------------
ts_method_1                          10.43%           235.364us        36.46%           822.920us        822.920us        1                test/test_profiler.py(70): test_source
aten::add                            7.52%            169.833us        8.88%            200.439us        200.439us        1                test/test_profiler.py(69): test_source
aten::normal_                        6.26%            141.380us        6.26%            141.380us        141.380us        1                test/test_profiler.py(67): test_source
aten::add                            5.80%            130.830us        8.41%            189.800us        63.267us         3                test/test_profiler.py(72): test_source
aten::sum                            5.02%            113.340us        8.39%            189.475us        189.475us        1                test/test_profiler.py(64): ts_method_1
aten::add                            4.58%            103.346us        6.33%            142.847us        142.847us        1                test/test_profiler.py(62): ts_method_1
aten::mul                            4.05%            91.498us         9.62%            217.113us        217.113us        1                test/test_profiler.py(71): test_source
aten::add                            4.03%            90.880us         5.60%            126.405us        126.405us        1                test/test_profiler.py(58): ts_method_2
aten::empty                          3.49%            78.735us         3.49%            78.735us         19.684us         4                test/test_profiler.py(72): test_source
```

Reviewed By: ngimel

Differential Revision: D23432664

Pulled By: ilia-cher

fbshipit-source-id: 83ad7ebe0c2502494d3b48c4e687802db9c77615
2020-09-30 00:57:35 -07:00
c2c7099944 Fix docs for kwargs, q-z (#43589)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43589

Reviewed By: zhangguanheng66

Differential Revision: D24006259

Pulled By: mruberry

fbshipit-source-id: 39abd474744f152648aad201d7311b42d20efc88
2020-09-29 22:57:02 -07:00
b4ba66ae32 Print tensor shapes and convolution parameters when cuDNN exception is thrown (#45023)
Summary:
Originally proposed at https://github.com/pytorch/pytorch/issues/44473#issuecomment-690670989 by colesbury .

This PR adds the functionality to print relevant tensor shapes and convolution parameters along with the stack trace once a cuDNN exception is thrown.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45023

Reviewed By: gchanan

Differential Revision: D23932661

Pulled By: ezyang

fbshipit-source-id: 5f5f570df6583271049dfc916fac36695f415331
2020-09-29 21:55:34 -07:00
93650a82c9 Move prim::tolist math.log and aten::cpu to lite interpreter for translation model (#45482)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45482

Working on some models that need these ops on lite interpreter.

Test Plan: locally build and load/run the TS model without problem.

Reviewed By: iseeyuan

Differential Revision: D23906581

fbshipit-source-id: 01b9de2af2046296165892b837bc14a7e5d59b4e
2020-09-29 21:42:18 -07:00
4aca63d38a [TensorExpr] Change API for creating Load and Store expressions. (#45520)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45520

With this change `Load`s and `Store`s no longer accept `Placeholder`s in
their constructor and `::make` functions and can only be built with
`Buf`.
`Placeholder` gets its own `store`, `load`, `storeWithMask`, and
`loadWithMask` method for more convenient construction.

Test Plan: Imported from OSS

Reviewed By: glaringlee

Differential Revision: D23998789

Pulled By: ZolotukhinM

fbshipit-source-id: 3fe018e00c1529a563553b2b215f403b34aea912
2020-09-29 20:52:38 -07:00
772ce9ac2c Fix memory corruption when running torch.svd for complex.doubles (#45486)
Summary:
According to http://www.netlib.org/lapack/explore-html/d3/da8/group__complex16_g_esing_gaccb06ed106ce18814ad7069dcb43aa27.html
rwork should be an array of doubles, but it was allocated as array of floats (actually ints)

Fixes crash from https://github.com/pytorch/pytorch/issues/45269

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45486

Reviewed By: walterddr

Differential Revision: D23984444

Pulled By: malfet

fbshipit-source-id: 6a1b00a27de47046496ccf6a91b6e8ad283e42e6
2020-09-29 20:27:08 -07:00
ccad73ab41 Fix D23995953 import.
Summary: https://github.com/pytorch/pytorch/pull/45511 could not be properly imported

Test Plan: See https://github.com/pytorch/pytorch/pull/45511

Reviewed By: zhangguanheng66

Differential Revision: D23995953

fbshipit-source-id: a6224a67d54617ddf34c2392e65f2142c4e78ea4
2020-09-29 19:30:23 -07:00
c87ff2cb90 Enable transposed tensor copy for complex types (#45487)
Summary:
This enables a special copy operator for transposed tensors with more than 360 elements:
417e3f85e5/aten/src/ATen/native/Copy.cpp (L19)

Steps to repro: python -c "import torch; print(torch.svd(torch.randn(61, 61, dtype=torch.complex64)))"

Fixes https://github.com/pytorch/pytorch/issues/45269

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45487

Reviewed By: anjali411

Differential Revision: D23984441

Pulled By: malfet

fbshipit-source-id: 10ce1d5f4425fb6de78e96adffd119e545b6624f
2020-09-29 19:22:05 -07:00
0a15646e15 CUDA RTX30 series support (#45489)
Summary:
I also opened a PR on cmake upstream: https://gitlab.kitware.com/cmake/cmake/-/merge_requests/5292

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45489

Reviewed By: zhangguanheng66

Differential Revision: D23997844

Pulled By: ezyang

fbshipit-source-id: 4e7443dde9e70632ee429184f0d51cb9aa5a98b5
2020-09-29 18:19:23 -07:00
c1e6592964 Enable type-checking of torch.nn.quantized.* modules (#43110)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/43029

I am not changing the following files in this PR:
* `torch/nn/quantized/dynamic/modules/rnn.py` due to https://github.com/pytorch/pytorch/issues/43072
* `torch/nn/quantized/modules/conv.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43110

Reviewed By: gchanan

Differential Revision: D23963258

Pulled By: ezyang

fbshipit-source-id: 0fb0fd13af283f6f7b3434e7bbf62165357d1f98
2020-09-29 18:14:29 -07:00
375a83e6c1 Annotate torch.utils.(tensorboard/show_pickle/hypify) (#44216)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/44215

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44216

Reviewed By: gchanan

Differential Revision: D23963216

Pulled By: ezyang

fbshipit-source-id: b3fed51b2a1cbd05e3cd0222c89c38d61d8968c1
2020-09-29 18:14:26 -07:00
eb39542e67 Add typing annotations for torch.utils.data.* modules (#44136)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/44135

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44136

Reviewed By: gchanan

Differential Revision: D23963273

Pulled By: ezyang

fbshipit-source-id: 939234dddbe89949bd8e5ff05d06f6c8add6935c
2020-09-29 18:12:05 -07:00
33aba57e4c Patch generate files for system protobuf (#44583)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/42939

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44583

Reviewed By: albanD

Differential Revision: D23692639

Pulled By: ezyang

fbshipit-source-id: 49781f704dd6ceab7717b63225d0b4076ce33daa
2020-09-29 18:06:33 -07:00
22a34bcf4e ROCm {emoji:2764} TensorExpr (#45506)
Summary:
This might be an alternative to reverting https://github.com/pytorch/pytorch/issues/45396 .
The obvious rough edge is that I'm not really seeing the work group limits that TensorExpr produces.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45506

Reviewed By: zhangguanheng66

Differential Revision: D23991410

Pulled By: Krovatkin

fbshipit-source-id: 11d3fc4600e4bffb1d1192c6b8dd2fe22c1e064e
2020-09-29 16:52:16 -07:00
637570405b Disable multi tensor tesnor tests on rocm (#45535)
Summary:
Disable multi tensor test on rocm

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45535

Reviewed By: ngimel

Differential Revision: D24002557

Pulled By: izdeby

fbshipit-source-id: 608c9389e3d9cd7dac49ea42c9bb0af55662c754
2020-09-29 15:49:21 -07:00
06a566373a [PyTorch/NCCL] Fix async error handling (#45456)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45456

Remove work while not holding lock, to avoid deadlock with watchdog thread while GPU is 100%

SyncBatchNorm failure trace: P143879560

Test Plan:
**Desync test:**
BACKEND=nccl WORLD_SIZE=3 NCCL_ASYNC_ERROR_HANDLING=1 ./buck-out/gen/caffe2/test/distributed/distributed_nccl_spawn#binary.par -r test_DistributedDataParallel_desync

**SyncBatchNorm test:**
BACKEND=nccl WORLD_SIZE=3 NCCL_ASYNC_ERROR_HANDLING=1 ./buck-out/gen/caffe2/test/distributed/distributed_nccl_fork#binary.par -r test_DistributedDataParallel_SyncBatchNorm_Diff_Input_Sizes_gradient

Reviewed By: osalpekar

Differential Revision: D23972071

fbshipit-source-id: f03d9637a6ec998d64dab1a062a81e0f3697275f
2020-09-29 15:44:34 -07:00
ef41472544 Create experimental FX graph manipulation library (#44775)
Summary:
This PR adds a new GraphManipulation library for operating on the GraphModule nodes.
It also adds an implementation of replace_target_nodes_with, which replaces all nodes in the GraphModule or a specific op/target with a new specified op/target. An example use of this function would be replacing a generic operator with an optimized operator for specific sizes and shapes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44775

Reviewed By: jamesr66a

Differential Revision: D23874561

Pulled By: gcatron

fbshipit-source-id: e1497cd11e0bbbf1fabdf137d65c746248998e0b
2020-09-29 15:32:41 -07:00
d642992877 Quantized operators template selective (#45509)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45509

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44479

Test Plan: Imported from OSS

Reviewed By: dhruvbird

Differential Revision: D23626562

Pulled By: iseeyuan

fbshipit-source-id: c2fc8bad25f8e5e9a70eb1001b9066a711b8e8e7
2020-09-29 14:52:27 -07:00
ab5cf16b6c fix standard deviation gradient NaN behavior (#45468)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/4320

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45468

Reviewed By: zhangguanheng66

Differential Revision: D23991064

Pulled By: albanD

fbshipit-source-id: d4274895f2dac8b2cdbd73e5276ce3df466fc341
2020-09-29 13:47:29 -07:00
18876b5722 Update backward formula for torch.dot and add backward definition for torch.vdot (#45074)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45074

TODO: Add R -> C tests in https://github.com/pytorch/pytorch/pull/44744 (blocked on some JIT changes)

Test Plan: Imported from OSS

Reviewed By: gchanan

Differential Revision: D23975361

Pulled By: anjali411

fbshipit-source-id: 3512bd2962b588a198bc317673bd18cc96ac823f
2020-09-29 12:52:03 -07:00
147c88ef2d Add docs to a pytorch.github.io/doc/tag directory when repo is tagged (#45204)
Summary:
In coordination with jlin27.

This PR is meant to build documentation when the repo is tagged. For instance, tagging the repo with 1.7.0rc1 will push that commit's documentation to pytorch/pytorch.github.io/docs/1.7.

Subsequently tagging 1.7.0rc2 will override the 1.7 docs, as will 1.7.0, and 1.7.1. I think this is as it should be: there should be one, latest, version for the 1.7 docs. This can be tweaked differently if desired.

There is probably work that needs to be done to adjust the [versions.html](https://pytorch.org/docs/versions.html) to add the new tag?

Is there a way to test the tagging side of this without breaking the production documentation?

As an aside, the documentation is being built via the `pytorch_linux_xenial_py3_6_gcc5_4_build` image. Some projects are starting to move on from python3.6 since [it is in security-only support mode](https://devguide.python.org/#status-of-python-branches), no new binaries are being released.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45204

Reviewed By: zhangguanheng66

Differential Revision: D23996800

Pulled By: seemethere

fbshipit-source-id: a94a080348a47738c1de5832ab37b2b0d57d2d57
2020-09-29 12:31:30 -07:00
b66ac1e928 Updates nonzero's as_tuple behavior to no longer warn. (#45413)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/44284.

[torch.nonzero](https://pytorch.org/docs/master/generated/torch.nonzero.html?highlight=nonzero#torch.nonzero) is distinct from [numpy.nonzero](https://numpy.org/doc/1.18/reference/generated/numpy.nonzero.html?highlight=nonzero#numpy.nonzero). The latter returns a tensor by default, and the former returns a tuple of tensors. The `as_tuple` argument was added as part of an intended deprecation process to make torch.nonzero consistent with numpy.nonzero, but this was a confusing change for users. A better deprecation path would be to offer torch.argwhere consistent with [numpy.argwhere](https://numpy.org/doc/stable/reference/generated/numpy.argwhere.html?highlight=argwhere#numpy.argwhere), which is equivalent to the default torch.nonzero behavior. Once this is offered a change to torch.nonzero should be more straightforward with less user disruption, if we decided that's the correct change to pursue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45413

Reviewed By: ngimel

Differential Revision: D23975015

Pulled By: mruberry

fbshipit-source-id: b59237d0d8c2df984e952b62d0a7c247b49d84dc
2020-09-29 12:16:59 -07:00
0df99ad470 Remove unnecessary __at_align32__ in int_elementwise_binary_256 (#45470)
Summary:
They were added in 4b3046ed286e92b5910769bf97f2bc6a1ad473d1 based on a
misunderstanding of `_mm256_storeu_si256`, but they
are actually unnecessary. The [document][1] of `_mm256_storeu_si256` says:

> Moves values from a integer vector to an **unaligned** memory location.

In this case, it's better to remove the `__at_align32__` qualifier to
leave the compiler and linker more flexibility to optimize.

[1]: https://software.intel.com/content/www/us/en/develop/documentation/cpp-compiler-developer-guide-and-reference/top/compiler-reference/intrinsics/intrinsics-for-intel-advanced-vector-extensions/intrinsics-for-load-and-store-operations-1/mm256-storeu-si256.html

Close https://github.com/pytorch/pytorch/issues/44810

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45470

Reviewed By: zhangguanheng66

Differential Revision: D23980060

Pulled By: glaringlee

fbshipit-source-id: 12b3558b76c6e81d88a72081060fdb8674464768
2020-09-29 11:55:25 -07:00
6e55a26e10 Move mobile specific CPUCachingAllocator to c10/mobile folder. (#45364)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45364

Plus add some more comments about the usage, limitations and cons.

Test Plan: Build and run benchmark binary.

Reviewed By: gchanan

Differential Revision: D23944193

fbshipit-source-id: 30d4f4991d2185a0ab768d94c846d73730fc0835
2020-09-29 11:33:26 -07:00
b2925671b6 Updates deterministic flag to throw a warning, makes docs consistent (#45410)
Summary:
Per feedback in the recent design review. Also tweaks the documentation to clarify what "deterministic" means and adds a test for the behavior.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45410

Reviewed By: ngimel

Differential Revision: D23974988

Pulled By: mruberry

fbshipit-source-id: e48307da9c90418fc6834fbd67b963ba2fe0ba9d
2020-09-29 11:17:33 -07:00
aa2bd7e1ae Conservative-ish persistent RNN heuristics for compute capability 8.0+ (#43165)
Summary:
Based on https://github.com/pytorch/pytorch/pull/43165#issuecomment-697033663 and tests by Vasily Volkov ([persistentRNN-speedup.xlsx](https://github.com/pytorch/pytorch/files/5298001/persistentRNN-speedup.xlsx)).  See comments in code.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43165

Reviewed By: zhangguanheng66, mruberry

Differential Revision: D23991756

Pulled By: ngimel

fbshipit-source-id: 4c2c14c9002be2fec76fb21ba55b7dab79497510
2020-09-29 11:14:55 -07:00
f47fd0eb72 Updated cholesky_backward for complex inputs (#45267)
Summary:
Updated `cholesky_backward` to work correctly for complex input.
Note that the current implementation gives the conjugate of what JAX would return. anjali411 is that correct thing to do?
Ref. https://github.com/pytorch/pytorch/issues/44895

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45267

Reviewed By: bwasti

Differential Revision: D23975269

Pulled By: anjali411

fbshipit-source-id: 9908b0bb53c411e5ad24027ff570c4f0abd451e6
2020-09-29 11:07:32 -07:00
15f85eea18 Support bfloat16 and complex dtypes for logical_not (#43537)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43537

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D23751950

Pulled By: mruberry

fbshipit-source-id: d07ecd9aae263eb8e00928d4fc981e0d66066fbb
2020-09-29 11:00:05 -07:00
ea59251f51 Fix model_name not logged properly issue. (#45488)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45488

model_name logging was broken, issue is from the recent change of assigning the method name into the module name, this diff is fixing it.
ghstack-source-id: 113103942

Test Plan:
made sure that now the model_name is logged from module_->name().
verified with one model which does not contain the model metadata, and the model_name field is logged as below:

09-28 21:59:30.065 11530 12034 W module.cpp: TESTINGTESTING run() module = __torch__.Model
09-28 21:59:30.065 11530 12034 W module.cpp: TESTINGTESTING metadata does not have model_name assigning to __torch__.Model
09-28 21:59:30.066 11530 12034 W MobileModuleQPLObserver.cpp: TESTINGTESTING onEnterRunMethod log  model_name = __torch__.Model
09-28 21:59:30.066 11530 12034 W MobileModuleQPLObserver.cpp: TESTINGTESTING onEnterRunMethod log  method_name = labels
09-28 21:59:30.068 11530 12034 W MobileModuleQPLObserver.cpp: TESTINGTESTING onExitRunMethod()

Reviewed By: linbinyu

Differential Revision: D23984165

fbshipit-source-id: 5b00f50ea82106b695c2cee14029cb3b2e02e2c8
2020-09-29 10:37:36 -07:00
09b3e16b40 [JIT] Enable @unused syntax for ignoring properties (#45261)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45261

**Summary**
This commit enables `unused` syntax for ignoring
properties. Inoring properties is more intuitive with this feature enabled.
`ignore` is not supported because class type properties cannot be
executed in Python (because they exist only as TorchScript types) like
an `ignored` function and module properties that cannot be scripted
are not added to the `ScriptModule` wrapper so that they
may execute in Python.

**Test Plan**
This commit updates the existing unit tests for class type and module
properties to test properties ignored using `unused`.

Test Plan: Imported from OSS

Reviewed By: navahgar, Krovatkin, mannatsingh

Differential Revision: D23971881

Pulled By: SplitInfinity

fbshipit-source-id: 8d3cc1bbede7753d6b6f416619e4660c56311d33
2020-09-29 10:24:25 -07:00
5f49d14be2 Add mobile_optimized tag to optimized model. (#45479)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45479

Add a top level boolean attribute to the model called mobile_optimized that is set to true if it is optimized.

Test Plan: buck test //caffe2/test:mobile passes

Reviewed By: kimishpatel

Differential Revision: D23956728

fbshipit-source-id: 79c5931702208b871454319ca2ab8633596b1eb8
2020-09-29 10:06:57 -07:00
17be7c6e5c [vulkan][android][test_app] Add test_app variant that runs module on Vulkan (#44897)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44897

Test Plan: Imported from OSS

Reviewed By: dreiss

Differential Revision: D23763770

Pulled By: IvanKobzarev

fbshipit-source-id: 6ad16b7271c745313a71da64a629a764258bbc85
2020-09-29 10:00:46 -07:00
2c300fd74c [android][vulkan] Module load argument to specify device cpu/vulkan (#44896)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44896

Test Plan: Imported from OSS

Reviewed By: dreiss

Differential Revision: D23763771

Pulled By: IvanKobzarev

fbshipit-source-id: 990a386ad13c704f03345dbe09e180281af913c9
2020-09-29 09:58:22 -07:00
fe9019cbfe Reorganized Sorting.cpp method order (#45083)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45083

This PR just reorders the methods in Sorting.cpp placing related methods next to each other.

Test Plan: Imported from OSS

Reviewed By: glaringlee

Differential Revision: D23908817

Pulled By: heitorschueroff

fbshipit-source-id: 1dd7b693b5135fddf5dff12303474e85ce0c2f83
2020-09-29 09:49:31 -07:00
ab5edf21b0 Revert D23789657: [wip] fast typeMeta/ScalarType conversion approach 2
Test Plan: revert-hammer

Differential Revision:
D23789657 (1ed1a2f5b0)

Original commit changeset: 5afdd52d24bd

fbshipit-source-id: 6d827be8895bcb39c8e85342eee0f7a3f5056c76
2020-09-29 09:40:53 -07:00
b3135c2056 Enable torch.cuda.amp typechecking (#45480)
Summary:
Fix `torch._C._autocast_*_nesting` declarations in __init__.pyi

Fix iterable constructor logic: not every iterable can be constructed using `type(val)(val)` trick, for example it would not work for `val=range(10)` although `isinstance(val, Iterable)` is True
Change optional resolution logic to meet mypy expectations

Fixes https://github.com/pytorch/pytorch/issues/45436

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45480

Reviewed By: walterddr

Differential Revision: D23982822

Pulled By: malfet

fbshipit-source-id: 6418a28d04ece1b2427dcde4b71effb67856a872
2020-09-29 09:31:55 -07:00
df0de780c3 Add cusolver guard for cuda >= 10.1.243 (#45452)
Summary:
See https://github.com/pytorch/pytorch/issues/45403

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45452

Reviewed By: mruberry

Differential Revision: D23977009

Pulled By: ngimel

fbshipit-source-id: df66425773d7500fa37e64d5e4bcc98167016be3
2020-09-29 09:25:20 -07:00
bb19a55429 Improves fft doc consistency and makes deprecation warnings more prominent (#45409)
Summary:
This PR makes the deprecation warnings for existing fft functions more prominent and makes the torch.stft deprecation warning consistent with our current deprecation planning.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45409

Reviewed By: ngimel

Differential Revision: D23974975

Pulled By: mruberry

fbshipit-source-id: b90d8276095122ac3542ab625cb49b991379c1f8
2020-09-29 09:07:49 -07:00
0a38aed025 Auto set libuv_ROOT env var for Gloo submodule on Windows platform (#45484)
Summary:
Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45484

Reviewed By: lw

Differential Revision: D23990724

Pulled By: mrshenli

fbshipit-source-id: 1987ce7eb7d3f9d3120c07e954cd6581cd3caf59
2020-09-29 08:58:56 -07:00
6d37126a10 Makes rdiv consistent with div (#45407)
Summary:
In addition to making rdiv consistent with div, this PR significantly expands division testing, accounting for floor_divide actually performing truncation division, too.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45407

Reviewed By: ngimel

Differential Revision: D23974967

Pulled By: mruberry

fbshipit-source-id: 82b46b07615603f161ab7cd1d3afaa6d886bfe95
2020-09-29 08:34:01 -07:00
7cde662f08 Add check for Complex Type to allow non integral alpha. (#45200)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/45184

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45200

Reviewed By: gchanan

Differential Revision: D23940134

Pulled By: anjali411

fbshipit-source-id: cce7b1efc22ec189ba6c83e31ce712bb34997139
2020-09-29 07:36:46 -07:00
0806c58e9f Optimize view_as_complex and view_as_real (#44908)
Summary:
This avoids unnecessary memory allocations in `view_as_complex` and `view_as_real`. I construct the new tensor directly with the existing storage to avoid creating a new storage object and also use `DimVector`s to avoid allocating for the sizes and strides. Overall, this saves about 2 us of overhead from `torch.fft.fft` which currently has to call `view_as_real` and `view_as_complex` for every call.

I've used this simple benchmark to measure the overhead:
```python
In [1]: import torch
   ...: a = torch.rand(1, 2)
   ...: ac = torch.view_as_complex(a)
   ...: %timeit torch.view_as_real(ac)
   ...: %timeit torch.view_as_complex(a)
   ...: %timeit ac.real
```

Results before:
```
2.5 µs ± 62.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
2.22 µs ± 36 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
4.17 µs ± 8.76 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
```

and after:
```
1.83 µs ± 9.26 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
1.57 µs ± 7.17 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
3.47 µs ± 34.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44908

Reviewed By: agolynski

Differential Revision: D23793479

Pulled By: anjali411

fbshipit-source-id: 64b9cad70e3ec10891310cbfa8c0bdaa1d72885b
2020-09-29 07:30:38 -07:00
87f98a5b54 Updates torch.floor_divide documentation to clarify it's actually torch.trunc_divide (or torch.rtz_divide) (#45411)
Summary:
Addresses https://github.com/pytorch/pytorch/issues/43874 for 1.7. 1.8 will need to take floor_divide through a proper deprecation process.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45411

Reviewed By: ngimel

Differential Revision: D23974997

Pulled By: mruberry

fbshipit-source-id: 16dd07e50a17ac76bfc93bd6b71d4ad72d909bf4
2020-09-29 05:55:44 -07:00
37f9af7f29 Missing tests about torch.xxx(out=...) (#44465)
Summary:
PR opened just to run the CI tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44465

Reviewed By: ngimel

Differential Revision: D23907565

Pulled By: mruberry

fbshipit-source-id: 620661667877f1e9a2bab17d19988e2dc986fc0f
2020-09-29 04:54:46 -07:00
56af122659 Revert D23966878: [pytorch][PR] This PR flips a switch to enable PE + TE
Test Plan: revert-hammer

Differential Revision:
D23966878 (dddb685c11)

Original commit changeset: 2010a0b07c59

fbshipit-source-id: 132556039730fd3e4babd0d7ca8daf9c8d14f728
2020-09-29 04:33:19 -07:00
1ed1a2f5b0 [wip] fast typeMeta/ScalarType conversion approach 2 (#44965)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44965

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D23789657

Pulled By: bhosmer

fbshipit-source-id: 5afdd52d24bd097891ff4a7313033f7bd400165e
2020-09-29 02:39:36 -07:00
489af4ddcb [quant] Add quant APIs to save/load observer state_dict (#44846)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44846

The save function traverses the model state dict to pick out the observer stats
load function traverse the module hierarchy to load the state dict into module attributes depending on observer type

Test Plan:
python test/test_quantization.py TestQuantizeFx.test_save_observer_state_dict

Imported from OSS

Reviewed By: raghuramank100

Differential Revision: D23746821

fbshipit-source-id: 05c571b62949a2833602d736a81924d77e7ade55
2020-09-29 01:52:42 -07:00
bb478810e0 [quant] torch.max_pool1d (#45152)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45152

Test Plan: Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D23846473

Pulled By: z-a-f

fbshipit-source-id: 38fd611e568e4f8b39b7a00adeb42c7b99576360
2020-09-29 01:45:22 -07:00
b86008ab75 [TensorExpr] Remove buf_ field from class Tensor. (#45390)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45390

Tensor objects should always refer to their Function's bufs. Currently
we never create a Tensor with a buffer different than of its function,
but having it in two places seems incorrect and dangerous.

Differential Revision: D23952865

Test Plan: Imported from OSS

Reviewed By: nickgg

Pulled By: ZolotukhinM

fbshipit-source-id: e63fc26d7078427514649d9ce973b74ea635a94a
2020-09-29 01:21:57 -07:00
3c33695a6d [TensorExpr] Rename Buffer to Placeholder. (#45389)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45389

Differential Revision: D23952866

Test Plan: Imported from OSS

Reviewed By: nickgg

Pulled By: ZolotukhinM

fbshipit-source-id: 17eedd3ac17897501403482ac1866c569d247c75
2020-09-29 01:21:54 -07:00
92306b85d5 [TensorExpr] Consolidate {buffer,function,tensor}.{h.cpp} in tensor.{h,cpp}. (#45388)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45388

Classes defined in these files are closely related, so it is reasonable
to have them all in one file. The change is purely a code move.

Differential Revision: D23952867

Test Plan: Imported from OSS

Reviewed By: nickgg

Pulled By: ZolotukhinM

fbshipit-source-id: 12cfaa968bdfc4dff00509e34310a497c7b59155
2020-09-29 01:17:10 -07:00
d2623da52c replaced whitelist with allowlist (#45260)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/41754

**(1)**
Intially file was named **gen_op_registration_whitelist.py** I changed it to **gen_op_registration_allowlist.py**

**(2)**
There were some **whitelist** in comment inside the file, I changed it to **allowlist**
![update1](https://user-images.githubusercontent.com/62737243/94106752-b296e780-fe59-11ea-8541-632a1dbf90d6.png)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45260

Reviewed By: dhruvbird

Differential Revision: D23947182

Pulled By: ljk53

fbshipit-source-id: 31b486592451dbb0605d7950e07747cbb72ab80f
2020-09-29 00:27:46 -07:00
8c309fc052 Add more tests for mt optimizers (#45475)
Summary:
Add more test cases for mt optimizers and fix Adam/AdamW

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45475

Reviewed By: soumith

Differential Revision: D23982727

Pulled By: izdeby

fbshipit-source-id: 4b24d37bd52a2fa3719d3e3a5dcf3b96990b0f5b
2020-09-28 23:59:58 -07:00
6bdb871d47 [FX] Lint pass for Graphs (#44973)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44973

Test Plan: Imported from OSS

Reviewed By: zdevito

Differential Revision: D23792631

Pulled By: jamesr66a

fbshipit-source-id: d8faef0c311d8bd611ba0a7e1e2f353e3e5a1068
2020-09-28 23:00:32 -07:00
b0bdc82a00 [FX][EZ] Fix bug where copying node made non-unique name (#45311)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45311

Test Plan: Imported from OSS

Reviewed By: dzhulgakov

Differential Revision: D23917864

Pulled By: jamesr66a

fbshipit-source-id: 10d0a4017ffe160bce4ba0d830e035616bbded74
2020-09-28 22:55:20 -07:00
417e3f85e5 Support tuple inputs in NN Module test (#44853)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44853

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D23750441

Pulled By: glaringlee

fbshipit-source-id: 1b111a370a726b40521134b711c35f48dda99411
2020-09-28 22:05:05 -07:00
dddb685c11 This PR flips a switch to enable PE + TE (#45396)
Summary:
This PR flips a switch to enable PE + TE
next PR: https://github.com/pytorch/pytorch/pull/45397

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45396

Reviewed By: suo

Differential Revision: D23966878

Pulled By: Krovatkin

fbshipit-source-id: 2010a0b07c595992a88b3fe0792d6af315cf421e
2020-09-28 21:57:50 -07:00
50b91103a9 add self cuda time to avoid double/quadruple counting (#45209)
Summary:
In profiler, cuda did not report self time, so for composite functions there was no way to determine which function is really taking time. In addition, "total cuda time" reported was frequently more than total wallclock time. This PR adds "self CUDA time" in profiler, and computes total cuda time based on self cuda time, similar to how it's done for CPU. Also, slight formatting changes to make table more compact. Before:
```
--------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------
Name                  Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CUDA total %     CUDA total       CUDA time avg    Number of Calls
--------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------
aten::matmul          0.17%            890.805us        99.05%           523.401ms        5.234ms          49.91%           791.184ms        7.912ms          100
aten::mm              98.09%           518.336ms        98.88%           522.511ms        5.225ms          49.89%           790.885ms        7.909ms          100
aten::t               0.29%            1.530ms          0.49%            2.588ms          25.882us         0.07%            1.058ms          10.576us         100
aten::view            0.46%            2.448ms          0.46%            2.448ms          12.238us         0.06%            918.936us        4.595us          200
aten::transpose       0.13%            707.204us        0.20%            1.058ms          10.581us         0.03%            457.802us        4.578us          100
aten::empty           0.14%            716.056us        0.14%            716.056us        7.161us          0.01%            185.694us        1.857us          100
aten::as_strided      0.07%            350.935us        0.07%            350.935us        3.509us          0.01%            156.380us        1.564us          100
aten::stride          0.65%            3.458ms          0.65%            3.458ms          11.527us         0.03%            441.258us        1.471us          300
--------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------
Self CPU time total: 528.437ms
CUDA time total: 1.585s

Recorded timeit time:  789.0814 ms

```
Note recorded timeit time (with proper cuda syncs) is 2 times smaller than "CUDA time total" reported by profiler

After
```
--------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
--------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
        aten::matmul         0.15%     802.716us        99.06%     523.548ms       5.235ms     302.451us         0.04%     791.151ms       7.912ms           100
            aten::mm        98.20%     519.007ms        98.91%     522.745ms       5.227ms     790.225ms        99.63%     790.848ms       7.908ms           100
             aten::t         0.27%       1.406ms         0.49%       2.578ms      25.783us     604.964us         0.08%       1.066ms      10.662us           100
          aten::view         0.45%       2.371ms         0.45%       2.371ms      11.856us     926.281us         0.12%     926.281us       4.631us           200
     aten::transpose         0.15%     783.462us         0.22%       1.173ms      11.727us     310.016us         0.04%     461.282us       4.613us           100
         aten::empty         0.11%     591.603us         0.11%     591.603us       5.916us     176.566us         0.02%     176.566us       1.766us           100
    aten::as_strided         0.07%     389.270us         0.07%     389.270us       3.893us     151.266us         0.02%     151.266us       1.513us           100
        aten::stride         0.60%       3.147ms         0.60%       3.147ms      10.489us     446.451us         0.06%     446.451us       1.488us           300
--------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
Self CPU time total: 528.498ms
CUDA time total: 793.143ms

Recorded timeit time:  788.9832 ms

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45209

Reviewed By: zou3519

Differential Revision: D23925491

Pulled By: ngimel

fbshipit-source-id: 7f9c49238d116bfd2db9db3e8943355c953a77d0
2020-09-28 21:51:13 -07:00
35596d39e9 Coalesce TLS accesses in RecordFunction constructor (#44970)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44970

Right now, when RecordFunction is not active (usual case),
we do two TLS accesses (check for thread local callbacks, and check for
thread local boolean).
Experimenting with reducing number of TLS accesses in RecordFunction
constructor.

Test Plan: record_function_benchmark

Reviewed By: dzhulgakov

Differential Revision: D23791165

Pulled By: ilia-cher

fbshipit-source-id: 6137ce4bface46f540ece325df9864fdde50e0a4
2020-09-28 21:42:23 -07:00
5a6a31168f add circle ci job name dimension to report test stats (#45457)
Summary:
To support abnormal detection for test time spike

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45457

Reviewed By: malfet

Differential Revision: D23975628

Pulled By: walterddr

fbshipit-source-id: f28d0f12559070004d637d5bde83289f029b15b8
2020-09-28 20:51:58 -07:00
5be954b502 Fix WorkerInfo link format (#45476)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45476

Test Plan: Imported from OSS

Reviewed By: rohan-varma

Differential Revision: D23982069

Pulled By: mrshenli

fbshipit-source-id: 6d932e77c1941dfd96592b388353f0fc8968dde6
2020-09-28 20:48:15 -07:00
8e47fcba5f Update docs for RPC async_execution (#45458)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45458

Test Plan: Imported from OSS

Reviewed By: pritamdamania87

Differential Revision: D23973366

Pulled By: mrshenli

fbshipit-source-id: 3697f07fa972db21746aa25eaf461c1b93293f58
2020-09-28 20:48:12 -07:00
c5ade5f698 Fix no_sync docs (#45455)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45455

Test Plan: Imported from OSS

Reviewed By: pritamdamania87

Differential Revision: D23973365

Pulled By: mrshenli

fbshipit-source-id: 87c9878cdc7310754670b83efa65ae6f877f86fb
2020-09-28 20:48:09 -07:00
6967e6295e Fix DDP docs (#45454)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45454

Test Plan: Imported from OSS

Reviewed By: pritamdamania87

Differential Revision: D23973367

Pulled By: mrshenli

fbshipit-source-id: 11f20d51d0d0f92f199e4023f02b86623867bae0
2020-09-28 20:43:22 -07:00
534f2ae582 Disable inplace abs for complex tensors (#45069)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45069

`torch.abs` is a `C -> R` function for complex input. Following the general semantics in torch, the in-place version of abs should be disabled for complex input.

Test Plan: Imported from OSS

Reviewed By: glaringlee, malfet

Differential Revision: D23818397

Pulled By: anjali411

fbshipit-source-id: b23b8d0981c53ba0557018824d42ed37ec13d4e2
2020-09-28 20:33:35 -07:00
208df1aeb8 Use python 3.8 in pytorch docker image (#45466)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45466

Test Plan: Imported from OSS

Reviewed By: seemethere

Differential Revision: D23975294

Pulled By: tierex

fbshipit-source-id: 964de7928b541121963e9de792630bcef172bb5c
2020-09-28 19:21:40 -07:00
8c66cd120b Disable complex inputs to torch.round (#45330)
Summary:
- Related with https://github.com/pytorch/pytorch/issues/44612
- Disable complex inputs to `torch.round`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45330

Reviewed By: gchanan

Differential Revision: D23970781

Pulled By: anjali411

fbshipit-source-id: b8c9ac315ae0fc872701aa132367c3171fd56185
2020-09-28 19:07:01 -07:00
0c8a6008ac Fix torch.pow when the scalar base is a complex number (#45259)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/43829

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45259

Reviewed By: gchanan

Differential Revision: D23962073

Pulled By: anjali411

fbshipit-source-id: 1b16afbb98f33fa7bc53c6ca296c5ddfcbdd2b72
2020-09-28 18:25:53 -07:00
a0f0cb1608 [JIT] Add test for ignored class type property (#45233)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45233

**Summary**
This commit modifies `TestClassType.test_properties` to check that
properties on class types can be ignored with the same syntax as
ignoring properties on `Modules`.

**Test Plan**
`python test/test_jit.py TestClassType.test_properties`

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D23971885

Pulled By: SplitInfinity

fbshipit-source-id: f2228f61fe26dff219024668cc0444a2baa8834c
2020-09-28 18:22:19 -07:00
4af4b71fdc [JIT] Update docs for recently added features (#45232)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45232

**Summary**
This commit updates the TorchScript language reference to include
documentation on recently-added TorchScript enums. It also removed
`torch.no_grad` from the list of known unsupported `torch` modules and
classes because it is now supported.

**Test Plan**
Continuous integration.

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D23971884

Pulled By: SplitInfinity

fbshipit-source-id: 5e2c164ed59bc0926b11201106952cff86e9356e
2020-09-28 18:17:42 -07:00
52cbc9e4ec [TensorExpr] Always inline and DCE in the LLVM backend (#45445)
Summary:
Inline pytorch into wrapper, which is especially helpful in combination
with dead code elimination to reduce IR size and compilation times when
a lot of parameters are unused.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45445

Test Plan: CI

Reviewed By: ZolotukhinM

Differential Revision: D23969009

Pulled By: asuhan

fbshipit-source-id: a21509d07e4c130b6aa6eae5236bb64db2748a3d
2020-09-28 18:11:13 -07:00
7ac872b934 [JIT] Modify to_backend API so that it accepts wrapped modules (#43612)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43612

**Summary**
This commit modifies the `torch._C._jit_to_backend` function so that it
accepts `ScriptModules` as inputs. It already returns `ScriptModules`
(as opposed to C++ modules), so this makes sense and makes the API more
intuitive.

**Test Plan**
Continuous integration, which includes unit tests and out-of-tree tests
for custom backends.

**Fixes**
This commit fixes #41432.

Test Plan: Imported from OSS

Reviewed By: suo, jamesr66a

Differential Revision: D23339854

Pulled By: SplitInfinity

fbshipit-source-id: 08ecef729c4e1e6bddf3f483276947fc3559ea88
2020-09-28 17:17:01 -07:00
5855aa8dac Type check quasirandom (#45434)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/42978.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45434

Reviewed By: walterddr

Differential Revision: D23967139

Pulled By: ajitmaths

fbshipit-source-id: bcee6627f367fd01aa9a5c10a7c24331fc1823ad
2020-09-28 16:49:38 -07:00
49b198c454 type check for torch.testing._internal.common_utils (#45375)
Summary:
part of torch.testing._internal.* effort

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45375

Reviewed By: malfet

Differential Revision: D23964315

Pulled By: walterddr

fbshipit-source-id: efdd643297f5c7f75670ffe60ff7e82fc413d18d
2020-09-28 16:28:46 -07:00
96f8755034 Fixed handling of nan for evenly_distribute_backward (#45280)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45280

Performance is the same on CPU and on CUDA is only 1-1.05x slower. This change is necessary for the future nan ops including nan(min|max|median)

Test Plan: Imported from OSS

Reviewed By: gchanan

Differential Revision: D23908796

Pulled By: heitorschueroff

fbshipit-source-id: c2b57acbe924cfa59fbd85216811f29f4af05088
2020-09-28 15:57:02 -07:00
6a206df891 20000x faster audio conversion for SummaryWriter (#44201)
Summary:
Stumbled upon a little gem in the audio conversion for `SummaryWriter.add_audio()`: two Python `for` loops to convert a float array to little-endian int16 samples. On my machine, this took 35 seconds for a 30-second 22.05 kHz excerpt. The same can be done directly in numpy in 1.65 milliseconds. (No offense, I'm glad that the functionality was there!)

Would also be ready to extend this to support stereo waveforms, or should this become a separate PR?

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44201

Reviewed By: J0Nreynolds

Differential Revision: D23831002

Pulled By: edward-io

fbshipit-source-id: 5c8f1ac7823d1ed41b53c4f97ab9a7bac33ea94b
2020-09-28 15:44:29 -07:00
e54e1fe51e [package] Add dependency viz (#45214)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45214

When in verbose mode the package exporter will produce an html visualization
of dependencies of a module to make it easier to trim out unneeded code,
or debug inclusion of things that cannot be exported.

Test Plan: Imported from OSS

Reviewed By: suo

Differential Revision: D23873525

Pulled By: zdevito

fbshipit-source-id: 6801991573d8dd5ab8c284e09572b36a35e1e5a4
2020-09-28 15:38:41 -07:00
331ebaf7cb [Distributed] Adding Python tests for the TCPStore getNumKeys and deleteKey (#45402)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45402

Previous diffs in this stack implemented the getNumKeys and deleteKey
APIs in the c10d Store as well as added tests at the C++ layer. This diff adds
tests at the Python level in test_c10d.py
ghstack-source-id: 112997161

Test Plan: Running these new python tests as well as previous C++ tests

Reviewed By: mrshenli

Differential Revision: D23955729

fbshipit-source-id: c7e0af7c884de2d488320e2a1d94aec801a782e5
2020-09-28 15:35:24 -07:00
6b65b3cbd8 [Distributed] DeleteKey API for c10d TCP Store (#45401)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45401

Added a DeleteKey API for the TCP Store
ghstack-source-id: 112997162

Test Plan:
Modified the existing get/set test to use delete. verified that the
correct keys were deleted and that the numKeys API returned the right values

Reviewed By: mrshenli

Differential Revision: D23955730

fbshipit-source-id: 5c9f82be34ff4521c59f56f8d9c1abf775c67f9f
2020-09-28 15:30:39 -07:00
190f91e3db Adding Histogram Binning Calibration to DSNN and Adding Type Double to Caffe2 ParallelSumOp/SumReluOp
Summary: As title.

Test Plan:
FBL job without this diff failed:
f221545832

Error message:
```
NonRetryableException: AssertionError: Label is missing in training stage for HistogramBinningCalibration
```

FBL job with canary package built in this diff is running without failure:
f221650379

Reviewed By: chenshouyuan

Differential Revision: D23959508

fbshipit-source-id: c077230de29f7abfd092c84747eaabda0b532bcc
2020-09-28 15:21:31 -07:00
1097fe0088 Remove CriterionTest.test_cuda code for dtype None. (#45316)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45316

It's never used.

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D23919449

Pulled By: gchanan

fbshipit-source-id: f9aaeeabf3940389156bfc01bc3118d348ca4cf6
2020-09-28 15:08:09 -07:00
a4486fe7ba [ROCm] Print name irrespective of seq number assignment for roctx traces (#45229)
Summary:
Recent changes to the seq_num correlation behavior in profiler (PR https://github.com/pytorch/pytorch/issues/42565)  has changed the behavior for emit_nvtx(record_shapes=True)  which doesn't print the name of the operator properly.

Created PR to dump out the name in roctx traces, irrespective of the sequence number assigned only for ROCm.

cc: jeffdaily sunway513

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45229

Reviewed By: zou3519

Differential Revision: D23932902

Pulled By: albanD

fbshipit-source-id: c782667ff002b70b51f1cc921afd1b1ac533b39d
2020-09-28 15:03:47 -07:00
c6b7eeb654 Gh/taylorrobie/timer cleanup (#45361)
Summary:
This PR cleans up some of the rough edges around `Timer` and `Compare`
* Moves `Measurement` to be dataclass based
* Adds a bunch of type annotations. MyPy is now happy.
* Allows missing entries in `Compare`. This is one of the biggest usability issues with `Compare` right now, both from an API perspective and because the current failure mode is really unpleasant.
* Greatly expands the testing of `Compare`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45361

Test Plan: Changes to Timer are covered under existing tests, changes to `Compare` are covered by the expanded `test_compare` method.

Reviewed By: bwasti

Differential Revision: D23966816

Pulled By: robieta

fbshipit-source-id: 826969f73b42f72fa35f4de3c64d0988b61474cd
2020-09-28 14:56:43 -07:00
a77d633db1 [ONNX] Fix view for dynamic input shape (#43558)
Summary:
Export of view op with dynamic input shape is broken when using tensors with a 0-dim.
This fix removes symbolic use of static input size to fix this issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43558

Reviewed By: ailzhang

Differential Revision: D23965090

Pulled By: bzinodev

fbshipit-source-id: 628e9d7ee5d53375f25052340ca6feabf7ba7c53
2020-09-28 14:46:51 -07:00
5d1fee23b3 Remove convert_target from NN tests. (#45291)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45291

It's not necessary, you can just check if the dtype is integral.

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D23911963

Pulled By: gchanan

fbshipit-source-id: 230139e1651eb76226f4095e31068dded30e03e8
2020-09-28 14:21:42 -07:00
986af53be2 type check for torch.testing._internalcodegen:* (#45368)
Summary:
part of `torch.testing._internal.*` effort

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45368

Reviewed By: malfet

Differential Revision: D23950512

Pulled By: walterddr

fbshipit-source-id: 399f712d12cdd9795b0136328f512c3f86a15f24
2020-09-28 14:04:52 -07:00
7a4c417ed3 Fix typo (#45379)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45379

Registeres -> Registers in reducer.h.
ghstack-source-id: 112982279

Test Plan: N/A

Reviewed By: mrshenli

Differential Revision: D23951203

fbshipit-source-id: 96c7dc2e1e12c132339b9ac83ce1da52c812740c
2020-09-28 14:02:01 -07:00
57c18127dc [ONNX] Update div export to perform true divide (#44831)
Summary:
related https://github.com/pytorch/pytorch/issues/43787

Now that PyTorch div is actually performing true divide, update onnx export code to stay consistent.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44831

Reviewed By: eellison

Differential Revision: D23880316

Pulled By: bzinodev

fbshipit-source-id: 3bb8db34142ac4fed4039295ad3c4cb79487987f
2020-09-28 13:53:43 -07:00
9163e8171e Adding Type Double to Caffe2 Mean Op
Summary: Adding support for type double to caffe2 MeanOp and MeanGradientOp.

Test Plan:
All tests passed.

Example FBL job failed without this diff:
f221169563

Error message:
```
c10::Error: [enforce fail at mean_op.h:72] . Mean operator only supports 32-bit float, but input was of type double (Error from operator:
input: "dpsgd_8/Copy_3" input: "dpsgd_8/Copy_4" output: "dpsgd_8/Mean_2" name: "" type: "Mean" device_option { device_type: 0 device_id: 0 })
```

Example FBL job is running without failure with the canary package built from this diff:
f221468723

Reviewed By: chenshouyuan

Differential Revision: D23956222

fbshipit-source-id: 6c81bbc390d812ae0ac235e7d025141c8402def1
2020-09-28 13:35:29 -07:00
47debdca42 Document change for DDP enabled on Windows platform (#45392)
Summary:
Document change for DDP enabled on Windows platform

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45392

Reviewed By: gchanan

Differential Revision: D23962344

Pulled By: mrshenli

fbshipit-source-id: 8924c6ca36d68699871d8add3e0aab6542ea269c
2020-09-28 13:22:42 -07:00
722faeb2a4 [RELAND] Added optimizers based on multi tensor apply (#45408)
Summary:
Original PR https://github.com/pytorch/pytorch/pull/45299.  The present PR fixes minor bugs that caused revert.

Adding a new namespace `torch.optim._multi_tensor` with a bunch of updated optimizers. Those optimizers are using _foreach APIs which improve performance significantly.

### Tests
- updated existing tests to use both optimizers
- added `test_multi_tensor_optimizers` test to verify correctness.

### Perf results

**Adam**
timeit: 42.69 ms --> 10.16 ms
autorange: 41.96 ms --> 10.28 ms

**AdamW**
timeit: 51.38 ms --> 15.63 ms
autorange: 50.82 ms --> 16.07 ms

**SGD**
timeit: 6.28 ms --> 4.40 ms
autorange: 6.13 ms --> 4.73 ms

**RMSprop**
timeit: 28.63 ms --> 5.89 ms
autorange: 28.27 ms -->  5.76 ms

**Rprop**
timeit: 213.30 --> 178.42
autorange: 212.03 --> 178.03

**ASGD**
timeit: 21.67 --> 9.33
autorange: 21.64 --> 9.27

**Adamax**
timeit: 55.60 --> 48.29
autorange: 55.22 -> 49.13

**Rerf Script used**

```
import torch
import time
import torch.optim as optim
from torch.autograd import Variable
from torch.optim.lr_scheduler import ExponentialLR, ReduceLROnPlateau, StepLR
import torch.nn as nn
import time
import torchvision
import torch.utils._benchmark as benchmark_utils

device = "cuda"
model = torchvision.models.resnet.resnet101(pretrained=True).to(device)
targets = torch.randint(0, 1000, (100, 100), device=device)
criterion = nn.CrossEntropyLoss()

optimizer = optim.SGD(model.parameters(), lr=1e-3) # <----------------------- optimizer.
                                                          # would compare optim.SGD vs optim._multi_tensor.SGD
running_loss = 0.0
target = torch.empty(128, dtype=torch.long, device=device).random_(5)

optimizer.zero_grad()
inputs = torch.rand(128, 3, 100, 100, device=device , requires_grad=True)
outputs = model(inputs)
loss = criterion(outputs, target)
loss.backward()
optimizer.step()
running_loss += loss.item()

def main():
    timer = benchmark_utils.Timer(
        stmt="optimizer.step()",
        globals=globals(),
        label="str(optimizer)",
    )

    for i in range(1):
        print(f"Run: {i}\n{'-' * 40}")
        print(f"timeit:\n{timer.timeit(1000)}\n")
        print(f"autorange:\n{timer.blocked_autorange()}\n\n")

if __name__ == "__main__":
    main()
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45408

Reviewed By: gchanan

Differential Revision: D23956680

Pulled By: izdeby

fbshipit-source-id: c5eab7bf5fce14a287c15cead1cdc26e42cfed94
2020-09-28 13:14:04 -07:00
87b356d093 [static runtime] Split out graph preparation from runtime (#44131)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44131

Test Plan: Imported from OSS

Reviewed By: hlu1

Differential Revision: D23604305

Pulled By: bwasti

fbshipit-source-id: 7b47da4961d99074199417ef1407a788c7d80ee6
2020-09-28 13:01:23 -07:00
6ab1c0b1ca Disable a few tests in preparation to enabling PE+TE (#44815)
Summary:
Disable a few tests in preparation to enabling PE+TE
Next PR: https://github.com/pytorch/pytorch/pull/45396

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44815

Reviewed By: ZolotukhinM

Differential Revision: D23948445

Pulled By: Krovatkin

fbshipit-source-id: 93e641b7b8a3f13bd3fd3840116076553408f224
2020-09-28 12:55:12 -07:00
36c3fbc9e3 CUDA BFloat Conv (non-cuDNN) (#45007)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45007

Reviewed By: zou3519

Differential Revision: D23933174

Pulled By: ngimel

fbshipit-source-id: 84eb028f09c9197993fb9981c0efb535014e5f78
2020-09-28 11:42:42 -07:00
03342af3a3 Add env variable to bypass CUDACachingAllocator for debugging (#45294)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45294

While tracking down a recent memory corruption bug we found that
cuda-memcheck wasn't finding the bad accesses, and ngimel pointed out that
it's because we use a caching allocator so a lot of "out of bounds" accesses
land in a valid slab.

This PR adds a runtime knob (`PYTORCH_NO_CUDA_MEMORY_CACHING`) that, when set,
bypasses the caching allocator's caching logic so that allocations go straight
to cudaMalloc.  This way, cuda-memcheck will actually work.

Test Plan:
Insert some memory errors and run a test under cuda-memcheck;
observe that cuda-memcheck flags an error where expected.

Specifically I removed the output-masking logic here:
https://github.com/pytorch/pytorch/blob/master/torch/csrc/jit/tensorexpr/cuda_codegen.cpp#L819-L826

And ran:
```
PYTORCH_NO_CUDA_MEMORY_CACHING=1 cuda-memcheck pytest -k test_superslomo test_jit_fuser_te.py
```

Reviewed By: ngimel

Differential Revision: D23964734

Pulled By: bertmaher

fbshipit-source-id: 04efd11e8aff037b9edde80c70585cb820ee6e39
2020-09-28 11:40:04 -07:00
993628c74a Build shape expressions and remove outputs that are only used by aten::sizes (#45080)
Summary:
Currently, TE materializes all intermediate results even if they are only used for computing their shapes. This diff ports the approach the OF (Old Fuser) took to deal with this issue. Namely, given the structure of a fusion group we infer all the sizes outside a fusion group based on fusion group's inputs.

A simple example would be:

```
        def test_fuse(a, b):
            c = a + b
            d = c + b
            return d
```

Here we don't need to cache `c` as computing a gradient for `b` in `d = c + b` doesn't need it. We do need to compute sizes for all arguments here in case broadcasts happen.

Without this optimization, TE would need to materialize `c` so we can get its size

```
[DUMP profiling_graph_executor_impl.cpp:499] Optimized Graph:
[DUMP profiling_graph_executor_impl.cpp:499] graph(%a.1 : Tensor,
[DUMP profiling_graph_executor_impl.cpp:499]       %b.1 : Tensor):
[DUMP profiling_graph_executor_impl.cpp:499]   %11 : Tensor = prim::DifferentiableGraph_0(%b.1, %a.1)
[DUMP profiling_graph_executor_impl.cpp:499]   return (%11)
[DUMP profiling_graph_executor_impl.cpp:499] with prim::DifferentiableGraph_0 = graph(%11 : Tensor,
[DUMP profiling_graph_executor_impl.cpp:499]       %13 : Tensor):
[DUMP profiling_graph_executor_impl.cpp:499]   %59 : int[] = aten::size(%13) # <string>:3:44
[DUMP profiling_graph_executor_impl.cpp:499]   %62 : int[] = aten::size(%11) # <string>:3:93
[DUMP profiling_graph_executor_impl.cpp:499]   %83 : Double(1:1, requires_grad=0, device=cuda:0), %84 : Double(1:1, requires_grad=0, device=cuda:0), %85 : bool = prim::TypeCheck(%11, %13)
[DUMP profiling_graph_executor_impl.cpp:499]   %86 : Tensor, %87 : Tensor = prim::If(%85)
[DUMP profiling_graph_executor_impl.cpp:499]     block0():
[DUMP profiling_graph_executor_impl.cpp:499]       %d.4 : Double(1:1, requires_grad=0, device=cuda:0), %c.4 : Double(1:1, requires_grad=0, device=cuda:0) = prim::TensorExprGroup_0(%83, %84)
[DUMP profiling_graph_executor_impl.cpp:499]       -> (%d.4, %c.4)
[DUMP profiling_graph_executor_impl.cpp:499]     block1():
[DUMP profiling_graph_executor_impl.cpp:499]       %94 : Function = prim::Constant[name="fallback_function", fallback=1]()
[DUMP profiling_graph_executor_impl.cpp:499]       %95 : (Tensor, Tensor) = prim::CallFunction(%94, %11, %13)
[DUMP profiling_graph_executor_impl.cpp:499]       %96 : Tensor, %97 : Tensor = prim::TupleUnpack(%95)
[DUMP profiling_graph_executor_impl.cpp:499]       -> (%96, %97)
[DUMP profiling_graph_executor_impl.cpp:499]   %60 : int[] = aten::size(%87) # <string>:3:55
[DUMP profiling_graph_executor_impl.cpp:499]   %61 : int[]? = aten::_size_if_not_equal(%59, %60) # <string>:3:19
[DUMP profiling_graph_executor_impl.cpp:499]   %64 : int[]? = aten::_size_if_not_equal(%62, %60) # <string>:3:68
[DUMP profiling_graph_executor_impl.cpp:499]   %67 : int[] = aten::size(%86) # <string>:3:55
[DUMP profiling_graph_executor_impl.cpp:499]   %68 : int[]? = aten::_size_if_not_equal(%60, %67) # <string>:3:19
[DUMP profiling_graph_executor_impl.cpp:499]   %71 : int[]? = aten::_size_if_not_equal(%62, %67) # <string>:3:68
[DUMP profiling_graph_executor_impl.cpp:499]   return (%86, %61, %64, %68, %71)
[DUMP profiling_graph_executor_impl.cpp:499] with prim::TensorExprGroup_0 = graph(%1 : Double(1:1, requires_grad=0, device=cuda:0),
[DUMP profiling_graph_executor_impl.cpp:499]       %4 : Double(1:1, requires_grad=0, device=cuda:0)):
[DUMP profiling_graph_executor_impl.cpp:499]   %5 : int = prim::Constant[value=1]()
[DUMP profiling_graph_executor_impl.cpp:499]   %c.3 : Double(1:1, requires_grad=0, device=cuda:0) = aten::add(%4, %1, %5) # /scratch/villedepommes/pytorches/bench/test/test_jit.py:2872:16
[DUMP profiling_graph_executor_impl.cpp:499]   %2 : int = prim::Constant[value=1]()
[DUMP profiling_graph_executor_impl.cpp:499]   %d.3 : Double(1:1, requires_grad=0, device=cuda:0) = aten::add(%c.3, %1, %2) # /scratch/villedepommes/pytorches/bench/test/test_jit.py:2873:16
[DUMP profiling_graph_executor_impl.cpp:499]   return (%d.3, %c.3)
```

With this optimization we use `prim::BroadcastSizes` to compute the size of `c`. No need to materialize it.

```
[DUMP profiling_graph_executor_impl.cpp:499] Optimized Graph:
[DUMP profiling_graph_executor_impl.cpp:499] graph(%a.1 : Tensor,
[DUMP profiling_graph_executor_impl.cpp:499]       %b.1 : Tensor):
[DUMP profiling_graph_executor_impl.cpp:499]   %11 : Tensor = prim::DifferentiableGraph_0(%b.1, %a.1)
[DUMP profiling_graph_executor_impl.cpp:499]   return (%11)
[DUMP profiling_graph_executor_impl.cpp:499] with prim::DifferentiableGraph_0 = graph(%11 : Tensor,
[DUMP profiling_graph_executor_impl.cpp:499]       %13 : Tensor):
[DUMP profiling_graph_executor_impl.cpp:499]   %59 : int[] = aten::size(%13) # <string>:3:44
[DUMP profiling_graph_executor_impl.cpp:499]   %62 : int[] = aten::size(%11) # <string>:3:93
[DUMP profiling_graph_executor_impl.cpp:499]   %88 : Double(1:1, requires_grad=0, device=cuda:0), %89 : Double(1:1, requires_grad=0, device=cuda:0), %90 : bool = prim::TypeCheck(%11, %13)
[DUMP profiling_graph_executor_impl.cpp:499]   %91 : Tensor = prim::If(%90)
[DUMP profiling_graph_executor_impl.cpp:499]     block0():
[DUMP profiling_graph_executor_impl.cpp:499]       %d.4 : Double(1:1, requires_grad=0, device=cuda:0) = prim::TensorExprGroup_0(%88, %89)
[DUMP profiling_graph_executor_impl.cpp:499]       -> (%d.4)
[DUMP profiling_graph_executor_impl.cpp:499]     block1():
[DUMP profiling_graph_executor_impl.cpp:499]       %97 : Function = prim::Constant[name="fallback_function", fallback=1]()
[DUMP profiling_graph_executor_impl.cpp:499]       %98 : (Tensor) = prim::CallFunction(%97, %11, %13)
[DUMP profiling_graph_executor_impl.cpp:499]       %99 : Tensor = prim::TupleUnpack(%98)
[DUMP profiling_graph_executor_impl.cpp:499]       -> (%99)
[DUMP profiling_graph_executor_impl.cpp:499]   %85 : int[] = aten::size(%91)
[DUMP profiling_graph_executor_impl.cpp:499]   %86 : int[] = prim::BroadcastSizes(%59, %62)
[DUMP profiling_graph_executor_impl.cpp:499]   %61 : int[]? = aten::_size_if_not_equal(%59, %86) # <string>:3:19
[DUMP profiling_graph_executor_impl.cpp:499]   %64 : int[]? = aten::_size_if_not_equal(%62, %86) # <string>:3:68
[DUMP profiling_graph_executor_impl.cpp:499]   %68 : int[]? = aten::_size_if_not_equal(%86, %85) # <string>:3:19
[DUMP profiling_graph_executor_impl.cpp:499]   %71 : int[]? = aten::_size_if_not_equal(%62, %85) # <string>:3:68
[DUMP profiling_graph_executor_impl.cpp:499]   return (%91, %61, %64, %68, %71)
[DUMP profiling_graph_executor_impl.cpp:499] with prim::TensorExprGroup_0 = graph(%1 : Double(1:1, requires_grad=0, device=cuda:0),
[DUMP profiling_graph_executor_impl.cpp:499]       %4 : Double(1:1, requires_grad=0, device=cuda:0)):
[DUMP profiling_graph_executor_impl.cpp:499]   %5 : int = prim::Constant[value=1]()
[DUMP profiling_graph_executor_impl.cpp:499]   %c.3 : Double(1:1, requires_grad=0, device=cuda:0) = aten::add(%4, %1, %5) # /scratch/villedepommes/pytorches/bench/test/test_jit.py:2872:16
[DUMP profiling_graph_executor_impl.cpp:499]   %2 : int = prim::Constant[value=1]()
[DUMP profiling_graph_executor_impl.cpp:499]   %d.3 : Double(1:1, requires_grad=0, device=cuda:0) = aten::add(%c.3, %1, %2) # /scratch/villedepommes/pytorches/bench/test/test_jit.py:2873:16
[DUMP profiling_graph_executor_impl.cpp:499]   return (%d.3)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45080

Reviewed By: bertmaher

Differential Revision: D23856410

Pulled By: Krovatkin

fbshipit-source-id: 2956286eb03a4894a5baa151c35e6092466322b1
2020-09-28 10:45:56 -07:00
e5242aaf89 Update TensorPipe submodule (#45433)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45433

Primarily in order to pick up the fix landed in https://github.com/pytorch/tensorpipe/pull/225 which fixes the handling of scopes in link-local IPv6 addresses, which was reported by a user.

Test Plan: The specific upstream change is covered by new unit tests. The submodule update will be validated by the PyTorch CI.

Reviewed By: beauby

Differential Revision: D23962289

fbshipit-source-id: 4ed762fc19c4aeb1398d1337d61b3188c4c228be
2020-09-28 10:32:06 -07:00
48d29c830d [hotfix] disable problematic cuda tests on rocm builds (#45435)
Summary:
Disable the recent 3 cuda tests on amd rocm build/tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45435

Reviewed By: malfet

Differential Revision: D23962881

Pulled By: walterddr

fbshipit-source-id: ad4ea1f835b4722cdbdce685806cfd64376cc16f
2020-09-28 10:02:12 -07:00
e2ffdf467a docker: Add torchelastic to docker image (#45438)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45438

Adds torchelastic (as well as its dependencies) to the official docker
images

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: tierex

Differential Revision: D23963787

Pulled By: seemethere

fbshipit-source-id: 54ebb4b9c50699e543f264975dadf99badf55753
2020-09-28 09:53:07 -07:00
e4950a093a Backward support for generalized eigenvalue solver with LOBPCG in forward [only k-rank SYMEIG case] (#43002)
Summary:
As per title. Fixes [#{38948}](https://github.com/pytorch/pytorch/issues/38948). Therein you can find some blueprints for the algorithm being used in this PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43002

Reviewed By: zou3519

Differential Revision: D23931326

Pulled By: albanD

fbshipit-source-id: e6994af70d94145f974ef87aa5cea166d6deff1e
2020-09-28 07:22:35 -07:00
6417a70465 Updates linalg warning + docs (#45415)
Summary:
Changes the deprecation of norm to a docs deprecation, since PyTorch components still rely on norm and some behavior, like automatically flattening tensors, may need to be ported to torch.linalg.norm. The documentation is also updated to clarify that torch.norm and torch.linalg.norm are distinct.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45415

Reviewed By: ngimel

Differential Revision: D23958252

Pulled By: mruberry

fbshipit-source-id: fd54e807c59a2655453a6bcd9f4073cb2c12e8ac
2020-09-28 05:28:42 -07:00
7818a214c5 [AutoAccept][Codemod][FBSourceClangFormatLinter] Daily arc lint --take CLANGFORMAT
Reviewed By: zertosh

Differential Revision: D23959094

fbshipit-source-id: 6caa046d263114bff38a38d756099aac357e4f04
2020-09-28 05:08:46 -07:00
95a97e51b5 [ONNX] Improve scripting inplace indexing ops (#44351)
Summary:
Fix a couple of issues with scripting inplace indexing in prepare_inplace_ops_for_onnx pass.
1- Tracing index copy (such as cases lik x[1:3] = data) already applies broadcasting on rhs if needed. The broadcasting node (aten::expand) is missing in scripting cases.

2- Inplace indexing with ellipsis (aten::copy_) is replaced with aten::index_put and then handled with slice+select in this pass.
Support for negative indices for this op added.

Shape inference is also enabled for scripting tests using new JIT API.
A few more tests are enabled for scripting.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44351

Reviewed By: ezyang

Differential Revision: D23880267

Pulled By: bzinodev

fbshipit-source-id: 78b33444633eb7ae0fbabc7415e3b16001f5207f
2020-09-28 00:32:36 -07:00
13f76f2be4 Fix preserve submodule attribute in freezing (#45143)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45143

This PR prevents freezing cleaning up a submodule when user requests to
preserve a submodule.

Test Plan: Imported from OSS

Reviewed By: eellison

Differential Revision: D23844969

Pulled By: bzinodev

fbshipit-source-id: 80e6db3fc12460d62e634ea0336ae2a3551c2151
2020-09-28 00:05:38 -07:00
c3bf402cbb handle onnx nll with default ignore index (#44816)
Summary:
in ONNX NegativeLogLikelihoodLoss specification, ignore_index is optional without default value.
therefore, when convert nll op to ONNX, we need to set ignore_index attribute even if it is not specified (e.g. ignore_index=-100).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44816

Reviewed By: ezyang

Differential Revision: D23880354

Pulled By: bzinodev

fbshipit-source-id: d0bdd58d0a4507ed9ce37133e68533fe6d1bdf2b
2020-09-27 23:26:19 -07:00
8bdbedd4ee Revert "Updates and simplifies nonzero as_tuple behavior"
This reverts commit 8b143771d0f0bcd93d925263adc8b0d6b235b398.
2020-09-27 20:58:42 -07:00
8b143771d0 Updates and simplifies nonzero as_tuple behavior 2020-09-27 20:56:30 -07:00
5b839bca78 [ONNX] Optimize export_onnx api to reduce string and model proto exchange (#44332)
Summary:
Optimize export_onnx api to reduce string and model proto exchange in export.cpp

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44332

Reviewed By: bwasti, eellison

Differential Revision: D23880129

Pulled By: bzinodev

fbshipit-source-id: 1d216d8f710f356cbba2334fb21ea15a89dd16fa
2020-09-27 16:29:08 -07:00
4005afe94b [ONNX] Update narrow for dynamic inputs (#44039)
Summary:
Update narrow for dynamic inputs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44039

Reviewed By: mruberry

Differential Revision: D23742215

Pulled By: bzinodev

fbshipit-source-id: 0d58d2fe996f91a124af988a9a21ee433e842d07
2020-09-27 15:52:57 -07:00
78caa028b6 Revert D23009117: [Distributed] DeleteKey API for c10d TCP Store
Test Plan: revert-hammer

Differential Revision:
D23009117 (addf94f2d6)

Original commit changeset: 1a0d95b43d79

fbshipit-source-id: ad3fe5501267e1a0a7bf23410766f1e92b34b24d
2020-09-27 12:04:42 -07:00
f84b2e865f Revert D23878455: [Distributed] Adding Python tests for the TCPStore getNumKeys and deleteKey
Test Plan: revert-hammer

Differential Revision:
D23878455 (cf808bed73)

Original commit changeset: 0a17ecf66b28

fbshipit-source-id: 93e60b23f66324e3e5266c45abb0cec295bb3d23
2020-09-27 12:02:24 -07:00
bc5710f2f7 Benchmarks: tweak PE config settings. (#45349)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45349

Test Plan: Imported from OSS

Reviewed By: Krovatkin

Differential Revision: D23935518

Pulled By: ZolotukhinM

fbshipit-source-id: 5a7c508c6fc84eafbc23399f095d732b903510dc
2020-09-26 23:13:29 -07:00
a07d82982a CI: Add a run of FastRNN benchmarks in default executor/fuser configuration. (#45348)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45348

Test Plan: Imported from OSS

Reviewed By: Krovatkin

Differential Revision: D23935520

Pulled By: ZolotukhinM

fbshipit-source-id: efecaaab68caaaa057b354884f4ae37b6ef36983
2020-09-26 23:13:27 -07:00
8cef7326f4 Benchmarks: add 'default' options for fuser and executor. (#45347)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45347

Test Plan: Imported from OSS

Reviewed By: Krovatkin

Differential Revision: D23935519

Pulled By: ZolotukhinM

fbshipit-source-id: 8323fafe7828683c4d29c12a1e5722adb6f945ff
2020-09-26 23:09:02 -07:00
37a671abc7 Revert D23828257: Quantization: add API summary section
Test Plan: revert-hammer

Differential Revision:
D23828257 (d2bd556e7d)

Original commit changeset: 9311ee3f394c

fbshipit-source-id: 80b16fc123191e249e6a070ec5360a15fe91cf61
2020-09-26 22:53:10 -07:00
110aa45387 Revert D23842456: Quantization: combine previous summary with new summary
Test Plan: revert-hammer

Differential Revision:
D23842456 (278da57255)

Original commit changeset: db2399e51e9a

fbshipit-source-id: 7878257330bf83751cb17c0971a5c894bdf256ba
2020-09-26 22:53:07 -07:00
3da1061059 Revert D23916669: quant docs: add reduce_range explanatation to top level doc
Test Plan: revert-hammer

Differential Revision:
D23916669 (eb39624394)

Original commit changeset: ef93fb774cb1

fbshipit-source-id: 7b56020427e76e13f847494044179c81d508db11
2020-09-26 22:48:38 -07:00
54a253fded Revert D23931987: Added optimizers based on multi tensor apply
Test Plan: revert-hammer

Differential Revision:
D23931987 (2b21e7767e)

Original commit changeset: 582134ef2d40

fbshipit-source-id: ffd500aea55fda34155442fb15e2529cb9c00100
2020-09-26 18:11:54 -07:00
e52762cbb7 Revert D23917034: quant docs: document how to customize qconfigs in eager mode
Test Plan: revert-hammer

Differential Revision:
D23917034 (7763e1d7b1)

Original commit changeset: ccf71ce4300c

fbshipit-source-id: 9ce99e880b4a22e824f4413354a0f3703e7c5c2c
2020-09-26 18:05:38 -07:00
23dfca8351 Support record_shapes in RPC profiling (#44419)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44419

Closes https://github.com/pytorch/pytorch/issues/39969

This PR adds support for propagation of input shapes over the wire when the profiler is invoked with `record_shapes=True` over RPC. Previously, we did not respect this argument.

This is done by saving the shapes as an ivalue list and recovering it as the type expected (`std::vector<std::vector<int>>` on the client). Test is added to ensure that remote ops have the same `input_shapes` as if the op were run locally.
ghstack-source-id: 112977899

Reviewed By: pritamdamania87

Differential Revision: D23591274

fbshipit-source-id: 7cf3b2e8df26935ead9d70e534fc2c872ccd6958
2020-09-26 13:26:44 -07:00
19dda7c68a Fallback to CPU when remote end does not have CUDA for profiling (#44967)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44967

When enabling profiler on server, if it is a different machine it may
not have CUDA while caller does. In this case, we would crash but now we
fallback to CPU and log a warning.
ghstack-source-id: 112977906

Test Plan: CI

Reviewed By: pritamdamania87

Differential Revision: D23790729

fbshipit-source-id: dc6eba172b7e666842d54553f52a6b9d5f0a5362
2020-09-26 13:12:55 -07:00
2b21e7767e Added optimizers based on multi tensor apply (#45299)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45299

Adding a new namespace `torch.optim._multi_tensor` with a bunch of updated optimizers. Those optimizers are using _foreach APIs which improve performance significantly.

### Tests
- updated existing tests to use both optimizers
- added `test_multi_tensor_optimizers` test to verify correctness.

### Perf results

**Adam**
timeit: 42.69 ms --> 10.16 ms
autorange: 41.96 ms --> 10.28 ms

**AdamW**
timeit: 51.38 ms --> 15.63 ms
autorange: 50.82 ms --> 16.07 ms

**SGD**
timeit: 6.28 ms --> 4.40 ms
autorange: 6.13 ms --> 4.73 ms

**RMSprop**
timeit: 28.63 ms --> 5.89 ms
autorange: 28.27 ms -->  5.76 ms

**Rprop**
timeit: 213.30 --> 178.42
autorange: 212.03 --> 178.03

**ASGD**
timeit: 21.67 --> 9.33
autorange: 21.64 --> 9.27

**Adamax**
timeit: 55.60 --> 48.29
autorange: 55.22 -> 49.13

**Rerf Script used**

```
import torch
import time
import torch.optim as optim
from torch.autograd import Variable
from torch.optim.lr_scheduler import ExponentialLR, ReduceLROnPlateau, StepLR
import torch.nn as nn
import time
import torchvision
import torch.utils._benchmark as benchmark_utils

device = "cuda"
model = torchvision.models.resnet.resnet101(pretrained=True).to(device)
targets = torch.randint(0, 1000, (100, 100), device=device)
criterion = nn.CrossEntropyLoss()

optimizer = optim.SGD(model.parameters(), lr=1e-3) # <----------------------- optimizer.
                                                          # would compare optim.SGD vs optim._multi_tensor.SGD
running_loss = 0.0
target = torch.empty(128, dtype=torch.long, device=device).random_(5)

optimizer.zero_grad()
inputs = torch.rand(128, 3, 100, 100, device=device , requires_grad=True)
outputs = model(inputs)
loss = criterion(outputs, target)
loss.backward()
optimizer.step()
running_loss += loss.item()

def main():
    timer = benchmark_utils.Timer(
        stmt="optimizer.step()",
        globals=globals(),
        label="str(optimizer)",
    )

    for i in range(1):
        print(f"Run: {i}\n{'-' * 40}")
        print(f"timeit:\n{timer.timeit(1000)}\n")
        print(f"autorange:\n{timer.blocked_autorange()}\n\n")

if __name__ == "__main__":
    main()
```

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D23931987

Pulled By: izdeby

fbshipit-source-id: 582134ef2d402909d27d89a45c5b588fb7130ea1
2020-09-26 12:17:43 -07:00
0fa551f0ab [c2] Fix int types for learning rate
Summary: Currently GetSingleArgument is overflowing since it's expecting an int instead of an int64 when using a 1cycle (hill policy) annealing schedule

Test Plan:
unittest

buck test  caffe2/caffe2/python/operator_test:learning_rate_op_test

Differential Revision: D23938169

fbshipit-source-id: 20d65df800d7a0f1dd9520705af31f63ae716463
2020-09-26 10:59:29 -07:00
cf808bed73 [Distributed] Adding Python tests for the TCPStore getNumKeys and deleteKey (#45223)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45223

Previous diffs in this stack implemented the getNumKeys and deleteKey
APIs in the c10d Store as well as added tests at the C++ layer. This diff adds
tests at the Python level in test_c10d.py
ghstack-source-id: 112939763

Test Plan: Ensured these new python tests as well as previous C++ tests pass

Reviewed By: jiayisuse

Differential Revision: D23878455

fbshipit-source-id: 0a17ecf66b28d46438a77346e5bf36414e05e25c
2020-09-26 00:54:28 -07:00
addf94f2d6 [Distributed] DeleteKey API for c10d TCP Store (#43963)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43963

Added a DeleteKey API for the TCP Store
ghstack-source-id: 112939762

Test Plan:
Modified the existing get/set test to use delete. verified that the
correct keys were deleted and that the numKeys API returned the right values

Reviewed By: jiayisuse

Differential Revision: D23009117

fbshipit-source-id: 1a0d95b43d79e665a69b2befbaa059b2b50a1f66
2020-09-26 00:54:21 -07:00
304e1d1e19 [Distributed] getNumKeys API to c10d TCPStore (#43962)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43962

TCPStore needs a getNumKeys API for our logging needs.
ghstack-source-id: 112939761

Test Plan: Adding tests to C++ Store Tests

Reviewed By: pritamdamania87

Differential Revision: D22985085

fbshipit-source-id: 8a0d286fbd6fd314dcc997bae3aad0e62b51af83
2020-09-26 00:49:00 -07:00
d9af3d2fcd [quant] ConvTranspose warnings (#45081)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45081

Test Plan: Imported from OSS

Reviewed By: vkuzo

Differential Revision: D23822449

Pulled By: z-a-f

fbshipit-source-id: f21a5f3ef4d09f703c96fff0bc413dbadeac8202
2020-09-25 22:30:14 -07:00
92189b34b7 Add get_all_users_of function to GraphManipulation (#45216)
Summary:
This PR adds get_all_users_of function. The function returns all the users of a specific node. A test unit is also added.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45216

Reviewed By: ezyang

Differential Revision: D23883572

Pulled By: scottxu0730

fbshipit-source-id: 3eb68a411c3c6db39ed2506c9cb7bb7337520ee4
2020-09-25 19:32:49 -07:00
7763e1d7b1 quant docs: document how to customize qconfigs in eager mode (#45306)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45306

Adds details to the main quantization doc on how specifically
users can skip or customize quantization of layers.

Test Plan: Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D23917034

Pulled By: vkuzo

fbshipit-source-id: ccf71ce4300c1946b2ab63d1f35a07691fd7a2af
2020-09-25 18:33:35 -07:00
eb39624394 quant docs: add reduce_range explanatation to top level doc (#45305)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45305

Adds an explanatation for reduce_range to the main quantization
doc page.

Test Plan: Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D23916669

Pulled By: vkuzo

fbshipit-source-id: ef93fb774cb15741cd92889f114f6ab76c39f051
2020-09-25 18:33:32 -07:00
278da57255 Quantization: combine previous summary with new summary (#45135)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45135

The previous quantization summary had steps on what to do for
dynamic, static, QAT.  This PR moves these steps to comments in the
example code, so it is more clear how to accomplish the steps.

Test Plan: Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D23842456

Pulled By: vkuzo

fbshipit-source-id: db2399e51e9ae33c8a1ac610e3d7dbdb648742b0
2020-09-25 18:33:30 -07:00
d2bd556e7d Quantization: add API summary section (#45093)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45093

This adds a tl;dr; style summary of the quantization API
to the documentation. Hopefully this will make this easier
for new folks to learn how to use quantization.

This is not meant to be all-encompassing.  Future PRs
can improve the documentation further.

Test Plan:
1. build the doc as specified in https://github.com/pytorch/pytorch#building-the-documentation
2. inspect the quantization page in Chrome, format looks good

Reviewed By: jerryzh168

Differential Revision: D23828257

Pulled By: vkuzo

fbshipit-source-id: 9311ee3f394cd83af0aeafb6e2fcdc3e0321fa38
2020-09-25 18:30:51 -07:00
958c208666 [quant] conv_transpose graph patterns (#45078)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45078

Test Plan: Imported from OSS

Reviewed By: vkuzo

Differential Revision: D23821580

Pulled By: z-a-f

fbshipit-source-id: 813a4ef1bbc429720765d61791fe754b6678a334
2020-09-25 18:14:29 -07:00
606b1a9a2e Move xla codegen to aten. (#45241)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45241

Test Plan: Imported from OSS

Reviewed By: soumith

Differential Revision: D23926750

Pulled By: ailzhang

fbshipit-source-id: f768e24a9baeca9f9df069a62d6f8b94a853a1ee
2020-09-25 18:07:32 -07:00
32c355af5b [dist_optim] introduce distributed functional optimizer (#45221)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45221

This PR introduces a distributed functional optimizer, so that
distributed optimizer can reuse the functional optimizer APIs and
maintain their own states. This could enable the torchscript compatible
functional optimizer when using distributed optimizer, helps getting rid
of GIL and improve overall performance of training, especially distributed
model parallel training

Test Plan: Imported from OSS

Reviewed By: ailzhang

Differential Revision: D23935256

Pulled By: wanchaol

fbshipit-source-id: 59b6d77ff4693ab24a6e1cbb6740bcf614cc624a
2020-09-25 17:13:10 -07:00
08caf15502 [optimizer] refactor Adam to use functional API (#44791)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44791

Test Plan: Imported from OSS

Reviewed By: ailzhang

Differential Revision: D23935257

Pulled By: wanchaol

fbshipit-source-id: 6f6e22a9287f5515d2e4e6abd4dee2fe7e17b945
2020-09-25 17:13:08 -07:00
0444c372e1 [optimizer] introduce optimizer functional API, refactor Adagrad (#44715)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44715

We have provided a nice and intuitive API in Python. But in the context of large scale distributed training (e.g. Distributed Model Parallel), users often want to use multithreaded training instead of multiprocess training as it provides better resource utilization and efficiency.

This PR introduces functional optimizer concept (that is similar to the concept of `nn.functional`), we split optimizer into two parts: 1. optimizer state management 2. optimizer computation. We expose the computation part as a separate functional API that is available to be used by internal and OSS developers, the caller of the functional API will maintain their own states in order to directly calls the functional API. While maintaining the end user API be the same, the functional API is TorchScript friendly, and could be used by the distributed optimizer to speed up the training without GIL.

Test Plan: Imported from OSS

Reviewed By: ailzhang

Differential Revision: D23935258

Pulled By: wanchaol

fbshipit-source-id: d2a5228439edb3bc64f7771af2bb9e891847136a
2020-09-25 17:10:26 -07:00
8ab2ad306d Enable torch.cuda.nccl typechecking (#45344)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/45336

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45344

Reviewed By: walterddr

Differential Revision: D23935306

Pulled By: malfet

fbshipit-source-id: dd09d4f8ff7a327131764487158675027a13bf69
2020-09-25 17:02:47 -07:00
5211fb97ac Remove device maps from TensorPipe for v1.7 release (#45353)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45353

Temporarily removing this feature, will add this back after branch cut.

Test Plan: Imported from OSS

Reviewed By: rohan-varma

Differential Revision: D23939865

Pulled By: mrshenli

fbshipit-source-id: 7dceaffea6b9a16512b5ba6036da73e7f8f83a8e
2020-09-25 16:51:45 -07:00
439930c81b adding a beta parameter to the smooth_l1 loss fn (#44433)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44433

Not entirely sure why, but changing the type of beta from `float` to `double in autocast_mode.cpp and FunctionsManual.h fixes my compiler errors, failing instead at link time

fixing some type errors, updated fn signature in a few more files

removing my usage of Scalar, making beta a double everywhere instead

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D23636720

Pulled By: bdhirsh

fbshipit-source-id: caea2a1f8dd72b3b5fd1d72dd886b2fcd690af6d
2020-09-25 16:36:28 -07:00
37513a1118 Use explicit templates in CUDALoops kernels (#44286)
Summary:
Reland attempt of https://github.com/pytorch/pytorch/pull/41059
Use explicit templates instead of lambdas to reduce binary size without affecting the perf by 100-200Kb per arch per CU, namely:
BinaryMulDivKernel.cu 3.8Mb -> 3.5Mb
CompareEQKernel.cu 1.8Mb -> 1.7Mb
BinaryAddSubKernel.cu 2.0Mb -> 1.8Mb
BinaryBitwiseOpsKernels.cu 2.6Mb -> 2.3Mb

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44286

Reviewed By: ngimel

Differential Revision: D23859691

Pulled By: malfet

fbshipit-source-id: 2c4e86f35e0f94a62294dc5d52a3ba364db23e2d
2020-09-25 16:26:40 -07:00
a2b4177c5b Add barrier() at the end of init_process_group and new_group. (#45181)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45181

`init_process_group` and `new_group` update a bunch of global
variables after initializing the actual process group. As a result, there is a
race that after initializing the process group on say rank 0, if we immediately
check the default process group on rank 1 (say via RPC), we might actually get
an error since rank 1 hasn't yet updated its _default_pg variable.

To resolve this issue, I've added barrier() at the end of both of these calls.
This ensures that once these calls return we are guaranteed about correct
initialization on all ranks.

Since these calls are usually done mostly during initialization, it should be
fine to add the overhead of a barrier() here.

#Closes: https://github.com/pytorch/pytorch/issues/40434, https://github.com/pytorch/pytorch/issues/40378
ghstack-source-id: 112923112

Test Plan:
Reproduced the failures in
https://github.com/pytorch/pytorch/issues/40434 and
https://github.com/pytorch/pytorch/issues/40378 and verified that this PR fixes
the issue.

Reviewed By: mrshenli

Differential Revision: D23858025

fbshipit-source-id: c4d5e46c2157981caf3ba1525dec5310dcbc1830
2020-09-25 15:46:59 -07:00
3b7e4f89b2 Add deprecation warning to PG backend and make TP backend stable. (#45356)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45356

In this PR, I'm adding a warning to the PG backend mentioning it would
be deprecated in the future. In addition to this I removed the warning from the
TP backend that it is a beta feature.
ghstack-source-id: 112940501

Test Plan: waitforbuildbot

Reviewed By: mrshenli

Differential Revision: D23940144

fbshipit-source-id: d44054aa1e4ef61004a40bbe0ec45ff07829aad4
2020-09-25 15:41:00 -07:00
04be420549 [static runtime] Remove ops in static from backwards compatibility checks (#45354)
Summary:
This should get the builds green again

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45354

Reviewed By: zhangguanheng66

Differential Revision: D23939615

Pulled By: bwasti

fbshipit-source-id: e93b11bc9592205e52330bb15928603b0aea21ac
2020-09-25 14:46:42 -07:00
eee7dad376 Add torch.do_assert, which is symbolically traceable (#45188)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45188

This is a symbolically traceable alternative to Python's `assert`.
It should be useful to allow people who want to use FX to also
be able to assert things.

A bunch of TODO(before) land are inline - would love thoughts
on where is the best place for this code to live, and what this
function should be called (since `assert` is reserved).

Test Plan:
```
python test/test_fx.py TestFX.test_symbolic_trace_assert
```

Imported from OSS

Reviewed By: jamesr66a

Differential Revision: D23861567

fbshipit-source-id: d9d6b9556140faccc0290eba1fabea401d7850de
2020-09-25 13:46:28 -07:00
7c5436d557 [RPC profiling] Add tests to ensure RPC profiling works on single threaded (#44923)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44923

This ensures that RPC profiling works in single-threaded server
scenarios and that we won't make the assumption that we'll have multiple
threads when working on this code. For example, this assumption resulted in a
bug in the previous diff (which was fixed)
ghstack-source-id: 112868469

Test Plan: CI

Reviewed By: lw

Differential Revision: D23691304

fbshipit-source-id: b17d34ade823794cbe949b70a5ab35723d974203
2020-09-25 13:24:18 -07:00
27ab9bc0f9 [RPC profiling] Extend RPC profiling to support async function execution over RPC. (#44664)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44664

Closes https://github.com/pytorch/pytorch/issues/39971. This PR adds support for functions decorated with `rpc.functions.async_execution` to be profiled over RPC as builtins, jit functions, and blocking python UDFs currently can be. The reasoning for this is to provide complete feature support in terms of RPC profiling and the various types of functions users can run.

To enable this, the PR below this enables calling `disableProfiler()` safely from another thread. We use that functionality to defer disabling the profiler on the server until the future corresponding to the RPC request completes (rather than only the blocking `processRPC` call as was done previously). Since when the future completes we've kicked off the async function and the future corresponding to it has completed, we are able to capture any RPCs the function would have called and the actual work done on the other node.

For example, if the following async function is ran on a server over RPC:

```
def slow_add(x, y):
    time.sleep(1)
    return torch.add(x, y)

rpc.functions.async_execution
def slow_async_add(to, x, y):
    return rpc.rpc_async(to, slow_add, args=(x, y))
```

we expect to see the original RPC profiled, the nested RPC profiled, and the actual torch.add() work. All of these events should be recorded with the correct node id. Here is an example profiling output:

```
-------------------------------------------------------------------------------------------------------------------------  ---------------  ---------------  ---------------  --------
-------  ---------------  ---------------  ---------------
Name                                                                                                                       Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     Number of Calls  Node ID
-------------------------------------------------------------------------------------------------------------------------  ---------------  ---------------  ---------------  --------
-------  ---------------  ---------------  ---------------                                                                                                                            rpc_async#slow_async_add(worker1 -> worker2)                                                                               0.00%            0.000us          0                1.012s
         1.012s           1                1
aten::empty                                                                                                                7.02%            11.519us         7.02%            11.519us         11.519us         1                1
rpc_async#slow_async_add(worker1 -> worker2)#remote_op: rpc_async#slow_add(worker2 -> worker3)                             0.00%            0.000us          0                1.006s
         1.006s           1                2                                                                                                                                          rpc_async#slow_async_add(worker1 -> worker2)#remote_op: aten::empty                                                        7.21%            11.843us         7.21%            11.843us
         11.843us         1                2
rpc_async#slow_async_add(worker1 -> worker2)#remote_op: rpc_async#slow_add(worker2 -> worker3)#remote_op: aten::add        71.94%           118.107us        85.77%           140.802us        140.802us        1                3
rpc_async#slow_async_add(worker1 -> worker2)#remote_op: rpc_async#slow_add(worker2 -> worker3)#remote_op: aten::empty      13.82%           22.695us         13.82%           22.695us
         22.695us         1                3                                                                                                                                          -------------------------------------------------------------------------------------------------------------------------  ---------------  ---------------  ---------------  --------
-------  ---------------  ---------------  ---------------
Self CPU time total: 164.164us
```

This PR also moves a bunch of the profiling logic to `rpc/utils.cpp` to declutter `request_callback` code.
ghstack-source-id: 112868470

Test Plan:
```
rvarm1@devbig978:fbcode  (52dd34f6)$ buck test mode/no-gpu mode/dev-nosan //caffe2/test/distributed/rpc:process_group_agent -- test_rpc_profiling_async_function --print-passing-details --stress-runs 1
```

Reviewed By: mrshenli

Differential Revision: D23638387

fbshipit-source-id: eedb6d48173a4ecd41d70a9c64048920bd4807c4
2020-09-25 13:19:26 -07:00
d5748d9a1a Enable binary ops with Scalar Lists with for foreach APIs (#45298)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45298

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D23931986

Pulled By: izdeby

fbshipit-source-id: 281267cd6f90d57a169af89f9f10b0f4fcab47e3
2020-09-25 12:58:34 -07:00
f07ac6a004 Fix Windows build failure after DDP PR merged (#45335)
Summary:
Fixes #{issue number}
This is resubmit for PR https://github.com/pytorch/pytorch/issues/42897 . Together with fix for Windows build issue introduced by PR https://github.com/pytorch/pytorch/issues/44344 .

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45335

Reviewed By: zou3519

Differential Revision: D23931471

Pulled By: mrshenli

fbshipit-source-id: f49b5a114944c1450b32934b3292170be064f494
2020-09-25 12:37:50 -07:00
c8166d4b58 Add torch.cuda.comm to typechecking CI (#45350)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45350

Reviewed By: walterddr

Differential Revision: D23935750

Pulled By: malfet

fbshipit-source-id: 5a7d2d4fbc976699d80bb5caf4727c19fa2c5bc8
2020-09-25 12:13:43 -07:00
22401b850b port all JIT tests to gtest (#45264)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45264

Context for why we are porting to gtest in: https://github.com/pytorch/pytorch/pull/45018.

This PR completes the process of porting and removes unused files/macros.

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D23901392

Pulled By: suo

fbshipit-source-id: 89526890e1a49462f3f77718f4ee273c5bc578ba
2020-09-25 11:37:43 -07:00
5a0514e3e6 [pytorch] Update fmt to 7.0.3 (#45304)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45304

As title

Test Plan: sandcastle

Reviewed By: malfet

Differential Revision: D23916328

fbshipit-source-id: 47c76886c1f17233304dc59289ff6baa16c50b8d
2020-09-25 11:33:36 -07:00
dc9e9c118e CUDA BFloat16 neg (#45240)
Summary:
Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45240

Reviewed By: mruberry

Differential Revision: D23933392

Pulled By: ngimel

fbshipit-source-id: 2472dc550600ff470a1044ddee39054e22598038
2020-09-25 11:25:49 -07:00
e5f6e5af13 Add Deep and wide to test and flatten/tranpose for good measure (#44129)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44129

Test Plan: Imported from OSS

Reviewed By: hlu1

Differential Revision: D23604302

Pulled By: bwasti

fbshipit-source-id: 5787f6f32a80b22b1b712c4116f70370dad98f12
2020-09-25 11:05:41 -07:00
d1a11618f5 [static runtime] Add _out variants and reuse memory (#44128)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44128

Test Plan: Imported from OSS

Reviewed By: hlu1

Differential Revision: D23604304

Pulled By: bwasti

fbshipit-source-id: 06a23cb75700a0fc733069071843b7b498e7b9e9
2020-09-25 11:03:06 -07:00
d1d9017a66 [NNC] fix Half conversion of immediates in Cuda backend (#45213)
Summary:
The Cuda HalfChecker casts up all loads and stores of Half to Float, so we do math in Float on the device. It didn't cast up HalfImmediate (ie. constants) so they could insert mixed-size ops. Fix is to do that.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45213

Reviewed By: ezyang

Differential Revision: D23885287

Pulled By: nickgg

fbshipit-source-id: 912991d85cc06ebb282625cfa5080d7525c8eba9
2020-09-25 10:53:36 -07:00
536580e976 Vectorize bitwise_not (#45103)
Summary:
Benchmark (Debian 10, Release build, gcc 8.3, no turbo, Intel(R) Xeon(R)
E-2136 CPU @ 3.30GHz):

```python
import timeit
for dtype in ('torch.int64', 'torch.int32', 'torch.int16', 'torch.int8', 'torch.uint8'):
    for n, t in [(10_000, 100000),
                (100_000, 10000)]:
        print(f'torch.bitwise_not(a), numel() == {n} for {t} times, dtype={dtype}')
        print(timeit.timeit('torch.bitwise_not(a)', setup=f'import torch; a = torch.arange(-{n//2}, {n//2}, dtype={dtype})', number=t))
```

Before:

```
torch.bitwise_not(a), numel() == 10000 for 100000 times, dtype=torch.int64
0.5479081739904359
torch.bitwise_not(a), numel() == 100000 for 10000 times, dtype=torch.int64
0.3350257440470159
torch.bitwise_not(a), numel() == 10000 for 100000 times, dtype=torch.int32
0.39590477803722024
torch.bitwise_not(a), numel() == 100000 for 10000 times, dtype=torch.int32
0.25563537096604705
torch.bitwise_not(a), numel() == 10000 for 100000 times, dtype=torch.int16
0.31152817397378385
torch.bitwise_not(a), numel() == 100000 for 10000 times, dtype=torch.int16
0.20817365101538599
torch.bitwise_not(a), numel() == 10000 for 100000 times, dtype=torch.int8
0.8573925020173192
torch.bitwise_not(a), numel() == 100000 for 10000 times, dtype=torch.int8
0.4150037349900231
torch.bitwise_not(a), numel() == 10000 for 100000 times, dtype=torch.uint8
0.8551108679967001
torch.bitwise_not(a), numel() == 100000 for 10000 times, dtype=torch.uint8
0.37137620500288904
```

After:

```
torch.bitwise_not(a), numel() == 10000 for 100000 times, dtype=torch.int64
0.5232444299617782
torch.bitwise_not(a), numel() == 100000 for 10000 times, dtype=torch.int64
0.33852163201663643
torch.bitwise_not(a), numel() == 10000 for 100000 times, dtype=torch.int32
0.3931163849774748
torch.bitwise_not(a), numel() == 100000 for 10000 times, dtype=torch.int32
0.24392802000511438
torch.bitwise_not(a), numel() == 10000 for 100000 times, dtype=torch.int16
0.3122224889229983
torch.bitwise_not(a), numel() == 100000 for 10000 times, dtype=torch.int16
0.1977886479580775
torch.bitwise_not(a), numel() == 10000 for 100000 times, dtype=torch.int8
0.26711542706470937
torch.bitwise_not(a), numel() == 100000 for 10000 times, dtype=torch.int8
0.18208567495457828
torch.bitwise_not(a), numel() == 10000 for 100000 times, dtype=torch.uint8
0.2615354140289128
torch.bitwise_not(a), numel() == 100000 for 10000 times, dtype=torch.uint8
0.17972210398875177
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45103

Reviewed By: ailzhang

Differential Revision: D23848675

Pulled By: ezyang

fbshipit-source-id: 6dde1ab32d9a343a49de66ad9f9b062fa23824d2
2020-09-25 10:18:30 -07:00
a117d968f6 [quant][graph] Remove redundant aten::wait calls in the graph (#45257)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45257

Currently we inline fork-wait calls when we insert observers for quantization
In the case where fork and wait are in different subgraphs, inlining the fork-wait calls
only gets rid of the fork. This leaves the aten::wait call in the graph with a torch.Tensor as input,
which is currently not supported.
To avoid this we check to make sure input to all wait calls in the graph is of type Future[tensor]
in the cleanup phase

Test Plan:
python test/test_quantization.py TestQuantizeJitPasses.test_quantize_fork_wait

Imported from OSS

Reviewed By: qizzzh

Differential Revision: D23895412

fbshipit-source-id: 3c58c6be7d7e7904eb6684085832ac21f827a399
2020-09-25 09:52:52 -07:00
8b00c4c794 [ONNX] Correct a minor typo in warning (#45187)
Summary:
The warning for batch_norm was mentioning dropout.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45187

Reviewed By: glaringlee

Differential Revision: D23873215

Pulled By: ezyang

fbshipit-source-id: 1dcc82ad16522215f49b4cd0fc0e357b2094e4f2
2020-09-25 09:26:51 -07:00
b70fac75ac CMake: Fix python dependencies in codegen (#45275)
Summary:
I noticed while working on https://github.com/pytorch/pytorch/issues/45163 that edits to python files in the  `tools/codegen/api/` directory wouldn't trigger rebuilds. This tells CMake about all of the dependencies, so rebuilds are triggered automatically.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45275

Reviewed By: zou3519

Differential Revision: D23922805

Pulled By: ezyang

fbshipit-source-id: 0fbf2b6a9b2346c31b9b0384e5ad5e0eb0f70e9b
2020-09-25 09:16:38 -07:00
78fcde9c50 Trace scattered tensor options arguments (#44071)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44071

Previously, tracing re-gathered ScalarType, Layout, Device, bool into a TensorOptions object and called `tracer::addInput()` on the gathered TensorOptions argument. `tracer::addInput()` then scattered them again and added the individual scattered arguments to the traced graph. This PR avoids the extraneous gathering and re-scattering step and calls `tracer::addInput()` on the individual arguments directly. This avoid the perf hit for an unnecessary gathering step.

This applies to both c10-full and non-c10-full ops. In the case of c10-full ops, the tracing kernels takes scattered arguments and we can directly pass them to `tracer::addInput()`. In the case of non-c10-full ops, the kernel takes a `TensorOptions` argument but we still call `tracer::addInput()` on the scattered arguments.
ghstack-source-id: 112825793

Test Plan:
waitforsandcastle

vs master: https://www.internalfb.com/intern/fblearner/details/216129483/

vs previous diff: https://www.internalfb.com/intern/fblearner/details/216170069/

Reviewed By: ezyang

Differential Revision: D23486638

fbshipit-source-id: e0b53e6673cef8d7f94158e718301eee261e5d22
2020-09-25 09:04:06 -07:00
2ac7de7d53 Remove hacky_wrapper from BackendSelect kernels (#44062)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44062

Previously, BackendSelect kernels were still written in the legacy way, i.e. they took one TensorOptions argument instead of scattered dtype, layout, device, pin_memory,  and they used hacky_wrapper to be callable. This caused a re-wrapping step. Calling into a BackencSelect kernel required taking the individual scattered arguments, packing them into a TensorOptions, and the kernel itself then gathered them again for redispatch.

Now with this PR, BackendSelect kernels are written in the new way and no hacky_wrapper or rewrapping is needed for them.
ghstack-source-id: 112825789

Test Plan:
vs master: https://www.internalfb.com/intern/fblearner/details/216117032/

vs previous diff: https://www.internalfb.com/intern/fblearner/details/216170194/

Reviewed By: ezyang

Differential Revision: D23484192

fbshipit-source-id: e8fb49c4692404b6b775d18548b990c4cdddbada
2020-09-25 09:04:03 -07:00
043bd51b48 Remove hacky_wrapper from VariableType and TraceType (#44005)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44005

Previously, VariableType and TraceType kernels were still written in the legacy way, i.e. they took one TensorOptions argument instead of scattered dtype, layout, device, pin_memory,  and they used hacky_wrapper to be callable.

Now with this PR, variable and tracing kernels are written in the new way and no hacky_wrapper is needed for them.
ghstack-source-id: 112825791

Test Plan:
waitforsandcastle

https://www.internalfb.com/intern/fblearner/details/215954270/

Reviewed By: ezyang

Differential Revision: D23466042

fbshipit-source-id: bde730a9e3bb4cb80ad484417be1ebecbdc2d377
2020-09-25 09:01:34 -07:00
bf8cd21f2a Py transformer coder test (#43976)
Summary:
Fixes #{[37756](https://github.com/pytorch/pytorch/issues/37756)}

Added the missing Transformer coder python test scripts from C++ API test scripts

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43976

Reviewed By: jamesr66a

Differential Revision: D23873250

Pulled By: glaringlee

fbshipit-source-id: cdeae53231e02208463e7629ba2c1f00990150ea
2020-09-25 08:22:24 -07:00
2739a7c599 Byte-for-byte compatibility fixes in codegen (#44879)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44879

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D23825163

Pulled By: bdhirsh

fbshipit-source-id: 4d8028274f82c401b393c4fe1b9e32de3f4909c6
2020-09-25 08:06:50 -07:00
00e704e757 [fix] torch.repeat : dim-0 backward (#45212)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/45201

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45212

Reviewed By: mrshenli

Differential Revision: D23905545

Pulled By: albanD

fbshipit-source-id: c5bf9cf481c8cf3ccc1fdbfb364006b29f67dc9f
2020-09-25 07:53:00 -07:00
76ee58e2ec [TensorExpr] Move inner loops vectorization logic to its own method (#45287)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45287

Test Plan: CI, build

Reviewed By: gmagogsfm

Differential Revision: D23913432

Pulled By: asuhan

fbshipit-source-id: 3bf8fe09753f349e3c857863a43d2b1fca5101c1
2020-09-25 02:29:36 -07:00
241afc9188 Migrate addr from the TH to Aten (CPU) (#44364)
Summary:
Related https://github.com/pytorch/pytorch/issues/24507
Fixes https://github.com/pytorch/pytorch/issues/24666

This PR is to modernize the CPU implementation of the vector `outer product`.
The existing TH implementation for `torch.attr` is migrated to `aten`, as the `torch.ger` manipulates the `addr` functions to calculate outer product,

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44364

Reviewed By: ezyang

Differential Revision: D23866733

Pulled By: mruberry

fbshipit-source-id: 5159ea22f0e3c991123fe7c19cc9beb6ad00301e
2020-09-25 01:18:09 -07:00
99e0a87bbb [nvFuser] Latency improvements for pointwise + reduction fusion (#45218)
Summary:
A lot of changes are in this update, some highlights:

- Added Doxygen config file
- Split the fusion IR (higher level TE like IR) from kernel IR (lower level CUDA like IR)
- Improved latency with dynamic shape handling for the fusion logic
- Prevent recompilation for pointwise + reduction fusions when not needed
- Improvements to inner dimension reduction performance
- Added input -> kernel + kernel launch parameters cache, added eviction policy
- Added reduction fusions with multiple outputs (still single reduction stage)
- Fixed code generation bugs for symbolic tiled GEMM example
- Added thread predicates to prevent shared memory form being loaded multiple times
- Improved sync threads placements with shared memory and removed read before write race
- Fixes to FP16 reduction fusions where output would come back as FP32

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45218

Reviewed By: ezyang

Differential Revision: D23905183

Pulled By: soumith

fbshipit-source-id: 12f5ad4cbe03e9a25043bccb89e372f8579e2a79
2020-09-24 23:17:20 -07:00
95df8657c9 Enables test linalg (#45278)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/45271.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45278

Reviewed By: ngimel

Differential Revision: D23926124

Pulled By: mruberry

fbshipit-source-id: 26692597f9a1988e5fa846f97b8430c3689cac27
2020-09-24 23:09:38 -07:00
bdf329ef8a SyncBN: preserve qconfig if it exists (#45317)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45317

Eager mode quantization depends on the presence of the `config`
model attribute.  Currently converting a model to use `SyncBatchNorm`
removes the qconfig - fixing this.  This is important if a BN is not
fused to anything during quantization convert.

Test Plan:
```
python test/test_quantization.py TestDistributed.test_syncbn_preserves_qconfig
```

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D23922072

fbshipit-source-id: cc1bc25c8e5243abb924c6889f78cf65a81be158
2020-09-24 22:52:07 -07:00
103fa3894a Revert D23841786: [pytorch][PR] Enable distributed package on windows, Gloo backend supported only
Test Plan: revert-hammer

Differential Revision:
D23841786 (0122299f9b)

Original commit changeset: 334ba1ed73ef

fbshipit-source-id: ec95432f9957df56a5a04e52661f5db920b7f57f
2020-09-24 22:44:33 -07:00
bc3151dee0 [quant] Remove unused qconfig argument in qat linear module (#45307)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45307

fixes: https://github.com/pytorch/pytorch/issues/35634

Test Plan: Imported from OSS

Reviewed By: vkuzo

Differential Revision: D23917339

fbshipit-source-id: 65f8844b98198bbf93547b3d71408c2a54605218
2020-09-24 22:15:16 -07:00
31ae8117ba [RFC] Remove per-op-registration related code in caffe2/tools/codegen/gen.py (#45134)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45134

Per-Op-Registration was a mechanism used for mobile selective build v0. Since then, a new dispathing mechanism has been built for PyTorch, and this code path isn't used any more. Remove it to simplify understanding/updating the code-generator's code-flow.
ghstack-source-id: 112723942

Test Plan: `buck build` and sandcastle.

Reviewed By: ezyang

Differential Revision: D23806632

fbshipit-source-id: d93cd324650c541d9bfc8eeff2ddb2833b988ecc
2020-09-24 22:02:49 -07:00
0122299f9b Enable distributed package on windows, Gloo backend supported only (#42897)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/42095

For test case part will be committed to this PR later

mrshenli, please help to review

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42897

Reviewed By: osalpekar

Differential Revision: D23841786

Pulled By: mrshenli

fbshipit-source-id: 334ba1ed73eff2f668857390fc32d1bc7f08e5f3
2020-09-24 21:13:55 -07:00
c6500bcf14 [reland] Make grad point to bucket buffer in DDP to save memory usage (#44344)
Summary:
[test all]
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44344

reland #41954

Add one argument in DDP API to enable/disable letting grads pointing  to views. When it is disabled, behavior is the same as DDP right now; when it is enabled, Make both variable.grad() and grad in distautograd context point to bucket buffer in DDP to save memory usage.
In this case, grad will be view of bucket buffer tensors, in order to make it compatiable with optimizer.zero_grad(), we
made changes in #41283.

Also be noted that we can not make variable.grad() pointing to bucket buffer during construction time, because we want to
keep grad undefined for unused parameters.
ghstack-source-id: 112845787

Test Plan:
1. When grad_is_view=false:
a. roberta_base, peak memory usage 8250MB, p50 per iteration latency 0.923second, https://www.internalfb.com/intern/fblearner/details/218029699/?notif_channel=cli
b. resnet, peak memory usage 3089MB, p50 per iteration latency 0.120second, https://www.internalfb.com/intern/fblearner/details/218029035/?notif_channel=cli
c. accuracy benchmark, distributed=false, .accuracy 40.914535522461, .loss: 1.6370717287064; distributed=true, .accuracy: 39.966053009033, .loss: 1.6849111318588
https://www.internalfb.com/intern/fblearner/details/218035688/?notif_channel=cli
d. classy vision uru production flow, https://www.internalfb.com/intern/fblearner/details/219065811/?notif_channel=cli
e. pytext flow, https://www.internalfb.com/intern/fblearner/details/219137458/?notif_channel=cli

2. When grad_is_view=true:
a. roberta_base, peak memory usage 7183MB, p50 per iteration latency 0.908second, https://www.internalfb.com/intern/fblearner/details/217882539?tab=operator_details
b. resnet, peak memory usage 2988 MB, p50 per iteration latency 0.119second, https://www.internalfb.com/intern/fblearner/details/218028479/?notif_channel=cli
c. accuracy benchmark, distributed=false, .accuracy 41.713260650635, .loss: 1.69939661026; distributed=true, .accuracy: 39.966053009033, .loss: 1.6849111318588, https://www.internalfb.com/intern/fblearner/details/218037058/?notif_channel=cli
d. classy vision uru production flow, expected, can not work well with apex.amp https://www.internalfb.com/intern/fblearner/details/219205218/?notif_channel=cli
e. pytext flow, detach_() related error, expected, as pytext zero_grad depends on apex repo where detach_() is called. also seeing the warning in finalize_bucket_dense due to tied weights, which is expected. https://www.internalfb.com/intern/fblearner/details/219150229/?notif_channel=cli

Reviewed By: mrshenli

Differential Revision: D23588186

fbshipit-source-id: f724d325b954ef6f06ede31759bf01dd29a6f5e5
2020-09-24 20:54:51 -07:00
630bd85aae [pytorch] refine dispatch keys in native_functions.yaml (2/N) (#45284)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45284

This is the 2nd batch of the change described in #45010.

In this batch we relaxed some filters to cover more 'backend specific' ops:
* ops that not call any 'Tensor::is_xxx()' method OR only call
  'Tensor::is_cuda()' - we are adding CUDA dispatch key anyway;
* ops that call other ATen ops but ARE differentiable - differentiability
  is a fuzzy indicator of not being 'composite';

Inherited other filters from the 1st batch:
* These ops don't already have dispatch section in native_functions.yaml;
* These ops call one or more DispatchStub (thus "backend specific");

Differential Revision: D23909901

Test Plan: Imported from OSS

Reviewed By: ailzhang

Pulled By: ljk53

fbshipit-source-id: 3b31e176324b6ac814acee0b0f80d18443bd81a1
2020-09-24 20:18:57 -07:00
7e5492e1be [minor] Fix undefined variable (#45246)
Summary:
The commit 2a37f3fd2f https://github.com/pytorch/pytorch/pull/45130 deleted the python variable `capability` which is used in later lines.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45246

Reviewed By: walterddr

Differential Revision: D23923916

Pulled By: malfet

fbshipit-source-id: c5d7fef9e4a87ccc621191200e5965710e9d6aaa
2020-09-24 20:17:13 -07:00
0f2c648c97 log metadata when model loading failed (#44430)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44430

log metadata even when model loading is failed

Test Plan: {F331550976}

Reviewed By: husthyc

Differential Revision: D23577711

fbshipit-source-id: 0504e75625f377269f1e5df0f1ebe34b8e564c4b
2020-09-24 20:09:22 -07:00
03dde4c62a Resend diff D23858329 (#45315)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45315

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45314

in D23858329 (721cfbf842), we put PriorCorrectionCalibrationPrediction unit test in OSS file which causes test failure issue in public trunk.

this diff moves it to FB only test file.

Test Plan:
```
 buck test //caffe2/caffe2/python/operator_test:torch_integration_test -- test_gather_ranges_to_dense_op

buck test //caffe2/caffe2/fb/python/operator_test:torch_integration_test -- test_prior_correct_calibration_prediction_op
```
all pass.

Reviewed By: houseroad

Differential Revision: D23899012

fbshipit-source-id: 1ed97d8702e2765991e6caf5695d4c49353dae82
2020-09-24 18:41:49 -07:00
677a59dcaa [aten] Call fbgemm functions for embedding prepack/unpack (#44845)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44845

fbgemm functions are vectorized and faster

```
Finished test run: https://our.intern.facebook.com/intern/testinfra/testrun/6473924484856786
Summary (total time 15.08s):
  PASS: 7
  FAIL: 0
  SKIP: 0
  FATAL: 0
  TIMEOUT: 0
  OMIT: 0
```

Performance Before:
```
# ----------------------------------------
# PyTorch/Caffe2 Operator Micro-benchmarks
# ----------------------------------------
# Tag : short

# Benchmarking PyTorch: qembeddingbag_byte_prepack
# Mode: Eager
# Name: qembeddingbag_byte_prepack_num_embeddings80_embedding_dim128
# Input: num_embeddings: 80, embedding_dim: 128
Forward Execution Time (us) : 68.727

# Benchmarking PyTorch: qembeddingbag_byte_prepack
# Mode: Eager
# Name: qembeddingbag_byte_prepack_num_embeddings80_embedding_dim256
# Input: num_embeddings: 80, embedding_dim: 256
Forward Execution Time (us) : 131.500

# Benchmarking PyTorch: qembeddingbag_byte_prepack
# Mode: Eager
# Name: qembeddingbag_byte_prepack_num_embeddings80_embedding_dim512
# Input: num_embeddings: 80, embedding_dim: 512
Forward Execution Time (us) : 248.190

# Benchmarking PyTorch: qembeddingbag_4bit_prepack
# Mode: Eager
# Name: qembeddingbag_4bit_prepack_num_embeddings80_embedding_dim128
# Input: num_embeddings: 80, embedding_dim: 128
Forward Execution Time (us) : 172.742

# Benchmarking PyTorch: qembeddingbag_4bit_prepack
# Mode: Eager
# Name: qembeddingbag_4bit_prepack_num_embeddings80_embedding_dim256
# Input: num_embeddings: 80, embedding_dim: 256
Forward Execution Time (us) : 333.008

# Benchmarking PyTorch: qembeddingbag_4bit_prepack
# Mode: Eager
# Name: qembeddingbag_4bit_prepack_num_embeddings80_embedding_dim512
# Input: num_embeddings: 80, embedding_dim: 512
Forward Execution Time (us) : 652.423

# Benchmarking PyTorch: qembeddingbag_2bit_prepack
# Mode: Eager
# Name: qembeddingbag_2bit_prepack_num_embeddings80_embedding_dim128
# Input: num_embeddings: 80, embedding_dim: 128
Forward Execution Time (us) : 167.282

# Benchmarking PyTorch: qembeddingbag_2bit_prepack
# Mode: Eager
# Name: qembeddingbag_2bit_prepack_num_embeddings80_embedding_dim256
# Input: num_embeddings: 80, embedding_dim: 256
Forward Execution Time (us) : 398.901

# Benchmarking PyTorch: qembeddingbag_2bit_prepack
# Mode: Eager
# Name: qembeddingbag_2bit_prepack_num_embeddings80_embedding_dim512
# Input: num_embeddings: 80, embedding_dim: 512
Forward Execution Time (us) : 785.254

# Benchmarking PyTorch: qembeddingbag_byte_unpack
# Mode: Eager
# Name: qembeddingbag_byte_unpack_num_embeddings80_embedding_dim128
# Input: num_embeddings: 80, embedding_dim: 128
Forward Execution Time (us) : 122.653

# Benchmarking PyTorch: qembeddingbag_byte_unpack
# Mode: Eager
# Name: qembeddingbag_byte_unpack_num_embeddings80_embedding_dim256
# Input: num_embeddings: 80, embedding_dim: 256
Forward Execution Time (us) : 230.617

# Benchmarking PyTorch: qembeddingbag_byte_unpack
# Mode: Eager
# Name: qembeddingbag_byte_unpack_num_embeddings80_embedding_dim512
# Input: num_embeddings: 80, embedding_dim: 512
Forward Execution Time (us) : 408.807

# Benchmarking PyTorch: qembeddingbag_4bit_unpack
# Mode: Eager
# Name: qembeddingbag_4bit_unpack_num_embeddings80_embedding_dim128
# Input: num_embeddings: 80, embedding_dim: 128
Forward Execution Time (us) : 176.087

# Benchmarking PyTorch: qembeddingbag_4bit_unpack
# Mode: Eager
# Name: qembeddingbag_4bit_unpack_num_embeddings80_embedding_dim256
# Input: num_embeddings: 80, embedding_dim: 256
Forward Execution Time (us) : 337.514

# Benchmarking PyTorch: qembeddingbag_4bit_unpack
# Mode: Eager
# Name: qembeddingbag_4bit_unpack_num_embeddings80_embedding_dim512
# Input: num_embeddings: 80, embedding_dim: 512
Forward Execution Time (us) : 659.716

# Benchmarking PyTorch: qembeddingbag_2bit_unpack
# Mode: Eager
# Name: qembeddingbag_2bit_unpack_num_embeddings80_embedding_dim128
# Input: num_embeddings: 80, embedding_dim: 128
Forward Execution Time (us) : 342.529

# Benchmarking PyTorch: qembeddingbag_2bit_unpack
# Mode: Eager
# Name: qembeddingbag_2bit_unpack_num_embeddings80_embedding_dim256
# Input: num_embeddings: 80, embedding_dim: 256
Forward Execution Time (us) : 665.197

# Benchmarking PyTorch: qembeddingbag_2bit_unpack
# Mode: Eager
# Name: qembeddingbag_2bit_unpack_num_embeddings80_embedding_dim512
# Input: num_embeddings: 80, embedding_dim: 512
Forward Execution Time (us) : 1307.923
```

Performance After:
```
# ----------------------------------------
# PyTorch/Caffe2 Operator Micro-benchmarks
# ----------------------------------------
# Tag : short

# Benchmarking PyTorch: qembeddingbag_byte_prepack
# Mode: Eager
# Name: qembeddingbag_byte_prepack_num_embeddings80_embedding_dim128
# Input: num_embeddings: 80, embedding_dim: 128
Forward Execution Time (us) : 10.782

# Benchmarking PyTorch: qembeddingbag_byte_prepack
# Mode: Eager
# Name: qembeddingbag_byte_prepack_num_embeddings80_embedding_dim256
# Input: num_embeddings: 80, embedding_dim: 256
Forward Execution Time (us) : 17.443

# Benchmarking PyTorch: qembeddingbag_byte_prepack
# Mode: Eager
# Name: qembeddingbag_byte_prepack_num_embeddings80_embedding_dim512
# Input: num_embeddings: 80, embedding_dim: 512
Forward Execution Time (us) : 25.898

# Benchmarking PyTorch: qembeddingbag_4bit_prepack
# Mode: Eager
# Name: qembeddingbag_4bit_prepack_num_embeddings80_embedding_dim128
# Input: num_embeddings: 80, embedding_dim: 128
Forward Execution Time (us) : 13.903

# Benchmarking PyTorch: qembeddingbag_4bit_prepack
# Mode: Eager
# Name: qembeddingbag_4bit_prepack_num_embeddings80_embedding_dim256
# Input: num_embeddings: 80, embedding_dim: 256
Forward Execution Time (us) : 18.575

# Benchmarking PyTorch: qembeddingbag_4bit_prepack
# Mode: Eager
# Name: qembeddingbag_4bit_prepack_num_embeddings80_embedding_dim512
# Input: num_embeddings: 80, embedding_dim: 512
Forward Execution Time (us) : 30.650

# Benchmarking PyTorch: qembeddingbag_2bit_prepack
# Mode: Eager
# Name: qembeddingbag_2bit_prepack_num_embeddings80_embedding_dim128
# Input: num_embeddings: 80, embedding_dim: 128
Forward Execution Time (us) : 14.158

# Benchmarking PyTorch: qembeddingbag_2bit_prepack
# Mode: Eager
# Name: qembeddingbag_2bit_prepack_num_embeddings80_embedding_dim256
# Input: num_embeddings: 80, embedding_dim: 256
Forward Execution Time (us) : 19.818

# Benchmarking PyTorch: qembeddingbag_2bit_prepack
# Mode: Eager
# Name: qembeddingbag_2bit_prepack_num_embeddings80_embedding_dim512
# Input: num_embeddings: 80, embedding_dim: 512
Forward Execution Time (us) : 30.852

# Benchmarking PyTorch: qembeddingbag_byte_unpack
# Mode: Eager
# Name: qembeddingbag_byte_unpack_num_embeddings80_embedding_dim128
# Input: num_embeddings: 80, embedding_dim: 128
Forward Execution Time (us) : 47.596

# Benchmarking PyTorch: qembeddingbag_byte_unpack
# Mode: Eager
# Name: qembeddingbag_byte_unpack_num_embeddings80_embedding_dim256
# Input: num_embeddings: 80, embedding_dim: 256
Forward Execution Time (us) : 91.025

# Benchmarking PyTorch: qembeddingbag_byte_unpack
# Mode: Eager
# Name: qembeddingbag_byte_unpack_num_embeddings80_embedding_dim512
# Input: num_embeddings: 80, embedding_dim: 512
Forward Execution Time (us) : 131.425

# Benchmarking PyTorch: qembeddingbag_4bit_unpack
# Mode: Eager
# Name: qembeddingbag_4bit_unpack_num_embeddings80_embedding_dim128
# Input: num_embeddings: 80, embedding_dim: 128
Forward Execution Time (us) : 12.637

# Benchmarking PyTorch: qembeddingbag_4bit_unpack
# Mode: Eager
# Name: qembeddingbag_4bit_unpack_num_embeddings80_embedding_dim256
# Input: num_embeddings: 80, embedding_dim: 256
Forward Execution Time (us) : 20.856

# Benchmarking PyTorch: qembeddingbag_4bit_unpack
# Mode: Eager
# Name: qembeddingbag_4bit_unpack_num_embeddings80_embedding_dim512
# Input: num_embeddings: 80, embedding_dim: 512
Forward Execution Time (us) : 33.944

# Benchmarking PyTorch: qembeddingbag_2bit_unpack
# Mode: Eager
# Name: qembeddingbag_2bit_unpack_num_embeddings80_embedding_dim128
# Input: num_embeddings: 80, embedding_dim: 128
Forward Execution Time (us) : 21.181

# Benchmarking PyTorch: qembeddingbag_2bit_unpack
# Mode: Eager
# Name: qembeddingbag_2bit_unpack_num_embeddings80_embedding_dim256
# Input: num_embeddings: 80, embedding_dim: 256
Forward Execution Time (us) : 34.213

# Benchmarking PyTorch: qembeddingbag_2bit_unpack
# Mode: Eager
# Name: qembeddingbag_2bit_unpack_num_embeddings80_embedding_dim512
# Input: num_embeddings: 80, embedding_dim: 512
Forward Execution Time (us) : 59.622
```
ghstack-source-id: 112836216

Test Plan: buck test //caffe2/test:quantization -- 'test_embedding_bag*'  --print-passing-details

Reviewed By: radkris-git

Differential Revision: D23675777

fbshipit-source-id: 0b1a787864663daecc7449295f9ab6264eac52fc
2020-09-24 17:21:03 -07:00
0b6e5ad4a9 Resolve comments in #44354. (#45150)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45150

Test Plan: Imported from OSS

Reviewed By: bhosmer

Differential Revision: D23846796

Pulled By: ailzhang

fbshipit-source-id: 7bef89d833848ac3f8993c4c037acf1d4f2ca674
2020-09-24 16:40:02 -07:00
92ebb04f92 added check for NumberType (#44375)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/44107

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44375

Reviewed By: mrshenli

Differential Revision: D23906728

Pulled By: eellison

fbshipit-source-id: 3b534e5dd3af1f5e43a7314953e64117cbe8ffe4
2020-09-24 16:26:59 -07:00
bee1d448e7 Fix test_rpc_profiling_remote_record_function (#45162)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45162

This test was flaky because it was not able to validate that the
overall record_function's CPU times are greater than the sum of its children.
It turns out that this is a general bug in the profiler that can be reproduced
without RPC, see https://github.com/pytorch/pytorch/issues/45160. Hence,
removing this from the test and replacing it by just validating the expected
children.

Ran the test 1000 times and they all passed.
ghstack-source-id: 112632327

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D23851854

fbshipit-source-id: 5d9023acd17800a6668ba4849659d8cc902b8d6c
2020-09-24 15:57:32 -07:00
5dd288eb06 [JIT] Regularize tensorexpr fuser strategy with other fusers (#44972)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44972

Previously, our fusion strategy would be:
- start at the end of the block, find a fusable node
- iteratively try to merge inputs into the fusion group, sorted topologically

This strategy works pretty well, but has the possibility of missing fusion groups. See my attached test case for an example where we wouldn't find all possible fusion groups. bertmaher found an example of a missed fusion groups in one of our rnn examples (jit_premul) that caused a regression from the legacy fuser.

Here, I'm updating our fusion strategy to be the same as our other fusion passes - create_autodiff_subgraphs, and graph_fuser.cpp.

The basic strategy is:
- iterate until you find a fusible node
- try to merge the nodes inputs, whenever a succesful merge occurs restart at the beginning of the nodes inputs
- after you've exhausted a node, continue searching the block for fusion opportunities from the node
- continue doing this on the block until we go through an iteration without an succesful merges

Since we create the fusion groups once, and only re-specialize within the fusion groups, we should be running this very infrequently (only re-triggers when we fail undefinedness specializations). Also bc it's the same algorithm as the existing fuser it is unlikely to cause a regression.

Test Plan: Imported from OSS

Reviewed By: Krovatkin, robieta

Differential Revision: D23821581

Pulled By: eellison

fbshipit-source-id: e513d1ef719120dadb0bfafc7a14f4254cd806ee
2020-09-24 15:34:21 -07:00
0137e3641d Refactor subgraph merging (#44238)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44238

Refactor create_autodiff_subgraphs to use the same updating of output aliasing properties logic as tensorexpr fuser, and factor that out to a common function in subgraph utils.

Test Plan: Imported from OSS

Reviewed By: Krovatkin, robieta

Differential Revision: D23871565

Pulled By: eellison

fbshipit-source-id: 72df253b16baf8e4aabf3d68b103b29e6a54d44c
2020-09-24 15:29:34 -07:00
1539d4a664 Add operator to compute the equalization scale (#45096)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45096

Add operator to compute the equalization scale. This will be used in the integration of equalization into dper int8 fixed quant scheme quantization flow.

Design docs:
https://fb.quip.com/bb7SAGBxPGNC

https://fb.quip.com/PDAOAsgoLfRr

Test Plan: buck test caffe2/caffe2/quantization/server:compute_equalization_scale_test

Reviewed By: jspark1105

Differential Revision: D23779870

fbshipit-source-id: 5e6a8c220935a142ecf8e61100a8c71932afa8d7
2020-09-24 15:19:49 -07:00
5a59330647 Add architectural support for multi-GPU. (#44059)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44059

Test Plan: Imported from OSS

Reviewed By: IvanKobzarev

Differential Revision: D23820825

Pulled By: AshkanAliabadi

fbshipit-source-id: 0719b00581487a77ebadff867d1e4ac89354bf90
2020-09-24 15:11:55 -07:00
6311c5a483 Minor touchups. (#44317)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44317

Test Plan: Imported from OSS

Reviewed By: IvanKobzarev

Differential Revision: D23820828

Pulled By: AshkanAliabadi

fbshipit-source-id: b83bdea9aed2fb52bd254ff15914d55a1af58c04
2020-09-24 15:07:08 -07:00
b84dd771e6 Grammatically updated the tech docs (#45192)
Summary:
Small grammatical update to the [https://pytorch.org/docs/stable/tensors.html](url) docs.

**_update1_**
![update1](https://user-images.githubusercontent.com/62737243/93969792-5c0ea800-fd8a-11ea-8c9f-0033f51a1fdc.png)

**_update2_**
![update2](https://user-images.githubusercontent.com/62737243/93969801-603ac580-fd8a-11ea-812d-d3026b9fc8a5.png)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45192

Reviewed By: bwasti

Differential Revision: D23877870

Pulled By: ezyang

fbshipit-source-id: 929ba3d479925b5132dbe87fad2da487408db7c7
2020-09-24 14:48:30 -07:00
cd7a682282 [caffe2] adds hypothesis test for queue ops cancel (#45178)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45178

## Motivation
* To be able to make C2 ops cancellable so we can safely exit.
* Some C2 operators are now blocking thus being non-cancellable. If an error
occurs we need to be able to safely stop all net execution so we can throw
the exception to the caller.

## Summary
* Adds a hypothesis test for queue ops cancellation.

Test Plan:
## Unit test added to verify that queue ops propagate errors

```
buck test caffe2/caffe2/python:hypothesis_test
buck test caffe2/caffe2/python:hypothesis_test -- test_safe_dequeue_blob__raises_exception_when_hang --stress-runs 1000
```

```
Summary
  Pass: 1000
  ListingSuccess: 1
```

Reviewed By: d4l3k

Differential Revision: D23847576

fbshipit-source-id: 2fc351e1ee13ea8b32d976216d2d01dfb6fcc1ad
2020-09-24 14:43:52 -07:00
71e6ce6616 [JIT] Specialize AutogradZero: merge AutogradAnyNonZero and Not(AutogradAnyNonZero) checks into one. (#44987)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44987

This PR introduces new `prim::AutogradAllZero` and
`prim::AutogradAllNonZero` ops that are used for a batch check for
multiple tensors. The specialize-autogradzero pass now generates one
check for all expected-to-be-undefined tensors, one check for all
expected-to-be-defined tensors, and a bunch of checks for size
parameters passed to `grad_sum_to_size` (this probably could be cleaned
up somehow as well in future).

An example of what we generated before this change:
```
%1626 : bool = prim::AutogradAnyNonZero(%0)
%1627 : bool = prim::AutogradAnyNonZero(%2)
%1628 : bool = aten::__not__(%1627)
%1629 : bool = prim::AutogradAnyNonZero(%3)
%1630 : bool = aten::__not__(%1629)
%1631 : bool = prim::AutogradAnyNonZero(%4)
%1632 : bool = aten::__not__(%1631)
%1633 : bool = prim::AutogradAnyNonZero(%5)
%1634 : bool = aten::__not__(%1633)
%1635 : bool = prim::AutogradAnyNonZero(%6)
%1636 : bool = aten::__not__(%1635)
%1637 : bool = prim::AutogradAnyNonZero(%7)
%1638 : bool = aten::__not__(%1637)
%1639 : bool = prim::AutogradAnyNonZero(%8)
%1640 : bool = aten::__not__(%1639)
%1641 : bool = prim::AutogradAnyNonZero(%9)
%1642 : bool = aten::__not__(%1641)
%1643 : bool = prim::AutogradAnyNonZero(%10)
%1644 : bool = aten::__not__(%1643)
%1645 : bool = prim::AutogradAnyNonZero(%11)
%1646 : bool = aten::__not__(%1645)
%1647 : bool = prim::AutogradAnyNonZero(%12)
%1648 : bool = aten::__not__(%1647)
%1649 : bool = prim::AutogradAnyNonZero(%13)
%1650 : bool = aten::__not__(%1649)
%1651 : bool = prim::AutogradAnyNonZero(%14)
%1652 : bool = aten::__not__(%1651)
%1653 : bool = prim::AutogradAnyNonZero(%15)
%1654 : bool = aten::__not__(%1653)
%1655 : bool = prim::AutogradAnyNonZero(%16)
%1656 : bool = aten::__not__(%1655)
%1657 : bool = prim::AutogradAnyNonZero(%17)
%1658 : bool = prim::AutogradAnyNonZero(%18)
%1659 : bool = prim::AutogradAnyNonZero(%19)
%1660 : bool = prim::AutogradAnyNonZero(%20)
%1661 : bool = aten::__is__(%self_size.16, %1625)
%1662 : bool = aten::__is__(%other_size.16, %1625)
%1663 : bool = aten::__is__(%self_size.14, %1625)
%1664 : bool = aten::__is__(%self_size.12, %1625)
%1665 : bool = prim::AutogradAnyNonZero(%ingate.7)
%1666 : bool = prim::AutogradAnyNonZero(%forgetgate.7)
%1667 : bool = prim::AutogradAnyNonZero(%cellgate.7)
%1668 : bool = prim::AutogradAnyNonZero(%30)
%1669 : bool = prim::AutogradAnyNonZero(%31)
%1670 : bool = aten::__is__(%self_size.10, %1625)
%1671 : bool = aten::__is__(%other_size.10, %1625)
%1672 : bool = prim::AutogradAnyNonZero(%34)
%1673 : bool = prim::AutogradAnyNonZero(%35)
%1674 : bool = aten::__is__(%self_size.8, %1625)
%1675 : bool = aten::__is__(%other_size.8, %1625)
%1676 : bool = aten::__is__(%self_size.6, %1625)
%1677 : bool = aten::__is__(%other_size.6, %1625)
%1678 : bool = prim::AutogradAnyNonZero(%outgate.7)
%1679 : bool = prim::AutogradAnyNonZero(%41)
%1680 : bool = prim::AutogradAnyNonZero(%42)
%1681 : bool = prim::AutogradAnyNonZero(%43)
%1682 : bool = aten::__is__(%self_size.4, %1625)
%1683 : bool = aten::__is__(%other_size.4, %1625)
%1684 : bool[] = prim::ListConstruct(%1626, %1628, %1630, %1632, %1634, %1636, %1638, %1640, %1642, %1644, %1646, %1648, %1650, %1652, %1654, %1656, %1657, %1658, %1659, %1660, %1661, %1662, %1663, %1664, %1665, %1666, %1667, %1668, %1669, %1670, %1671, %1672, %1673, %1674, %1675, %1676, %1677, %1678, %1679, %1680, %1681, %1682, %1683)
%1685 : bool = aten::all(%1684)
```

Same example after this change:
```
%1625 : None = prim::Constant()
%1626 : bool = aten::__is__(%self_size.16, %1625)
%1627 : bool = aten::__is__(%other_size.16, %1625)
%1628 : bool = aten::__is__(%self_size.14, %1625)
%1629 : bool = aten::__is__(%self_size.12, %1625)
%1630 : bool = aten::__is__(%self_size.10, %1625)
%1631 : bool = aten::__is__(%other_size.10, %1625)
%1632 : bool = aten::__is__(%self_size.8, %1625)
%1633 : bool = aten::__is__(%other_size.8, %1625)
%1634 : bool = aten::__is__(%self_size.6, %1625)
%1635 : bool = aten::__is__(%other_size.6, %1625)
%1636 : bool = aten::__is__(%self_size.4, %1625)
%1637 : bool = aten::__is__(%other_size.4, %1625)
%1638 : bool = prim::AutogradAllNonZero(%0, %17, %18, %19, %20, %ingate.7, %forgetgate.7, %cellgate.7, %30, %31, %34, %35, %outgate.7, %41, %42, %43)
%1639 : bool = prim::AutogradAllZero(%2, %3, %4, %5, %6, %7, %8, %9, %10, %11, %12, %13, %14, %15, %16)
%1640 : bool[] = prim::ListConstruct(%1626, %1627, %1628, %1629, %1630, %1631, %1632, %1633, %1634, %1635, %1636, %1637, %1638, %1639)
%1641 : bool = aten::all(%1640)
```

My performance measurements showed some changes, but I don't really
trust them and think that they are probably just a noise. Below are
tables with min-aggregation over 10 runs:

FastRNN models:

| name                                             | base time (s) |   diff time (s) |   % change |
| :---                                             |          ---: |            ---: |       ---: |
| lstm[aten]:bwd                                   |     30.059927 |       29.834089 |      -0.8% |
| lstm[aten]:fwd                                   |     25.673708 |       25.700039 |       0.1% |
| lstm[cudnn]:bwd                                  |     17.866232 |       17.893120 |       0.2% |
| lstm[cudnn]:fwd                                  |     11.418444 |       11.408514 |      -0.1% |
| lstm[jit]:bwd                                    |     27.127205 |       27.141029 |       0.1% |
| lstm[jit]:fwd                                    |     17.018047 |       16.975451 |      -0.3% |
| lstm[jit_multilayer]:bwd                         |     27.502396 |       27.365149 |      -0.5% |
| lstm[jit_multilayer]:fwd                         |     16.918591 |       16.917767 |      -0.0% |
| lstm[jit_premul]:bwd                             |     22.281199 |       22.215082 |      -0.3% |
| lstm[jit_premul]:fwd                             |     14.848708 |       14.896231 |       0.3% |
| lstm[jit_premul_bias]:bwd                        |     20.761206 |       21.170969 |       2.0% |
| lstm[jit_premul_bias]:fwd                        |     15.013515 |       15.037978 |       0.2% |
| lstm[jit_simple]:bwd                             |     26.715771 |       26.697786 |      -0.1% |
| lstm[jit_simple]:fwd                             |     16.675898 |       16.545893 |      -0.8% |
| lstm[py]:bwd                                     |     56.327065 |       54.731030 |      -2.8% |
| lstm[py]:fwd                                     |     39.876324 |       39.230572 |      -1.6% |

Torch Hub models:

| name                                             | base time (s) |   diff time (s) |   % change |
| :---                                             |          ---: |            ---: |       ---: |
| test_eval[BERT_pytorch-cuda-jit]                 |      0.111706 |        0.106604 |      -4.6% |
| test_eval[LearningToPaint-cuda-jit]              |      0.002841 |        0.002801 |      -1.4% |
| test_eval[Super_SloMo-cuda-jit]                  |      0.384869 |        0.384737 |      -0.0% |
| test_eval[attension_is_all_you_nee...-cuda-jit]  |      0.123857 |        0.123923 |       0.1% |
| test_eval[demucs-cuda-jit]                       |      0.077270 |        0.076878 |      -0.5% |
| test_eval[fastNLP-cuda-jit]                      |      0.000255 |        0.000249 |      -2.3% |
| test_eval[moco-cuda-jit]                         |      0.426472 |        0.427380 |       0.2% |
| test_eval[pytorch_CycleGAN_and_pix...-cuda-jit]  |      0.026483 |        0.026423 |      -0.2% |
| test_eval[pytorch_mobilenet_v3-cuda-jit]         |      0.036202 |        0.035853 |      -1.0% |
| test_eval[pytorch_struct-cuda-jit]               |      0.001439 |        0.001495 |       3.9% |
| test_train[BERT_pytorch-cuda-jit]                |      0.247236 |        0.247188 |      -0.0% |
| test_train[Background_Matting-cuda-jit]          |      3.536659 |        3.581864 |       1.3% |
| test_train[LearningToPaint-cuda-jit]             |      0.015341 |        0.015331 |      -0.1% |
| test_train[Super_SloMo-cuda-jit]                 |      1.018626 |        1.019098 |       0.0% |
| test_train[attension_is_all_you_nee...-cuda-jit] |      0.446314 |        0.444893 |      -0.3% |
| test_train[demucs-cuda-jit]                      |      0.169647 |        0.169846 |       0.1% |
| test_train[fastNLP-cuda-jit]                     |      0.001990 |        0.001978 |      -0.6% |
| test_train[moco-cuda-jit]                        |      0.855323 |        0.856974 |       0.2% |
| test_train[pytorch_mobilenet_v3-cuda-jit]        |      0.497723 |        0.485416 |      -2.5% |
| test_train[pytorch_struct-cuda-jit]              |      0.309692 |        0.308792 |      -0.3% |

Differential Revision: D23794659

Test Plan: Imported from OSS

Reviewed By: bertmaher

Pulled By: ZolotukhinM

fbshipit-source-id: 859b68868ef839c5c6cbc7021879ee22d3144ea8
2020-09-24 14:31:49 -07:00
cbe1eac1f4 [caffe2] adds Cancel to SafeDequeueBlobsOp and SafeEnqueueBlobsOp (#45177)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45177

## Motivation
* To be able to make C2 ops cancellable so we can safely exit.
* Some C2 operators are now blocking thus being non-cancellable. If an error
occurs we need to be able to safely stop all net execution so we can throw
the exception to the caller.

## Summary
* When an error occurs in a net or it got cancelled, running ops will have the
`Cancel` method called.
This diff adds `Cancel` method to the `SafeEnqueueBlobsOp`
and `SafeDequeueBlobsOp` to have the call queue->close() to force all the
blocking ops to return.
* Adds unit test that verified the error propagation.

Test Plan:
## Unit test added to verify that queue ops propagate errors

```
buck test caffe2/caffe2/python:hypothesis_test -- test_safe_dequeue_blob__raises_exception_when_hang --stress-runs 1000
```

```
Summary
  Pass: 1000
  ListingSuccess: 1
```

Reviewed By: d4l3k

Differential Revision: D23846967

fbshipit-source-id: c7ddd63259e033ed0bed9df8e1b315f87bf59394
2020-09-24 14:22:46 -07:00
022ba5a78b Make ddp_comm_hook_wrapper a private method. (#44643)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44643

This method is not used anywhere else.

Also formatted the file.

Test Plan: buck test caffe2/test/distributed/algorithms/ddp_comm_hooks:test_ddp_hooks

Reviewed By: pritamdamania87

Differential Revision: D23675945

fbshipit-source-id: 2d04f94589a20913e46b8d71e6a39b70940c1461
2020-09-24 13:29:48 -07:00
e2bcdc7b69 [Caffe2] Fix LayerNormOp when batch_size == 0. (#45250)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45250

[Caffe2] Fix LayerNormOp when batch_size == 0.

Test Plan: buck test mode/dev-nosan //caffe2/caffe2/python/operator_test:layer_norm_op_test

Reviewed By: houseroad

Differential Revision: D23892091

fbshipit-source-id: 9a34654dd6880c9d14b7111fcf850e4f48ffdf91
2020-09-24 12:30:03 -07:00
c3a5aed5f7 Run pytorch_core CUDA tests on GPU using TPX
Summary:
Modify contbuild to disable sanitizers, add option to run "cuda" test using TPX RE

(Note: this ignores all push blocking failures!)

Test Plan: CI

Reviewed By: walterddr, cspanda

Differential Revision: D23854578

fbshipit-source-id: 327d7cc3655c17034a6a7bc78f69967403290623
2020-09-24 12:12:23 -07:00
c211a9102f add rocm 3.8 to nightly builds (#45222)
Summary:
Corresponding change in builder repo: https://github.com/pytorch/builder/pull/528.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45222

Reviewed By: ezyang

Differential Revision: D23894831

Pulled By: walterddr

fbshipit-source-id: c6a256ec325ddcf5836b4d293f546368d58db538
2020-09-24 12:00:30 -07:00
26001a2334 Revert D23753711: [pytorch][PR] Add foreach APIs for binary ops with ScalarList
Test Plan: revert-hammer

Differential Revision:
D23753711 (71d1b5b0e2)

Original commit changeset: bf3e8c54bc07

fbshipit-source-id: 192692e0d3fff4cade9983db0a1760fedfc9674c
2020-09-24 11:55:49 -07:00
c79d493096 added rocm 3.8 docker image (#45205)
Summary:
jeffdaily

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45205

Reviewed By: malfet

Differential Revision: D23906606

Pulled By: walterddr

fbshipit-source-id: 604a12bf4c97260215a1881cc96e35e7c42b4578
2020-09-24 11:18:33 -07:00
3f5eee666c Adjust TF32 tests (#44240)
Summary:
- The thresholds of some tests are bumped up. Depending on the random generator, sometimes these tests fail with things like 0.0059 is not smaller than 0.005. I ran `test_nn.py` and `test_torch.py` for 10+ times to check these are no longer flaky.
- Add `tf32_on_and_off` to new `matrix_exp` tests.
- Disable TF32 on test suites other than `test_nn.py` and `test_torch.py`

cc: ptrblck

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44240

Reviewed By: mruberry

Differential Revision: D23882498

Pulled By: ngimel

fbshipit-source-id: 44a9ec08802c93a2efaf4e01d7487222478b6df8
2020-09-24 10:25:58 -07:00
b8eab8cdbd [hotfix] typo in NaiveConvolutionTranspose2d.cu (#45224)
Summary:
Fixes typo in e2f49c8
Fixes https://github.com/pytorch/pytorch/issues/45172

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45224

Reviewed By: ezyang

Differential Revision: D23879872

Pulled By: walterddr

fbshipit-source-id: c3db6d4c6f2ac0e6887862d4217a79c030647cb9
2020-09-24 10:06:29 -07:00
e57a08119b Add a warning log when there is high skew of uneven inputs in DDP training (#45238)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45238

Adds a warning when there is much higher than expected amount of
discrepancy of inputs across different processes when running with uneven
inputs. This is because a skew in the thousands can reduce performance a
nontrivial amount as shown in benchmarks, and it was proposed to add this
warning as a result. Tested by running the tests so the threshold is hit and
observing the output.
ghstack-source-id: 112773552

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D23719270

fbshipit-source-id: 306264f62c1de65e733696a912bdb6e9376d5622
2020-09-24 09:50:44 -07:00
2b38c09f69 Moves prim ops from C10 back to JIT (#45144)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45144

Moves prim ops from C10 back to JIT.

These were originally moved to C10 from JIT in D19237648 (f362cd510d)
ghstack-source-id: 112775781

Test Plan:
buck test //caffe2/test/cpp/jit:jit

https://pxl.cl/1l22N

buck test adsatlas/gavel/lib/ata_processor/tests:ata_processor_test

https://pxl.cl/1lBxD

Reviewed By: iseeyuan

Differential Revision: D23697598

fbshipit-source-id: 36d1eb8c346e9b161ba6af537a218440a9bafd27
2020-09-24 09:44:20 -07:00
8507ea22b2 replace timer test with a mocked variant (#45173)
Summary:
I noticed that the recently introduced adaptive_autorange tests occasionally timeout CI, and I've been meaning to improve the Timer tests for a while. This PR allows unit tests to swap the measurement portion of `Timer` with a deterministic mock so we can thoroughly test behavior without having to worry about flaky CI measurements. It also means that the tests can be much more detailed and still finish very quickly.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45173

Test Plan: You're lookin' at it.

Reviewed By: ezyang

Differential Revision: D23873548

Pulled By: robieta

fbshipit-source-id: 26113e5cea0cbf46909b9bf5e90c878c29e87e88
2020-09-24 09:42:37 -07:00
bfdf4323ac Bump up NCCL to 2.7.8 (#45251)
Summary:
Use latest NCCL

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45251

Reviewed By: mingzhe09088

Differential Revision: D23893064

Pulled By: mrshenli

fbshipit-source-id: 820dd166039e61a5aa59b4c5bbc615a7b18be8c3
2020-09-24 09:33:57 -07:00
5195d727b5 adding a test for ddp save()/load() (#44906)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44906

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D23825386

Pulled By: bdhirsh

fbshipit-source-id: 2276e6e030ef9cffd78fc78c2ffe34d60a1e160e
2020-09-24 09:15:53 -07:00
f9ae296a85 renaming TestDdpCommHook class so it doesn't get picked up as a test by pytest (#44905)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44905

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D23825308

Pulled By: bdhirsh

fbshipit-source-id: 17a07b3bd211850d6ecca793fd9ef3f326ca9274
2020-09-24 08:46:25 -07:00
bc591d76a1 add skip_if_rocm to all requires_nccl tests (#45158)
Summary:
requires_nccl annotation should skip_if_rocm as well

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45158

Reviewed By: seemethere

Differential Revision: D23879952

Pulled By: walterddr

fbshipit-source-id: 818fb31ab75d5f02e77fe3f1367faf748855bee7
2020-09-24 08:37:49 -07:00
71d1b5b0e2 Add foreach APIs for binary ops with ScalarList (#44743)
Summary:
In this PR:
1) Added binary operations with ScalarLists.
2) Fixed _foreach_div(...) bug in native_functions
3) Covered all possible cases with scalars and scalar lists in tests
4) [minor] fixed bug in native_functions by adding "use_c10_dispatcher: full" to all _foreach functions

tested via unit tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44743

Reviewed By: bwasti, malfet

Differential Revision: D23753711

Pulled By: izdeby

fbshipit-source-id: bf3e8c54bc07867e8f6e82b5d3d35ff8e99b5a0a
2020-09-24 08:30:42 -07:00
bea7901e38 Enable torch.tensor typechecks (#45077)
Summary:
this fixes https://github.com/pytorch/pytorch/issues/42983.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45077

Reviewed By: ezyang

Differential Revision: D23842493

Pulled By: walterddr

fbshipit-source-id: 1c516a5ff351743a187d00cba7ed0be11678edf1
2020-09-24 08:22:06 -07:00
dc67b47bc9 Deprecate old fft functions (#44876)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44876

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D23866715

Pulled By: mruberry

fbshipit-source-id: 73305eb02f92cbd1ef7d175419529d19358fedda
2020-09-24 02:39:44 -07:00
6d21d5f0b3 gtest-ify JIT tests, through the letter c (#45249)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45249

Reland of https://github.com/pytorch/pytorch/pull/45055 and
https://github.com/pytorch/pytorch/pull/45020

See https://github.com/pytorch/pytorch/pull/45018 for context.

Test Plan: Imported from OSS

Reviewed By: jamesr66a

Differential Revision: D23892645

Pulled By: suo

fbshipit-source-id: e7fe58d5e1a5a0c44f4e2aec9694145afabde0fd
2020-09-24 00:21:20 -07:00
29dc3c5ec8 Sparse softmax support (CUDA) (#42307)
Summary:
This PR implements softmax support for sparse tensors.

Resolves gh-23651 for CUDA.

- [x]  sparse softmax
    - [x]  CUDA C++ implementation
    - [x]  unittests
    - [x]  update softmax documentation
    - [x]  autograd support
- [x]  sparse log_softmax
    - [x]  CUDA C++ implementation
    - [x]  unittests
    - [x]  update log_softmax documentation
    - [x]  autograd support

Here are some benchmark (script is [here](https://gist.github.com/aocsa/fbc1827b3e49901512a33ba96092cbc1)) results for `torch.sparse.softmax and torch.softmax`,  using CPU and GPU, values are float64 scalars, timing repeat is 1000:

| size         | density | sparse CUDA | sparse CPU |
|--------------|---------|-------------|------------|
|  (32, 10000) |   0.01  |    380.2    |    687.5   |
| (32, 10000)  | 0.05    | 404.3       | 2357.9     |
| (32, 10000)  | 0.1     | 405.9       | 3677.2     |
| (512, 10000) | 0.01    | 438.0       | 5443.4     |
| (512, 10000) | 0.05    | 888.1       | 24485.0    |
| (512, 10000) | 0.1     | 1921.3      | 45340.5    |

| size         | density | dense CUDA | dense CPU |
|--------------|---------|-------------|------------|
|  (32, 10000) |   0.01  |     23.6    |   1943.2   |
| (32, 10000)  | 0.05    | 23.6        | 1954.0     |
| (32, 10000)  | 0.1     | 23.5        | 1950.0     |
| (512, 10000) | 0.01    | 639.3       | 39797.9    |
| (512, 10000) | 0.05    | 640.3       | 39374.4    |
| (512, 10000) | 0.1     | 639.6       | 39192.3    |

Times are in microseconds (us).

Quick note:  I updated the performance test again.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42307

Reviewed By: ngimel

Differential Revision: D23774427

Pulled By: mruberry

fbshipit-source-id: bfabf726075b39dde544c10249f27ae1871f82c7
2020-09-24 00:07:30 -07:00
b3d7c2f978 [ONNX] Update ONNX docs for release (#45086)
Summary:
ONNX doc updates.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45086

Reviewed By: ezyang

Differential Revision: D23880383

Pulled By: bzinodev

fbshipit-source-id: ca29782fd73024967ee7708c217a005233e7b970
2020-09-23 23:28:36 -07:00
3dd0e362db [TensorExpr] Fix min and max for integral inputs in CUDA backend (#44984)
Summary:
For integral types, isnan is meaningless. Provide specializations for
maximum and minimum which don't call it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44984

Test Plan: python test/test_jit_fuser_te.py -k TestTEFuser.test_minmax_int_ops

Reviewed By: ezyang

Differential Revision: D23885259

Pulled By: asuhan

fbshipit-source-id: 2e6da2c43c0ed18f0b648a2383d510894c574437
2020-09-23 23:19:12 -07:00
b470fa4500 Add complex number support for binary logical operators (#43174)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43174

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D23684425

Pulled By: mruberry

fbshipit-source-id: 4857b16e18ec4c65327136badd7f04c74e32d330
2020-09-23 23:03:00 -07:00
0b6b735863 [fix] type promotion atan2 (#43466)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/43360

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43466

Reviewed By: malfet

Differential Revision: D23834928

Pulled By: mruberry

fbshipit-source-id: 2e7e0b4fcf1a846efc171c275d65a6daffd3c631
2020-09-23 22:23:05 -07:00
6a2e9eb51c torch.fft: Multi-dimensional transforms (#44550)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44550

Part of the `torch.fft` work (gh-42175).
This adds n-dimensional transforms: `fftn`, `ifftn`, `rfftn` and `irfftn`.

This is aiming for correctness first, with the implementation on top of the existing `_fft_with_size` restrictions. I plan to follow up later with a more efficient rewrite that makes `_fft_with_size` work with arbitrary numbers of dimensions.

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D23846032

Pulled By: mruberry

fbshipit-source-id: e6950aa8be438ec5cb95fb10bd7b8bc9ffb7d824
2020-09-23 22:09:58 -07:00
070fe15e4c Add link to profiling recipe from rpc main docs (#45235)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45235

This is so that users know that the profiler works as expected with
RPC and they can learn how to use it to profile RPC-based workloads.
ghstack-source-id: 112773748

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D23777888

fbshipit-source-id: 4805be9b949c8c7929182f291a6524c3c6a725c1
2020-09-23 22:02:38 -07:00
956a25d061 Revert D23858329: [PT Model Split] Support 2 operators in PT by C2 conversion
Test Plan: revert-hammer

Differential Revision:
D23858329 (721cfbf842)

Original commit changeset: ed37118ca7f0

fbshipit-source-id: 30c700f80665be11afc608b00a77766064e60b35
2020-09-23 21:20:21 -07:00
2d00ebd29f Failing test demonstrating problems with mixed output shapes (#44455)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44455

Test Plan: Imported from OSS

Reviewed By: gmagogsfm

Differential Revision: D23886119

Pulled By: bertmaher

fbshipit-source-id: 41787930f154cf4e8a1766613c4cf33b18246555
2020-09-23 21:15:37 -07:00
c760bc8fb1 Add GlowLoadAOTModel flag (#45189)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45189

Pull Request resolved: https://github.com/pytorch/glow/pull/4902

Test Plan: Test locally

Reviewed By: yinghai

Differential Revision: D23810445

fbshipit-source-id: 56e717d80abbfe76b15d0f4249e1e399a9722753
2020-09-23 20:50:04 -07:00
60665ace17 [quant] Add optimized approach to calculate qparams for qembedding_bag (#45149)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45149

The choose_qparams_optimized calculates the the optimized qparams.
It uses a greedy approach to nudge the min and max and calculate the l2 norm
  and tries to minimize the quant error by doing `torch.norm(x-fake_quant(x,s,z))`

Test Plan: Imported from OSS

Reviewed By: raghuramank100

Differential Revision: D23848060

fbshipit-source-id: c6c57c9bb07664c3f1c87dd7664543e09f634aee
2020-09-23 19:00:22 -07:00
721cfbf842 [PT Model Split] Support 2 operators in PT by C2 conversion (#45231)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45231

There are two operators:
`PriorCorrectionCalibrationPrediction` and `GatherRangesToDense` is not supported in PT which makes GLOW cannot work.

To unblock, we first try to use C2->PT conversion. In the long-term, we need to implement PT custom ops.

This diff does this conversion to unblock current project.

Test Plan:
Run unit test. the Test input is from current DPER example.
All pass.
```buck test //caffe2/caffe2/python/operator_test:torch_integration_test -- test_prior_correct_calibration_prediction_op  --print-passing-details

> c2 reference output
> [0.14285715 0.27272728 0.39130434 0.5 ]

> PT converted output
> tensor([0.1429, 0.2727, 0.3913, 0.5000])

buck test //caffe2/caffe2/python/operator_test:torch_integration_test -- test_gather_ranges_to_dense_op  --print-passing-details

c2 reference output
> [array([[6, 5, 4, 3], [0, 0, 0, 0]], dtype=int64)]

> PT converted output
> [tensor([[6, 5, 4, 3], [0, 0, 0, 0]])]
```

Reviewed By: allwu, qizzzh

Differential Revision: D23858329

fbshipit-source-id: ed37118ca7f09e1cd0ad1fdec3d37f66dce60dd9
2020-09-23 18:31:57 -07:00
27c7158166 Remove __future__ imports for legacy Python2 supports (#45033)
Summary:
There is a module called `2to3` which you can target for future specifically to remove these, the directory of `caffe2` has the most redundant imports:

```2to3 -f future -w caffe2```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45033

Reviewed By: seemethere

Differential Revision: D23808648

Pulled By: bugra

fbshipit-source-id: 38971900f0fe43ab44a9168e57f2307580d36a38
2020-09-23 17:57:02 -07:00
e9aa6898ab Revert D23802296: gtest-ify JIT tests, through the letter c
Test Plan: revert-hammer

Differential Revision:
D23802296 (d2b045030e)

Original commit changeset: 20c9798a414e

fbshipit-source-id: a28d56039ca404fe94ed7572f1febd1673e3e788
2020-09-23 17:42:19 -07:00
89c570ed0a Revert D23811085: gtestify dce and fuser tests
Test Plan: revert-hammer

Differential Revision:
D23811085 (246bd9422a)

Original commit changeset: 45008e41f239

fbshipit-source-id: 94c981f565cab9b710fe52a55bbe8dbf9c179c23
2020-09-23 17:27:59 -07:00
76c185dcca [TensorExpr] When lanes differ, insert Broadcast instead of Cast (#45179)
Summary:
We need to check if dtypes differ in scalar type or lanes to decide between
Cast and Broadcast.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45179

Test Plan: test_tensorexpr --gtest_filter=TensorExprTest.SimplifyBroadcastTermExpander

Reviewed By: bwasti

Differential Revision: D23873316

Pulled By: asuhan

fbshipit-source-id: ca141be67e10c2b6c5f2ff9c11e42dcfc62ac620
2020-09-23 17:06:54 -07:00
f93ead6d37 [quant][eagermode] Custom module support (#44835)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44835

This is for feature parity with fx graph mode quantization

Test Plan: Imported from OSS

Reviewed By: z-a-f

Differential Revision: D23745086

fbshipit-source-id: ae2fc86129f9896d5a9039b73006a4da15821307
2020-09-23 15:39:40 -07:00
0495998862 [TensorExpr] Disallow arithmetic binary operations on Bool (#44677)
Summary:
Arithmetic operations on Bool aren't fully supported in the evaluator. Moreover,
such semantics can be implemented by the client code through insertion of
explicit casts to widen and narrow to the desired types.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44677

Test Plan:
test_tensorexpr --gtest_filter=TensorExprTest.ExprDisallowBoolArithmetic
python test/test_jit_fuser_te.py

Reviewed By: agolynski

Differential Revision: D23801412

Pulled By: asuhan

fbshipit-source-id: fff5284e3a216655dbf5a9a64d1cb1efda271a36
2020-09-23 14:59:11 -07:00
8e0fc711f4 [TensorExpr] Remove unused EvalConstExpr function (#45180)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45180

Test Plan: build

Reviewed By: ezyang

Differential Revision: D23877151

Pulled By: asuhan

fbshipit-source-id: a5d4d211c1dc85e6f7045330606163a933b9474e
2020-09-23 14:55:27 -07:00
2a1a51facb Fix typos. (#45195)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45195

Fix some typos in reducer class.
ghstack-source-id: 112673443

Test Plan: N/A

Reviewed By: rohan-varma

Differential Revision: D23862399

fbshipit-source-id: 0dc69e5ea1fa7d33c85d1909b2216bcd1f579f6a
2020-09-23 14:51:15 -07:00
246bd9422a gtestify dce and fuser tests (#45055)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45055

See https://github.com/pytorch/pytorch/pull/45018 for context.

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D23811085

Pulled By: suo

fbshipit-source-id: 45008e41f2394d2ba319745b0340392e1b3d3172
2020-09-23 14:33:22 -07:00
d2b045030e gtest-ify JIT tests, through the letter c (#45020)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45020

See https://github.com/pytorch/pytorch/pull/45018 for context.

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D23802296

Pulled By: suo

fbshipit-source-id: 20c9798a414e9ba30869a862012cbdee0613c8b1
2020-09-23 14:28:45 -07:00
3f89b779c4 [jit] allow submodule methods inference rule be different (#43872)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43872

This PR allows the recursive scripting to have a separate
submodule_stubs_fn to create its submodule with specific user provided
rules.

Fixes https://github.com/pytorch/pytorch/issues/43729

Test Plan: Imported from OSS

Reviewed By: suo

Differential Revision: D23430176

Pulled By: wanchaol

fbshipit-source-id: 20530d7891ac3345b36f1ed813dc9c650b28d27a
2020-09-23 14:10:31 -07:00
9e206ee9f1 [NNC] Fix a bug in SplitWithMask when splitting multiple times (#45141)
Summary:
When doing a splitWithMask we only mask if the loop extent is not cleanly divide by the split factor. However, the logic does not simplify so any nontrivial loop extents will always cause a mask to be added, e.g. if the loop had been previously split. Unlike splitWithTail, the masks added by splitWithMask are always overhead and we don't have the analysis to optimize them out if they are unnecessary, so it's good to avoid inserting them if we can.

The fix is just to simplify the loop extents before doing the extent calculation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45141

Reviewed By: ezyang

Differential Revision: D23869170

Pulled By: nickgg

fbshipit-source-id: 44686fd7b802965ca4f5097b0172a41cf837a1f5
2020-09-23 14:04:58 -07:00
adb2b380ba [quant][graphmode][fx] qconfig_dict support more types of configurations (#44856)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44856

Support following format of qconfig_dict
```python
qconfig_dict = {
    # optional, global config
    "": qconfig?,

    # optional, used for module and function types
    # could also be split into module_types and function_types if we prefer
    "object_type": [
      (nn.Conv2d, qconfig?),
      (F.add, qconfig?),
      ...,
    ],

    # optional, used for module names
    "module_name": [
      ("foo.bar", qconfig?)
      ...,
    ],

    # optional, matched in order, first match takes precedence
    "module_name_regex": [
      ("foo.*bar.*conv[0-9]+", qconfig?)
      ...,
    ]
    # priority (in increasing order): global, object_type, module_name_regex, module_name
    # qconfig == None means fusion and quantization should be skipped for anything
    # matching the rule
}
```

Test Plan: Imported from OSS

Reviewed By: vkuzo

Differential Revision: D23751304

fbshipit-source-id: 5b98f4f823502b12ae2150c93019c7b229c49c50
2020-09-23 13:59:53 -07:00
21fabae47a Remove expensive call to PyObject_GetAttrString in PyTorch_LookupSpecial (#44684)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44684

The ad-hoc quantization benchmarking script in D23689062 recently highlighted that quantized ops were surprisingly slow after the introduction of support for custom ops in torch.fx in D23203204 (f15e27265f).

Using strobelight, it's immediately clear that up to 66% of samples were seen in `c10::get_backtrace`, which is descends from `torch::is_tensor_and_apppend_overloaded -> torch::check_has_torch_function ->  torch::PyTorch_LookupSpecial -> PyObject_HasAttrString ->  PyObject_GetAttrString`.

I'm no expert by any means so please correct any/all misinterpretation, but it appears that:
- `check_has_torch_function` only needs to return a bool
- `PyTorch_LookupSpecial` should return `NULL` if a matching method is not found on the object
- in the impl of `PyTorch_LookupSpecial` the return value from `PyObject_HasAttrString` only serves as a bool to return early, but ultimately ends up invoking `PyObject_GetAttrString`, which raises, spawning the generation of a backtrace
- `PyObject_FastGetAttrString` returns `NULL` (stolen ref to an empty py::object if the if/else if isn't hit) if the method is not found, anyway, so it could be used singularly instead of invoking both `GetAttrString` and `FastGetAttrString`
- D23203204 (f15e27265f) compounded (but maybe not directly caused) the problem by increasing the number of invocations

so, removing it in this diff and seeing how many things break :)

before:
strobelight: see internal section
output from D23689062 script:
```
$ ./buck-out/gen/scripts/v/test_pt_quant_perf.par
Sequential(
  (0): Quantize(scale=tensor([0.0241]), zero_point=tensor([60]), dtype=torch.quint8)
  (1): QuantizedLinear(in_features=4, out_features=4, scale=0.017489388585090637, zero_point=68, qscheme=torch.per_tensor_affine)
  (2): DeQuantize()
)
fp 0.010896682739257812
q 0.11908197402954102
```

after:
strobelight: see internal section
output from D23689062 script:
```
$ ./buck-out/gen/scripts/v/test_pt_quant_perf.par
Sequential(
  (0): Quantize(scale=tensor([0.0247]), zero_point=tensor([46]), dtype=torch.quint8)
  (1): QuantizedLinear(in_features=4, out_features=4, scale=0.012683945707976818, zero_point=41, qscheme=torch.per_tensor_affine)
  (2): DeQuantize()
)
fp 0.011141300201416016
q 0.022639036178588867
```

which roughly restores original performance seen in P142370729

UPDATE: 9/22 mode/opt benchmarks
```
buck run //scripts/x:test_pt_quant_perf mode/opt
Sequential(
  (0): Quantize(scale=tensor([0.0263]), zero_point=tensor([82]), dtype=torch.quint8)
  (1): QuantizedLinear(in_features=4, out_features=4, scale=0.021224206313490868, zero_point=50, qscheme=torch.per_tensor_affine)
  (2): DeQuantize()
)
fp 0.002968311309814453
q 0.5138928890228271
```

with patch:
```
buck run //scripts/x:test_pt_quant_perf mode/opt
Sequential(
  (0): Quantize(scale=tensor([0.0323]), zero_point=tensor([70]), dtype=torch.quint8)
  (1): QuantizedLinear(in_features=4, out_features=4, scale=0.017184294760227203, zero_point=61, qscheme=torch.per_tensor_affine)
  (2): DeQuantize()
)
fp 0.0026655197143554688
q 0.0064449310302734375
```

Reviewed By: ezyang

Differential Revision: D23697334

fbshipit-source-id: f756d744688615e01c94bf5c48c425747458fb33
2020-09-23 13:52:54 -07:00
99242eca1d Dockerfile: Support CUDA 11 (#45071)
Summary:
Although PyTorch already supports CUDA 11, the Dockerfile still relies on CUDA 10. This pull request upgrades all the necessary versions such that recent NVIDIA GPUs like A100 can be used.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45071

Reviewed By: ezyang

Differential Revision: D23873224

Pulled By: seemethere

fbshipit-source-id: 822c25f183dcc3b4c5b780c00cd37744d34c6e00
2020-09-23 11:38:49 -07:00
4d80c8c648 Fix inlining interface call in fork subgraph (#43790)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43790

Interface calls were not handled properly when they are used in fork
subgraph. This PR fixes this issue.

Test Plan: Imported from OSS

Reviewed By: eellison

Differential Revision: D23402039

Pulled By: bzinodev

fbshipit-source-id: 41adc5ee7d942250e732e243ab30e356d78d9bf7
2020-09-23 11:17:19 -07:00
da4033d32a Make cudaHostRegister actually useful on cudart. (#45159)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45159

By default, pybind11 binds void* to be capsules.  After a lot of
Googling, I have concluded that this is not actually useful:
you can't actually create a capsule from Python land, and our
data_ptr() function returns an int, which means that the
function is effectively unusable.  It didn't help that we had no
tests exercising it.

I've replaced the void* with uintptr_t, so that we now accept int
(and you can pass data_ptr() in directly).  I'm not sure if we
should make these functions accept ctypes types; unfortunately,
pybind11 doesn't seem to have any easy way to do this.

Fixes #43006

Also added cudaHostUnregister which was requested.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: lw

Differential Revision: D23849731

Pulled By: ezyang

fbshipit-source-id: 8a79986f3aa9546abbd2a6a5828329ae90fd298f
2020-09-23 11:05:44 -07:00
a5a4924c27 Warn if import torch is called from the source root. (#39995)
Summary:
This is a small developer quality of life improvement. I commonly try to run some snippet of python as I'm working on a PR and forget that I've cd-d into the local clone to run some git commands, resulting in annoying failures like:
`ImportError: cannot import name 'default_generator' from 'torch._C' (unknown location)`

This actually took a non-trivial amount of time to figure out the first time I hit it, and even now it's annoying because it happens just infrequently enough to not sit high in the mental cache.

This PR adds a check to `torch/__init__.py` and warns if `import torch` is likely resolving to the wrong thing:

```
WARNING:root:You appear to be importing PyTorch from a clone of the git repo:
  /data/users/taylorrobie/repos/pytorch
  This will prevent `import torch` from resolving to the PyTorch install
  (instead it will try to load /data/users/taylorrobie/repos/pytorch/torch/__init__.py)
  and will generally lead to other failures such as a failure to load C extensions.
```

so that the soon to follow internal import failure makes some sense. I elected to make this a warning rather than an exception because I'm not 100% sure that it's **always** wrong. (e.g. weird `PYTHONPATH` or `importlib` corner cases.)

EDIT: There are now separate cases for `cwd` vs. `PYTHONPATH`, and failure is an `ImportError`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/39995

Reviewed By: malfet

Differential Revision: D23817209

Pulled By: robieta

fbshipit-source-id: d9ac567acb22d9c8c567a8565a7af65ac624dbf7
2020-09-23 10:55:08 -07:00
9db3871288 Update true_divide_out to use at::. (#45079)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45079

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D23821701

Pulled By: ailzhang

fbshipit-source-id: 562eac10faba7a503eda0029a0b026c1fb85fe1e
2020-09-23 10:50:48 -07:00
9e30a76697 Filter strtod_l is undeclared errors from sccache log (#45183)
Summary:
This prevents DrCI from misidentifying test failures for the compilation failures, such as:
```
/var/lib/jenkins/workspace/build/CMakeFiles/CMakeTmp/CheckSymbolExists.c:8:19: error: use of undeclared identifier \'strtod_l\'
  return ((int*)(&strtod_l))[argc];
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45183

Reviewed By: ezyang

Differential Revision: D23859267

Pulled By: malfet

fbshipit-source-id: 283d9bd2ab712f23239b72f3758d121e2d026fb0
2020-09-23 09:49:49 -07:00
5b20bf4fd9 Added support for complex input for Cholesky decomposition (#44895)
Summary:
Cholesky decomposition now works for complex inputs.

Fixes https://github.com/pytorch/pytorch/issues/44637.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44895

Reviewed By: ailzhang

Differential Revision: D23841583

Pulled By: anjali411

fbshipit-source-id: 3b1f34a7af17827884540696f8771a0d5b1df478
2020-09-23 08:25:56 -07:00
94c3cdd994 Let rpc._all_gather use default RPC timeout (#44983)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44983

`_all_gather` was converted from `_wait_all_workers` and inherited its
5 seconds fixed timeout. As `_all_gather` meant to support a broader
set of use cases, the timeout configuration should be more flexible.
This PR makes `rpc._all_gather` use the global default RPC timeout.

Test Plan: Imported from OSS

Reviewed By: pritamdamania87

Differential Revision: D23794383

Pulled By: mrshenli

fbshipit-source-id: 382f52c375f0f25c032c5abfc910f72baf4c5ad9
2020-09-23 08:06:09 -07:00
e5bade7b2c [PyTorch Mobile] Move string op registrations to prim and make them selective (#44960)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44960

Since we have templated selective build, it should be safe to move the operators to prim so that they can be selectively built in mobile

Test Plan: CI

Reviewed By: linbinyu

Differential Revision: D23772025

fbshipit-source-id: 52cebae76e4df5a6b2b51f2cd82f06f75e2e45d0
2020-09-23 07:42:35 -07:00
76dc50e9c8 [RPC] Infer backend type if only options are given (#45065)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45065

To preserve backwards compatibility with applications that were passing in some ProcessGroupRpcBackendOptions but were not explicitly setting backend=BackendType.PROCESS_GROUP, we're here now inferring the backend type from the options if only the latter ones are passed. If neither are passed, we'll default to TensorPipe, as before this change.
ghstack-source-id: 112586258

Test Plan: Added new unit tests.

Reviewed By: pritamdamania87

Differential Revision: D23814289

fbshipit-source-id: f4be7919e0817a4f539a50ab12216dc3178cb752
2020-09-23 00:46:27 -07:00
215679573e [TensorExpr] Fix operator order in combineMultilane (#45157)
Summary:
combineMultilane used the wrong order when ramp was on the left hand side,
which matters for subtract.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45157

Test Plan: test_tensorexpr --gtest_filter=TensorExprTest.SimplifyRampSubBroadcast

Reviewed By: ailzhang

Differential Revision: D23851751

Pulled By: asuhan

fbshipit-source-id: 864d1611e88769fb43327ef226bb3310017bf858
2020-09-22 23:50:47 -07:00
7fba30c2be [quant][fx][bug] Fix error in convert step for QAT (#45050)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45050

Update tests to actually test for QAT

Test Plan:
python test/test_quantization.py TestQuantizeFxOps.test_linear

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D23808022

fbshipit-source-id: d749ab2d215fe19238ff9d539307ffce9ef0ca9b
2020-09-22 22:48:31 -07:00
144dacd8d9 CUDA BFloat16 batched gemm (#45167)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45167

Reviewed By: mruberry

Differential Revision: D23860458

Pulled By: ngimel

fbshipit-source-id: 698de424a046963a30017b58d227fa510f85bf3f
2020-09-22 22:43:52 -07:00
989d877c95 [JIT] Do not allow creating generics with None types (#44958)
Summary:
Otherwise, invoking something like  `python -c "import torch._C;print(torch._C.ListType(None))"` will result in SIGSEGV

Discovered while trying to create a torch script for function with the following type annotation `Tuple[int, Ellipsis] -> None`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44958

Reviewed By: suo

Differential Revision: D23799906

Pulled By: malfet

fbshipit-source-id: 916a243007d13ed3e7a5b282dd712da3d66e3bf7
2020-09-22 21:50:40 -07:00
0a9ac98bed [reland][pytorch] refine dispatch keys in native_functions.yaml (1/N) (#45137)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45137

Reland https://github.com/pytorch/pytorch/pull/45010 - which broke
master due to merge conflict.

Test Plan: Imported from OSS

Reviewed By: bhosmer

Differential Revision: D23843510

Pulled By: ljk53

fbshipit-source-id: 28aabb9da533b6b806ab8779a0ee96b695e9e242
2020-09-22 21:44:55 -07:00
25ed739ac9 [packaging] rstrip fix (#45166)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45166

Test Plan: Imported from OSS

Reviewed By: suo

Differential Revision: D23852505

Pulled By: zdevito

fbshipit-source-id: 6bb743b37333ae19fc24629686e8d06aef812c50
2020-09-22 21:23:47 -07:00
cb75addee4 torch.package - a way to package models and code (#45015)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45015

torch.package allows you to write packages of code, pickled python data, and
arbitrary binary and text resources into a self-contained package.

torch.package.PackageExporter writes the packages and
torch.package.PackageImporter reads them.

The importers can load this code in a hermetic way, such that code is loaded
from the package rather than the normal python import system. This allows
for the packaging of PyTorch model code and data so that it can be run
on a server or used in the future for transfer learning.

The code contained in packages is copied file-by-file from the original
source when it is created, and the file format is a specially organized
zip file. Future users of the package can unzip the package, and edit the code
in order to perform custom modifications to it.

The importer for packages ensures that code in the module can only be loaded from
within the package, except for modules explicitly listed as external using :method:`extern_module`.
The file `extern_modules` in the zip archive lists all the modules that a package externally depends on.
This prevents "implicit" dependencies where the package runs locally because it is importing
a locally-installed package, but then fails when the package is copied to another machine.

Test Plan: Imported from OSS

Reviewed By: SplitInfinity

Differential Revision: D23824337

Pulled By: zdevito

fbshipit-source-id: 1247c34ba9b656f9db68a83e31f2a0fbe3bea6bd
2020-09-22 21:21:21 -07:00
d4a634c209 [RPC profiling] Don't wrap toHere() calls with profiling (#44655)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44655

Since `toHere()` does not execute operations over RPC and simply
transfers the value to the local node, we don't need to enable the profiler
remotely for this message. This causes unnecessary overhead and is not needed.

Since `toHere` is a blocking call, we already profile the call on the local node using `RECORD_USER_SCOPE`, so this does not change the expected profiler results (validated by ensuring all remote profiling tests pass).
ghstack-source-id: 112605610

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D23641466

fbshipit-source-id: 109d9eb10bd7fe76122b2026aaf1c7893ad10588
2020-09-22 21:17:00 -07:00
70d2e4d1f6 [RPC profiling] Allow disableProfiler() to be called from another thread. (#44653)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44653

This changes the profiler per a discussion with ilia-cher offline that enables `disableProfiler()` event consolidation logic to be called from different threads (i.e. threads where the profiler was not explicitly enabled). This is needed to support the functionality enabled by D23638387 where we defer profiling event collection until executing an async callback that can execute on a different thread, to support RPC async function profiling.

This is done by introducing 2 flags `cleanupTLSState` and `consolidate` which controls whether we should clean up thread local settings (we don't do this when calling `disableProfiler()` on non-main threads) and whether we should consolidate all profiled events. Backwards compatiblity is ensured since both options are true by default.

Added a test in `test_misc.cpp` to test this.
ghstack-source-id: 112605620

Reviewed By: mrshenli

Differential Revision: D23638499

fbshipit-source-id: f5bbb0d41ef883c5e5870bc27e086b8b8908f46b
2020-09-22 21:16:58 -07:00
1bd6533d60 Remove thread_local RecordFunctionGuard from profiler. (#44646)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44646

Per a discussion with ilia-cher, this is not needed anymore and
removing it would make some future changes to support async RPC profiling
easier. Tested by ensuring profiling tests in `test_autograd.py` still pass.
ghstack-source-id: 112605618

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D23683998

fbshipit-source-id: 4e49a439509884fe04d922553890ae353e3331ab
2020-09-22 21:15:31 -07:00
67a19fecef CUDA BFloat16 pooling (#45151)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45151

Reviewed By: ailzhang

Differential Revision: D23854056

Pulled By: ngimel

fbshipit-source-id: 32f0835218c2602a09654a9ac2d161c4eb360f90
2020-09-22 20:19:25 -07:00
666223df46 [jit] gtestify test_argument_spec.cpp (#45019)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45019

See https://github.com/pytorch/pytorch/pull/45018 for context.

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D23802298

Pulled By: suo

fbshipit-source-id: 0e36d095d4d81dcd5ebe6d56b3dc469d6d5482d0
2020-09-22 19:44:14 -07:00
f575df201f [quant][graphmode][jit][api] Expose preserved_attrs from finalize to convert_jit (#44490)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44490

Test Plan: Imported from OSS

Reviewed By: z-a-f

Differential Revision: D23631142

fbshipit-source-id: f0913f0cb4576067e2a7288326024942d12e0ae0
2020-09-22 19:37:25 -07:00
e045119956 [JIT] Add default arguments for class types (#45098)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45098

**Summary**
This commit adds support for default arguments in methods of class
types. Similar to how default arguments are supported for regular
script functions and methods on scripted modules, default values are
retrieved from the definition of a TorchScript class in Python as Python
objects, converted to IValues, and then attached to the schemas of
already compiled class methods.

**Test Plan**
This commit adds a set of new tests to TestClassType to test default
arguments.

**Fixes**
This commit fixes #42562.

Test Plan: Imported from OSS

Reviewed By: gmagogsfm

Differential Revision: D23844769

Pulled By: SplitInfinity

fbshipit-source-id: ceedff7703bf9ede8bd07b3abcb44a0f654936bd
2020-09-22 18:37:44 -07:00
ebde5a80bb [tensorexpr] Add flag to fuse with unknown shapes (#44401)
Summary:
This flag simply allows users to get fusion groups that will *eventually* have shapes (such that `getOperation` is a valid).

This is useful for doing early analysis and compiling just in time.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44401

Reviewed By: ZolotukhinM

Differential Revision: D23656140

Pulled By: bwasti

fbshipit-source-id: 9a26c202752399d1932ad7d69f21c88081ffc1e5
2020-09-22 18:17:47 -07:00
c0267c6845 [caffe2] Support data types in shape hints (#45110)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45110

A recent change in DSNN quantizes the ad embedding to 8 bits. Ad embeddings are part of the inputs to the DSNN merge net. To correctly pass shape hints of input tensors including quantized ad embeddings, we need to be able to annotate the data types in shape hints.

A bit on the corner cases, if type is omitted or not a valid type, e.g., white spaces, instead of throwing an exception, I decided to return the default type, float.

Test Plan:
```
buck test caffe2/caffe2/fb/opt:shape_info_utils_test
```

Reviewed By: yinghai

Differential Revision: D23834091

fbshipit-source-id: 5e072144a7a7ff4b5126b618062dfc4041851dd3
2020-09-22 17:49:33 -07:00
b98ac20849 install ATen/native/cuda and hip headers (#45097)
Summary:
The ATen/native/cuda headers were copied to torch/include, but then not included in the final package.  Further, add ATen/native/hip headers to the installation, as well.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45097

Reviewed By: mruberry

Differential Revision: D23831006

Pulled By: malfet

fbshipit-source-id: ab527928185faaa912fd8cab208733a9b11a097b
2020-09-22 17:43:47 -07:00
2a37f3fd2f Relax CUDA architecture check (#45130)
Summary:
NVIDIA GPUs are binary compatible within major compute capability revision

This would prevent: "GeForce RTX 3080 with CUDA capability sm_86 is not compatible with the current PyTorch installation." messages from appearing, since CUDA-11 do not support code generation for sm_85.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45130

Reviewed By: ngimel

Differential Revision: D23841556

Pulled By: malfet

fbshipit-source-id: bcfc9e8da63dfe62cdec06909b6c049aaed6a18a
2020-09-22 17:26:47 -07:00
ccfbfe5eb5 [quant][graphmode][fx] Custom module support (#44766)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44766

There might be modules that are not symbolically traceable, e.g. LSTM (since it has
input dependent control flows), to support quantization in these cases, user will provide
the corresponding observed and quantized version of the custom module, the observed
custom module with observers already inserted in the module and the quantized version will
have the corresponding ops quantized. And use
```
from torch.quantization import register_observed_custom_module_mapping
from torch.quantization import register_quantized_custom_module_mapping
register_observed_custom_module_mapping(CustomModule, ObservedCustomModule)
register_quantized_custom_module_mapping(CustomModule, QuantizedCustomModule)
```
to register the custom module mappings, we'll also need to define a custom delegate class
for symbolic trace in order to prevent the custom module from being traced:
```python
class CustomDelegate(DefaultDelegate):
      def is_leaf_module(self, m):
          return (m.__module__.startswith('torch.nn') and
                    not isinstance(m, torch.nn.Sequential)) or \
                    isinstance(m, CustomModule)
m = symbolic_trace(original_m, delegate_class=CustomDelegate)
```

Test Plan: Imported from OSS

Reviewed By: z-a-f

Differential Revision: D23723455

fbshipit-source-id: 50d666e29b94cbcbea5fb6bcc73b00cff87eb77a
2020-09-22 17:11:46 -07:00
7f4a27be3a [resubmit][FX] s/get_param/get_attr/ (#45147)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45147

ghstack-source-id: 112605923

Test Plan: Imported from OSS

Reviewed By: eellison

Differential Revision: D23845096

fbshipit-source-id: 9ca209aa84cbaddd6e89c52b541e43b11197e2d5
2020-09-22 17:06:18 -07:00
35cdb01327 [PyTorch] Enable type check for autocast_test_lists (#45107)
Summary:
This is a sub-task for addressing: https://github.com/pytorch/pytorch/issues/42969. We re-enable type check for `autocast_test_lists `.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45107

Test Plan:
`python test/test_type_hints.py` passed:
```
(pytorch) bash-5.0$ with-proxy python test/test_type_hints.py
....
----------------------------------------------------------------------
Ran 4 tests in 103.871s

OK
```

Reviewed By: walterddr

Differential Revision: D23842884

Pulled By: Hangjun

fbshipit-source-id: a39f3810e3abebc6b4c1cb996b06312f6d42ffd6
2020-09-22 16:54:26 -07:00
cddcfde81d [JIT] Fix WithTest.test_with_exceptions (#45106)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45106

**Summary**
This commit fixes `WithTest.test_with_exceptions`. It's been running
in regular Python this whole time; none of the functions created and
invoked for the test were scripted. Fortunately, the tests still pass
after being fixed.

**Test Plan**
Ran unit tests + continuous integration.

Test Plan: Imported from OSS

Reviewed By: gmagogsfm

Differential Revision: D23848206

Pulled By: SplitInfinity

fbshipit-source-id: fd975ee34db9441ef4e4a4abf2fb21298166bbaa
2020-09-22 16:31:17 -07:00
d1c68a7069 Clarify that 5-D 'bilinear' grid_sample is actually trilinear (#45090)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/41528

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45090

Reviewed By: ailzhang

Differential Revision: D23841046

Pulled By: zou3519

fbshipit-source-id: 941770cd5b3e705608957739026e9113e5f0c616
2020-09-22 15:10:22 -07:00
79fe794f87 [FX] Make Graphs immutable and make GraphModule recompile after assigning graph (#44830)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44830

Test Plan: Imported from OSS

Reviewed By: zdevito

Differential Revision: D23743850

Pulled By: jamesr66a

fbshipit-source-id: 501b92a89ff636c26abeff13105a75462384554c
2020-09-22 15:02:11 -07:00
def433bbb6 .circleci: Upgrade all xcode 9 workers to xcode 11 (#45153)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45153

xcode 9 is being deprectated within circleci infra so we should get
everything else on a more recent version of xcode

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D23852774

Pulled By: seemethere

fbshipit-source-id: c02e162f1993d408de439fee21b340e9640e5a24
2020-09-22 14:57:43 -07:00
a4ce3f4194 Fix type hint warnings for common_methods_invocations.py (#44971)
Summary:
Fixes a subtask of https://github.com/pytorch/pytorch/issues/42969

Tested the following and no warnings were seen.

python test/test_type_hints.py
....
----------------------------------------------------------------------
Ran 4 tests in 180.759s

OK

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44971

Reviewed By: walterddr

Differential Revision: D23822274

Pulled By: visweshfb

fbshipit-source-id: e3485021e348ee0a8508a9d128f04bad721795ef
2020-09-22 13:40:46 -07:00
c253b10154 Fix incorrect EnumValue serialization issue (#44891)
Summary:
Previously, `prim::EnumValue` is serialized to `ops.prim.EnumValue`, which doesn't have the right implementation to refine return type. This diff correctly serializes it to enum.value, thus fixing the issue.

Fixes https://github.com/pytorch/pytorch/issues/44892

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44891

Reviewed By: malfet

Differential Revision: D23818962

Pulled By: gmagogsfm

fbshipit-source-id: 6edfdf9c4b932176b08abc69284a916cab10081b
2020-09-22 11:59:45 -07:00
2b1f25885e [quant] Fix ConvTranspose mapping (#44844)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44844

Test Plan: Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D23746466

Pulled By: z-a-f

fbshipit-source-id: cb84e0fef5ab82e8ed8dd118d9fb21ee7b480ef7
2020-09-22 11:59:42 -07:00
09aee06e82 [caffe2] Replace embedding conversion ops with fbgemm functions (#44843)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44843

Replace perfkernels calls with fbgemm kernels to avoid code duplication
ghstack-source-id: 112496292

Test Plan: CI

Reviewed By: radkris-git

Differential Revision: D23675519

fbshipit-source-id: 05c285a9eeb9ea109a04a78cb442a24ee40a4aec
2020-09-22 11:57:01 -07:00
e2b40ce793 Support BFloat16 for binary logical operators on CUDA (#42485)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42485

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D23684423

Pulled By: mruberry

fbshipit-source-id: edc2b46b726361d4c8bf8a4bf4e4a09197b20428
2020-09-22 11:42:34 -07:00
ef885c10d8 [pytorch] Add triplet margin loss with custom distance (#43680)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43680

As discussed [here](https://github.com/pytorch/pytorch/issues/43342),
adding in a Python-only implementation of the triplet-margin loss that takes a
custom distance function.  Still discussing whether this is necessary to add to
PyTorch Core.

Test Plan:
python test/run_tests.py

Imported from OSS

Reviewed By: albanD

Differential Revision: D23363898

fbshipit-source-id: 1cafc05abecdbe7812b41deaa1e50ea11239d0cb
2020-09-22 11:35:52 -07:00
10f287539f Align casing in test_dispatch with dispatch keys. (#44933)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44933

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D23778247

Pulled By: ailzhang

fbshipit-source-id: bc3725eae670b03543015afe763cb3bb16baf8f6
2020-09-22 10:50:08 -07:00
1fd48a9d1f Revert D23798016: [FX] s/get_param/get_attr/
Test Plan: revert-hammer

Differential Revision:
D23798016 (c941dd3492)

Original commit changeset: 1d2f3db1994a

fbshipit-source-id: 974d930064b37d396c5d66c905a63d45449813e5
2020-09-22 10:32:51 -07:00
8501b89a87 [ONNX] Update ort release (#45095)
Summary:
Update ort release

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45095

Reviewed By: bwasti

Differential Revision: D23832041

Pulled By: malfet

fbshipit-source-id: 39c47a87e451c4c43ba4d4e8be385cc195cc611a
2020-09-22 10:08:48 -07:00
4b42f0b613 Support Math keyword in native_functions.yaml. (#44556)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44556

Test Plan: Imported from OSS

Reviewed By: bhosmer

Differential Revision: D23698386

Pulled By: ailzhang

fbshipit-source-id: f10ea839a2cfe7d16f5823a75b8b8c5f1ae22dde
2020-09-22 10:00:40 -07:00
ae286d81e0 [JIT] improve alias analysis for list constructs (#39111)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39111

In our present alias analysis, we consider any Value that enter another container as entering the heap, and thus aliasing all other heap values of the same type. There are a number of advantages to this approach:
- it is not to hard to maintain the aliasDb implementation
- it is much easier from an op schema perspective - there are many composite list ops registered internally and externally that would be tricky to register and get right if we did something more complicated
- It limits the size of the AliasDb, because a container of size 10 only contains a single memory dag element instead of 10 elements.

The downside is that we have are unable to handle the simple and extremely common case of a list of tensors being used in an ATen op.

In an example like:

```
 def foo(input):
    x = torch.tensor([1, 2, 3, 4])
    y = [x, x]
    input.add_(1)
    return torch.cat(y)
```

we will consider x to be written to. any write to any wildcard element (an element that enters a tuple, an element that is taken from a list) will mark x as written to. This can be limiting for our ability to create a functional subset and fuse graphs - as a result, 4 of TorchVision classification models could not be functionalized.

Test Plan: Imported from OSS

Reviewed By: SplitInfinity

Differential Revision: D23828003

Pulled By: eellison

fbshipit-source-id: 9109fcb6f2ca20ca897cae71683530285da9d537
2020-09-22 09:38:59 -07:00
9fc7a942f0 Change from self to self.class() in _DecoratorManager to ensure a new object is every time a function is called recursively (#44633)
Summary:
Change from self to self._class_() in _DecoratorManager to ensure a new object is every time a function is called recursively

Fixes https://github.com/pytorch/pytorch/issues/44531

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44633

Reviewed By: agolynski

Differential Revision: D23783601

Pulled By: albanD

fbshipit-source-id: a818664dee7bdb061a40ede27ef99e9546fc80bb
2020-09-22 09:13:39 -07:00
63fd257879 Add Ellipsis constant to the list of recognized tokens (#44959)
Summary:
Per https://docs.python.org/3.6/library/constants.html
> `Ellipsis` is the same as ellipsis literal `...`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44959

Reviewed By: suo

Differential Revision: D23785660

Pulled By: malfet

fbshipit-source-id: f68461849e7d16ef68042eb96566f2c936c06b0f
2020-09-22 09:05:25 -07:00
e155fbe915 add warning when ParameterList/Dict is used with DataParallel (#44405)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44405

Test Plan: Imported from OSS

Reviewed By: agolynski

Differential Revision: D23783987

Pulled By: albanD

fbshipit-source-id: 5018b0d381cb09301d2f88a98a910854f740ace1
2020-09-22 08:58:00 -07:00
4a0aa69a66 Fix undefined variable 'namedshape' in tensor.py (#45085)
Summary:
Hot Fix

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45085

Reviewed By: malfet, seemethere

Differential Revision: D23824444

Pulled By: walterddr

fbshipit-source-id: c9f37b394d281b7ef44b14c30699bb7510a362a7
2020-09-22 08:52:47 -07:00
36ec8f8fb8 [dper3] Create dper LearningRate low-level module (#44639)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44639

As title; this will unblock migration of several modules that need learning rate functionality.

Test Plan:
```
buck test //dper3/dper3/modules/low_level_modules/tests:learning_rate_test
```

Reviewed By: yf225

Differential Revision: D23681733

fbshipit-source-id: 1d98cb35bf6a4ff0718c9cb6abf22401980b523c
2020-09-22 08:26:07 -07:00
58b6ab69e5 torch.sgn for complex tensors (#39955)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39955

resolves https://github.com/pytorch/pytorch/issues/36323 by adding `torch.sgn` for complex tensors.
`torch.sgn` returns `x/abs(x)` for `x != 0` and returns `0 + 0j` for `x==0`

This PR doesn't test the correctness of the gradients. It will be done as a part of auditing all the ops in future once we decide the autograd behavior (JAX vs TF) and add gradchek.

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D23460526

Pulled By: anjali411

fbshipit-source-id: 70fc4e14e4d66196e27cf188e0422a335fc42f92
2020-09-22 08:24:53 -07:00
1b059f2c6d Directly use work.result() to retrieve tensor rather than passing as a separate argument (#44914)
Summary:
We currently are fetching an allreduced tensor from Python in C++ in, where we are storing the resulting tensor in a struct's parameter. This PR removes extra tensor paratemeter in the function parameter and fetch from a single place.

Fixes https://github.com/pytorch/pytorch/issues/43960

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44914

Reviewed By: rohan-varma

Differential Revision: D23798888

Pulled By: bugra

fbshipit-source-id: ad1b8c31c15e3758a57b17218bbb9dc1f61f1577
2020-09-22 06:28:47 -07:00
71aeb84ab4 Revert D23803951: [pytorch] refine dispatch keys in native_functions.yaml (1/N)
Test Plan: revert-hammer

Differential Revision:
D23803951 (339961187a)

Original commit changeset: aaced7c34427

fbshipit-source-id: fcc4fb6a2c1d79b587f62347b43f8851fe1647fd
2020-09-22 05:41:59 -07:00
339961187a [pytorch] refine dispatch keys in native_functions.yaml (1/N) (#45010)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45010

The motivation of this change is to differentiate "backend specific" ops
and "generic" ops.

"backend specific" ops are those invoking backend specific kernels thus
only able to run on certain backends, e.g.: CPU, CUDA.

"generic" ops are those not *directly* invoking backend specific kernels.
They are usually calling other "backend specific" ops to get things
done. Thus, they are also referred to as "composite" ops, or "math" ops
(because they are usually pure C++ code constructed from math formula).

The other way to see the difference is that: we have to implement new
kernels for the "backend specific" ops if we want to run these ops on a
new backend. In contrast, "generic"/"composite" ops can run on the new
backend if we've added support for all the "backend specific" ops to
which they delegate their work.

Historically we didn't make a deliberate effort to always populate
supported backends to the "dispatch" section for all the "backend specific"
ops in native_functions.yaml. So now there are many ops which don't have
"dispatch" section but are actually "backend specific" ops. Majority
of them are calling "DispatchStub" kernels, which usually only support
CPU/CUDA (via TensorIterator) or QuantizedCPU/CUDA.

The ultimate goal is to be able to differentiate these two types of ops
by looking at the "dispatch" section in native_functions.yaml.

This PR leveraged the analysis script on #44963 to populate missing
dispatch keys for a set of "backend specific" ops. As the initial step,
we only deal with the simplest case:
* These ops don't already have dispatch section in native_functions.yaml;
* These ops call one or more DispatchStub (thus "backend specific");
* These ops don't call any other aten ops - except for some common
  ones almost every op calls via framework, e.g. calling aten::eq via
  Dispatcher::checkSchemaCompatibility. Calling other nontrivial aten
  ops is a sign of being "composite", so we don't want to deal with this
  case now;
* These ops don't call Tensor::is_quantized() / Tensor::is_sparse() / etc.
  Some ops call thse Tensor::is_XXX() methods to dispatch to quantized /
  sparse kernels internally. We don't deal with this case now.

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D23803951

Pulled By: ljk53

fbshipit-source-id: aaced7c34427d1ede72380af4513508df366ea16
2020-09-22 03:20:01 -07:00
c947ab0bb9 Added sparse support for asin and neg functions, updated log1p (#44028)
Summary:
Description:

- [x] added C++ code for sparse `asin` and `neg` ops similarly to `log1p` op
- [x] added tests
  - [x] coalesced input CPU/CUDA
  - [x] uncoalesced input CPU/CUDA
- [x] added tests for `negative`  and `arcsin`

Backprop will be addressed in another PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44028

Reviewed By: agolynski

Differential Revision: D23793027

Pulled By: mruberry

fbshipit-source-id: 5fd642808da8e528cf6acd608ca0dcd720c4ccc3
2020-09-22 02:04:38 -07:00
d126a0d4fd [iOS] Disable the iOS nightly build until the cert issue has resolved (#45094)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45094

Test Plan: Imported from OSS

Reviewed By: husthyc

Differential Revision: D23831152

Pulled By: xta0

fbshipit-source-id: 6327edba01e4d5abad63ac35680eefb22276423f
2020-09-22 01:47:41 -07:00
5aed75b21b [quant][graphmode][jit] Try to support append (#44641)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44641

Test Plan: Imported from OSS

Reviewed By: z-a-f

Differential Revision: D23682356

fbshipit-source-id: 09a03dfde0b1346a5764e8e28ba56e32b343d239
2020-09-21 23:13:56 -07:00
2111ec3bf3 CUDA BFloat16 losses (#45011)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45011

Reviewed By: mruberry

Differential Revision: D23805840

Pulled By: ngimel

fbshipit-source-id: 3eb60d4367c727100763879e20e9df9d58bf5ad6
2020-09-21 22:51:17 -07:00
32c1a8c79f adjust shape inference in sls tests (#44936)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44936

need to provide max sequence size and max element size instead of
total
added a check that onnxifi was succesful

Test Plan: sls tests

Reviewed By: yinghai

Differential Revision: D23779437

fbshipit-source-id: 5048d6536ca00f0a3b0b057c4e2cf6584b1329d6
2020-09-21 22:09:55 -07:00
0dda65ac77 [ONNX] add jit pass for lists (#43820)
Summary:
Add jit preprocessing pass for adding int lists.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43820

Reviewed By: albanD

Differential Revision: D23674598

Pulled By: bzinodev

fbshipit-source-id: 35766403a073e202563bba5251c07efb7cc5cfb1
2020-09-21 22:05:25 -07:00
09e7f62ce2 Fix RPC and ProcessGroup GIL deadlock (#45088)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45088

Fixes #45082

Found a few problems while working on #44983

1. We deliberately swallow RPC timeouts during shutdown, as we haven't
found a good way to handle those. When we convert `_wait_all_workers`
into `_all_gather`, the same logic was inherited. However, as
`_all_gather` meant to be used in more general scenarios, we should
no longer keep silent about errors. This commit let the error throw
in `_all_gather` and also let `shutdown()` to catch them and log.
2. After fixing (1), I found that `UnpickledPythonCall` needs to
acquire GIL on destruction, and this can lead to deadlock when used
in conjuction with `ProcessGroup`. Because `ProcessGroup` ctor is a
synchronization point which holds GIL. In `init_rpc`, followers
(`rank != 0`) can exit before the leader (`rank == 0`). If the two
happens together, we could get a) on a follower, it exits `init_rpc`
after running `_broadcast_to_followers` and before the reaching dtor
of `UnpickledPythonCall`. Then it runs the ctor of `ProcessGroup`,
which holds the GIL and wait for the leader to join. However, the
leader is waiting for the response from `_broadcast_to_followers`,
which is blocked by the dtor of `UnpickledPythonCall`. And hence
the deadlock. This commit drops the GIL in `ProcessGroup` ctor.
3. After fixing (2), I found that `TensorPipe` backend
nondeterministically fails with `test_local_shutdown`, due to a
similar reason as (2), but this time it is that `shutdown()` on a
follower runs before the leader finishes `init_rpc`. This commit
adds a join for `TensorPipe` backend `init_rpc` after `_all_gather`.

The 3rd one should be able to solve the 2nd one as well. But since
I didn't see a reason to hold GIL during `ProcessGroup` ctor, I
made that change too.

Test Plan: Imported from OSS

Reviewed By: pritamdamania87

Differential Revision: D23825592

Pulled By: mrshenli

fbshipit-source-id: 94920f2ad357746a6b8e4ffaa380dd56a7310976
2020-09-21 21:47:27 -07:00
dfc88d4fd0 [vulkan] support dimensions negative indexing (#45068)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45068

Test Plan: Imported from OSS

Reviewed By: AshkanAliabadi

Differential Revision: D23816081

Pulled By: IvanKobzarev

fbshipit-source-id: bda753f3f216dac7c05b6f728a3bd6068e5d06a0
2020-09-21 21:24:16 -07:00
5621ba87a2 [vulkan] reshape op to use infer_size to expand -1 (#45104)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45104

Test Plan: Imported from OSS

Reviewed By: AshkanAliabadi

Differential Revision: D23834249

Pulled By: IvanKobzarev

fbshipit-source-id: 0e3699d6a4227788d1d634349c0bf259c0ad5e8d
2020-09-21 21:08:59 -07:00
8968030f19 [WIP] Add vec256 test to linux CI (#44912)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44912

This is to add vec256 test into linux CI system.
The whole test will last 50 to 70 seconds.

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D23772923

Pulled By: glaringlee

fbshipit-source-id: ef929b53f3ea7894abcd9510a8e0389979cab4a2
2020-09-21 21:00:29 -07:00
4b3046ed28 Vectorize int8_t on CPU (#44759)
Summary:
int8_t is not vectorized in vec256_int.h. This PR adds vectorization for
int8_t. As pointed out in https://github.com/pytorch/pytorch/issues/43033, this is an important type for vectorization because
a lot of images are loaded in this data type.

Related issue: https://github.com/pytorch/pytorch/issues/43033

Benchmark (Debian Buster,  Intel(R) Xeon(R) E-2136 CPU @ 3.30GHz, Turbo off, Release build):

```python
import timeit
dtype = 'torch.int8'
for op in ('+', '-'):
    for n, t in [(10_000, 200000),
                (100_000, 20000)]:
        print(f'a {op} b, numel() == {n} for {t} times, dtype={dtype}')
        print(timeit.timeit(f'c = a {op} b', setup=f'import torch; a = torch.arange(1, {n}, dtype={dtype}); b = torch.arange({n}, 1, -1, dtype={dtype})', number=t))
```

Results:

Before:

```
a + b, numel() == 10000 for 200000 times, dtype=torch.int8
1.2223373489978258
a + b, numel() == 100000 for 20000 times, dtype=torch.int8
0.6108450189931318
a - b, numel() == 10000 for 200000 times, dtype=torch.int8
1.256775538000511
a - b, numel() == 100000 for 20000 times, dtype=torch.int8
0.6101213909860235
```

After:

```
a + b, numel() == 10000 for 200000 times, dtype=torch.int8
0.5713336059998255
a + b, numel() == 100000 for 20000 times, dtype=torch.int8
0.39169703199877404
a - b, numel() == 10000 for 200000 times, dtype=torch.int8
0.5838428330025636
a - b, numel() == 100000 for 20000 times, dtype=torch.int8
0.37486923701362684
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44759

Reviewed By: malfet

Differential Revision: D23786383

Pulled By: glaringlee

fbshipit-source-id: 67f5bcd344c0b5014bacbc876143231fca156713
2020-09-21 19:55:13 -07:00
f77ba0e48c Change typo 'momemtum' to 'momentum' (#45045)
Summary:
As the title.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45045

Reviewed By: mruberry

Differential Revision: D23808563

Pulled By: mrshenli

fbshipit-source-id: ca818377f4c23d67b037c146fef667ab8731961e
2020-09-21 19:03:26 -07:00
20f52cdd76 [hpc]optimize the torch.cat cuda kernel (#44833)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44833

Current cat cuda kernel employs the pin memory to pass the tensor data. 1) It is much slower than passing through argument using constant memory 2) the H2D sometimes overlaps with other H2D in training, and thus generates some random delay and leads to desync issue.

For small N, we actually saw 2X improvements.

Test Plan:
benchmark
```
./buck-out/opt/gen/caffe2/benchmarks/operator_benchmark/pt/cat_test.par --tag_filter all --device cuda
```
```
# ----------------------------------------
# PyTorch/Caffe2 Operator Micro-benchmarks
# ----------------------------------------
# Tag : all

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(1,1,1)_N2_dim0_cuda
# Input: sizes: (1, 1, 1), N: 2, dim: 0, device: cuda
Forward Execution Time (us) : 38.825

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(512,512,2)_N2_dim1_cuda
# Input: sizes: (512, 512, 2), N: 2, dim: 1, device: cuda
Forward Execution Time (us) : 45.440

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(128,1024,2)_N2_dim1_cuda
# Input: sizes: (128, 1024, 2), N: 2, dim: 1, device: cuda
Forward Execution Time (us) : 38.765

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(1024,1024,2)_N2_dim0_cuda
# Input: sizes: (1024, 1024, 2), N: 2, dim: 0, device: cuda
Forward Execution Time (us) : 60.075

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(1025,1023,2)_N2_dim1_cuda
# Input: sizes: (1025, 1023, 2), N: 2, dim: 1, device: cuda
Forward Execution Time (us) : 65.203

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(1024,1024,2)_N2_dim2_cuda
# Input: sizes: (1024, 1024, 2), N: 2, dim: 2, device: cuda
Forward Execution Time (us) : 83.941

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f0d50fc2440>,111,65]_N5_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f0d50fc2440>, 111, 65], N: 5, dim: 0, device: cuda
Forward Execution Time (us) : 51.059

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[96,<function<lambda>at0x7f0d50fc2b90>,64]_N5_dim1_cuda
# Input: sizes: [96, <function <lambda> at 0x7f0d50fc2b90>, 64], N: 5, dim: 1, device: cuda
Forward Execution Time (us) : 42.134

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[128,64,<function<lambda>at0x7f0b22b7e3b0>]_N5_dim2_cuda
# Input: sizes: [128, 64, <function <lambda> at 0x7f0b22b7e3b0>], N: 5, dim: 2, device: cuda
Forward Execution Time (us) : 78.333

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f0b22b7e5f0>,32,64]_N50_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f0b22b7e5f0>, 32, 64], N: 50, dim: 0, device: cuda
Forward Execution Time (us) : 77.065

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[32,<function<lambda>at0x7f0b22b7e680>,64]_N50_dim1_cuda
# Input: sizes: [32, <function <lambda> at 0x7f0b22b7e680>, 64], N: 50, dim: 1, device: cuda
Forward Execution Time (us) : 74.632

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[33,65,<function<lambda>at0x7f0b22b7e710>]_N50_dim2_cuda
# Input: sizes: [33, 65, <function <lambda> at 0x7f0b22b7e710>], N: 50, dim: 2, device: cuda
Forward Execution Time (us) : 81.846

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(64,32,4,16,32)_N2_dim2_cuda
# Input: sizes: (64, 32, 4, 16, 32), N: 2, dim: 2, device: cuda
Forward Execution Time (us) : 99.291

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(16,32,4,16,32)_N8_dim2_cuda
# Input: sizes: (16, 32, 4, 16, 32), N: 8, dim: 2, device: cuda
Forward Execution Time (us) : 114.060

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(9,31,5,15,33)_N17_dim4_cuda
# Input: sizes: (9, 31, 5, 15, 33), N: 17, dim: 4, device: cuda
Forward Execution Time (us) : 478.777

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f0b22b7e7a0>]_N100_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f0b22b7e7a0>], N: 100, dim: 0, device: cuda
Forward Execution Time (us) : 80.165

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f0b22b7e830>]_N1000_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f0b22b7e830>], N: 1000, dim: 0, device: cuda
Forward Execution Time (us) : 491.983

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f0b22b7e8c0>]_N2000_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f0b22b7e8c0>], N: 2000, dim: 0, device: cuda
Forward Execution Time (us) : 966.613

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f0b22b7e950>]_N3000_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f0b22b7e950>], N: 3000, dim: 0, device: cuda
Forward Execution Time (us) : 1500.133
```

After optimization
```
# ----------------------------------------
# PyTorch/Caffe2 Operator Micro-benchmarks
# ----------------------------------------
# Tag : all

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(1,1,1)_N2_dim0_cuda
# Input: sizes: (1, 1, 1), N: 2, dim: 0, device: cuda
Forward Execution Time (us) : 22.168

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(512,512,2)_N2_dim1_cuda
# Input: sizes: (512, 512, 2), N: 2, dim: 1, device: cuda
Forward Execution Time (us) : 33.430

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(128,1024,2)_N2_dim1_cuda
# Input: sizes: (128, 1024, 2), N: 2, dim: 1, device: cuda
Forward Execution Time (us) : 19.884

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(1024,1024,2)_N2_dim0_cuda
# Input: sizes: (1024, 1024, 2), N: 2, dim: 0, device: cuda
Forward Execution Time (us) : 48.082

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(1025,1023,2)_N2_dim1_cuda
# Input: sizes: (1025, 1023, 2), N: 2, dim: 1, device: cuda
Forward Execution Time (us) : 53.261

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(1024,1024,2)_N2_dim2_cuda
# Input: sizes: (1024, 1024, 2), N: 2, dim: 2, device: cuda
Forward Execution Time (us) : 71.294

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f837a135200>,111,65]_N5_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f837a135200>, 111, 65], N: 5, dim: 0, device: cuda
Forward Execution Time (us) : 40.165

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[96,<function<lambda>at0x7f837a135950>,64]_N5_dim1_cuda
# Input: sizes: [96, <function <lambda> at 0x7f837a135950>, 64], N: 5, dim: 1, device: cuda
Forward Execution Time (us) : 32.666

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[128,64,<function<lambda>at0x7f82e50e2440>]_N5_dim2_cuda
# Input: sizes: [128, 64, <function <lambda> at 0x7f82e50e2440>], N: 5, dim: 2, device: cuda
Forward Execution Time (us) : 67.003

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f82e50e24d0>,32,64]_N50_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f82e50e24d0>, 32, 64], N: 50, dim: 0, device: cuda
Forward Execution Time (us) : 67.035

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[32,<function<lambda>at0x7f82e50e2560>,64]_N50_dim1_cuda
# Input: sizes: [32, <function <lambda> at 0x7f82e50e2560>, 64], N: 50, dim: 1, device: cuda
Forward Execution Time (us) : 63.803

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[33,65,<function<lambda>at0x7f82e50e25f0>]_N50_dim2_cuda
# Input: sizes: [33, 65, <function <lambda> at 0x7f82e50e25f0>], N: 50, dim: 2, device: cuda
Forward Execution Time (us) : 69.969

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(64,32,4,16,32)_N2_dim2_cuda
# Input: sizes: (64, 32, 4, 16, 32), N: 2, dim: 2, device: cuda
Forward Execution Time (us) : 98.327

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(16,32,4,16,32)_N8_dim2_cuda
# Input: sizes: (16, 32, 4, 16, 32), N: 8, dim: 2, device: cuda
Forward Execution Time (us) : 112.363

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(9,31,5,15,33)_N17_dim4_cuda
# Input: sizes: (9, 31, 5, 15, 33), N: 17, dim: 4, device: cuda
Forward Execution Time (us) : 478.224

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f82e50e2680>]_N100_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f82e50e2680>], N: 100, dim: 0, device: cuda
Forward Execution Time (us) : 63.269

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f82e50e2710>]_N1000_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f82e50e2710>], N: 1000, dim: 0, device: cuda
Forward Execution Time (us) : 470.141

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f82e50e27a0>]_N2000_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f82e50e27a0>], N: 2000, dim: 0, device: cuda
Forward Execution Time (us) : 966.668

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f82e50e2830>]_N3000_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f82e50e2830>], N: 3000, dim: 0, device: cuda
Forward Execution Time (us) : 1485.309
```

Reviewed By: ngimel

Differential Revision: D23727275

fbshipit-source-id: 171275ac541c649f7aeab0a2f8f0fea9486d0180
2020-09-21 18:38:25 -07:00
81bb19c9f0 [JIT] Prohibit subscripted assignments for tuple types (#44929)
Summary:
This would force jit.script to raise an error if someone tries to mutate tuple
```
Tuple[int, int] does not support subscripted assignment:
  File "/home/nshulga/test/tupleassignment.py", line 9
torch.jit.script
def foo(x: Tuple[int, int]) -> int:
    x[-1] = x[0] + 1
    ~~~~~ <--- HERE
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44929

Reviewed By: suo

Differential Revision: D23777668

Pulled By: malfet

fbshipit-source-id: 8efaa4167354ffb4930ccb3e702736a3209151b6
2020-09-21 16:35:44 -07:00
9a31eee107 [vulkan] Remove duplication of op registration and clean unused vars (#44932)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44932

Test Plan: Imported from OSS

Reviewed By: AshkanAliabadi

Differential Revision: D23778203

Pulled By: IvanKobzarev

fbshipit-source-id: d1bc0a5c2cdd711d8a4cd983154a4f6774987674
2020-09-21 15:57:32 -07:00
dfb8f2d51f CUDA BFloat16 addmm, addmv (#44986)
Summary:
This PR was originally authored by slayton58. I steal his implementation and added some tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44986

Reviewed By: mruberry

Differential Revision: D23806039

Pulled By: ngimel

fbshipit-source-id: 305d66029b426d8039fab3c3e011faf2bf87aead
2020-09-21 14:28:27 -07:00
581a364437 CUDA BFloat16 unary ops part 1 (#44813)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44813

Reviewed By: mruberry

Differential Revision: D23805816

Pulled By: ngimel

fbshipit-source-id: 28c645dc31f094c8b6c3d3803f0b4152f0475a64
2020-09-21 14:22:31 -07:00
1cab27d485 Add a torch.hub.load_local() function that can load models from any local directory with a hubconf.py (#44204)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/43622

- Moves the model loading part of `torch.hub.load()` into a new `torch.hub.load_local()` function that takes in a path to a local directory that contains a `hubconf.py` instead of a repo name.
- Refactors `torch.hub.load()` so that it now calls `torch.hub.load_local()` after downloading and extracting the repo.
- Updates `torch.hub` docs to include the new function + minor fixes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44204

Reviewed By: malfet

Differential Revision: D23817429

Pulled By: ailzhang

fbshipit-source-id: 788fd83c87a94f487b558715b2809d346ead02b2
2020-09-21 14:17:21 -07:00
c941dd3492 [FX] s/get_param/get_attr/ (#45000)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45000

Test Plan: Imported from OSS

Reviewed By: suo

Differential Revision: D23798016

Pulled By: jamesr66a

fbshipit-source-id: 1d2f3db1994a62b95d0ced03bf958e54d30c35dd
2020-09-21 14:09:32 -07:00
9dc2bcdc07 Introducing (Const)StridedRandomAccessor + CompositeRandomAccessor + migrate sort to ATen (CPU) (#39744)
Summary:
This PR introduces a (Const)StridedRandomAccessor, a [random access iterator](https://en.cppreference.com/w/cpp/named_req/RandomAccessIterator) over a strided array, and a CompositeRandomAccessor, a random access iterator over two random access iterators.

The main motivation is to be able to use a handful of operations from STL and thrust in numerous dim-apply types of algorithms and eliminate unnecessary buffer allocations. Plus more advanced algorithms are going to be available with C++17.

Porting `sort` provides a hands-on example of how these iterators could be used.

Fixes [https://github.com/pytorch/pytorch/issues/24770](https://github.com/pytorch/pytorch/issues/24770).

Some benchmarks:
```python
from IPython import get_ipython

torch.manual_seed(13)

ipython = get_ipython()

sizes = [
        [10000, 10000],
        [1000, 1000, 100]
        ]
for size in sizes:
    t = torch.randn(*size)
    dims = len(size)

    print(f"Tensor of size {size}")
    for dim in range(dims):
        print(f"sort for dim={dim}")
        print("float:")
        ipython.magic("timeit t.sort(dim)")
    print()

```
#### Master
```
Tensor of size [10000, 10000]
sort for dim=0
float:
10.7 s ± 201 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
sort for dim=1
float:
6.27 s ± 50.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Tensor of size [1000, 1000, 100]
sort for dim=0
float:
7.21 s ± 23.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
sort for dim=1
float:
6.1 s ± 21.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
sort for dim=2
float:
3.58 s ± 27 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

```
#### This PR
```
Tensor of size [10000, 10000]
sort for dim=0
float:
10.5 s ± 209 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
sort for dim=1
float:
6.16 s ± 28.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Tensor of size [1000, 1000, 100]
sort for dim=0
float:
5.94 s ± 60.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
sort for dim=1
float:
5.1 s ± 11.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
sort for dim=2
float:
3.43 s ± 8.52 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

```
As you can see, the legacy sorting routine is actually quite efficient. The performance gain is likely due to the improved reduction with TensorIterator.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/39744

Reviewed By: malfet

Differential Revision: D23796486

Pulled By: glaringlee

fbshipit-source-id: 7bddad10dfbc0a0e5cad7ced155d6c7964e8702c
2020-09-21 13:24:58 -07:00
7118d53711 add .cache to gitignore (#45017)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45017

this is the default indexing folder for clangd 11.

Test Plan: Imported from OSS

Reviewed By: jamesr66a

Differential Revision: D23817619

Pulled By: suo

fbshipit-source-id: 6a60136e591b2fec3d432ac5343cb76ac0934502
2020-09-21 12:51:35 -07:00
1a580c1021 Adding test to quantized copy for 'from float' (#43681)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43681

Test Plan: Imported from OSS

Reviewed By: supriyar

Differential Revision: D23364507

Pulled By: z-a-f

fbshipit-source-id: ef1b00937b012b0647d9b9afa054437f2bce032a
2020-09-21 12:38:59 -07:00
7de512ced8 nightly robustness fixes for linking across devices (#43771)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/43761

CC rgommers ezyang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43771

Reviewed By: glaringlee

Differential Revision: D23819835

Pulled By: malfet

fbshipit-source-id: a3be2780c4b8bdbf347d456c4d14df863c2ff8c2
2020-09-21 12:32:32 -07:00
42af2c7923 [jit] gtest-ify test_alias_analysis.cpp (#45018)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45018

Now that https://github.com/pytorch/pytorch/pull/44795 has landed, we
can convert the bulk of our cpp tests to use gtest APIs. Eventually
we'll want to get rid of our weird harness for cpp tests entirely in
favor of using regular gtest everywhere. This PR demonstrates some of
the benefits of this approach:
1. You don't need to register your test twice (once to define it, once
in tests.h).
2. Consequently, it's easier to have many individual test cases.
Failures can be reported independently (rather than having huge
functions to test entire modules.
3. Some nicer testing APIs, notably test fixtures.

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D23802297

Pulled By: suo

fbshipit-source-id: 774255da7716294ac573747dcd5e106e5fe3ac8f
2020-09-21 12:19:37 -07:00
92f8f75c59 Add alias dispatch key Math. (#44354)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44354

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D23591481

Pulled By: ailzhang

fbshipit-source-id: 6e93c4ec99a07f3fc920ba2d09dc222e6ced5adf
2020-09-21 11:10:39 -07:00
acc2a1e5fa Update submodule gloo (#45025)
Summary:
Including commits to fix Windows CI failure of enable distributed training on Windows PR

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45025

Reviewed By: beauby

Differential Revision: D23807995

Pulled By: mrshenli

fbshipit-source-id: a2f4c1684927ca66d7d3e9920ecb588fb4386f7c
2020-09-21 10:28:37 -07:00
a4aba1d465 fix compile error (#45052)
Summary:
Update vulkanOptimizeForMobile function invoking in optimize_for_mobile.cc to align latest call contract in PR https://github.com/pytorch/pytorch/pull/44903.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45052

Reviewed By: malfet

Differential Revision: D23814953

Pulled By: mrshenli

fbshipit-source-id: 0fa844a8291e952715b9de35cdec0e411c42b7f9
2020-09-21 10:23:49 -07:00
ac8c7c4e9f Make Channel API accept buffer structs rather than raw pointers. (#45014)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45014

Pull Request resolved: https://github.com/pytorch/tensorpipe/pull/219

Pull Request resolved: https://github.com/pytorch/tensorpipe/pull/212

+ Introduce buffer.h defining the buffer struct(s). The `CpuBuffer`
struct is always defined, while the `CudaBuffer` struct is defined
only when `TENSORPIPE_SUPPORTS_CUDA` is true.
+ Update all channels to take a `CpuBuffer` or `CudaBuffer` for
`send`/`recv` rather than a raw pointer and a length.
+ Make the base `Channel`/`Context` classes templated on `TBuffer`,
effectively creating two channel hierarchies (one for CPU channels,
one for CUDA channels).
+ Update the Pipe and the generic channel tests to use the new API. So
far, generic channel tests are CPU only, and tests for the CUDA IPC
channel are (temporarily) disabled. A subsequent PR will take care of
refactoring tests so that generic tests work for CUDA channels. An
other PR will add support for CUDA tensors in the Pipe.

Differential Revision: D23598033

Test Plan: Imported from OSS

Reviewed By: lw

Pulled By: beauby

fbshipit-source-id: 1d6c3f91e288420858835cd5e7962e8da051b44b
2020-09-21 10:18:45 -07:00
4bbb6adff5 [NNC] fix SyncThreads insertion and reenable CudaSharedMem test (#44909)
Summary:
A previous fix for masking Cuda dimensions (https://github.com/pytorch/pytorch/issues/44733) changed the behaviour of inserting thread synchronization barriers in the Cuda CodeGen, causing the CudaSharedMemReduce_1 to be flaky and ultimately disabled.

The issue is working out where these barriers must be inserted - solving this optimally is very hard, and I think not possible without dependency analysis we don't have, so I've changed our logic to be quite pessimistic. We'll insert barriers before and after any blocks that have thread dimensions masked (even between blocks that have no data dependencies). This should be correct, but it's an area we could improve performance. To address this somewhat I've added a simplifier pass that removes obviously unnecessary syncThreads.

To avoid this test being flaky again, I've added a check against the generated code to ensure there is a syncThread in the right place.

Also fixed a couple of non-functional but clarity issues in the generated code: fixed the missing newline after Stores in the CudaPrinter, and prevented the PrioritizeLoad mutator from pulling out loads contained within simple Let statements (such as those produced by the Registerizer).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44909

Reviewed By: agolynski

Differential Revision: D23800565

Pulled By: nickgg

fbshipit-source-id: bddef1f40d8d461da965685f01d00b468d8a2c2f
2020-09-21 09:27:22 -07:00
e2f49c8437 skip im2col & vol2col in cpu/cuda convolution methods (#44600)
Summary:
this fixes https://github.com/pytorch/pytorch/issues/44482.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44600

Reviewed By: ngimel

Differential Revision: D23733483

Pulled By: walterddr

fbshipit-source-id: 90e188027ef6bb08588619b6629110b5f73d63e3
2020-09-21 09:20:23 -07:00
a6895d43b6 Turn on gradgrad check for BCELoss Criterion Tests. (#44894)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44894

Looks like we added double backwards support but only turned on the ModuleTests.

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D23762544

Pulled By: gchanan

fbshipit-source-id: b5cef579608dd71f3de245c4ba92e49216ce8a5e
2020-09-21 07:14:22 -07:00
4810365576 Enabled torch.testing._internal.jit_utils.* typechecking. (#44985)
Summary:
Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44985

Reviewed By: malfet

Differential Revision: D23794444

Pulled By: kauterry

fbshipit-source-id: 9893cc91780338a8223904fb574efa77fa3ab2b9
2020-09-21 01:19:06 -07:00
9f67176b82 Complex gradcheck logic (#43208)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43208

This PR adds gradcheck for complex. The logic used for complex gradcheck is described in Section 3.5.3 here: https://arxiv.org/pdf/1701.00392.pdf

More concretely, this PR introduces the following changes:
1. Updates get_numerical_jacobian to take as input a scalar value for vector (v). Adds gradcheck logic for C -> C, C-> R, R -> C. For R -> C functions, only the real value of gradient is propagated.
2. Adds backward definition for `torch.complex` and also adds a test to verify the definition added.
3. Updates backward for `mul`, `sin`, `cos`, `sinh`, `cosh`.
4. Adds tests for all `torch.real`, `torch.imag`, `torch.view_as_real`, `torch.view_as_complex`, `torch.conj`.

Follow up tasks:
1. Add more thorough tests for R -> C cases. Specifically, add R->C test variants for functions. for e.g., `torch.mul(complex_tensor, real_tensor)`
2. Add back commented test in `common_methods_invocation.py`.
3. Add more special case checking for complex gradcheck to make debugging easier.
4. Update complex autograd note.
5. disable complex autograd for operators not tested for complex.

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D23655088

Pulled By: anjali411

fbshipit-source-id: caa75e09864b5f6ead0f988f6368dce64cf15deb
2020-09-20 22:05:04 -07:00
da7863f46b Add one dimensional FFTs to torch.fft namespace (#43011)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43011

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D23751850

Pulled By: mruberry

fbshipit-source-id: 8dc5fec75102d8809eeb85a3d347ba1b5de45b33
2020-09-19 23:32:22 -07:00
49db7b59e0 For logical tests, use the dtypes decorator (#42483)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42483

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D23684424

Pulled By: mruberry

fbshipit-source-id: ba7ab5c3a6eaa0c16975728200f27d164ed4f852
2020-09-19 19:01:49 -07:00
60709ad1bf Adds multiply and divide aliases (#44463)
Summary:
These alias are consistent with NumPy. Note that C++'s naming would be different (std::multiplies and std::divides), and that PyTorch's existing names (mul and div) are consistent with Python's dunders.

This also improves the instructions for adding an alias to clarify that dispatch keys should be removed when copying native_function.yaml entries to create the alias entries.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44463

Reviewed By: ngimel

Differential Revision: D23670782

Pulled By: mruberry

fbshipit-source-id: 9f1bdf8ff447abc624ff9e9be7ac600f98340ac4
2020-09-19 15:47:52 -07:00
faef89c89f CUDA BFloat Pooling (#44836)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44836

Reviewed By: mruberry

Differential Revision: D23800992

Pulled By: ngimel

fbshipit-source-id: 2945a27874345197cbd1d8a4fbd20816afc02c86
2020-09-19 15:43:36 -07:00
7ecfaef7ec CUDA BFloat16 layernorm (#45002)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45002

Reviewed By: mruberry

Differential Revision: D23800931

Pulled By: ngimel

fbshipit-source-id: cc213d02352907a3e945cd9fffd1de29e355a16c
2020-09-19 15:36:03 -07:00
2163d31016 histogram observer: ensure buffer shape consistency (#44956)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44956

Makes buffer shapes for HistogramObserver have the
same shapes in uninitialized versus initialized states.

This is useful because the detectron2 checkpointer assumes
that these states will stay the same, so it removes the
need for manual hacks around the shapes changing.

Test Plan:
```
python test/test_quantization.py TestObserver.test_histogram_observer_consistent_buffer_shape
```

Imported from OSS

Reviewed By: raghuramank100

Differential Revision: D23785382

fbshipit-source-id: 1a83fd4f39b244b00747c368d5d305a07d877c92
2020-09-19 09:29:39 -07:00
0714c003ee [pytorch][tensorexpr] Make gtest-style macros in tests match actual gtest signatures (#44861)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44861

We were redefining things like ASSERT_EQ to take a _VA_ARGS_ parameter, so compiling these files with gtest (instead of pytorch's custom python-based cpp test infra) fails.

Test Plan: buck build //caffe2/test/cpp/tensorexpr

Reviewed By: asuhan

Differential Revision: D23711293

fbshipit-source-id: 8af14fa7c1f1e8169d14bb64515771f7bc3089e5
2020-09-19 07:25:05 -07:00
9e5045e978 [pytorch] clean up normalized_dynamic_type() hack (#44889)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44889

This HACK doesn't seem to be necessary any more - there is no 'real'
type in generated Declarations.yaml file.
Verified by comparing generated code before/after.

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D23761624

Pulled By: ljk53

fbshipit-source-id: de996f04d77eebea3fb9297dd90a8ebeb07647bb
2020-09-18 23:49:46 -07:00
d75c402755 Add cusolver to build, rewrite MAGMA inverse with cusolver (#42403)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/42265

This PR adds cusolver to the pytorch build, and enables the use of cusolver/cublas library functions on GPU `torch.inverse` on certain tensor shapes.

Specifically, when

* the tensor is two dimensional (single batch), or
* has >2 dimensions (multiple batches) and `batch_size <= 2`, or
* magma is not linked,

cusolver/cublas will be used. In other conditions, the current implementation of MAGMA will still be used.

8c0949ae45/aten/src/ATen/native/cuda/BatchLinearAlgebra.cu (L742-L752)

The reason for this is that for tensors with large batch_size, `cublasXgetrfBatched` and `cublasXgetriBatched` doesn't perform very well. For `batch_size > 1`, we launch cusolver functions in multiple streams. This lets cusolver functions run in parallel, and can greatly increase the performance. When `batch_size > 2`, the parallel launched cusolver functions are slightly slower than the current magma implementation, so we still use the current magma impl.

On CUDA 9.2, there were some numerical issues detected, so cusolver impl will not be used. The cusolver impl will also not be used on platforms other than Nvidia CUDA.

060769feaf/aten/src/ATen/native/cuda/BatchLinearAlgebraLib.h (L10-L13)

Note that there is a new heuristic used before cusolver/cublas calls here:

8c0949ae45/aten/src/ATen/native/cuda/MiscUtils.h (L113-L121)

where `use_loop_launch = true` means launch single batch cusolver functions in parallel, and `use_loop_launch = false` means use cublas_X_batched functions. When magma is enabled (only `batch_size <= 2` will be dispatched to cusolver/cublas), the heuristic will always return `true` and the cusolver calls are faster than small batch_size magma calls. When magma is disabled, this adds the functionality of `torch.inverse`, which was disabled before for all shapes (though large batch_size cublas performance may not be as well as magma).

Checklist:
- [X] Add benchmark, cpu, gpu-before (magma), gpu-after (cusolver)
- [X] Rewrite single inverse (ndim == 2) with cusolver
- [X] Rewrite batched inverse (ndim > 2) with cublas
- [X] Add cusolver to build
- [x] Clean up functions related to `USE_MAGMA` define guard
- [x] Workaround for non-cuda platform
- [x] Workaround for cuda 9.2
- [x] Add zero size check
- [x] Add tests

Next step:

If cusolver doesn't cause any problem in pytorch build, and there are no major performance regressions reported after this PR being merged, I will start porting other cusolver/cublas functions for linear algebra to improve the performance.

<details>
<summary> benchmark 73499c6 </summary>

benchmark code: https://github.com/xwang233/code-snippet/blob/master/torch.inverse/inverse-cusolver.ipynb

shape meaning:

* `[] 2 torch.float32 -> torch.randn(2, 2, dtype=torch.float32)`
* `[2] 4 torch.float32 -> torch.randn(2, 4, 4, dtype=torch.float32)`

| shape | cpu_time (ms) | gpu_time_before (magma) (ms) | gpu_time_after (ms) |
| --- | --- | --- | --- |
| [] 2 torch.float32 |  0.095 |  7.534 |  0.129  |
| [] 4 torch.float32 |  0.009 |  7.522 |  0.129  |
| [] 8 torch.float32 |  0.011 |  7.647 |  0.138  |
| [] 16 torch.float32 |  0.075 |  7.582 |  0.135  |
| [] 32 torch.float32 |  0.073 |  7.573 |  0.191  |
| [] 64 torch.float32 |  0.134 |  7.694 |  0.288  |
| [] 128 torch.float32 |  0.398 |  8.073 |  0.491  |
| [] 256 torch.float32 |  1.054 |  11.860 |  1.074  |
| [] 512 torch.float32 |  5.218 |  14.130 |  2.582  |
| [] 1024 torch.float32 |  19.010 |  18.780 |  6.936  |
| [1] 2 torch.float32 |  0.009 |  0.113 |  0.128 ***regressed |
| [1] 4 torch.float32 |  0.009 |  0.113 |  0.131 ***regressed |
| [1] 8 torch.float32 |  0.011 |  0.116 |  0.129 ***regressed |
| [1] 16 torch.float32 |  0.015 |  0.122 |  0.135 ***regressed |
| [1] 32 torch.float32 |  0.032 |  0.177 |  0.178 ***regressed |
| [1] 64 torch.float32 |  0.070 |  0.420 |  0.281  |
| [1] 128 torch.float32 |  0.328 |  0.816 |  0.490  |
| [1] 256 torch.float32 |  1.125 |  1.690 |  1.084  |
| [1] 512 torch.float32 |  4.344 |  4.305 |  2.576  |
| [1] 1024 torch.float32 |  16.510 |  16.340 |  6.928  |
| [2] 2 torch.float32 |  0.009 |  0.113 |  0.186 ***regressed |
| [2] 4 torch.float32 |  0.011 |  0.115 |  0.184 ***regressed |
| [2] 8 torch.float32 |  0.012 |  0.114 |  0.184 ***regressed |
| [2] 16 torch.float32 |  0.019 |  0.119 |  0.173 ***regressed |
| [2] 32 torch.float32 |  0.050 |  0.170 |  0.240 ***regressed |
| [2] 64 torch.float32 |  0.120 |  0.429 |  0.375  |
| [2] 128 torch.float32 |  0.576 |  0.830 |  0.675  |
| [2] 256 torch.float32 |  2.021 |  1.748 |  1.451  |
| [2] 512 torch.float32 |  9.070 |  4.749 |  3.539  |
| [2] 1024 torch.float32 |  33.655 |  18.240 |  12.220  |
| [4] 2 torch.float32 |  0.009 |  0.112 |  0.318 ***regressed |
| [4] 4 torch.float32 |  0.010 |  0.115 |  0.319 ***regressed |
| [4] 8 torch.float32 |  0.013 |  0.115 |  0.320 ***regressed |
| [4] 16 torch.float32 |  0.027 |  0.120 |  0.331 ***regressed |
| [4] 32 torch.float32 |  0.085 |  0.173 |  0.385 ***regressed |
| [4] 64 torch.float32 |  0.221 |  0.431 |  0.646 ***regressed |
| [4] 128 torch.float32 |  1.102 |  0.834 |  1.055 ***regressed |
| [4] 256 torch.float32 |  4.042 |  1.811 |  2.054 ***regressed |
| [4] 512 torch.float32 |  18.390 |  4.884 |  5.087 ***regressed |
| [4] 1024 torch.float32 |  69.025 |  19.840 |  20.000 ***regressed |

</details>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42403

Reviewed By: ailzhang, mruberry

Differential Revision: D23717984

Pulled By: ngimel

fbshipit-source-id: 54cbd9ea72a97989cff4127089938e8a8e29a72b
2020-09-18 20:43:29 -07:00
620c999979 update gloo submodule (#45008)
Summary:
Revert accidental gloo submodule changes in https://github.com/pytorch/pytorch/issues/41977

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45008

Reviewed By: malfet

Differential Revision: D23799892

Pulled By: ngimel

fbshipit-source-id: e8dab244c6abad32ed60efe3c26cab40837e57c8
2020-09-18 19:02:36 -07:00
21a1b9c7cf skip more nccl tests that causes flaky timeouts on rocm build (#44996)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44996

Reviewed By: malfet

Differential Revision: D23797564

Pulled By: walterddr

fbshipit-source-id: 4d60f76bb8ae54bb04a9f4143a68623933461b2a
2020-09-18 18:53:47 -07:00
1c15452703 Update Windows builders to latest VS2019 (#44746)
Summary:
Restore https://github.com/pytorch/pytorch/issues/44706, which should workaround VC compiler crash, which was reverted by https://github.com/pytorch/pytorch/issues/41977
Update configs to use ":stable" Windows images

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44746

Reviewed By: walterddr

Differential Revision: D23793682

Pulled By: malfet

fbshipit-source-id: bfdc36c35b920f58798a18c15642ec7efc68f00e
2020-09-18 18:46:44 -07:00
e9941a5dd4 [vulkan][py] torch.utils.optimize_for_vulkan (#44903)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44903

Test Plan: Imported from OSS

Reviewed By: kimishpatel

Differential Revision: D23766039

Pulled By: IvanKobzarev

fbshipit-source-id: dbdf484ee7d3a7719aab105efba51b92ebc51568
2020-09-18 18:20:11 -07:00
572f7e069c Enable type check for torch.testing._internal.te_utils.* (#44927)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44927

Test Plan: Imported from OSS

Reviewed By: walterddr

Differential Revision: D23776842

Pulled By: sshawnwu

fbshipit-source-id: 65c028169a37e1f2f7d9fdce8a958234ee1caa26
2020-09-18 18:09:15 -07:00
043466f978 [FX] Pass module's qualname to is_leaf_module (#44966)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44966

Test Plan: Imported from OSS

Reviewed By: dzhulgakov

Differential Revision: D23790360

Pulled By: jamesr66a

fbshipit-source-id: 7ef569fd93646584b27af7a615fa69c8d8bbdd3b
2020-09-18 17:02:33 -07:00
40c09cfe14 [CircleCI] Fix CUDA test setup (#44982)
Summary:
Circle updated windows-nvidia-2019:canary image to exclude VC++ 14.26
Update the config to use 14.27

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44982

Reviewed By: seemethere

Differential Revision: D23794116

Pulled By: malfet

fbshipit-source-id: f3281f7d51acae4a4d06cecff01100fa77bd81ff
2020-09-18 16:20:24 -07:00
e255a4e1fd Enable bfloat16 random kernels on Windows (#44918)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/33793

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44918

Reviewed By: pbelevich

Differential Revision: D23777548

Pulled By: ngimel

fbshipit-source-id: 9cf13166d7deba17bc72e402b82ed0afe347cb9b
2020-09-18 15:55:32 -07:00
06389406bb CUDA BFloat activations 1 (#44834)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44834

Reviewed By: mruberry

Differential Revision: D23752660

Pulled By: ngimel

fbshipit-source-id: 209a937e8a9afe12b7dd86ecfa493c9417fd22fb
2020-09-18 15:48:49 -07:00
76a109c930 [caffe2/aten] Fix clang build (#44934)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44934

Fix build errors when using clang to build cuda sources:

```
In file included from aten/src/ATen/native/cuda/DistributionBernoulli.cu:4:
In file included from aten/src/ATen/cuda/CUDAApplyUtils.cuh:5:
caffe2/aten/src/THC/THCAtomics.cuh:321:1: error: control reaches end of non-void function [-Werror,-Wreturn-type]
}
^
1 error generated when compiling for sm_70.

In file included from aten/src/ATen/native/cuda/DistributionBernoulli.cu:4:
In file included from aten/src/ATen/cuda/CUDAApplyUtils.cuh:5:
caffe2/aten/src/THC/THCAtomics.cuh:321:1: error: control reaches end of non-void function [-Werror,-Wreturn-type]
}
^
1 error generated when compiling for sm_60.

In file included from aten/src/ATen/native/cuda/DistributionBernoulli.cu:4:
In file included from aten/src/ATen/cuda/CUDAApplyUtils.cuh:5:
caffe2/aten/src/THC/THCAtomics.cuh:321:1: error: control reaches end of non-void function [-Werror,-Wreturn-type]
}
^
1 error generated when compiling for sm_52.
```

Test Plan: CI

Reviewed By: ngimel

Differential Revision: D23775266

fbshipit-source-id: 141e6624e2da870a8c50ff9f71fcf0717222fb17
2020-09-18 15:22:09 -07:00
fd4e21c91e Add optional string support to native_functions schema (#43010)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43010

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D23751851

Pulled By: mruberry

fbshipit-source-id: 648f7430e1b7311eff28421f38e01f52d998fcbd
2020-09-18 14:57:24 -07:00
2d884f2263 Optimize Scale function (#44913)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44913

Pull Request resolved: https://github.com/pytorch/pytorch/pull/18322

Optimize Scale function

i-am-not-moving-c2-to-c10

Test Plan: buck test mode/dbg caffe2/caffe2/python/operator_test:weighted_sum_test

Reviewed By: BIT-silence

Differential Revision: D14575780

fbshipit-source-id: db333a7964581dcaff6e432ff1d6b517ba1a075f
2020-09-18 14:31:33 -07:00
374e9373b5 [jit] Pull (most) tests out of libtorch_python (#44795)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44795

Today, we build our cpp tests twice, once as a standalone gtest binary,
and once linked in `libtorch_python` so we can call them from
`test_jit.py`.

This is convenient (it means that `test_jit.py` is a single entry point
for all our tests), but has a few drawbacks:
1. We can't actually use the gtest APIs, since we don't link gtest into
`libtorch_python`. We're stuck with the subset that we want to write
polyfills for, and an awkward registration scheme where you have to
write a test then include it in `tests.h`).
2. More seriously, we register custom operators and classes in these
tests. In a world where we may be linking many `libtorch_python`s, this
has a tendency to cause errors with `libtorch`.

So now, only tests that explicitly require cooperation with Python are
built into `libtorch_python`. The rest are built into
`build/bin/test_jit`.

There are tests which require that we define custom classes and
operators. In these cases, I've built thm into separate `.so`s that we
call `torch.ops.load_library()` on.

Test Plan: Imported from OSS

Reviewed By: SplitInfinity, ZolotukhinM

Differential Revision: D23735520

Pulled By: suo

fbshipit-source-id: d146bf4e7eb908afa6f96b394e4d395d63ad72ff
2020-09-18 14:04:40 -07:00
af3fc9725d Extract rpc/tensorpipe_utils.{cpp,h} from rpc/utils.{cpp,h} (#44803)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44803

Test Plan: CI

Reviewed By: lw

Differential Revision: D23732022

fbshipit-source-id: 5b839c7997bbee162a14d03414ee32baabbc8ece
2020-09-18 13:51:43 -07:00
d22dd80128 Enable type check for torch.testing._internal.common_device_type. (#44911)
Summary:
This PR intends to fix the type exceptions in common_device_type.py.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44911

Reviewed By: walterddr

Differential Revision: D23768397

Pulled By: wuyangzhang

fbshipit-source-id: 053692583b4d6169b0eb5ffe0c3d30635c0db699
2020-09-18 13:42:11 -07:00
a47e3697ab Use iterator of DispatchKeySet. (#44682)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44682

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D23698387

Pulled By: ailzhang

fbshipit-source-id: 4fa140db9254c2c9c342bf1c8dfd952469b0b779
2020-09-18 13:34:27 -07:00
6d312132e1 Beef up vmap docs and expose to master documentation (#44825)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44825

Test Plan: - build and view docs locally.

Reviewed By: ezyang

Differential Revision: D23742727

Pulled By: zou3519

fbshipit-source-id: f62b7a76b5505d3387b7816c514c086c01089de0
2020-09-18 13:26:25 -07:00
c2cf6efd96 Enable type check for torch.testing._internal.dist_utils.* (#44832)
Summary:
Addresses a sub-task of https://github.com/pytorch/pytorch/issues/44752.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44832

Reviewed By: malfet

Differential Revision: D23744260

Pulled By: samestep

fbshipit-source-id: 46aede57b4fa66a770d5df382b0aea2bd6772b9b
2020-09-18 12:50:48 -07:00
7bd8a6913d CUDA BFloat div, addcdiv, addcmul, mean, var (#44758)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44758

Reviewed By: mruberry

Differential Revision: D23752317

Pulled By: ngimel

fbshipit-source-id: 77992cf991f4e2b4b6839de73ea7e6ce2e1061c6
2020-09-18 11:51:11 -07:00
f175830558 [NNC] Fuse identical conditions in simplifier (#44886)
Summary:
Adds a pass to the IR Simplifier which fuses together the bodies of Cond statements which have identical conditions. e.g.

```
if (i < 10) {
  do_thing_1;
} else {
  do_thing_2;
}
if (i < 10) {
  do_thing_3;
}
```

is transformed into:

```
if (i < 10) {
  do_thing_1;
  do_thing_3;
} else {
  do_thing_2;
}
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44886

Reviewed By: glaringlee

Differential Revision: D23768565

Pulled By: nickgg

fbshipit-source-id: 3fe40d91e82bdfff8dcb8c56a02a4fd579c070df
2020-09-18 11:38:03 -07:00
09f2c6a94c Back out "Revert D23494065: Refactor CallbackManager as a friend class of RecordFunction." (#44699)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44699

Original commit changeset: 3b1ec928e3db

Previous revert (D23698861) was on the wrong diff stack. Backing out the revert.

Test Plan: Passed unit tests and previously landed.

Reviewed By: mruberry

Differential Revision: D23702258

fbshipit-source-id: 5c3e197bca412f454db5a7e86251ec85faf621c1
2020-09-18 11:08:27 -07:00
174cbff00a Improve sugared value's error message (#42889)
Summary:
Stack from [ghstack](https://github.com/ezyang/ghstack):
* **https://github.com/pytorch/pytorch/issues/42889 Improve sugared value's error message**

I think most (if not all) cases where this code path is reached can be attributed to closing over a global variable.
Improving error message to make this clearer to users.

close https://github.com/pytorch/pytorch/issues/41288

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42889

Reviewed By: SplitInfinity

Differential Revision: D23779347

Pulled By: gmagogsfm

fbshipit-source-id: ced702a96234040f79eb16ad998d202e360d6654
2020-09-18 11:01:40 -07:00
0063512a4b [ONNX] Updates to diagnostic tool to find missing ops (#44124)
Summary:
Moved description of tool and changes in function name

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44124

Reviewed By: albanD

Differential Revision: D23674618

Pulled By: bzinodev

fbshipit-source-id: 5db0bb14fc106fc96358b1e0590f08e975388c6d
2020-09-18 10:32:30 -07:00
c68cc78299 Add a device parameter to RemoteModule (#44254)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44254

Add a device parameter to RemoteModule, so it can be placed on any device
and not just CPU.

Original PR issue: RemoteModule enhancements #40550

Test Plan: buck test test/distributed/rpc:process_group_agent -- RemoteModule

Reviewed By: pritamdamania87

Differential Revision: D23483803

fbshipit-source-id: 4918583c15c6a38a255ccbf12c9168660ab7f6db
2020-09-18 10:31:03 -07:00
cff0e57c31 Remove Incorrect Comment in tools/build_libtorch and remove Python2 support in the module import (#44888)
Summary:
Fixes #{44293} and removes Python2 imports from MNIST download module as Python2 is not being supported.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44888

Reviewed By: agolynski

Differential Revision: D23785579

Pulled By: bugra

fbshipit-source-id: d9380502380876282008dd2d5feb92a446648982
2020-09-18 10:03:36 -07:00
07b7e44ed1 Stop using check_criterion_jacobian. (#44786)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44786

This predates gradcheck and gradcheck does the same and more.

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D23731902

Pulled By: gchanan

fbshipit-source-id: 425fd30e943194f63a663708bada8960265b8f05
2020-09-18 07:04:57 -07:00
6d178f6b8e Stop ignoring errors in cuda nn module tests. (#44783)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44783

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D23731778

Pulled By: gchanan

fbshipit-source-id: 32df903a9e36bbf3f66645ee2d77efa5ed6ee429
2020-09-18 07:03:41 -07:00
df39c40054 Cleanup tracer handling of optional arguments (#43009)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43009

* **#43009 Cleanup tracer handling of optional arguments**

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D23766621

Pulled By: mruberry

fbshipit-source-id: c1b46cd23b58b18ef4c03021b2514d7e692badb6
2020-09-18 06:54:09 -07:00
caea1adc35 Complex support for stft and istft (#43886)
Summary:
Ref https://github.com/pytorch/pytorch/issues/42175, fixes https://github.com/pytorch/pytorch/issues/34797

This adds complex support to `torch.stft` and `torch.istft`. Note that there are really two issues with complex here: complex signals, and returning complex tensors.

## Complex signals and windows
`stft` currently assumes all signals are real and uses `rfft` with `onesided=True` by default. Similarly, `istft` always takes a complex fourier series and uses `irfft` to return real signals.

For `stft`, I now allow complex inputs and windows by calling the full `fft` if either are complex. If the user gives `onesided=True` and the signal is complex, then this doesn't work and raises an error instead. For `istft`, there's no way to automatically know what to do when `onesided=False` because that could either be a redundant representation of a real signal or a complex signal. So there, the user needs to pass the argument `return_complex=True` in order to use `ifft` and get a complex result back.

## stft returning complex tensors
The other issue is that `stft` returns a complex result, represented as a `(... X 2)` real tensor. I think ideally we want this to return proper complex tensors but to preserver BC I've had to add a `return_complex` argument to manage this transition. `return_complex` defaults to false for real inputs to preserve BC but defaults to True for complex inputs where there is no BC to consider.

In order to `return_complex` by default everywhere without a sudden BC-breaking change, a simple transition plan could be:
1. introduce `return_complex`, defaulted to false when BC is an issue but giving a warning. (this PR)
2. raise an error in cases where `return_complex` defaults to false, making it a required argument.
3. change `return_complex` default to true in all cases.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43886

Reviewed By: glaringlee

Differential Revision: D23760174

Pulled By: mruberry

fbshipit-source-id: 2fec4404f5d980ddd6bdd941a63852a555eb9147
2020-09-18 01:39:47 -07:00
e400150c3b Fixed for caffe2/opt/tvm_transformer.cc (#44249)
Summary:
Fixes #https://github.com/pytorch/pytorch/issues/41706

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44249

Reviewed By: gmagogsfm

Differential Revision: D23752331

Pulled By: SplitInfinity

fbshipit-source-id: 1d7297e080bc1e065129259e406af7216f3f0665
2020-09-18 00:03:59 -07:00
f2b3480795 CUDA BFloat softmax (#44837)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44837

Reviewed By: glaringlee

Differential Revision: D23767981

Pulled By: ngimel

fbshipit-source-id: be92c25a1b66ed50a52e090db167079def6f6b39
2020-09-17 21:52:47 -07:00
1694fde7eb Fix a GroupNorm cuda bug when input does not require_grad (#44863)
Summary:
Fix https://discuss.pytorch.org/t/illegal-memory-access-when-i-use-groupnorm/95800

`dX` is a Tensor, comparing `dX` with `nullptr` was wrong.

cc BIT-silence who wrote the kernel.

The test couldn't pass with `rtol=0` and `x.requires_grad=True`, so I have to update that to `1e-5`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44863

Reviewed By: mruberry

Differential Revision: D23754101

Pulled By: BIT-silence

fbshipit-source-id: 2eb0134dd489480e5ae7113a7d7b84629104cd49
2020-09-17 19:01:28 -07:00
5dbcbea265 TorchScript with record_function (#44345)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44345

As part of enhancing profiler support for RPC, when executing TorchScript functions over RPC, we would like to be able to support user-defined profiling scopes created by `with record_function(...)`.

Since after https://github.com/pytorch/pytorch/pull/34705, we support `with` statements in TorchScript, this PR adds support for `with torch.autograd.profiler.record_function` to be used within TorchScript.

This can be accomplished via the following without this PR:
```
torch.opts.profiler._record_function_enter(...)
# Script code, such as forward pass
torch.opts.profiler._record_function_exit(....)
```

This is a bit hacky and it would be much cleaner to use the context manager now that we support `with` statements. Also, `_record_function_` type operators are internal operators that are subject to change, this change will help avoid BC issues in the future.

Tested with `python test/test_jit.py TestWith.test_with_record_function -v`
ghstack-source-id: 112320645

Test Plan:
Repro instructions:
1) Change `def script_add_ones_return_any(x) -> Any` to `def script_add_ones_return_any(x) -> Tensor` in `jit/rpc_test.py`
2) `buck test mode/dev-nosan //caffe2/test/distributed/rpc:process_group_agent -- test_record_function_on_caller_rpc_async --print-passing-details`
3) The function which ideally should accept `Future[Any]` is `def _call_end_callbacks_on_future` in `autograd/profiler.py`.

python test/test_jit.py TestWith.test_with_foo -v

Reviewed By: pritamdamania87

Differential Revision: D23332074

fbshipit-source-id: 61b0078578e8b23bfad5eeec3b0b146b6b35a870
2020-09-17 18:45:00 -07:00
4a9c80e82e [pytorch][bot] update mobile op deps (#44854)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44854

Test Plan: Imported from OSS

Reviewed By: iseeyuan

Differential Revision: D23751925

Pulled By: ljk53

fbshipit-source-id: 8e1905091bf3abaac20d97182eb88f96e905ffc2
2020-09-17 18:33:13 -07:00
9a007ba4cb [jit] stop parsing the block after seeing exit statements (#44870)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44870

fix https://github.com/pytorch/pytorch/issues/44864

Test Plan: buck test mode/dev-nosan //caffe2/test:jit -- 'test_assert_is_script'

Reviewed By: eellison

Differential Revision: D23755094

fbshipit-source-id: ca3f8b27dc6f9dc9364a22a1bce0e2f588ed4308
2020-09-17 18:09:16 -07:00
60ae6c9c18 [FX] Fix GraphModule copy methods not regenerating forward (#44806)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44806

Test Plan: Imported from OSS

Reviewed By: zdevito

Differential Revision: D23738732

Pulled By: jamesr66a

fbshipit-source-id: 14e13551c6568c562f3f789b6274b6c86afefd0b
2020-09-17 17:14:38 -07:00
e14b2080be [reland] move rebuild buckets from end of first iteration to beginning of second iteration (#44798)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44798

[test all]

Update for relanding: in ddp.join(), moved _rebuild_buckets from end of backward to beginning of forward as well.

Part of relanding PR #41954, this refactoring is to move rebuild_buckets call from end of first iteration to beginning of second iteration
ghstack-source-id: 112279261
ghstack-source-id: 112279261

Test Plan: unit tests

Reviewed By: rohan-varma

Differential Revision: D23735185

fbshipit-source-id: c26e0efeecb3511640120faa1122a2c856cd694e
2020-09-17 17:10:21 -07:00
2043fbdfb6 Enable torch.backends.cuda typechecking in CI (#44916)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44916

Reviewed By: walterddr

Differential Revision: D23769844

Pulled By: malfet

fbshipit-source-id: 3be3616fba9e2f9c6d89cc71d5f0d24ffcc45cf2
2020-09-17 15:31:38 -07:00
18b77d7d17 [TensorExpr] Add Mod support to the LLVM backend (#44823)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44823

Test Plan: test_tensorexpr --gtest_filter=TensorExprTest.LLVMElemwiseMod_LLVM

Reviewed By: glaringlee

Differential Revision: D23761996

Pulled By: asuhan

fbshipit-source-id: c3c5b2fe0d989dec04f0152ce47c5cae35ed19c9
2020-09-17 15:25:42 -07:00
e535fb3f7d [ONNX] Enable true_divide scripting export with ONNX shape inference (#43991)
Summary:
Fixes the `true_divide` symbolic to cast tensors correctly.
The logic depends on knowing input types at export time, which is a known gap for exporting scripted modules. On that end we are improving exporter by enabling ONNX shape inference https://github.com/pytorch/pytorch/issues/40628, and starting to increase coverage for scripting support.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43991

Reviewed By: mruberry

Differential Revision: D23674614

Pulled By: bzinodev

fbshipit-source-id: 1b1b85340eef641f664a14c4888781389c886a8b
2020-09-17 14:38:24 -07:00
1c996b7170 Enable typechecking for torch.testing._internal.common_quantized.* (#44805)
Summary:
Addresses a subproblem of [Issue 42969](https://github.com/pytorch/pytorch/issues/42969)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44805

Reviewed By: malfet

Differential Revision: D23742754

Pulled By: janeyx99

fbshipit-source-id: e916a6a0c049cac318549a485d47f19363087d15
2020-09-17 14:24:32 -07:00
f5b92332c1 [TensorExpr] Fix order comparisons for unsigned types (#44857)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44857

Test Plan: test_tensorexpr --gtest_filter=TensorExprTest.LLVMCompareSelectByte*_LLVM

Reviewed By: glaringlee

Differential Revision: D23762162

Pulled By: asuhan

fbshipit-source-id: 1553429bd2d5292ccda57910326b8c70e4e6ab88
2020-09-17 14:16:54 -07:00
a153eafab7 Let logspace support bfloat16 on both CPU and CUDA (#44675)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44675

Reviewed By: ngimel

Differential Revision: D23710801

Pulled By: mruberry

fbshipit-source-id: 12d8e56f41bb635b500e89aaaf5df86a1795eb72
2020-09-17 14:13:55 -07:00
40e44c5f0a Make nuclear and frobenius norm non-out depend on out variants (#44095)
Summary:
Part of https://github.com/pytorch/pytorch/issues/24802

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44095

Reviewed By: ngimel

Differential Revision: D23735893

Pulled By: mruberry

fbshipit-source-id: bd1264b6a8e7f9220033982b0118aa962991ca88
2020-09-17 14:11:31 -07:00
086a2e7a4e [caffe2] add cost inference for FusedFakeQuantFC and FusedFakeQuantFCGradient (#44840)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44840

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44762

Move CostInferenceForFCGradient to fc_inference.cc/h to be used in multiple .cc files.

Test Plan: CI

Reviewed By: qizzzh

Differential Revision: D23714877

fbshipit-source-id: d27f33e270a93b0e053f2af592dc4a24e35526cd
2020-09-17 14:07:17 -07:00
4066022146 Do not use PRId64 in torch/csrc (#44767)
Summary:
Instead use `fmt::format()` or `%lld` and cast argument to `(long long)`
Fix typos and add helper `PyErr_SetString()` method in torch/csrc/Exceptions.h

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44767

Reviewed By: ezyang

Differential Revision: D23723671

Pulled By: malfet

fbshipit-source-id: c0101aed222184aa436b1e8768480d1531dff232
2020-09-17 14:00:02 -07:00
5d57025206 [TensorExpr] Add log1p support to the LLVM backend (#44839)
Summary:
Also corrected Sleef_log1p registrations, float versions had a redundant f.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44839

Test Plan: test_tensorexpr --gtest_filter=TensorExprTest.LLVMElemwiseLog1pFloat_LLVM

Reviewed By: glaringlee

Differential Revision: D23762113

Pulled By: asuhan

fbshipit-source-id: b5cf003b5c0c1ad549c7f04470352231929ac459
2020-09-17 13:38:35 -07:00
f5440a448a CUDA BFloat16 i0 support (#44750)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44750

Reviewed By: glaringlee

Differential Revision: D23764383

Pulled By: ngimel

fbshipit-source-id: d0e784d89241e8028f97766fdac51fe1ab4c188c
2020-09-17 13:30:10 -07:00
bee97d5be0 Document the default behavior for dist.new_group() when ranks=None (#44000)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44000

This wasn't documented, so add a doc saying all ranks are used when
ranks=None
ghstack-source-id: 111206308

Test Plan: CI

Reviewed By: SciPioneer

Differential Revision: D23465034

fbshipit-source-id: 4c51f37ffcba3d58ffa5a0adcd5457e0c5676a5d
2020-09-17 11:30:37 -07:00
2558e5769d Implement sort for list of tuples (#43448)
Summary:
* Implement tuple sort by traversing contained IValue types and generate a lambda function as comparator for sort.
* Tuple, class objects can now arbitrarily nest within each other and still be sortable

Fixes https://github.com/pytorch/pytorch/issues/43219

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43448

Reviewed By: eellison

Differential Revision: D23352273

Pulled By: gmagogsfm

fbshipit-source-id: b6efa8d00e112178de8256da3deebdba7d06c0e1
2020-09-17 11:20:56 -07:00
c189328e5d CUDA BFloat16 unary ops part 2 (#44824)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44824

Reviewed By: mruberry

Differential Revision: D23752360

Pulled By: ngimel

fbshipit-source-id: 3aadaf9db9d4e4937aa38671e8589ecbeece709d
2020-09-17 10:57:43 -07:00
c1fa42497b fix legacy GET_BLOCKS code from THCUNN/common.h (#44789)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/44472

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44789

Reviewed By: malfet

Differential Revision: D23732762

Pulled By: walterddr

fbshipit-source-id: c3748e365e9a1d009b00140ab0ef892da905d09b
2020-09-17 10:49:53 -07:00
24df3b7373 torch.empty_like and torch.zeros_like raise error if any memory format is provided with sparse input (#43699) (#44058)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/43699

- Changed the order of `TORCH_CHECK` and `if (options.layout() == kSparse && self.is_sparse())`
inside `empty_like` method.

- [x] Added tests

EDIT:

More details on that and why we can not take zeros_like  approach.
Python code :
```python
res = torch.zeros_like(input_coalesced, memory_format=torch.preserve_format)
```
is routed to
```c++
// TensorFactories.cpp
Tensor zeros_like(
    const Tensor& self,
    const TensorOptions& options,
    c10::optional<c10::MemoryFormat> optional_memory_format) {
  if (options.layout() == kSparse && self.is_sparse()) {
    auto res = at::empty({0}, options); // to be resized
    res.sparse_resize_and_clear_(
        self.sizes(), self.sparse_dim(), self.dense_dim());
    return res;
  }
  auto result = at::empty_like(self, options, optional_memory_format);
  return result.zero_();
}
```
and passed to `if (options.layout() == kSparse && self.is_sparse())`

When we call in Python
```python
res = torch.empty_like(input_coalesced, memory_format=torch.preserve_format)
```
it is routed to
```c++
Tensor empty_like(
    const Tensor& self,
    const TensorOptions& options_,
    c10::optional<c10::MemoryFormat> optional_memory_format) {
  TORCH_CHECK(
    !(options_.has_memory_format() && optional_memory_format.has_value()),
    "Cannot set memory_format both in TensorOptions and explicit argument; please delete "
    "the redundant setter.");
  TensorOptions options =
      self.options()
          .merge_in(options_)
          .merge_in(TensorOptions().memory_format(optional_memory_format));
  TORCH_CHECK(
      !(options.layout() != kStrided &&
          optional_memory_format.has_value()),
      "memory format option is only supported by strided tensors");
  if (options.layout() == kSparse && self.is_sparse()) {
    auto result = at::empty({0}, options); // to be resized
    result.sparse_resize_and_clear_(
        self.sizes(), self.sparse_dim(), self.dense_dim());
    return result;
  }
```

cc pearu

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44058

Reviewed By: albanD

Differential Revision: D23672494

Pulled By: mruberry

fbshipit-source-id: af232274dd2b516dd6e875fc986e3090fa285658
2020-09-17 10:25:31 -07:00
1fde54d531 [quant][qat] Ensure fake_quant and observer can be disabled on scriptmodule (#44773)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44773

The model is created and prepared using fx APIs and then scripted for training.
In order to test QAT on scriptmodel we need to be able to disable/enable fake_quant
and observer modules on it.

Test Plan:
python test/test_quantization.py TestQuantizeFx.test_qat_and_script

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D23741354

fbshipit-source-id: 3fee7aa9b049d9901313b977710f4dc1c4501532
2020-09-17 10:21:52 -07:00
361b38da19 [quant][fx] Add node name as prefix to observer module name (#44765)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44765

Test Plan:
python test/test_quantization.py TestQuantizeFx.test_save_observer_state_dict

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D23741355

fbshipit-source-id: 7185ceae5b3b520ac0beebb627c44eab7ae7d231
2020-09-17 10:17:42 -07:00
74c3dcd1d2 Revert D23725053: [pytorch][PR] change self.generator to generator
Test Plan: revert-hammer

Differential Revision:
D23725053 (a011b86115)

Original commit changeset: 89706313013d

fbshipit-source-id: 035214f0d4298d29a52f8032d364b52dfd956fe8
2020-09-17 09:42:37 -07:00
d2b4534d4d refactor intialize bucket views (#44330)
Summary:
[test all]
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44330

Part of relanding PR #41954, this refactor is to seperate intialize_bucket_views and populate_bucket_views_out, as they are doing different things and called by different callsites as well
ghstack-source-id: 112257271

Test Plan: unit tests

Reviewed By: mrshenli

Differential Revision: D23583347

fbshipit-source-id: a5f2041b2c4f2c2b5faba1af834c7143eaade938
2020-09-17 09:20:23 -07:00
6006e45028 .circleci: Switch to dynamic MAX_JOBS (#44729)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44729

Switches our MAX_JOBS from a hardcoded value to a more dynamic value so
that we can always utilize all of the core that are available to us

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: walterddr

Differential Revision: D23759643

Pulled By: seemethere

fbshipit-source-id: ad26480cb0359c988ae6f994e26a09f601b728e3
2020-09-17 09:16:36 -07:00
f605d7581e Implement better caching allocator for segmentation usecase. (#44618)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44618

This diff refactors caching allocator to allow for overriding behavior by
making it a virtual class.

Test Plan: https://www.internalfb.com/intern/fblearner/details/218419618?tab=Experiment%20Results

Reviewed By: dreiss

Differential Revision: D23672902

fbshipit-source-id: 976f02922178695fab1c87f453fcb59142c258ec
2020-09-17 08:56:14 -07:00
4affbbd9f8 minor style edits to torch/testing/_internal/common_quantized.py (#44807)
Summary:
style nits

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44807

Reviewed By: malfet

Differential Revision: D23742537

Pulled By: janeyx99

fbshipit-source-id: 446343822d61f8fd9ef6dfcb8e5da4feff6522b6
2020-09-17 08:02:43 -07:00
a40ef25e30 [te] Disable flaky test CudaSharedMemReduce_1 (#44862)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44862

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D23753831

Pulled By: bertmaher

fbshipit-source-id: d7d524ac34e4ca208df022a5730c2d11b3068f12
2020-09-17 07:58:16 -07:00
503c74888f Always use NewModuleTest instead of ModuleTest. (#44745)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44745

Much like CriterionTest, NewCriterionTest these are outdated formulations and we should just use the new one.

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D23717808

Pulled By: gchanan

fbshipit-source-id: eb91982eef23452456044381334bfc9a5bbd837e
2020-09-17 07:36:39 -07:00
28085cbd39 Fixed quantile nan propagation and implemented nanquantile (#44393)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44393

torch.quantile now correctly propagates nan and implemented torch.nanquantile similar to numpy.nanquantile.

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D23649613

Pulled By: heitorschueroff

fbshipit-source-id: 5201d076745ae1237cedc7631c28cf446be99936
2020-09-17 05:53:25 -07:00
99093277c0 Support Python Slice class in TorchScript (#44335)
Summary:
Implements support for[ Python Slice class](https://docs.python.org/3/c-api/slice.html) (not slice expression, which is already supported)

Slice object can be used in any place that supports slice expression, including multi-dim tensor slicing.

Fixes https://github.com/pytorch/pytorch/issues/43511
Fixes https://github.com/pytorch/pytorch/issues/43125

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44335

Reviewed By: suo, jamesr66a

Differential Revision: D23682213

Pulled By: gmagogsfm

fbshipit-source-id: f74fe25370e89fbfd2b3727d95ce4e1c4ba8dec4
2020-09-17 00:41:53 -07:00
b6f4bb0a70 Revert D23236088: [pytorch][PR] [caffe2] adds Cancel to SafeDequeueBlobsOp and SafeEnqueueBlobsOp
Test Plan: revert-hammer

Differential Revision:
D23236088 (0ccc38b773)

Original commit changeset: daa90d9ee324

fbshipit-source-id: 933c7deab177250075683a9bea143ac37f16a598
2020-09-16 23:32:50 -07:00
e18a2219dd Implement scatter reductions (CUDA), remove divide/subtract (#41977)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/33394 .

This PR does two things:
1. Implement CUDA scatter reductions with revamped GPU atomic operations.
2. Remove support for divide and subtract for CPU reduction as was discussed with ngimel .

I've also updated the docs to reflect the existence of only multiply and add.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41977

Reviewed By: mruberry

Differential Revision: D23748888

Pulled By: ngimel

fbshipit-source-id: ea643c0da03c9058e433de96db02b503514c4e9c
2020-09-16 23:25:21 -07:00
fdeee74590 [pytorch][vulkan] Fix downcast warnings-errors, aten_vulkan buck target
Summary:
buck build has -Wall for downcasts - need to add safe_downcast<int32_t> everywhere

BUCK build changes for aten_vulkan to include vulkan_wrapper lib

Test Plan: The next diff with segmentation demo works fine

Reviewed By: dreiss

Differential Revision: D23739445

fbshipit-source-id: b22a30e1493c4174c35075a68586defb0fccd2af
2020-09-16 20:49:34 -07:00
b61d3d8be8 Implement torch.kaiser_window (#44271)
Summary:
Related to https://github.com/pytorch/pytorch/issues/38349

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44271

Reviewed By: ngimel

Differential Revision: D23727972

Pulled By: mruberry

fbshipit-source-id: b4c931b2eb3a536231ad6d6c3cb66e52a13286ac
2020-09-16 20:41:31 -07:00
34331b0e0f CUDA BFloat16 and other improvements on abs (#44804)
Summary:
Not sure if ROCm supports `std::abs` today, let's see the CI

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44804

Reviewed By: mruberry

Differential Revision: D23748837

Pulled By: ngimel

fbshipit-source-id: ccf4e63279f3e5927a85d8d8f70ba4b8c334156b
2020-09-16 20:37:07 -07:00
ba6534ae2b enable type check common_distributed (#44821)
Summary:
Enabled type checking in common_distributed by using tensors of ints

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44821

Test Plan: Run python test/test_type_hints.py, errors are no longer ingnored by mypy.ini

Reviewed By: walterddr

Differential Revision: D23747466

Pulled By: alanadakotashine

fbshipit-source-id: 820fd502d7ff715728470fbef0be90ae7f128dd6
2020-09-16 19:19:36 -07:00
e48201c5cf Mention TF32 on related docs (#44690)
Summary:
cc: ptrblck

![image](https://user-images.githubusercontent.com/1032377/93168022-cbbfcb80-f6d6-11ea-8f6e-f2c8a15c5bea.png)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44690

Reviewed By: ngimel

Differential Revision: D23727921

Pulled By: mruberry

fbshipit-source-id: db7cc8e74cde09c13d6a57683129fd839863b914
2020-09-16 19:18:30 -07:00
79108fc16c [JIT] Improve Future subtype checking (#44570)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44570

**Summary**
This commit improves subtype checking for futures so that
`Future[T]` is considered to be a subtype of `Future[U]` if `U` is a
subtype of `V`.

**Test Plan**
This commit adds a test case to `test_async.py` that tests this.

Test Plan: Imported from OSS

Reviewed By: eellison

Differential Revision: D23660588

Pulled By: SplitInfinity

fbshipit-source-id: b606137c91379debab91b9f41057f7b1605757c5
2020-09-16 18:54:51 -07:00
29664e6aa3 [FX] Further sanitize generated names (#44808)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44808

Test Plan: Imported from OSS

Reviewed By: suo

Differential Revision: D23739413

Pulled By: jamesr66a

fbshipit-source-id: b759c3ea613dfa717fb23977b72ff4773d9dcc99
2020-09-16 18:47:38 -07:00
204f985fc3 [NNC] Add simplification of Loop + Condition patterns. (#44764)
Summary:
Adds a new optimization to the IRSimplifier which changes this pattern:
```
for ...
  if ...
   do thing;
```
into:
```
if ...
  for ...
    do thing;
```

Which should be almost strictly better.

There are many cases where this isn't safe to do, hence tests. Most  obviously when the condition depends on something modified within the loop.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44764

Reviewed By: mruberry

Differential Revision: D23734463

Pulled By: nickgg

fbshipit-source-id: 51617e837de96b354fb702d0090ac65ddc523d36
2020-09-16 18:41:58 -07:00
8ec6bc7292 [pytorch][vulkan][jni] LiteModuleLoader load argument to use vulkan device
Summary:
### Java, CPP
Introducing additional parameter `device` to LiteModuleLoader to specify device on which the `forward` will work.

On the java side this is enum that contains CPU and VULKAN, passing as jint to jni side and storing it as a member field on the same level as module.

On pytorch_jni_lite.cpp - for all input tensors converting them to vulkan.

On pytorch_jni_common.cpp (also goes to OSS) - if result Tensor is not cpu - call cpu. (Not Cpu at the moment is only Vulkan).

### BUCK
Introducing `pytorch_jni_lite_with_vulkan` target, that depends on `pytorch_jni_lite_with_vulkan` and adds `aten_vulkan`

In that case `pytorch_jni_lite_with_vulkan` can be used along with `pytorch_jni_lite_with_vulkan`.

Test Plan:
After the following diff with aidemo segmentation:
```
buck install -r aidemos-android
```
{F296224521}

Reviewed By: dreiss

Differential Revision: D23198335

fbshipit-source-id: 95328924e398901d76718c4d828f96e112dfa1b0
2020-09-16 18:35:22 -07:00
0ccc38b773 [caffe2] adds Cancel to SafeDequeueBlobsOp and SafeEnqueueBlobsOp (#44495)
Summary:
## Motivation

* To be able to make C2 ops cancellable so we can safely exit.
* Some C2 operators are now blocking thus being non-cancellable. If an error
  occurs we need to be able to safely stop all net execution so we can throw
  the exception to the caller.

* When an error occurs in a net or it got cancelled, running ops will have the
 `Cancel` method called.

* This diff adds `Cancel` method to the `SafeEnqueueBlobsOp`
and `SafeDequeueBlobsOp` to have the call queue->close() to force all the
 blocking ops to return.
* Adds unit test that verified the error propagation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44495

Test Plan:
## Unit Test added to verify that queue ops propagate errors
```
buck test caffe2/caffe2/python:hypothesis_test
```

Reviewed By: dzhulgakov

Differential Revision: D23236088

Pulled By: dahsh

fbshipit-source-id: daa90d9ee32483fb51195e269a52cf5987bb0a5a
2020-09-16 18:17:34 -07:00
3fa7f515a5 [pytorch][bot] update mobile op deps (#44700)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44700

Test Plan: Imported from OSS

Reviewed By: iseeyuan

Differential Revision: D23719486

Pulled By: ljk53

fbshipit-source-id: 39219ceeee51861f90b228fdfe2ab59ac8a9704d
2020-09-16 17:20:15 -07:00
6befc09465 Fix misuse of PyObject_IsSubclass (#44769)
Summary:
PyObject_IsSubclass may set python live exception bit if given object is not a class. `IsNamedTuple` is currently using it incorrectly, which may trip all following python operations in debug-build python. Normal release-build python is not affected because `assert` is no-op in release-build.

Fixes https://github.com/pytorch/pytorch/issues/43577

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44769

Reviewed By: jamesr66a

Differential Revision: D23725584

Pulled By: gmagogsfm

fbshipit-source-id: 2dabd4f8667a045d5bf75813500876c6fd81542b
2020-09-16 16:19:01 -07:00
43fe034514 [JIT] Disallow plain Optional type annotation without arg (#44586)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44586

**Summary**
This commit disallows plain `Optional` type annotations without
any contained types both in type comments and in-line as
Python3-style type annotations.

**Test Plan**
This commit adds a unit test for these two situations.

Test Plan: Imported from OSS

Reviewed By: gmagogsfm

Differential Revision: D23721517

Pulled By: SplitInfinity

fbshipit-source-id: ead411e94aa0ccce227af74eb0341e2a5331370a
2020-09-16 16:07:26 -07:00
574f9af160 [NCCL] Add option to run NCCL on high priority cuda stream (#43796)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43796

This diff adds an option for the process group NCCL backend to pick high priority cuda streams.

Test Plan: waitforsandcastle

Reviewed By: jiayisuse

Differential Revision: D23404286

fbshipit-source-id: b79ae097b7cd945a26e8ba1dd13ad3147ac790eb
2020-09-16 16:00:41 -07:00
161490d441 Move torch/version.py generation to cmake (#44577)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44577

I would like to to move this to cmake so that I can depend on it
happening from other parts of the build.

This PR pulls out the logic for determining the version string and
writing the version file into its own module. `setup.py` still receives
the version string and uses it as before, but now the code for writing
out `torch/version.py` lives in a custom command in torch/CMakeLists.txt

I noticed a small inconsistency in how version info is populated.
`TORCH_BUILD_VERSION` is populated from `setup.py` at configuration
time, while `torch/version.py` is written at build time. So if, e.g. you
configured cmake on a certain git rev, then built it in on another, the
two versions would be inconsistent.

This does not appear to matter, so I opted to preserve the existing
behavior.

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D23734781

Pulled By: suo

fbshipit-source-id: 4002c9ec8058503dc0550f8eece2256bc98c03a4
2020-09-16 15:49:22 -07:00
ffe127e4f1 [JIT] Disallow plain Tuple type annotation without arg (#44585)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44585

**Summary**
This commit disallows plain `Tuple` type annotations without any
contained types both in type comments and in-line as Python3-style
type annotations.

**Test Plan**
This commit adds a unit test for these two situations.

Test Plan: Imported from OSS

Reviewed By: gmagogsfm

Differential Revision: D23721515

Pulled By: SplitInfinity

fbshipit-source-id: e11c77a4fac0b81cd535c37a31b9f4129c276592
2020-09-16 15:49:19 -07:00
qxu
09a84071a3 enable mypy check for jit_metaprogramming_utils (#44752)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/42969
enable mypy check for jit_metaprogramming_utils.py and fixed all errors.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44752

Reviewed By: walterddr

Differential Revision: D23741285

Pulled By: qxu-fb

fbshipit-source-id: 21e36ca5d25c8682fb93b806e416b9e1db76f71e
2020-09-16 15:44:37 -07:00
3f5bb2bade [quant] Support clone for per channel affine quantized tensor (#44573)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44573

fixes: https://github.com/pytorch/pytorch/issues/33309

Test Plan: Imported from OSS

Reviewed By: VitalyFedyunin

Differential Revision: D23663828

fbshipit-source-id: 9a021a22b6075b1e94b3f91c0c101fbb9246ec0e
2020-09-16 15:37:44 -07:00
7b3432caff [TensorExpr] Support boolean in simplifier (#44659)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44659

Test Plan: test_tensorexpr --gtest_filter=TensorExprTest.ConstantFoldCastToBool

Reviewed By: ngimel

Differential Revision: D23714675

Pulled By: asuhan

fbshipit-source-id: 4c18d972b628d5ad55bad58eddd5f6974e043d9c
2020-09-16 15:30:19 -07:00
ac0d13cc88 Vectorize complex copy. (#44722)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44722

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: anjali411

Differential Revision: D23731276

Pulled By: ezyang

fbshipit-source-id: 4902c4b79577ae3c70aca94828006b12914ab7f9
2020-09-16 15:15:12 -07:00
78b806ab4a [JIT] Disallow plain List type annotation without arg (#44584)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44584

**Summary**
This commit extends the work done in #38130 and disallows plain
Python3-style `List` type annotations.

**Test Plan**
This commit extends `TestList.test_no_element_type_annotation` to the
Python3-style type annotation.

Test Plan: Imported from OSS

Reviewed By: gmagogsfm

Differential Revision: D23721514

Pulled By: SplitInfinity

fbshipit-source-id: 48957868286f44ab6d5bf5e1bf97f0a4ebf955df
2020-09-16 15:08:04 -07:00
cb3b8a33f1 [JIT] Disallow plain Dict type annotation without arg (#44334)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44334

**Summary**
This commit detects and prohibits the case in which `typing.Dict` is
used as an annotation without type arguments (i.e. `typing.Dict[K, V]`).
At present, `typing.Dict` is always assumed to have two arguments, and
when it is used without them, `typing.Dict.__args__` is nonempty and
contains some `typing.TypeVar` instances, which have no JIT type equivalent.
Consequently, trying to convert `typing.Dict` to a JIT type results in
a `c10::DictType` with `nullptr` for its key and value types, which can cause
a segmentation fault.

This is fixed by returning a `DictType` from
`jit.annotations.try_ann_to_type` only if the key and value types are converted
successfully to a JIT type and returning `None` otherwise.

**Test Plan**
This commit adds a unit test to `TestDict` that tests the plain `Dict`
annotations throw an error.

**Fixes**
This commit closes #43530.

Test Plan: Imported from OSS

Reviewed By: gmagogsfm

Differential Revision: D23610766

Pulled By: SplitInfinity

fbshipit-source-id: 036b10eff6e3206e0da3131cfb4997d8189c4fec
2020-09-16 14:38:28 -07:00
5027c161a9 Add TORCH_SELECTIVE_NAME to AMP definitions (#44711)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44711

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D23711425

Pulled By: ezyang

fbshipit-source-id: d4b0ef77893af80fe9b74791e66825e223ae221d
2020-09-16 14:25:17 -07:00
82ab167cce [NNC] Fix masking for all block and thread dimensions in CudaCodeGen (#44733)
Summary:
Unifies a number of partial solutions to the thread and block dimension extent masking, including the NoThreadIdxWriter and my last fix https://github.com/pytorch/pytorch/issues/44325. The NoThreadIdxWriter is gone in favour of tracking the current loop extents and masking any statements that have a lower rank than the launch parameters in any Block or Thread dimension, which handles both the "no" and "smaller" axis binding cases.

For example it will transform the following:
```
for i in 0..10 // blockIdx.x
  for j in 0..10 // threadIdx.x
    do thing(i, j);
  for k in 0..5 // threadIdx.x
    do other thing(i, k);
```

Into:
```
do thing(blockIdx.x, threadIdx.x);
if (threadIdx.x < 5) {
  do other thing(blockIdx.x, threadIdx.x);
}
```

And handle the case where statements are not bound by any axis, eg.
```
do outer thing;
for i in 0..10 // blockIdx.x
  for j in 0..10 // threadIdx.x
    do thing(i, j);
  do other thing(i);
```

will become:

```
if (blockIdx.x < 1) {
  if (threadIdx.x < 1) {
    do outer thing;
  }
}
syncthreads();
do thing(blockIdx.x, threadIdx.x);
syncthreads();
if (threadIdx.x < 1) {
  do other thing(blockIdx.x);
}
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44733

Reviewed By: mruberry

Differential Revision: D23736878

Pulled By: nickgg

fbshipit-source-id: 52d08626ae8043d53eb937843466874d479a6768
2020-09-16 14:23:47 -07:00
a3835179a1 [FakeLowP] Addressing FakeLowP OSS issues. (#44819)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44819

[12:39 AM] Cherckez, Tal
please review the following patch.
should address these issues that our validation team found:
A) test_op_nnpi_fp16: hypothesis to trigger max_example*max_example.
B) batchnorm: batchNorm has derived from unit test which doesnt have setting required for hypothesis. hence default value as 100 getting set.

Test Plan:
buck test //caffe2/caffe2/contrib/fakelowp/test/...
https://our.intern.facebook.com/intern/testinfra/testrun/5910974543950859

Reviewed By: hyuen

Differential Revision: D23740970

fbshipit-source-id: 16fcc49f7bf84a5d7342786f671cd0b4e0fc87d3
2020-09-16 13:56:11 -07:00
07d9cc80a4 Fix error code checks for triangular_solve (CPU) (#44720)
Summary:
Added missing error checks for the CPU version of `triangular_solve`.
Fixes https://github.com/pytorch/pytorch/issues/43141.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44720

Reviewed By: mruberry

Differential Revision: D23733400

Pulled By: ngimel

fbshipit-source-id: 9837e01b04a6bfd9181e08d46bf96329f292cae0
2020-09-16 13:54:45 -07:00
f3bd984e44 Move the description comment of compute_bucket_assignment_by_size from cpp to the header file. (#44703)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44703

The description of this public function should be in the header file.

Also fix some typos.

Test Plan: N/A.

Reviewed By: pritamdamania87

Differential Revision: D23703661

fbshipit-source-id: 24ae63de9498e321b31dfb2efadb44183c6370df
2020-09-16 13:44:14 -07:00
20ac736200 Remove py2 compatible future imports (#44735)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44735

Reviewed By: mruberry

Differential Revision: D23731306

Pulled By: ezyang

fbshipit-source-id: 0ba009a99e475ddbe22981be8ac636f8a1c8b02f
2020-09-16 12:55:57 -07:00
6debe825be [vulkan] glsl shaders relaxed precision mode to cmake option (#43076)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43076

Test Plan: Imported from OSS

Reviewed By: AshkanAliabadi

Differential Revision: D23143354

Pulled By: IvanKobzarev

fbshipit-source-id: 7b3ead1e63cf8acf6e8e547080a8ead7a2db994b
2020-09-16 12:51:34 -07:00
e9c6449b46 [FX][EZ] Allow constructing GraphModule with dict for root (#44679)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44679

Test Plan: Imported from OSS

Reviewed By: zdevito

Differential Revision: D23696766

Pulled By: jamesr66a

fbshipit-source-id: fe18b7b579c1728d00589bd5fd5e54c917cc61fe
2020-09-16 12:43:23 -07:00
1718b16d15 [Caffe2] gcs_cuda_only is trivial if CUDA not available (#44578)
Summary:
Make `gcs_cuda_only` and `gcs_gpu_only` return empty device lists if CUDA/GPU(CUDA or RocM) not available

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44578

Reviewed By: walterddr

Differential Revision: D23664227

Pulled By: malfet

fbshipit-source-id: 176b5d964c0b02b8379777cd9a38698c11818690
2020-09-16 12:24:08 -07:00
c44e4878ae Enable torch.backends.quantized typechecks (#44794)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/44793

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44794

Reviewed By: walterddr

Differential Revision: D23734353

Pulled By: malfet

fbshipit-source-id: 491bd7c8f147759715eb296d7537a172685aa066
2020-09-16 12:21:20 -07:00
1cd5ba49c6 Add batching rule for "is_complex", "conj" (#44649)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44649

To unblock #43208, which adds "is_complex" checks to backward formulas
that are being tested for batched gradient support with vmap.

Test Plan: - `pytest test/test_vmap.py -v`

Reviewed By: anjali411

Differential Revision: D23685356

Pulled By: zou3519

fbshipit-source-id: 29e41a9296336f6d1008e3040cade4c643bf5ebf
2020-09-16 12:19:46 -07:00
cce7680a23 Add bound method tests for async_execution with RRef helper (#44716)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44716

Test Plan: Imported from OSS

Reviewed By: rohan-varma

Differential Revision: D23707326

Pulled By: mrshenli

fbshipit-source-id: a2f8db17447e9f82c9f6ed941ff1f8cb9090ad74
2020-09-16 12:01:07 -07:00
257c6d0fde Make async_execution compatible with RRef helpers (#44666)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44666

Test Plan: Imported from OSS

Reviewed By: rohan-varma

Differential Revision: D23691989

Pulled By: mrshenli

fbshipit-source-id: b36f4b1c9d7782797a0220434a8272610a23e83e
2020-09-16 12:01:05 -07:00
924717bf51 Add _get_type() API to RRef (#44663)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44663

The new API returns the type of the data object referenced by this
`RRef`. On the owner, this is same as `type(rref.local_value())`.
On a user, this will trigger an RPC to fetch the `type` object from
the owner. After this function is run once, the `type` object is
cached by the `RRef`, and subsequent invocations no longer trigger
RPC.

closes #33210

Test Plan: Imported from OSS

Reviewed By: rohan-varma

Differential Revision: D23691990

Pulled By: mrshenli

fbshipit-source-id: a2d87cd601a691dd75164b6bcd7315245e9cf6bd
2020-09-16 11:59:22 -07:00
6954ae1278 Vec256 Test cases (#42685)
Summary:
[Tests for Vec256 classes https://github.com/pytorch/pytorch/issues/15676](https://github.com/pytorch/pytorch/issues/15676)

Testing
Current list:

- [x] Blends
- [x] Memory: UnAlignedLoadStore
- [x] Arithmetics: Plus,Minu,Multiplication,Division
- [x] Bitwise: BitAnd, BitOr, BitXor
- [x] Comparison: Equal, NotEqual, Greater, Less, GreaterEqual, LessEqual
- [x] MinMax: Minimum, Maximum, ClampMin, ClampMax, Clamp
- [x] SignManipulation: Absolute, Negate
- [x] Interleave: Interleave, DeInterleave
- [x] Rounding: Round, Ceil, Floor, Trunc
- [x] Mask: ZeroMask
- [x] SqrtAndReciprocal: Sqrt, RSqrt, Reciprocal
- [x] Trigonometric: Sin, Cos, Tan
- [x] Hyperbolic: Tanh, Sinh, Cosh
- [x] InverseTrigonometric: Asin, ACos, ATan, ATan2
- [x] Logarithm: Log, Log2, Log10, Log1p
- [x] Exponents: Exp, Expm1
- [x] ErrorFunctions: Erf, Erfc, Erfinv
- [x] Pow: Pow
- [x] LGamma: LGamma
- [x] Quantization: quantize, dequantize, requantize_from_int
- [x] Quantization: widening_subtract, relu, relu6
Missing:
- [ ] Constructors, initializations
- [ ] Conversion , Cast
- [ ] Additional: imag, conj, angle (note: imag and conj only checked for float complex)

#### Notes on tests and testing framework
- some math functions are tested within domain range
- mostly testing framework randomly tests against std implementation within the domain or within the implementation domain for some math functions.
- some functions are tested against the local version. ~~For example, std::round and vector version of round differs. so it was tested against the local version~~
- round was tested against pytorch at::native::round_impl. ~~for double type on **Vsx  vec_round failed  for  (even)+0 .5 values**~~ . it was solved by using vec_rint
- ~~**complex types are not tested**~~  **After enabling complex testing due to precision and domain some of the complex functions failed for vsx and x86 avx as well. I will either test it against local implementation or check within the accepted domain**
- ~~quantizations are not tested~~  Added tests for quantizing, dequantize, requantize_from_int, relu, relu6, widening_subtract functions
- the testing framework should be improved further
- ~~For now  `-DBUILD_MOBILE_TEST=ON `will be used for Vec256Test too~~
Vec256 Test cases will be built for each CPU_CAPABILITY

Fixes: https://github.com/pytorch/pytorch/issues/15676

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42685

Reviewed By: malfet

Differential Revision: D23034406

Pulled By: glaringlee

fbshipit-source-id: d1bf03acdfa271c88744c5d0235eeb8b77288ef8
2020-09-16 11:48:02 -07:00
e6101f5507 fixes lda condition for blas functions, fixes bug with beta=0 in addmv slow path (#44681)
Summary:
per title. If `beta=0` and slow path was taken, `nan` and `inf` in the result were not masked as is the case with other linear algebra functions. Similarly, since `mv` is implemented as `addmv` with `beta=0`, wrong results were sometimes produced for `mv` slow path.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44681

Reviewed By: mruberry

Differential Revision: D23708653

Pulled By: ngimel

fbshipit-source-id: e2d5d3e6f69b194eb29b327e1c6f70035f3b231c
2020-09-16 11:47:56 -07:00
570102ce85 Remove many unused THC pointwise math operators (#44230)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44230

Reviewed By: albanD

Differential Revision: D23701185

Pulled By: ngimel

fbshipit-source-id: caf7b7a815b37d50232448d6965e591508546bd7
2020-09-16 11:47:51 -07:00
07d07e3c6c Remove EXPERIMENTAL_ENUM_SUPPORT feature guard (#44243)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/41095

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44243

Reviewed By: ZolotukhinM

Differential Revision: D23605979

Pulled By: gmagogsfm

fbshipit-source-id: 098ae69049c4664ad5d1521c45b8a7dd22e72f6c
2020-09-16 11:45:59 -07:00
3e6bb5233f Reference amp tutorial (recipe) from core amp docs (#44725)
Summary:
https://pytorch.org/tutorials/recipes/recipes/amp_recipe.html is live.  Core amp docs should reference it.

Also i fixed some typos in the `zero_grad` docs we ignored when git was behaving weirdly during ngimel 's merge of https://github.com/pytorch/pytorch/pull/44423.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44725

Reviewed By: mruberry

Differential Revision: D23723807

Pulled By: ngimel

fbshipit-source-id: ca0b76365f8ca908bd978e3b38bf81857fa6c2a3
2020-09-16 11:37:58 -07:00
a011b86115 change self.generator to generator (#44461)
Summary:
bug fix

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44461

Reviewed By: mruberry

Differential Revision: D23725053

Pulled By: ngimel

fbshipit-source-id: 89706313013d9eae96aaaf144924867457efd2c0
2020-09-16 11:32:17 -07:00
ee493e1a91 CUDA bfloat compare ops (#44748)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44748

Reviewed By: mruberry

Differential Revision: D23725997

Pulled By: ngimel

fbshipit-source-id: 4f89dce3a8b8f1295ced522011b59e60d756e749
2020-09-16 11:32:14 -07:00
eb75cfb9c0 Back out "Revert D23323486: DPP Async Tracing" plus windows build fix. (#44702)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44702

Original commit changeset: c6bd6d277aca

This diff caused windows build to fail due to a compiler bug in VS2019 (lambda capture constant int value). This back out works around the issue with explicit capture of const int value.

Test Plan: Tested and previously landed.

Reviewed By: mruberry

Differential Revision: D23703215

fbshipit-source-id: f9ef23be97540bc9cf78a855295fb8c69f360459
2020-09-16 11:32:11 -07:00
ced8727d88 Fix a broken link in CONTRIBUTING.md (#44701)
Summary:
as the title says :)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44701

Reviewed By: ngimel

Differential Revision: D23724919

Pulled By: mrshenli

fbshipit-source-id: 5ca5ea974ee6a94ed132dbe7892a9b4b9c3dd9be
2020-09-16 11:30:05 -07:00
5e717f0d5e delete the space for the docs rendering (#44740)
Summary:
see the docs rendering of `jacobian` and `hessian` at https://pytorch.org/docs/stable/autograd.html

![image](https://user-images.githubusercontent.com/20907377/93268949-f0618500-f762-11ea-9ec6-ddd062540c59.png)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44740

Reviewed By: ngimel

Differential Revision: D23724899

Pulled By: mrshenli

fbshipit-source-id: f7558ff53989e5dc7e678706207be2ac7ce22c66
2020-09-16 11:13:45 -07:00
a5cc151b8c Build EigenBlas as static library (#44747)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/43709

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44747

Reviewed By: ezyang

Differential Revision: D23717927

Pulled By: malfet

fbshipit-source-id: c46fbcf5a55895cb984dd4c5301fbcb784fc17d5
2020-09-16 10:25:26 -07:00
b63b684394 Consolidate CODEOWNERS file for distributed package. (#44763)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44763

The file had separate rules for RPC and DDP/c10d, consolidated all of
it together and placed all the distributed rules together.
ghstack-source-id: 112140871

Test Plan: waitforbuildbot

Reviewed By: rohan-varma

Differential Revision: D23721162

fbshipit-source-id: d41c757eb1615376d442bd6b2802909624bd1d3f
2020-09-16 10:19:25 -07:00
dbf17a1d4c Fixing a few links in distributed CONTRIBUTING.md (#44753)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44753

ghstack-source-id: 112132781

Test Plan: waitforbuildbot

Reviewed By: rohan-varma

Differential Revision: D23719077

fbshipit-source-id: 3d943dfde100d175f417554fc7fca1fdb295129f
2020-09-16 10:14:19 -07:00
06036f76b6 CUDA BFloat16 pow (#44760)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44760

Reviewed By: ngimel

Differential Revision: D23727936

Pulled By: mruberry

fbshipit-source-id: 8aa89e989294347d7f593b1a63ce4a1dbfdf783e
2020-09-16 10:01:21 -07:00
63469da3bb Add a test to ensure DDP join works with RPC (#44439)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44439

Adds a test to ddp_under_dist_autograd_test to enusre that that uneven
inputs join() API works properly when DDP + RPC is combined. We test that when
running in outside DDP mode (DDP applied to whole hybrid module) we can
correctly process uneven inputs across different trainers.
ghstack-source-id: 112156980

Test Plan: CI

Reviewed By: albanD

Differential Revision: D23612409

fbshipit-source-id: f1e328c096822042daaba263aa8747a9c7e89de7
2020-09-16 09:51:43 -07:00
3f512b0de2 [quant][qat] Ensure observers and fq modules are scriptable (#44749)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44749

Ensure fx module is scriptable after calling prepare_qat on it

Test Plan:
python test/test_quantization.py TestQuantizeFx.test_qat_and_script

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D23718380

fbshipit-source-id: abf63ffb21e707f7def8f6c88246877f5aded58c
2020-09-16 09:30:07 -07:00
b85568a54a [CI] Add profiling-te benchmarks. (#44756)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44756

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D23719728

Pulled By: ZolotukhinM

fbshipit-source-id: 739940e02a6697fbed2a43a13682a6e5268f710b
2020-09-15 21:33:03 -07:00
d66520ba08 [TensorExpr] Fuser: try merging adjacent fusion groups. (#43671)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43671

Test Plan: Imported from OSS

Reviewed By: eellison

Differential Revision: D23360796

Pulled By: ZolotukhinM

fbshipit-source-id: 60ec318fe77ae9f2c821d9c4d106281845266e0f
2020-09-15 21:31:02 -07:00
2efc618f19 lr_schedule.py redundant code (#44613)
Summary:
The subclass sets "self.last_epoch" when this is set in the parent class's init function. Why would we need to set last_epoch twice? I think calling "super" resets last_epoch anyway, so I am not sure why we would want to include this in the subclass. Am I missing something?

For the record, I am just a Pytorch enthusiast. I hope my question isn't totally silly.

Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44613

Reviewed By: albanD

Differential Revision: D23691770

Pulled By: mrshenli

fbshipit-source-id: 080d9acda86e1a2bfaafe2c6fcb8fc1544f8cf8a
2020-09-15 20:28:39 -07:00
2c1b215b48 [fx] remove delegate, replace with tracer (#44566)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44566

The Delegate objects were confusing. They were suppose to be a way to
configure how tracing works, but in some cases they appeared necessary
for consturcting graphs, which was not true. This makes the organization
clearer by removing Delgate and moving its functionality into a Tracer class,
similar to how pickle has a Pickler class.

Test Plan: Imported from OSS

Reviewed By: jamesr66a

Differential Revision: D23683177

Pulled By: zdevito

fbshipit-source-id: 7605a34e65dfac9a487c0bada39a23ca1327ab00
2020-09-15 16:52:22 -07:00
993b4651fd Convert num_kernels to int64 before calling into CUDA GET_BLOCKS (#44688)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44688

this fixes https://github.com/pytorch/pytorch/issues/44472

Test Plan: Imported from OSS

Reviewed By: walterddr

Differential Revision: D23699819

Pulled By: soulitzer

fbshipit-source-id: 7ecfe78d09344178d1e6c7e1503417feb6beff6c
2020-09-15 15:10:55 -07:00
fb085d90e3 Revert D23583017: move rebuild buckets from end of first iteration to beginning of second iteration
Test Plan: revert-hammer

Differential Revision:
D23583017 (f5d231d593)

Original commit changeset: ef67f79437a8

fbshipit-source-id: fd914b7565aba6a5574a32b31403525abb80ff07
2020-09-15 15:10:52 -07:00
26a91a9f04 [WIP][JIT] Add benchmarking support of NV Fuser with FP16 dtype support (#44101)
Summary:
Modified files in `benchmarks/tensorexpr` to add support for NVIDIA's Fuser for the jit compiler.

This support has some modifications besides adding an option to support the NVIDIA fuser:

* Adds FP16 Datatype support
* Fixes SOL/Algo calculations to generally use the data type instead of being fixed to 4 bytes
* Adds IR printing and kernel printing knobs
* Adds a knob `input_iter` to create ranges of inputs currently only for reductions
* Adds further reduction support for Inner and Outer dimension reductions that are compatible with the `input_iter` knob.
* Added `simple_element`, `reduce2d_inner`, and `reduce2d_outer` to isolate performance on elementwise  and reduction operations in the most minimal fashion.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44101

Reviewed By: ngimel

Differential Revision: D23713658

Pulled By: bertmaher

fbshipit-source-id: d6b83cfab559aefe107c23b3c0f2df9923b3adc1
2020-09-15 15:10:49 -07:00
2f4c31ce3a [jit] Speed up saving in case of many classes (#44589)
Summary:
There's an annoying O(N^2) in module export logic that makes saving some of the models (if they have many classes) take eternity.

I'm not super familiar with this code to properly untangle the deps and make it a pure hash lookup. So I just added a side lookup table for raw pointers. It's still quadratic, but it's O(num_classes^2) instead of O(num_classes * num_references) which already gives huge savings.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44589

Test Plan:
Tested with one of the offending models - just loading a saving a Torchscript file:

```
Before:
load 1.9239683151245117
save 165.74712467193604

After:
load 1.9409027099609375
save 1.4711427688598633
```

Reviewed By: suo

Differential Revision: D23675278

Pulled By: dzhulgakov

fbshipit-source-id: 8f3fa7730941085ea20d9255b49a149ac1bf64fe
2020-09-15 15:10:45 -07:00
285ba0d068 Enable fp16 for UniformFill (#44540)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44540

Support output type to be fp16 for UniformFill

Reviewed By: jianyuh

Differential Revision: D23558030

fbshipit-source-id: 53a5b2c92cfe78cd11f55e6ee498e1bd682fe4a1
2020-09-15 15:09:18 -07:00
69839ea3f6 [NNC] make inlining immediate (take 3) (#44231)
Summary:
This is a reup https://github.com/pytorch/pytorch/issues/43885 with an extra commit which should fix the bugs that caused it to be reverted. Read that for general context.

The issue here was that we were still using the side maps `tensor_to_stmt_` and `stmt_to_tensor_` which get invalidated by any transform of the IR (rather than just any transform that isn't computeInline). I added a comment about this but didn't actually address our usages of it.

I've removed these maps and changed the `getLoopBodyFor` and `getLoopStatementsFor` helpers to search the root stmt directly.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44231

Reviewed By: albanD

Differential Revision: D23689688

Pulled By: nickgg

fbshipit-source-id: 1c6009a880f8c0cebf2300fd06b5cc9322bffbf9
2020-09-15 11:12:24 -07:00
8df0400a50 Fix fallback graph in specialize autogradzero (#44654)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44654

Previously we weren't creating a fallback graph as intended in specialize autograd zero, so if a Tensor failed one of our undefinedness checks we would run the backward normally without reprofiling & optimizing.

Test Plan: Imported from OSS

Reviewed By: jamesr66a

Differential Revision: D23691764

Pulled By: eellison

fbshipit-source-id: 10c6fa79518c84a6f5ef2bfbd9ea10843af751eb
2020-09-15 11:12:20 -07:00
4ce6af35c4 Enable fp16 for CUDA SparseLengthsSum/Mean (#44089)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44089

Add support of fp16 as input type in SparseLengthSum/Mean caffe2 operator

Reviewed By: xianjiec

Differential Revision: D23436877

fbshipit-source-id: 02fbef2fde17d4b0abea9ca5d17a36aa989f98a0
2020-09-15 11:10:54 -07:00
07cba8b1fc Run vmap tests in CI (#44656)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44656

All this time, test_vmap wasn't running in the CI. Fortunately all the
tests pass locally for me. h/t to anjali411 for pointing this out.

Test Plan: - Wait for CI

Reviewed By: anjali411

Differential Revision: D23689355

Pulled By: zou3519

fbshipit-source-id: 543c3e6aed0af77bfd6ea7a7549337f8230e3d32
2020-09-15 10:59:00 -07:00
d62994a94d ci: Add anaconda pruning to CI pipeline (#44651)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44651

Adds pruning for our anaconda channels (pytorch-nightly, pytorch-test)
into our CI pipeline so that it gets run on a more consistent basis.

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: walterddr

Differential Revision: D23692851

Pulled By: seemethere

fbshipit-source-id: fa69b506b73805bf2ffbde75d221aef1ee3f753e
2020-09-15 10:51:05 -07:00
1d733d660d [docs] torch.min/max: remove incorrect warning from docs (#44615)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/44195

cc: mruberry

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44615

Reviewed By: ngimel

Differential Revision: D23703525

Pulled By: mruberry

fbshipit-source-id: 471ebd764be667e29c03a30f3ef341440adc54d2
2020-09-15 10:42:08 -07:00
6bc77f4d35 Use amax/maximum instead of max in optimizers (#43797)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43797

Reviewed By: malfet

Differential Revision: D23406641

Pulled By: mruberry

fbshipit-source-id: 0cd075124aa6533b21375fe2c90c44a5d05ad6e6
2020-09-15 10:39:40 -07:00
9c364da9b9 Fix doc builds for bool kwargs (#44686)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/43669

The bool will still link to https://docs.python.org/3/library/functions.html#bool.
Tested using bmm:
![image](https://user-images.githubusercontent.com/16063114/93156438-2ad11080-f6d6-11ea-9b81-96e02ee68d90.png)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44686

Reviewed By: ngimel

Differential Revision: D23703823

Pulled By: mruberry

fbshipit-source-id: 7286afad084f5ab24a1254ad84e5d01907781c85
2020-09-15 10:34:58 -07:00
f5d231d593 move rebuild buckets from end of first iteration to beginning of second iteration (#44326)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44326

Part of relanding PR #41954, this refactoring is to move rebuild_buckets call from end of first iteration to beginning of second iteration
ghstack-source-id: 112011490

Test Plan: unit tests

Reviewed By: mrshenli

Differential Revision: D23583017

fbshipit-source-id: ef67f79437a820d9b5699b651803622418499a83
2020-09-15 09:51:33 -07:00
5f692a67db qat conv_fused.py: one more patch for forward compatibility (#44671)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44671

See comments inline - the FC between
https://github.com/pytorch/pytorch/pull/38478 and
https://github.com/pytorch/pytorch/pull/38820 was broken,
patching it.

Test Plan: Verified with customer hitting the issue that this fixes their issue.

Reviewed By: jerryzh168

Differential Revision: D23694029

fbshipit-source-id: a5e1733334e22305a111df750b190776889705d0
2020-09-15 09:43:29 -07:00
72b5665c4f Upgrade oneDNN (mkl-dnn) to v1.6 (#44706)
Summary:
- Bump oneDNN (mkl-dnn) to 1.6 for bug fixes
    - Fixes https://github.com/pytorch/pytorch/issues/42446. RuntimeError: label is redefined for convolutions with large filter size on Intel AVX512
    - Implemented workaround for internal compiler error when building oneDNN with Microsoft Visual Studio 2019 (https://github.com/pytorch/pytorch/pull/43169)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44706

Reviewed By: ngimel

Differential Revision: D23705967

Pulled By: albanD

fbshipit-source-id: 65e8fecc52a76c9f3324403a8b60ffa8a8948bc6
2020-09-15 09:30:01 -07:00
7036e91abd Revert D23323486: DPP Async Tracing
Test Plan: revert-hammer

Differential Revision:
D23323486 (71673b31f9)

Original commit changeset: 4b6ca6c0e320

fbshipit-source-id: c6bd6d277aca070bef2de3522c2a60e23b4395ad
2020-09-15 01:19:23 -07:00
2435d941b1 Fix FP16 fastAtomicAdd for one case where tensor start address is not 32 bit aligned (#44642)
Summary:
For https://github.com/pytorch/pytorch/issues/44206 and https://github.com/pytorch/pytorch/issues/42218, I'd like to update trilinear interpolate backward and grid_sample backward to use `fastAtomicAdd`.

As a prelude, I spotted a UB risk in `fastAtomicAdd`.  I think existing code incurs a misaligned `__half2` atomicAdd when `index` is odd and `tensor` is not 32-bit aligned (`index % 2 == 1` and `(reinterpret_cast<std::uintptr_t>(tensor) % sizeof(__half2) == 1`). In this case we think we're `!low_bit` and go down the `!low_bit` code path, but in fact we are `low_bit`.  It appears the original [fastAtomicAdd PR](https://github.com/pytorch/pytorch/pull/21879#discussion_r295040377)'s discussion did not consider that case explicitly.

I wanted to push my tentative fix for discussion ASAP. jjsjann123 and mkolod as original authors of `fastAtomicAdd`. (I'm also curious why we need to `reinterpret_cast<std::uintptr_t>(tensor...` for the address modding, but that's minor.)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44642

Reviewed By: mruberry

Differential Revision: D23699820

Pulled By: ngimel

fbshipit-source-id: 0db57150715ebb45e6a1fb36897e46f00d61defd
2020-09-14 22:07:29 -07:00
2fd142a2ef Small clarification to amp gradient penalty example (#44667)
Summary:
requested by https://discuss.pytorch.org/t/what-is-the-correct-way-of-computing-a-grad-penalty-using-amp/95827/3

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44667

Reviewed By: mruberry

Differential Revision: D23692768

Pulled By: ngimel

fbshipit-source-id: 83c61b94e79ef9f86abed2cc066f188dce0c8456
2020-09-14 21:56:09 -07:00
aedce773ed Deleted docker images for rocm 3.3 and rocm 3.5 (#44672)
Summary:
jeffdaily

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44672

Reviewed By: malfet

Differential Revision: D23694924

Pulled By: xw285cornell

fbshipit-source-id: 0066dc4b36c366588e1f309c82e7e1dc2ce8eec1
2020-09-14 21:50:41 -07:00
c71ce10cfc add dilation to transposeconv's _output_padding method (#43793)
Summary:
This PR adds dilation to _ConvTransposeNd._output_padding method and tests using a bunch of different sized inputs.

Fixes https://github.com/pytorch/pytorch/issues/14272

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43793

Reviewed By: zou3519

Differential Revision: D23493313

Pulled By: ezyang

fbshipit-source-id: bca605c428cbf3a97d3d24316d8d7fde4bddb307
2020-09-14 21:28:27 -07:00
ed862d3682 Split CUDA_NVCC_FLAGS by space (#44603)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/44599

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44603

Reviewed By: albanD

Differential Revision: D23692320

Pulled By: ezyang

fbshipit-source-id: 6a63d94ab8b88e7a82f9d65f03523d6ef639c754
2020-09-14 20:25:37 -07:00
2c4b4aa81b Revert D23494065: Refactor CallbackManager as a nested class of RecordFunction.
Test Plan: revert-hammer

Differential Revision:
D23494065 (63105fd5b1)

Original commit changeset: 416d5bf6c942

fbshipit-source-id: 3b1ec928e3db0cc203bb63ec4db3da1584b9b884
2020-09-14 19:43:50 -07:00
e7d782e724 [JIT] Add property support for ScriptModules (#42390)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42390

**Summary**
This commit extends support for properties to include
ScriptModules.

**Test Plan**
This commit adds a unit test that has a ScriptModule with
a user-defined property.

`python test/test_jit_py3.py TestScriptPy3.test_module_properties`

Test Plan: Imported from OSS

Reviewed By: eellison, mannatsingh

Differential Revision: D22880298

Pulled By: SplitInfinity

fbshipit-source-id: 74f6cb80f716084339e2151ca25092b6341a1560
2020-09-14 18:49:21 -07:00
63105fd5b1 Refactor CallbackManager as a nested class of RecordFunction. (#44645)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44645

Moved CallbackManager as a nested class of RecordFunction to allow private access to the call handles and context without exposing them publicly. It still hides the singleton instance of the CallbackManager inside record_function.cpp.

Test Plan: Unit tests.

Reviewed By: ilia-cher

Differential Revision: D23494065

fbshipit-source-id: 416d5bf6c9426e112877fbd233a6f4dff7bef455
2020-09-14 18:44:40 -07:00
71673b31f9 DPP Async Tracing (#44252)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44252

Add tracing to DPP client. Because DPP requests are async, we need to be able to start a trace event in one thread and potentially end in a different thread. RecordFunction and LibgpumonObserver previously assume each trace event starts and finishes in the same thread. So they use a thread local context to track enter and exit call backs. Async events breaks this assumption. This change attaches the event context to the RecordFunction object so we do not need to use thread local context.

Test Plan:
Tested with dpp perf test and able to collect trace.

{F307824044}

Reviewed By: ilia-cher

Differential Revision: D23323486

fbshipit-source-id: 4b6ca6c0e32028fb38a476cd1f44c17a001fc03b
2020-09-14 18:43:14 -07:00
e107ef5ca2 Add type annotations for torch.nn.utils.* (#43080)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/43013

Redo of gh-42954

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43080

Reviewed By: albanD

Differential Revision: D23681334

Pulled By: malfet

fbshipit-source-id: 20ec78aa3bfecb7acffc12eb89d3ad833024394c
2020-09-14 17:52:37 -07:00
551494b01d [JIT] Fix torch.tensor for empty multidimensional-typed lists (#44652)
Summary:
We were hitting an assert error when you passed in an empty `List[List[int]]` - this fixes that error by not recursing into 0-element tensors.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44652

Reviewed By: ZolotukhinM

Differential Revision: D23688247

Pulled By: eellison

fbshipit-source-id: d48ea24893044fae96bc39f76c0f1f9726eaf4c7
2020-09-14 17:28:23 -07:00
2254e5d976 Add note comments to enforce nondeterministic alert documentation (#44140)
Summary:
This PR fulfills Ed's request (https://github.com/pytorch/pytorch/pull/41692#discussion_r473122076) for a strategy to keep the functions that have nondeterministic alerts fully documented.

Part of https://github.com/pytorch/pytorch/issues/15359

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44140

Reviewed By: colesbury

Differential Revision: D23644469

Pulled By: ezyang

fbshipit-source-id: 60936ccced13f071c620f7d25ef6dcbca338de7f
2020-09-14 16:48:22 -07:00
a91c2be2a9 Automated submodule update: FBGEMM (#44647)
Summary:
This is an automated pull request to update the first-party submodule for [pytorch/FBGEMM](https://github.com/pytorch/FBGEMM).

New submodule commit: 1d710393d5

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44647

Test Plan: Ensure that CI jobs succeed on GitHub before landing.

Reviewed By: dskhudia

Differential Revision: D23684528

fbshipit-source-id: 316ff2e448707a6e5a83248c9b22e58118bc8741
2020-09-14 16:43:59 -07:00
686e281bcf Updates div to perform true division (#42907)
Summary:
This PR:

- updates div to perform true division
- makes torch.true_divide an alias of torch.div

This follows on work in previous PyTorch releases that first deprecated div performing "integer" or "floor" division, then prevented it by throwing a runtime error.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42907

Reviewed By: ngimel

Differential Revision: D23622114

Pulled By: mruberry

fbshipit-source-id: 414c7e3c1a662a6c3c731ad99cc942507d843927
2020-09-14 15:50:38 -07:00
e594c30bc2 [quant][graphmode][fx] Support fp16 dynamic quantization for linear (#44582)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44582

Test Plan:
test_quantize_fx.py

Imported from OSS

Reviewed By: vkuzo

Differential Revision: D23665974

fbshipit-source-id: 19ba6c61a9c77ef570b00614016506e9a2729f7c
2020-09-14 15:43:08 -07:00
43406e218a [ONNX] Update ONNX shape inference (#43929)
Summary:
* Support sequence type (de)serialization, enables onnx shape inference on sequence nodes.
* Fix shape inference with block input/output: e.g. Loop and If nodes.
* Fix bugs in symbolic discovered by coverage of onnx shape inference.
* Improve debuggability: added more jit logs. For simplicity, the default log level, when jit log is enabled, will not dump ir graphs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43929

Reviewed By: albanD

Differential Revision: D23674604

Pulled By: bzinodev

fbshipit-source-id: ab6aacb16d0e3b9a4708845bce27c6d65e567ba7
2020-09-14 15:36:19 -07:00
89aed1a933 [vulkan][op] avg_pool2d (#42675)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42675

Test Plan: Imported from OSS

Reviewed By: AshkanAliabadi

Differential Revision: D22978765

Pulled By: IvanKobzarev

fbshipit-source-id: 64938d8965aeeb408dd5c40d688eca13fb7ebb8a
2020-09-14 15:07:34 -07:00
8f327cd6c5 [vulkan][op] add.Scalar, mul.Scalar (#42674)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42674

Test Plan: Imported from OSS

Reviewed By: AshkanAliabadi

Differential Revision: D22978763

Pulled By: IvanKobzarev

fbshipit-source-id: 9fd97d394205e3fa51992ee99d5bfafc33f75efa
2020-09-14 15:03:22 -07:00
f7cfbac89b [ONNX] Update len symbolic (#43824)
Summary:
Update len symbolic.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43824

Reviewed By: izdeby

Differential Revision: D23575765

Pulled By: bzinodev

fbshipit-source-id: 0e5c8c8d4a5297f65e2dc43168993350f784c776
2020-09-14 15:00:44 -07:00
da11d932bc [ONNX] Update arange op to support out argument (#43777)
Summary:
Update arange op to support out argument

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43777

Reviewed By: albanD

Differential Revision: D23674583

Pulled By: bzinodev

fbshipit-source-id: 6fb65e048c6b1a551569d4d2a33223522d2a960c
2020-09-14 14:56:17 -07:00
62ebad4ff9 [ONNX] Export new_empty and new_zeros (#43506)
Summary:
Adding symbolic to export new_empty and new_zeros

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43506

Reviewed By: houseroad

Differential Revision: D23674574

Pulled By: bzinodev

fbshipit-source-id: ecfcdbd4845fd3a3c6618a060129fbeee4df5dd7
2020-09-14 14:48:34 -07:00
d0a56cab07 [quant] Fixing the output shape for the linear (#44513)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44513

Test Plan: Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D23637508

Pulled By: z-a-f

fbshipit-source-id: d19d4c1b234b05e8d9813e864863d937b6c35bf5
2020-09-14 14:31:00 -07:00
742654d1b6 [quant] ConvTranspose1d / ConvTranspose2d (#40371)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40371

Test Plan: Imported from OSS

Reviewed By: vkuzo

Differential Revision: D22158981

Pulled By: z-a-f

fbshipit-source-id: defbf6fbe730a58d5b155dcb2460dd969797215c
2020-09-14 14:25:06 -07:00
84949672bf Fix exception chaining in test/ (#44193)
Summary:
## Motivation
This PR fixes https://github.com/pytorch/pytorch/issues/43770 and is the continuation of https://github.com/pytorch/pytorch/issues/43836.

## Description of the change
This PR fixes exception chaining only in files under `test/` where appropriate.
To fix exception chaining, I used either:
1. `raise new_exception from old_exception` where `new_exception` itself seems not descriptive enough to debug or `old_exception` delivers valuable information.
2. `raise new_exception from None` where raising both of `new_exception` and `old_exception` seems a bit noisy and redundant.

## List of lines containing `raise` in `except` clause:
I wrote [this simple script](https://gist.github.com/akihironitta/4223c1b32404b36c1b349d70c4c93b4d) using [ast](https://docs.python.org/3.8/library/ast.html#module-ast) to list lines where `raise`ing in `except` clause.

- [x] f8f35fddd4/test/test_cpp_extensions_aot.py (L16)
- [x] f8f35fddd4/test/test_jit.py (L2503)
- [x] f8f35fddd4/test/onnx/model_defs/word_language_model.py (L22)
- [x] f8f35fddd4/test/onnx/verify.py (L73)
- [x] f8f35fddd4/test/onnx/verify.py (L110)
- [x] f8f35fddd4/test/onnx/test_verify.py (L31)
- [x] f8f35fddd4/test/distributed/test_c10d.py (L255)
- [x] f8f35fddd4/test/distributed/test_c10d.py (L2992)
- [x] f8f35fddd4/test/distributed/test_c10d.py (L3025)
- [x] f8f35fddd4/test/distributed/test_c10d.py (L3712)
- [x] f8f35fddd4/test/distributed/test_distributed.py (L3180)
- [x] f8f35fddd4/test/distributed/test_distributed.py (L3198)
- [x] f8f35fddd4/test/distributed/test_data_parallel.py (L752)
- [x] f8f35fddd4/test/distributed/test_data_parallel.py (L776)
- [x] f8f35fddd4/test/test_type_hints.py (L151)
- [x] f8f35fddd4/test/test_jit_fuser.py (L771)
- [x] f8f35fddd4/test/test_jit_fuser.py (L773)
- [x] f8f35fddd4/test/test_dispatch.py (L105)
- [x] f8f35fddd4/test/test_distributions.py (L4738)
- [x] f8f35fddd4/test/test_nn.py (L9824)
- [x] f8f35fddd4/test/test_namedtensor.py (L843)
- [x] f8f35fddd4/test/test_jit_fuser_te.py (L875)
- [x] f8f35fddd4/test/test_jit_fuser_te.py (L877)
- [x] f8f35fddd4/test/test_dataloader.py (L31)
- [x] f8f35fddd4/test/test_dataloader.py (L43)
- [x] f8f35fddd4/test/test_dataloader.py (L365)
- [x] f8f35fddd4/test/test_dataloader.py (L391)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44193

Reviewed By: albanD

Differential Revision: D23681529

Pulled By: malfet

fbshipit-source-id: 7c2256ff17334625081137b35baeb816c1e53e0b
2020-09-14 14:20:16 -07:00
a188dbdf3f Check for index-rank consistency in FunctionInliner (#44561)
Summary:
When caller / callee pairs are	inserted into the mapping, verify that
the arity of the buffer access is consistent with its declared rank.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44561

Test Plan: CI, test_tensorexpr --gtest_filter=TensorExprTest.DetectInlineRankMismatch

Reviewed By: albanD

Differential Revision: D23684342

Pulled By: asuhan

fbshipit-source-id: dd3a0cdd4c2492853fa68381468e0ec037136cab
2020-09-14 14:07:22 -07:00
b5dd6e3e61 split torch.testing._internal.* and add type checking for torch.testing._internal.common_cuda (#44575)
Summary:
First step to fix https://github.com/pytorch/pytorch/issues/42969.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44575

Reviewed By: malfet

Differential Revision: D23668740

Pulled By: walterddr

fbshipit-source-id: eeb3650b1780aaa5727b525b4e6182e1bc47a83f
2020-09-14 14:04:02 -07:00
cfba33bde3 Fix the ELU formula in the docs (#43764)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/43389.

This PR replaces the old ELU formula from the docs that yields wrong results for negative alphas with the new one that fixes the issue and relies on the cases notation which makes the formula more straightforward.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43764

Reviewed By: ailzhang

Differential Revision: D23425532

Pulled By: albanD

fbshipit-source-id: d0931996e5667897d926ba4fc7a8cc66e8a66837
2020-09-14 14:01:56 -07:00
9d4943daaf [quant] conv_transpose1d / conv_transpose2d (#40370)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40370

Test Plan: Imported from OSS

Reviewed By: vkuzo

Differential Revision: D22158979

Pulled By: z-a-f

fbshipit-source-id: f5cb812c9953efa7608f06cf0188de447f73f358
2020-09-14 13:45:28 -07:00
ecac8294a6 enable type checking for torch._classes (#44576)
Summary:
Fix https://github.com/pytorch/pytorch/issues/42980

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44576

Reviewed By: malfet

Differential Revision: D23668741

Pulled By: walterddr

fbshipit-source-id: 4201ea3187a40051ebff53d28c8e571ea1a61126
2020-09-14 13:26:46 -07:00
ad7a2eb1c9 Simplify nested Min and Max patterns. (#44142)
Summary:
Improve simplification of nested Min and Max patterns.

Specifically, handles the following pattern simplications:
  * `Max(A, Max(A, Const)) => Max(A, Const)`
  * `Max(Min(A, B), Min(A, C)) => Min(A, Max(B, C))`
  * `Max(Const, Max(A, OtherConst) => Max(A, Max(Const, OtherConst))`
     - This case can have an arbitrarily long chain of Max ops. For example: `Max(5, Max(x, Max(y, Max(z, 8)))) => Max(Max(Max(x, 8), y), z)`

Similarly, for the case of Min as well.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44142

Reviewed By: albanD

Differential Revision: D23644486

Pulled By: navahgar

fbshipit-source-id: 42bd241e6c2af820566744c8494e5dee172107f4
2020-09-14 13:24:46 -07:00
199435af90 Update median doc to note return value of even-sized input (#44562)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44562

Add a note that torch.median returns the smaller of the two middle elements for even-sized input and refer user to torch.quantile for the mean of the middle values.

fixes https://github.com/pytorch/pytorch/issues/39520

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D23657208

Pulled By: heitorschueroff

fbshipit-source-id: 2747aa652d1e7f10229d9299b089295aeae092c2
2020-09-14 13:18:33 -07:00
a475613d1d [static runtime] Swap to out-variant compatible nodes (#44127)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44127

Test Plan: Imported from OSS

Reviewed By: hlu1

Differential Revision: D23604306

Pulled By: bwasti

fbshipit-source-id: 18ccfb9b466b822e28130be3d5c4fae36c76820b
2020-09-14 12:38:25 -07:00
856510c96d [JIT] Dont optimize shape info in batch_mm (#44565)
Summary:
We run remove profile nodes and specialize types before batch_mm, so we cannot run peepholes on the type information of tensors since these properties have not been guarded to be guaranteed to be correct.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44565

Reviewed By: albanD

Differential Revision: D23661538

Pulled By: eellison

fbshipit-source-id: 0dd23a65714f047f49b4db4ec582b21870925fe1
2020-09-14 12:34:20 -07:00
e261e0953e Fix centos8 gcc (#44644)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/44198 properly this time

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44644

Reviewed By: albanD

Differential Revision: D23684909

Pulled By: malfet

fbshipit-source-id: cea6f6e2ae28138f6b93a6513d1abd36d14ae573
2020-09-14 12:28:09 -07:00
ace81b6794 Remove an extra empty line in the warning comments. (#44622)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44622

Remove an extra empty line in the warning comments.Remove an extra empty line.

Test Plan: N/A

Reviewed By: rohan-varma

Differential Revision: D23674070

fbshipit-source-id: 4ee570590c66a72fb808e9ee034fb773b833efcd
2020-09-14 11:15:35 -07:00
21a09ba94d Fix lerp.cu bug when given discontiguous out tensor (#44559)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44559

Please refer to the discussion at the bottom of https://github.com/pytorch/pytorch/pull/43541 about the bug.

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D23655403

Pulled By: heitorschueroff

fbshipit-source-id: 10e4ce5c2fe7bf6e95bcfac4033202430292b03f
2020-09-14 11:03:02 -07:00
95a69a7d09 adds list_gpu_processes function (#44616)
Summary:
per title, to make it easier to track the creation of stray contexts:
```
python -c "import torch; a=torch.randn(1, device='cuda'); print(torch.cuda.memory.list_gpu_processes(0)); print(torch.cuda.memory.list_gpu_processes(1))"
GPU:0
process      79749 uses      601.000 MB GPU memory
GPU:1
no processes are running
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44616

Reviewed By: mruberry

Differential Revision: D23675739

Pulled By: ngimel

fbshipit-source-id: ffa14cad9d7144e883de13b1c2c6817bd432f53a
2020-09-14 09:54:32 -07:00
105132b891 Move ONNX circle ci build to torch and remove all caffe2 CI job/workflows (#44595)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44595

Reviewed By: seemethere

Differential Revision: D23670280

Pulled By: walterddr

fbshipit-source-id: b32633912f6c8b4606be36b90f901e636567b355
2020-09-14 09:50:13 -07:00
bd257a17a1 Add HIP/ROCm version to collect_env.py (#44106)
Summary:
This adds HIP version info to the `collect_env.py` output.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44106

Reviewed By: VitalyFedyunin

Differential Revision: D23652341

Pulled By: zou3519

fbshipit-source-id: a1f5bce8da7ad27a1277a95885934293d0fd43c5
2020-09-14 09:19:18 -07:00
7040a070e3 [torch] Minor: Avoid ostreamstring in Operator's canonicalSchemaString() (#44442)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44442

I noticed lock contention on startup as lookupByLiteral() was
calling registerPendingOperators() - some calls were holding the
lock for 10+ ms, as operators were being registered.

canonicalSchemaString() was using ostreamstring, which isn't typically
particularly fast (partly because of c++ spec locale requirements).
If we repalce with regular c++ string appends, it's somewhat faster
(which isn't hard when comparing with stringstream; albeit a bit
more codegen)

Over the first minute or so, this cuts out 1.4 seconds under the
OperatorRegistry lock (as part of registerPendingOperators) in the
first couple minutes of run time (mostly front-loaded) when running
sync sgd.

As an example, before:
   registerPendingOperators 12688 usec for 2449 operators
After:
   registerPendingOperators 6853 usec for 2449 operators
ghstack-source-id: 111862971

Test Plan: buck test mode/dev-nosan caffe2/test/cpp/...

Reviewed By: ailzhang

Differential Revision: D23614515

fbshipit-source-id: e712f9dac5bca0b1876e11fb8f0850402f03873a
2020-09-14 08:24:16 -07:00
c68a99bd61 [numpy] Add torch.exp2 (#44184)
Summary:
Reference https://github.com/pytorch/pytorch/issues/42515

TODO
* [x] Add tests
* [x] Add docs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44184

Reviewed By: ngimel

Differential Revision: D23674237

Pulled By: mruberry

fbshipit-source-id: 7f4fb1900fad3051cd7fc9d3d7f6d985c5fb093c
2020-09-14 04:05:37 -07:00
870f647040 Automated submodule update: FBGEMM (#44581)
Summary:
This is an automated pull request to update the first-party submodule for [pytorch/FBGEMM](https://github.com/pytorch/FBGEMM).

New submodule commit: 0725301da5

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44581

Test Plan: Ensure that CI jobs succeed on GitHub before landing.

Reviewed By: dskhudia, VitalyFedyunin

Differential Revision: D23665173

fbshipit-source-id: 03cee22335eef0517e561827795bbe2036942ea0
2020-09-13 21:26:56 -07:00
68a5c361ae Adding Adapative Autorange to benchmark utils. (#44607)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/44219

Rebasing https://github.com/pytorch/pytorch/pull/44288 and fixing the git history.

This allows users to bencmark code without having to specify how long to run the benchmark. It runs the benchmark until the variance (IQR / Median) is low enough that we can be confident in the measurement.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44607

Test Plan: There are unit tests, and we manually tested using Examples posted in git.

Reviewed By: robieta

Differential Revision: D23671208

Pulled By: bitfort

fbshipit-source-id: d63184290b88b26fb81c2452e1ae701c7d513d12
2020-09-13 20:55:40 -07:00
8daaa3bc7e Fix latex error in heaviside docs (#44481)
Summary:
This fixes a `katex` error I was getting trying to build the docs:
```
ParseError: KaTeX parse error: Undefined control sequence: \0 at position 55: …gin{cases}
```

This failure was introduced in https://github.com/pytorch/pytorch/issues/42523.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44481

Reviewed By: colesbury

Differential Revision: D23627700

Pulled By: mruberry

fbshipit-source-id: 9cc09c687a7d9349da79a0ac87d6c962c9cfbe2d
2020-09-13 16:42:19 -07:00
fe26102a0e Enable TE in test_jit.py (#44200)
Summary:
Enable TE in test_jit.py and adjust/fix tests accordingly.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44200

Reviewed By: SplitInfinity

Differential Revision: D23673624

Pulled By: Krovatkin

fbshipit-source-id: 5999725c7aacc6ee77885eb855a41ddfb4d9a8d8
2020-09-13 15:58:20 -07:00
7862827269 [pytorch] Add variadic run_method for lite intepreter (#44337)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44337

Add a new run_method to mobile Module which is variadic (takes any number of arguments) to match full jit.
ghstack-source-id: 111909068

Test Plan: Added new unit test to test_jit test suite

Reviewed By: linbinyu, ann-ss

Differential Revision: D23585763

fbshipit-source-id: 007cf852290f03615b78c35aa6f7a21287ccff9e
2020-09-13 13:26:30 -07:00
bcf97b8986 [JIT] Cleanup some places where we log graphs in executors. (#44588)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44588

1) SOURCE_DUMP crashes when invoked on a backward graph since
   `prim::GradOf` nodes can't be printed as sources (they don't have
   schema).
2) Dumping graph each time we execute an optimized plan produces lots of
   output in tests where we run the graph multiple times (e.g.
   benchmarks). Outputting that on the least level of verbosity seems
   like an overkill.
3) Duplicated log statement is removed.

Differential Revision: D23666812

Test Plan: Imported from OSS

Reviewed By: bertmaher

Pulled By: ZolotukhinM

fbshipit-source-id: b9a30e34fd39c85f3e13c3f1e3594e157e1c130f
2020-09-13 11:31:02 -07:00
82da6b3702 [JIT] Fix jit-log verbosity selection logic. (#44587)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44587

Currently it's skewed by one.

The following test demonstrates it:
```
$ cat test.py

import torch
def foo(a,b):
    return a*a*b
torch._C._jit_set_profiling_executor(True)
torch._C._jit_set_profiling_mode(True)
torch._C._jit_override_can_fuse_on_cpu(True)
torch._C._jit_set_texpr_fuser_enabled(True)
f = torch.jit.script(foo)
for _ in range(10):
    f(torch.rand(10), torch.rand(10))

$ cat test_logging_levels.sh

PYTORCH_JIT_LOG_LEVEL="tensorexpr_fuser"    python test.py 2>&1 | grep DUMP   >& /dev/null && echo OK || echo FAIL
PYTORCH_JIT_LOG_LEVEL="tensorexpr_fuser"    python test.py 2>&1 | grep UPDATE >& /dev/null && echo FAIL || echo OK
PYTORCH_JIT_LOG_LEVEL="tensorexpr_fuser"    python test.py 2>&1 | grep DEBUG  >& /dev/null && echo FAIL || echo OK

PYTORCH_JIT_LOG_LEVEL=">tensorexpr_fuser"   python test.py 2>&1 | grep DUMP   >& /dev/null && echo OK || echo FAIL
PYTORCH_JIT_LOG_LEVEL=">tensorexpr_fuser"   python test.py 2>&1 | grep UPDATE >& /dev/null && echo OK || echo FAIL
PYTORCH_JIT_LOG_LEVEL=">tensorexpr_fuser"   python test.py 2>&1 | grep DEBUG  >& /dev/null && echo FAIL || echo OK

PYTORCH_JIT_LOG_LEVEL=">>tensorexpr_fuser"  python test.py 2>&1 | grep DUMP   >& /dev/null && echo OK || echo FAIL
PYTORCH_JIT_LOG_LEVEL=">>tensorexpr_fuser"  python test.py 2>&1 | grep UPDATE >& /dev/null && echo OK || echo FAIL
PYTORCH_JIT_LOG_LEVEL=">>tensorexpr_fuser"  python test.py 2>&1 | grep DEBUG  >& /dev/null && echo OK || echo FAIL
```

Before this change:
```
OK
FAIL
OK
OK
OK
FAIL
OK
OK
OK
```

With this change everthing passes.

Differential Revision: D23666813

Test Plan: Imported from OSS

Reviewed By: bertmaher

Pulled By: ZolotukhinM

fbshipit-source-id: 4adaa5a3d06deadf54eae014a0d76588cdc5e20a
2020-09-13 11:29:25 -07:00
6d4a605ce9 Fix bug simplifying if-then-else when it can be removed (#44462)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44462

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D23671157

Pulled By: bertmaher

fbshipit-source-id: b9b92ad0de1a7bd9bc1fcac390b542d885d0ca58
2020-09-13 10:29:28 -07:00
7e91728f68 Deprecates calling linspace and logspace without setting steps explicitly (#43860)
Summary:
**BC-breaking note**

This change is BC-breaking for C++ callers of linspace and logspace if they were providing a steps argument that could not be converted to an optional.

**PR note**

This PR deprecates calling linspace and logspace wihout setting steps explicitly by:

- updating the documentation to warn that not setting steps is deprecated
- warning (once) when linspace and logspace are called without steps being specified

A test for this behavior is added to test_tensor_creation_ops. The warning only appears once per process, however, so the test would pass even if no warning were thrown. Ideally there would be a mechanism to force all warnings, include those from TORCH_WARN_ONCE, to trigger.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43860

Reviewed By: izdeby

Differential Revision: D23498980

Pulled By: mruberry

fbshipit-source-id: c48d7a58896714d184cb6ff2a48e964243fafc90
2020-09-13 06:09:19 -07:00
e703c17967 Revert D23584071: [dper3] Create dper LearningRate low-level module
Test Plan: revert-hammer

Differential Revision:
D23584071 (a309355be3)

Original commit changeset: f6656531b1ca

fbshipit-source-id: b0a93f4286053fb8576a70278edca3a7d89c722b
2020-09-12 20:45:30 -07:00
a309355be3 [dper3] Create dper LearningRate low-level module
Summary: As title; this will unblock migration of several modules that need learning rate functionality.

Test Plan:
```
buck test //dper3/dper3/modules/low_level_modules/tests:learning_rate_test
```

WIP: need to add more learning rate tests for the different policies

Reviewed By: yf225

Differential Revision: D23584071

fbshipit-source-id: f6656531b1caba38c3e3a7d6e16d9591563391e2
2020-09-12 15:33:29 -07:00
0743d013a6 fuse layernorm + quantize (#44232)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44232

enhance layernorm to optionally quantize its output
add fusion code to replace instances of layernorm +quantization

Test Plan:
tested layernorm
net_runner

P141557987

Reviewed By: venkatacrc

Differential Revision: D23510893

fbshipit-source-id: 32f57ba2090d35d86dcc951e0f3f6a8901ab3153
2020-09-12 13:32:33 -07:00
6f2c3c39d2 Add SNPE deps for caffe2 benchmark android binary
Summary:
Adding snpe dependencies to caffe2_benchmark so that this can benchmark SNPE models on portal devices.

Also need to change ndk_libcxx to gnustl till snpe is updated to work with ndk.

Test Plan: Tested on top of the stack.

Reviewed By: linbinyu

Differential Revision: D23569397

fbshipit-source-id: a6281832804ed4fbb5a8406f436caeae1ff4fd2b
2020-09-12 12:34:56 -07:00
05c1f1d974 [ROCm] remove thrust workaround in ScanKernels (#44553)
Summary:
Remove ROCm workaround added in https://github.com/pytorch/pytorch/issues/39180.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44553

Reviewed By: mruberry

Differential Revision: D23663988

Pulled By: ngimel

fbshipit-source-id: 71b2fd7db006d9d3459b908a996c4d96838ba742
2020-09-11 21:12:43 -07:00
d191caa3e7 Cleanup workarounds for compiler bug of ROCm (#44579)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44579

Reviewed By: mruberry

Differential Revision: D23664481

Pulled By: ngimel

fbshipit-source-id: ef698f26455e5827c5b5c0e5d42a1c95bcac8af4
2020-09-11 21:10:33 -07:00
8641b55158 fix dangling ptr in embedding_bag (#44571)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44571

Test Plan: Imported from OSS

Reviewed By: malfet, ngimel

Differential Revision: D23661007

Pulled By: glaringlee

fbshipit-source-id: e4a54acd0de55f275828c1d1289a1f069de07291
2020-09-11 20:40:44 -07:00
82b4477948 Pass the input tensor vector by const reference. (#44340)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44340

Changed the constructor of GradBucket to pass the input by const
reference and hence avoided unnecessary explicit move semantics. Since
previously the declaration and definition are separated, passing the input
tensor vector by value looks quite bizarre.

Test Plan: buck test caffe2/torch/lib/c10d:ProcessGroupGlooTest

Reviewed By: pritamdamania87

Differential Revision: D23569939

fbshipit-source-id: db761d42e76bf938089a0b38e98e76a05bcf4162
2020-09-11 18:03:56 -07:00
ab5fee2784 Move the inline implementations of GradBucket class to the header. (#44339)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44339

Moved the inline implementations of GradBucket class to the header for
succinctness and readability. This coding style is also consistent with
reducer.h under the same directory.

Test Plan: buck test caffe2/torch/lib/c10d:ProcessGroupGlooTest

Reviewed By: pritamdamania87

Differential Revision: D23569701

fbshipit-source-id: 237d9e2c5f63a6bcac829d0fcb4a5ba3bede75e5
2020-09-11 18:01:37 -07:00
1f0dcf39fc [JIT] dont optimize device dtype on inline (#43363)
Summary:
Follow up to https://github.com/pytorch/pytorch/pull/36404

Adding prim::device and prim::dtype to list of skipped peepholes when we run inlining. In the long term another fix may not be to encode shape / dtype info on the traced graph, because it is not guaranteed to be correct. This is blocked by ONNX currently.

Partial fix for https://github.com/pytorch/pytorch/issues/43134

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43363

Reviewed By: glaringlee

Differential Revision: D23383987

Pulled By: eellison

fbshipit-source-id: 2e9c5160d39d690046bd9904be979d58af8d3a20
2020-09-11 17:29:54 -07:00
d729e2965e [TensorExpr] Do not inline autodiff graphs if they contain prim::TypeCheck nodes. (#44564)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44564

Before this change we sometimes inlined autodiff subgraph containing
fusion groups. This happened because we didn't look for 'unsupported'
nodes recursively (maybe we should), but fusion groups were inside
if-nodes.

The problem was detected by bertmaher in 'LearningToPaint' benchmark
investigation where this bug caused us to keep constantly hitting
fallback paths of the graph.

Test Plan: Imported from OSS

Reviewed By: bwasti

Differential Revision: D23657049

Pulled By: ZolotukhinM

fbshipit-source-id: 7c853424f6dce4b5c344d6cd9c467ee04a8f167e
2020-09-11 17:28:53 -07:00
64b4307d47 [NNC] Cuda Codegen - mask loops bound to block/thread dimensions (#44325)
Summary:
Fix an issue where loops of different sizes are bound to the same Cuda dimension / metavar.

Coming soon more info and tests...

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44325

Reviewed By: colesbury

Differential Revision: D23628859

Pulled By: nickgg

fbshipit-source-id: 3621850a4cc38a790b62ad168d32e7a0e2462fad
2020-09-11 16:48:16 -07:00
2ae74c0632 Compile less legacy code when BUILD_CAFFE2 is set to False (take 2) (#44453)
Summary:
2nd attempt to land https://github.com/pytorch/pytorch/pull/44079

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44453

Reviewed By: walterddr, seemethere

Differential Revision: D23619528

Pulled By: malfet

fbshipit-source-id: c7c206ebd327dcf3994789bd47008b05ff862fe7
2020-09-11 16:27:47 -07:00
566b8d0650 handle missing NEON vst1_*_x2 intrinsics (#44198) (#44199)
Summary:
CentOS 8 on AArch64 has vld1_* intrinsics but lacks vst1q_f32_x2 one.

This patch checks for it and handle it separately to vld1_* ones.

Fixes https://github.com/pytorch/pytorch/issues/44198

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44199

Reviewed By: seemethere

Differential Revision: D23641273

Pulled By: malfet

fbshipit-source-id: c2053c8e0427705eaeeeb82ec030925bff22623a
2020-09-11 16:02:44 -07:00
db24c5c582 Change code coverage option name (#43999)
Summary:
According to [documentation](https://github.com/pytorch/pytorch/blob/master/tools/setup_helpers/cmake.py#L265), only options starts with `BUILD_` / `USE_` / `CMAKE_` in `CMakeLists.txt` can be imported by environment variables.

 ---
This diff is originally intended to enable  `c++` source coverage with `CircleCI` and `codecov.io`, but we will finish it in the future. You can find the related information in the diff history. Following is the originally procedur:

Based on [this pull request](1bda5e480c), life becomes much easier for this time.
1.in `build.sh`
- Enable coverage builld option for c++
- `apt-get install lcov`

2.in `test.sh`
- run `lcov`

3.in `pytorch-job-specs.yml`
- copy coverage.info to `test/` folder and upload it to codecov.io

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43999

Test Plan: Test on github

Reviewed By: malfet

Differential Revision: D23464656

Pulled By: scintiller

fbshipit-source-id: b2365691f04681d25ba5c00293fbcafe8e8e0745
2020-09-11 15:55:05 -07:00
b6f0ea0c71 [quant][graphmode][fx][fix] Remove qconfig in convert (#44526)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44526

Test Plan: Imported from OSS

Reviewed By: vkuzo

Differential Revision: D23641960

fbshipit-source-id: 546da1c16694d1e1dfb72629085acaae2165e759
2020-09-11 15:51:47 -07:00
42f9f2f38f [fix] ReduceOps throw error if dim is repeated (#44281)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/44273

TODO

* [x] Add test

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44281

Reviewed By: zhangguanheng66

Differential Revision: D23569004

Pulled By: ezyang

fbshipit-source-id: 1ca6523fef168c8ce252aeb7ca418be346b297bf
2020-09-11 15:34:06 -07:00
f3a79b881f add lcov to oss for beautiful html report (#44568)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44568

By `lcov`, we can generate beautiful html. It's better than current file report and line report. Therefore in oss gcc, remove `export` code and `file/line level report` code, only use the html report.

But in clang, since such tool is not available, we will still use file-report and line-report generated by ourself.

Test Plan:
Test in docker ubuntu machine.
## Mesurement
1. After running `atest`, it takes about 15 mins to collect code coverage and genrate the report.
```
# gcc code coverage
python oss_coverage.py --run-only=atest
```

## Presentation
**The html result looks like:**

*Top Level:*

{F328330856}

*File Level:*

{F328336709}

Reviewed By: malfet

Differential Revision: D23550784

fbshipit-source-id: 1fff050e7f7d1cc8e86a6a200fd8db04b47f5f3e
2020-09-11 15:29:24 -07:00
c2b40b056a Filter default tests for clang coverage in oss
Summary: Some tests like `test_dataloader.py` are not able to run under `clang` in oss, because it generates too large intermediate files (~40G) that can't be merged by `llvm`. Skip them when user doesn't specify the `--run-only` option

Test Plan: Test locally. But still, not recomend user to run `clang` coverage in default mode, because it takes too much space.

Reviewed By: malfet

Differential Revision: D23549829

fbshipit-source-id: 0737e6e9dcbe3f38de00580ee6007906e743e52f
2020-09-11 15:28:15 -07:00
a82ea6a91f [quant][graphmode][fx][fix] Support None qconfig in convert (#44524)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44524

None qconfig is not handled previously
closes: https://github.com/pytorch/pytorch/issues/44438

Test Plan: Imported from OSS

Reviewed By: vkuzo

Differential Revision: D23640269

fbshipit-source-id: 8bfa88c8c78d4530338d9d7fa9669876c386d91f
2020-09-11 15:22:25 -07:00
1fb5883072 removing conv filters from conv pattern matching (#44512)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44512

Test Plan: Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D23637409

Pulled By: z-a-f

fbshipit-source-id: ad5be0fa6accfbcceaae9171bf529772d87b4098
2020-09-11 15:16:29 -07:00
dd4bbe1a79 Add iterator like functionality for DispatchKeySet (#44066)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44066

Add STL Input iterator to DispatchKeySet:
* Iterator is able to iterate from first not undefined DispatchKey
to NumDispatchKeys.
* Iterator is invalidated once underlying DispatchKeySet is invalidated

Note see http://www.cplusplus.com/reference/iterator/ for comparisons of
different iterators.

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D23611405

Pulled By: linux-jedi

fbshipit-source-id: 131b287d60226a1d67a6ee0f88571f8c4d29f9c3
2020-09-11 15:08:15 -07:00
e2bb34e860 Batched grad support for: slice, select, diagonal (#44505)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44505

Added batching rules for slice_backward, select_backward, and
diagonal_backward.

Test Plan: - new tests: `pytest test/test_vmap.y -v -k "BatchedGrad"`

Reviewed By: agolynski, anjali411

Differential Revision: D23650409

Pulled By: zou3519

fbshipit-source-id: e317609d068c88ee7bc07fab88b2b3acb8fad7e1
2020-09-11 14:59:58 -07:00
7632484000 Add some batched gradient tests (#44494)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44494

These tests check (most) operations that are useful for bayesian logistic
regression (BLR) models. Said operators are basically those found in the
log_prob functions of Distributions objects. This PR is not a general,
structured solution for testing batched gradients (see "Alternative
solution" for that), but I wanted to test a small subset of operations
to confirm that the BLR use case works.

There will be follow-up PRs implementing support for some missing
operations for the BLR use case.

Alternative solution
=====================

Ideally, and in the future, I want to autogenerate tests from
common_method_invocations and delete all of the manual tests
introduced by this PR. However, if we were to do this now,
we would need to store the following additional metadata somewhere:
- operator name, supports_batched_grad, allow_vmap_fallback_usage

We could store that metadata as a separate table from
common_method_invocations, or add two columns to
common_method_invocations. Either way that seems like a lot of work and
the situation will get better once vmap supports batched gradients for
all operators (on the fallback path).

I am neutral between performing the alternative approach now v.s. just
manually writing out some tests for these operations, so I picked the
easier approach. Please let me know if you think it would be better to
pursue the alternative approach now.

Test Plan: - `pytest test/test_vmap.py -v -k "BatchedGrad"`

Reviewed By: anjali411

Differential Revision: D23650408

Pulled By: zou3519

fbshipit-source-id: 2f26c7ad4655318a020bdaab5c767cd3956ea5eb
2020-09-11 14:59:54 -07:00
ab6126b50e [rpc][jit] support remote call in TorchScript (#43046)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43046

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D23621108

Pulled By: wanchaol

fbshipit-source-id: e8152c6cdd3831f32d72d46ac86ce22f3f13c651
2020-09-11 14:59:51 -07:00
3e5df5f216 [rpc][jit] support rpc_sync in TorchScript (#43043)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43043

This add the support for rpc_sync in TorchScript in a way similar to
rpc_async

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D23252039

Pulled By: wanchaol

fbshipit-source-id: 8a05329cb8a24079b2863178b73087d47273914c
2020-09-11 14:59:47 -07:00
8bec7cfa91 [rpc] rename some functions (#43042)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43042

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D23228894

Pulled By: wanchaol

fbshipit-source-id: 3702b7826ecb455073fabb9dc5dca804c0e092b2
2020-09-11 14:58:39 -07:00
70dfeb44bd MinMax based observers: respect device affinity for state_dict (#44537)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44537

Originally, the `min_val`, `max_val`, `min_vals`, `max_vals`
attributes of observers were Tensors but not buffers.  They had custom
state_dict save/load code to ensure their state was saved.

At some point, these attributes became buffers, and the custom
save/load code remained. This introduced a subtle bug:
* create model A, move it to a device (cpu/cuda) and save its state_dict
* create model B, load its state dict.
* `min_val|min_vals|max_val|max_vals` would always be loaded to model A's device, even if the rest of model B was on a different device
* the above is inconsistent with how save/load on different devices is expected to work (see https://pytorch.org/tutorials/beginner/saving_loading_models.html#saving-loading-model-across-devices)

In practice, the case people would sometimes hit is:
* model A is on CPU, state dict is saved
* model B is created and moved to GPU, state_dict from model A is loaded
* assertions throw when operations are attempted across different devices

This PR fixes the behavior by removing the custom save/load where
possible and letting the default `nn.Module` save/load code handle
device assignment.  We special case `PerChannelMinMaxObserver` and its
children to allow for loading buffers or different size, which is
normal.

There are some followups to also enable this for HistogramObserver
and FakeQuantize, which can be done in separate PRs due to higher
complexity.

Test Plan:
```
python test/test_quantization.py TestObserver.test_state_dict_respects_device_affinity
```

Imported from OSS

Reviewed By: raghuramank100

Differential Revision: D23644493

fbshipit-source-id: 0dbb6aa309ad569a91a663b9ee7e44644080032e
2020-09-11 14:48:56 -07:00
192c4111a3 Simplify target handling in nn gradcheck. (#44507)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44507

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D23635799

Pulled By: gchanan

fbshipit-source-id: 75090d6a48771e5c92e737a0829fbfa949f7c8a7
2020-09-11 13:25:59 -07:00
8a574c7104 [Cmake] Drop quotation marks around $ENV{MAX_JOBS} (#44557)
Summary:
Solves `the '-j' option requires a positive integer argument` error on some systems when MAX_JOBS is not defined

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44557

Reviewed By: vkuzo

Differential Revision: D23653511

Pulled By: malfet

fbshipit-source-id: 7d86fb7fb6c946c34afdc81bf2c3168a74d00a1f
2020-09-11 12:57:11 -07:00
2b8f0b2023 [caffe2] adds Cancel to OperatorBase and NetBase (#44145)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44145

## Motivation

* To be able to make C2 ops cancellable so we can safely exit.
* Some C2 operators are now blocking thus being non-cancellable. If an error
  occurs we need to be able to safely stop all net execution so we can throw
  the exception to the caller.

## Summary
*  Adds `NetBase::Cancel()` to NetBase which iterates over the entire list of
   operators and call Cancel.
* Cancel on all ops was added to Net since there's nothing Asyc specific about it.
* `AsyncSchedulingNet` calls parent Cancel.
* To preserve backwards compatibility, `AsyncSchedulingNet`'s Cancel still calls
   `CancelAndFinishAsyncTasks` .
* Adds `Cancel()` to `OperatorBase`.

Reviewed By: dzhulgakov

Differential Revision: D23279202

fbshipit-source-id: e1bb0ff04a4e1393f935dbcac7c78c0baf728550
2020-09-11 12:50:26 -07:00
5579b53a7f Fix SmoothL1Loss when target.requires_grad is True. (#44486)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44486

SmoothL1Loss had a completely different (and incorrect, see #43228) path when target.requires_grad was True.

This PR does the following:

1) adds derivative support for target via the normal derivatives.yaml route
2) kill the different (and incorrect) path for when target.requires_grad was True
3) modify the SmoothL1Loss CriterionTests to verify that the target derivative is checked.

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D23630699

Pulled By: gchanan

fbshipit-source-id: 0f94d1a928002122d6b6875182867618e713a917
2020-09-11 12:13:36 -07:00
b7ef4eec46 [NNC] Add loop slicing transforms (#43854)
Summary:
Add new transforms `sliceHead` and `sliceTail` to `LoopNest`, for example:

Before transformation:
```
for x in 0..10:
  A[x] = x*2
```

After `sliceHead(x, 4)`:

```
for x in 0..4:
  A[x] = x*2
for x in 4..10:
  A[x] = x*2
```

After `sliceTail(x, 1)`:
```
for x in 0..4:
  A[x] = x*2
for x in 4..9:
  A[x] = x*2
for x in 9..10:
  A[x] = x*2
```

`sliceHead(x, 10)` and `sliceTail(x, 10)` is no-op.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43854

Test Plan: Tests are added in `test_loopnest.cpp`, the tests cover the basic transformations, and also tests the combination with other transformations such as `splitWithTail`.

Reviewed By: nickgg

Differential Revision: D23417366

Pulled By: cheng-chang

fbshipit-source-id: 06c6348285f2bafb4be3286d1642bfbe1ea499bf
2020-09-11 12:09:12 -07:00
39bb455e36 Update fallback kernel for Autograd keys. (#44349)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44349

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D23589807

Pulled By: ailzhang

fbshipit-source-id: 0e4b0bf3e07bb4e35cbf1bda22f7b03193eb3dc4
2020-09-11 12:04:52 -07:00
11fb51d093 [quant][graphmode][fx][fix] Support dictionary output (#44508)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44508

Bug fix for dictionary output

Test Plan: Imported from OSS

Reviewed By: z-a-f

Differential Revision: D23636182

fbshipit-source-id: 0c00cd6b9747fa3f8702d7f7a0d5edb31265f466
2020-09-11 11:29:20 -07:00
442957d8b6 [pytorch] Remove mobile nonvariadic run_method (#44235)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44235

Removes nonvariadic run_method() from mobile Module entirely (to be later replaced by a variadic version). All use cases should have been migrated to use get_method() and Method::operator() in D23436351
ghstack-source-id: 111848220

Test Plan: CI

Reviewed By: iseeyuan

Differential Revision: D23484577

fbshipit-source-id: 602fcde61e13047a34915b509da048b9550103b1
2020-09-11 10:23:08 -07:00
a61318a535 [pytorch] Replace mobile run_method with get_method and operator() (#44202)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44202

In preparation for changing mobile run_method() to be variadic, this diff:

* Implements get_method() for mobile Module, which is similar to find_method but expects the method to exist.
* Replaces calls to the current nonvariadic implementation of run_method() by calling get_method() and then invoking the operator() overload on Method objects.
ghstack-source-id: 111848222

Test Plan: CI, and all the unit tests which currently contain run_method that are being changed.

Reviewed By: iseeyuan

Differential Revision: D23436351

fbshipit-source-id: 4655ed7182d8b6f111645d69798465879b67a577
2020-09-11 10:23:06 -07:00
cdf5e2ae86 add typing annotations for a few torch.utils.* modules (#43806)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/43431. Depends on [gh-43862](https://github.com/pytorch/pytorch/pull/43862) (EDIT: now merged)

Modules:
- torch.utils.mkldnn
- torch.utils.mobile_optimizer
- torch.utils.bundled_inputs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43806

Reviewed By: gmagogsfm

Differential Revision: D23635151

Pulled By: SplitInfinity

fbshipit-source-id: a85b75a7927dde6cc55bcb361f8ff601ffb0b2a1
2020-09-11 10:20:55 -07:00
7d78a6fcdd Update interpolate to use new upsample overloads (#43025)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43025

- Use new overloads that better reflect the arguments to interpolate.
- More uniform interface for upsample ops allows simplifying the Python code.
- Also reorder overloads in native_functions.yaml to give them priority.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/37177

ghstack-source-id: 106938111

Test Plan:
test_nn has pretty good coverage.

Relying on CI for ONNX, etc.

Didn't test FC because this change is *not* forward compatible.

To ensure backwards compatibility, I ran this code before this change

```python
def test_func(arg):
    interp = torch.nn.functional.interpolate
    with_size = interp(arg, size=(16,16))
    with_scale = interp(arg, scale_factor=[2.1, 2.2], recompute_scale_factor=False)
    with_compute = interp(arg, scale_factor=[2.1, 2.2])
    return (with_size, with_scale, with_compute)

traced_func = torch.jit.trace(test_func, torch.randn(1,1,1,1))

sample = torch.randn(1, 3, 7, 7)
output = traced_func(sample)

assert not torch.allclose(output[1], output[2])

torch.jit.save(traced_func, "model.pt")
torch.save((sample, output), "data.pt")
```

then this code after this change

```python
model = torch.jit.load("model.pt")
sample, golden = torch.load("data.pt")
result = model(sample)
for r, g in zip(result, golden):
    assert torch.allclose(r, g)
```

Reviewed By: AshkanAliabadi

Differential Revision: D21209991

fbshipit-source-id: 5b2ebb7c3ed76947361fe532d1dbdd6faa3544c8
2020-09-11 09:59:14 -07:00
df6ea62526 Add nondeterministic check to new upsample overloads
Summary: I think these were missed due to a code landing race condition.

Test Plan: Fixes CUDA tests with PR 43025 applied.

Reviewed By: iseeyuan, AshkanAliabadi

Differential Revision: D23639566

fbshipit-source-id: 1322d7708e246b075a66588e7e54f4e12092477f
2020-09-11 09:58:07 -07:00
3de2c0b42f Fix L1Loss when target.requires_grad is True. (#44471)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44471

L1Loss had a completely different (and incorrect, see #43228) path when target.requires_grad was True.

This PR does the following:

1) adds derivative support for target via the normal derivatives.yaml route
2) kill the different (and incorrect) path for when target.requires_grad was True
3) modify the L1Loss CriterionTests to verify that the target derivative is checked.

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D23626008

Pulled By: gchanan

fbshipit-source-id: 2828be16b56b8dabe114962223d71b0e9a85f0f5
2020-09-11 09:51:16 -07:00
ea55820606 [dper3] Export PackSegments and UnpackSegments to Pytorch
Summary: As title.

Test Plan:
```
buck test //caffe2/caffe2/python/operator_test/:torch_integration_test -- test_pack_segments
```

Reviewed By: yf225

Differential Revision: D23610495

fbshipit-source-id: bd8cb61f2284a08a54091a4f982f01fcf681f215
2020-09-11 09:29:24 -07:00
b73b44f976 [PyTorch Mobile] Move some string ops to register_prim_ops.cpp and make them selective (#44500)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44500

Some user models are using those operators. Unblock them while keep the ops selective.

Test Plan: CI

Reviewed By: linbinyu

Differential Revision: D23634769

fbshipit-source-id: 55841d1b07136b6a27b6a39342f321638dc508cd
2020-09-11 09:24:35 -07:00
567c51cce9 In common_distributed, fix TEST_SKIPS multiprocessing manager (#44525)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44525

Since `TEST_SKIPS` is a global multiprocessing.manager, this was causing
issues when one test would fail and make the rest of the tests fail during
setup due to networking errors.

See the failed CI job: https://app.circleci.com/pipelines/github/pytorch/pytorch/212491/workflows/0450151d-ca09-4cf6-863d-272de6ed917f/jobs/7389065 for an example, where `test_ddp_backward` failed but then caused the rest of the tests to fail at the line `test_skips.update(TEST_SKIPS)`.

To fix this issue, at the end of every test we revert `TEST_SKIPS` back to a regular dict, and redo the conversion to a `mulitiprocessing.Manager` in the next test, which prevents these errors.
ghstack-source-id: 111844724

Test Plan: CI

Reviewed By: malfet

Differential Revision: D23641618

fbshipit-source-id: 27ce823968ece9804bb4dda898ffac43ef732b89
2020-09-11 09:16:33 -07:00
d07d25a8c5 Fix MSELoss when target.requires_grad is True. (#44437)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44437

MSELoss had a completely different (and incorrect, see https://github.com/pytorch/pytorch/issues/43228) path when target.requires_grad was True.

This PR does the following:
1) adds derivative support for target via the normal derivatives.yaml route
2) kill the different (and incorrect) path for when target.requires_grad was True
3) modify the MSELoss CriterionTests to verify that the target derivative is checked.

TODO:
1) do we still need check_criterion_jacobian when we run grad/gradgrad checks?
2) ensure the Module tests check when target.requires_grad
3) do we actually test when reduction='none' and reduction='mean'?

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D23612166

Pulled By: gchanan

fbshipit-source-id: 4f74d38d8a81063c74e002e07fbb7837b2172a10
2020-09-11 08:51:28 -07:00
9a3b83cbf2 Update submodule gloo to have latest commits to enable it can work on Windows (#44529)
Summary:
Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44529

Reviewed By: rohan-varma

Differential Revision: D23650123

Pulled By: mrshenli

fbshipit-source-id: b5b891cbcec51a14379d6604af63c714c32d93e7
2020-09-11 08:47:02 -07:00
b6b1c01adf torch.view_as_complex fails with segfault for a zero dimensional tensor (#44175)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/44061

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44175

Reviewed By: colesbury

Differential Revision: D23628103

Pulled By: anjali411

fbshipit-source-id: 6f70b5824150121a1617c0757499832923ae02b5
2020-09-11 08:35:49 -07:00
a9754fb860 Use TP Tensor.metadata to carry device info (#44396)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44396

Test Plan: Imported from OSS

Reviewed By: lw

Differential Revision: D23602576

Pulled By: mrshenli

fbshipit-source-id: c639789979b2b71fc165efbcf70f37b4c39469df
2020-09-11 08:33:22 -07:00
f44de7cdc3 Add missing rpc.shutdown() (#44417)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44417

Test Plan: Imported from OSS

Reviewed By: lw

Differential Revision: D23626208

Pulled By: mrshenli

fbshipit-source-id: 4ff8cad0e1193f99518804c21c9dd26ae718f4eb
2020-09-11 08:32:15 -07:00
77cc7d1ecd C++ APIs Transformer NN Module Top Layer (#44333)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44333

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D23584010

Pulled By: glaringlee

fbshipit-source-id: 990026e3f1b5ae276776e344ea981386cb7528fe
2020-09-11 08:25:27 -07:00
09892de815 Clarify track_running_stats docs; Make SyncBatchNorm track_running_stats behavior consistent (#44445)
Summary:
context: https://github.com/pytorch/pytorch/pull/38084

Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44445

Reviewed By: colesbury

Differential Revision: D23634216

Pulled By: mrshenli

fbshipit-source-id: d1242c694dec0e7794651f8031327625eb9989ee
2020-09-11 08:20:34 -07:00
30fccc53a9 [NNC] Don't attempt to refactor conditional scalars (#44223)
Summary:
Fixes a bug in the NNC registerizer for Cuda where it would hoist reads out of a conditional context when trying to cache them. As a quick fix, prevent scalar replacement if a usage is within a condition.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44223

Reviewed By: gchanan

Differential Revision: D23551247

Pulled By: nickgg

fbshipit-source-id: 17a7bf2be4c8c3dd8a9ab7997dce9aea200c3685
2020-09-11 04:22:16 -07:00
c967e7724e [quant] conv_transpose1d_prepack / conv_transpose1d_unpack (#40360)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40360

Test Plan: Imported from OSS

Reviewed By: vkuzo

Differential Revision: D22158982

Pulled By: z-a-f

fbshipit-source-id: 844d02806554aaa68b521283703e630cc544d419
2020-09-11 04:12:28 -07:00
8b8986662f [JIT] Remove profiling nodes in autodiff forward graph (#44420)
Summary:
Previously we were not removing profiling nodes in graphs that required grad and contained diff graphs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44420

Reviewed By: bertmaher

Differential Revision: D23607482

Pulled By: eellison

fbshipit-source-id: af095f3ed8bb3c5d09610f38cc7d1481cbbd2613
2020-09-11 02:59:39 -07:00
c6febc6480 [JIT] Add a python hook for a function to interpret JIT graphs. (#44493)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44493

This function allows to execute a graph exactly as it is, without going
through a graph executor which would run passes on the graph before
interpreting it. I found this feature extremely helpful when I worked on
a stress-testing script to shake out bugs from the TE fuser: I needed to
execute a very specific set of passes on a graph and nothing else, and
then execute exactly it.

Test Plan: Imported from OSS

Reviewed By: jamesr66a

Differential Revision: D23632505

Pulled By: ZolotukhinM

fbshipit-source-id: ea81fc838933743e2057312d3156b77284d832ef
2020-09-11 02:55:26 -07:00
51ed31269e Replace FutureMessage with c10::ivalue::Future in DistEngine. (#44239)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44239

As part of https://github.com/pytorch/pytorch/issues/41574, use
c10::ivalue::Future everywhere in DistEngine.
ghstack-source-id: 111645070

Test Plan: waitforbuildbot

Reviewed By: mrshenli

Differential Revision: D23553507

fbshipit-source-id: 1b51ba13d1ebfa6c5c70b12028e9e96ce8ba51ff
2020-09-11 01:03:42 -07:00
b5d75dddd9 Enable lerp on half type; fix output memory format (#43541)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43541

Reviewed By: zou3519

Differential Revision: D23499592

Pulled By: ezyang

fbshipit-source-id: 9efdd6cbf0a334ec035ddd467667ba874b892549
2020-09-10 21:50:35 -07:00
0c58a017bd [quant][eagermode][refactor] Add set/get method for quantization and fusion mappings (#43990)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43990

Allow user to register custom quantization and fusion patterns

Test Plan: Imported from OSS

Reviewed By: z-a-f

Differential Revision: D23485344

fbshipit-source-id: 4f0174ee6d8000d83de0f73cb370e9a1941d54aa
2020-09-10 21:29:39 -07:00
f7278473d3 [NCCL] Fix NCCL_BLOCKING_WAIT functionality with Async Error Handling (#44411)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44411

This basically aborts errored NCCL communicators if either blocking
wait or async error handling is enabled. Otherwise we may abort nccl
communicators where neither are enabled, and this may result in subsequent GPU
operations using corrupted data.
ghstack-source-id: 111839264

Test Plan: Succesful Flow run: f217591683

Reviewed By: jiayisuse

Differential Revision: D23605382

fbshipit-source-id: 6c16f9626362be3b0ce2feaf0979b2dff97ce61b
2020-09-10 20:57:55 -07:00
6ee41974e3 Speedup Linux nightly builds (#44532)
Summary:
`stdbuf` affects not only the process it launches, but all of its subprocessed, which have a very negative effects on the IPC communication between nvcc and c++ preprocessor, which results in 2x slowdown, for example:

```
$ time /usr/local/cuda/bin/nvcc /pytorch/aten/src/THC/generated/THCTensorMathPointwiseByte.cu -c ...
real	0m34.623s
user	0m31.736s
sys	0m2.825s
```
but
```
time stdbuf -i0 -o0 -e0 /usr/local/cuda/bin/nvcc /pytorch/aten/src/THC/generated/THCTensorMathPointwiseByte.cu -c ...
real	1m14.113s
user	0m37.989s
sys	0m36.104s
```
because OS spends lots of time transferring preprocessed source back to nvcc byte by byte, as requested via stdbuf call

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44532

Reviewed By: ngimel

Differential Revision: D23643411

Pulled By: malfet

fbshipit-source-id: 9fdaf8b8a49574e6b281f68a5dd9ba9d33464dff
2020-09-10 20:32:08 -07:00
69f6d94caa Register diag_backward, diagonal_backward, infinitetely...gelu_backward as operators (#44422)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44422

See #44052 for context.

Test Plan:
- `pytest test/test_autograd.py -v`
- `pytest test/test_nn.py -v`

Reviewed By: mrshenli

Differential Revision: D23607691

Pulled By: zou3519

fbshipit-source-id: 09fbcd66b877af4fa85fd9b2f851ed3912ce84d6
2020-09-10 18:43:18 -07:00
7ff7e6cfc8 Register cummaxmin_backward, cumprod_backward as operators (#44410)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44410

See #44052 for context. One of the cumprod_backward overloads was unused
so I just deleted it.

Test Plan: - `pytest test/test_autograd.py -v`

Reviewed By: mrshenli

Differential Revision: D23605503

Pulled By: zou3519

fbshipit-source-id: f9c5b595e62d2d6e71f26580ba96df15cc9de4f7
2020-09-10 18:43:15 -07:00
08b431f54c Add trace_backward, masked_select_backward, and take_backward as ops (#44408)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44408

See #44052 for context.

Test Plan: - `pytest test/test_autograd.py -v`

Reviewed By: mrshenli

Differential Revision: D23605504

Pulled By: zou3519

fbshipit-source-id: b9b1646d13caa6e536d08669c29bfc2ad8ff89a3
2020-09-10 18:41:07 -07:00
41f62b17e7 Fix DDP join() API in the case of model.no_sync() (#44427)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44427

Closes https://github.com/pytorch/pytorch/issues/44425

DDP join API currently does not work properly with `model.no_sync()`, see https://github.com/pytorch/pytorch/issues/44425 for details. This PR fixes the problem via the approach mentioned in the issue, namely scheduling an allreduce that tells joined ranks whether to sync in the backwards pass or not. Tests are added for skipping gradient synchronization for various `sync_interval`s.
ghstack-source-id: 111786479

Reviewed By: pritamdamania87

Differential Revision: D23609070

fbshipit-source-id: e8716b7881f8eee95e3e3499283e716bd3d7fe76
2020-09-10 18:31:40 -07:00
129d52aef2 Fix uniqueness check in movedim (#44307)
Summary:
Noticed this bug in `torch.movedim` (https://github.com/pytorch/pytorch/issues/41480). [`std::unique`](https://en.cppreference.com/w/cpp/algorithm/unique) only guarantees uniqueness for _sorted_ inputs. The current check lets through non-unique values when they aren't adjacent to each other in the list, e.g. `(0, 1, 0)` wouldn't raise an exception and instead the algorithm fails later with an internal assert.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44307

Reviewed By: mrshenli

Differential Revision: D23598311

Pulled By: zou3519

fbshipit-source-id: fd6cc43877c42bb243cfa85341c564b6c758a1bf
2020-09-10 17:41:07 -07:00
c48f511c7e Moves some of TestTorchMathOps to OpInfos (#44277)
Summary:
This PR fixes three OpInfo-related bugs and moves some functions from TestTorchMathOps to be tested using the OpInfo pattern. The bugs are:

- A skip test path in test_ops.py incorrectly formatted its string argument
- Decorating the tests in common_device_type.py was incorrectly always applying decorators to the original test, not the op-specific variant of the test. This could cause the same decorator to be applied multiple times, overriding past applications.
- make_tensor was incorrectly constructing tensors in some cases

The functions moved are:

- asin
- asinh
- sinh
- acosh
- tan
- atan
- atanh
- tanh
- log
- log10
- log1p
- log2

In a follow-up PR more or all of the remaining functions in TestTorchMathOps will be refactored as OpInfo-based tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44277

Reviewed By: mrshenli, ngimel

Differential Revision: D23617361

Pulled By: mruberry

fbshipit-source-id: edb292947769967de9383f6a84eb327f027509e0
2020-09-10 17:31:50 -07:00
2e744b1820 Support work.result() to get result tensors for allreduce for Gloo, NCCL backends (#43970)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43970

It is resubmition of #43386

Original commit changeset: 27fbeb161706
ghstack-source-id: 111775070

Test Plan:
Added checks to existing unit test and ran it on gpu devserver.
Verified the test that was failing in original diff also passes: https://app.circleci.com/pipelines/github/pytorch/pytorch/210229/workflows/86bde47b-f2da-48e3-a618-566ae2713102/jobs/7253683

Reviewed By: pritamdamania87

Differential Revision: D23455047

fbshipit-source-id: b8dc4a30b95570d68a482c19131674fff2a3bc7c
2020-09-10 17:13:37 -07:00
91b16bff1e Disable PyTorch iOS ARM64 builds until cert problem is fixed (#44499)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44499

Reviewed By: seemethere, xta0

Differential Revision: D23634961

Pulled By: malfet

fbshipit-source-id: e32ae29c42c351bcb4f48bc52d4082ae56545e5b
2020-09-10 16:24:11 -07:00
1dd3fae3d2 [pytorch] Add logging to mobile Method run (#44234)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44234

Changes mobile Method to point to a mobile Module directly instead of the Module ivalue in order to access metadata for logging/debugging, and then adds said logging.
ghstack-source-id: 111775806

Test Plan:
CI/existing unit tests to test BC
Testing fb4a logging:
Built fb4a on D23436351 (because usage of run_method isn't replaced yet in this diff), and then checked the Scuba logs to see that the appropriate ad clicks were logged (one ad for Buzzfeed shopping and another about Netflix from Bustle)

{F328510687}
{F328511201}
[Scuba sample of QPL metrics](https://www.internalfb.com/intern/scuba/query/?dataset=qpl_metrics%2Fpytorch_employee&pool=uber&view=samples_client&drillstate=%7B%22sampleCols%22%3A[%22device_model%22%2C%22instance_id_sampled%22%2C%22method%22%2C%22ios_device_class%22%2C%22points_path%22%2C%22userid_sampled%22%2C%22client_sample_rate%22%2C%22browser_name%22%2C%22ios_device_name%22%2C%22points%22%2C%22is_employee%22%2C%22is_test_user%22%2C%22network_only_queries%22%2C%22annotations%22%2C%22oncall_shortname%22%2C%22environment_tags%22%2C%22revoked_queries%22%2C%22annotations_bool%22%2C%22points_data%22%2C%22annotations_double_array%22%2C%22annotations_string_array%22%2C%22revoked_steps%22%2C%22points_set%22%2C%22device_os_version%22%2C%22ota_version_rollout%22%2C%22steps%22%2C%22vadar_calculation_result%22%2C%22app_name%22%2C%22client_push_phase%22%2C%22vadar%22%2C%22release_channel%22%2C%22interaction_class%22%2C%22exposures%22%2C%22annotations_double%22%2C%22deviceid_sampled%22%2C%22is_logged_in%22%2C%22device_os%22%2C%22time%22%2C%22major_os_ver%22%2C%22annotations_int_array%22%2C%22duration_ns%22%2C%22app_build%22%2C%22bucket_id%22%2C%22cache_and_network_queries%22%2C%22value%22%2C%22vadar_v2%22%2C%22quicklog_event%22%2C%22unixname%22%2C%22vadar_calculation_result_v2%22%2C%22trace_tags%22%2C%22annotations_int%22%2C%22quicklog_module%22%2C%22push_phase%22%2C%22year_class%22%2C%22country%22%2C%22capped_duration%22%2C%22ram_class%22%2C%22weight%22%2C%22carrier%22%2C%22app_id%22%2C%22app_version%22%2C%22react_bundle_version%22%2C%22logging_source%22%2C%22is_unsampled_for_scuba%22%2C%22instrumentation_errors%22%2C%22android_cpu_abi_list%22%2C%22days_after_release%22%2C%22cpu_cores%22%2C%22user_bucket%22%2C%22quicklog_action%22%2C%22server_scuba_sample_rate%22%2C%22points_vector%22%2C%22annotations_bool_array%22%2C%22android_device_class%22%2C%22browser_full_version%22%2C%22major_app_ver%22]%2C%22derivedCols%22%3A[]%2C%22mappedCols%22%3A[]%2C%22enumCols%22%3A[]%2C%22hideEmptyColumns%22%3Afalse%2C%22focused_event%22%3A%22%22%2C%22show_metadata%22%3A%22false%22%2C%22start%22%3A%222020-09-08%2011%3A27%3A00%22%2C%22end%22%3A%22start%20%2B%201%20minute%22%2C%22timezone%22%3A%22America%2FLos_Angeles%22%2C%22samplingRatio%22%3A%221%22%2C%22num_samples%22%3A%22100%22%2C%22aggregateList%22%3A[]%2C%22param_dimensions%22%3A[]%2C%22modifiers%22%3A[]%2C%22order%22%3A%22none%22%2C%22order_desc%22%3Atrue%2C%22filterMode%22%3A%22DEFAULT%22%2C%22constraints%22%3A[[%7B%22column%22%3A%22quicklog_event%22%2C%22op%22%3A%22eq%22%2C%22value%22%3A[%22[%5C%22MOBILE_MODULE_STATS%5C%22]%22]%7D%2C%7B%22column%22%3A%22userid_sampled%22%2C%22op%22%3A%22eq%22%2C%22value%22%3A[%22[%5C%22100013484978975%5C%22]%22]%7D]]%2C%22c_constraints%22%3A[[]]%2C%22b_constraints%22%3A[[]]%2C%22metrik_view_params%22%3A%7B%22should_use_legacy_colors%22%3Afalse%2C%22columns_skip_formatting%22%3A[]%2C%22view%22%3A%22samples_client%22%2C%22width%22%3A%221358%22%2C%22height%22%3A%22912%22%2C%22tableID%22%3A%22qpl_metrics%2Fpytorch_employee%22%2C%22fitToContent%22%3Afalse%2C%22format_tooltip_in_percent%22%3Afalse%2C%22use_y_axis_hints_as_limits%22%3Atrue%2C%22has_dynamic_context_menu%22%3Atrue%2C%22has_context_menu%22%3Afalse%2C%22legend_mode%22%3A%22nongrid%22%2C%22connect_nulls%22%3Atrue%2C%22timezone_offset%22%3A420%2C%22timezone%22%3A%22America%2FLos_Angeles%22%2C%22y_min_hint%22%3A0%2C%22should_render_plugins_menu%22%3Afalse%7D%7D&normalized=1599581160)
[Scuba sample showing ad source; just the bottom two results](https://www.internalfb.com/intern/scuba/query/?dataset=business_integrity_webpage_semantic&pool=uber&drillstate=%7B%22sampleCols%22%3A[%22from_custom_sampling%22%2C%22data_version%22%2C%22scribe_category_type%22%2C%22page_id%22%2C%22name%22%2C%22source_url%22%2C%22time%22%2C%22title_semantic%22%2C%22major_version%22%2C%22server_protocol%22%2C%22custom_sampling_enabled%22%2C%22ad_id%22%2C%22appversion%22%2C%22clienttime%22%2C%22isemployee%22%2C%22title%22%2C%22images%22%2C%22weight%22%2C%22carrier%22%2C%22is_ad%22%2C%22locale%22%2C%22appid%22%2C%22ip_country%22%2C%22iab_models%22]%2C%22derivedCols%22%3A[]%2C%22mappedCols%22%3A[]%2C%22enumCols%22%3A[]%2C%22return_remainder%22%3Afalse%2C%22should_pivot%22%3Afalse%2C%22is_timeseries%22%3Afalse%2C%22hideEmptyColumns%22%3Afalse%2C%22main_dimension%22%3A%22time%22%2C%22start%22%3A%22-5%20minutes%22%2C%22samplingRatio%22%3A%221%22%2C%22compare%22%3A%22none%22%2C%22axes%22%3A%22linked%22%2C%22overlay_types%22%3A[]%2C%22minBucketSamples%22%3A%22%22%2C%22dimensions%22%3A[]%2C%22scale_type%22%3A%22absolute%22%2C%22num_samples%22%3A%22100%22%2C%22metric%22%3A%22avg%22%2C%22fill_missing_buckets%22%3A%22connect%22%2C%22smoothing_bucket%22%3A%221%22%2C%22top%22%3A%227%22%2C%22markers%22%3A%22%22%2C%22timezone%22%3A%22America%2FLos_Angeles%22%2C%22end%22%3A%22now%22%2C%22show_p95_ci%22%3Afalse%2C%22time_bucket%22%3A%22auto%22%2C%22compare_mode%22%3A%22normal%22%2C%22aggregateList%22%3A[]%2C%22param_dimensions%22%3A[]%2C%22modifiers%22%3A[]%2C%22order%22%3A%22none%22%2C%22order_desc%22%3Atrue%2C%22filterMode%22%3A%22DEFAULT%22%2C%22constraints%22%3A[[%7B%22column%22%3A%22major_version%22%2C%22op%22%3A%22eq%22%2C%22value%22%3A[%22[%5C%22288%5C%22]%22]%7D]]%2C%22c_constraints%22%3A[[]]%2C%22b_constraints%22%3A[[]]%2C%22metrik_view_params%22%3A%7B%22should_use_legacy_colors%22%3Afalse%2C%22columns_skip_formatting%22%3A[]%2C%22view%22%3A%22time_view%22%2C%22width%22%3A%221358%22%2C%22height%22%3A%22912%22%2C%22tableID%22%3A%22business_integrity_webpage_semantic%22%2C%22fitToContent%22%3Afalse%2C%22format_tooltip_in_percent%22%3Afalse%2C%22use_y_axis_hints_as_limits%22%3Atrue%2C%22has_dynamic_context_menu%22%3Atrue%2C%22has_context_menu%22%3Afalse%2C%22legend_mode%22%3A%22nongrid%22%2C%22connect_nulls%22%3Atrue%2C%22timezone_offset%22%3A420%2C%22timezone%22%3A%22America%2FLos_Angeles%22%2C%22y_min_hint%22%3A0%2C%22should_render_plugins_menu%22%3Afalse%7D%7D&view=samples_client&normalized=1599587280)

Reviewed By: iseeyuan

Differential Revision: D23548687

fbshipit-source-id: 3e63085663f5fd8de90a4c7dbad0a17947aee973
2020-09-10 15:26:33 -07:00
a2a81e1335 Add a CONTRIBUTING.md for the distributed package. (#44224)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44224

The purpose of this file is to help developers on PT distributed get
upto speed on the code structure and layout for PT Distributed.
ghstack-source-id: 111644842

Test Plan: waitforbuildbot

Reviewed By: rohan-varma

Differential Revision: D23548377

fbshipit-source-id: 561d5b8e257642de172def8fdcc1311fae20690b
2020-09-10 14:58:00 -07:00
4bead6438a Enable torch.autograd typechecks (#44451)
Summary:
To help with further typing, move dynamically added native contributions from `torch.autograd` to `torch._C._autograd`
Fix invalid error handling pattern in
89ac30afb8/torch/csrc/autograd/init.cpp (L13-L15)
`PyImport_ImportModule` already raises Python exception and nullptr should be returned to properly propagate the to Python runtime.

And all native methods/types in `torch/autograd/__init.py` after `torch._C._init_autograd()` has been called
Use f-strings instead of `.format` in test_type_hints.py
Fixes https://github.com/pytorch/pytorch/issues/44450

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44451

Reviewed By: ezyang

Differential Revision: D23618261

Pulled By: malfet

fbshipit-source-id: fa5f739d7cff8410641128b55b810318c5f636ae
2020-09-10 13:37:29 -07:00
cc5a1cf616 [JIT] Erase shapes before fallback graph (#44434)
Summary:
Previously the specialized types were copied over to the fallback function, although the tensors in the fallback type were not of that type.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44434

Reviewed By: SplitInfinity

Differential Revision: D23611943

Pulled By: eellison

fbshipit-source-id: 2ea88a97529409f6c5c4c1f59a14b623524933de
2020-09-10 12:07:31 -07:00
b3f0297a94 ConvPackedParams: remove legacy format (#43651)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43651

This is a forward compatibility follow-up to
https://github.com/pytorch/pytorch/pull/43086/. We switch the
conv serialization to output the v2 format instead of the v1 format.

The plan is to land this 1 - 2 weeks after the base PR.

Test Plan:
```
python test/test_quantization.py TestSerialization.test_conv2d_graph_v2
python test/test_quantization.py TestSerialization.test_conv2d_nobias_graph_v2
```

Imported from OSS

Reviewed By: z-a-f

Differential Revision: D23355480

fbshipit-source-id: 4cb04ed8b90a0e3e452297a411d641a15f6e625f
2020-09-10 11:47:34 -07:00
d232fec1f1 Partly fix cuda builds of dper broken by caffe2 c++
Summary:
cuda builds using clang error out when building caffe2 due to an incorrect std::move

This does not fix all known errors, but it's a step in the right direction.

Differential Revision: D23626667

fbshipit-source-id: 7d9df886129f671ec430a166dd22e4af470afe1e
2020-09-10 11:37:49 -07:00
38c10b4f30 [NCCL] Fix the initialization of futureNCCLCallbackStreams (#44347)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44347

Cloned from Pull Request resolved: https://github.com/pytorch/pytorch/pull/44097, because the original author Sinan has completed the internship and now is unable to submit this diff.

As johnsonpaul mentioned in D23277575 (7d517cf96f). It looks like all processes were allocating memory on GPU-ID=0.

I was able to reproduce it by running `test_ddp_comm_hook_allreduce_with_then_hook_nccl` unit test of `test_c10d.py` and running `nvidia-smi` while test was running. The issue was reproduced as:
```
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0   3132563      C   python                                       777MiB |
|    0   3132564      C   python                                       775MiB |
|    4   3132564      C   python                                       473MiB |
+-----------------------------------------------------------------------------+
```
I realized that as we initialize ProcessGroupNCCL both processes were initially allocating memory on GPU 0.

We later also realized that I forgot `isHighPriority` input of `getStreamFromPool` and `futureNCCLCallbackStreams_.push_back(std::make_shared<at::cuda::CUDAStream>(at::cuda::getStreamFromPool(device_index)));` was just creating a vector of GPU 0 streams. As i changed `at::cuda::getStreamFromPool(device_index)` to `at::cuda::getStreamFromPool(false, device_index)`. `nvidia-smi` looked like:
```
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0    673925      C   python                                       771MiB |
|    0    673926      C   python                                       771MiB |
|    1    673925      C   python                                       771MiB |
|    1    673926      C   python                                       771MiB |
|    2    673925      C   python                                       771MiB |
|    2    673926      C   python                                       771MiB |
|    3    673925      C   python                                       771MiB |
|    3    673926      C   python                                       771MiB |
|    4    673925      C   python                                       771MiB |
|    4    673926      C   python                                       771MiB |
|    5    673925      C   python                                       771MiB |
|    5    673926      C   python                                       771MiB |
|    6    673925      C   python                                       771MiB |
|    6    673926      C   python                                       771MiB |
|    7    673925      C   python                                       707MiB |
|    7    673926      C   python                                       623MiB |
+-----------------------------------------------------------------------------+
```
This confirms that we were just getting GPU 0 streams for the callback. I think this does not explain the `fp16_compress` stability issue, because we were able to reproduce that even without any then callback and just calling copy from fp32 to fp16 before allreduce. However, this can explain other issues where `allreduce` was not on par with `no_hook`. I'll run some additional simulations with this diff.

I tried to to replace `getStreamFromPool` by `getDefaultCUDAStream(deviceIndex)` and it wasn't causing additional memory usage. In this diff, I temporarily solved the issue by just initializing null pointers for each device in the constructor and setting the callback stream for corresponding devices inside `ProcessGroupNCCL::getNCCLComm`. After the fix it looks like the memory issue was resolved:
```
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0   2513142      C   python                                       745MiB |
|    4   2513144      C   python                                       747MiB |
+-----------------------------------------------------------------------------+
```
I could use a dictionary instead of a vector for `futureNCCLCallbackStreams_`, but since number of devices is fixed, I think it isn't necessary. Please let me know what you think in the comments.
ghstack-source-id: 111485483

Test Plan:
`test_c10d.py` and some perf tests. Also check `nvidia-smi` while running tests to validate memory looks okay.

This diff also fixes the regression in HPC tests as we register a hook:

{F322730175}

See https://fb.quip.com/IGuaAbD8 (474fdd7e2d)bnvy for details.

Reviewed By: pritamdamania87

Differential Revision: D23495436

fbshipit-source-id: ad08e1d94343252224595d7c8a279fe75e244822
2020-09-10 11:25:38 -07:00
cb90fef770 Fix return value of PyErr_WarnEx ignored (SystemError) (#44371)
Summary:
This PR fixes unexpected `SystemError` when warnings are emitted and warning filters are set.

## Current behavior

```
$ python -Werror
>>> import torch
>>> torch.range(1, 3)
UserWarning: torch.range is deprecated in favor of torch.arange and will be removed in 0.5. Note that arange generates values in [start; end), not [start; end].

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
SystemError: <built-in method range of type object at 0x7f38c7703a60> returned a result with an error set
```

## Expected behavior

```
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UserWarning: torch.range is deprecated and will be removed in a future release because its behavior is inconsistent with Python's range builtin. Instead, use torch.arange, which produces values in [start, end).
```

## Note

Python exception must be raised if `PyErr_WarnEx` returns `-1` ([python docs](https://docs.python.org/3/c-api/exceptions.html#issuing-warnings)). This PR fixes warnings raised in the following code:
```py
import torch

torch.range(1, 3)
torch.autograd.Variable().volatile
torch.autograd.Variable().volatile = True
torch.tensor(torch.tensor([]))
torch.tensor([]).new_tensor(torch.tensor([]))
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44371

Reviewed By: mrshenli

Differential Revision: D23598410

Pulled By: albanD

fbshipit-source-id: 2fbcb13fe4025dbebaf1fd837d4c8e0944e05010
2020-09-10 10:15:21 -07:00
f9a0d0c21e Allow Tensor-likes in torch.autograd.gradcheck (#43877)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/42942

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43877

Reviewed By: zou3519

Differential Revision: D23493257

Pulled By: ezyang

fbshipit-source-id: 6cdaabe17157b484e9491189706ccc15420ac239
2020-09-10 09:02:17 -07:00
c8914afdfa Merge criterion_tests and new_criterion_tests. (#44398)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44398

These end up executing the same tests, so no reason to have them separate.

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D23600855

Pulled By: gchanan

fbshipit-source-id: 0952492771498bf813f1bf8e1d7c8dce574ec965
2020-09-10 08:29:59 -07:00
fa158c4ca6 Combine criterion and new criterion tests in test_jit. (#43958)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43958

There is not any difference between these tests (I'm merging them), so let's merge them in the JIT as well.

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D23452337

Pulled By: gchanan

fbshipit-source-id: e6d13cdb164205eec3dbb7cdcd0052b02c961778
2020-09-10 08:28:14 -07:00
af9cad761a Stop ignoring NotImplementedErrors in cuda CriterionTests. (#44381)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44381

Perhaps this was necessary when the test was originally introduced, but it's difficult to figure out what is actually tested.  And I don't think we actually use NotImplementedErorrs.

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D23598646

Pulled By: gchanan

fbshipit-source-id: aa18154bfc4969cca22323e61683a301198823be
2020-09-10 08:18:33 -07:00
208ad45b4b fix scripts (#44464)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44464

Reviewed By: agolynski

Differential Revision: D23624921

Pulled By: colesbury

fbshipit-source-id: 72bed69edcf467a99eda9a3b97e894015c992dce
2020-09-10 08:13:48 -07:00
356aa54694 [Codemod][FBSourceClangFormatLinter] Daily arc lint --take CLANGFORMAT
Reviewed By: zertosh

Differential Revision: D23621463

fbshipit-source-id: 1cd7e94e480c7073c9a0aad55aeba98de4b96164
2020-09-10 04:24:43 -07:00
6c98d904c0 handle the case of -0.0 on tanh quantization (#44406)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44406

this fix makes fakelowp identical to hw

- mask out the floating point number with 0x7fff so we are always dealing
with positive numbers
- dsp implementation is correct, ice-ref suffers from this same problem

Test Plan: - tested with test_fusions.py, can't enable the test until the fix in ice-ref appears

Reviewed By: venkatacrc

Differential Revision: D23603878

fbshipit-source-id: a72d93a4bc811f98d1b5e82ddb204be028addfeb
2020-09-10 01:18:45 -07:00
28a23fce4c Deprecate torch.norm and torch.functional.norm (#44321)
Summary:
Part of https://github.com/pytorch/pytorch/issues/24802

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44321

Reviewed By: mrshenli

Differential Revision: D23617273

Pulled By: mruberry

fbshipit-source-id: 6f88b5cb097fd0acb9cf0e415172c5a86f94e9f2
2020-09-10 01:16:41 -07:00
7b547f086f To fix extra memory allocation when using circular padding (#39273)
Summary:
For fixing https://github.com/pytorch/pytorch/issues/39256

Pull Request resolved: https://github.com/pytorch/pytorch/pull/39273

Reviewed By: anjali411

Differential Revision: D23471811

Pulled By: mruberry

fbshipit-source-id: fb324b51baea765311715cdf14642b334f335733
2020-09-10 00:15:31 -07:00
65d4a6b7c0 [ROCm] fix cub hipify mappings (#44431)
Summary:
Fixes ROCm-specific workarounds introduced by https://github.com/pytorch/pytorch/issues/44259.  This adds new hipify mappings that properly handle cub outside of caffe2 sources.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44431

Reviewed By: mrshenli

Differential Revision: D23617417

Pulled By: ngimel

fbshipit-source-id: 5d16afb6b8e6ec5ed049c51571866b0878d534ca
2020-09-09 23:39:25 -07:00
28bd4929bd [NNC] Make it able to normalize loop with variable start (#44133)
Summary:
Loops with variable start can also be normalized.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44133

Test Plan: updated testNormalizeStartVariable.

Reviewed By: navahgar

Differential Revision: D23507097

Pulled By: cheng-chang

fbshipit-source-id: 4e9aad1cd4f4a839f59a00bf8ddf97637a1a6648
2020-09-09 23:05:57 -07:00
c515881137 Add reset_grad() function (#44423)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44423

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42754

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D23010859

Pulled By: ngimel

fbshipit-source-id: 56eec43eba88b98cbf714841813977c68f983564
2020-09-09 22:05:45 -07:00
6324ef4ced [caffe2] Speed up compilation of aten-op.cc (#44440)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44440

`aten-op.cc` takes a long time to compile due to the large generated constructor. For each case, the `std::function` constructor and the initialization functions are inlined, producing a huge amount of intermediate code that takes a long time to optimize, given that many compiler optimization passes are superlinear in the function size.

This diff moves each case to a separate function, so that each one is cheap to optimize, and the constructor is just a large jump table, which is easy to optimize.

Reviewed By: dzhulgakov

Differential Revision: D23593741

fbshipit-source-id: 1ce7a31cda10d9b0c9d799716ea312a291dc0d36
2020-09-09 21:21:48 -07:00
89ac30afb8 [JIT] Propagate type sharing setting to submodule compilation (#44226)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44226

**Summary**
At present, the `share_types` argument to `create_script_module` is used
to decide whether to reuse a previously created type for a top-level
module that has not yet been compiled. However, that setting does not apply
to the compilation of submodules of the top-level module; types are
still reused if possible.

This commit modifies `create_script_module` so that the `share_types`
flag is honoured during submodule compilation as well.

**Test Plan**
This commit adds a unit test to `TestTypeSharing` that checks that
submodule types are not shared or reused when `share_types` is set to
`False`.

**Fixes**
This commit fixes #43605.

Test Plan: Imported from OSS

Reviewed By: eellison

Differential Revision: D23602371

Pulled By: SplitInfinity

fbshipit-source-id: b909b8b6abbe3b4cb9be8319ac263ade90e83bd3
2020-09-09 20:06:35 -07:00
d3b6d5caf1 [JIT] Add support for del to TS classes (#44352)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44352

**Summary**
This commit adds support for `del` with class instances. If a class
implements `__delitem__`, then `del class_instance[key]` is syntactic
sugar for `class_instance.__delitem__[key]`.

**Test Plan**
This commit adds a unit test to TestClassTypes to test this feature.

Test Plan: Imported from OSS

Reviewed By: eellison

Differential Revision: D23603102

Pulled By: SplitInfinity

fbshipit-source-id: 28ad26ddc9a693a58a6c48a0e853a1c7cf5c9fd6
2020-09-09 19:52:35 -07:00
058d7228ec Expose the interface of nesterov of SGD Optimizer from caffe2 to dper
Summary:
Expose the interface of `nesterov` of SGD Optimizer from caffe2 to dper.

dper sgd optimizer (https://fburl.com/diffusion/chpobg0h) has referred to NAG sgdoptimizer in caffe2: https://fburl.com/diffusion/uat2lnan. So just need to add the parameter 'nesterov' in dper sgd optimizer.

Analysis of run resutls: N345540.

- train_ne increases as momentum (m) decreases.
- for m=0.95, 0.9: eval_ne is lower with NAG than production (no NAG, m = 0.95).
- for m=0.99: eval_ne with or without NAG is higher than production. It indicates larger variance in validation and overfit in training (lower train_ne).

Test Plan:
1. unit tests:
`buck test caffe2/caffe2/fb/dper/layer_models/tests/split_1:sparse_nn_test -- test_sgd_without_nesterov`
`buck test caffe2/caffe2/fb/dper/layer_models/tests/split_1:sparse_nn_test -- test_sgd_with_nesterov`
.
1. build dper front end package: `flow-cli canary   ads.dper3.workflows.sparse_nn.train --mode opt --entitlement      ads_global --run-as-secure-group      team_ads_ml_ranking`. The build result (refreshed) is here https://www.internalfb.com/intern/buck/build/2a368b55-d94b-45c1-8617-2753fbce994b. Flow package version is ads_dper3.canary:856b545cc6b249c0bd328f845adeb0d2.
.
2. To build dper back end package: `flow-cli canary  dper.workflows.dper3.train --mode opt --entitlement      ads_global --run-as-secure-group      team_ads_ml_ranking`. The build result (refreshed) is here: https://www.internalfb.com/intern/buck/build/70fa91cd-bf6e-4a08-8a4d-41e41a77fb52. Flow package version is aml.dper2.canary:84123a34be914dfe86b1ffd9925869de.
.
3. Compare prod with NAG-enabled runs:
a) refreshed prod run (m=0.95): f213877098
NAG enabled run (m=0.95): f213887113
.
b) prod run (m=0.9): f214065288
NAG enabled run (m=0.9): f214066319
.
c) prod run (m=0.99): f214065804
NAG enabled run (m=0.99): f214066725
.
d) change date type of nestrov to `bool` and launched a validation run
NAG enabled (m=0.95): f214500597

Reviewed By: ustctf

Differential Revision: D23152229

fbshipit-source-id: 61703ef6b4e72277f4c73171640fb8afc6d31f3c
2020-09-09 19:37:00 -07:00
5ee31308e6 [caffe2] exposes Net cancellation through pybind state (#44043)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44043

To invoke `cancel` from the net instance in Python, we expose it through pybind state.

Reviewed By: dzhulgakov

Differential Revision: D23249660

fbshipit-source-id: 45a1e9062dca811746fcf2e5e42199da8f76bb54
2020-09-09 18:13:13 -07:00
e028ad0762 Fix HashStoreTests and move to Gtest (#43384)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43384

Much like the FileStoreTests, the HashStoreTests were also run in a single blob and threw exceptions upon failure. This modularizes the test by separating each function into separate gtest test cases.
ghstack-source-id: 111690834

Test Plan: Confirmed that the tests pass on devvm.

Reviewed By: jiayisuse

Differential Revision: D23257579

fbshipit-source-id: 7e821f0e9ee74c8b815f06facddfdb7dc2724294
2020-09-09 17:56:33 -07:00
69a3ff005d Modularize FileStoreTest and move to Gtest (#43383)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43383

FileStore Test currently has a large blob of tests that throw
exceptions upon failure. This PR modularizes each test so they can run
independently, and migrates the framework to gtest.
ghstack-source-id: 111690831

Test Plan: Confirmed tests pass on devvm

Reviewed By: jiayisuse

Differential Revision: D22879473

fbshipit-source-id: 6fa5468e594a53c9a6b972757068dfc41645703e
2020-09-09 17:56:30 -07:00
a7fba7de22 Convert StoreTestUtils to Gtest (#43382)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43382

StoreTestCommon defines standard helper functions that are used by all of our Store tests. These helpers currently throw exceptions upon failure, this PR changes them to use gtest assertions instead.
ghstack-source-id: 111690833

Test Plan: Tested the 2 PR's above this on devvm

Reviewed By: jiayisuse

Differential Revision: D22828156

fbshipit-source-id: 9e116cf2904e05ac0342a441e483501e00aad3dd
2020-09-09 17:55:25 -07:00
b69c28d02c Improving ModuleList indexing error msg (#43361)
Summary:
Follow up to https://github.com/pytorch/pytorch/pull/41946/, to suggest enumerating a module as an alternative if a user tries indexing into a modulelist/sequential with a non-integer literal

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43361

Reviewed By: mrshenli

Differential Revision: D23602388

Pulled By: eellison

fbshipit-source-id: 51fa28d5bc45720529b3d45e92d367ee6c9e3316
2020-09-09 16:22:57 -07:00
c010ef7f0c use non-overflowing divide in cuda kernel util GET_BLOCKS (#44391)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/43476.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44391

Reviewed By: mrshenli

Differential Revision: D23602424

Pulled By: walterddr

fbshipit-source-id: 40ed81547f933194ce5bf4a5bcebdb3434298bc1
2020-09-09 16:20:41 -07:00
ba6ddaf04c [pyper] export caffe2 bucketize GPU operator to pytorch
Summary: Exporting the Bucketize operator on CUDA. Also adding unit test.

Test Plan: buck test mode/dev-nosan caffe2/torch/fb/sparsenn:gpu_test -- test_bucketize

Differential Revision: D23581321

fbshipit-source-id: 7f21862984c04d840410b8718db93006f526938a
2020-09-09 16:08:53 -07:00
e0c65abd38 Revert D23568330: [pytorch][PR] Moves some of TestTorchMathOps to OpInfos
Test Plan: revert-hammer

Differential Revision:
D23568330 (a953a825cc)

Original commit changeset: 03e69fccdbfd

fbshipit-source-id: 04ec6843c5eb3c84ddf226dad0088172d9bed84d
2020-09-09 15:48:56 -07:00
fc51047af5 Small fixes in Dependency.cmake and run_test.py (#44414)
Summary:
Do not add gencode flags to NVCC_FLAGS twice: First time they are added in `cmake/public/cuda.cmake` no need to do it again in `cmake/Dependencies.cmake`
Copy `additional_unittest_args` before appending local options to it in `run_test()` method

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44414

Reviewed By: seemethere

Differential Revision: D23605733

Pulled By: malfet

fbshipit-source-id: 782a0da61650356a978a892fb03c66cb1a1ea26b
2020-09-09 15:09:33 -07:00
b0bcdbb1ab [JIT] Support partially specified sizes/strides in IRParser (#44113)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44113

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D23508149

Pulled By: Lilyjjo

fbshipit-source-id: b6b2d32109fae599bc5347dae742b67a2e4a0a49
2020-09-09 14:45:51 -07:00
3674264947 [quant] quantized path for ConstantPadNd (#43304)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43304

Test Plan: Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D23231946

Pulled By: z-a-f

fbshipit-source-id: 8c77f9a81f5a36c268467a190b5b954df0a8f5a4
2020-09-09 14:04:41 -07:00
032480d365 fix typo in embedding_bag_non_contiguous_weight test (#44382)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44382

This is to fix a typo that introduced in #44032.

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D23601316

Pulled By: glaringlee

fbshipit-source-id: 17d6de5900443ea46c7a6ee9c7614fe6f2d92890
2020-09-09 13:30:36 -07:00
a00d36b0e7 [PyTorch][Mobile] Insert the module name as name() to metadata dict if metadata doesn't contain "model_name" (#44400)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44400

This diff does the identical thing as D23549149 (398409f072) does. A fix included for OSS CI: pytorch_windows_vs2019_py36_cuda10.1_test1
ghstack-source-id: 111679745

Test Plan:
- CI
- OSS CI

Reviewed By: xcheng16

Differential Revision: D23601050

fbshipit-source-id: 8ebdcd8fdc5865078889b54b0baeb397a90ddc40
2020-09-09 13:01:17 -07:00
24efd29d19 Check commutativity for computed dispatch table and add a test to check entries. (#44088)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44088

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D23492793

Pulled By: ailzhang

fbshipit-source-id: 37502f2a8a4d755219b400fcbb029e49d6cdb6e9
2020-09-09 12:48:34 -07:00
48c47db8fe [NCCL] Add Environment Variable to guard Async Error Handling feature (#44163)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44163

In this PR, we introduce a new environment variable
(NCCL_ASYNC_ERROR_HANDLING), which guards the asynchronous error handling
feature. We intend to eventually turn this feature on by default for all users,
but this is a temporary solution so the change in behavior from hanging to
crashing is not the default for users all of a sudden.
ghstack-source-id: 111637788

Test Plan:
CI/Sandcastle. We will turn on this env var by default in
torchelastic and HPC trainer soon.

Reviewed By: jiayisuse

Differential Revision: D23517895

fbshipit-source-id: e7cd244b2ddf2dc0800ff7df33c73a6f00b63dcc
2020-09-09 12:26:25 -07:00
211ece7267 [NCCL] ProcessGroupNCCL Destructor Blocks on WorkNCCL Completion (#41054)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41054

**This Commit:**
ProcessGroupNCCL destructor now blocks until all WorkNCCL objects have either been aborted or completed and removed from the work vector.

**This Stack:**
The purpose of this stack is to fix the hanging behavior observed in when using PyTorch DDP training with NCCL. In various situations (desynchronization, high GPU utilization, etc.), NCCL collectives may hang due to waiting on an unresponsive worker. This stack detects such hanging behavior and aborts timed-out collectives by throwing a user-visible exception, all with minimal perf regression. Training can then be restarted from a previous checkpoint with something like torchelastic.

ghstack-source-id: 111614314

Test Plan:
1. **DDP Sanity Check**: First we have a sanity check based on the PyTorch DDP benchmark. This verifies that the baseline DDP training with NCCL for  standard CU workloads works well (esp. with standard models like Resnet50 and BERT). Here is a sample Flow: f213293473

1. **HPC Performance Benchmarks**: This stack has undergone thorough testing and profiling on the Training Cluster with varying number of nodes. This introduces 1-1.5% QPS regression only (~200-400 QPS regression for 8-64 GPUs).

1. **HPC Accuracy Benchmarks**: We've confirmed NE parity with the existing NCCL/DDP stack without this change.

1. **Kernel-Specific Benchmarks**: We have profiled other approaches for this system (such as cudaStreamAddCallback) and performed microbenchmarks to confirm the current solution is optimal.

1. **Sandcastle/CI**: Apart from the recently fixed ProcessGroupNCCL tests, we will also introduce a new test for desynchronization scenarios.

Reviewed By: jiayisuse

Differential Revision: D22054298

fbshipit-source-id: 2b95a4430a4c9e9348611fd9cbcb476096183c06
2020-09-09 12:26:22 -07:00
afbf2f140b [NCCL] WorkNCCL Helper Functions (#41053)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41053

**This Commit:**
Some minor refactoring - added helper to check if `WorkNCCL` objects have timed out. Adding a new finish function to ProcessGroupNCCL::WorkNCCL that avoids notifying CV and uses `lock_guard`. Also renaming the timeoutCVMutex mutex to be more descriptive.

**This Stack:**
The purpose of this stack is to fix the hanging behavior observed in when using PyTorch DDP training with NCCL. In various situations (desynchronization, high GPU utilization, etc.), NCCL collectives may hang due to waiting on an unresponsive worker. This stack detects such hanging behavior and aborts timed-out collectives by throwing a user-visible exception, all with minimal perf regression. Training can then be restarted from a previous checkpoint with something like torchelastic.

ghstack-source-id: 111614315

Test Plan: See D22054298 for verification of correctness and performance

Reviewed By: jiayisuse

Differential Revision: D21943520

fbshipit-source-id: b27ee329f0da6465857204ee9d87953ed6072cbb
2020-09-09 12:26:18 -07:00
f8f7b7840d [NCCL] Abort Errored and Timed Out NCCL Communicators from Watchdog Thread (#41052)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41052

**This Commit:**
Watchdog Thread checks for error-ed or timed out `WorkNCCL` objects and aborts all associated NCCL Communicators. For now, we  also process these aborted communicators as with the existing Watchdog logic (by adding them to abortedCommIds and writing aborted communicator ids to the store.)

**This Stack:**
The purpose of this stack is to fix the hanging behavior observed in when using PyTorch DDP training with NCCL. In various situations (desynchronization, high GPU utilization, etc.), NCCL collectives may hang due to waiting on an unresponsive worker. This stack detects such hanging behavior and aborts timed-out collectives by throwing a user-visible exception, all with minimal perf regression. Training can then be restarted from a previous checkpoint with something like torchelastic.

ghstack-source-id: 111614313

Test Plan: See D22054298 for verification of correctness and performance

Reviewed By: jiayisuse

Differential Revision: D21943151

fbshipit-source-id: 337bfcb8af7542c451f1e4b3dcdfc5870bdec453
2020-09-09 12:26:15 -07:00
4e5c55ef69 [NCCL] Use cudaEventQuery to Poll for GPU operation errors (#41051)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41051

**This Commit:**
In the workCleanupThread, we process completion and exception handling for workNCCL objects corresponding to collective calls that have either completed GPU Execution, or have already thrown an exception. This way, we throw an exception from the workCleanupThread for failed GPU operations. This approach replaces the previous (and lower performance) approach of enqueuing a callback on the CUDA stream to process failures.

**This Stack:**
The purpose of this stack is to fix the hanging behavior observed in when using PyTorch DDP training with NCCL. In various situations (desynchronization, high GPU utilization, etc.), NCCL collectives may hang due to waiting on an unresponsive worker. This stack detects such hanging behavior and aborts timed-out collectives by throwing a user-visible exception, all with minimal perf regression. Training can then be restarted from a previous checkpoint with something like torchelastic.

ghstack-source-id: 111614319

Test Plan: See D22054298 for verification of correctness and performance

Reviewed By: jiayisuse

Differential Revision: D21938498

fbshipit-source-id: df598365031ff210afba57e0c7be865e3323ca07
2020-09-09 12:26:12 -07:00
1df24fd457 [NCCL] Timeout Loop Thread for Async Error Handling (#41050)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41050

**This Commit:**
We introduce a workVector to track live workNCCL objects corresponding to collective operations. Further, we introduce a workCleanupLoop, which busy-polls the vector of workNCCL objects and removes them upon completion.

**This Stack:**
The purpose of this stack is to fix the hanging behavior observed in when using PyTorch DDP training with NCCL. In various situations (desynchronization, high GPU utilization, etc.), NCCL collectives may hang due to waiting on an unresponsive worker. This stack detects such hanging behavior and aborts timed-out collectives by throwing a user-visible exception, all with minimal perf regression. Training can then be restarted from a previous checkpoint with something like torchelastic.

Test Plan: See D22054298 for verification of correctness and performance

Reviewed By: jiayisuse

Differential Revision: D21916637

fbshipit-source-id: f8cadaab0071aaad1c4e31f9b089aa23cba0cfbe
2020-09-09 12:25:06 -07:00
15cbd1cf4b Preserve .ninja_log in build artifacts (#44390)
Summary:
Helpful for later analysis on the build time trends
Also, same .whl files out of regular linux build job

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44390

Reviewed By: walterddr

Differential Revision: D23602049

Pulled By: malfet

fbshipit-source-id: 4d55c9aa2d161a7998ad991a3da0436da83f70ad
2020-09-09 12:19:46 -07:00
ef4475f902 [Reland] Optimize code path for adaptive_avg_pool2d when output size is (1, 1) (#44211)
Summary:
Reland of https://github.com/pytorch/pytorch/issues/43986

DO NOT MERGE YET. XLA failure seems real.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44211

Reviewed By: mrshenli

Differential Revision: D23590505

Pulled By: ngimel

fbshipit-source-id: 6ee516b0995bfff6efaf740474c82cb23055d274
2020-09-09 12:08:14 -07:00
37093f4d99 Benchmarks: make fuser and executor configurable from command line. (#44291)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44291

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D23569089

Pulled By: ZolotukhinM

fbshipit-source-id: ec25b2f0bba303adaa46c3e85b1a9ce4fa3cf076
2020-09-09 11:59:35 -07:00
364d03a67c Misc. FakeLowP OSS cleanup (#44331)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44331

[10:22 AM] Cherckez, Tal
summary of issues(just to have a clear list):
* std::clamp forces the user to use c++17
* using setting without given fails the test
* avoid using max_examples for tests

(Note: this ignores all push blocking failures!)

Test Plan: https://www.internalfb.com/intern/testinfra/testconsole/testrun/6192449509073222/

Reviewed By: hyuen

Differential Revision: D23581440

fbshipit-source-id: fe9fbc341f8fca02352f531cc622fc1035d0300c
2020-09-09 11:53:43 -07:00
758c2b96f5 BUG: make cholesky_solve_out do broadcast, error checking (#43137)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/42695

test, fix `cholesky_solve_out` to use error checking and broadcasting from `cholesky_solve`. Test segfaults before, passes after the fix.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43137

Reviewed By: izdeby

Differential Revision: D23568589

Pulled By: malfet

fbshipit-source-id: 41b67ba964b55e59f1897eef0d96e0f6e1725bef
2020-09-09 11:38:36 -07:00
683380fc91 Use compile time cudnn version if linking with it statically (#44402)
Summary:
This should prevent torch_python from linking the entire cudnn library statically just to query its version

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44402

Reviewed By: seemethere

Differential Revision: D23602720

Pulled By: malfet

fbshipit-source-id: 185b15b789bd48b1df178120801d140ea54ba569
2020-09-09 11:33:41 -07:00
6ec8fabc29 Fix frac in CUDA fuser (#44152)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44152

Test Plan: Imported from OSS

Reviewed By: jamesr66a

Differential Revision: D23528506

fbshipit-source-id: bfd468d72fa55ce317f88ae83e1f2d5eee041aa0
2020-09-09 11:10:08 -07:00
350130a69d Prevent the TE fuser from getting datatypes it can't handle (#44160)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44160

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D23528508

Pulled By: bertmaher

fbshipit-source-id: 03b22725fb2666f441cb504b35397ea6d155bb85
2020-09-09 11:10:04 -07:00
960c088a58 [te] Fix casting of unsigned char, and abs(int) (#44157)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44157

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D23528507

Pulled By: bertmaher

fbshipit-source-id: c5ef0422a91a4665b616601bed8b7cd137be39f9
2020-09-09 11:08:36 -07:00
7c464eed16 Skipping CUDA tests in ProcessGroupGloo and logs (#42488)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42488

Currently, ProcessGroupGloo tests do not emit logs if the test was
skipped due CUDA not being available/not enough CUDA devices. This PR clarifies
the reason for skipping through these logs.
ghstack-source-id: 111638111

Test Plan: tested on devvm and devgpu

Reviewed By: jiayisuse

Differential Revision: D22879396

fbshipit-source-id: d483ca46b5e22ed986521262c11a1c6dbfbe7efd
2020-09-09 10:52:52 -07:00
2a87742ffa Autocast wrappers for RNN cell apis (#44296)
Summary:
Should fix https://github.com/pytorch/pytorch/issues/42605.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44296

Reviewed By: izdeby

Differential Revision: D23580447

Pulled By: ezyang

fbshipit-source-id: 86027b693fd2b648f043ab781b84ffcc1f72854d
2020-09-09 09:44:59 -07:00
a953a825cc Moves some of TestTorchMathOps to OpInfos (#44277)
Summary:
This PR fixes three OpInfo-related bugs and moves some functions from TestTorchMathOps to be tested using the OpInfo pattern. The bugs are:

- A skip test path in test_ops.py incorrectly formatted its string argument
- Decorating the tests in common_device_type.py was incorrectly always applying decorators to the original test, not the op-specific variant of the test. This could cause the same decorator to be applied multiple times, overriding past applications.
- make_tensor was incorrectly constructing tensors in some cases

The functions moved are:

- asin
- asinh
- sinh
- acosh
- tan
- atan
- atanh
- tanh
- log
- log10
- log1p
- log2

In a follow-up PR more or all of the remaining functions in TestTorchMathOps will be refactored as OpInfo-based tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44277

Reviewed By: ngimel

Differential Revision: D23568330

Pulled By: mruberry

fbshipit-source-id: 03e69fccdbfd560217c34ce4e9a5f20e10d05a5e
2020-09-09 09:41:03 -07:00
f044b17ae2 Disable a test (#44348)
Summary:
Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44348

Reviewed By: mrshenli

Differential Revision: D23592524

Pulled By: Krovatkin

fbshipit-source-id: 349057606ce39dd5de24314c9ba8f40516d2ae1c
2020-09-09 08:36:19 -07:00
cfd3620b76 Don't use VCOMP if Intel OMP is used (#44280)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/44096.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44280

Reviewed By: malfet

Differential Revision: D23568557

Pulled By: ezyang

fbshipit-source-id: bd627e497a9f71be9ba908852bf3ae437b1a5c94
2020-09-09 08:12:34 -07:00
d23f3170ef Remove pybind11 from required submodules (#44278)
Summary:
This can be taken from the system in which case it is not used from the submodule. Hence the check here limits the usage unnecessarily

ccing malfet

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44278

Reviewed By: malfet

Differential Revision: D23568552

Pulled By: ezyang

fbshipit-source-id: 7fd2613251567f649b12eca0b1fe7663db9cb58d
2020-09-09 08:07:13 -07:00
8acce55015 Dump optimized graph when logging in already-optimized PE (#44315)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44315

I find it more intuitive to dump the optimized graph if we have one;
when I first saw the unoptimized graph being dumped I thought we had failed to
apply any optimizations.

Test Plan: Observe output by hand

Reviewed By: Lilyjjo

Differential Revision: D23578813

Pulled By: bertmaher

fbshipit-source-id: e2161189fb0e1cd53aae980a153aea610871662a
2020-09-09 01:28:48 -07:00
7a64b0c27a Export Node::isBefore/isAfter for PythonAPI (#44162)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44162

This diff exports Node::isBefore/isAfter method to PythonAPI.

Test Plan: Tested locally. Please let me know if there is a set of unit tests to be passed.

Reviewed By: soumith

Differential Revision: D23514448

fbshipit-source-id: 7ef709b036370217ffebef52fd93fbd68c464e89
2020-09-09 00:57:08 -07:00
135ebbde6d [Caffe2] Add RMSNormOp (#44338)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44338

Add RMSNormOp in Caffe2

Test Plan: buck test mode/dev-nosan //caffe2/caffe2/python/operator_test:rms_norm_op_test

Reviewed By: houseroad

Differential Revision: D23546424

fbshipit-source-id: 8f3940a0bb42230bfa647dc66b5e359cc84491c6
2020-09-08 23:50:44 -07:00
106459acac Rename test_distributed to test_distributed_fork (#42932)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42932

Follow up from https://github.com/pytorch/pytorch/pull/41769, rename `test_distributed` to `test_distributed_fork` to make it explicit that it forks.

New command to run test:
`python test/run_test.py -i distributed/test_distributed_fork -v`
ghstack-source-id: 111632568

Test Plan: `python test/run_test.py -i distributed/test_distributed_fork -v`

Reviewed By: izdeby

Differential Revision: D23072201

fbshipit-source-id: 48581688b6c5193a309e803c3de38e70be980872
2020-09-08 23:13:37 -07:00
b22abbe381 Enable test_distributed to work with spawn mode (#41769)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41769

Currently the tests in `test_distributed` only work with the `fork` mode multiprocessing, this PR introduces support for `spawn` mode multiprocessing as well (while keeping the `fork` mode intact).

Motivations for the change:
1) Spawn multiprocessing is the default on MacOS, so it better emulates how MacOS users would use distributed
2) With python 3.8+, spawn is the default on linux, so we should have test coverage for this
3) PT multiprocessing suggests using spawn/forkserver over fork, for sharing cuda tensors: https://pytorch.org/docs/stable/multiprocessing.html
4) Spawn is better supported with respect to certain sanitizers such as TSAN, so adding this sanitizer coverage may help us uncover issues.

How it is done:
1) Move `test_distributed` tests in `_DistTestBase` class to a shared file `distributed_test` (similar to how the RPC tests are structured)
2) For `Barrier`, refactor the setup of temp directories, as the current version did not work with spawn, each process would get a different randomly generated directory and thus would write to different barriers.
3) Add all the relevant builds to run internally and in OSS.
Running test_distributed with spawn mode in OSS can be done with:
`python test/run_test.py -i distributed/test_distributed_spawn -v`

Reviewed By: izdeby

Differential Revision: D22408023

fbshipit-source-id: e206be16961fd80438f995e221f18139d7e6d2a9
2020-09-08 23:11:12 -07:00
1d01fcdc24 [quant] fill_ path for quantized tensors (#43303)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43303

Test Plan: Imported from OSS

Reviewed By: raghuramank100

Differential Revision: D23231947

Pulled By: z-a-f

fbshipit-source-id: fd5110ff15a073f326ef590436f8c6e5a2608324
2020-09-08 21:34:06 -07:00
4aacfab221 Resolve Autograd key for disable_variable_dispatch flag. (#44268)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44268

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D23561042

Pulled By: ailzhang

fbshipit-source-id: 6f35cd9a543bea3f9e294584f1db7c3622ebb741
2020-09-08 21:27:52 -07:00
ecc6358dbe Port nonzero cuda from THC to ATen (#44259)
Summary:
1) Ports nonzero from THC to ATen
2) replaces most thrust uses with cub, to avoid synchronization and to improve performance. There is still one necessary synchronization point, communicating number of nonzero elements from GPU to CPU
3) slightly changes algorithm, now we first compute the number of nonzeros, and then allocate correct-sized output, instead of allocating full-sized output as was done before, to account for possibly all elements being non-zero
4) unfortunately, since the last transforms are still done with thrust, 2) is slightly beside the point, however it is a step towards a future without thrust
4) hard limits the number of elements in the input tensor to MAX_INT. Previous implementation allocated a Long tensor with the size ndim*nelements, so that would be at least 16 GB for a tensor with MAX_INT elements. It is reasonable to say that larger tensors could not be used anyway.

Benchmarking is done for tensors with approximately half non-zeros
<details><summary>Benchmarking script</summary>
<p>

```
import torch
from torch.utils._benchmark import Timer
from torch.utils._benchmark import Compare
import sys

device = "cuda"
results = []
for numel in (1024 * 128,):#, 1024 * 1024, 1024 * 1024 * 128):
    inp = torch.randint(2, (numel,), device="cuda", dtype=torch.float)
    for ndim in range(2,3):#(1,4):
        if ndim == 1:
            shape = (numel,)
        elif ndim == 2:
            shape = (1024, numel // 1024)
        else:
            shape = (1024, 128, numel // 1024 // 128)
        inp = inp.reshape(shape)
        repeats = 3
        timer = Timer(stmt="torch.nonzero(inp, as_tuple=False)", label="Nonzero", sub_label=f"number of elts {numel}",
        description = f"ndim {ndim}", globals=globals())
        for i in range(repeats):
            results.append(timer.blocked_autorange())
        print(f"\rnumel {numel} ndim {ndim}", end="")
        sys.stdout.flush()

comparison = Compare(results)
comparison.print()
```
</p>
</details>

### Results
Before:
```
[--------------------------- Nonzero ---------------------------]
                                 |  ndim 1  |   ndim 2  |   ndim 3
 1 threads: ------------------------------------------------------
       number of elts 131072     |    55.2  |     71.7  |     90.5
       number of elts 1048576    |   113.2  |    250.7  |    497.0
       number of elts 134217728  |  8353.7  |  23809.2  |  54602.3

 Times are in microseconds (us).
```
After:
```
[-------------------------- Nonzero --------------------------]
                                |  ndim 1  |  ndim 2  |  ndim 3
1 threads: ----------------------------------------------------
      number of elts 131072     |    48.6  |    79.1  |    90.2
      number of elts 1048576    |    64.7  |   134.2  |   161.1
      number of elts 134217728  |  3748.8  |  7881.3  |  9953.7

Times are in microseconds (us).

```
There's a real regression for smallish 2D tensor due to added work of computing number of nonzero elements, however, for other sizes there are significant gains, and there are drastically lower memory requirements. Perf gains would be even larger for tensors with fewer nonzeros.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44259

Reviewed By: izdeby

Differential Revision: D23581955

Pulled By: ngimel

fbshipit-source-id: 0b99a767fd60d674003d83f0848dc550d7a363dc
2020-09-08 20:52:51 -07:00
bd8e38cd88 [TensorExpr] Fuser: check node inputs' device before merging the node into a fusion group. (#44241)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44241

Test Plan: Imported from OSS

Reviewed By: gmagogsfm

Differential Revision: D23554192

Pulled By: ZolotukhinM

fbshipit-source-id: fb03262520303152b83671603e08e7aecc24f5f2
2020-09-08 19:32:23 -07:00
646ffd4886 [quant] Move EmbeddingBag eager quantization to static (#44217)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44217

Move the tests to static ones as well

Test Plan:
python test/test_quantization.py TestStaticQuantizedModule.test_embedding_bag_api

Imported from OSS

Reviewed By: raghuramank100

Differential Revision: D23547386

fbshipit-source-id: 41f81c31e1613098ecf6a7eff601c7dcd4b09c76
2020-09-08 19:05:02 -07:00
57b87aaf59 [quant] Add quantized Embedding module (#44208)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44208

Add quantized module in static quantization namespace. Embedding
quantization requires only weights to be quantized so it is static.
Internally it calls the embedding_bag_byte op with the offsets set corresponding to the
indices.

Future PR will move EmbeddingBag quantization from dynamic to static as well.

Test Plan:
python test/test_quantization.py test_embedding_api

Imported from OSS

Reviewed By: vkuzo

Differential Revision: D23547384

fbshipit-source-id: eddc6fb144b4a771060e7bab5853656ccb4443f0
2020-09-08 19:04:59 -07:00
6013a29fc0 [quant] Support quantization of embedding lookup operators (#44207)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44207

Use existing embedding_bag operator but set offsets to [0, 1, .. len(indices)]

Test Plan:
python test/test_quantization.py TestEmbeddingOps.test_embedding_byte

Imported from OSS

Reviewed By: vkuzo

Differential Revision: D23547385

fbshipit-source-id: ccce348bc192c6a4a65a8eca4c8b90f99f40f1b1
2020-09-08 19:03:59 -07:00
f27be2f781 [caffe2] fix wrong comment (#42735)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42735

We use reduced precision only for embedding table (not for momentum) in RowWiseSparseAdagrad

Test Plan: .

Reviewed By: jianyuh

Differential Revision: D23003939

fbshipit-source-id: 062290d94b160100bc4c2f48b797833819f8e88a
2020-09-08 18:54:24 -07:00
f9146b4598 fix lint (#44346)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44346

Reviewed By: jamesr66a

Differential Revision: D23589324

Pulled By: eellison

fbshipit-source-id: a4e22b69196909ec200ac3e262f04d2aaf78e9cf
2020-09-08 18:29:44 -07:00
6269b6e0f0 [quant][graphmode][fx][api] Call fuse in prepare (#43984)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43984

Test Plan: Imported from OSS

Reviewed By: vkuzo

Differential Revision: D23459261

fbshipit-source-id: 6b56b0916d76df67b9cc2f4be1fcee905d604019
2020-09-08 18:09:26 -07:00
be94dba429 [NNC] fix support for FP16 in CudaCodgen (#44209)
Summary:
Fixes a bug where FP16 values could be incorrectly cast to a half type that doesn't have a cast operator by inserting the cuda specific cast to float during handling of the Cast node, not as a wrapper around printing Loads and Stores. Two main changes: the HalfChecker now inserts the casts to float explicitly in the IR, and the PrioritizeLoad mutator now consumes both Loads and a Cast which immediately preceded a load.

Tested with test_jit_fuser_te.py and test_tensorexpr.py, plus C++ tests obv.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44209

Reviewed By: izdeby

Differential Revision: D23575577

Pulled By: nickgg

fbshipit-source-id: 808605aeb2af812758f96f9fdc11b07e08053b46
2020-09-08 18:00:39 -07:00
9f54bcc522 [quant][graphmode][fx] Support inplace option (#43983)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43983

Support inplace option in apis

Test Plan: Imported from OSS

Reviewed By: vkuzo

Differential Revision: D23459260

fbshipit-source-id: 80409c7984f17d1a4e13fb1eece8e18a69ee43b3
2020-09-08 17:39:13 -07:00
0351d31722 add rocm nightly build (#44250)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44250

Reviewed By: izdeby

Differential Revision: D23585431

Pulled By: walterddr

fbshipit-source-id: c798707f5cb55f720e470bc40f30ab82718e0ddf
2020-09-08 17:09:32 -07:00
40d138f7c1 Added alpha overloads for add/sub ops with lists (#43413)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43413

Test Plan: Imported from OSS

Reviewed By: cpuhrsch

Differential Revision: D23331896

Pulled By: izdeby

fbshipit-source-id: 2e7484339fec533e21224f18979fddbeca649d2c
2020-09-08 17:02:08 -07:00
00b5bd536f fx quant: add docblocks to _find_matches and _find_quants (#43928)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43928

Improving readability, no logic change.

Test Plan:
CI

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D23440249

fbshipit-source-id: a7ebfc7ad15c73e26b9a94758e7254413cc17d29
2020-09-08 16:13:11 -07:00
6dd53fb58d [fix] output of embedding_bag with non-contiguous weight (#44032)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/43723

use weight.contiguous on fast-path as it expects contiguous tensor.

TODO:
* [x] Add tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44032

Reviewed By: izdeby

Differential Revision: D23502200

Pulled By: glaringlee

fbshipit-source-id: 4a7b546b3e8b1ad35c287a634b4e990a1ccef874
2020-09-08 16:07:13 -07:00
43e38d60d6 [quant][graphmode][fx] Support quantize per channel in all cases (#44042)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44042

Missed one case last time

Test Plan: Imported from OSS

Reviewed By: vkuzo

Differential Revision: D23479345

fbshipit-source-id: 30e6713120c494e9fab5584de4df9b25bec83d32
2020-09-08 15:45:14 -07:00
49e979bfde Set default compiler differently according to platform (#43890)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43890

1. auto-detect `CXX` default compiler type in oss, and `clang` as default compiler type in fbcode (because auto-detecting will say `gcc` is the default compiler on devserver).

2. change `compiler type` from str `"CLANG" "GCC"` to enum type
3. rename function `get_cov_type` to `detect_compiler_type`
4. auto-set the default pytorch folder for users in oss

Test Plan:
on devserver:
```
buck run :coverage //caffe2/c10:
```

on oss:
```
python oss_coverage.py --run-only=atest
```

Reviewed By: malfet

Differential Revision: D23420034

fbshipit-source-id: c0ea88188578bb1343a286f2090eb8a74cdf3982
2020-09-08 14:57:35 -07:00
1fcccd6a18 [FX] Minor fixups in Graph printout (#44214)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44214

Test Plan: Imported from OSS

Reviewed By: suo

Differential Revision: D23545501

Pulled By: jamesr66a

fbshipit-source-id: dabb3b051ed4da213b2087979ade8a649288bd5d
2020-09-08 14:45:32 -07:00
47ac9bb105 Enable temp disabled tests in test_jit_fuser_te.py (#44222)
Summary:
Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44222

Reviewed By: izdeby

Differential Revision: D23582214

Pulled By: Krovatkin

fbshipit-source-id: 27caa3ea02ce10b163212f6a45a81b446898953d
2020-09-08 14:40:32 -07:00
54931ebb7b Release saved variable from DifferentiableGraphBackward (#42994)
Summary:
When the backward ops execute via the autograd engine evaluate_function(), the fn.release_variables() is called to release the SavedVariables. For the eager mode ops, this releases the saved inputs that was required for backward grad function. However, with TorchScript, we get a DifferentableGraph and the DifferentiableGraphBackward() doesn't implement a release_variables(). This leads to the SavedVariables to be alive longer. Implement release_variables() for DifferentiableGraphBackward to release these SavedVariables  early.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42994

Reviewed By: izdeby

Differential Revision: D23503172

Pulled By: albanD

fbshipit-source-id: d87127498cfa72883ae6bb31d0e6c7056c4c36d4
2020-09-08 14:36:52 -07:00
63d62d3e44 Skips test_addcmul_cuda if using ROCm (#44304)
Summary:
This test is failing consistently on linux-bionic-rocm3.7-py3.6-test2. Relevant log snippet:

```
03:43:11 FAIL: test_addcmul_cuda_float16 (__main__.TestForeachCUDA)
03:43:11 ----------------------------------------------------------------------
03:43:11 Traceback (most recent call last):
03:43:11   File "/var/lib/jenkins/.local/lib/python3.6/site-packages/torch/testing/_internal/common_utils.py", line 818, in wrapper
03:43:11     method(*args, **kwargs)
03:43:11   File "/var/lib/jenkins/.local/lib/python3.6/site-packages/torch/testing/_internal/common_device_type.py", line 258, in instantiated_test
03:43:11     result = test(self, *args)
03:43:11   File "test_foreach.py", line 83, in test_addcmul
03:43:11     self._test_pointwise_op(device, dtype, torch._foreach_addcmul, torch._foreach_addcmul_, torch.addcmul)
03:43:11   File "test_foreach.py", line 58, in _test_pointwise_op
03:43:11     self.assertEqual(tensors, expected)
03:43:11   File "/var/lib/jenkins/.local/lib/python3.6/site-packages/torch/testing/_internal/common_utils.py", line 1153, in assertEqual
03:43:11     exact_dtype=exact_dtype, exact_device=exact_device)
03:43:11   File "/var/lib/jenkins/.local/lib/python3.6/site-packages/torch/testing/_internal/common_utils.py", line 1127, in assertEqual
03:43:11     self.assertTrue(result, msg=msg)
03:43:11 AssertionError: False is not true : Tensors failed to compare as equal! With rtol=0.001 and atol=1e-05, found 10 element(s) (out of 400) whose difference(s) exceeded the margin of error (including 0 nan comparisons). The greatest difference was 0.00048828125 (-0.46484375 vs. -0.46533203125), which occurred at index (11, 18).
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44304

Reviewed By: malfet, izdeby

Differential Revision: D23578316

Pulled By: mruberry

fbshipit-source-id: 558eecf42677383e7deaa4961e12ef990ffbe28c
2020-09-08 13:14:25 -07:00
de89261abe Reduce sccache log levels for RocM to a default state (#44310)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44310

Reviewed By: walterddr

Differential Revision: D23576966

Pulled By: malfet

fbshipit-source-id: c7fa063ec2be92de8f3768aaa3e6a032913004f7
2020-09-08 12:55:23 -07:00
477f489137 Don't register a fallback for private use to let extensions do it themselves (#44149)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44149

Thanks Christian Puhrsch for reporting.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: izdeby

Differential Revision: D23574739

Pulled By: ezyang

fbshipit-source-id: 8c9d0d78e6970139e0103cd1e0004b743e3c7f9e
2020-09-08 12:30:26 -07:00
caf23d110f [JIT] Unshare types for modules that define() in __init__ (#44233)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44233

**Summary**
By default, scripting tries to share concrete and JIT types across
compilations. However, this can lead to incorrect results if a module
extends `torch.jit.ScriptModule`, and injects instance variables into
methods defined using `define`.

This commit detects when this has happened and disables type sharing
for the compilation of the module that uses `define` in `__init__`.

**Test Plan**
This commit adds a test to TestTypeSharing that tests this scenario.

**Fixes**
This commit fixes #43580.

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D23553870

Pulled By: SplitInfinity

fbshipit-source-id: d756e87fcf239befa0012998ce29eeb25728d3e1
2020-09-08 12:16:45 -07:00
4e0ac120e9 [FX] Only copy over training attr if it\'s there (#44314)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44314

Test Plan: Imported from OSS

Reviewed By: dzhulgakov

Differential Revision: D23578189

Pulled By: jamesr66a

fbshipit-source-id: fb7643f28582bd5009a826663a937fbe188c50bc
2020-09-08 11:50:08 -07:00
fd8e2064e0 quant: switch observers to use min_max (#42957)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42957

Switches observers to use the new min_max function to calculate
min and max at the same time.  We see around 45-50% speedup on
representative input shapes on the microbenchmarks for all observers except `HistogramObserver`.

Test Plan:
CI for correctness

performance:
```
cd benchmarks/operator_benchmark
// repeat (before diff, after diff) x (cpu, cuda)
python -m pt.qobserver_test --tag_filter all --device cpu
/*
    * before, cpu: https://our.intern.facebook.com/intern/paste/P138633280/
    * before, cuda: https://our.intern.facebook.com/intern/paste/P138639473/
    * after, cpu: https://our.intern.facebook.com/intern/paste/P138635458/
    * after, cuda: https://our.intern.facebook.com/intern/paste/P138636344/
*/
```

Imported from OSS

Reviewed By: supriyar

Differential Revision: D23093995

fbshipit-source-id: 9f416d144109b5b80baf089eb4bcfabe8fe358d5
2020-09-08 11:39:44 -07:00
de980f937b skip test_tanhquantize for now (#44312)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44312

This test is failing now when running on card. Let's disable it while Intel is investigating the issue.

Test Plan: Sandcastle

Reviewed By: hyuen

Differential Revision: D23577475

fbshipit-source-id: 84f957c69ed75e0e0f563858b8b8ad7a2158da4e
2020-09-08 11:21:41 -07:00
8d212d3f7a add 'run_duration' stats for binary builds to scuba (#44251)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44251

Reviewed By: seemethere

Differential Revision: D23575312

Pulled By: walterddr

fbshipit-source-id: 29d737f5bee1540d6595d4d0ca1386b9ce5ab2ee
2020-09-08 11:13:00 -07:00
1130de790c Automated submodule update: FBGEMM (#44177)
Summary:
This is an automated pull request to update the first-party submodule for [pytorch/FBGEMM](https://github.com/pytorch/FBGEMM).

New submodule commit: d5ace7ca70

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44177

Test Plan: Ensure that CI jobs succeed on GitHub before landing.

Reviewed By: dskhudia

Differential Revision: D23533561

fbshipit-source-id: 9e580f8dbfb83e57bebc28f8e459caa0c5fc7317
2020-09-08 10:12:21 -07:00
5de805d8a7 [dper3] Export Caffe2 operator LearningRate to PyTorch
Summary: Exports the operator to PyTorch, to be made into a low-level module.

Test Plan:
```
buck test //caffe2/caffe2/python/operator_test:torch_integration_test -- test_learning_rate
```

Reviewed By: yf225

Differential Revision: D23545582

fbshipit-source-id: 6b6d9aa6a47b2802ccef0f87c1263c6cc2d2fdf6
2020-09-08 08:50:09 -07:00
cce5982c4c Add unary ops: exp and sqrt (#42537)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42537

[First PR: Add private API to support tensor lists: _foreach_add(TensorList tensors, Scalar scalar)](https://github.com/pytorch/pytorch/pull/41554).

**Motivation**
[GitHub issue](https://github.com/pytorch/pytorch/issues/38655)
Current PyTorch optimizer implementations are not efficient in cases when we work with a lot of small feature tensors. Starting a lot of kernels slows down the whole process. We need to reduce the number of kernels that we start.
As an example, we should be looking at [NVIDIAs Apex](https://github.com/NVIDIA/apex).
In order to track progress, we will pick PyTorchs DCGAN model with Adam optimizer and once the optimizer is reimplemented with tensor lists, benchmark the model performance against original model version, Apexs version with original Adam optimizer and it’s FusedAdam optimizer.

**Current API restrictions**
- List can't be empty (will fixed in upcoming PRs).
- All tensors in the list must have the same dtype, device and size.

**Broadcasting**
At this point we don't support broadcasting.

**What is 'Fast' and 'Slow' route**
In particular cases, we cant process an op with a fast list CUDA kernel. Still, we can do with a regular for-loop where the op will be applied to each tensor individually through the dispatch mechanisms. There are a few checks that decide whether the op will be performed via a 'fast' or 'slow' path.
To go the fast route,
- All tensors must have strided layout
- All tensors must be dense and not have overlapping memory
- The resulting tensor type must be the same.

----------------
**In this PR**
Adding APIs:
```
torch._foreach_exp(TensorList tl1)
torch._foreach_exp_(TensorList tl1)
torch._foreach_sqrt(TensorList tl1)
torch._foreach_sqrt_(TensorList tl1)
```

**Tests**
Tested via unit tests

**TODO**
1. Properly handle empty lists
2. Properly handle bool tensors

**Plan for the next PRs**
1. APIs
- Pointwise Ops

2. Complete tasks from TODO
3. Rewrite PyTorch optimizers to use for-each operators for performance gains.

Test Plan: Imported from OSS

Reviewed By: cpuhrsch

Differential Revision: D23331889

Pulled By: izdeby

fbshipit-source-id: 8b04673b8412957472ed56361954ca3884eb9376
2020-09-07 19:57:34 -07:00
6134ac17ba Revert D23561500: Benchmarks: re-enable profiling-te configuration (try 2).
Test Plan: revert-hammer

Differential Revision:
D23561500 (589a2024c8)

Original commit changeset: 7fe86d34afa4

fbshipit-source-id: 10e48f230402572fcece56662ad4413ac0bd3cb5
2020-09-07 19:10:30 -07:00
7c61f57bec test_ops: skipTest only takes a single argument (#44181)
Summary:
Fixes a broken skipTest from https://github.com/pytorch/pytorch/issues/43451, e.g. in the ROCm CI.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44181

Reviewed By: ngimel

Differential Revision: D23568608

Pulled By: malfet

fbshipit-source-id: 557048bd5f0086ffac38d1c48255badb63869899
2020-09-07 18:32:59 -07:00
0e64b02912 FindCUDA error handling (#44236)
Summary:
Check return code of `nvcc --version` and if it's not zero, print warning and mark CUDA as not found.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44236

Test Plan: Run `CUDA_NVCC_EXECUTABLE=/foo/bar cmake ../`

Reviewed By: ezyang

Differential Revision: D23552336

Pulled By: malfet

fbshipit-source-id: cf9387140a8cdbc8dab12fcc4bfaf55ae8e6a502
2020-09-07 18:17:55 -07:00
5d748e6d22 [TensorExpr] Re-enable tests. (#44218)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44218

Differential Revision: D23546100

Test Plan: Imported from OSS

Reviewed By: ngimel

Pulled By: ZolotukhinM

fbshipit-source-id: 4c4c5378ec9891ef72b60ffb59081a009e0df049
2020-09-07 15:52:03 -07:00
589a2024c8 Benchmarks: re-enable profiling-te configuration (try 2). (#44270)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44270

The previous PR (#44212) was reverted since I didn't update the
`upload_scribe.py` script and it was looking for 'executor_and_fuser'
field in the json which now is replaced with two separate fields:
'executor' and 'fuser'.

Differential Revision: D23561500

Test Plan: Imported from OSS

Reviewed By: ngimel

Pulled By: ZolotukhinM

fbshipit-source-id: 7fe86d34afa488a0e43d5ea2aaa7bc382337f470
2020-09-07 15:50:39 -07:00
10dd25dcd1 Add binary ops for _foreach APIs (#42536)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42536

[First PR: Add private API to support tensor lists: _foreach_add(TensorList tensors, Scalar scalar)](https://github.com/pytorch/pytorch/pull/41554).

**Motivation**
[GitHub issue](https://github.com/pytorch/pytorch/issues/38655)
Current PyTorch optimizer implementations are not efficient in cases when we work with a lot of small feature tensors. Starting a lot of kernels slows down the whole process. We need to reduce the number of kernels that we start.
As an example, we should be looking at [NVIDIAs Apex](https://github.com/NVIDIA/apex).
In order to track progress, we will pick PyTorchs DCGAN model with Adam optimizer and once the optimizer is reimplemented with tensor lists, benchmark the model performance against original model version, Apexs version with original Adam optimizer and it’s FusedAdam optimizer.

**Current API restrictions**
- List can't be empty (will fixed in upcoming PRs).
- All tensors in the list must have the same dtype, device and size.

**Broadcasting**
At this point we don't support broadcasting.

**What is 'Fast' and 'Slow' route**
In particular cases, we cant process an op with a fast list CUDA kernel. Still, we can do with a regular for-loop where the op will be applied to each tensor individually through the dispatch mechanisms. There are a few checks that decide whether the op will be performed via a 'fast' or 'slow' path.
To go the fast route,
- All tensors must have strided layout
- All tensors must be dense and not have overlapping memory
- The resulting tensor type must be the same.

----------------
**In this PR**
Adding APIs:
```
torch._foreach_sub(TensorList tl1, TensorList tl2)
torch._foreach_sub_(TensorList self, TensorList tl2)
torch._foreach_mul(TensorList tl1, TensorList tl2)
torch._foreach_mul_(TensorList self, TensorList tl2)
torch._foreach_div(TensorList tl1, TensorList tl2)
torch._foreach_div_(TensorList self, TensorList tl2)

torch._foreach_sub(TensorList tl1, Scalar scalar)
torch._foreach_sub_(TensorList self, Scalar scalar)
torch._foreach_mul(TensorList tl1, Scalar scalar)
torch._foreach_mul_(TensorList self, Scalar scalar)
torch._foreach_div(TensorList tl1, Scalar scalar)
torch._foreach_div(TensorList self, Scalar scalar)
```

**Tests**
Tested via unit tests

**TODO**
1. Properly handle empty lists
2. Properly handle bool tensors

**Plan for the next PRs**
1. APIs
- Unary Ops for list
- Pointwise Ops

2. Complete tasks from TODO
3. Rewrite PyTorch optimizers to use for-each operators for performance gains.

Test Plan: Imported from OSS

Reviewed By: cpuhrsch

Differential Revision: D23331891

Pulled By: izdeby

fbshipit-source-id: 18c5937287e33e825b2e391e41864dd64e226f19
2020-09-07 10:29:32 -07:00
626e410e1d Revert D23544563: Benchmarks: re-enable profiling-te configuration.
Test Plan: revert-hammer

Differential Revision:
D23544563 (ac1f471fe2)

Original commit changeset: 98659e8860fa

fbshipit-source-id: 5dab7044699f59c709e64d178758f5f462ebb788
2020-09-06 21:01:19 -07:00
1b2da9ed82 Expose alias key info in dumpState and update test_dispatch. (#44081)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44081

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D23492794

Pulled By: ailzhang

fbshipit-source-id: 27a2978591900463bda2e92e0201c9fd719f9792
2020-09-06 18:43:05 -07:00
514f20ea51 Histogram Binning Calibration
Summary:
Adding a calibration module called histogram binning:

Divide the prediction range (e.g., [0, 1]) into B bins. In each bin, use two parameters to store the number of positive examples and the number of examples that fall into this bucket. So we basically have a histogram for the model prediction.

As a result, for each bin, we have a statistical value for the real CTR (num_pos / num_example). We use this statistical value as the final calibrated prediction if the pre-cali prediction falls into the corresponding bin.

In this way, the predictions within each bin should be well-calibrated if we have sufficient examples. That is, we have a fine-grained calibrated model by this calibration module.

Theoretically, this calibration layer can fix any uncalibrated model or prediction if we have sufficient bins and examples. It provides the potential to use any kind of training weight allocation to our training data, without worrying about the calibration issue.

Test Plan:
buck test dper3/dper3/modules/calibration/tests:calibration_test -- test_histogram_binning_calibration

buck test dper3/dper3_models/ads_ranking/tests:model_paradigm_e2e_tests -- test_sparse_nn_histogram_binning_calibration

All tests passed.

Example workflows:
f215431958

{F326445092}

f215445048

{F326445223}

Reviewed By: chenshouyuan

Differential Revision: D23356450

fbshipit-source-id: c691b66c51ef33908c17575ce12e5bee5fb325ff
2020-09-06 17:11:16 -07:00
ac1f471fe2 Benchmarks: re-enable profiling-te configuration. (#44212)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44212

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D23544563

Pulled By: ZolotukhinM

fbshipit-source-id: 98659e8860fa951d142e0f393731c4a769463c6c
2020-09-06 10:22:16 -07:00
bb861e1d69 Ports CUDA var and std reduce all (with no out argument) to ATen, fixes var docs (#43858)
Summary:
When var and std are called without args (other than unbiased) they currently call into TH or THC. This PR:

- Removes the THC var_all and std_all functions and updates CUDA var and std to use the ATen reduction
- Fixes var's docs, which listed its arguments in the incorrect order
- Adds new tests comparing var and std with their NumPy counterparts

Performance appears to have improved as a result of this change. I ran experiments on 1D tensors, 1D tensors with every other element viewed ([::2]), 2D tensors and 2D transposed tensors. Some notable datapoints:

- torch.randn((8000, 8000))
  - var measured 0.0022215843200683594s on CUDA before the change
  - var measured 0.0020322799682617188s on CUDA after the change
- torch.randn((8000, 8000)).T
  - var measured .015128850936889648 on CUDA before the change
  - var measured 0.001912832260131836 on CUDA after the change
- torch.randn(8000 ** 2)
  - std measured 0.11031460762023926 on CUDA before the change
  - std measured 0.0017833709716796875 on CUDA after the change

Timings for var and std are, as expected, similar.

On the CPU, however, the performance change from making the analogous update was more complicated, and ngimel and I decided not to remove CPU var_all and std_all. ngimel wrote the following script that showcases how single-threaded CPU inference would suffer from this change:

```
import torch
import numpy as np
from torch.utils._benchmark import Timer
from torch.utils._benchmark import Compare
import sys
base = 8
multiplier = 1

def stdfn(a):
    meanv = a.mean()
    ac = a-meanv
    return torch.sqrt(((ac*ac).sum())/a.numel())

results = []
num_threads=1
for _ in range(7):
    size = base*multiplier
    input = torch.randn(size)

    tasks = [("torch.var(input)", "torch_var"),
             ("torch.var(input, dim=0)", "torch_var0"),
             ("stdfn(input)", "stdfn"),
             ("torch.sum(input, dim=0)", "torch_sum0")
            ]
    timers = [Timer(stmt=stmt, num_threads=num_threads, label="Index", sub_label=f"{size}",
    description=label, globals=globals()) for stmt, label in tasks]
    repeats = 3

    for i, timer in enumerate(timers * repeats):
        results.append(
            timer.blocked_autorange()
        )
        print(f"\r{i + 1} / {len(timers) * repeats}", end="")
        sys.stdout.flush()
    multiplier *=10
print()

comparison = Compare(results)

comparison.print()
```

The TH timings using this script on my devfair are:

```
[------------------------------ Index ------------------------------]
        | torch_var | torch_var0 |  stdfn  | torch_sum0
1 threads: ----------------------------------------------------------
   8    |   16.0  |    5.6  |   40.9 |    5.0
   80    |   15.9  |    6.1  |   41.6 |    4.9
   800   |   16.7  |   12.0  |   42.3 |    5.0
   8000   |   27.2  |   72.7  |   51.5 |    6.2
   80000  |   129.0  |   715.0  |  133.0 |   18.0
   800000  |  1099.8  |  6961.2  |  842.0 |   112.6
   8000000 |  11879.8  |  68948.5  | 20138.4 |  1750.3
```

and the ATen timings are:

```
[------------------------------ Index ------------------------------]
               |  torch_var  |  torch_var0  |   stdfn   |  torch_sum0
1 threads: ----------------------------------------------------------
      8              |       4.3   |       5.4    |     41.4  |       5.4
      80            |       4.9   |       5.7    |     42.6  |       5.4
      800          |      10.7   |      11.7    |     43.3  |       5.5
      8000        |      69.3   |      72.2    |     52.8  |       6.6
      80000      |     679.1   |     676.3    |    129.5  |      18.1
      800000    |    6770.8   |    6728.8    |    819.8  |     109.7
      8000000  |   65928.2   |   65538.7    |  19408.7  |    1699.4
```

which demonstrates that performance is analogous to calling the existing var and std with `dim=0` on a 1D tensor. This would be a significant performance hit. Another simple script shows the performance is mixed when using multiple threads, too:

```
import torch
import time

# Benchmarking var and std, 1D with varying sizes
base = 8
multiplier = 1

op = torch.var
reps = 1000

for _ in range(7):
    size = base * multiplier
    t = torch.randn(size)
    elapsed = 0
    for _ in range(reps):
        start = time.time()
        op(t)
        end = time.time()
        elapsed += end - start
    multiplier *= 10

    print("Size: ", size)
    print("Avg. elapsed time: ", elapsed / reps)
```

```
var cpu TH vs ATen timings

Size:  8
Avg. elapsed time:  1.7853736877441406e-05 vs 4.9788951873779295e-06 (ATen wins)
Size:  80
Avg. elapsed time:  1.7803430557250977e-05 vs 6.156444549560547e-06 (ATen wins)
Size:  800
Avg. elapsed time:  1.8569469451904296e-05 vs 1.2302875518798827e-05 (ATen wins)
Size:  8000
Avg. elapsed time:  2.8756141662597655e-05 vs. 6.97789192199707e-05 (TH wins)
Size:  80000
Avg. elapsed time:  0.00026622867584228516 vs. 0.0002447957992553711 (ATen wins)
Size:  800000
Avg. elapsed time:  0.0010556647777557374 vs 0.00030616092681884767 (ATen wins)
Size:  8000000
Avg. elapsed time:  0.009990205764770508 vs 0.002938544034957886 (ATen wins)

std cpu TH vs ATen timings

Size:  8
Avg. elapsed time:  1.6681909561157225e-05 vs. 4.659652709960938e-06 (ATen wins)
Size:  80
Avg. elapsed time:  1.699185371398926e-05 vs. 5.431413650512695e-06 (ATen wins)
Size:  800
Avg. elapsed time:  1.768803596496582e-05 vs. 1.1279821395874023e-05 (ATen wins)
Size:  8000
Avg. elapsed time:  2.7791500091552735e-05  vs 7.031106948852539e-05 (TH wins)
Size:  80000
Avg. elapsed time:  0.00018650460243225096 vs 0.00024368906021118164 (TH wins)
Size:  800000
Avg. elapsed time:  0.0010522041320800782 vs 0.0003039860725402832 (ATen wins)
Size:  8000000
Avg. elapsed time:  0.009976618766784668 vs. 0.0029211788177490234 (ATen wins)
```

These results show the TH solution still performs better than the ATen solution with default threading for some sizes.

It seems like removing CPU var_all and std_all will require an improvement in ATen reductions. https://github.com/pytorch/pytorch/issues/40570 has been updated with this information.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43858

Reviewed By: zou3519

Differential Revision: D23498981

Pulled By: mruberry

fbshipit-source-id: 34bee046c4872d11c3f2ffa1b5beee8968b22050
2020-09-06 09:40:54 -07:00
83a6e7d342 Adds inequality testing aliases for better NumPy compatibility (#43870)
Summary:
This PR adds the following aliaes:

- not_equal for torch.ne
- greater for torch.gt
- greater_equal for torch.ge
- less for torch.lt
- less_equal for torch.le

This aliases are consistent with NumPy's naming for these functions.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43870

Reviewed By: zou3519

Differential Revision: D23498975

Pulled By: mruberry

fbshipit-source-id: 78560df98c9f7747e804a420c1e53fd1dd225002
2020-09-06 09:36:23 -07:00
671160a963 Revert D23557576: Revert D23519521: [dper3] replace LengthsGather lowlevel module's PT implemetnatio to use caffe2 op
Test Plan: revert-hammer

Differential Revision:
D23557576

Original commit changeset: 33631299eabe

fbshipit-source-id: 704d36a16346f047b30e2da8be882062135f8617
2020-09-06 01:50:43 -07:00
e358d516c8 Revert D23549149: [PyTorch][Mobile] Insert the module name as name() to metadata dict if metadata doesn't contain "model_name"
Test Plan: revert-hammer

Differential Revision:
D23549149 (398409f072)

Original commit changeset: fad742a8d4e6

fbshipit-source-id: bd92a2033a804d3e6a2747b4fda4ca527991a993
2020-09-06 00:06:35 -07:00
70c8daf439 Apply selective build on RNN operators (#44132)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44132

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43985

Added
```
def(detail::SelectiveStr<true>, ...)
impl(detail::SelectiveStr<true>, ...)
```
in torch/library, which can also be used for other templated selective registration.

Size saves for this diff:
fbios-pika: 78 KB
igios: 87 KB

Test Plan: Imported from OSS

Reviewed By: ljk53, smessmer

Differential Revision: D23459774

Pulled By: iseeyuan

fbshipit-source-id: 86d34cfe8e3f852602f203db06f23fa99af2c018
2020-09-05 23:47:51 -07:00
68297eeb1a Add support for integer dim arg in torch.linalg.norm (#43907)
Summary:
Since PR https://github.com/pytorch/pytorch/issues/43262 is merged, this works now.

Part of https://github.com/pytorch/pytorch/issues/24802

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43907

Reviewed By: anjali411

Differential Revision: D23471964

Pulled By: mruberry

fbshipit-source-id: ef2f11f78343fc866f752c9691b0c1fa687353ba
2020-09-05 23:16:36 -07:00
719d29dab5 Implement torch.i0 and torch.kaiser_window (#43132)
Summary:
Related to https://github.com/pytorch/pytorch/issues/38349

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43132

Reviewed By: smessmer

Differential Revision: D23479072

Pulled By: mruberry

fbshipit-source-id: 4fb1de44830771c6a7222cf19f7728d9ac7c043b
2020-09-05 23:11:47 -07:00
4fc29e9c43 Revert D23519521: [dper3] replace LengthsGather lowlevel module's PT implemetnatio to use caffe2 op
Test Plan: revert-hammer

Differential Revision:
D23519521 (8c64bb4f47)

Original commit changeset: ed9bd16a8af3

fbshipit-source-id: 33631299eabec05a1a272bfd0040d96203cf62a0
2020-09-05 20:43:04 -07:00
396469f18c Explicitly forbidden the other inherited methods of RemoteModule. (#43895)
Summary:
Throw exceptions when the methods except for forwardXXX are used.

Original PR issue: RemoteModule enhancements #40550

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43895

Test Plan: buck test test/distributed/rpc:process_group_agent -- RemoteModule

Reviewed By: rohan-varma

Differential Revision: D23392842

Pulled By: SciPioneer

fbshipit-source-id: 7c09a55a03f9f0b7e9f9264a42bfb907607f4651
2020-09-05 14:48:56 -07:00
199c73be0f [quant][pyper] Support quantization of ops in fork-wait subgraph (#44048)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44048

Inline the fork-wait calls to make sure we can see the ops to be quantized in the main graph

Also fix the InlineForkWait JIT pass to account for the case where the aten::wait call isn't present in the main graph
and we return future tensor from subgraph

Example

```
graph(%self.1 : __torch__.dper3.core.interop.___torch_mangle_6325.DperModuleWrapper,
       %argument_1.1 : Tensor,
       %argument_2.1 : Tensor):
   %3 : Future[Tensor[]] = prim::fork_0(%self.1, %argument_1.1, %argument_2.1) # :0:0
   return (%3)
 with prim::fork_0 = graph(%self.1 : __torch__.dper3.core.interop.___torch_mangle_5396.DperModuleWrapper,
       %argument_1.1 : Tensor,
       %argument_2.1 : Tensor):
   %3 : __torch__.dper3.core.interop.___torch_mangle_6330.DperModuleWrapper = prim::GetAttr[name="x"](%self.1)
   %4 : __torch__.dper3.core.interop.___torch_mangle_5397.DperModuleWrapper = prim::GetAttr[name="y"](%self.1)
   %5 : __torch__.dper3.core.interop.___torch_mangle_6327.DperModuleWrapper = prim::GetAttr[name="z"](%4)
   %6 : Tensor = prim::CallMethod[name="forward"](%5, %argument_1.1, %argument_2.1) # :0:0
   %7 : None = prim::CallMethod[name="forward"](%3, %6) # :0:0
   %8 : Tensor[] = prim::ListConstruct(%6)
   return (%8)
```

Test Plan:
python test/test_quantization.py test_interface_with_fork

Imported from OSS

Reviewed By: vkuzo

Differential Revision: D23481003

fbshipit-source-id: 2e756be73c248319da38e053f021888b40593032
2020-09-05 12:06:19 -07:00
164b96c34c [quant][pyper] make embedding_bag quantization static (#44008)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44008

embedding_bag requires only quantization of weights (no dynamic quantization of inputs)
So the type of quantization is essentially static (without calibration)
This will enable pyper to do fc and embedding_bag quantization using the same API call

Test Plan:
python test/test_quantization.py test_embedding_bag

Imported from OSS

Reviewed By: vkuzo

Differential Revision: D23467019

fbshipit-source-id: 41a61a17ee34bcb737ba5b4e19fb7a576d4aeaf9
2020-09-05 12:06:16 -07:00
a0ae416d60 [quant] Support aten::embedding_bag quantization in graph mode (#43989)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43989

When we trace the model it produces aten::embedding_bag node in the graph,
Add necessary passes in graph mode to help support quantizing it as well

Test Plan:
python test/test_quantization.py TestQuantizeDynamicJitOps.test_embedding_bag

Imported from OSS

Reviewed By: vkuzo

Differential Revision: D23460485

fbshipit-source-id: 328c5e1816cfebb10ba951113f657665b6d17575
2020-09-05 12:05:06 -07:00
15a7368115 Add const to getTensors method of GradBucket. (#44126)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44126

Add const to getTensors method of GradBucket.

Test Plan: buck test caffe2/torch/lib/c10d:ProcessGroupGlooTest

Reviewed By: sinannasir, jiayisuse

Differential Revision: D23504088

fbshipit-source-id: 427d9591042e0c03cde02629c1146ff1e5e027f9
2020-09-05 09:19:42 -07:00
5bd2902796 [JIT] Remove references to no longer generated _tanh_backward and _sigmoid_backward (#44138)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44138

If you look at the sigmoid and tanh backward they are composed of other ops: https://github.com/pytorch/pytorch/blob/master/torch/csrc/jit/runtime/symbolic_script.cpp#L786
https://github.com/pytorch/pytorch/blob/master/torch/csrc/jit/runtime/symbolic_script.cpp#L164

So tanh_backward and sigmoid_backward are no longer generated / legacy ops.

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D23543603

Pulled By: eellison

fbshipit-source-id: ce8353e53043cf969b536aac47c9576d66d4ce02
2020-09-05 01:41:36 -07:00
df67f0beab [TensorExpr fuser] Guard nodes that have tensor output properties determined by non-tensor inputs (#44137)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44137

We only insert guards on Tensor types, so we rely on the output
of a node being uniquely determined by its input types.
bail if any non-Tensor input affects the output type
and cannot be reasoned about statically

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D23543602

Pulled By: eellison

fbshipit-source-id: abd6fe0b1fd7fe6fc251694d4cd442b19c032dd7
2020-09-05 01:40:18 -07:00
5a0d65b06b Further expand coverage of addmm/addmv, fix 0 stride (#43980)
Summary:
- test beta=0, self=nan
- test transposes
- fixes broadcasting of addmv
- not supporting tf32 yet, will do it in future PR together with other testing fixes

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43980

Reviewed By: mruberry

Differential Revision: D23507559

Pulled By: ngimel

fbshipit-source-id: 14ee39d1a0e13b9482932bede3fccb61fe6d086d
2020-09-04 23:03:23 -07:00
d07a36e0c1 Revert D23490149: [pytorch][PR] Compile less legacy code when BUILD_CAFFE2 is set to False
Test Plan: revert-hammer

Differential Revision:
D23490149 (15e99b6ff6)

Original commit changeset: a76382c30d83

fbshipit-source-id: 75057fa9af2c19eb976962552118bf0a99911b38
2020-09-04 22:59:39 -07:00
618b4dd763 fx quant prepare: clarify naming (#44125)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44125

In `Quantizer._prepare`, `observed` was used for two different variables
with different types.  Making the names a bit cleaner and removing the
name conflict.

Test Plan:
```
python test/test_quantization.py TestQuantizeFx
python test/test_quantization.py TestQuantizeFxOps
```

Imported from OSS

Reviewed By: dskhudia

Differential Revision: D23504109

fbshipit-source-id: 0f73eac3d6dd5f72ad5574a4d47d33808a70174a
2020-09-04 21:29:56 -07:00
a940f5ea5d torchscript graph mode quant: remove benchmark filter (#44165)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44165

Allows convolutions to be quantized if `torch.cudnn.backends.benchmark`
flag was set.

Not for land yet, just testing.

Test Plan:
in the gist below, the resulting graph now has quantized convolutions
https://gist.github.com/vkuzo/622213cb12faa0996b6700b08d6ab2f0

Imported from OSS

Reviewed By: supriyar

Differential Revision: D23518775

fbshipit-source-id: 294f678c6afbd3feeb89b7a6655bc66ac9f8bfbc
2020-09-04 21:25:35 -07:00
8c64bb4f47 [dper3] replace LengthsGather lowlevel module's PT implemetnatio to use caffe2 op
Summary: Use a more efficient C++ implementation in a caffe2 op to get rid of control flow statements here.

Test Plan:
- Ran `buck test dper3/dper3/modules/low_level_modules/tests:single_operators_test`
- Ran `buck-out/gen/dper3/dper3_models/experimental/pytorch/ads_model_generation_script.par --model_type="inline_cvr_post_imp" --model_version="april_2020" --gen_inference_model` and observed files getting generated:
```
[ashenoy@devbig086.ash8 ~/fbsource/fbcode] ls -l /tmp/ashenoy/inline_cvr_post_imp_april_2020/
total 278332
-rw-r--r--. 1 ashenoy users 71376941 Sep  3 23:10 serialized_inline_cvr_post_imp_april_2020_model_inference.pt
-rw-r--r--. 1 ashenoy users 71437424 Sep  3 22:09 serialized_inline_cvr_post_imp_april_2020_model_inference_shrunk.pt
-rw-r--r--. 1 ashenoy users    14952 Sep  3 22:38 serialized_inline_cvr_post_imp_april_2020_model_io_metadata_map.pt
-rw-r--r--. 1 ashenoy users    14952 Sep  3 21:42 serialized_inline_cvr_post_imp_april_2020_model_io_metadata_map_shrunk.pt
-rw-r--r--. 1 ashenoy users 67001662 Sep  3 22:38 serialized_inline_cvr_post_imp_april_2020_model_main.pt
-rw-r--r--. 1 ashenoy users 67126415 Sep  3 21:42 serialized_inline_cvr_post_imp_april_2020_model_main_shrunk.pt
-rw-r--r--. 1 ashenoy users  3945257 Sep  3 22:34 serialized_inline_cvr_post_imp_april_2020_model_preproc.pt
-rw-r--r--. 1 ashenoy users  4077266 Sep  3 21:37 serialized_inline_cvr_post_imp_april_2020_model_preproc_shrunk.pt
```
- Ran `buck-out/gen/dper3/dper3_models/experimental/pytorch/ads_model_generation_script.par --model_type="ctr_mbl_feed" --model_version="april_2020" --gen_inference_model` and observed model files getting generated:
```
[ashenoy@devbig086.ash8 ~/fbsource/fbcode] ls -l /tmp/ashenoy/ctr_mbl_feed_april_2020/
total 170304
-rw-r--r--. 1 ashenoy users  2641870 Sep  3 23:06 ctr_mbl_feed_april_2020_prod_eval_training_options
-rw-r--r--. 1 ashenoy users  2641870 Sep  3 23:06 ctr_mbl_feed_april_2020_prod_train_training_options
-rw-r--r--. 1 ashenoy users 42225079 Sep  3 23:59 serialized_ctr_mbl_feed_april_2020_model_inference.pt
-rw-r--r--. 1 ashenoy users 42576708 Sep  3 22:33 serialized_ctr_mbl_feed_april_2020_model_inference_shrunk.pt
-rw-r--r--. 1 ashenoy users    11194 Sep  3 23:29 serialized_ctr_mbl_feed_april_2020_model_io_metadata_map.pt
-rw-r--r--. 1 ashenoy users    11194 Sep  3 22:05 serialized_ctr_mbl_feed_april_2020_model_io_metadata_map_shrunk.pt
-rw-r--r--. 1 ashenoy users 39239139 Sep  3 23:29 serialized_ctr_mbl_feed_april_2020_model_main.pt
-rw-r--r--. 1 ashenoy users 39250842 Sep  3 22:05 serialized_ctr_mbl_feed_april_2020_model_main_shrunk.pt
-rw-r--r--. 1 ashenoy users  2839097 Sep  3 23:24 serialized_ctr_mbl_feed_april_2020_model_preproc.pt
-rw-r--r--. 1 ashenoy users  2944239 Sep  3 22:01 serialized_ctr_mbl_feed_april_2020_model_preproc_shrunk.pt
```

Reviewed By: houseroad

Differential Revision: D23519521

fbshipit-source-id: ed9bd16a8af3cca3a865d9614d67d07f01d8b18a
2020-09-04 21:19:53 -07:00
398409f072 [PyTorch][Mobile] Insert the module name as name() to metadata dict if metadata doesn't contain "model_name" (#44227)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44227

As title
ghstack-source-id: 111490242

Test Plan: CI

Reviewed By: xcheng16

Differential Revision: D23549149

fbshipit-source-id: fad742a8d4e6f844f83495514cd60ff2bf0d5bcb
2020-09-04 21:18:12 -07:00
15e99b6ff6 Compile less legacy code when BUILD_CAFFE2 is set to False (#44079)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44079

Reviewed By: walterddr

Differential Revision: D23490149

Pulled By: malfet

fbshipit-source-id: a76382c30d83127d180ec63ac15093a7297aae53
2020-09-04 20:04:21 -07:00
f3bf6a41ca [ONNX] Update repeat op (#43430)
Summary:
Update repeat op so that the inputs to sizes argument can a mixture of dynamic and constant inputs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43430

Reviewed By: houseroad

Differential Revision: D23494257

Pulled By: bzinodev

fbshipit-source-id: 90c5e90e4f73e98f3a9d5c8772850e72cecdf0d4
2020-09-04 18:53:31 -07:00
3699274ce2 [DPER3] AOT integration
Summary: Integrate aot flow with model exporter.

Test Plan:
buck test dper3/dper3_backend/delivery/tests:dper3_model_export_test

replayer test see D23407733

Reviewed By: ipiszy

Differential Revision: D23313689

fbshipit-source-id: 39ae8d578ed28ddd6510db959b65974a5ff62888
2020-09-04 18:37:22 -07:00
8b17fd2516 Add remote_parameters() into RemoteModule class. (#43906)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43906

This method returns a list of RRefs of remote parameters that can be fed into the DistributedOptimizer.

Original PR issue: RemoteModule enhancements #40550

Test Plan: buck test caffe2/test/distributed/rpc:process_group_agent -- RemoteModule

Reviewed By: rohan-varma

Differential Revision: D23399586

fbshipit-source-id: 4b0f1ccf2e47c8a9e4f79cb2c8668f3cdbdff820
2020-09-04 16:22:40 -07:00
8f37ad8290 [BUILD] Guard '#pragma unroll' with COMPILING_FOR_MIN_SIZE
Summary: Disable  unroll hints when COMPILING_FOR_MIN_SIZE is on. We were seeing hundreds of errors in the build because the optimization was not being performed.

Test Plan: Smoke builds

Differential Revision: D23513255

fbshipit-source-id: 87da2fdc3c1146e8ffcacf14a49d5151d313f367
2020-09-04 15:55:28 -07:00
3d7c22a2ce [ONNX] Enable new scripting passes for functionalization and remove_mutation (#43791)
Summary:
Duplicate of https://github.com/pytorch/pytorch/issues/41413
This PR initiates the process of updating the torchsciprt backend interface used by ONNX exporter.

Replace jit lower graph pass by freeze module pass

Enable ScriptModule tests for ONNX operator tests (ORT backend) and model tests by default.

Replace jit remove_inplace_ops pass with remove_mutation and consolidation all passes for handling inplace ops.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43791

Reviewed By: houseroad

Differential Revision: D23421872

Pulled By: bzinodev

fbshipit-source-id: a98710c45ee905748ec58385e2a232de2486331b
2020-09-04 15:21:45 -07:00
70bbd08402 [FX] Fix forward merge conflict breakage (#44221)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44221

Test Plan: Imported from OSS

Reviewed By: zdevito

Differential Revision: D23547373

Pulled By: jamesr66a

fbshipit-source-id: df47fce0f6ff2988093208fc8370544b7985288d
2020-09-04 15:12:33 -07:00
4562b212db Fix potential divide by zero for CostInferenceForRowWiseSparseAdagrad
Summary: Fix the potential divide by zero error in CostInferenceForRowWiseSparseAdagrad, when n has zero elements

Test Plan:
Ran buck test caffe2/caffe2/python/operator_test:adagrad_test
Result: https://our.intern.facebook.com/intern/testinfra/testrun/562950122086369

Reviewed By: idning

Differential Revision: D23520763

fbshipit-source-id: 191345bd24f5179a9dbdb41c6784eab102cfe89c
2020-09-04 14:14:49 -07:00
2ad5a82c43 [fx] get rid of graph_module.root (#44092)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44092

instead submodules and weights are installed directly on the
graph_module by transferring the original modules. This makes it more
likely that scripting will succeed (since we no longer have submodules
that are not used in the trace). It also prevents layered transforms
from having to special case handling of the `root` module. GraphModules
can now be re-traced as part of the input to other transforms.

Test Plan: Imported from OSS

Reviewed By: jamesr66a

Differential Revision: D23504210

Pulled By: zdevito

fbshipit-source-id: f79e5c4cbfc52eb0ffb5d6ed89b37ce35a7dc467
2020-09-04 11:35:32 -07:00
0c2bc4fe20 Revert D23468286: [pytorch][PR] Optimize code path for adaptive_avg_pool2d when output size is (1, 1)
Test Plan: revert-hammer

Differential Revision:
D23468286 (f8f35fddd4)

Original commit changeset: cc181f705fea

fbshipit-source-id: 3a1db0eef849e0c2f3c0c64040d2a8b799644fa3
2020-09-04 11:28:15 -07:00
6474057c76 Revert D23503636: [pytorch][PR] [NNC] make inlining immediate (take 2) and fix bugs
Test Plan: revert-hammer

Differential Revision:
D23503636 (70aecd2a7f)

Original commit changeset: cdbdc902b7a1

fbshipit-source-id: b5164835f874a56213de4bed9ad690164eae9230
2020-09-04 10:58:23 -07:00
539d029d8c [ONNX] Fix split export using slice (#43670)
Summary:
Fix for exporting split with fixed output shape using slice.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43670

Reviewed By: houseroad

Differential Revision: D23420318

Pulled By: bzinodev

fbshipit-source-id: 09c2b58049fe32dca2f2977d91dd64de6ee9a72f
2020-09-04 10:52:44 -07:00
af13faf18b [FX] __str__ for GraphModule and Graph (#44166)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44166

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D23520801

Pulled By: jamesr66a

fbshipit-source-id: f77e3466e435127ec01e66291964395f32a18992
2020-09-04 10:46:43 -07:00
0e3cf6b8d2 [pytorch] remove code analyzer build folder between builds (#44148)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44148

Automatically remove the build_code_analyzer folder each time build.sh is run
ghstack-source-id: 111458413

Test Plan:
Run build.sh with different options and compare the outputs (should be different).
Ex:
`ANALYZE_TORCH=1 DEPLOY=1 BASE_OPS_FILE=/path/to/baseops MOBILE_BUILD_FLAGS='-DBUILD_MOBILE_AUTOGRAD=OFF' tools/code_analyzer/build.sh `

should produce a shorter file than
`ANALYZE_TORCH=1 DEPLOY=1 BASE_OPS_FILE=/path/to/baseops MOBILE_BUILD_FLAGS='-DBUILD_MOBILE_AUTOGRAD=ON' tools/code_analyzer/build.sh`

Reviewed By: iseeyuan

Differential Revision: D23503886

fbshipit-source-id: 9b95d4365540da0bd2d27760e1315caed5f44eec
2020-09-04 10:38:12 -07:00
f38e7aee71 Updates to SCCACHE for ROCm case (#44155)
Summary:
- Collecting sccache trace logs
- Change the SCCACHE_IDLE_TIMEOUT to unlimited

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44155

Reviewed By: ngimel

Differential Revision: D23516192

Pulled By: malfet

fbshipit-source-id: aa93052d7b9a1832eeaa8e81ee8706aeb9f7a508
2020-09-04 10:11:18 -07:00
2a1fc56694 replace the white list from default mappings (#41802)
Summary:
Replaced "whitelist" from default_mappings.py
Fixes https://github.com/pytorch/pytorch/issues/41756

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41802

Reviewed By: ngimel

Differential Revision: D23521452

Pulled By: malfet

fbshipit-source-id: 019a2d5c06dc59dc53d6c48b70fb35b216299cf4
2020-09-04 10:04:28 -07:00
4d431881d1 Control NCCL build parallelism via MAX_JOBS environment var (#44167)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44167

Reviewed By: walterddr, ngimel

Differential Revision: D23522419

Pulled By: malfet

fbshipit-source-id: 31b25a71fef3e470bdf382eb3698e267326fa354
2020-09-04 10:02:53 -07:00
6aba58cfd3 Limit MAX_JOBS to 18 for linux binary builds (#44168)
Summary:
Because those jobs are running in Docker2XLarge+ container that has 20 cores
Unfortunately `nproc` returns number of cores available on the host rather than number of cores available to container

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44168

Reviewed By: walterddr, ngimel

Differential Revision: D23539558

Pulled By: malfet

fbshipit-source-id: 3df858722e153a8fcbe8ef6370b1a9c1993ada5b
2020-09-04 09:58:17 -07:00
6cecf7ec68 Enable test_cublas_config_deterministic_error for windows (#42796)
Summary:
test_cublas_config_deterministic_error can pass for windows, so enable it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42796

Reviewed By: seemethere

Differential Revision: D23520002

Pulled By: malfet

fbshipit-source-id: eccedbbf202b1cada795071a34e266b2c635c2cf
2020-09-04 09:52:57 -07:00
9a5a732866 Register some backwards functions as operators (#44052)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44052

Summary
=======

This PR registers the following backwards functions as operators:
- slice_backward
- select_backward
- gather_backward
- index_select_backward (the backward function for index_select)
- select_index_backward (prevously known as index_select_backward, but is actually the backward function for max.dim, min.dim, etc)

In the future, I'd like to register more backward functions as operators
so that we can write batching rules for the backward functions. Batching
rules for backward functions makes it so that we can compute batched
gradients.

Motivation
==========
The rationale behind this PR is that a lot of backwards functions (27 in total)
are incompatible with BatchedTensor due to using in-place operations.
Sometimes we can allow the in-place operations, but other times we can't.
For example, consider select_backward:

```
Tensor select_backward(const Tensor& grad, IntArrayRef input_sizes, int64_t dim, int64_t index) {
  auto grad_input = at::zeros(input_sizes, grad.options());
  grad_input.select(dim, index).copy_(grad);
  return grad_input;
}
```

and consider the following code:
```
x = torch.randn(5, requires_grad=True)
def select_grad(v):
    torch.autograd.grad(x[0], x, v)

vs = torch.randn(B0)
batched_grads = vmap(select_grad)(vs)
```

For the batched gradient use case, `grad` is a BatchedTensor.
The physical version of `grad` has size `(B0,)`.
However, select_backward creates a `grad_input` of shape `(5)`, and
tries to copy `grad` to a slice of it.

Other approaches
================

I've considered the following:
- register select_backward as an operator (this PR)
- have a branch inside select_backward for if `grad` is batched.
    - this is OK, but what if we have more tensor extensions that want to override this?
- modify select_backward to work with BatchedTensor, by creating a new operator for the "select + copy_ behavior".
    - select + copy_ isn't used elsewhere in derivative formulas so this doesn't seem useful

Test Plan
=========

- `pytest test/test_autograd.py -v`
- Registering backward functions may impact performance. I benchmarked
select_backward to see if registering it as an operator led to any noticable
performance overheads: https://gist.github.com/zou3519/56d6cb53775649047b0e66de6f0007dc.
The TL;DR is that the overhead is pretty minimal.

Test Plan: Imported from OSS

Reviewed By: ezyang, fbhuba

Differential Revision: D23481183

Pulled By: zou3519

fbshipit-source-id: 125af62eb95824626dc83d06bbc513262ee27350
2020-09-04 08:30:39 -07:00
0c01f136f3 [BE] Use f-string in various Python functions (#44161)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44161

Reviewed By: seemethere

Differential Revision: D23515874

Pulled By: malfet

fbshipit-source-id: 868cf65aedd58fce943c08f8e079e84e0a36df1f
2020-09-04 07:38:25 -07:00
28b1360d24 [Codemod][FBSourceGoogleJavaFormatLinter] Daily arc lint --take GOOGLEJAVAFORMAT
Reviewed By: zertosh

Differential Revision: D23536088

fbshipit-source-id: d4c6c26ed5bad4e8c1b80ac1c05bd86b36cb6aaa
2020-09-04 07:30:50 -07:00
f8f35fddd4 Optimize code path for adaptive_avg_pool2d when output size is (1, 1) (#43986)
Summary:
Benchmark:

code: https://github.com/xwang233/code-snippet/blob/master/adaptive-avg-pool2d-output-1x1/adap.ipynb

| shape | time_before (ms) | time_after (ms) |
| --- | --- | --- |
| (2, 3, 4, 4), torch.contiguous_format, cpu  |  0.035 |  0.031 |
| (2, 3, 4, 4), torch.contiguous_format, cuda  |  0.041 |  0.031 |
| (2, 3, 4, 4), torch.channels_last, cpu  |  0.027 |  0.029 |
| (2, 3, 4, 4), torch.channels_last, cuda  |  0.031 |  0.034 |
| (2, 3, 4, 4), non_contiguous, cpu  |  0.037 |  0.026 |
| (2, 3, 4, 4), non_contiguous, cuda  |  0.062 |  0.033 |
| (4, 16, 32, 32), torch.contiguous_format, cpu  |  0.063 |  0.055 |
| (4, 16, 32, 32), torch.contiguous_format, cuda  |  0.043 |  0.031 |
| (4, 16, 32, 32), torch.channels_last, cpu  |  0.052 |  0.064 |
| (4, 16, 32, 32), torch.channels_last, cuda  |  0.190 |  0.033 |
| (4, 16, 32, 32), non_contiguous, cpu  |  0.048 |  0.035 |
| (4, 16, 32, 32), non_contiguous, cuda  |  0.062 |  0.033 |
| (8, 128, 64, 64), torch.contiguous_format, cpu  |  0.120 |  0.109 |
| (8, 128, 64, 64), torch.contiguous_format, cuda  |  0.043 |  0.044 |
| (8, 128, 64, 64), torch.channels_last, cpu  |  1.303 |  0.260 |
| (8, 128, 64, 64), torch.channels_last, cuda  |  1.237 |  0.049 |
| (8, 128, 64, 64), non_contiguous, cpu  |  0.132 |  0.128 |
| (8, 128, 64, 64), non_contiguous, cuda  |  0.062 |  0.031 |
| (16, 256, 224, 224), torch.contiguous_format, cpu  |  17.232 |  14.807 |
| (16, 256, 224, 224), torch.contiguous_format, cuda  |  1.930 |  1.930 |
| (16, 256, 224, 224), torch.channels_last, cpu  |  245.025 |  24.345 |
| (16, 256, 224, 224), torch.channels_last, cuda  |  15.593 |  1.944 |
| (16, 256, 224, 224), non_contiguous, cpu  |  11.738 |  6.460 |
| (16, 256, 224, 224), non_contiguous, cuda  |  0.524 |  0.251 |

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43986

Reviewed By: anjali411

Differential Revision: D23468286

Pulled By: ngimel

fbshipit-source-id: cc181f705feacb2f86df420d648cc59fda69fdb7
2020-09-04 03:37:33 -07:00
ef28ee50b0 [Codemod][FBSourceClangFormatLinter] Daily arc lint --take CLANGFORMAT
Reviewed By: zertosh

Differential Revision: D23536086

fbshipit-source-id: 56e9c70a6998086515f59d74c5d8a2280ac2f669
2020-09-04 03:33:32 -07:00
98ad5ff41f [te] Disable reductions by default (#44122)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44122

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D23504769

Pulled By: bertmaher

fbshipit-source-id: 1889217cd22da529e46ab30c9319a5646267e4ec
2020-09-03 23:37:45 -07:00
a37c199b8b [c2][cuda] small improvement to dedup adagrad by avoiding recompute of x_ij (#44173)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44173

it has small 10~15% speed improvement

Test Plan:
== Correctness ==
`buck test mode/dev-nosan //caffe2/caffe2/fb/net_transforms/tests:fuse_sparse_ops_test -- 'test_fuse_sparse_adagrad_with_sparse_lengths_sum_gradient '`

Reviewed By: jianyuh

Differential Revision: D23494030

fbshipit-source-id: cdb7ee716a7e559903b72ed9f93bf106813f88fa
2020-09-03 22:50:53 -07:00
2f8a43341d Add API for onnxifi with AOT Glow ONNX (#44021)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44021

Pull Request resolved: https://github.com/pytorch/glow/pull/4854

Test Plan: Added `test_onnxifi_aot.py`

Reviewed By: yinghai

Differential Revision: D23307003

fbshipit-source-id: e6d4f3e394f96fd22f80eb2b8a686cf8171a54c0
2020-09-03 22:46:20 -07:00
d221256888 [Message] Add what to do for missing operators.
Summary: As title.

Test Plan: N/A

Reviewed By: gaurav-work

Differential Revision: D23502416

fbshipit-source-id: a341eb10030e3f319266019ba4c02d9d9a0a6298
2020-09-03 22:41:27 -07:00
addfd7a9b9 Add tests against autograd precedence and multiple dispatch. (#44037)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44037

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D23480154

Pulled By: ailzhang

fbshipit-source-id: 28b68e67975397c76ce6c73ceaeec9d5cc934635
2020-09-03 22:19:08 -07:00
b60ffcdfdd Enable typechecks for torch.nn.quantized.modules.linear (#44154)
Summary:
Also import `Optional` directly from `typing` rather than from `_jit_internal`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44154

Reviewed By: seemethere

Differential Revision: D23511833

Pulled By: malfet

fbshipit-source-id: f78c5fd679c002b218e4d287a9e56fa198171981
2020-09-03 19:52:49 -07:00
538d3bd364 Enable CUDA 11 jobs for Windows nightly builds (#44086)
Summary:
Fixes https://github.com/pytorch/pytorch/pull/43366/files#r474333051.
Testing with https://github.com/pytorch/pytorch/pull/44007.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44086

Reviewed By: ezyang

Differential Revision: D23493553

Pulled By: malfet

fbshipit-source-id: 34b3e5b2e8dece5e97db9d507c34d61d33bd0863
2020-09-03 17:45:31 -07:00
69e38828f5 [quant] conv_transpose2d_prepack/conv_transpose2d_unpack (#40351)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40351

Test Plan: Imported from OSS

Reviewed By: vkuzo

Differential Revision: D22158983

Pulled By: z-a-f

fbshipit-source-id: 3ca064c2d826609724b2740fcc9b9eb40556168d
2020-09-03 17:21:32 -07:00
c40e3f9f98 [android][jni] Support Tensor MemoryFormat in java wrappers (#40785)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40785

The main goal of this change is to support creating Tensors specifying blob in NHWC (ChannelsLast) format.

ChannelsLast is supported only for 4-dim tensors, this is enforced on LibTorch side, I have not added asserts on java side in case that this limitation will be changed in future and not to have double asserts.

Additional changes in `aten/src/ATen/templates/Functions.h`:

`from_blob` creates `at::empty({0}, options)` tensor first and sets it Storage with sizes and strides afterwards.

But as ChannelsLast is only for 4-dim tensors - it fails on that creation, as dim==1.

I've added `zero_sizes()` function that returns `{0, 0, 0, 0}` for ChannelsLast and ChannelsLast3d.

Test Plan: Imported from OSS

Reviewed By: dreiss

Differential Revision: D22396244

Pulled By: IvanKobzarev

fbshipit-source-id: 02582d748a554e0f859aefe71cd2c1e321fb8979
2020-09-03 17:01:35 -07:00
70aecd2a7f [NNC] make inlining immediate (take 2) and fix bugs (#43885)
Summary:
A rework of `computeInline` which makes it work a bit better, particularly when combined with other transformations. Previously we stored Functions that were inlined and then deferred the actual inlining of the function body until prepareForCodgen was called. This has an issue when transformations are applied to the LoopNest: the function body can be different from what appears in the root_stmt and result in inlining that a) fails, b) reverses other transformations or c) a weird unpredictable combination of the two.

This PR changes that behaviour so that the inlining occurs in the root stmt immediately, which means it reflects any previous transformations and any future transformations have a true view of the internal IR. It also has the benefit that inspecting the root statement gives an accurate view of it without needing to call prepareForCodgen. I also removed the difference between `computeInline` and `computeInlineWithRand` and we handle calls to `rand()` in all branches.

This is a rework of https://github.com/pytorch/pytorch/issues/38696, with the agreed changes from ZolotukhinM and zheng-xq: we should only inline if the dimensions are trivial (ie. they are vars not exprs).

This PR is mostly tests, and I fixed a bunch of bugs I found along the way. Partial list:
* When inlining an expression involving rand, we would create random vars equal to the dimensionality of the enclosing Tensor not the produced Tensor - meaning we'd use an incorrect value if the inlined tensor was smaller. E.g: `X[i] = rand(); A[i, j] = X[i]` would produce a tensor where `A[0, 0] != A[0, 1]`. This is fixed by inserting the Let binding of the random variable at the correct loop body.
* When inlining we'd replace all calls to `rand()` rather than just those present in the Tensor being inlined.
* `rand()` was treated symbolically by the simplifier and we would aggregate or cancel calls to `rand()`. Have fixed the hasher to hash all calls to `rand()` distinctly.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43885

Reviewed By: gmagogsfm

Differential Revision: D23503636

Pulled By: nickgg

fbshipit-source-id: cdbdc902b7a14d269911d978a74a1c11eab004fa
2020-09-03 16:49:24 -07:00
bc4a00c197 [TVM] Support Fused8BitRowwiseQuantizedToFloat op (#44098)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44098

Reviewed By: yinghai

Differential Revision: D23470129

fbshipit-source-id: 1959e2167859f7cbc16e1423b957072bbc743ece
2020-09-03 16:39:53 -07:00
3105d8a9b2 [TensorExpr] Fuser: rely on input types when checking whether a device is supported. (#44139)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44139

Also, make sure that we're checking that condition when we're starting a
new fusion group, not only when we merge a node into an existing fusion
group. Oh, and one more: add a test checking that we're rejecting graphs
with unspecified shapes.

Differential Revision: D23507510

Test Plan: Imported from OSS

Reviewed By: bertmaher

Pulled By: ZolotukhinM

fbshipit-source-id: 9c268825ac785671d7c90faf2aff2a3e5985ac5b
2020-09-03 16:27:14 -07:00
71510c60ad fx qat: respect device affinity (#44115)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44115

Fixes device affinity in the FX prepare pass for QAT. Before this PR, observers
were always created on CPU. After this PR, observers are created on the
same device as the rest of the model. This will enable QAT prepare to
work regardless of whether users move the model to cuda before or after
calling this pass.

Test Plan:
```
python test/test_quantization.py TestQuantizeFx.test_qat_prepare_device_affinity
```

Imported from OSS

Reviewed By: supriyar

Differential Revision: D23502291

fbshipit-source-id: ec4ed20c21748a56a25e3395b35ab8640d71b5a8
2020-09-03 16:16:59 -07:00
7816d53798 [JIT] Add mypy type annotations for JIT (#43862)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43862

Test Plan: Imported from OSS

Reviewed By: eellison

Differential Revision: D23491151

Pulled By: SplitInfinity

fbshipit-source-id: 88367b89896cf409bb9ac3db7490d6779efdc3a4
2020-09-03 15:09:24 -07:00
9dd8670d7d [jit] Better match behavior of loaded ScriptModules vs. freshly created ones (#43298)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43298

IR emitter uses `ModuleValue` to represent ScriptModules and emit IR for
attribute access, submodule access, etc.

`ModuleValue` relies on two pieces of information, the JIT type of the
module, and the `ConcreteModuleType`, which encapsulates Python-only
information about the module.

ScriptModules loaded from a package used to create a dummy
ConcreteModuleType without any info in it. This led to divergences in
behavior during compilation.

This PR makes the two ways of constructing a ConcreteModuleType equivalent,
modulo any py-only information (which, by definition, is never present in
packaged files anyway).

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D23228738

Pulled By: suo

fbshipit-source-id: f6a660f42272640ca1a1bb8c4ee7edfa2d1b07cc
2020-09-03 15:03:39 -07:00
74f18476a2 [jit] fix segfault in attribute lookup on loaded ScriptModules (#43284)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43284

The IR emitter looks for attributes on modules like:
1. Check the JIT type for the attribute
2. Check the originating Python class, in order to fulfill requests for, e.g. static methods or ignored methods.

In the case where you do:
```
inner_module = torch.jit.load("inner.pt")
wrapped = Wrapper(inner_module)  # wrap the loaded ScriptModule in an nn.Module
torch.jit.script(wrapped)
```

The IR emitter may check for attributes on `inner_module`. There is no
originating Python class for `inner_module`, since it was directly
compiled from the serialized format.

Due to a bug in the code, we don't guard for this case an a segfault
results if the wrapper asks for an undefined attribute. The lookup in
this case looks like:
1. Check the JIT type for the attribute (not there!)
2. Check the originating Python class (this is a nullptr! segfault!)

This PR guards this case and properly just raises an attribute missing
compiler error instead of segfaulting.

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D23224337

Pulled By: suo

fbshipit-source-id: 0cf3060c427f2253286f76f646765ec37b9c4c49
2020-09-03 15:01:59 -07:00
e64879e180 [tensorexpr] Alias analysis tests (#44110)
Summary:
Some tests for alias analysis.

The first aliases at the module level and the second at the input level.

Please let me know if there are other alias situations!

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44110

Reviewed By: nickgg

Differential Revision: D23509473

Pulled By: bwasti

fbshipit-source-id: fbfe71a1d40152c8fbbd8d631f0a54589b791c34
2020-09-03 14:52:47 -07:00
6868bf95c6 [JIT] Fuser match on schemas not node kind (#44083)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44083

Match on the complete schema of a node instead of its node kind when deciding to fuse it. Previously we matched on node kind, which could fail with something like `aten::add(int, int)` and if a new overload was added to an op without corresponding NNC support we would fuse it.

Follow ups are:
 - bail when an output tensor type isnt uniquely determined by the input types (e.g. aten::add and the second input could be either a float or an int)
- remove NNC lowering for _tanh_backward & _sigmoid_backward
- Validate that we support all of the overloads here. I optimistically added ops that included Tensors, it's possible that we do not support every overload here. This isn't a regression, and this PR is at least improving our failures in that regard.

I can do any of these as part of this PR if desired, but there are a number of failures people have run into that this PR fixes so I think it would be good to land this sooner than later.

Test Plan: Imported from OSS

Reviewed By: SplitInfinity

Differential Revision: D23503704

Pulled By: eellison

fbshipit-source-id: 3ce971fb1bc3a7f1cbaa38f1ed853e2db3d67c18
2020-09-03 14:47:19 -07:00
9b3c72d46e [pytorch] Make mobile find_method return an optional (#43965)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43965

As part of a larger effort to unify the API between the lite interpreter and full JIT:
- implement torch::jit::mobile::Method, a proxy for torch::jit::mobile::Function
- add support for overloaded operator() to mobile Method and Function
- mobile find_method now returns a c10::optional<Method> (so signature matches full jit)
- moves some implementation of Function from module.cpp to function.cpp
ghstack-source-id: 111161942

Test Plan: CI

Reviewed By: iseeyuan

Differential Revision: D23330762

fbshipit-source-id: bf0ba0d711d9566c92af31772057ecd35983ee6d
2020-09-03 14:46:18 -07:00
f91bdbeabd Enable function calls in TEFuser and SpecializeAutogradZero (#43866)
Summary:
Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43866

Reviewed By: ezyang

Differential Revision: D23452798

Pulled By: Krovatkin

fbshipit-source-id: 2cff4c905bf1b5d9de56e7869458ffa6fce1f1b5
2020-09-03 14:42:52 -07:00
e05fa2f553 [quant] Prep for conv_transpose packing (#39714)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/39714

Test Plan: Imported from OSS

Reviewed By: vkuzo

Differential Revision: D22087071

Pulled By: z-a-f

fbshipit-source-id: 507f8a414026eb4c9926f68c1e94d2f56119bca6
2020-09-03 14:10:32 -07:00
352a32e7f3 [caffe2] fix clang build
Summary:
* multiple -Wpessimizing-moves
* `static` within  `__host__` `__device__` function

Test Plan:
```lang=bash
buck build -c fbcode.cuda_use_clang=true fblearner/flow/projects/dper:workflow
```

Reviewed By: andrewjcg

Differential Revision: D23506573

fbshipit-source-id: 1490a1267e39e067d3ef836ef9b1cd5d7a28f724
2020-09-03 14:02:27 -07:00
f3da9e3b50 Enable Enum pickling/unpickling. (#43188)
Summary:
Stack from [ghstack](https://github.com/ezyang/ghstack):
* **https://github.com/pytorch/pytorch/issues/43188 Enable Enum pickling/unpickling.**
* https://github.com/pytorch/pytorch/issues/42963 Add Enum TorchScript serialization and deserialization support
* https://github.com/pytorch/pytorch/issues/42874 Fix enum constant printing and add FileCheck to all Enum tests
* https://github.com/pytorch/pytorch/issues/43121 Add Enum convert back to Python object support

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43188

Reviewed By: zdevito

Differential Revision: D23365141

Pulled By: gmagogsfm

fbshipit-source-id: f0c93d4ac614dec047ad8640eb6bd9c74159b558
2020-09-03 13:51:02 -07:00
d0421ff1cc Benchmarks: add scripts for FastRNNs results comparison. (#44134)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44134

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D23505810

Pulled By: ZolotukhinM

fbshipit-source-id: d0b3d70d4c2a44a8c3773631d09a25a98ec59370
2020-09-03 13:44:42 -07:00
3806c939bd Polish DDP join API docstrings (#43973)
Summary:
Polishes DDP join api docstrings and makes a few minor cosmetic changes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43973

Reviewed By: zou3519

Differential Revision: D23467238

Pulled By: rohan-varma

fbshipit-source-id: faf0ee56585fca5cc16f6891ea88032336b3be56
2020-09-03 13:39:45 -07:00
442684cb25 Enable typechecks for torch.nn.modules.[activation|upsampling] (#44093)
Summary:
Add missing `hardsigmoid`, `silu`, `hardswish` and `multi_head_attention_forward` to functional.pyi.in
 Embed some typing annotations into functional.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44093

Reviewed By: ezyang

Differential Revision: D23494384

Pulled By: malfet

fbshipit-source-id: 27023c16ff5951ceaebb78799c4629efa25f7c5c
2020-09-03 13:20:04 -07:00
a153f69417 Fix replaceAtenConvolution for BC. (#44036)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44036

Running replaceAtenConvolution on older traced model wont work as
_convolution signature has changed and replaceAtenConvolution was
changed to account for that.
But we did not preserve the old behavior during that. This change
restores the old behavior while keeing the new one.

Test Plan: Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D23476775

fbshipit-source-id: 73a0c2b7387f2a8d82a8d26070d0059972126836
2020-09-03 12:57:57 -07:00
ba65cce2a2 Fix transposed conv2d rewrite pattern to account for convolution api (#44035)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44035

change

Also added test so as to capture such cases for future.

Test Plan:
python test/test_xnnpack_integration.py

Imported from OSS

Reviewed By: iseeyuan

Differential Revision: D23476773

fbshipit-source-id: a62c4429351c909245106a70b4c60b1bacffa817
2020-09-03 12:55:43 -07:00
55ff9aa185 Test TE fuser unary ops and fix sigmoid(half) (#44094)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44094

Test Plan: Imported from OSS

Reviewed By: SplitInfinity

Differential Revision: D23494950

Pulled By: bertmaher

fbshipit-source-id: 676c4e57267c4ad92065ea90b06323918dd5b0de
2020-09-03 12:48:46 -07:00
bfa1fa5249 Update rocm-3.5.1 build job to rocm-3.7 (#44123)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44123

Reviewed By: seemethere

Differential Revision: D23504193

Pulled By: malfet

fbshipit-source-id: 3570dc0aa879a3fdd43f3ecd41ee9e745006cfde
2020-09-03 12:39:30 -07:00
49215d7f26 For CriterionTests, have check_gradgrad actually only affect gradgrad checks. (#44060)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44060

Right now it skips grad checks as well.

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D23484018

Pulled By: gchanan

fbshipit-source-id: 24a8f1af41f9918aaa62bc3cd78b139b2f8de1e1
2020-09-03 12:29:32 -07:00
42f9897983 Mark bucketize as not subject to autograd (#44102)
Summary:
Bucketize returns integers, currently this triggers an internal assert, so we apply the mechanism for this case (also used for argmax etc.).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44102

Reviewed By: zou3519

Differential Revision: D23500048

Pulled By: albanD

fbshipit-source-id: fdd869cd1feead6616b532b3e188bd5512adedea
2020-09-03 12:05:47 -07:00
91b0d1866a add tanh + quantize unit test (#44076)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44076

add fakelowp test for tanh + quantize

Test Plan: net runner

Reviewed By: venkatacrc

Differential Revision: D23339662

fbshipit-source-id: 96c2cea12b41bf3df24aa46e601e053dca8e9481
2020-09-03 12:00:36 -07:00
de672e874d [JIT] Improve error message for unsupported Optional types (#44054)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44054

**Summary**
This commit improves the error message that is printed when an
`Optional` type annotation with an unsupported contained type is
encountered. At present, the `Optional` is printed as-is, and
`Optional[T]` is syntatic sugar for `Union[T, None]`, so that is what
shows up in the error message and can be confusing. This commit modifies
the error message so that it prints `T` instead of `Union[T, None]`.

**Test Plan**
Continuous integration.

Example of old message:
```
AssertionError: Unsupported annotation typing.Union[typing.List, NoneType] could not be resolved.
```
Example of new message:
```
AssertionError: Unsupported annotation typing.Union[typing.List, NoneType] could not be resolved because typing.List could not be resolved.
```

**Fixes**
This commit fixes #42859.

Test Plan: Imported from OSS

Reviewed By: gmagogsfm

Differential Revision: D23490365

Pulled By: SplitInfinity

fbshipit-source-id: 2aa9233718e78cf1ba3501ae11f5c6f0089e29cd
2020-09-03 11:55:06 -07:00
d11603de38 [TensorExpr] Benchmarks: set number of profiling runs to 2 for PE. (#44112)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44112

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D23500904

Pulled By: ZolotukhinM

fbshipit-source-id: d0dd54752b7ea5ae11f33e865c96d2d61e98d573
2020-09-03 11:29:35 -07:00
b10c527a1f [pytorch][bot] update mobile op deps (#44100)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44100

Test Plan: Imported from OSS

Reviewed By: iseeyuan

Differential Revision: D23496532

Pulled By: ljk53

fbshipit-source-id: 1e5b9059482e423960349d1361a7a98718c2d9ed
2020-09-03 11:24:26 -07:00
f96b91332f [caffe2.proto] Add AOTConfig (#44020)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44020

Pull Request resolved: https://github.com/pytorch/glow/pull/4853

Add AOT config

Reviewed By: yinghai

Differential Revision: D23414435

fbshipit-source-id: 3c48acf29889fcf63def37a48de382e675e0e1f3
2020-09-03 11:07:45 -07:00
c59e11bfbb Add soft error reporting to capture all the inference runtime failure. (#44078)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44078

When PyTorch mobile inference failed and throw exception, if caller catch and not crash the app, we are not able to track all the inference failures.

So we are adding native soft error reporting to capture all the failures occurring during module loading and running including both crashing and on-crashing failures. Since c10::Error has good error messaging stack handling (D21202891 (a058e938f9)), we are utilizing it for the error handling and message print out.
ghstack-source-id: 111307080

Test Plan:
Verified that the soft error reporting is sent through module.cpp when operator is missing, make sure a logview mid is generated with stack trace: https://www.internalfb.com/intern/logview/details/facebook_android_softerrors/5dd347d1398c1a9a73c804b20f7c2179/?selected-logview-tab=latest.

Error message with context is logged below:

```
soft_error.cpp		[PyTorchMobileInference] : Error occured during model running entry point: Could not run 'aten::embedding' with arguments from the 'CPU' backend. 'aten::embedding' is only available for these backends: [BackendSelect, Named, Autograd, Autocast, Batched, VmapMode].

BackendSelect: fallthrough registered at xplat/caffe2/aten/src/ATen/core/BackendSelectFallbackKernel.cpp:3 [backend fallback]
Named: registered at xplat/caffe2/aten/src/ATen/core/NamedRegistrations.cpp:7 [backend fallback]
Autograd: fallthrough registered at xplat/caffe2/aten/src/ATen/core/VariableFallbackKernel.cpp:31 [backend fallback]
Autocast: fallthrough registered at xplat/caffe2/aten/src/ATen/autocast_mode.cpp:253 [backend fallback]
Batched: registered at xplat/caffe2/aten/src/ATen/BatchingRegistrations.cpp:317 [backend fallback]
VmapMode: fallthrough registered at xplat/caffe2/aten/src/ATen/VmapModeRegistrations.cpp:33 [backend fallback]

Exception raised from reportError at xplat/caffe2/aten/src/ATen/core/dispatch/OperatorEntry.cpp:261 (m
```

Reviewed By: iseeyuan

Differential Revision: D23428636

fbshipit-source-id: 82d5d9c054300dff18d144f264389402d0b55a8a
2020-09-03 10:54:43 -07:00
5973b44d9e Rename NewCriterionTest to CriterionTest. (#44056)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44056

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D23482573

Pulled By: gchanan

fbshipit-source-id: dde0f1624330dc85f48e5a0b9d98fb55fdb72f68
2020-09-03 10:29:20 -07:00
7d95eb8633 [fbgemm] manual submodule update (#44082)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44082

Automated submodule is running into some test failures and I am not sure how can I rebase that.

automated submodule update:
https://github.com/pytorch/pytorch/pull/43817

Test Plan: CI tests

Reviewed By: jianyuh

Differential Revision: D23489240

fbshipit-source-id: a49b01786ebf0a59b719a0abf22398e1eafa90af
2020-09-03 10:07:46 -07:00
c10f30647f Fix CUDA debug nightly build failure (#44085)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/43607.
Tested in https://github.com/pytorch/pytorch/pull/44007.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44085

Reviewed By: malfet

Differential Revision: D23493663

Pulled By: ezyang

fbshipit-source-id: 4c01f3fc5a52814a23773a56b980c455851c2686
2020-09-03 09:12:52 -07:00
98320061ad DDP Communication hook: (Patch) Fix the way we pass future result to buckets. (#43734)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43734

Following the additional GH comments on the original PR https://github.com/pytorch/pytorch/pull/43307.
ghstack-source-id: 111327130

Test Plan: Run `python test/distributed/test_c10d.py`

Reviewed By: smessmer

Differential Revision: D23380288

fbshipit-source-id: 4b8889341c57b3701f0efa4edbe1d7bbc2a82ced
2020-09-03 08:59:10 -07:00
768c2b0fb2 Fix THPVariable_float_scalar (#43842)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43842

Reviewed By: ailzhang

Differential Revision: D23426892

Pulled By: ezyang

fbshipit-source-id: 63318721fb3f4a57d417f9a87e57c74f6d4e6e18
2020-09-03 08:39:41 -07:00
b6e2b1eac7 BatchedFallback: stop emitting the entire schema in the fallback warning (#44051)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44051

Instead, just emit the operator name. The entire schema is pretty wordy
and doesn't add any additional information.

Test Plan: - modified test: `pytest test/test_vmap.py -v`

Reviewed By: ezyang

Differential Revision: D23481184

Pulled By: zou3519

fbshipit-source-id: 9fbda61fc63565507b04c8b87e0e326a2036effa
2020-09-03 08:33:51 -07:00
cae52b4036 Merge CriterionTest into NewCriterionTest. (#44055)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44055

There is no functional change here.  Another patch will rename NewCriterionTest to CriterionTest.

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D23482572

Pulled By: gchanan

fbshipit-source-id: de364579067e2cc9de7df6767491f8fa3a685de2
2020-09-03 08:14:34 -07:00
15643de941 With fixes, Back out "Back out "Selective meta programming preparation for prim ops""
Summary: Original commit changeset: b2c712a512a2

Test Plan: CI

Reviewed By: jiatongzhou

Differential Revision: D23477710

fbshipit-source-id: 177ee56a82234376b7a5c3fc33441f8acfd59fea
2020-09-03 08:02:20 -07:00
24ca6aab02 Improves type-checking guards. (#43339)
Summary:
PR https://github.com/pytorch/pytorch/issues/38157 fixed type checking for mypy by including `if False` guards on some type-checker-only imports. However other typecheckers - [like pyright](https://github.com/microsoft/pylance-release/issues/262#issuecomment-677758245) - will respect this logic and ignore the imports. Using [`if TYPE_CHECKING`](https://docs.python.org/3/library/typing.html#typing.TYPE_CHECKING) instead means both mypy and pyright will work correctly.

[For background, an example of where the current code fails](https://github.com/microsoft/pylance-release/issues/262) is if you make a file `tmp.py` with the contents
```python
import torch
torch.ones((1,))
```
Then [`pyright tmp.py --lib`](https://github.com/microsoft/pyright#command-line) will fail with a `"ones" is not a known member of module` error. This is because it can't find the `_VariableFunctions.pyi` stub file, as pyright respects the `if False` logic. After adding the `TYPE_CHECKING` guard, all works correctly.

Credit to erictraut for suggesting the fix.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43339

Reviewed By: agolynski

Differential Revision: D23348142

Pulled By: ezyang

fbshipit-source-id: c8a58122a7b0016845c311da39a1cc48748ba03f
2020-09-03 07:45:53 -07:00
b6d5973e13 Delete THCStream.cpp (#43733)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43733

Reviewed By: malfet

Differential Revision: D23405121

Pulled By: ezyang

fbshipit-source-id: 95fa80b5dcb11abaf4d2507af15646a98029c80d
2020-09-03 07:41:24 -07:00
68a1fbe308 Allow criterion backwards test on modules requiring extra args (i.e. CTCLoss). (#44050)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44050

We don't actually turn on the CTCLoss tests since they fail, but this allows you to toggle check_forward_only and for the code to actually run.

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D23481091

Pulled By: gchanan

fbshipit-source-id: f2a3b0a2dee27341933c5d25f1e37a878b04b9f6
2020-09-03 07:41:21 -07:00
5f89aa36cf Actually run backward criterion tests. (#44030)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44030

This looks to have been a mistake from https://github.com/pytorch/pytorch/pull/9287.

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D23476274

Pulled By: gchanan

fbshipit-source-id: 81ed9d0c9a40d49153fc97cd69fdcd469bec0c73
2020-09-03 07:39:13 -07:00
665feda15b Adds opinfo-based autograd tests and (un)supported dtype tests (#43451)
Summary:
This PR adds a new test suite, test_ops.py, designed for generic tests across all operators with OpInfos. It currently has two kinds of tests:

- it validates that the OpInfo has the correct supported dtypes by verifying that unsupported dtypes throw an error and supported dtypes do not
- it runs grad and gradgrad checks on each op and its variants (method and inplace) that has an OpInfo

This is a significant expansion and simplification of the current autogenerated autograd tests, which spend considerable processing their inputs. As an alternative, this PR extends OpInfos with "SampleInputs" that are much easier to use. These sample inputs are analogous to the existing tuples in`method_tests()`.

Future PRs will extend OpInfo-based testing to other uses of `method_tests()`, like test_jit.py, to ensure that new operator tests can be implemented entirely using an OpInfo.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43451

Reviewed By: albanD

Differential Revision: D23481723

Pulled By: mruberry

fbshipit-source-id: 0c2cdeacc1fdaaf8c69bcd060d623fa3db3d6459
2020-09-03 02:50:48 -07:00
ab7606702c Rectified a few grammatical errors in documentation (#43695)
Summary:
Rectified a few grammatical errors in documentation of pytorch.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43695

Reviewed By: anjali411

Differential Revision: D23451600

Pulled By: ezyang

fbshipit-source-id: bc7b34c240fde1b31cac811080befa2ff2989395
2020-09-02 23:59:45 -07:00
40fec4e739 [TensorExpr] Fuser: do not fuse ops with 0-dim tensors. (#44073)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44073

We don't have a proper support on NNC and JIT IR->NNC lowering side for it yet.

Test Plan: Imported from OSS

Reviewed By: SplitInfinity

Differential Revision: D23487905

Pulled By: ZolotukhinM

fbshipit-source-id: da0da7478fc8ce7b455176c95d8fd610c94352c1
2020-09-02 22:59:04 -07:00
3da82aee03 [JIT] Remove profile nodes before BatchMM. (#43961)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43961

Currently we're removing prim::profile nodes and embed the type info
directly in the IR right before the fuser, because it is difficult to
fuse in a presence of prim::profile nodes. It turns out that BatchMM has
a similar problem: it doesn't work when there are prim::profile nodes in
the graph. These two passes run next to each other, so we could simply
remove prim::profile nodes slightly earlier: before the BatchMM pass.

Test Plan: Imported from OSS

Reviewed By: eellison

Differential Revision: D23453266

Pulled By: ZolotukhinM

fbshipit-source-id: 92cb50863962109b3c0e0112e56c1f2cb7467ff1
2020-09-02 22:57:39 -07:00
ae7699829c Remove THC max and min, which are longer used (#43903)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43903

Reviewed By: smessmer

Differential Revision: D23493225

Pulled By: ezyang

fbshipit-source-id: bc89d8221f3351da0ef3cff468ffe6a91dae96a6
2020-09-02 22:05:05 -07:00
32e0cedc53 [ONNX] Move tests to test_pytorch_onnx_onnxruntime (#42684)
Summary:
Move tests to test_pytorch_onnx_onnxruntime from test_utility_fun

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42684

Reviewed By: smessmer

Differential Revision: D23480360

Pulled By: bzinodev

fbshipit-source-id: 8876ba0a0c3e1d7104511de7a5cca5262b32f574
2020-09-02 21:47:38 -07:00
bc45c47aa3 Expand the coverage of test_addmm and test_addmm_sizes (#43831)
Summary:
- This test is very fast and very important, so it makes no sense in marking it as slowTest
- This test is should also run on CUDA
- This test should check alpha and beta support
- This test should check `out=` support
- manual computation should use list instead of index_put because list is much faster
- precision for TF32 needs to be fixed. Will do it in future PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43831

Reviewed By: ailzhang

Differential Revision: D23435032

Pulled By: ngimel

fbshipit-source-id: d1b8350addf1e2fe180fdf3df243f38d95aa3f5a
2020-09-02 20:51:49 -07:00
f5ba489f93 Move dependent configs to CUDA-10.2 (#44057)
Summary:
Move `multigpu`, `noavx` and `slow` test configs to CUDA-10.2, but keep them a master only tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44057

Reviewed By: walterddr, seemethere

Differential Revision: D23482732

Pulled By: malfet

fbshipit-source-id: a6b050701cbc1d8f176ebb302f7f5076a78f1f58
2020-09-02 20:07:48 -07:00
a76a56d761 Add "torch/testing/_internal/data/*.pt" to .gitignore (#43941)
Summary:
I usually get this extra "legacy_conv2d.pt" file in my git "changed files". I found that this is from tests with `download_file`
42c895de4d/test/test_nn.py (L410-L426)

and its definition (see `data_dir` for download output location)
f17d7a5556/torch/testing/_internal/common_utils.py (L1338-L1357)

I assume a file "generated" by test should not be tracked in VCS? Also, if the file is updated on the server, users may still use the old version of it if they have already downloaded that before.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43941

Reviewed By: anjali411

Differential Revision: D23451264

Pulled By: ezyang

fbshipit-source-id: 7fcdfb24685a7e483914cc46b3b024df798bf7f7
2020-09-02 20:00:31 -07:00
37658b144b Remove useless py2 compatibility import __future__, part 1 (#43808)
Summary:
To avoid conflicts, this PR does not remove all imports. More are coming in further PRs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43808

Reviewed By: wanchaol

Differential Revision: D23436675

Pulled By: ailzhang

fbshipit-source-id: ccc21a1955c244f0804277e9e47e54bfd23455cd
2020-09-02 19:15:11 -07:00
b2a9c3baa9 [TVM] Support fp16 weights in c2_frontend (#44070)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44070

Reviewed By: yinghai

Differential Revision: D23444253

fbshipit-source-id: 0bfa98172dfae835eba5ca7cbe30383ba964c2a6
2020-09-02 19:07:35 -07:00
b2aaf212aa [TensorExpr] Add option to enforce TensorExprKernel fallbacks. (#43972)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43972

It is useful when debugging a bug to disable NNC backend to see whether
the bug is there or in the fuser logic.

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D23455624

Pulled By: ZolotukhinM

fbshipit-source-id: f7c0452a29b860afc806e2d58acf35aa89afc060
2020-09-02 18:34:24 -07:00
6a6552576d rename _min_max to _aminmax (#44001)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44001

This is to align with the naming in numpy and in
https://github.com/pytorch/pytorch/pull/43092

Test Plan:
```
python test/test_torch.py TestTorchDeviceTypeCPU.test_aminmax_cpu_float32
python test/test_torch.py TestTorchDeviceTypeCUDA.test_aminmax_cuda_float32
```

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D23465298

fbshipit-source-id: b599035507156cefa53942db05f93242a21c8d06
2020-09-02 18:07:55 -07:00
486a9fdab2 _min_max.dim: CUDA implementation (#42943)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42943

Adds a CUDA kernel for _min_max_val.dim

Test Plan:
correctness:
```
python test/test_torch.py TestTorchDeviceTypeCUDA.test_minmax_cuda_float32
```

performance: ~50% savings on a tensor representative of quantization workloads: https://gist.github.com/vkuzo/3e16c645e07a79dd66bcd50629ff5db0

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D23086797

fbshipit-source-id: 04a2d310f64a388d48ab8131538dbd287900ca4a
2020-09-02 18:07:51 -07:00
834279f4ab _min_max_val.dim: CPU implementation (#42894)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42894

Continuing the min_max kernel implementation, this PR adds the
CPU path when a dim is specified.  Next PR will replicate for CUDA.

Note: after a discussion with ngimel, we are taking the fast path
of calculating the values only and not the indices, since that is what
is needed for quantization, and calculating indices would require support
for reductions on 4 outputs which is additional work.  So, the API
doesn't fully match `min.dim` and `max.dim`.

Flexible on the name, let me know if something else is better.

Test Plan:
correctness:
```
python test/test_torch.py TestTorchDeviceTypeCPU.test_minmax_cpu_float32
```

performance: seeing a 49% speedup on a min+max tensor with similar shapes
to what we care about for quantization observers (bench:
https://gist.github.com/vkuzo/b3f24d67060e916128a51777f9b89326). For
other shapes (more dims, different dim sizes, etc), I've noticed a
speedup as low as 20%, but we don't have a good use case to optimize
that so perhaps we can save that for a future PR.

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D23086798

fbshipit-source-id: b24ce827d179191c30eccf31ab0b2b76139b0ad5
2020-09-02 18:07:47 -07:00
78994d165f min_max kernel: add CUDA (#42868)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42868

Adds a CUDA kernel for the _min_max function.

Note: this is a re-submit of https://github.com/pytorch/pytorch/pull/41805,
was faster to resubmit than to ressurect that one.  Thanks to durumu
for writing the original implementation!

Future PRs will add index support, docs, and hook this up to observers.

Test Plan:
```
python test/test_torch.py TestTorchDeviceTypeCUDA.test_minmax_cuda_float32
```

Basic benchmarking shows a 50% reduction in time to calculate min + max:
https://gist.github.com/vkuzo/b7dd91196345ad8bce77f2e700f10cf9

TODO

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D23057766

fbshipit-source-id: 70644d2471cf5dae0a69343fba614fb486bb0891
2020-09-02 18:06:03 -07:00
33d51a9b32 Respect canFuseOn{CPU,GPU} in TE fuser (#43967)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43967

Test Plan: Imported from OSS

Reviewed By: asuhan

Differential Revision: D23469048

Pulled By: bertmaher

fbshipit-source-id: 1005a7ae08974059ff9d467492caa3a388070eeb
2020-09-02 18:00:25 -07:00
041573c8cd Add Cost Inference for AdaGrad and RowWiseSparseAdagrad
Summary: Add cost inference for AdaGrad and RowWiseSparseAdagrad

Test Plan:
Ran `buck test caffe2/caffe2/python/operator_test:adagrad_test`
Result: https://our.intern.facebook.com/intern/testinfra/testrun/5629499567799494

Reviewed By: bwasti

Differential Revision: D23442607

fbshipit-source-id: 67800fb82475696512ad19a43067774247f8b230
2020-09-02 17:52:40 -07:00
2f044d4ee5 Fix CI build (#44068)
Summary:
Some of our machines have only 1 device.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44068

Reviewed By: wanchaol

Differential Revision: D23485730

Pulled By: izdeby

fbshipit-source-id: df6bc0aba18feefc50c56a8f376103352fa2a2ea
2020-09-02 17:09:30 -07:00
129f406062 Make torch.conj() a composite function and return self for real tensors (#43270)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43270

`torch.conj` is a very commonly used operator for complex tensors, but it's mathematically a no op for real tensors. Switching to tensorflow gradients for complex tensors (as discussed in #41857) would involve adding `torch.conj()` to the backward definitions for a lot of operators. In order to preserve autograd performance for real tensors and maintain numpy compatibility for `torch.conj`, this PR updates `torch.conj()` which behaves the same for complex tensors but performs a view/returns `self` tensor for tensors of non-complex dtypes. The documentation states that the returned tensor for a real input shouldn't be mutated. We could perhaps return an immutable tensor for this case in future when that functionality is available (zdevito ezyang ).

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D23460493

Pulled By: anjali411

fbshipit-source-id: 3b3bf0af55423b77ff2d0e29f5d2c160291ae3d9
2020-09-02 17:06:04 -07:00
f9efcb646b fx quant: clarify state in Quantizer object (#43927)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43927

Adds uninitialized placeholders for various state
used throughout the Quantizer object, with documentation
on what they are. No logic change.

Test Plan:
```
python test/test_quantization.py TestQuantizeFxOps
python test/test_quantization.py TestQuantizeFx
```

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D23439473

fbshipit-source-id: d4ae83331cf20d81a7f974f88664ccddca063ffc
2020-09-02 16:34:00 -07:00
f15e27265f [torch.fx] Add support for custom op (#43248)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43248

We add the support of __torch_function__ override for C++ custom op. The logic is the same as the other components, like torch.nn.Module.
Refactored some code a little bit to make it reusable.

Test Plan: buck test //caffe2/test:fx -- test_torch_custom_ops

Reviewed By: bradleyhd

Differential Revision: D23203204

fbshipit-source-id: c462a86e407e46c777171da32d7a40860acf061e
2020-09-02 16:08:37 -07:00
7a77d1c5c2 [FX] Only copy over forward() from exec (#44006)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44006

Test Plan: Imported from OSS

Reviewed By: zdevito

Differential Revision: D23466542

Pulled By: jamesr66a

fbshipit-source-id: 12a1839ddc65333e3e3d511eeb53206f06546a87
2020-09-02 15:35:49 -07:00
402e9953df [pytorch][bot] update mobile op deps (#44018)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44018

Test Plan: Imported from OSS

Reviewed By: iseeyuan

Differential Revision: D23470528

Pulled By: ljk53

fbshipit-source-id: b677e1c5677fc8929713ee108df69098502c50ea
2020-09-02 14:34:33 -07:00
297c938729 Add _foreach_add(TensorList tl1, TensorList tl2) and _foreach_add_(TensorList tl1, TensorList tl2) APIs (#42533)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42533

[First PR: Add private API to support tensor lists: _foreach_add(TensorList tensors, Scalar scalar)](https://github.com/pytorch/pytorch/pull/41554).

**Motivation**
[GitHub issue](https://github.com/pytorch/pytorch/issues/38655)
Current PyTorch optimizer implementations are not efficient in cases when we work with a lot of small feature tensors. Starting a lot of kernels slows down the whole process. We need to reduce the number of kernels that we start.
As an example, we should be looking at [NVIDIAs Apex](https://github.com/NVIDIA/apex).
In order to track progress, we will pick PyTorchs DCGAN model with Adam optimizer and once the optimizer is reimplemented with tensor lists, benchmark the model performance against original model version, Apexs version with original Adam optimizer and it’s FusedAdam optimizer.

**Current API restrictions**
- List can't be empty (will fixed in upcoming PRs).
- All tensors in the list must have the same dtype, device and size.

**Broadcasting**
At this point we don't support broadcasting.

**What is 'Fast' and 'Slow' route**
In particular cases, we cant process an op with a fast list CUDA kernel. Still, we can do with a regular for-loop where the op will be applied to each tensor individually through the dispatch mechanisms. There are a few checks that decide whether the op will be performed via a 'fast' or 'slow' path.
To go the fast route,
- All tensors must have strided layout
- All tensors must be dense and not have overlapping memory
- The resulting tensor type must be the same.

----------------
**In this PR**
- Adding a `_foreach_add(TensorList tl1, TensorList tl2)` API
- Adding a `_foreach_add_(TensorList tl1, TensorList tl2)` API

**Tests**
Tested via unit tests

**TODO**
1. Properly handle empty lists

**Plan for the next PRs**
1. APIs
- Binary Ops for list with Scalar
- Binary Ops for list with list
- Unary Ops for list
- Pointwise Ops

2. Complete tasks from TODO
3. Rewrite PyTorch optimizers to use for-each operators for performance gains.

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D23331894

Pulled By: izdeby

fbshipit-source-id: 876dd1bc82750f609b9e3ba23c8cad94d8d6041c
2020-09-02 12:18:28 -07:00
f6f9d22228 [ONNX] Export KLDivLoss (#41858)
Summary:
Enable export for KLDivLoss

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41858

Reviewed By: mrshenli

Differential Revision: D22918004

Pulled By: bzinodev

fbshipit-source-id: e3debf77a4cf0eae0df6ed5a72ee91c43e482b62
2020-09-02 11:45:13 -07:00
4716284904 Update persons_of_interest.rst (#44031)
Summary:
Adding Geeta to the POI for TorchServe

cc chauhang

Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44031

Reviewed By: jspisak

Differential Revision: D23476439

Pulled By: soumith

fbshipit-source-id: 6936d46c201e1437143d85e1dce24da355857628
2020-09-02 10:56:27 -07:00
b167402e2e [redo] Fix SyncBatchNorm forward pass for non-default process group (#43861)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43861

This is a redo of https://github.com/pytorch/pytorch/pull/38874, and
fixing my original bug from
https://github.com/pytorch/pytorch/pull/38246.

Test Plan:
CI

Imported from OSS

Reviewed By: supriyar

Differential Revision: D23418816

fbshipit-source-id: 2a3a3d67fc2d03bb0bf30a87cce4e805ac8839fb
2020-09-02 10:44:46 -07:00
544a56ef69 [JIT] Always map node output in vmap (#43988)
Summary:
Previously when merging a node without a subgraph, we would merge the node's outputs to the corresponding subgraph values, but when merging a node with a subgraph the node's outputs would be absent in the value mapping. This PR makes it so they are included.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43988

Reviewed By: ZolotukhinM

Differential Revision: D23462116

Pulled By: eellison

fbshipit-source-id: 232c081261e9ae040df0accca34b1b96a5a5af57
2020-09-02 10:30:43 -07:00
276158fd05 .circleci: Remove un-needed steps from binary builds (#43974)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43974

We already install devtoolset7 in our docker images for binary builds
and tclsh shouldn't be needed since we're not relying on unbuffer
anymore

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D23462531

Pulled By: seemethere

fbshipit-source-id: 83cbb8b0782054f0b543dab8d11fa6ac57685272
2020-09-02 09:57:52 -07:00
73f009a2aa refactor manual function definitions (#43711)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43711

this makes them available in forward if needed

No change to the file content, just a copy-paste.

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D23454146

Pulled By: albanD

fbshipit-source-id: 6269a4aaf02ed53870fadf8b769ac960e49af195
2020-09-02 09:23:21 -07:00
a6789074fc Implement ChannelShuffle op with XNNPACK (#43602)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43602

Test Plan: Imported from OSS

Reviewed By: kimishpatel

Differential Revision: D23334952

Pulled By: kimishpatel

fbshipit-source-id: 858ef3db599b1c521ba3a1855c9a3c35fe3b02b0
2020-09-02 09:18:25 -07:00
df8da5cb5a fx quant: make load_arg function more clear (#43923)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43923

Readability improvements to `Quantizer.convert.load_arg`, makes
things easier to read.
1. add docblock
2. `arg` -> `arg_or_args`, to match what's actually happening
3. `loaded_arg` -> `loaded_args`, to match what's actually happening

Test Plan:
```
python test/test_quantization.py TestQuantizeFxOps
python test/test_quantization.py TestQuantizeFx
```

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D23438745

fbshipit-source-id: f886b324d2e2e33458b72381499e37dccfc3bd30
2020-09-02 09:06:05 -07:00
77ef77e5fa fx quant: rename matches -> is_match (#43914)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43914

Renames `matches` function to `is_match`, since there is also
a list named `matches` we are passing around in `Quantizer`,
and would be good to decrease name conflicts.

Test Plan:
```
python test/test_quantization.py TestQuantizeFxOps
```

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D23435601

fbshipit-source-id: 394af11e0120cfb07dedc79d5219247330d4dfd6
2020-09-02 09:06:01 -07:00
6f5282adc8 add quantization debug util to pretty print FX graphs (#43910)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43910

Adds a debug function to get a representation of all nodes in the
graph, such as

```
name          op      target         args               kwargs
x             plchdr  x              ()                 {}
linear_weight gt_prm  linear.weight  ()                 {}
add_1         cl_fun  <bi_fun add>   (x, linear_weight) {}
linear_1      cl_mod  linear         (add_1,)           {}
relu_1        cl_meth relu           (linear_1,)        {}
sum_1         cl_fun  <bi_meth sum>  (relu_1,)          {'dim': -1}
topk_1        cl_fun  <bi_meth topk> (sum_1, 3)         {}
```

using only Python STL. This is useful for printing internal state of
graphs when working on FX code.

Has some on-by-default logic to shorten things so that node reprs for
toy models and unit tests fit into 80 chars.

Flexible on function name and location, I care more that this is
accessible from both inside PT as well as from debug scripts which
are not checked in.

Test Plan:
see
https://gist.github.com/vkuzo/ed0a50e5d6dc7442668b03bb417bd603 for
example usage

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D23435029

fbshipit-source-id: 1a2df797156a19cedd705e9e700ba7098b5a1376
2020-09-02 09:04:44 -07:00
b6b5ebc345 Add torch.vdot (#43004)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/42747

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43004

Reviewed By: mruberry

Differential Revision: D23318935

Pulled By: anjali411

fbshipit-source-id: 12d4824b7cb42bb9ca703172c54ec5c663d9e325
2020-09-02 09:00:30 -07:00
14ebb2c67c Allow no-bias MKLDNN Linear call (#43703)
Summary:
MKLDNN linear incorrectly assumes that bias is defined and will fail for no-bias calls.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43703

Reviewed By: glaringlee

Differential Revision: D23373182

Pulled By: bwasti

fbshipit-source-id: 1e817674838a07d237c02eebe235c386cf5b191e
2020-09-02 08:54:50 -07:00
c88ac25679 Check for internal memory overlap in some indexing-type functions (#43423)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43423

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D23298652

Pulled By: zou3519

fbshipit-source-id: c13c59aec0c6967ef0d6365d782c1f4c98c04227
2020-09-02 08:51:50 -07:00
5807bb92d3 TensorIteratorConfig: Check memory overlap by default (#43422)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43422

Test Plan: Imported from OSS

Reviewed By: ailzhang

Differential Revision: D23298653

Pulled By: zou3519

fbshipit-source-id: a7b66a8a828f4b35e31e8be0c07e7fe9339181f2
2020-09-02 08:50:29 -07:00
cd58114c6c Adjust level of verbosity of debug dumps in graph executor T74227880 (#43682)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43682

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D23397980

Pulled By: Lilyjjo

fbshipit-source-id: b0114efbd63b2a29eb14086b0a8963880023c2a8
2020-09-02 08:45:16 -07:00
8722952dbd Add benchmark for channel_shuffle operator (#43509)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43509

Test Plan: Imported from OSS

Reviewed By: kimishpatel

Differential Revision: D23299972

Pulled By: kimishpatel

fbshipit-source-id: 6189d209859da5a41067eb9e8317e3bf7a0fc754
2020-09-02 08:15:19 -07:00
6512032699 [Static Runtime] Add OSS build for static runtime benchmarks (#43881)
Summary:
Adds CMake option.  Build with:

```
BUILD_STATIC_RUNTIME_BENCHMARK=ON python setup.py install
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43881

Reviewed By: hlu1

Differential Revision: D23430708

Pulled By: bwasti

fbshipit-source-id: a39bf54e8d4d044a4a3e4273a5b9a887daa033ec
2020-09-02 08:00:18 -07:00
c61a16b237 Kill dead code in common_nn as part of merging Criterion and NewCriterionTests. (#43956)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43956

See https://github.com/pytorch/pytorch/pull/43769 and https://github.com/pytorch/pytorch/pull/43776 for proof this code is dead.

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D23452217

Pulled By: gchanan

fbshipit-source-id: 6850aab2daaa1c321a6b7714f6f113f364f41973
2020-09-02 07:54:05 -07:00
95f912ab13 Use NewCriterionTest in test_cpp_api_parity.py. (#43954)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43954

CriterionTest is basically dead -- see https://github.com/pytorch/pytorch/pull/43769 and https://github.com/pytorch/pytorch/pull/43776.

The only exception is the cpp parity test, but the difference there doesn't actually have any effect -- the get_target has unpack=True, but all examples don't require unpacking (I checked).

As a pre-requisite for merging these tests, have the cpp parity test start using the NewCriterionTest.

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D23452144

Pulled By: gchanan

fbshipit-source-id: 5dca1eb0878b882c93431d3b0e880b5bb1764522
2020-09-02 07:53:03 -07:00
4bb5d33076 is_numpy_scalar should also consider bool and complex types (#43644)
Summary:
Before this PR,

```python
import torch
import numpy as np

a = torch.tensor([1, 2], dtype=torch.bool)
c = np.array([1, 2], dtype=np.bool)
print(a[0] == c[0])

a = torch.tensor([1, 2], dtype=torch.complex64)
c = np.array([1, 2], dtype=np.complex64)
print(a[0] == c[0])

 # This case is still broken
a = torch.tensor([1 + 1j, 2 + 2j], dtype=torch.complex64)
c = np.array([1 + 1j, 2 + 2j], dtype=np.complex64)
print(a[0] == c[0])
```

outputs

```
False
False
False
```

After this PR, it outputs:

```
tensor(True)
/home/user/src/pytorch/torch/tensor.py:25: ComplexWarning: Casting complex values to real discards the imaginary part return f(*args, **kwargs)
tensor(True)
tensor(False)
```

Related issue: https://github.com/pytorch/pytorch/issues/43579

cc anjali411 mruberry

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43644

Reviewed By: ailzhang

Differential Revision: D23425569

Pulled By: anjali411

fbshipit-source-id: a868209376b30cea601295e54015c47803923054
2020-09-02 07:41:50 -07:00
7000c2efb5 [2/2][PyTorch][Mobile] Added mobile module metadata logging (#43853)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43853

Add QPL logging for mobile module's metadata
ghstack-source-id: 111113492

(Note: this ignores all push blocking failures!)

Test Plan:
- CI

- Load the model trained by `mobile_model_util.py`

- Local QPL logger standard output.
{F319012106}

Reviewed By: xcheng16

Differential Revision: D23417304

fbshipit-source-id: 7bc834f39e616be1eccfae698b3bccdf2f7146e5
2020-09-01 22:27:10 -07:00
1dd658f28f [Codemod][GleanFbcode] Remove dead includes in caffe2/test (#43953)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43953

Reviewed By: malfet

Differential Revision: D23445556

fbshipit-source-id: 89cd6833aa06f35c5d3c99d698abb08cd61ae4ab
2020-09-01 21:48:28 -07:00
c259146477 add missing NEON {vld1,vst1}_*_x2 intrinsics (#43683)
Summary:
Workaround for issue https://github.com/pytorch/pytorch/issues/43265.
Add the missing intrinsics until gcc-7 gets the missing patches backported.

Fixes https://github.com/pytorch/pytorch/issues/43265.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43683

Reviewed By: albanD

Differential Revision: D23467867

Pulled By: malfet

fbshipit-source-id: 7c138dd3de3c45852a60f2cfe8b4d7f7cf76bc7e
2020-09-01 21:19:39 -07:00
137a4fcc3b Back out "Selective meta programming preparation for prim ops"
Summary:
The diff D22618309 (bacee6aa2e) breaks CYA ACP e2e tests. (https://www.internalfb.com/intern/ods/chart/?rapido=%7B%22queries%22%3A[%7B%22entity%22%3A%22regex(assistant%5C%5C.cya%5C%5C..*acp.*)%2C%5Cn%2C%20!regex(assistant%5C%5C.cya%5C%5C..*fair.*)%2C%22%2C%22key%22%3A%22overview.pct_passed_x_1000%2C%22%2C%22transform%22%3A%22formula(%2F%20%241%201000.0)%2C%22%2C%22reduce_keys%22%3Atrue%2C%22datatypes%22%3A[%22raw%22]%2C%22reduce%22%3A%22%22%2C%22id%22%3A%22ds1%22%2C%22source%22%3A%22ods%22%2C%22active%22%3Atrue%7D]%2C%22period%22%3A%7B%22minutes_back%22%3A720%2C%22time_type%22%3A%22dynamic%22%7D%7D&view=%7B%22type%22%3A%22line_chart_client%22%2C%22params%22%3A%7B%22title%22%3A%22Pass%20Rates%20of%20All%20Continuous%20Runs%20in%20PROD%22%2C%22haspoints%22%3Afalse%2C%22state%22%3A%22published%22%2C%22title_use_v2%22%3Atrue%2C%22tooltip_outside%22%3Atrue%2C%22series_names_preg_replace_list%22%3A[%7B%22series_name_preg_replace_list_group%22%3Anull%2C%22pattern%22%3A%22%2Fassistant%5C%5C.cya%5C%5C.(%5C%5Cw%2B)%5C%5C.([%5E%3A]%2B)%3A%3A.*%2F%22%2C%22replacement%22%3A%22%241%2F%242%22%7D]%2C%22sort_by_series_name%22%3A%22ASC%22%2C%22use_y_axis_hints_as_limits%22%3Atrue%7D%7D&version=2)

So I back out the diff.

Test Plan:
```
cya test -n aloha.acp.arv2.prod --tp ~/tmp/cyaTests/assistant/cya/aloha_acp/whatsapp_call_who_ondevice_oacr.yaml --device_no_new_conn --retries 0
Installing: finished in 13.4 sec
More details at https://www.internalfb.com/intern/buck/build/c48882e8-1032-43ca-ba8f-8
Running "aloha.acp.arv2.prod (acp)" [1 tests] with endpoint "https://prod.facebookvirtualassistant.com"
.
  %100.0 tests passed:  1/1
  Avg turn duration:    12.6s
  P99 turn duration:    24.4s
  CTP report:  https://our.intern.facebook.com/intern/testinfra/testrun/2814749804232321

[jaeholee@32384.od ~/fbsource (7934576f)]$
```

Differential Revision: D23464555

fbshipit-source-id: b2c712a512a207c4813585f4ee57fdb5607317c6
2020-09-01 21:05:45 -07:00
263412e536 Rename is_complex_t -> is_complex (#39906)
Summary:
`is_complex_t` is a bad name. For example in std, there are `std::is_same` but not `std::is_same_t`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/39906

Reviewed By: mrshenli

Differential Revision: D22665013

Pulled By: anjali411

fbshipit-source-id: 4b71745f5e2ea2d8cf5845d95ada4556c87e040d
2020-09-01 21:04:19 -07:00
9db90fe1f3 [TensorExpr] Remove unused functions in kernel.cpp (#43966)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43966

Test Plan: build.

Reviewed By: ZolotukhinM

Differential Revision: D23456660

Pulled By: asuhan

fbshipit-source-id: c13411b61cf62dd5d038e7246f79a8682822b472
2020-09-01 20:25:16 -07:00
8fd9fe93be [quant][graphmode][fx] Support dynamic quantization without calibration (#43952)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43952

Run weight observer for dynamic quantization before inserting quant/dequant node

Test Plan: Imported from OSS

Reviewed By: supriyar

Differential Revision: D23452123

fbshipit-source-id: c322808fa8025bbadba36c2e5ab89f59e85de468
2020-09-01 19:09:48 -07:00
fbea2ee917 broadcast_object API for c10d (#43887)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43887

As part of addressing #23232, this PR adds support for `broadcast_object_list` which is an API to broadcast arbitrary picklable objects to all the other ranks.  This has been a long-requested feature, so would be good for Pytorch to natively support this.

The implementation approach follows a similar approach as https://github.com/pytorch/pytorch/pull/42189. The input is a list of objects to be broadcasted and it is in place, meaning all ranks part of the group will have their input list modified to contain the broadcasted objects from the src rank.

Note that the API is designed to match the tensor-based collectives other than supporting async_op. For now, it is a blocking call. If we see demand to support async_op, we will have to make more progress on merging work/future to support this.
ghstack-source-id: 111180436

Reviewed By: mrshenli

Differential Revision: D23422577

fbshipit-source-id: fa700abb86eff7128dc29129a0823e83caf4ab0e
2020-09-01 18:54:17 -07:00
4134b7abfa Pass CC env variable as ccbin argument to nvcc (#43931)
Summary:
This is the common behavior when one builds PyTorch (or any other CUDA project) using CMake, so it should be held true for Torch CUDA extensions as well.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43931

Reviewed By: ezyang, seemethere

Differential Revision: D23441793

Pulled By: malfet

fbshipit-source-id: 1af392107a94840331014fda970ef640dc094ae4
2020-09-01 17:26:08 -07:00
0ffe3d84d5 [quant][graphmode][fx] Support dynamic quantization without calibration (#43892)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43892

Run weight observer in the convert function, so user do not need to run calibration

Test Plan: Imported from OSS

Reviewed By: vkuzo

Differential Revision: D23429758

fbshipit-source-id: 5bc222e3b731789ff7a86463c449690a58dffb7b
2020-09-01 17:01:48 -07:00
d15b9d980c [quant][graphmode][fx][refactor] Move patterns to separate files (#43891)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43891

Test Plan: Imported from OSS

Reviewed By: vkuzo

Differential Revision: D23429759

fbshipit-source-id: f19add96beb7c8bac323ad78f74588ca1393040c
2020-09-01 16:37:33 -07:00
8d53df30ea [FX] Better error when unpacking Proxy (#43740)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43740

Test Plan: Imported from OSS

Reviewed By: suo

Differential Revision: D23380964

Pulled By: jamesr66a

fbshipit-source-id: 9658ef1c50d0f9c4de38781a7485002487f6d3f7
2020-09-01 16:28:50 -07:00
ec7f14943c [OSS] Update README.md -- Explain more complex arguments and functionalities
Summary: Update `README.md` for oss to explain the usage of `--run` `--export` `--summary`

Test Plan: Test locally.

Reviewed By: malfet

Differential Revision: D23431508

fbshipit-source-id: 368b8dd8cd5099f39c7f5bc985203c417bf7af39
2020-09-01 16:10:33 -07:00
e49dd9fa05 Delete raise_from from torch._six (#43981)
Summary:
No need for compatibility wrapper in Python3+ world

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43981

Reviewed By: seemethere

Differential Revision: D23458325

Pulled By: malfet

fbshipit-source-id: 00f822895625f4867c22376fe558c50316f5974d
2020-09-01 15:46:18 -07:00
5e97f251a8 Enable TF32 support for cuDNN (#40737)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40737

Reviewed By: mruberry

Differential Revision: D22801525

Pulled By: ngimel

fbshipit-source-id: ac7f7e728b4b3e01925337e8c9996f26a6433fd2
2020-09-01 15:34:24 -07:00
93fbbaab2a Update README.md in oss (#43893)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43893

Update `README.md` in oss, provide more examples, start from the most common use to specified use. Make `README.md` be more friendly and more specific.

Test Plan: `README.md` doesn't need test.

Reviewed By: malfet, seemethere

Differential Revision: D23420203

fbshipit-source-id: 1a4c146393fbcaf2893321e7892740edf5d0c248
2020-09-01 14:58:28 -07:00
24eea364f7 Check SparseAdam params are dense on init (#41966) (#43668)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/41966

Raises a value error if user attempts to create SparseAdam optimizer with sparse parameter tensors.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43668

Reviewed By: glaringlee

Differential Revision: D23388109

Pulled By: ranman

fbshipit-source-id: 1fbcc7527d49eac6fae9ce51b3307c609a6ca38b
2020-09-01 14:25:59 -07:00
bacee6aa2e Selective meta programming preparation for prim ops (#43540)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43540

selected_mobile_ops.h is generated at BUCK build time, which contains the whitelist of root operators. It's used for templated selective build when XPLAT_MOBILE_BUILD is defined.

ghstack-source-id: 111014372

Test Plan: CI and BSB

Reviewed By: ljk53

Differential Revision: D22618309

fbshipit-source-id: ddf813904892f99c3f4ae0cd14ce8b27727be5a2
2020-09-01 13:51:44 -07:00
a1a23669f2 [FX] Pickle serialization of GraphModule via forward source (#43674)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43674

Test Plan: Imported from OSS

Reviewed By: dzhulgakov

Differential Revision: D23362396

Pulled By: jamesr66a

fbshipit-source-id: cb8181edff70643b7bbe548cc6b0957328d4eedd
2020-09-01 13:31:18 -07:00
73f7d63bc9 [FX] Support tensor-valued constants (#43666)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43666

Test Plan: Imported from OSS

Reviewed By: dzhulgakov

Differential Revision: D23359110

Pulled By: jamesr66a

fbshipit-source-id: 8569a2db0ef081ea7d8e81d7ba26a92bc12ed423
2020-09-01 13:30:04 -07:00
06c277f38e [TVM] Support slice op (#43969)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43969

Reviewed By: yinghai

Differential Revision: D23413340

fbshipit-source-id: 20168bd573b81ce538e3589b72aba9590c3c055e
2020-09-01 12:34:30 -07:00
5472426b9f Reset DataLoader workers instead of creating new ones (#35795)
Summary:
This PR needs discussion as it changes the behavior of `DataLoader`. It can be closed if its not considered a good practice.

Currently, the `DataLoader` spawns a new `_BaseDataLoaderIter` object every epoch,
In the case of the multiprocess DataLoader, every epoch the worker processes are re-created and they make a copy of the original `Dataset` object.
If users want to cache data or do some tracking on their datasets, all their data will be wiped out every epoch. Notice that this doesn't happen when the number of workers is 0. giving some inconsistencies with the multiprocess and serial data loaders.

This PR keeps the `_BaseDataLoaderIter` object alive and just resets it within epochs, so the workers remain active and so their own `Dataset` objects. People seem to file issues about this often.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/35795

Reviewed By: ailzhang

Differential Revision: D23426612

Pulled By: VitalyFedyunin

fbshipit-source-id: e16950036bae35548cd0cfa78faa06b6c232a2ea
2020-09-01 11:48:00 -07:00
db6bd9d60b rename input argunment interested-folder to interest-only -- be consistent with other arguments (#43889)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43889

1. rename input argunment `interested-folder` to `interest-only` -- be consistent with `run-only`, `coverage-only` and be shorted

Test Plan: Test on devserver and linux docker.

Reviewed By: malfet

Differential Revision: D23417338

fbshipit-source-id: ce9711e75ca3a1c30801ad6bd1a620f3b06819c5
2020-09-01 11:46:23 -07:00
bc64efae48 Back out "Revert D19987020: [pytorch][PR] Add the sls tensor train op" (#43938)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43938

resubmit

Test Plan: unit test included

Reviewed By: mruberry

Differential Revision: D23443493

fbshipit-source-id: 7b68f8f7d1be58bee2154e9a498b5b6a09d11670
2020-09-01 11:42:12 -07:00
7035cd0f84 Revert D23216393: Support work.result() to get result tensors for allreduce for Gloo, NCCL backends
Test Plan: revert-hammer

Differential Revision:
D23216393 (0b2694cd11)

Original commit changeset: fed5e37fbabb

fbshipit-source-id: 27fbeb1617066fa3f271a681cb089622027d6689
2020-09-01 10:32:38 -07:00
63a0bb0ab9 Add typing annotations for torch.nn.quantized.dynamic.modules.rnn (#43186)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/43185

xref: [gh-43072](https://github.com/pytorch/pytorch/issues/43072)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43186

Reviewed By: ezyang

Differential Revision: D23441259

Pulled By: malfet

fbshipit-source-id: 80265ae7f3a70f0087e620969dbd4aa8ca17c317
2020-09-01 10:25:10 -07:00
8ca3913f47 Introduce BUILD_CAFFE2 flag (#43673)
Summary:
introduce BUILD_CAFFE2 flag. default to `ON`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43673

Reviewed By: malfet

Differential Revision: D23381035

Pulled By: walterddr

fbshipit-source-id: 1f4582987fa0c4a911f0b18d311c04fdbf8dd8f0
2020-09-01 10:18:23 -07:00
76ca365661 [pytorch][bot] update mobile op deps (#43937)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43937

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D23443927

Pulled By: ljk53

fbshipit-source-id: 526ca08dfb5bd32527bff98b243da90dbbf2ea49
2020-09-01 10:07:52 -07:00
e3cb582e05 Error printing extension support for multiline errors (#43807)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43807

Test Plan: Imported from OSS

Reviewed By: gmagogsfm

Differential Revision: D23407457

Pulled By: Lilyjjo

fbshipit-source-id: 05a6a50dc39c00474d9087ef56028a2c183aa53a
2020-09-01 10:02:43 -07:00
224232032c Move Autograd to an alias dispatch key (#43070)
Summary:
This PR moves `DispatchKey::Autograd` to an alias dispatch key mapping to `AutogradCPU, AutogradCUDA, AutogradXLA, AutogradOther, AutogradPrivate*` keys.

A few things are handled in this PR:
- Update alias dispatch key mapping and precompute dispatchTable logic
- Move `Autograd` key from `always_included` set to TensorImpl constructor.
- Update `dummyTensor` constructor to take `requires_grad` as optional argument so that it's closer to the real application in op_registration_test.
- Use `BackendSelect` key for both backend select before and after autograd layer. (1 liner in backend_select codegen)

A few planned followups ordered by priority:
- [cleanup] Update `test_dispatch.py` to include testing `Autograd`.
- [cleanup] Add Math alias key and move catchAll to Math. (to remove 2.2 in `computeDispatchTableEntryWithDebug`)
- [new feature] Add support for Math in native_functions.yaml
- [cleanup] Add iterator like functionality to DispatchKeySet
- [cleanup/large] Only add Autograd backend keys when tensor requires grad. (cc: ljk53 ?)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43070

Reviewed By: ezyang

Differential Revision: D23281535

Pulled By: ailzhang

fbshipit-source-id: 9ad00b17142e9b83304f63cf599f785500f28f71
2020-09-01 09:05:29 -07:00
13a48ac1f3 MaxPool1d without indices optimization (#43745)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43745

This is part of a larger effort to refactor and optimize the pooling code. Previously I started working on MaxPool2d here https://github.com/pytorch/pytorch/pull/43267 but since it uses MaxPool1d as a subroutine, it made more sense to work on 1D first and get it tested and optimized and then move up to 2D and then 3D.

Below are some benchmarking results, the python script I used is under the results.

## Benchmarking
```
Name (time in us)                            Min                   Max                Mean             StdDev              Median                 IQR            Outliers  OPS (Kops/s)            Rounds  Iterations
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_googlenet[(3, 2, 0, 1, 0)-new]      79.7659 (1.03)     1,059.6327 (5.32)      90.6280 (1.01)     19.1196 (1.41)      84.2176 (1.01)       2.4289 (1.0)     1079;2818       11.0341 (0.99)       9055           1
test_googlenet[(3, 2, 0, 1, 0)-old]     505.1531 (6.55)       830.8962 (4.17)     563.4763 (6.29)     65.3974 (4.81)     538.3361 (6.43)      80.5371 (33.16)      242;99        1.7747 (0.16)       1742           1
test_googlenet[(3, 2, 0, 1, 1)-new]      80.2949 (1.04)       233.0020 (1.17)      97.6498 (1.09)     19.1228 (1.41)      89.2282 (1.07)      18.5743 (7.65)     1858;741       10.2407 (0.92)       9587           1
test_googlenet[(3, 2, 0, 1, 1)-old]     513.5350 (6.66)       977.4677 (4.91)     594.4559 (6.63)     69.9372 (5.15)     577.9080 (6.90)      79.8218 (32.86)      503;84        1.6822 (0.15)       1675           1
test_googlenet[(3, 2, 1, 1, 0)-new]      77.1061 (1.0)        199.1168 (1.0)       89.6529 (1.0)      13.5864 (1.0)       83.7557 (1.0)        7.5139 (3.09)    1419;1556       11.1541 (1.0)        7434           1
test_googlenet[(3, 2, 1, 1, 0)-old]     543.6055 (7.05)       964.5708 (4.84)     636.9867 (7.11)     84.0732 (6.19)     616.7777 (7.36)     100.4562 (41.36)      434;65        1.5699 (0.14)       1552           1
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_inception[(3, 2, 0, 1, 0)-new]      84.5827 (1.00)       184.2827 (1.0)       90.5438 (1.01)      9.6324 (1.0)       89.3027 (1.05)      4.5672 (1.03)      637;759       11.0444 (0.99)       6274           1
test_inception[(3, 2, 0, 1, 0)-old]     641.2268 (7.59)     1,704.8977 (9.25)     686.9383 (7.65)     57.2499 (5.94)     682.5905 (8.01)     58.3753 (13.17)       86;21        1.4557 (0.13)        802           1
test_inception[(3, 2, 0, 1, 1)-new]      84.5008 (1.0)      1,093.6335 (5.93)      89.8233 (1.0)      14.0443 (1.46)      85.2682 (1.0)       4.4331 (1.0)      802;1106       11.1330 (1.0)        9190           1
test_inception[(3, 2, 0, 1, 1)-old]     643.7078 (7.62)       851.4188 (4.62)     687.4905 (7.65)     41.1116 (4.27)     685.1386 (8.04)     60.2733 (13.60)      286;14        1.4546 (0.13)       1300           1
test_inception[(3, 2, 1, 1, 0)-new]     106.0739 (1.26)       258.5649 (1.40)     115.3597 (1.28)     17.5436 (1.82)     106.9643 (1.25)      5.5470 (1.25)     894;1402        8.6685 (0.78)       7635           1
test_inception[(3, 2, 1, 1, 0)-old]     651.0504 (7.70)       955.2278 (5.18)     698.0295 (7.77)     45.5097 (4.72)     692.8109 (8.13)     64.6794 (14.59)      145;15        1.4326 (0.13)        909           1
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_large_batch_size[new]       2.9608 (1.0)        5.1127 (1.0)        3.3096 (1.0)      0.1936 (1.0)        3.3131 (1.0)      0.2093 (1.0)          71;6  302.1515 (1.0)         297           1
test_large_batch_size[old]     130.6583 (44.13)    152.9521 (29.92)    137.1385 (41.44)    7.4352 (38.40)    135.1784 (40.80)    5.1358 (24.53)         1;1    7.2919 (0.02)          7           1
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_large_channel_size[new]      2.9696 (1.0)       5.5595 (1.0)       3.5997 (1.0)      0.5836 (1.0)       3.3497 (1.0)      0.3445 (1.0)         58;54  277.8014 (1.0)         277           1
test_large_channel_size[old]     19.6838 (6.63)     22.6637 (4.08)     21.1775 (5.88)     0.8610 (1.48)     21.3739 (6.38)     1.4930 (4.33)         13;0   47.2199 (0.17)         36           1
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_large_width[new]      1.7714 (1.0)       2.4104 (1.0)       1.8988 (1.0)      0.0767 (1.0)       1.8911 (1.0)      0.0885 (1.0)         86;13  526.6454 (1.0)         373           1
test_large_width[old]     19.5708 (11.05)    22.8755 (9.49)     20.7987 (10.95)    0.7009 (9.14)     20.6623 (10.93)    0.8584 (9.70)         14;1   48.0799 (0.09)         46           1
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_multithreaded[new]      15.0560 (1.0)       24.2891 (1.0)       16.1627 (1.0)      1.5657 (1.0)       15.7182 (1.0)      0.7598 (1.0)           4;6  61.8709 (1.0)          65           1
test_multithreaded[old]     115.7614 (7.69)     120.9670 (4.98)     118.3004 (7.32)     1.6259 (1.04)     118.4164 (7.53)     1.9613 (2.58)          2;0   8.4531 (0.14)          8           1
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Legend:
  Outliers: 1 Standard Deviation from Mean; 1.5 IQR (InterQuartile Range) from 1st Quartile and 3rd Quartile.
  OPS: Operations Per Second, computed as 1 / Mean
```

### Benchmarking script
To run the benchmark make sure you have pytest-benchmark installed with `pip install pytest-benchmark` and use the following command: `pytest benchmark.py --benchmark-sort='name'`

```
import torch
import pytest

def _test_speedup(benchmark, batches=1, channels=32, width=32,
                  kernel_size=2, stride=None, padding=0, dilation=1, ceil_mode=False, return_indices=False):
    torch.set_num_threads(1)
    x = torch.randn((batches, channels, width))
    model = torch.nn.MaxPool1d(kernel_size, stride, padding, dilation, return_indices, ceil_mode)
    benchmark(model, x)

pytest.mark.benchmark(group="inception")
pytest.mark.parametrize("return_indices", [True, False], ids=["old", "new"])
pytest.mark.parametrize("params", [(3, 2), (3, 2, 0, 1, True), (3, 2, 1)],
                         ids=["(3, 2, 0, 1, 0)",
                              "(3, 2, 0, 1, 1)",
                              "(3, 2, 1, 1, 0)"])
def test_inception(benchmark, params, return_indices):
    _test_speedup(benchmark, 10, 64, 147, *params, return_indices=return_indices)

pytest.mark.benchmark(group="googlenet")
pytest.mark.parametrize("return_indices", [True, False], ids=["old", "new"])
pytest.mark.parametrize("params", [(3, 2), (3, 2, 0, 1, True), (3, 2, 1)],
                         ids=["(3, 2, 0, 1, 0)",
                              "(3, 2, 0, 1, 1)",
                              "(3, 2, 1, 1, 0)"])
def test_googlenet(benchmark, params, return_indices):
    _test_speedup(benchmark, 10, 64, 112, *params, return_indices=return_indices)

pytest.mark.benchmark(group="large batch size")
pytest.mark.parametrize("return_indices", [True, False], ids=["old", "new"])
def test_large_batch_size(benchmark, return_indices):
    _test_speedup(benchmark, 100000, 1, 32, return_indices=return_indices)

pytest.mark.benchmark(group="large channel size")
pytest.mark.parametrize("return_indices", [True, False], ids=["old", "new"])
def test_large_channel_size(benchmark, return_indices):
    _test_speedup(benchmark, 1, 100000, 32, return_indices=return_indices)

pytest.mark.benchmark(group="large width")
pytest.mark.parametrize("return_indices", [True, False], ids=["old", "new"])
def test_large_width(benchmark, return_indices):
    _test_speedup(benchmark, 1, 32, 100000, return_indices=return_indices)

pytest.mark.benchmark(group="multithreading")
pytest.mark.parametrize("return_indices", [True, False], ids=["old", "new"])
def test_multithreaded(benchmark, return_indices):
    x = torch.randn((40, 10000, 32))
    model = torch.nn.MaxPool1d(2, return_indices=return_indices)
    benchmark(model, x)
```

## Discussion

The new algorithm is on average 7x faster than the old one. But because the old algorithm had many issues with how it parallelized the code and made use of the cache, one can come up with input parameters (like large batch size) that will make the new algorithm much faster than the original one.

Test Plan: Imported from OSS

Reviewed By: glaringlee

Differential Revision: D23425348

Pulled By: heitorschueroff

fbshipit-source-id: 3fa3f9b8e71200da48424a95510124a83f50d7b2
2020-09-01 08:40:01 -07:00
a044c039c0 updated documentation to streamline setup (#42850)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42850

Reviewed By: mrshenli

Differential Revision: D23449055

Pulled By: osandoval-fb

fbshipit-source-id: 6db695d4fe5f6d9b7bb2895c85c855db4779516b
2020-09-01 08:25:48 -07:00
b1f19c20d6 Run function check and out check in TestTensorDeviceOps (#43830)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43830

Reviewed By: ailzhang

Differential Revision: D23438101

Pulled By: mruberry

fbshipit-source-id: b581ce779ea2f50ea8dfec51d5469031ec7a0a67
2020-09-01 08:21:53 -07:00
9b98bcecfa torch.cat and torch.stack batching rules (#43798)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43798

These are relatively straightforward.

Test Plan: - `pytest test/test_vmap.py -v`

Reviewed By: ezyang

Differential Revision: D23405000

Pulled By: zou3519

fbshipit-source-id: 65c78da3dee43652636bdb0a65b636fca69e765d
2020-09-01 08:12:46 -07:00
dbc4218f11 Batching rules for: torch.bmm, torch.dot (#43781)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43781

Test Plan: - `pytest test/test_vmap.py -v`

Reviewed By: ezyang

Differential Revision: D23400843

Pulled By: zou3519

fbshipit-source-id: a901bba6dc2d8435d314cb4dac85bbd5cd4ee2a5
2020-09-01 08:12:43 -07:00
fa12e225d3 Batching rule for torch.mv (#43780)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43780

The general strategy is:
- unsqueeze the physical inputs enough
- pass the unsqueezed physical inputs to at::matmul
- squeeze any extra dimensions

Test Plan: - `pytest test/test_vmap.py -v`

Reviewed By: ezyang

Differential Revision: D23400842

Pulled By: zou3519

fbshipit-source-id: c550eeb935747c08e3b083609ed307a4374b9096
2020-09-01 08:12:41 -07:00
2789a4023b TestVmapOperators: add structured tests that batching rules get invoked (#43731)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43731

After this PR, for each test in TestVmapOperators, TestVmapOperators
tests that the test never invokes the slow vmap fallback path. The
rationale behind this change is that TestVmapOperators is used for
testing batching rules and we want confidence that the batching rules
actually get invoked.

We set this up using a similar mechanism to the CUDA memory leak check:
(bff741a849/torch/testing/_internal/common_utils.py (L506-L511))

This PR also implements the batching rule for `to.dtype_layout`; the new
testing caught that we were testing vmap on `to.dtype_layout` but it
didn't actually have a batching rule implemented!

Test Plan: - New tests in `pytest test/test_vmap.py -v` that test the mechanism.

Reviewed By: ezyang

Differential Revision: D23380729

Pulled By: zou3519

fbshipit-source-id: 6a4b97a7fa7b4e1c5be6ad80d6761e0d5b97bb8c
2020-09-01 08:11:35 -07:00
0b2694cd11 Support work.result() to get result tensors for allreduce for Gloo, NCCL backends (#43386)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43386

Resolves #43178

ghstack-source-id: 111109716

Test Plan: Added checks to existing unit test and ran it on gpu devserver.

Reviewed By: rohan-varma

Differential Revision: D23216393

fbshipit-source-id: fed5e37fbabbd2ac4a9055b20057fffe3c416c0b
2020-09-01 08:05:55 -07:00
a67246b2d4 Add reduction string test for ctc_loss. (#43884)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43884

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D23427907

Pulled By: gchanan

fbshipit-source-id: 889bd92e9d3e0528b57e3952fc83e25bc7abe293
2020-09-01 07:01:54 -07:00
fab012aa28 Revert "Added support for Huber Loss (#37599)" (#43351)
Summary:
This reverts commit 11e5174926d807a540fc7b54fb45a26ec0c5d9c0 due to [comment](https://github.com/pytorch/pytorch/pull/37599#pullrequestreview-471950192).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43351

Reviewed By: pbelevich, seemethere

Differential Revision: D23249511

Pulled By: vincentqb

fbshipit-source-id: 18b8b346f00eaf0ef7376b06579d404a84add4de
2020-09-01 06:34:26 -07:00
c14a3613a8 Fix NaN propagation in TE fuser's min/max implementation (#43609)
Summary:
Per eager mode source-of-truth, NaNs shall be propagated by min/max.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43609

Reviewed By: ZolotukhinM

Differential Revision: D23349184

Pulled By: bertmaher

fbshipit-source-id: 094eb8b89a02b27d5ecf3988d0f473c0f91e4afb
2020-09-01 02:10:13 -07:00
820c4b05a9 [ONNX] Update slice symbolic function (#42935)
Summary:
During scripting, combination of shape (or size()) and slice (e.g x.shape[2:]) produces following error:
 slice() missing 1 required positional argument: 'step'
This happens because aten::slice has 2 signatures:

- aten::slice(Tensor self, int dim, int start, int end, int step) -> Tensor
- aten::slice(t[] l, int start, int end, int step) -> t[]

and when a list is passed instead of tensor the 2nd of the two slice signatures is called, and since it has 4 instead of 5 arguments it produces the above exception.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42935

Reviewed By: houseroad

Differential Revision: D23398435

Pulled By: bzinodev

fbshipit-source-id: 4151a8f878c520cea199b265973fb476b17801fe
2020-09-01 02:08:48 -07:00
f1624b82b5 Preserve python backtrace in autograd engine errors. (#43684)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43684

This PR attempts to address #42560 by capturing the appropriate
exception_ptr in the autograd engine and passing it over to the Future.

As part of this change, there is a significant change the Future API where we
now only accept an exception_ptr as part of setError.

For the example in #42560, the exception trace would now look like:

```
> Traceback (most recent call last):
>   File "test_autograd.py", line 6914, in test_preserve_backtrace
>     Foo.apply(t).sum().backward()
>   File "torch/tensor.py", line 214, in backward
>     torch.autograd.backward(self, gradient, retain_graph, create_graph)
>   File "torch/autograd/__init__.py", line 127, in backward
>     allow_unreachable=True)  # allow_unreachable flag
>   File "torch/autograd/function.py", line 87, in apply
>     return self._forward_cls.backward(self, *args)
>   File "test_autograd.py", line 6910, in backward
>     raise ValueError("something")
> ValueError: something
```
ghstack-source-id: 111109637

Test Plan: waitforbuildbot

Reviewed By: albanD

Differential Revision: D23365408

fbshipit-source-id: 1470c4776ec8053ea92a6ee1663460a3bae6edc5
2020-09-01 01:28:47 -07:00
825c109eb7 [reland][quant][graphmode][fx] Add support for weight prepack folding (#43728) (#43902)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43902

Trace back from the weight node util we hit getattr, reconstruct the graph module with the traced nodes
and run the graph module to pack the weight. then replace the original chain of ops with the packed weight.

Test Plan:
Imported from OSS

Imported from OSS

Reviewed By: vkuzo

Differential Revision: D23432431

fbshipit-source-id: 657f21a8287494f7f87687a9d618ca46376d3aa3
2020-09-01 00:26:19 -07:00
6da26cf0d9 Update torch.range warning message regarding the removal version number (#43569)
Summary:
`torch.range` still hasn't been removed way after version 0.5. This PR fixes the warning message. Alternatively, we can remove `torch.range`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43569

Reviewed By: ngimel

Differential Revision: D23408233

Pulled By: mruberry

fbshipit-source-id: 86c4f9f018ea5eddaf80b78a3c54dfa41cfc6fa6
2020-08-31 22:23:32 -07:00
85d91a3230 [TensorExpr] Check statements in test_kernel.cpp (#43911)
Summary:
Check statements and fix all the warnings.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43911

Test Plan: test_tensorexpr

Reviewed By: ZolotukhinM

Differential Revision: D23441092

Pulled By: asuhan

fbshipit-source-id: f671eef4b4eb9b51acb15054131152ae650fedbd
2020-08-31 22:16:25 -07:00
f229d2c07b Revert D23335106: [quant][graphmode][fix] Fix insert quant dequant for observers without qparams
Test Plan: revert-hammer

Differential Revision:
D23335106 (602209751e)

Original commit changeset: 84af2884d521

fbshipit-source-id: 8d227fe2048b532016407d8ecfbaa6ffd1c313fd
2020-08-31 22:12:37 -07:00
69080e9e7e simplify profile text output by displaying only top-level ops statistics (#42262)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42262

Test Plan:
Imported from OSS
```
==================================================================================================================================================================================
TEST
-----------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------------------------------------
Name                           Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     Number of Calls  Input Shapes
-----------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------------------------------------
aten::add_                     3.61%            462.489us        3.61%            462.489us        462.489us        1                [[3, 20], [3, 20], []]
aten::slice                    1.95%            249.571us        1.95%            250.018us        250.018us        1                [[3, 80], [], [], [], []]
aten::lstm                     1.89%            242.534us        22.41%           2.872ms          2.872ms          1                [[5, 3, 10], [], [], [], [], [], [], [], []]
aten::lstm                     1.68%            215.852us        18.18%           2.330ms          2.330ms          1                [[5, 3, 10], [], [], [], [], [], [], [], []]
aten::lstm                     1.68%            215.767us        18.49%           2.370ms          2.370ms          1                [[5, 3, 10], [], [], [], [], [], [], [], []]
aten::lstm                     1.60%            205.014us        20.15%           2.582ms          2.582ms          1                [[5, 3, 10], [], [], [], [], [], [], [], []]
aten::lstm                     1.55%            198.213us        18.53%           2.375ms          2.375ms          1                [[5, 3, 10], [], [], [], [], [], [], [], []]
aten::addmm                    0.95%            122.359us        1.01%            129.857us        129.857us        1                [[80], [3, 20], [20, 80], [], []]
aten::stack                    0.29%            36.745us         0.63%            80.179us         80.179us         1                [[], []]
aten::add_                     0.28%            35.694us         0.28%            35.694us         35.694us         1                [[3, 20], [3, 20], []]
-----------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------------------------------------
Self CPU time total: 12.817ms

-----------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------------------------------------
Name                           Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     Number of Calls  Input Shapes
-----------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------------------------------------
aten::mul                      11.45%           1.467ms          12.88%           1.651ms          11.006us         150              [[3, 20], [3, 20]]
aten::lstm                     8.41%            1.077ms          97.76%           12.529ms         2.506ms          5                [[5, 3, 10], [], [], [], [], [], [], [], []]
aten::addmm                    7.65%            979.982us        11.38%           1.459ms          29.182us         50               [[80], [3, 20], [20, 80], [], []]
aten::sigmoid_                 6.78%            869.295us        9.74%            1.249ms          8.327us          150              [[3, 20]]
aten::add_                     5.82%            745.801us        5.82%            745.801us        14.916us         50               [[3, 20], [3, 20], []]
aten::slice                    5.58%            715.532us        6.61%            847.445us        4.237us          200              [[3, 80], [], [], [], []]
aten::unsafe_split             4.24%            544.015us        13.25%           1.698ms          33.957us         50               [[3, 80], [], []]
aten::tanh                     3.11%            398.881us        6.05%            775.024us        15.500us         50               [[3, 20]]
aten::empty                    3.04%            389.055us        3.04%            389.055us        1.319us          295              [[], [], [], [], [], []]
aten::sigmoid                  2.96%            379.686us        2.96%            379.686us        2.531us          150              [[3, 20], [3, 20]]
-----------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------------------------------------
Self CPU time total: 12.817ms

==================================================================================================================================================================================
TEST
==================================================================================================================================================================================
This report only display top-level ops statistics
-----------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------------------------------------
Name                           Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     Number of Calls  Input Shapes
-----------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------------------------------------
aten::lstm                     1.89%            242.534us        22.41%           2.872ms          2.872ms          1                [[5, 3, 10], [], [], [], [], [], [], [], []]
aten::lstm                     1.68%            215.852us        18.18%           2.330ms          2.330ms          1                [[5, 3, 10], [], [], [], [], [], [], [], []]
aten::lstm                     1.68%            215.767us        18.49%           2.370ms          2.370ms          1                [[5, 3, 10], [], [], [], [], [], [], [], []]
aten::lstm                     1.60%            205.014us        20.15%           2.582ms          2.582ms          1                [[5, 3, 10], [], [], [], [], [], [], [], []]
aten::lstm                     1.55%            198.213us        18.53%           2.375ms          2.375ms          1                [[5, 3, 10], [], [], [], [], [], [], [], []]
-----------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------------------------------------
Self CPU time total: 12.817ms

==================================================================================================================================================================================
This report only display top-level ops statistics
-----------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------------------------------------
Name                           Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     Number of Calls  Input Shapes
-----------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------------------------------------
aten::lstm                     8.41%            1.077ms          97.76%           12.529ms         2.506ms          5                [[5, 3, 10], [], [], [], [], [], [], [], []]
-----------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------------------------------------
Self CPU time total: 12.817ms

Total time based on python measurements:  13.206ms
CPU time measurement python side overhead: 3.03%
```

Reviewed By: ilia-cher

Differential Revision: D22830328

Pulled By: ilia-cher

fbshipit-source-id: c9a71be7b23a8f84784117c788faa43caa96f545
2020-08-31 21:41:40 -07:00
d7ee84c9b5 Update determinism documentation (#41692)
Summary:
Add user-facing documentation for set_deterministic
Also update grammar and readability in Reproducibility page

Issue https://github.com/pytorch/pytorch/issues/15359

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41692

Reviewed By: ailzhang

Differential Revision: D23433061

Pulled By: mruberry

fbshipit-source-id: 4c4552950803c2aaf80f7bb4792d2095706d07cf
2020-08-31 21:06:24 -07:00
69fbc705d8 Remained changes of #43578 (#43921)
Summary:
Not full https://github.com/pytorch/pytorch/issues/43578 was merged. This PR is the remained part.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43921

Reviewed By: ailzhang

Differential Revision: D23438504

Pulled By: mruberry

fbshipit-source-id: 9c5e26346dfc423b7a440b8a986420a27349090f
2020-08-31 20:42:07 -07:00
3c2f6d2ecf [caffe2] Extend dedup SparseAdagrad fusion with stochastic rounding FP16 (#43124)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43124

Add the stochastic rounding FP16 support for dedup version of SparseAdagrad fusion.
ghstack-source-id: 111037723

Test Plan:
```
buck test mode/dev-nosan //caffe2/caffe2/fb/net_transforms/tests:fuse_sparse_ops_test -- 'test_fuse_sparse_adagrad_with_sparse_lengths_sum_gradient \(caffe2\.caffe2\.fb\.net_transforms\.tests\.fuse_sparse_ops_test\.TestFuseSparseOps\)' --print-passing-details
```

https://our.intern.facebook.com/intern/testinfra/testrun/5629499566042000

```
buck test mode/dev-nosan //caffe2/caffe2/fb/net_transforms/tests:fuse_sparse_ops_test -- 'test_fuse_sparse_adagrad_with_sparse_lengths_mean_gradient \(caffe2\.caffe2\.fb\.net_transforms\.tests\.fuse_sparse_ops_test\.TestFuseSparseOps\)' --print-passing-details
```

https://our.intern.facebook.com/intern/testinfra/testrun/1125900076333177

Reviewed By: xianjiec

Differential Revision: D22893851

fbshipit-source-id: 81c7a7fe4b0d2de0e6b4fc965c5d23210213c46c
2020-08-31 20:35:22 -07:00
f17d7a5556 Fix exception chaining in torch/ (#43836)
Summary:
## Motivation
Fixes https://github.com/pytorch/pytorch/issues/43770.

## Description of the change
This PR fixes exception chaining only in files under `torch/` where appropriate.
To fix exception chaining, I used either:
1. `raise new_exception from old_exception` where `new_exception` itself seems not descriptive enough to debug or `old_exception` delivers valuable information.
2. `raise new_exception from None` where raising both of `new_exception` and `old_exception` seems a bit noisy and redundant.
I subjectively chose which one to use from the above options.

## List of lines containing raise in except clause:
I wrote [this simple script](https://gist.github.com/akihironitta/4223c1b32404b36c1b349d70c4c93b4d) using [ast](https://docs.python.org/3.8/library/ast.html#module-ast) to list lines where `raise`ing in `except` clause.

- [x] 000739c31a/torch/jit/annotations.py (L35)
- [x] 000739c31a/torch/jit/annotations.py (L150)
- [x] 000739c31a/torch/jit/annotations.py (L158)
- [x] 000739c31a/torch/jit/annotations.py (L231)
- [x] 000739c31a/torch/jit/_trace.py (L432)
- [x] 000739c31a/torch/nn/utils/prune.py (L192)
- [x] 000739c31a/torch/cuda/nvtx.py (L7)
- [x] 000739c31a/torch/utils/cpp_extension.py (L1537)
- [x] 000739c31a/torch/utils/tensorboard/_pytorch_graph.py (L292)
- [x] 000739c31a/torch/utils/data/dataloader.py (L835)
- [x] 000739c31a/torch/utils/data/dataloader.py (L849)
- [x] 000739c31a/torch/utils/data/dataloader.py (L856)
- [x] 000739c31a/torch/testing/_internal/common_utils.py (L186)
- [x] 000739c31a/torch/testing/_internal/common_utils.py (L189)
- [x] 000739c31a/torch/testing/_internal/common_utils.py (L424)
- [x] 000739c31a/torch/testing/_internal/common_utils.py (L1279)
- [x] 000739c31a/torch/testing/_internal/common_utils.py (L1283)
- [x] 000739c31a/torch/testing/_internal/common_utils.py (L1356)
- [x] 000739c31a/torch/testing/_internal/common_utils.py (L1388)
- [x] 000739c31a/torch/testing/_internal/common_utils.py (L1391)
- [ ] 000739c31a/torch/testing/_internal/common_utils.py (L1412)
- [x] 000739c31a/torch/testing/_internal/codegen/random_topo_test.py (L310)
- [x] 000739c31a/torch/testing/_internal/codegen/random_topo_test.py (L329)
- [x] 000739c31a/torch/testing/_internal/codegen/random_topo_test.py (L332)
- [x] 000739c31a/torch/testing/_internal/jit_utils.py (L183)
- [x] 000739c31a/torch/testing/_internal/common_nn.py (L4789)
- [x] 000739c31a/torch/onnx/utils.py (L367)
- [x] 000739c31a/torch/onnx/utils.py (L659)
- [x] 000739c31a/torch/onnx/utils.py (L892)
- [x] 000739c31a/torch/onnx/utils.py (L897)
- [x] 000739c31a/torch/serialization.py (L108)
- [x] 000739c31a/torch/serialization.py (L754)
- [x] 000739c31a/torch/distributed/rpc/_testing/faulty_agent_backend_registry.py (L76)
- [x] 000739c31a/torch/distributed/rpc/backend_registry.py (L260)
- [x] 000739c31a/torch/distributed/distributed_c10d.py (L184)
- [x] 000739c31a/torch/_utils_internal.py (L57)
- [x] 000739c31a/torch/hub.py (L494)
- [x] 000739c31a/torch/contrib/_tensorboard_vis.py (L16)
- [x] 000739c31a/torch/distributions/lowrank_multivariate_normal.py (L100)
- [x] 000739c31a/torch/distributions/constraint_registry.py (L142)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43836

Reviewed By: ailzhang

Differential Revision: D23431212

Pulled By: malfet

fbshipit-source-id: 5f7f41b391164a5ad0efc06e55cd58c23408a921
2020-08-31 20:26:23 -07:00
da32bf4cc6 Move type annotations for remaining torch.utils stub files inline (#43406)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43406

Reviewed By: mruberry

Differential Revision: D23319736

Pulled By: malfet

fbshipit-source-id: e25fbb49f27aa4893590b022441303d6d98263a9
2020-08-31 18:44:09 -07:00
602209751e [quant][graphmode][fix] Fix insert quant dequant for observers without qparams (#43606)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43606

Test Plan: Imported from OSS

Reviewed By: vkuzo

Differential Revision: D23335106

fbshipit-source-id: 84af2884d52118c069fc43a9f166dc336a8a87c8
2020-08-31 18:27:53 -07:00
7db7da7151 [reland][quant][graphmode][fx] Add top level APIs (#43581) (#43901)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43901

Add similar APIs like eager and graph mode on torchscript
- fuse_fx
- quantize_fx (for both post training static and qat)
- quantize_dynamic_fx (for post training dynamic)
- prepare_fx (for both post training static and qat)
- prepare_dynamic_fx (for post training dynamic)
- convert_fx (for all modes)

Test Plan:
Imported from OSS

Imported from OSS

Reviewed By: vkuzo

Differential Revision: D23432430

fbshipit-source-id: fc99eb75cbecd6ee7a3aa6c8ec71cd499ff7e3c1
2020-08-31 18:24:26 -07:00
deb5fde51c [TensorExpr] Make KernelSumMultipleAxes much faster (#43905)
Summary:
Reduce input size, skip the dtype conversion.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43905

Test Plan: test_tensorexpr --gtest_filter=TensorExprTest.KernelSum*

Reviewed By: ailzhang

Differential Revision: D23433398

Pulled By: asuhan

fbshipit-source-id: 0d95ced3c1382f10595a9e5745bf4bef007cc913
2020-08-31 17:58:43 -07:00
ee53a335c0 [ONNX] Floordiv (#43022)
Summary:
Add export of floordiv op

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43022

Reviewed By: houseroad

Differential Revision: D23398493

Pulled By: bzinodev

fbshipit-source-id: f929a88b3bc0c3867e8fbc4e50afdf0c0c71553d
2020-08-31 17:54:40 -07:00
f73ba88946 Avoid resizing in MinMaxObserver (#43789)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43789

Since it's single element.. In some cases we may not be able to resize the
buffers.

Test Plan: unit tests

Reviewed By: supriyar

Differential Revision: D23393108

fbshipit-source-id: 46cd7f73ed42a05093662213978a01ee726433eb
2020-08-31 17:41:39 -07:00
98b846cd1d [JIT] Remove loop peeling from the profiling executor pipeline. (#43847)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43847

It seems to slowdown two fastRNN benchmarks and does not speed up
others.

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D23416197

Pulled By: ZolotukhinM

fbshipit-source-id: 598144561979e84bcf6bccf9b0ca786f5af18383
2020-08-31 17:26:55 -07:00
d69d603061 [JIT] Specialize autograd zero: actually remove the original graph after we created its versioned copy. (#43900)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43900

The original code assumed that the versioning if was inserted in the
beginning of the graph while in fact it was inserted in the end. We're
now also not removing `profile_optional` nodes and rely on DCE to clean
it up later (the reason we're not doing it is that deletion could
invalidate the insertion point being used).

Test Plan: Imported from OSS

Reviewed By: eellison

Differential Revision: D23432175

Pulled By: ZolotukhinM

fbshipit-source-id: 1bf55affaa3f17af1bf71bad3ef64edf71a3e3fb
2020-08-31 17:26:51 -07:00
f150f924d3 [JIT] Specialize autograd zero: fix the guarding condition. (#43846)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43846

We are looking for tensors that are expected to be undefined (according
to the profile info) and should be checking for them to satisfy the
following condition: "not(have any non-zero)", which is equivalent to
"tensor is all zeros". The issue was that we've been checking tensors
that were expected *not* to be undefined.

Test Plan: Imported from OSS

Reviewed By: eellison

Differential Revision: D23416198

Pulled By: ZolotukhinM

fbshipit-source-id: 71e22f552680f68f2af29f427b7355df9b1a4278
2020-08-31 17:25:50 -07:00
9b820fe904 Fix ImportError in the OSS land. (#43912)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43912

Fixed the ImportError: cannot import name 'compute_ulp_error' from 'caffe2.python.oss.fakelowp.test_utils'

Test Plan: test_op_nnpi_fp16.py

Reviewed By: hyuen

Differential Revision: D23435218

fbshipit-source-id: be0b240ee62090d06fdc8efac85fb1c32803da0d
2020-08-31 16:48:54 -07:00
7137327646 log message at per-test level forperfpipe_pytorch_test_times (#43752)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43752

Test Plan:
{F315930458}

{F315930459}

Reviewed By: walterddr, malfet

Differential Revision: D23387998

Pulled By: dhuang29

fbshipit-source-id: 2da8b607c049a6f8f21d98dbb25e664ea6229f27
2020-08-31 16:22:44 -07:00
4c19a1e350 Move torch/autograd/grad_mode.pyi stubs inline (#43415)
Summary:
- Add `torch._C` bindings from `torch/csrc/autograd/init.cpp`
- Renamed `torch._C.set_grad_enabled` to `torch._C._set_grad_enabled`
  so it doesn't conflict with torch.set_grad_enabled anymore

This is a continuation of gh-38201. All I did was resolve merge conflicts and finish the annotation of `_DecoratorContextManager.__call__` that ezyang started in the first commit.

~Reverts commit b5cd3a80bbc, which was only motivated by not having `typing_extensions` available.~ (JIT can't be made to understand `Literal[False]`, so keep as is).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43415

Reviewed By: ngimel

Differential Revision: D23301168

Pulled By: malfet

fbshipit-source-id: cb5290f2e556b4036592655b9fe54564cbb036f6
2020-08-31 16:14:41 -07:00
e941a462a3 Enable gcc coverage in OSS (#43883)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43883

Check the result of GCC coverage in OSS is reasonable and ready to ship.

The amount of executable lines are not the same between `gcc` and `clang` because of the following reasons:
* Lines following are counted in `clang` but not in `gcc`:
1. empty line or line with only “{” or “}”
3. some comments are counted in clang but not in gcc
5. `#define ...` -- not supported by gcc according to official documentation

* Besides, a statement that explains to more than one line will be counted as only one executable line in gcc, but several lines in clang

## Advantage of `gcc` coverage
1. Much faster
- code coverage tool runtime is onle **4 min** (*ammazzzing!!*) by `gcc`, compared to **3 hours!!** by `clang`, to analyze all the tests' artifacts
2. Use less disk
- `Clang`'s artifacts will take as large as 170G, but `GCC` is 980M

Besides, also update `README.md`.

Test Plan:
Compare the result in OSS `clang` and OSS `gcc` with the same command:
```
python oss_coverage.py --run-only atest test_nn.py --interested-folder=aten
```

----

## GCC
**Summary**
> time: 0:15:45
summary percentage: 44.85%

**Report and Log**
[File Coverage Report](P140825162)
[Line Coverage Report](P140825196)
[Log](P140825385)

------

## CLANG

**Summary**
> time: 0:21:35
summary percentage: 44.08%

**Report and Log**
[File Coverage Report](P140825845)
[Line Coverage Report](P140825923)
[Log](P140825950)

----------

# Run all tests
```
# run all tests and get coverage over Pytorch
python oss_coverage.py
```
**Summary**
> time: 1:27:20. ( time to run tests:  1:23:33)
summary percentage: 56.62%

**Report and Log**
[File Coverage Report](P140837175)
[Log](P140837121)

Reviewed By: malfet

Differential Revision: D23416772

fbshipit-source-id: a6810fa4d8199690f10bd0a4f58a42ab2a22182b
2020-08-31 16:11:33 -07:00
da0e93a8c3 Move fbcode related coverage code to fb/ folder and add TARGETS (#43800)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43800

1. Move fbcode related coverage code to fb/ folder and add TARGETS so that we can use buck run to run the tool and solved the import probelm.

2. Write `README.md` to give users guidance about the tool

Test Plan:
On devserver:
```
buck run //caffe2/fb/code_coverage/tool:coverage -- //caffe2/c10:
```

More examples in README.md

Reviewed By: malfet

Differential Revision: D23404988

fbshipit-source-id: 4942cd0e0fb7bd28a5e884d9835b93f00adb7b92
2020-08-31 16:10:33 -07:00
3682df77db Implementing NumPy-like function torch.heaviside() (#42523)
Summary:
- Related with https://github.com/pytorch/pytorch/issues/38349
- Implementing the NumPy-like function `torch.heaviside()` .

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42523

Reviewed By: ngimel

Differential Revision: D23416743

Pulled By: mruberry

fbshipit-source-id: 9975bd9c9fa73bd0958fe9879f79a692aeb722d5
2020-08-31 15:54:56 -07:00
7680d87a76 Let linspace support bfloat16 and complex dtypes (#43578)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43578

Reviewed By: malfet

Differential Revision: D23413690

Pulled By: mruberry

fbshipit-source-id: 8c24f7b054269e1317fe53d26d523fea4decb164
2020-08-31 14:54:22 -07:00
3278beff44 Skip target determination for codecov test (#43899)
Summary:
Python code coverage tests should not rely on target determination as it will negatively impact the coverage score

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43899

Reviewed By: seemethere

Differential Revision: D23432069

Pulled By: malfet

fbshipit-source-id: 341fcadafaab6bd96d33d23973e01f7d421a6593
2020-08-31 14:43:12 -07:00
ffca81e38b [pytorch][bot] update mobile op deps (#43871)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43871

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D23422523

Pulled By: ljk53

fbshipit-source-id: 95f2a1b6a2d25b13618c65944a2b919922083fb8
2020-08-31 14:42:12 -07:00
4e4626a23d Join-based API to support DDP uneven inputs (#42577)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42577

Closes https://github.com/pytorch/pytorch/issues/38174. Implements a join-based API to support training with the DDP module in the scenario where different processes have different no. of inputs. The implementation follows the description in https://github.com/pytorch/pytorch/issues/38174. Details are available in the RFC, but as a summary, we make the following changes:

#### Approach
1) Add a context manager `torch.nn.parallel.distributed.join`
2) In the forward pass, we schedule a "present" allreduce where non-joined process contribute 1 and joined processes contribute 0. This lets us keep track of joined processes and know when all procs are joined.
3) When a process depletes its input and exits the context manager, it enters "joining" mode and attempts to "shadow" the collective comm. calls made in the model's forward and backward pass. For example we schedule the same allreduces in the same order as the backward pass, but with zeros
4) We adjust the allreduce division logic to divide by the effective world size (no. of non-joined procs) rather than the absolute world size to maintain correctness.
5) At the end of training, the last joined process is selected to be the "authoritative" model copy

We also make some misc. changes such as adding a `rank` argument to `_distributed_broadcast_coalesced` and exposing some getters/setters on `Reducer` to support the above changes.

#### How is it tested?
We have tests covering the following models/scenarios:
- [x] Simple linear model
- [x] Large convolutional model
- [x] Large model with module buffers that are broadcast in the forward pass (resnet). We verify this with a helper function `will_sync_module_buffers` and ensure this is true for ResNet (due to batchnorm)
- [x] Scenario where a rank calls join() without iterating at all, so without rebuilding buckets (which requires collective comm)
- [x] Model with unused params (with find unused parameters=True)
- [x] Scenarios where different processes iterate for a varying number of different iterations.
- [x] Test consistency in tie-breaking when multiple ranks are the last ones to join
- [x] Test that we divide by the effective world_size (no. of unjoined processes)

#### Performance implications

###### Trunk vs PR patched, 32 GPUs, batch size = 32
P50, forward + backward + optimizer batch latency & total QPS: 0.121 264/s vs 0.121 264/s
P50 backwards only batch latency & total QPS: 0.087 369/s vs 0.087 368/s

###### join(enable=True) vs without join, 32 GPUs, batch size = 32, even inputs
P50, forward + backward + optimizer batch latency & total QPS: 0.120 265/s vs 0.121 264/s
P50 backwards only batch latency & total QPS: 0.088 364/s vs 0.087 368/s

###### join(enable=False) vs without join, 32 GPUs, batch size = 32, even inputs
P50 forward + backward + optimizer batch latency & total QPS: 0.121 264/s vs 0.121 264/s
P50 backwards only batch latency & total QPS: 0.087 368/s vs 0.087 368/s

###### join(enable=True) with uneven inputs (offset = 2000), 32 GPUs, batch size = 32
P50 forward + backward + optimizer batch latency & total QPS: 0.183 174/s vs 0.121 264/s
P50 backwards only batch latency & total QPS: 0.150 213/s vs 0.087 368/s

###### join(enable=True) with uneven inputs ((offset = 2000)), 8 GPUs, batch size = 32
P50 forward + backward + optimizer batch latency & total QPS: 0.104 308/s vs 0.104 308/s
P50 backwards only batch latency & total QPS: 0.070 454/s vs 0.070 459/s

The 2 above uneven inputs benchmark was conducted 32 GPUs and 4 GPUs immediately depleting their inputs and entering "join" mode (i.e. not iterating at all), while the other 28 iterating as normal. It looks like there is a pretty significant perf hit for this case when there are uneven inputs and multi-node training. Strangely, when there is a single node (8 GPUs), this does not reproduce.

#### Limitations
1) This is only implemented for MPSD, not SPMD. Per a discussion with mrshenli we want to encourage the use of MPSD over SPMD for DDP.
2) This does not currently work with SyncBN or custom collective calls made in the model's forward pass. This is because the `join` class only shadows the `broadcast` for buffers in the forward pass, the gradient allreduces in the bwd pass, unused parameters reduction, and (optionally) the rebuild buckets broadcasting in the backwards pass. Supporting this will require additional design thought.
3) Has not been tested with the [DDP comm. hook](https://github.com/pytorch/pytorch/issues/39272) as this feature is still being finalized/in progress. We will add support for this in follow up PRs.
ghstack-source-id: 111033819

Reviewed By: mrshenli

Differential Revision: D22893859

fbshipit-source-id: dd02a7aac6c6cd968db882c62892ee1c48817fbe
2020-08-31 13:29:03 -07:00
2f52748515 Publish all_gather_object and gather_object docs (#43772)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43772

Test Plan: Imported from OSS

Reviewed By: rohan-varma

Differential Revision: D23398495

Pulled By: rohan-varma

fbshipit-source-id: 032e1d628c0c0f2dec297226167471698c56b605
2020-08-31 13:28:00 -07:00
f7bae5b6b1 Revert D23385091: [quant][graphmode][fx] Add top level APIs
Test Plan: revert-hammer

Differential Revision:
D23385091 (eb4199b0a7)

Original commit changeset: b789e54e1a0f

fbshipit-source-id: dc3dd9169d34beab92488d78d42d7e7d05e771d1
2020-08-31 12:18:29 -07:00
68304c527a Revert D23385090: [quant][graphmode][fx] Add support for weight prepack folding
Test Plan: revert-hammer

Differential Revision:
D23385090 (ef08f92076)

Original commit changeset: 11341f0af525

fbshipit-source-id: fe2bcdc16106923a2cee99eb5cc0a1e9c14ad2c5
2020-08-31 12:17:28 -07:00
0394c5a283 [fix] torch.multinomial : fix for 0 size dim (#43775)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/43768

TO-DO:
* [x] Add test

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43775

Reviewed By: ZolotukhinM

Differential Revision: D23421979

Pulled By: ngimel

fbshipit-source-id: 949fcdd30f18d17ae1c372fa6ca6a0b8d0d538ce
2020-08-31 11:57:42 -07:00
3c8b1d73c9 Update aliasing in tensorexpr fuser (#43743)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43743

Test Plan: Imported from OSS

Reviewed By: Krovatkin

Differential Revision: D23385205

Pulled By: eellison

fbshipit-source-id: 097a15d5bcf216453e1dd144d6117108b3deae4d
2020-08-31 11:52:26 -07:00
5da8a7bf2d use types in the IR instead of vmap (#43742)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43742

We can remove all prim::profiles, update the values to their specialized profiled types, and then later guard the input graphs based on the input types of the fusion group. After that we remove specialized tensor types from the graph. This gets rid of having to update the vmap and removes all of the profile nodes in fusing.

Test Plan: Imported from OSS

Reviewed By: Krovatkin

Differential Revision: D23385206

Pulled By: eellison

fbshipit-source-id: 2c84bd1d1c38df0d7585e523c30f7bd28f399d7c
2020-08-31 11:52:23 -07:00
259e5b7d71 Add passes to profiling executor pipeline (#43636)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43636

We weren't running inlining in the forward graph of differentiable subgraphs, and we weren't getting rid of all profiles as part of optimization.

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D23358804

Pulled By: eellison

fbshipit-source-id: 05ede5fa356a15ca385f899006cb5b35484ef620
2020-08-31 11:52:20 -07:00
a7e7981c0b Use prim::TensorExprGroup interned symbol (#43635)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43635

Intern the symbol, no functional changes. Aliasing need to be looked at but this should be done in a separate PR; this PR is just changing the symbol.

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D23358806

Pulled By: eellison

fbshipit-source-id: f18bcd142a0daf514136f019ae607e4c3f45d9f8
2020-08-31 11:52:16 -07:00
1c0faa759e Update requires grad property (#43634)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43634

Because differentiable graphs detach the gradients of input Tensors, creating and inlining differentiable graphs changes the requires_grad property of tensors in the graph. In the legacy executor, this was not a problem as the Fuser would simply ignore the gradient property because it would be invariant that the LegacyExecutor only passed tensors with grad = False. This is not the case with the profiler, as the Fuser does it's own guarding.

Updating the type also helps with other typechecks, e.g. the ones specializing the backward, and with debugging the graph.

Other possibilities considered were:
- Fuser/Specialize AutogradZero always guards against requires_grad=False regardless of the profiled type
- Re-profile forward execution of differentiable graph

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D23358803

Pulled By: eellison

fbshipit-source-id: b106998accd5d0f718527bc00177de9af5bad5fc
2020-08-31 11:51:06 -07:00
2bede78a05 add qr_backward functionality for wide case (#42216)
Summary:
Unblocks implementation of https://github.com/pytorch/pytorch/issues/27036. Note that this PR ***does not*** fix #{27036}.
Currently QR decomposition only has support for square and tall (a.k.a. skinny) case.
This PR adds functionality for wide A matrix/tensors, includes 3 unit tests for the new case
and restructures the `qr_backward` method to use the same Walther method as a helper.

cc albanD t-vi

I don't have a gpu machine so haven't tested on cuda but everything passes on my local machine in cpu.

The basic idea of the PR is noted in the comments in the `Functions.cpp` file but I'll note here too for clarity:

let <img src="https://render.githubusercontent.com/render/math?math=A_{m,n}"> be a matrix and <img src="https://render.githubusercontent.com/render/math?math=m < n"> then partition <img src="https://render.githubusercontent.com/render/math?math=A_{m, n}"> as  <img src="https://render.githubusercontent.com/render/math?math=A_{m,n} = [ X_{m,m} |\ Y_{m, n-m} ]">
and take QR of <img src="https://render.githubusercontent.com/render/math?math=X"> and call that one
<img src="https://render.githubusercontent.com/render/math?math=X=QU"> the <img src="https://render.githubusercontent.com/render/math?math=Q"> here from <img src="https://render.githubusercontent.com/render/math?math=X"> is the same as the <img src="https://render.githubusercontent.com/render/math?math=Q"> from <img src="https://render.githubusercontent.com/render/math?math=QR"> on entire <img src="https://render.githubusercontent.com/render/math?math=A"> matrix. Then transform <img src="https://render.githubusercontent.com/render/math?math=Y"> with the <img src="https://render.githubusercontent.com/render/math?math=Q"> rotation got from <img src="https://render.githubusercontent.com/render/math?math=X"> to get <img src="https://render.githubusercontent.com/render/math?math=V=Q^{T}Y"> now <img src="https://render.githubusercontent.com/render/math?math=R= [U |\ V] "> and similarly for the grads of each piece, e.g. if <img src="https://render.githubusercontent.com/render/math?math=\bar{A}"> is  `grad_A` then
<img src="https://render.githubusercontent.com/render/math?math=\bar{A} = [ \bar{X} |\ \bar{Y}]"> and <img src="https://render.githubusercontent.com/render/math?math=\bar{R} = [ \bar{U} |\ \bar{V}]"> and then
<img src="https://render.githubusercontent.com/render/math?math=\bar{Y} =  Q\bar{V}"> and
<img src="https://render.githubusercontent.com/render/math?math=\bar{V}"> is the `narrow()` of `grad_R`.
<img src="https://render.githubusercontent.com/render/math?math=\bar{X}"> is calculated very similar to the original Walther formula (exactly the same in the tall and square cases) but is slightly modified here for wide case matrices.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42216

Reviewed By: glaringlee

Differential Revision: D23373118

Pulled By: albanD

fbshipit-source-id: 3702ba7e7e23923868c02cdb7e10a96036052344
2020-08-31 11:46:45 -07:00
69dd0bab90 [RPC profiling] Add test to ensure using record_function works for RPC (#43657)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43657

We didn't have a test that ensures functions ran over RPC that are being profiled can use `with record_function()` to profile specific blocks in the function execution. This is useful if the user wants certain information about specific blocks in the function ran over RPC composed of many torch ops and some custom logic, for example.

Currently, this will not work if the function is TorchScripted since `with record_function()` is not torchscriptable yet. We can add support for this in future PRs so that torchscript RPC functions can also be profiled like this.
ghstack-source-id: 111033981

Reviewed By: mrshenli

Differential Revision: D23355215

fbshipit-source-id: 318d92e285afebfeeb2a7896b4959412c5c241d4
2020-08-31 11:43:09 -07:00
4ef12be900 Add __complex__ (#43844)
Summary:
fixes https://github.com/pytorch/pytorch/issues/43833

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43844

Reviewed By: ZolotukhinM

Differential Revision: D23422000

Pulled By: ngimel

fbshipit-source-id: ebc6a27a9b04c77c3977e6c184cefce9e817cc2f
2020-08-31 11:39:41 -07:00
c5d0f091b2 addmm/addmv should accept complex alpha and beta (#43827)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43827

Reviewed By: malfet

Differential Revision: D23415869

Pulled By: ngimel

fbshipit-source-id: a47b76df5fb751f76d36697f5fd95c69dd3a6efe
2020-08-31 11:35:58 -07:00
89452a67de [fx] GraphModule.src -> GraphModule.code (#43655)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43655

Pure, unadulerated bikeshed. The good stuff.

This makes things more consistent with ScriptModule.

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D23401528

Pulled By: suo

fbshipit-source-id: 7dd8396365f118abcd045434acd9348545314f44
2020-08-31 11:26:05 -07:00
1390cad2d8 [NNC] Hook up registerizer to Cuda codegen [2/x] (#42878)
Summary:
Insert the registerizer into the Cuda Codegen pass list, to enable scalar replacement and close the gap in simple reduction performance.

First up the good stuff, benchmark before:
```
          Column sum          Caffe2             NNC          Simple          Better
           (10, 100)          5.7917          9.7037          6.9386          6.0448
          (100, 100)          5.9338          14.972          7.1139          6.3254
        (100, 10000)          21.453          741.54          145.74          12.555
        (1000, 1000)          8.0678          122.75          22.833          9.0778

             Row sum          Caffe2             NNC          Simple          Better
           (10, 100)          5.4502          7.9661          6.1469          5.5587
          (100, 100)          5.7613          13.897           21.49          5.5808
        (100, 10000)          21.702          82.398          75.462          22.793
        (1000, 1000)          22.527             129          176.51          22.517

```

After:
```
          Column sum          Caffe2             NNC          Simple          Better
           (10, 100)          6.0458          9.4966          7.1094           6.056
          (100, 100)          5.9299          9.1482          7.1693           6.593
        (100, 10000)          21.739          121.97          162.63          14.376
        (1000, 1000)          9.2374           29.01          26.883          10.127

             Row sum          Caffe2             NNC          Simple          Better
           (10, 100)          5.9773          8.1792          7.2307          5.8941
          (100, 100)          6.1456          9.3155          24.563          5.8163
        (100, 10000)          25.384          30.212          88.531          27.185
        (1000, 1000)          26.517          32.702          209.31          26.537
```

Speedup about 3-8x depending on the size of the data (increasing with bigger inputs).

The gap between NNC and simple is closed or eliminated - remaining issue appears to be kernel launch overhead. Next up is getting us closer to the _Better_ kernel.

It required a lot of refactoring and bug fixes on the way:
* Refactored flattening of parallelized loops out of the CudaPrinter and into its own stage, so we can transform the graph in the stage between flattening and printing (where registerization occurs).
* Made AtomicAddFuser less pessimistic, it will now recognize that if an Add to a buffer is dependent on all used Block and Thread vars then it has no overlap and does not need to be atomic. This allows registerization to apply to these stores.
* Fixed PrioritizeLoad mutator so that it does not attempt to separate the Store and Load to the same buffer (i.e. reduction case).
* Moved CudaAnalysis earlier in the process, allowing later stages to use the analyzed bufs.
* Fixed a bug in the Registerizer where when adding a default initializer statement it would use the dtype of the underlying var (which is always kHandle) instead of the dtype of the Buf.
* Fixed a bug in the IRMutator where Allocate statements logic was inverted to be replaced only if they did not change.
* Added simplification of simple Division patterns to the IRSimplifier.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42878

Reviewed By: glaringlee

Differential Revision: D23382499

Pulled By: nickgg

fbshipit-source-id: 3640a98fd843723abad9f54e67070d48c96fe949
2020-08-31 10:39:46 -07:00
63dbef3038 Better msg (#43848)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43848

Missing space in logging.

Test Plan: build

Reviewed By: hl475

Differential Revision: D23416698

fbshipit-source-id: bf7c494f33836601f5f380c03a0910f419c2e62b
2020-08-31 10:36:59 -07:00
ef08f92076 [quant][graphmode][fx] Add support for weight prepack folding (#43728)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43728

Trace back from the weight node util we hit getattr, reconstruct the graph module with the traced nodes
and run the graph module to pack the weight. then replace the original chain of ops with the packed weight.

Test Plan: Imported from OSS

Reviewed By: vkuzo

Differential Revision: D23385090

fbshipit-source-id: 11341f0af525a02ecec36f163a9cd35dee3744a1
2020-08-31 10:35:11 -07:00
eb4199b0a7 [quant][graphmode][fx] Add top level APIs (#43581)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43581

Add similar APIs like eager and graph mode on torchscript
- fuse_fx
- quantize_fx (for both post training static and qat)
- quantize_dynamic_fx (for post training dynamic)
- prepare_fx (for both post training static and qat)
- prepare_dynamic_fx (for post training dynamic)
- convert_fx (for all modes)

Test Plan: Imported from OSS

Reviewed By: vkuzo

Differential Revision: D23385091

fbshipit-source-id: b789e54e1a0f3af6b026fd568281984e253e0433
2020-08-31 10:12:55 -07:00
42c895de4d Properly check that reduction strings are valid for l1_loss, smoothl1_loss, and mse_loss. (#43527)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43527

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D23306786

Pulled By: gchanan

fbshipit-source-id: f3b7c9c02ae02813da116cb6b247a95727c47587
2020-08-31 09:53:56 -07:00
b8d34547ee [quant][graphmode][fx][fix] enable per channel quantization for functional ops (#43534)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43534

Test Plan: Imported from OSS

Reviewed By: vkuzo

Differential Revision: D23310857

fbshipit-source-id: ff7a681ee55bcc51f564e9de78319249b989366c
2020-08-31 09:35:25 -07:00
6ea89166bd Rewrite of ATen code generator (#42629)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42629

How to approach reviewing this diff:

- The new codegen itself lives in `tools/codegen`. Start with `gen.py`, then read `model.py` and them the `api/` folder. The comments at the top of the files describe what is going on. The CLI interface of the new codegen is similar to the old one, but (1) it is no longer necessary to explicitly specify cwrap inputs (and now we will error if you do so) and (2) the default settings for source and install dir are much better; to the extent that if you run the codegen from the root source directory as just `python -m tools.codegen.gen`, something reasonable will happen.
- The old codegen is (nearly) entirely deleted; every Python file in `aten/src/ATen` was deleted except for `common_with_cwrap.py`, which now permanently finds its home in `tools/shared/cwrap_common.py` (previously cmake copied the file there), and `code_template.py`, which now lives in `tools/codegen/code_template.py`. We remove the copying logic for `common_with_cwrap.py`.
- All of the inputs to the old codegen are deleted.
- Build rules now have to be adjusted to not refer to files that no longer exist, and to abide by the (slightly modified) CLI.
- LegacyTHFunctions files have been generated and checked in. We expect these to be deleted as these final functions get ported to ATen. The deletion process is straightforward; just delete the functions of the ones you are porting. There are 39 more functions left to port.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: bhosmer

Differential Revision: D23183978

Pulled By: ezyang

fbshipit-source-id: 6073ba432ad182c7284a97147b05f0574a02f763
2020-08-31 09:00:22 -07:00
576880febf Print all traceback for nested backwards in detect_anomaly (#43626)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/43405.

This pull request adds a feature of printing all tracebacks if a `detect_anomaly` mode detects `nan` in nested backward operations.
The way I did it is by assigning a node as a parent to all nodes it produces during its backward calculation. Then if one of the children produces `nan`, it will print the traceback from the parent and grand parents (if any).

The parent is assigned in `parent_node_` member in `Node` class which is accessible in C++ by function `node->parent()` and in Python by `node.parent_function`.
A node has a parent iff:

1. it is created from a backward operation, and
2. created when anomaly mode and grad mode are both enabled.

An example of this feature:

    import torch

    def example():
        x = torch.tensor(1.0, requires_grad=True)
        y = torch.tensor(1e-8, requires_grad=True)  # small to induce nan in n-th backward
        a = x * y
        b = x * y
        z1 = a / b  # can produce nan in n-th backward as long as https://github.com/pytorch/pytorch/issues/43414 is unsolved
        z = z1 * z1
        gy , = torch.autograd.grad( z , (y,), create_graph=True)
        gy2, = torch.autograd.grad(gy , (y,), create_graph=True)
        gy3, = torch.autograd.grad(gy2, (y,), create_graph=True)
        gy4, = torch.autograd.grad(gy3, (y,), create_graph=True)
        return gy4

    with torch.autograd.detect_anomaly():
        gy4 = example()

with output:

    example.py:16: UserWarning: Anomaly Detection has been enabled. This mode will increase the runtime and should only be enabled for debugging.
      with torch.autograd.detect_anomaly():
    /home/mfkasim/anaconda2/envs/base3/lib/python3.8/site-packages/torch/autograd/__init__.py:190: UserWarning: Error detected in DivBackward0. Traceback of forward call that caused the error:
      File "example.py", line 17, in <module>
        gy4 = example()
      File "example.py", line 12, in example
        gy3, = torch.autograd.grad(gy2, (y,), create_graph=True)
      File "/home/mfkasim/anaconda2/envs/base3/lib/python3.8/site-packages/torch/autograd/__init__.py", line 190, in grad
        return Variable._execution_engine.run_backward(
     (Triggered internally at  ../torch/csrc/autograd/python_anomaly_mode.cpp:61.)
      return Variable._execution_engine.run_backward(
    /home/mfkasim/anaconda2/envs/base3/lib/python3.8/site-packages/torch/autograd/__init__.py:190: UserWarning:

    Traceback of forward call that induces the previous calculation:
      File "example.py", line 17, in <module>
        gy4 = example()
      File "example.py", line 11, in example
        gy2, = torch.autograd.grad(gy , (y,), create_graph=True)
      File "/home/mfkasim/anaconda2/envs/base3/lib/python3.8/site-packages/torch/autograd/__init__.py", line 190, in grad
        return Variable._execution_engine.run_backward(
     (Triggered internally at  ../torch/csrc/autograd/python_anomaly_mode.cpp:65.)
      return Variable._execution_engine.run_backward(
    /home/mfkasim/anaconda2/envs/base3/lib/python3.8/site-packages/torch/autograd/__init__.py:190: UserWarning:

    Traceback of forward call that induces the previous calculation:
      File "example.py", line 17, in <module>
        gy4 = example()
      File "example.py", line 8, in example
        z1 = a / b  # can produce nan in n-th backward as long as https://github.com/pytorch/pytorch/issues/43414 is unsolved
     (Triggered internally at  ../torch/csrc/autograd/python_anomaly_mode.cpp:65.)
      return Variable._execution_engine.run_backward(
    Traceback (most recent call last):
      File "example.py", line 17, in <module>
        gy4 = example()
      File "example.py", line 13, in example
        gy4, = torch.autograd.grad(gy3, (y,), create_graph=True)
      File "/home/mfkasim/anaconda2/envs/base3/lib/python3.8/site-packages/torch/autograd/__init__.py", line 190, in grad
        return Variable._execution_engine.run_backward(
    RuntimeError: Function 'DivBackward0' returned nan values in its 1th output.

cc & thanks to albanD

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43626

Reviewed By: malfet

Differential Revision: D23397499

Pulled By: albanD

fbshipit-source-id: aa7435ec2a7f0d23a7a02ab7db751c198faf3b7d
2020-08-31 08:23:07 -07:00
1cdb9d2ab5 Test runner for batched gradient computation with vmap (#43664)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43664

This PR implements the test runner for batched gradient computation with
vmap. It also implements the batching rule for sigmoid_backward and
tests that one can compute batched gradients with sigmoid (and batched
2nd gradients).

Test Plan: - New tests: `python test/test_vmap.py -v`

Reviewed By: ezyang

Differential Revision: D23358555

Pulled By: zou3519

fbshipit-source-id: 7bb05b845a41b638b7cca45a5eff1fbfb542a51f
2020-08-31 08:21:41 -07:00
1dcc4fb6b7 Kill unused _pointwise_loss function. (#43523)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43523

The code is also wrong, see https://github.com/pytorch/pytorch/issues/43228.

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D23305461

Pulled By: gchanan

fbshipit-source-id: 9fe516d87a4243d5ce3c29e8822417709a1d6346
2020-08-31 07:58:04 -07:00
a860be898e [resubmit] Add amax/amin (#43819)
Summary:
Resubmit for landing next week.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43819

Reviewed By: ngimel

Differential Revision: D23421906

Pulled By: mruberry

fbshipit-source-id: 23dd60d1e365bb1197d660c3bfad7ee07ba3e97f
2020-08-31 04:54:48 -07:00
8fb7c50250 Enable complex blas for ROCm. (#43744)
Summary:
Revert "Skips some complex tests on ROCm (https://github.com/pytorch/pytorch/issues/42759)".  This reverts commit 55b1706775726418ddc5dd3b7756ea0388c0817c.

Use new cuda_to_hip_mappings.py from https://github.com/pytorch/pytorch/issues/43004.

Fixes https://github.com/pytorch/pytorch/pull/42383#issuecomment-670771922

CC sunway513

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43744

Reviewed By: glaringlee

Differential Revision: D23391263

Pulled By: ngimel

fbshipit-source-id: ddf734cea3ba69c24f0d79cf1b87c05cdb45ec3d
2020-08-30 22:43:54 -07:00
08126c9153 [ONNX] Utilize ONNX shape inference for ONNX exporter (#40628)
Summary:
It is often that the conversion from torch operator to onnx operator requires input rank/dtype/shape to be known. Previously, the conversion depends on tracer to provide these info, leaving a gap in conversion of scripted modules.

We are extending the export with support from onnx shape inference. If enabled, onnx shape inference will be called whenever an onnx node is created. This is the first PR introducing the initial look of the feature. More and more cases will be supported following this PR.

* Added pass to run onnx shape inference on a given node. The node has to have namespace `onnx`.
* Moved helper functions from `export.cpp` to a common place for re-use.
* This feature is currently experimental, and can be turned on through flag `onnx_shape_inference` in internal api `torch.onnx._export`.
* Currently skipping ONNX Sequence ops, If/Loop and ConstantOfShape due to limitations. Support will be added in the future.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/40628

Reviewed By: mrshenli

Differential Revision: D22709746

Pulled By: bzinodev

fbshipit-source-id: b52aeeae00667e66e0b0c1144022f7af9a8b2948
2020-08-30 18:35:46 -07:00
3aeb70db0b Documents sub properly, adds subtract alias (#43850)
Summary:
`torch.sub` was undocumented, so this PR adds its documentation, analogous to `torch.add`'s documentation, and adds the alias `torch.subtract` for `torch.sub`, too. This alias comes from NumPy (see https://numpy.org/doc/stable/reference/generated/numpy.subtract.html?highlight=subtract#numpy.subtract)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43850

Reviewed By: ngimel

Differential Revision: D23416908

Pulled By: mruberry

fbshipit-source-id: 6c4d2ebaf6ecae91f3a6efe484ce6c4dad96f016
2020-08-30 15:44:56 -07:00
3dc9645430 Disable RocM CircleCI jobs (#42630)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42630

Reviewed By: seemethere

Differential Revision: D22957640

Pulled By: malfet

fbshipit-source-id: 9f7d633310c653fcd14e66755168c0e559307b69
2020-08-30 11:41:40 -07:00
7b835eb887 Update CUDA11 docker container (#42200)
Summary:
- no more `-rc`
- add magma

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42200

Reviewed By: ZolotukhinM, mruberry

Differential Revision: D23411686

Pulled By: malfet

fbshipit-source-id: 04532bc1cc65b3e14ddf29e8bf61a7a3b4c706ad
2020-08-30 11:39:20 -07:00
5021ec826b Fix docs for kwargs, f-p (#43586)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43586

Reviewed By: glaringlee

Differential Revision: D23390667

Pulled By: mruberry

fbshipit-source-id: dd51a4a48ff4e2fc10675ec817a206041957982f
2020-08-30 10:13:36 -07:00
1830e4f08c Remove unnamed namespace in headers (#43689)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43689

Test Plan: Imported from OSS

Reviewed By: eellison, asuhan

Differential Revision: D23367636

Pulled By: bertmaher

fbshipit-source-id: ddb6d34d2f7cadff3a591c3650e1dd1b401c3d2d
2020-08-29 22:45:53 -07:00
ab3ea95e90 #include <string> in loopnest.h (#43835)
Summary:
This file is causing compiling failure on my gcc-10.1

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43835

Reviewed By: bhosmer

Differential Revision: D23416417

Pulled By: ZolotukhinM

fbshipit-source-id: d0c2998347438fb729212574d52ce20dd6faae85
2020-08-29 19:06:44 -07:00
628db9699f Vulkan command buffer and pool. (#42930)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42930

Test Plan: Imported from OSS

Reviewed By: IvanKobzarev

Differential Revision: D23252333

Pulled By: AshkanAliabadi

fbshipit-source-id: 738385e0058edf3d3b34173e1b1011356adb7b3c
2020-08-29 17:48:19 -07:00
d1df098956 Vulkan resource cache. (#42709)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42709

Test Plan: Imported from OSS

Reviewed By: IvanKobzarev

Differential Revision: D23252339

Pulled By: AshkanAliabadi

fbshipit-source-id: 977ab3fdedfe98789a48dd263127529d8be0ed37
2020-08-29 17:48:17 -07:00
87e8f50aae Vulkan descriptor and descriptor layout cache. (#42642)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42642

Test Plan: Imported from OSS

Reviewed By: IvanKobzarev

Differential Revision: D23252337

Pulled By: AshkanAliabadi

fbshipit-source-id: 075acc8c093e639bb24a0d4653d5c922b36a1128
2020-08-29 17:48:14 -07:00
15aaeb8867 Vulkan pipeline and pipeline layout cache. (#42395)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42395

Test Plan: Imported from OSS

Reviewed By: IvanKobzarev

Differential Revision: D23252334

Pulled By: AshkanAliabadi

fbshipit-source-id: 6b4e88f9794a7879d47a1cdb671076d50f1944d9
2020-08-29 17:48:12 -07:00
387dc24c92 Vulkan memory allocator. (#42786)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42786

Test Plan: Imported from OSS

Reviewed By: IvanKobzarev

Differential Revision: D23252332

Pulled By: AshkanAliabadi

fbshipit-source-id: 14e848ad81b4ba1367e8cf719343a51995457827
2020-08-29 17:48:10 -07:00
287fb273cd Vulkan (source and binary) shader and shader layout cache. (#42325)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42325

Test Plan: Imported from OSS

Reviewed By: IvanKobzarev

Differential Revision: D23252336

Pulled By: AshkanAliabadi

fbshipit-source-id: f3f26c78366be45c90a370db9194d88defbf08d8
2020-08-29 17:48:08 -07:00
6373063a98 Generic Vulkan object cache. (#42394)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42394

Test Plan: Imported from OSS

Reviewed By: IvanKobzarev

Differential Revision: D23252340

Pulled By: AshkanAliabadi

fbshipit-source-id: 34e753964b94153ed6ed1fcaa7f3b4a7c6b5f340
2020-08-29 17:48:06 -07:00
4e39c310eb Move torch/csrc/utils/hash.h to c10/util/hash.h. (#42503)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42503

Test Plan: Imported from OSS

Reviewed By: IvanKobzarev

Differential Revision: D23252331

Pulled By: AshkanAliabadi

fbshipit-source-id: 3c4c0e27b9a7eec8560e374c2a3ba5f1c65dae48
2020-08-29 17:47:00 -07:00
7f967c08b8 Document the beta=0 behavior of BLAS functions (#43823)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43823

Reviewed By: mruberry

Differential Revision: D23413899

Pulled By: ngimel

fbshipit-source-id: d3c4e5631db729a3f3d5eb9290c76cb1aa529f74
2020-08-29 13:03:16 -07:00
cc52386096 Revert D19987020: [pytorch][PR] Add the sls tensor train op
Test Plan: revert-hammer

Differential Revision:
D19987020 (f31b111a35)

Original commit changeset: e3ca7b00a374

fbshipit-source-id: a600c747a45dfb51e0882196e382a21ccaa7b989
2020-08-29 12:46:11 -07:00
45ba836876 Revert "Revert D23252335: Refactor Vulkan context into its own files. Use RAII." (#43628)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43628

This reverts commit 6c772515ed1a87ec676382492ff3c019c6d194c3.

Test Plan: Imported from OSS

Reviewed By: IvanKobzarev

Differential Revision: D23356714

Pulled By: AshkanAliabadi

fbshipit-source-id: a44af3b3c7b00a097eae1b0c9a00fdabc7ab6f86
2020-08-29 12:39:22 -07:00
f31b111a35 Add the sls tensor train op (#33525)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/33525

Reviewed By: wx1988

Differential Revision: D19987020

Pulled By: lly-zero-one

fbshipit-source-id: e3ca7b00a374a75ee42716c4e6236bf168ebebf1
2020-08-29 12:16:44 -07:00
550fb2fd52 Expand the coverage of test_blas_empty (#43822)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43822

Reviewed By: mruberry

Differential Revision: D23413359

Pulled By: ngimel

fbshipit-source-id: fcdb337e32ed2d1c791fa0762d5233b346b26d14
2020-08-29 12:13:15 -07:00
60ad7e9c04 [TensorExpr] Make sum available from Python (#43730)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43730

Test Plan:
python test/test_jit_fuser_te.py -k TestTEFuser.test_sum
test_tensorexpr --gtest_filter=TensorExprTest.KernelSum*

Reviewed By: ZolotukhinM

Differential Revision: D23407600

Pulled By: asuhan

fbshipit-source-id: e6da4690ae6d802f9be012e39e61b7467aa5285c
2020-08-29 10:38:21 -07:00
8a41fa4718 [Selective Build] Move register_prim_ops and register_special_ops to app level (#43539)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43539

Move the two source files out of the base internal mobile library to the app level. Make it ready for app-based selective build. Opensource build should not be affected. The file list change in build_variables.bzl affects internal build only.

ghstack-source-id: 111006135

Test Plan: CI

Reviewed By: ljk53

Differential Revision: D23287661

fbshipit-source-id: 9b2d688544e79e0fca9c84730ef0259952cd8abe
2020-08-29 03:12:28 -07:00
d10056652b Enable torch.half for lt and masked_select (#43704)
Summary:
Enable testing of those options in `TestTorchDeviceTypeCPU.test_logical_cpu` and `TestTorchDeviceTypeCPU.test_masked_select_cpu_float16`
Add `view_as_real` testing for `torch.complex32` type

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43704

Reviewed By: albanD

Differential Revision: D23373070

Pulled By: malfet

fbshipit-source-id: 00f17f23b48513379a414227aea91e2d3c0dd5f9
2020-08-29 02:37:26 -07:00
931b8b4ac8 Use ivalue::Future in autograd engine and DistEngine. (#43676)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43676

This is one part of https://github.com/pytorch/pytorch/issues/41574 to
ensure we consolidate everything around ivalue::Future.

I've removed the use of torch/csrc/utils/future.h from the autograd engines and
used ivalue::Future instead.
ghstack-source-id: 110895545

Test Plan: waitforbuildbot.

Reviewed By: albanD

Differential Revision: D23362415

fbshipit-source-id: aa109b3f8acf0814d59fc5264a85a8c27ef4bdb6
2020-08-29 02:15:26 -07:00
000739c31a Function calls for fallback paths (#43274)
Summary:
This PR adds API to package unoptimized/fallback blocks as function calls. It's mainly meant to be used by TensorExpressionsFuser and SpecializeAutogradZero passes as both specialize the original graph but would also like to provide a fallback path in case the assumptions under which the graph was specialized do not hold for some inputs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43274

Reviewed By: malfet

Differential Revision: D23406961

Pulled By: Krovatkin

fbshipit-source-id: ef21fc9ad886953461b09418d02c75c58375490c
2020-08-28 23:31:02 -07:00
8538a79bfe [jit][static] Basic executor (#43647)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43647

Nothing fancy, just a basic implementation of the graph executor without using stack machine.

Reviewed By: bwasti

Differential Revision: D23208413

fbshipit-source-id: e483bb6ad7ba8591bbe1767e669654d82f42c356
2020-08-28 23:20:07 -07:00
6aaae3b08b [ONNX] Addition of diagnostic tool API (#43020)
Summary:
Added initial diagnostic tool API

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43020

Reviewed By: malfet

Differential Revision: D23398459

Pulled By: bzinodev

fbshipit-source-id: 7a6d9164a19e3ba51676fbcf645c4d358825eb42
2020-08-28 23:04:59 -07:00
58148c85f4 Use template OperatorGenerator for prim and special operator registration (#43481)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43481

Apply OperatorGenerator for prim and special operator registration. It does not affect the existing build by default. However, if a whitelist of operator exists, only the operators in the whitelist will be registered. It has the potential to save up to 200 KB binary size, depending on the usage.

Test Plan: Imported from OSS

Reviewed By: ljk53

Differential Revision: D23287251

Pulled By: iseeyuan

fbshipit-source-id: 3ca39fbba645bad8d69e69195f3680e4f6d633c5
2020-08-28 21:18:00 -07:00
8997a4b56b [typing] Enable typing in torch.quantization.fuse_modules typechecks … (#43786)
Summary:
…during CI

Fixes #{42971}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43786

Reviewed By: malfet

Differential Revision: D23403258

Pulled By: yizhouyu

fbshipit-source-id: 4cd24a4fcf1408341a210fa50f574887b6db5e0e
2020-08-28 20:42:23 -07:00
eae92b7187 Updated README.md by correcting grammatical errors (#43779)
Summary:
Fixed grammatical errors and punctuation so that it be can more understandable.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43779

Reviewed By: ZolotukhinM

Differential Revision: D23407849

Pulled By: malfet

fbshipit-source-id: 09c064ce68d0f37f8023c2ecae8775fc00541a2c
2020-08-28 20:30:03 -07:00
13c7c6227e Python/C++ API Parity: TransformerDecoder (#42886)
Summary:
Fixes #{[37756](https://github.com/pytorch/pytorch/issues/37756)}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42886

Reviewed By: zhangguanheng66

Differential Revision: D23385631

Pulled By: glaringlee

fbshipit-source-id: 610a2fabb4c25b2dfd37b33287215bb8872d653d
2020-08-28 20:13:53 -07:00
64906497cd Revert D23391941: [pytorch][PR] Implementing NumPy-like function torch.heaviside()
Test Plan: revert-hammer

Differential Revision:
D23391941 (a1eae6d158)

Original commit changeset: 7b942321a625

fbshipit-source-id: c2a7418a1fedaa9493300945c30e2392fc0d08ee
2020-08-28 19:16:58 -07:00
47e489b135 Make ExtraFilesMap return bytes instead of str (#43241)
Summary:
In case we want to store binary files using `ScriptModule.save(..., _extra_files=...)` functionality. With python3 we can just use bytes only and not bother about it.

I had to do a copy-pasta from pybind sources, maybe we should upstream it, but it'd mean adding a bunch of template arguments to `bind_map` which is a bind untidy.

Let me know if there's a better place to park this function (it seems to be the only invocation of `bind_map` so I put it in the same file)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43241

Reviewed By: zdevito

Differential Revision: D23205244

Pulled By: dzhulgakov

fbshipit-source-id: 8f291eb4294945fe1c581c620d48ba2e81b3dd9c
2020-08-28 19:11:33 -07:00
1a79d7bb28 DDP communication hook examples (#43310)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43310

In this diff, we prepared some example DDP communication hooks [#40848](https://github.com/pytorch/pytorch/pull/40848):

1\. `allreduce_hook`: This DDP communication hook just calls ``allreduce`` using ``GradBucket`` tensors. Once gradient tensors are aggregated across all workers, its ``then`` callback takes the mean and returns the result. If user registers this hook DDP results is expected to be same as the case where no hook was registered. Hence, this won't change behavior of DDP and user can use this as a reference or modify this hook to log useful information or any other purposes while unaffecting DDP behavior.

2\. `allgather_then_aggregate_hook` Similar to ``allreduce_hook``, this hook first gathers ``GradBucket`` tensors and its ``then`` callback aggregates the gathered gradient tensors and takes mean. Instead of ``allreduce`` this hook uses ``allgather``. Note that with W workers, both the computation and communication time scale as O(W) for allgather compared to O(logW) for allreduce. Therefore, this hook is expected to be much slower than ``allreduce_hook`` although both essentially do the same thing with the gradients.

3\. `fp16_compress_hook` This DDP communication hook implements a simple gradient compression approach that converts ``GradBucket`` tensors whose type is assumed to be ``torch.float32`` to half-precision floating point format (``torch.float16``). It allreduces those ``float16`` gradient tensors. Once compressed gradient tensors are allreduced, its then callback called ``decompress`` converts the aggregated result back to ``float32`` and takes the mean.

4\. `quantization_pertensor_hook` does quantization per tensor and uses the idea in https://pytorch.org/docs/master/generated/torch.quantize_per_tensor.html.  Note that we separately send scale and zero_point (two floats per rank) before quantized tensors.

5\. `quantization_perchannel_hook` does quantization per channel similar to https://pytorch.org/docs/master/generated/torch.quantize_per_channel.html. The main motivation is that after the initial QSGD study diff, we realized that for considerably large gradient tensors such as a tensor that contains 6 million floats quantizing dividing it into smaller channels (512 float chunks) and quantizing independently may significantly increase the resolution and result with lower error.
ghstack-source-id: 110923269

Test Plan:
python torch/distributed/algorithms/ddp_comm_hooks/test_ddp_hooks.py
Couldn't download test skip set, leaving all tests enabled...
.....
----------------------------------------------------------------------
Ran 4 tests in 26.724s

OK

Internal testing:
```
buck run mode/dev-nosan //caffe2/test/distributed/algorithms/ddp_comm_hooks:test_ddp_hooks
```

Reviewed By: malfet

Differential Revision: D22937999

fbshipit-source-id: 274452e7932414570999cb978ae77a97eb3fb0ec
2020-08-28 18:59:14 -07:00
68b9daa9bf Add torch.linalg.norm (#42749)
Summary:
Adds `torch.linalg.norm` function that matches the behavior of `numpy.linalg.norm`.

Additional changes:
* Add support for dimension wrapping in `frobenius_norm` and `nuclear_norm`
* Fix `out` argument behavior for `nuclear_norm`
* Fix issue where `frobenius_norm` allowed duplicates in `dim` argument
* Add `_norm_matrix`

Closes https://github.com/pytorch/pytorch/issues/24802

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42749

Reviewed By: ngimel

Differential Revision: D23336234

Pulled By: mruberry

fbshipit-source-id: f0aba3089a3a0bf856aa9c4215e673ff34228fac
2020-08-28 18:28:33 -07:00
cd0bab8d8d [ONNX] Where op (#41544)
Summary:
Extending where op export

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41544

Reviewed By: malfet

Differential Revision: D23279515

Pulled By: bzinodev

fbshipit-source-id: 4627c95ba18c8a5ac8d06839c343e06e71c46aa7
2020-08-28 18:15:01 -07:00
a1eae6d158 Implementing NumPy-like function torch.heaviside() (#42523)
Summary:
- Related with https://github.com/pytorch/pytorch/issues/38349
- Implementing the NumPy-like function `torch.heaviside()` .

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42523

Reviewed By: glaringlee

Differential Revision: D23391941

Pulled By: mruberry

fbshipit-source-id: 7b942321a62567a5fc0a3679a289f4c4c19e6134
2020-08-28 18:11:20 -07:00
633d239409 [torch.fx] Pass placeholders through delegate too (#43432)
Summary:
It's useful if we add additional attributed to nodes in the graph - it's easier to set the attribute on all nodes, even if the value would happen to be None.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43432

Reviewed By: jamesr66a

Differential Revision: D23276433

Pulled By: dzhulgakov

fbshipit-source-id: c69e7cb723bbbb4dba3b508a3d6c0e456fe610df
2020-08-28 18:07:52 -07:00
3f0120edb4 Revert D23360705: [pytorch][PR] Add amax/amin
Test Plan: revert-hammer

Differential Revision:
D23360705 (bcec8cc3f9)

Original commit changeset: 5bdeb08a2465

fbshipit-source-id: 76a9e199823c7585e55328bad0778bcd8cd49381
2020-08-28 18:01:25 -07:00
7d517cf96f [NCCL] Dedicated stream to run all FutureNCCL callbacks. (#43447)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43447

Two main better-engineering motivations to run all FutureNCCL callbacks on a dedicated stream:
1. Each time a then callback was called, we would get a stream from the pool and run the callback on that stream. If we observe the stream traces using that approach, we would see a lot of streams and debugging would become more complicated. If we have a dedicated stream to run all then callback operations, the trace results will be much cleaner and easier to follow.
2. getStreamFromPool may eventually return the default stream or a stream that is used for other operations. This can cause slowdowns.

Unless then callback takes longer than preceding allreduce, this approach will be as performant as the previous approach.
ghstack-source-id: 110909401

Test Plan:
Perf trace runs to validate the desired behavior:
See the dedicated stream 152 is running the then callback operations:

{F299759342}

I run pytorch.benchmark.main.workflow using resnet50 and 32 GPUs registering allreduce with then hook.
See f213777896 [traces](https://www.internalfb.com/intern/perfdoctor/results?run_id=26197585)

After updates, same observation: see f214890101

Reviewed By: malfet

Differential Revision: D23277575

fbshipit-source-id: 67a89900ed7b70f3daa92505f75049c547d6b4d9
2020-08-28 17:26:23 -07:00
3f5ea2367e Adding a version serialization type to ConvPackedParam (#43086)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43086

This PR changes the format of `ConvPackedParam` in a nearly backwards-compatible way:
* a new format is introduced which has more flexibility and a lower on-disk size
* custom pickle functions are added to `ConvPackedParams` which know how to load the old format
* the custom pickle functions are **not** BC because the output type of `__getstate__` has changed.  We expect this to be acceptable as no user flows are actually broken (loading a v1 model with v2 code works), which is why we whitelist the failure.

Test plan (TODO finalize):

```
// adhoc testing of saving v1 and loading in v2: https://gist.github.com/vkuzo/f3616c5de1b3109cb2a1f504feed69be

// test that loading models with v1 conv params format works and leads to the same numerics
python test/test_quantization.py TestSerialization.test_conv2d_graph
python test/test_quantization.py TestSerialization.test_conv2d_nobias_graph

// test that saving and loading models with v2 conv params format works and leads to same numerics
python test/test_quantization.py TestSerialization.test_conv2d_graph_v2
python test/test_quantization.py TestSerialization.test_conv2d_nobias_graph_v2

// TODO before land:
// test numerics for a real model
// test legacy ONNX path
```

Note: this is a newer copy of https://github.com/pytorch/pytorch/pull/40003

Test Plan: Imported from OSS

Reviewed By: dreiss

Differential Revision: D23347832

Pulled By: vkuzo

fbshipit-source-id: 06bbe4666421ebad25dc54004c3b49a481d3cc92
2020-08-28 15:41:30 -07:00
af4ecb3c11 quantized conv: add support for graph mode BC testing, and increase coverage (#43524)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43524

1. adds support for testing BC on data format and numerics for graph mode
quantized modules
2. using the above, adds coverage for quantized conv2d on graph mode

Test Plan:
```
python test/test_quantization.py TestSerialization.test_conv2d_nobias
python test/test_quantization.py TestSerialization.test_conv2d_graph
python test/test_quantization.py TestSerialization.test_conv2d_nobias_graph
```

Imported from OSS

Reviewed By: supriyar

Differential Revision: D23335222

fbshipit-source-id: 0c9e93a940bbf6c676c2576eb62fcc725247588b
2020-08-28 15:40:22 -07:00
4cb8d306e6 Add _foreach_add_(TensorList tensors, Scalar scalar) API (#42531)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42531

[First PR: Add private API to support tensor lists: _foreach_add(TensorList tensors, Scalar scalar)](https://github.com/pytorch/pytorch/pull/41554).

**Motivation**
[GitHub issue](https://github.com/pytorch/pytorch/issues/38655)
Current PyTorch optimizer implementations are not efficient in cases when we work with a lot of small feature tensors. Starting a lot of kernels slows down the whole process. We need to reduce the number of kernels that we start.
As an example, we should be looking at [NVIDIAs Apex](https://github.com/NVIDIA/apex).
In order to track progress, we will pick PyTorchs DCGAN model with Adam optimizer and once the optimizer is reimplemented with tensor lists, benchmark the model performance against original model version, Apexs version with original Adam optimizer and it’s FusedAdam optimizer.

**Current API restrictions**
- List can't be empty (will fixed in upcoming PRs).
- All tensors in the list must have the same dtype, device and size.

**Broadcasting**
At this point we don't support broadcasting.

**What is 'Fast' and 'Slow' route**
In particular cases, we cant process an op with a fast list CUDA kernel. Still, we can do with a regular for-loop where the op will be applied to each tensor individually through the dispatch mechanisms. There are a few checks that decide whether the op will be performed via a 'fast' or 'slow' path.
To go the fast route,
- All tensors must have strided layout
- All tensors must be dense and not have overlapping memory
- The resulting tensor type must be the same.

---------------
**In this PR**
- Adding a `std::vector<Tensor> _foreach_add_(TensorList tensors, Scalar scalar)` API
- Resolving some additional comments from previous [PR](https://github.com/pytorch/pytorch/pull/41554).

**Tests**
Tested via unit tests

**TODO**
1. Properly handle empty lists

**Plan for the next PRs**
1. APIs
- Binary Ops for list with Scalar
- Binary Ops for list with list
- Unary Ops for list
- Pointwise Ops

2. Complete tasks from TODO
3. Rewrite PyTorch optimizers to use for-each operators for performance gains.

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D23331892

Pulled By: izdeby

fbshipit-source-id: c585b72e1e87f6f273f904f75445618915665c4c
2020-08-28 14:34:46 -07:00
20abfc21e4 Adds arctanh, arcsinh aliases, simplifies arc* alias dispatch (#43762)
Summary:
Adds two more "missing" NumPy aliases: arctanh and arcsinh, and simplifies the dispatch of other arc* aliases.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43762

Reviewed By: ngimel

Differential Revision: D23396370

Pulled By: mruberry

fbshipit-source-id: 43eb0c62536615fed221d460c1dec289526fb23c
2020-08-28 13:59:19 -07:00
0564d7a652 Land code coverage tool for OSS (#43778)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43778

Move code_coverage_tool from experimental folder to caffe2/tools folder.

Delete `TODO` and fb-related code.

Test Plan: Test locally

Reviewed By: malfet

Differential Revision: D23399983

fbshipit-source-id: 92316fd3cc88409d087d2dc6ed0be674155b3762
2020-08-28 13:56:15 -07:00
89e2a3591e Add 1% threshold to codecov (#43783)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43783

Reviewed By: seemethere

Differential Revision: D23402196

Pulled By: malfet

fbshipit-source-id: bd11d6edc6d1f15bd227636a549b9ea7b3aca256
2020-08-28 13:51:23 -07:00
b23e9cdd64 .circleci: Add slash to end of s3 cp (#43792)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43792

This fixes the issue we had with the nightlies not being uploaded
properly, basically what was happening was that `aws s3 cp` doesn't
automatically distinguish between prefixes that are already
"directories" vs a single file with the same name.

This means that if you'd like to upload a file to a "directory" in S3
you need to suffix your destination with a slash.

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D23402074

Pulled By: seemethere

fbshipit-source-id: 6085595283fcbbbab0836ccdfe0f8aa2a6abd7c8
2020-08-28 13:37:25 -07:00
776c2d495f [JIT] IRParser: store list attributes as generic ivalue lists. (#43785)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43785

Test Plan: Imported from OSS

Reviewed By: eellison

Differential Revision: D23400565

Pulled By: ZolotukhinM

fbshipit-source-id: e248eb1854c4ec40da9455d4279ea6e47b1f2a16
2020-08-28 13:27:28 -07:00
bcec8cc3f9 Add amax/amin (#43092)
Summary:
Add a max/min operator that only return values.

## Some important decision to discuss
| **Question**                          | **Current State** |
|---------------------------------------|-------------------|
| Expose torch.max_values to python?    | No                |
| Remove max_values and only keep amax? | Yes               |
| Should amax support named tensors?    | Not in this PR    |

## Numpy compatibility

Reference: https://numpy.org/doc/stable/reference/generated/numpy.amax.html

| Parameter                                                                                                                                                                                                                                              | PyTorch Behavior                                                                  |
|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------|
| `axis`:  None or int or tuple of ints, optional. Axis or axes along which to operate. By default, flattened input is used. If this is a tuple of ints, the maximum is selected over multiple axes, instead of a single axis or all the axes as before. | Named `dim`, behavior same as `torch.sum` (https://github.com/pytorch/pytorch/issues/29137)                                |
| `out`: ndarray, optional. Alternative output array in which to place the result. Must be of the same shape and buffer length as the expected output.                                                                                                   | Same                                                                              |
| `keepdims`: bool, optional. If this is set to True, the axes which are reduced are left in the result as dimensions with size one. With this option, the result will broadcast correctly against the input array.                                      | implemented as `keepdim`                                                          |
| `initial`: scalar, optional. The minimum value of an output element. Must be present to allow computation on empty slice.                                                                                                                              | Not implemented in this PR. Better to implement for all reductions in the future. |
| `where`: array_like of bool, optional. Elements to compare for the maximum.                                                                                                                                                                            | Not implemented in this PR. Better to implement for all reductions in the future. |

**Note from numpy:**
> NaN values are propagated, that is if at least one item is NaN, the corresponding max value will be NaN as well. To ignore NaN values (MATLAB behavior), please use nanmax.

PyTorch has the same behavior

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43092

Reviewed By: ngimel

Differential Revision: D23360705

Pulled By: mruberry

fbshipit-source-id: 5bdeb08a2465836764a5a6fc1a6cc370ae1ec09d
2020-08-28 12:51:03 -07:00
f4695203c2 Fixes fft function calls for C++ API (#43749)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/43732.

Requires importing the fft namespace in the C++ API, just like the Python API does, to avoid clobbering torch::fft the function.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43749

Reviewed By: glaringlee

Differential Revision: D23391544

Pulled By: mruberry

fbshipit-source-id: d477d0b6d9a689d5c154ad6c31213a7d96fdf271
2020-08-28 12:41:30 -07:00
dc5d365514 Fix bug in caching allocator. (#43719)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43719

Accidentally this slipped through: with guard did not update the current
context

Test Plan: cpu_caching_allocator_test

Reviewed By: linbinyu

Differential Revision: D23374453

fbshipit-source-id: 1d3ef21cc390d0a8bde98fb1b5c2175b40ab571b
2020-08-28 11:56:23 -07:00
be3ec6ab3e [caffe2][torch] correctly re-raise Manifold StorageException
Summary:
1) Manifold raises StorageException when it see's an error: https://fburl.com/diffusion/kit3me8a
2) torch re-raises exception: https://fburl.com/diffusion/zbw9wmpu
Issue here, that in StorageException first argument is bool canRetry while re-raising happens with first argument being str as in all Python exceptions.

Test Plan:
Existing tests should pass. +
```
In [1]: from manifold.clients.python import StorageException
In [2]: getattr(StorageException, "message", None)
Out[2]: <attribute 'message' of 'manifold.blobstore.blobstore.types.StorageException' objects>
In [3]: getattr(Exception, "message", None) is None
Out[3]: True

Reviewed By: haijunz

Differential Revision: D23195514

fbshipit-source-id: baa1667dbba4086db6ec93f009e400611ac9b938
2020-08-28 11:41:10 -07:00
b72da0cf28 OneDNN: report error for dilation max_pooling and replace AT_ERROR with TORCH_CHECK in oneDNN codes (#43538)
Summary:
Fix https://github.com/pytorch/pytorch/issues/43514.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43538

Reviewed By: agolynski

Differential Revision: D23364302

Pulled By: ngimel

fbshipit-source-id: 8d17752cf33dcacd34504e32b5e523e607cfb497
2020-08-28 10:57:19 -07:00
1f7434d1ea Fix 'module' to 'model' in quantize_dynamic doc (#43693)
Summary:
Fixes issue https://github.com/pytorch/pytorch/issues/43503

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43693

Reviewed By: malfet

Differential Revision: D23397641

Pulled By: mrshenli

fbshipit-source-id: bc216cea4f0a30c035e84a6cfebabd3755ef1305
2020-08-28 10:44:43 -07:00
a76184fe1e grammatical error fix (#43697)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43697

Reviewed By: malfet

Differential Revision: D23397655

Pulled By: mrshenli

fbshipit-source-id: fb447dcde4f83bc6650f0faa0728a1867cfa5213
2020-08-28 10:38:46 -07:00
b630c1870d Add stateful XNNPack deconvolution2d operator to torch. (#43233)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43233

XNNPack is already being used for the convolution2d operation. Add the
ability for it to be used with transpose convolution.

Test Plan: buck run caffe2/test:xnnpack_integration

Reviewed By: kimishpatel

Differential Revision: D23184249

fbshipit-source-id: 3fa728ce1eaca154d24e60f800d5e946d768c8b7
2020-08-28 10:31:36 -07:00
58a7e73a95 [TensorExpr] Block Codegen (#40054)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40054

Reviewed By: ZolotukhinM

Differential Revision: D22061350

Pulled By: protonu

fbshipit-source-id: 004f7c316629b16610ecdbb97e43036c72c65067
2020-08-28 09:53:42 -07:00
9063bcee04 Don't proceed into setup.py too far if Python version is unsupported (#42870)
Summary:
This prevents confusing errors when the interpreter encounters some
syntax errors in the middle.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42870

Reviewed By: albanD

Differential Revision: D23269265

Pulled By: ezyang

fbshipit-source-id: 61f62cbe294078ad4a909fa87aa93abd08c26344
2020-08-28 09:04:55 -07:00
c177d25edf TensorIterator: Check for memory overlap in all nullary_ops (#43421)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43421

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D23298654

Pulled By: zou3519

fbshipit-source-id: 71b401f6ea1e3b50b830fef650927cc5b3fb940f
2020-08-28 08:40:25 -07:00
dc0722e9b7 TensorIterator: Check for memory overlap in all compare_ops (#43420)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43420

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D23298650

Pulled By: zou3519

fbshipit-source-id: 171cd17a3012880a5d248ffd0ea6942fbfb6606f
2020-08-28 08:40:22 -07:00
065ebdb92f TensorIterator: Check for memory overlap in all binary_ops (#43419)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43419

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D23298655

Pulled By: zou3519

fbshipit-source-id: 82e0ff308a6a7e46b4342d57ddb4c1d73745411a
2020-08-28 08:40:19 -07:00
bdee8e02c0 TensorIterator: Check memory overlap in all unary_ops (#43418)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43418

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D23298651

Pulled By: zou3519

fbshipit-source-id: 84be498f5375813fd10cf30b8beabbd2d15210a3
2020-08-28 08:39:13 -07:00
0ab83f7f9f Fixed undefined behavior in BatchedFallback (#43705)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43705

This was causing fb-internal flakiness. I'm surprised that the ASAN
builds don't catch this behavior.

The problem is that dereferencing the end() pointer of a vector is
undefined behavior. This PR fixes one callsite where BatchedFallback
dereferences the end() pointer and adds an assert to make sure another
callsite doesn't do that.

Test Plan:
- Make sure all tests pass (`pytest test/test_vmap.py -v`)
- It's hard to write a new test for this because most of the time this
doesn't cause a crash. It really depends on what lives at the end()
pointer.

Reviewed By: ezyang

Differential Revision: D23373352

Pulled By: zou3519

fbshipit-source-id: 61ea0be80dc006f6d4e73f2c5badd75096f63e56
2020-08-28 08:09:17 -07:00
8e507ad00e Update the div formula for numerical stability (#43627)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/43414

See the issue for numerical improvements and quick benchmark.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43627

Reviewed By: agolynski

Differential Revision: D23350124

Pulled By: albanD

fbshipit-source-id: 19d51640b3f200db37c32d2233a4244480e5a15b
2020-08-28 07:49:35 -07:00
b29375840a Revert D23379383: Land code_coverage_tool to caffe2/tools folder
Test Plan: revert-hammer

Differential Revision:
D23379383 (f06d3904f2)

Original commit changeset: f6782389ebb1

fbshipit-source-id: 33a26761deb58dfe81314ea912bf485c5fc962b7
2020-08-28 07:19:12 -07:00
c7787f7fbf [numpy compatibility]Fix argmin/argmax when multiple max/min values (#42004)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/41998
Fixes https://github.com/pytorch/pytorch/issues/22853

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42004

Reviewed By: ngimel

Differential Revision: D23049003

Pulled By: mruberry

fbshipit-source-id: a6fddbadfec4b8696730550859395ce4f0cf50d6
2020-08-28 06:42:42 -07:00
26161e8ab6 [Codemod][FBSourceClangFormatLinter] Daily arc lint --take CLANGFORMAT
Reviewed By: zertosh

Differential Revision: D23393950

fbshipit-source-id: 6a31b7ab6961cba88014f41b3ed1eda108edebab
2020-08-28 05:38:13 -07:00
f06d3904f2 Land code_coverage_tool to caffe2/tools folder
Summary:
Move `code_coverage_tool` from `experimental` folder to `caffe2/tools` folder.

Not sure if the fb related code is something we don't want to share with the oss. Can reviewers please help me check with `fbcode_coverage.py` and files in `fbcode/` folder?

Test Plan: Test locally

Reviewed By: malfet

Differential Revision: D23379383

fbshipit-source-id: f6782389ebb1b147eaf6d3664b5955db79d24ff3
2020-08-27 18:44:40 -07:00
654ab209c6 [JIT] Disable broken tests (#43750)
Summary:
These started failing since **https://github.com/pytorch/pytorch/pull/43633** for indecipherable reasons; temporarily disable. The errors on the PRs were
```
Downloading workspace layers
  workflows/workspaces/3ca9ca71-7449-4ae1-bb7b-b7612629cc62/0/8607ba99-5ced-473b-b60a-0025b48739a6/0/105.tar.gz - 8.4 MB
Applying workspace layers
  8607ba99-5ced-473b-b60a-0025b48739a6
```
which is not too helpful...

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43750

Reviewed By: ZolotukhinM

Differential Revision: D23388060

Pulled By: eellison

fbshipit-source-id: 96afa0160ec948049f3e194787a0a7ddbeb5124a
2020-08-27 18:12:57 -07:00
1a21c92364 [ONNX] Update in scatter ONNX export when scalar src has different type (#43440)
Summary:
`torch.scatter` allows `src` to be of different type when `src` is a scalar. This requires a an explicit cast op to be inserted in the ONNX graph because ONNX `ScatterElements` does not allow different types. This PR updates the export of `torch.scatter` with this logic.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43440

Reviewed By: hl475

Differential Revision: D23352317

Pulled By: houseroad

fbshipit-source-id: c9eeddeebb67fc3c40ad01def134799ef2b4dea6
2020-08-27 16:45:37 -07:00
87d7c362b1 [JIT] Add JIT support for torch.no_grad (#41371)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41371

**Summary**
This commit enables the use of `torch.no_grad()` in a with item of a
with statement within JIT. Note that the use of this context manager as
a decorator is not supported.

**Test Plan**
This commit adds a test case to the existing with statements tests for
`torch.no_grad()`.

**Fixes**
This commit fixes #40259.

Test Plan: Imported from OSS

Reviewed By: gmagogsfm

Differential Revision: D22649519

Pulled By: SplitInfinity

fbshipit-source-id: 7fa675d04835377666dfd0ca4e6bc393dc541ab9
2020-08-27 15:32:57 -07:00
8032dbc117 Add Rowwise Prune PyTorch op (#42708)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42708

Add rowwise prune pytorch op.

This operator introduces sparsity to the 'weights' matrix with the help
of the importance indicator 'mask'.

A row is considered important and not pruned if the mask value for that
particular row is 1(True) and not important otherwise.

Test Plan:
buck test caffe2/torch/fb/sparsenn:test -- rowwise_prune
buck test caffe2/test:pruning

Reviewed By: supriyar

Differential Revision: D22849432

fbshipit-source-id: 456f4f77c04158cdc3830b2e69de541c7272a46d
2020-08-27 15:16:23 -07:00
3a0e35c9f2 [pytorch] deprecate static dispatch (#43564)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43564

Static dispatch was originally introduced for mobile selective build.

Since we have added selective build support for dynamic dispatch and
tested it in FB production for months, we can deprecate static dispatch
to reduce the complexity of the codebase.

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D23324452

Pulled By: ljk53

fbshipit-source-id: d2970257616a8c6337f90249076fca1ae93090c7
2020-08-27 14:52:48 -07:00
3afd24d62c [pytorch] check in default generated op dependency graph (#43570)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43570

Add the default op dependency graph to the source tree - use it if user runs
custom build in dynamic dispatch mode without providing the graph.

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D23326988

Pulled By: ljk53

fbshipit-source-id: 5fefe90ca08bb0ca20284e87b70fe1dba8c66084
2020-08-27 14:51:44 -07:00
9a2d4d550e update build flags for benchmark binaries
Summary:
Suggested by Shoaib Meenai, we should use mode/ndk_libcxx to replace mode/gnustl.

This diff updated all build flags for caffe2 and pytorch in aibench. For easy management, I created two mode files in xplat/caffe2/mode, and delete buckconfig.ptmobile.pep.

Test Plan:
caffe2
```
buck run aibench:run_bench -- -b aibench/specifications/models/caffe2/squeezenet/squeezenet.json --remote --devices s9f
```
https://our.intern.facebook.com/intern/aibench/details/433604719423848

full jit
```
buck run aibench:run_bench -- -b aibench/specifications/models/pytorch/fbnet/fbnet_mobile_inference.json --platform android/full_jit --framework pytorch --remote --devices SM-G960F-8.0.0-26
```
https://our.intern.facebook.com/intern/aibench/details/189359776958060

lite interpreter
```
buck run aibench:run_bench -- -b aibench/specifications/models/pytorch/fbnet/fbnet_mobile_inference.json --platform android --framework pytorch --remote --devices s9f
```
https://our.intern.facebook.com/intern/aibench/details/568178969092066

Reviewed By: smeenai

Differential Revision: D23338089

fbshipit-source-id: 62f4ae2beb004ceaab1f73f4de8ff9e0c152d5ee
2020-08-27 14:40:01 -07:00
01f974eb1e Specialize optionals for grad_sum_to_size (#43633)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43633

In the backward graph, _grad_sum_to_size is inserted whenever a possibly broadcasting op is called:"
`"aten::_grad_sum_to_size(Tensor(a) self, int[]? size) -> Tensor(a)"`
 If a broadcast occurred, a sum is called, otherwise the second input is None and it is a no-op. Most of the time, it's a no-op (in the fast RNNs benchmark > 90% of the time).

We can get rid of this op by profiling the optionality of the second input. I added `prim::profile_optional` to do this, which counts the number of times it saw a None value and the number of times it saw a value present. When specializing the backward graph, we insert checks for values we profiled as None, and in the optimized block can remove the grad_sum_to_size calls that use those values.

In the future we may revisit this when NNC supports reductions and we want to replace grad_sum_to_size with sums as well, but I think this is worth landing now.

Test Plan: Imported from OSS

Reviewed By: bwasti, ZolotukhinM

Differential Revision: D23358809

Pulled By: eellison

fbshipit-source-id: a30a148ca581370789d57ba082d23cbf7ef2cd4d
2020-08-27 14:35:37 -07:00
a19fd3a388 Add undefined specializations in backward (#43632)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43632

Specialize the backward graph by guarding on the undefinedness of the input tensors. The graph will look like:
```
ty1, ty2, succesful_checks = prim::TypeCheck(...)
if (succesful_checks)
-> optimized graph
else:
-> fallback graph
```

Specializing on the undefinedness of tensors allows us to clean up the
```
if any_defined(inputs):
 outputs = <original_computation>
else:
 outputs = autograd zero tensors
```
blocks that make up the backward graph, so that we can fuse the original_computation nodes together.

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D23358808

Pulled By: eellison

fbshipit-source-id: f5bb28f78a4a3082ecc688a8fe0345a8a098c091
2020-08-27 14:35:35 -07:00
a4cf4c2437 refactor tests (#43631)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43631

I added a new test for just profiler stuff - I don't think the test should go in test_jit.py. Maybe this should just go in test_tensorexpr_fuser, but I'm not really testing tensorexpr stuff either... LMK

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D23358810

Pulled By: eellison

fbshipit-source-id: 074238e1b60e4c4a919a052b7a5312b790ad5d82
2020-08-27 14:35:33 -07:00
e189ef5577 Refactor pass to class (#43630)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43630

No functional changes here - just refactoring specialize autograd zero to a class, and standardizing its API to take in a shared_ptr<Graph>

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D23358805

Pulled By: eellison

fbshipit-source-id: 42e19ef2e14df66b44592252497a47d03cb07a7f
2020-08-27 14:35:30 -07:00
d1c4d75c14 Add API for unexecuted op (#43629)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43629

We have a few places where we count the size a block / subgraph - it's nice to have a shared API to ignore operators that are not executed in the optimized graph (will be used when i add a new profiling node in PR ^^)

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D23358807

Pulled By: eellison

fbshipit-source-id: 62c745d9025de94bdafd9f748f7c5a8574cace3f
2020-08-27 14:34:05 -07:00
5da97a38d1 Check if input is ChannelsLast or ChannelsLast3d for quantized AdaptivePool3d. (#42780)
Summary:
cc z-a-f, vkuzo. This serves as a very simple first step to the issue mentioned in https://github.com/pytorch/pytorch/issues/42779.

# Description
Since `ChannelsLast` and `ChannelsLast3d` are not equivalent [(MemoryFormat.h)](4e93844ab1/c10/core/MemoryFormat.h (L27)), the "fast" path for `NDHWC` is ignored.

This PR would produce the expected behaviour for 4 (5 if including batch) dimensional tensors.

# Benchmarks
## Notes
- For channels `< 8`, it is actually slower than before.
- For `qint32`, it is actually `2x` slower than before.
- For channels `> 8`, the execution time decreases up to `9-10` times in the benchmarks.
- While execution time does improve, it remains slower than the `contiguous` variant when channels `> 64`.

## C++
<img width="1667" alt="before_after_py" src="https://user-images.githubusercontent.com/37529096/89711911-5da22d80-d9e1-11ea-9b30-0c23d46c2c93.png">

## Python
<img width="1523" alt="before_after_cpp" src="https://user-images.githubusercontent.com/37529096/89711906-58dd7980-d9e1-11ea-9696-1963f394198a.png">

## Reproduce
See https://github.com/pytorch/pytorch/issues/42779.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42780

Reviewed By: smessmer

Differential Revision: D23035424

Pulled By: z-a-f

fbshipit-source-id: 15594846f66b73c22d2371eb8e47c472324d6139
2020-08-27 14:23:57 -07:00
cdc3e232e9 Add __str__ and __repr__ bindings to SourceRange (#43601)
Summary:
Added the bindings for `__str__` and `__repr__` methods for SourceRange

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43601

Test Plan:
`python test/test_jit.py`

cc gmagogsfm

Reviewed By: agolynski

Differential Revision: D23366500

Pulled By: gmagogsfm

fbshipit-source-id: ab4be6e8f9ad5f67a323554437878198483f4320
2020-08-27 12:30:47 -07:00
04ccd3ed77 Fix bazel dependencies (#43688)
Summary:
Add `header_template_rule` to `substitution.bzl`
Use it in BUILD.bazel to specify dependencies on autogenerated headers

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43688

Test Plan: bazel build --sandbox_writable_path=$HOME/.ccache -c dbg :caffe2

Reviewed By: seemethere

Differential Revision: D23374702

Pulled By: malfet

fbshipit-source-id: 180dd996d1382df86258bb6abab9f2c7e964152e
2020-08-27 12:11:34 -07:00
bff741a849 Improve save_for_mobile cxx binary (#43721)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43721

We can combine optimization pass and save_for_mobile together to reduce friction. Since lite interpreter model can also be used in full JIT, I don't think we need the option to save it as full JIT model.

Also
- improved usage message
- print op list before and after optimization pass

Test Plan:
```
buck run //xplat/caffe2:optimize_for_mobile -- --model=/home/linbin/sparkspot.pt

Building: finished in 12.4 sec (100%) 2597/2597 jobs, 2 updated
  Total time: 12.5 sec

pt_operator_library(
        name = "old_op_library",
        ops = [
                "aten::_convolution",
                "aten::adaptive_avg_pool2d",
                "aten::add_.Tensor",
                "aten::batch_norm",
                "aten::mul.Tensor",
                "aten::relu_",
                "aten::softplus",
                "aten::sub.Tensor",
        ],
)

pt_operator_library(
        name = "new_op_library",
        ops = [
                "aten::adaptive_avg_pool2d",
                "aten::add_.Tensor",
                "aten::batch_norm",
                "aten::mul.Tensor",
                "aten::relu_",
                "aten::softplus",
                "aten::sub.Tensor",
                "prepacked::conv2d_clamp_run",
        ],
)

The optimized model for lite interpreter was saved to /home/linbin/sparkspot_mobile_optimized.bc
```

```
buck run //xplat/caffe2:optimize_for_mobile -- --model=/home/linbin/sparkspot.pt --backend=vulkan
```

Reviewed By: kimishpatel

Differential Revision: D23363533

fbshipit-source-id: f7fd61aaeda5944de5bf198e7f93cacf8368babd
2020-08-27 11:01:12 -07:00
3830998ac3 [fx] When generating names, avoid shadowing builtins (#43653)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43653

When nodes are created without an explicit name, a name is generated for
it based on the target. In these cases, we need to avoid shadowing
builtin names. Otherwise, code like:
```
a.foo.bar
```
results in pretty-printed code like:
```
getattr = a.foo
getattr_1 = getattr.bar
```

While this is technically allowed in Python, it's probably a bad idea,
and more importantly is not supported by TorchScript (where `getattr` is
hardcoded).

This PR changes the name generation logic to avoid shadowing all
builtins and langauge keywords. We already do this for PyTorch
built-ins, so just extend that logic. So now the generated code will
look like:

```
getattr_1 = a.foo
getattr_2 = getattr_1.bar
```
Fixes #43522

Test Plan: Imported from OSS

Reviewed By: jamesr66a

Differential Revision: D23357420

Pulled By: suo

fbshipit-source-id: 91e9974adc22987eca6007a2af4fb4fe67f192a8
2020-08-27 10:43:56 -07:00
5a1aa0e21e [reland][quant][graphmode][fx] Add e2e test on torchvision (#43587)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43587

Add tests for graph mode quantization on torchvision and make sure it matches
current eager mode quantization

Test Plan:
Imported from OSS

Imported from OSS

Reviewed By: z-a-f

Differential Revision: D23331253

fbshipit-source-id: 0445a44145d99837a2c975684cd0a0b7d965c8f9
2020-08-27 10:12:07 -07:00
73dcfc5e78 Update RNN op registration format (#43599)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43599

Test Plan: Imported from OSS

Reviewed By: smessmer

Differential Revision: D23350223

Pulled By: iseeyuan

fbshipit-source-id: 94c528799e31b2ffb02cff675604e7cce639687f
2020-08-27 07:27:14 -07:00
288a2effa0 Operator generator based on templated selective build. (#43456)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43456

Introduce the template OperatorGenerator, which returns an optional Operator. It's null if the templated bool value is null.

RegisterOperators() is updated to take the optional Operator. A null will not be registered.

With this update the selective operator registration can be done at compile time. Tests are added to show an operator can be registered if it's in a whitelist and it will not be registered if it's not in the whitelist.

Test Plan: Imported from OSS

Reviewed By: ljk53

Differential Revision: D23283563

Pulled By: iseeyuan

fbshipit-source-id: 456e0c72b2f335256be800aeabb797bd83bcf0b3
2020-08-27 07:26:07 -07:00
c25d0015f0 Autograd code clean up (#43167)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43167

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D23222358

Pulled By: anjali411

fbshipit-source-id: b738c63b294bcee7d680fa64c6300007d988d218
2020-08-27 07:07:52 -07:00
de84db2a9d [TensorExpr] Add aten::sum lowering to the kernel (#43585)
Summary:
Handles all dimensions and selected dimensions, per PyTorch semantics.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43585

Test Plan: test_tensorexpr

Reviewed By: bertmaher

Differential Revision: D23362382

Pulled By: asuhan

fbshipit-source-id: e8d8f1197a026be0b46603b0807d996a0de5d58c
2020-08-27 02:46:47 -07:00
48e08f884e C++ APIs TransformerEncoder (#43187)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43187

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D23182770

Pulled By: glaringlee

fbshipit-source-id: 968846138d4b1c391a74277216111dba8b72d683
2020-08-27 01:31:46 -07:00
f63d06a57b Fix docs for kwargs, a-e (#43583)
Summary:
To reduce the chance of conflicts, not all ops are fixed. Ops starting with letter `f` will be fixed in separate PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43583

Reviewed By: ZolotukhinM

Differential Revision: D23330347

Pulled By: mruberry

fbshipit-source-id: 3387cb1e495faebd16fb183039197c6d90972ad4
2020-08-27 00:14:05 -07:00
a070c619b9 [FX] Native callables in FX lowering (#43426)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43426

Test Plan: Imported from OSS

Reviewed By: zdevito

Differential Revision: D23273427

Pulled By: jamesr66a

fbshipit-source-id: 3a9d04486c72933d8afd9c181578fe98c3d825b0
2020-08-27 00:00:03 -07:00
79e6aaeb4c pull empty() out of use_c10_dispatcher: full (#43572)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43572

Test Plan: Imported from OSS

Reviewed By: smessmer

Differential Revision: D23326019

Pulled By: bhosmer

fbshipit-source-id: 10a4d7ffe33b4be4ae45396725456c6097ce1757
2020-08-26 22:51:06 -07:00
01b5c06254 [fix] handle empty args in chain_matmul (#43553)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/41817

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43553

Reviewed By: agolynski

Differential Revision: D23342586

Pulled By: mruberry

fbshipit-source-id: c6349f8fa9fcefcf03681d92c085a21265d1e690
2020-08-26 18:54:46 -07:00
28be3ef2f2 Fix hipify script for pytorch extensions (#43528)
Summary:
PyTorch extensions can have .cpp or .h files which contain CUDA code that needs to be hipified. The current hipify script logic has overly strict conditions to determine which files get considered for hipification: https://github.com/pytorch/pytorch/blob/master/torch/utils/hipify/hipify_python.py#L146

These conditions might apply well to pytorch/caffe2 source code, but are overconstrained for third-party extensions.
`is_pytorch_file` conditions: https://github.com/pytorch/pytorch/blob/master/torch/utils/hipify/hipify_python.py#L549
`is_caffe2_gpu_file` conditions: https://github.com/pytorch/pytorch/blob/master/torch/utils/hipify/hipify_python.py#L561

This PR relaxes these conditions if we're hipifying a pytorch extension (specified by `is_pytorch_extension=True`) and considers all the file extensions specified using the `extensions` parameter: https://github.com/pytorch/pytorch/blob/master/torch/utils/hipify/hipify_python.py#L820

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43528

Reviewed By: mruberry

Differential Revision: D23328272

Pulled By: ngimel

fbshipit-source-id: 1e9c3a54ae2da65ac596a7ecd5539f3e14eeed88
2020-08-26 18:41:48 -07:00
c4e5ab6ff2 [TensorExpr] Disable a flaky test. (#43678)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43678

Test Plan: Imported from OSS

Reviewed By: Krovatkin

Differential Revision: D23363651

Pulled By: ZolotukhinM

fbshipit-source-id: 9557fbfda28633cea169836b02d034e9c950bc71
2020-08-26 18:35:24 -07:00
00c1501bc0 [JIT] Cast return values of functions returning Any (#42259)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42259

**Summary**
This commit modifies IR generation to insert explicit cast that cast
each return value to `Any` when a function is annotated as returning `Any`.
This precludes the failure in type unification (see below) that caused
this issue.

Issue #41962 reported that the use of an `Any` return type in
combination with different code paths returning values of different
types causes a segmentation fault. This is because the exit transform
pass tries to unify the different return types, fails, but silently sets
the type of the if node to c10::nullopt. This causes problems later in
shape analysis when that type object is dereferenced.

**Test Plan**
This commit adds a unit test that checks that a function similar to the
one in #41962 can be scripted and executed.

**Fixes**
This commit fixes #41962.

Differential Revision: D22883244

Test Plan: Imported from OSS

Reviewed By: eellison, yf225

Pulled By: SplitInfinity

fbshipit-source-id: 523d002d846239df0222cd07f0d519956e521c5f
2020-08-26 18:24:11 -07:00
f73e32cd04 Reduce amount of work done within a global lock within ParallelLoadOp (#43508)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43508

Differential Revision: D22952007

fbshipit-source-id: 11e28d20175271e6068edce8cb36f9fcf867a02a
2020-08-26 18:19:40 -07:00
0bf27d64f4 Fix NaN propagation in fuser's min/max implementation (#43590)
Summary:
fmax/fmin propagate the number if one argument is NaN, which doesn't match the eager mode behavior.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43590

Reviewed By: mruberry

Differential Revision: D23338664

Pulled By: bertmaher

fbshipit-source-id: b0316a6f01fcf8946ba77621efa18f339379b2d0
2020-08-26 17:31:06 -07:00
033b7ae3ef implement NumPy-like functionality maximum, minimum (#42579)
Summary:
Related to https://github.com/pytorch/pytorch/issues/38349

Implement NumPy-like functions `maximum` and `minimum`.
The `maximum` and `minimum` functions compute input tensors element-wise, returning a new array with the element-wise maxima/minima.

If one of the elements being compared is a NaN, then that element is returned, both `maximum` and `minimum` functions do not support complex inputs.

This PR also promotes the overloaded versions of torch.max and torch.min, by re-dispatching binary `torch.max` and `torch.min` to `torch.maximum` and `torch.minimum`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42579

Reviewed By: mrshenli

Differential Revision: D23153081

Pulled By: mruberry

fbshipit-source-id: 803506c912440326d06faa1b71964ec06775eac1
2020-08-26 16:56:12 -07:00
9ca338a9d4 [ONNX] Modified slice node in inplace ops pass (#43275)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/42292

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43275

Reviewed By: hl475

Differential Revision: D23352540

Pulled By: houseroad

fbshipit-source-id: 7fce3087c333efe3db4b03e9b678d0bee418e93a
2020-08-26 16:51:20 -07:00
1bda5e480c Add Python code coverage (#43600)
Summary:
Replace  `test` with  `coverage_test` stage for `pytorch-linux-bionic-py3.8-gcc9` configuration
Add `coverage.xml` to the list of ignored files
Add `codecov.yml` that maps installed pytorch folders back to original locations
Cleanup coverage option utilization in `run_test.py` and adapt it towards combining coverage reports across the runs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43600

Reviewed By: seemethere

Differential Revision: D23351877

Pulled By: malfet

fbshipit-source-id: acf78ae4c8f3e23920a76cce1d50f2821b83eb06
2020-08-26 16:16:03 -07:00
88e35fb8bd Skip SVD tests when no lapack (#43566)
Summary:
These tests are failing on one of my system that does not have lapack

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43566

Reviewed By: ZolotukhinM

Differential Revision: D23325378

Pulled By: mruberry

fbshipit-source-id: 5d795e460df0a2a06b37182d3d4084d8c5c8e751
2020-08-26 15:58:31 -07:00
cf26050e29 [pytorch] Move TensorIteratorConfig method implementation to cpp file (#43554)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43554

Move function implementations in the TensorIteratorConfig Class from TensorIterator.h to TensorIterator.cpp to avoid this issue: https://github.com/pytorch/pytorch/issues/43300

Reviewed By: malfet

Differential Revision: D23319007

fbshipit-source-id: 6cc3474994ea3094a294f795ac6998c572d6fb9b
2020-08-26 15:18:37 -07:00
6c28df7ceb [fx] add test for args/kwargs handling (#43640)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43640

+ added a `self.checkGraphModule` utility function to wrap the common
test assert pattern.

Test Plan: Imported from OSS

Reviewed By: gmagogsfm

Differential Revision: D23356262

Pulled By: suo

fbshipit-source-id: a50626dcb01246d0dbd442204a8db5958cae23ab
2020-08-26 14:39:25 -07:00
5a15f56668 match batchmatmul on 1.0.0.6 (#43559)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43559

- remove mkl strided gemm since it was acting weird in some cases, use the plain for loop for gemm for now, it will have performance implications but this closes the gap for the ctr_instagram_5x model
- reproduced the failure scenario of batchmatmul on ctr_instagram_5x by increasing the dimensions of the inputs
- added an option in netrunner to skip bmm if needed

Test Plan:
- net runner passes with ctr_instagram 5x
- bmm unit test repros the discrepancy fixed

Reviewed By: amylittleyang

Differential Revision: D23320857

fbshipit-source-id: 7d5cfb23c1b0d684e1ef766f1c1cd47bb86c9757
2020-08-26 14:35:31 -07:00
769b9381fc DDP Communication hook: Fix the way we pass future result to buckets. (#43307)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43307

I identified a bug with DDP communication hook while I was trying accuracy benchmarks: I was getting `loss=nan`.

Looks like when we re-`initialize_bucketviews` with the value of `future_work`, as `Reducer::mark_variable_ready_dense` does `bucket_view.copy_(grad)` it wasn't copying the `grads` back to the contents since `bucket_view` wouldn't have any relationship with `contents` after re-intitializing it with something else. As we have multiple iterations, this was causing problems.
I solved this by adding two states for `bucket_view`:
```
    // bucket_views_in[i].copy_(grad) and
    // grad.copy_(bucket_views_out[i])
    // provide convenient ways to move grad data in/out of contents.
    std::vector<at::Tensor> bucket_views_in;
    std::vector<at::Tensor> bucket_views_out;
```

I included two additional unit tests where we run multiple iterations for better test coverage:
1) `test_accumulate_gradients_no_sync_allreduce_hook`
2) `test_accumulate_gradients_no_sync_allreduce_with_then_hook`.

ghstack-source-id: 110728299

Test Plan:
Run `python test/distributed/test_c10d.py`, some perf&accuracy benchmarks.

New tests:
`test_accumulate_gradients_no_sync_allreduce_hook`
`test_accumulate_gradients_no_sync_allreduce_with_then_hook`

Acc benchmark results look okay:
f214188350

Reviewed By: agolynski

Differential Revision: D23229309

fbshipit-source-id: 329470036cbc05ac12049055828495fdb548a082
2020-08-26 14:22:09 -07:00
0521c71241 [D23047144 Duplicate][2/3][lite interpreter] add metadata when saving and loading models for mobile (#43584)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43584

1. add `metadata.pkl` to `.bc` file which includes the model info that we are interested in
2. load `metadata.pkl` as a attribute `unordered_map<string, string>` in the module
ghstack-source-id: 110730013

Test Plan:
- CI
```buck build //xplat/caffe2:jit_module_saving
```
```buck build //xplat/caffe2:torch_mobile_core
```

Reviewed By: xcheng16

Differential Revision: D23330080

fbshipit-source-id: 5d65bd730b4b566730930d3754fa1bf16aa3957e
2020-08-26 14:07:49 -07:00
306eb3def7 Additional error checking for torch.cuda.nccl APIs. (#43247)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43247

`torch.cuda.nccl` APIs didn't throw appropriate errors when called
with inputs/outputs that were of the wrong type and it resulted in some cryptic
errors instead.

Adding some error checks with explicit error messages for these APIs.
ghstack-source-id: 110683546

Test Plan: waitforbuildbot

Reviewed By: rohan-varma

Differential Revision: D23206069

fbshipit-source-id: 8107b39d27f4b7c921aa238ef37c051a9ef4d65b
2020-08-26 13:50:00 -07:00
db1fbc5729 [OACR][NLU] Add aten::str operator (#43573)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43573

We recently updated the Stella NLU model in D23307228, and the App started to crash with `Following ops cannot be found:{aten::str, }`.

Test Plan: Verified by installing the assistant-playground app on Android.

Reviewed By: czlx0701

Differential Revision: D23325409

fbshipit-source-id: d670242868774bb0aef4be5c8212bc3a3f2f667c
2020-08-26 13:27:11 -07:00
6459f0a077 added rocm 3.7 docker image (#43576)
Summary:
Added bionic rocm 3.7 docker image

- jeffdaily

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43576

Reviewed By: malfet

Differential Revision: D23352310

Pulled By: seemethere

fbshipit-source-id: fd544b3825d8c25587f5765332c0a8ed1fa63c6e
2020-08-26 12:39:46 -07:00
a91e1cedc5 Reduce number of hypothesis tests in CI (#43591)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43591

100 randomized inputs vs 50 doesn't change the balance that much but speed up test runtime

Test Plan: CI

Reviewed By: orionr, seemethere

Differential Revision: D23332393

fbshipit-source-id: 7a8ff9127ee3e045a83658a7a670a844f3862987
2020-08-26 11:54:49 -07:00
2a4d312027 Allow GPU skip decorators to report the right number of GPUs required in (#43468)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43468

Closes https://github.com/pytorch/pytorch/issues/41378.
https://github.com/pytorch/pytorch/pull/41973 enhanced the skip decorators to
report the right no. of GPUs required, but this information was not passed to
the main process where the message is actually displayed. This PR uses a
`multiprocessing.Manager()` so that the dictionary modification is reflected
correctly in the main process.
ghstack-source-id: 110684228

Test Plan:
With this diff, we can run a test in such as in https://github.com/pytorch/pytorch/pull/42577 that requires 4 GPUs on a 2 GPU machine, and we get the expected message:

```
test_ddp_uneven_inputs_replicated_error (test_distributed.TestDistBackend) ... skipped 'Need at least 4 CUDA devices'
```

Reviewed By: mrshenli

Differential Revision: D23285790

fbshipit-source-id: ac32456ef3d0b1d8f1337a24dba9f342c736ca18
2020-08-26 11:44:13 -07:00
25dcc28cd6 [jit][static] Replace deepcopy with copy (#43182)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43182

We should avoid using `deepcopy` on the module because it involves copying the weights.

Comparing the implementation of `c10::ivalue::Object::copy()` vs `c10::ivalue::Object::deepcopy()`, the only difference is `deepcopy` copies the attributes (slots) while `copy` does not.

Reviewed By: bwasti

Differential Revision: D23171770

fbshipit-source-id: 3cd711c6a2a19ea31d1ac1ab2703a0248b5a4ef3
2020-08-26 11:15:49 -07:00
51861cc9b1 .circleci: Add CUDA 11 to nightly binary builds (#43366)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43366

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D23348556

Pulled By: seemethere

fbshipit-source-id: 0cd129c5c27ffceec80636384762c3ff7bf74fdc
2020-08-26 10:11:01 -07:00
42f6c3b1f4 Raise error on device mismatch in addmm (#43505)
Summary:
Fixes gh-42282

This adds a device-mismatch check to `addmm` on CPU and CUDA. Although it seems like the dispatcher is always selecting the CUDA version here if any of the inputs are on GPU. So in theory the CPU check is unnecessary, but probably better to err on the side of caution.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43505

Reviewed By: mruberry

Differential Revision: D23331651

Pulled By: ngimel

fbshipit-source-id: 8eb2f64f13d87e3ca816bacec9d91fe285d83ea0
2020-08-26 09:37:57 -07:00
7beeef2c69 .jenkins: Remove openssh installs (#43597)
Summary:
openssh should be installed by either the circleci machines or from the
jenkins workers so we shouldn't need to install it ourselves in order to
get ssh functionality

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43597

Reviewed By: ezyang

Differential Revision: D23333479

Pulled By: seemethere

fbshipit-source-id: 17a1ad0200a9df7d4818ab1ed44c8488ec8888fb
2020-08-26 09:36:53 -07:00
573940f8d7 Fix type annotation errors in torch.functional (#43446)
Summary:
Closes gh-42968

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43446

Reviewed By: albanD

Differential Revision: D23280962

Pulled By: malfet

fbshipit-source-id: de5386a95a20ecc814c39cbec3e4252112340b3a
2020-08-26 08:27:59 -07:00
2b70f82737 fix typo in test_dataloader test_multiprocessing_contexts (take 2) (#43588)
Summary:
2nd attempt to land https://github.com/pytorch/pytorch/pull/43343

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43588

Reviewed By: seemethere

Differential Revision: D23332284

Pulled By: malfet

fbshipit-source-id: d78faf468c56af2f176dbdd2ce4bd51f0b5df6fd
2020-08-25 21:11:53 -07:00
c1553ff94b Benchmarks: temporarily disable profiling-te configuration. (#43603)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43603

We are in the midst of landing a big reword of profiling executor and
benchmarks are expected to fail while we are in the transitional state.

Test Plan: Imported from OSS

Reviewed By: SplitInfinity

Differential Revision: D23334818

Pulled By: ZolotukhinM

fbshipit-source-id: 99ff17c6f8ee18d003f6ee76ff0e719cea68c170
2020-08-25 21:00:10 -07:00
3ec24f02af [TensorExpr] Start using typecheck in the fuser. (#43173)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43173

With this change the fuser starts to generate typechecks for inputs of
fusion group. For each fusion group we generate a typecheck and an if
node: the true block contains the fused subgraph, the false block
contains unoptimized original subgraph.

Differential Revision: D23178230

Test Plan: Imported from OSS

Reviewed By: eellison

Pulled By: ZolotukhinM

fbshipit-source-id: f56e9529613263fb3e6575869fdb49973c7a520b
2020-08-25 18:13:32 -07:00
b763666f9f [JIT] Subgraph utils: add an optional vmap argument to the API to allow retrieving value mappings. (#43235)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43235

This functionality is needed when we want to not lose track of
nodes/values as we merge and unmerge them into other nodes. For
instance, if we have a side data structure with some meta information
about values or nodes, this new functionality would allow to keep that
metadata up to date after merging and unmerging nodes.

Differential Revision: D23202648

Test Plan: Imported from OSS

Reviewed By: eellison

Pulled By: ZolotukhinM

fbshipit-source-id: 350d21a5d462454166f8a61b51d833551c49fcc9
2020-08-25 18:13:29 -07:00
d18566c617 [TensorExpr] Fuser: disallow aten::slice nodes. (#43365)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43365

We don't have shape inference for them yet.

Test Plan: Imported from OSS

Reviewed By: eellison

Differential Revision: D23253418

Pulled By: ZolotukhinM

fbshipit-source-id: 9c38778b8a616e70f6b2cb5aab03d3c2013b34b0
2020-08-25 18:13:27 -07:00
8dc4b415eb [TensorExpr] Fuser: only require input shapes to be known (output shapes can be inferred). (#43171)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43171

Differential Revision: D23178228

Test Plan: Imported from OSS

Reviewed By: eellison

Pulled By: ZolotukhinM

fbshipit-source-id: e3465066e0cc4274d28db655de274a51c67594c4
2020-08-25 18:13:25 -07:00
f6b7c6da19 [TensorExpr] Fuser: move canHandle and some other auxiliary functions into TensorExprFuser class. (#43170)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43170

Differential Revision: D23178227

Test Plan: Imported from OSS

Reviewed By: eellison

Pulled By: ZolotukhinM

fbshipit-source-id: 3c3a0215344fb5942c4f3078023fef32ad062fe9
2020-08-25 18:12:01 -07:00
f35e069622 Back out "Make grad point to bucket buffer in DDP to save memory usage" (#43557)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43557

backout the diff that caused some errors in pytext distributed training

Test Plan: Tested by rayhou who verified reverting the diff works

Differential Revision: D23320238

fbshipit-source-id: caa0fe74404059e336cd95fdb41373f58ecf486e
2020-08-25 18:04:39 -07:00
58666982fb check in intel nnpi 1007 into fbcode/tp2
Summary: As title

Test Plan:
* Details of conducted tests can be found in https://fb.workplace.com/groups/527892364588452/permalink/615694119141609/
* Sandcastle

Reviewed By: arunm-git

Differential Revision: D23198458

fbshipit-source-id: dd8d34a985dced66a5624a21e5d4a7e9a499ce39
2020-08-25 17:59:11 -07:00
b3f8834033 Batching rule for torch.pow, torch.result_type (#43515)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43515

This PR adds a batching rule for torch.pow. This required adding a
batching rule for torch.result_type.

Test Plan: - added new tests: `pytest test/test_vmap.py -v`

Reviewed By: cpuhrsch

Differential Revision: D23302737

Pulled By: zou3519

fbshipit-source-id: 2cade358750f6cc3abf45f81f2394900600927cc
2020-08-25 17:55:53 -07:00
c9f125bf70 Black to Block for various files (#42913)
Summary:
Fixes  https://github.com/pytorch/pytorch/issues/41735 #41736 https://github.com/pytorch/pytorch/issues/41737 #41738 all areas where black is mentioned is replaced to block

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42913

Reviewed By: houseroad

Differential Revision: D23112873

Pulled By: malfet

fbshipit-source-id: a515b56dc2ed20aa75741c577988d95f750b364c
2020-08-25 17:43:31 -07:00
348e78b086 Evenly distribute output grad into all matching inputs for min/max/median (#43519)
Summary:
cc: ngimel mruberry

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43519

Reviewed By: albanD

Differential Revision: D23312235

Pulled By: ngimel

fbshipit-source-id: 678bda54996df7f29acf96add928bb7042fc2069
2020-08-25 16:36:33 -07:00
be637fd5f6 Revert D23306683: [quant][graphmode][fx] Testing torchvision
Test Plan: revert-hammer

Differential Revision:
D23306683 (62dcd253e3)

Original commit changeset: 30d27e225d45

fbshipit-source-id: e661334d187d3d6756facd36f2ebdb3ab2cd2e26
2020-08-25 15:24:02 -07:00
05f27b18fb Back out D23047144 "[2/3][lite interpreter] add metadata when saving and loading models for mobile"
Summary:
Original commit changeset: f368d00f7bae

Back out "[2/3][lite interpreter] add metadata when saving and loading models for mobile"

D23047144 (e37f871e87)

Pull Request: https://github.com/pytorch/pytorch/pull/43516

(Note: this ignores all push blocking failures!)

Test Plan: CI

Reviewed By: xcheng16

Differential Revision: D23304639

fbshipit-source-id: 970ca3438c1858f8656cbcf831ffee2c4a551110
2020-08-25 14:58:38 -07:00
5ca6cbbd93 Remove unnecessary copies in ProcessGroupGloo for multiple inputs allreduce (#43543)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43543

Closes https://github.com/pytorch/pytorch/issues/14691. This is not needed in the multiple outputs case, because gloo allreduce
will broadcast the result tensor to all the outputs. See
https://github.com/facebookincubator/gloo/issues/152 and commit
9cabb5aaa4
for more details. Came across this when debugging https://github.com/pytorch/pytorch/pull/42577.

This effectively reverts https://github.com/pytorch/pytorch/pull/14688 while still keeping the tests.

Tested by ensuring `test_allreduce_basics` in `test_c10d.py` still works as expected.
ghstack-source-id: 110636498

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D23173945

fbshipit-source-id: d1ae08f84b4ac9919c53080949b8fffcb2fe63a8
2020-08-25 14:01:26 -07:00
9b05fbd92e Correct the windows docs (#43479)
Summary:
Fixes https://discuss.pytorch.org/t/i-cannot-use-the-pytorch-that-was-built-successfully-from-source-dll-initialization-routine-failed-error-loading-caffe2-detectron-ops-gpu-dll/93243/5?u=peterjc123.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43479

Reviewed By: mrshenli, ngimel

Differential Revision: D23294211

Pulled By: ezyang

fbshipit-source-id: d67df7d0355c2783153d780c94f959758b246d36
2020-08-25 13:41:24 -07:00
3df398a3a8 Update the QR documentation to include a warning about when the QR.backward is well-defined. (#43547)
Summary:
As per title.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43547

Reviewed By: mruberry

Differential Revision: D23318829

Pulled By: albanD

fbshipit-source-id: 4764ebe1ad440e881b1c4c88b16fb569ef8eb0fa
2020-08-25 13:19:25 -07:00
62dcd253e3 [quant][graphmode][fx] Testing torchvision (#43526)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43526

Add tests for graph mode quantization on torchvision and make sure it matches
current eager mode quantization

Test Plan: Imported from OSS

Reviewed By: vkuzo

Differential Revision: D23306683

fbshipit-source-id: 30d27e225d4557bfc1d9aa462086e416aa9a9c0e
2020-08-25 13:02:14 -07:00
9420c773d0 Revert D23299452: [pytorch][PR] fix typo in test_dataloader test_multiprocessing_contexts
Test Plan: revert-hammer

Differential Revision:
D23299452 (6a2d7a05c4)

Original commit changeset: 9489c48b83bc

fbshipit-source-id: e8c15d338dd89d8e92f3710e9cf149149bd2e763
2020-08-25 12:34:49 -07:00
ebc0fc4dfc Polish the nightly.py docs in CONTRIBUTING a little (#43494)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43494

Reviewed By: mruberry

Differential Revision: D23296032

Pulled By: ngimel

fbshipit-source-id: c85a6d4c39cbb60644f79136a6f21fd49c813b61
2020-08-25 12:13:27 -07:00
3dcfe84861 Grammatical corrections (#43473)
Summary:
**Few documentation corrections.**

1. [...] If there is hard-to-debug error in one of your TorchScript **models**, you can use this flag [...]
2. [...] Since TorchScript (scripting and tracing) **is** disabled with this flag [...]

**Before corrections (as of now):**
![before-fix](https://user-images.githubusercontent.com/45713346/90977203-d8bc2580-e543-11ea-9609-fbdf5689dcb9.jpg)

**After corrections:**
![after-fix](https://user-images.githubusercontent.com/45713346/90977209-dbb71600-e543-11ea-8259-011618efd95b.jpg)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43473

Reviewed By: mruberry

Differential Revision: D23296167

Pulled By: ngimel

fbshipit-source-id: 932c9b25cc79d6e266e5ddb3744573b0bd63d925
2020-08-25 12:09:14 -07:00
f32ca57c5e Fix typo in LSTMCell document (#43395)
Summary:
Fixes typo in document

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43395

Reviewed By: mruberry

Differential Revision: D23312561

Pulled By: ngimel

fbshipit-source-id: 28340c96faf52c17acfe9f6b1dd94b71ea4d60ce
2020-08-25 12:04:59 -07:00
f8e9e7ad4a Allocating warp to an input index in compute_cuda_kernel (#43354)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43354

Instead of assigning a thread to an input index for repeating that index, we assign a warp to an index. This helps us in avoiding the costly uncoaelesced memory accesses and brach divergence which occur when each thread is repeating the index.

Test Plan: Run trainer to test

Reviewed By: ngimel

Differential Revision: D23230917

fbshipit-source-id: 731e912c844f1d859b0384fcaebafe69cb4ab56a
2020-08-25 10:47:50 -07:00
76894062dc move wholearchive to link option (#43485)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/43216

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43485

Reviewed By: glaringlee

Differential Revision: D23318735

Pulled By: malfet

fbshipit-source-id: 90c316d3d5ed51afcff356e6d9219950f119a902
2020-08-25 10:36:10 -07:00
1089ff404c Refactored the duplicate code into a function in _ConvNd (#43525)
Summary:
Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43525

Reviewed By: ngimel

Differential Revision: D23306593

Pulled By: jerryzh168

fbshipit-source-id: 3427cd2b9132a203858477b6c858d59b00e1282e
2020-08-25 10:00:07 -07:00
8ecfa9d9a2 [cmake] End support for python3.5 for pytorch (#43105)
Summary:
PyTorch uses f-string in its python codes.
Python support for f-string started with version 3.6
Using python version 3.5 or older fails the build with latest release/master.
This patch checks the version of the python used for build and mandates it to be 3.6 or higher.

Signed-off-by: Parichay Kapoor <kparichay@gmail.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43105

Reviewed By: glaringlee

Differential Revision: D23301481

Pulled By: malfet

fbshipit-source-id: e9b4f7bffce7384c8ade3b7d131b10cf58f5e8a0
2020-08-25 09:42:42 -07:00
6a2d7a05c4 fix typo in test_dataloader test_multiprocessing_contexts (#43343)
Summary:
https://github.com/pytorch/pytorch/issues/22990 added a multiprocessing_context argument to DataLoader, but a typo in the test causes the wrong DataLoader class to be used.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43343

Reviewed By: glaringlee

Differential Revision: D23299452

Pulled By: malfet

fbshipit-source-id: 9489c48b83bce36f46d350cad902f7ad96e1eec4
2020-08-25 09:36:56 -07:00
b430347a60 Address JIT/Mypy issue with torch._VF (#43454)
Summary:
- `torch._VF` is a hack to work around the lack of support for `torch.functional` in the JIT
- that hack hides `torch._VF` functions from Mypy
- could be worked around by re-introducing a stub file for `torch.functional`, but that's undesirable
- so instead try to make both happy at the same time: the type ignore comments are needed for Mypy, and don't seem to affect the JIT after excluding them from the `get_type_line()` logic

Encountered this issue while trying to make `mypy` run on `torch/functional.py` in gh-43446.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43454

Reviewed By: glaringlee

Differential Revision: D23305579

Pulled By: malfet

fbshipit-source-id: 50e490693c1e53054927b57fd9acc7dca57e88ca
2020-08-25 09:23:54 -07:00
f02753fabb Support AMP in nn.parallel (#43102)
Summary:
Take care of the state of autocast in `parallel_apply`, so there is no need to decorate model implementations.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43102

Reviewed By: ngimel

Differential Revision: D23294610

Pulled By: mrshenli

fbshipit-source-id: 0fbe0c79de976c88cadf2ceb3f2de99d9342d762
2020-08-25 08:38:49 -07:00
cbdaa20c88 [serialize] Expose zip file alignment calculation functions (#43531)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43531

It's useful for building some tooling out of tree to manipulate zip files in PyTorch-y way

Test Plan: contbuild

Reviewed By: houseroad

Differential Revision: D23277361

fbshipit-source-id: e15fad20e792d1e41018d32fd48295cfe74bea8c
2020-08-25 02:32:58 -07:00
d1d32003bb force pytorch tensors to contiguous before calling c2 ops
Summary: per title, makes c2 wrappers safer as contiguity of torch inputs is not guaranteed

Test Plan: covered by existing tests

Reviewed By: dzhulgakov

Differential Revision: D23310137

fbshipit-source-id: 3fe12abc7e394b8762098d032200778018e5b591
2020-08-24 23:04:13 -07:00
675f3f0482 Fix "save binary size" steps (#43529)
Summary:
`pip3` alias might not be available, so call `python3 -mpip` to be on the safe side
Should fix failures like that:
https://app.circleci.com/pipelines/github/pytorch/pytorch/203448/workflows/3837b2d6-b089-4a19-b797-38bdf989c82e/jobs/6913032/parallel-runs/0/steps/0-109

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43529

Reviewed By: seemethere

Differential Revision: D23307306

Pulled By: malfet

fbshipit-source-id: b55e6782b29f1a1f56787902cbb85b3c3d20370c
2020-08-24 19:25:33 -07:00
f80b695a75 Properly format db.h and db.cc (#43027)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43027

Format db.h and db.cc using the default formatter.

This change was split off of D22705434.

Test Plan: Wait for sandcastle.

Reviewed By: rohithmenon, marksantaniello

Differential Revision: D23113765

fbshipit-source-id: 3f02d55bfb055bda0fcba5122336fa001562d42e
2020-08-24 18:29:45 -07:00
7b243a4d46 [quant][graphmode[fx][test][refactor] Refactor tests for graph mode quantization on fx (#43445)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43445

changed the interface for checkGraphModule to make the arguments more explicit
as requested in https://github.com/pytorch/pytorch/pull/43437

Test Plan:
TestQuantizeFx

Imported from OSS

Reviewed By: vkuzo

Differential Revision: D23280586

fbshipit-source-id: 5b5859e326d149a5aacb1d15cbeee69667cc9109
2020-08-24 17:58:55 -07:00
87905b5856 [pytorch] add option to include autograd for code analyzer (#43155)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43155

Update the code_analyzer build.sh script to be able to take additional build flags in the mobile build/analysis

Test Plan:
Checkout associated PR or copy contents of build.sh into PyTorch repo (must be run from root of PyTorch repo)

To run with inclusion of autograd dependencies (note BUILD_MOBILE_AUTOGRAD is still an experimental build flag): `ANALYZE_TORCH=1 DEPLOY=1 BASE_OPS_FILE=/path/to/baseopsfile MOBILE_BUILD_FLAGS='-DBUILD_MOBILE_AUTOGRAD=ON' tools/code_analyzer/build.sh`

Reviewed By: ljk53

Differential Revision: D23065754

fbshipit-source-id: d83a7ad62ad366a84725430ed020adf4d56687bd
2020-08-24 15:04:43 -07:00
284ff04792 [quant] Support set API for EmbeddingBag quantization (#43433)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43433

Add support for torch.quint8 dtype

Test Plan: Imported from OSS

Reviewed By: radkris-git

Differential Revision: D23277002

fbshipit-source-id: 4204bc62f124b4fd481aaa6aa47b9437978c43ee
2020-08-24 14:33:35 -07:00
e37f871e87 [2/3][lite interpreter] add metadata when saving and loading models for mobile
Summary:
1. add `metadata.pkl` to `.bc` file which includes the model info that we are interested in
2. load `metadata.pkl` as a attribute `unordered_map<string, string>` in the module

Test Plan:
- CI
```buck build //xplat/caffe2:jit_module_saving
```
```buck build //xplat/caffe2:torch_mobile_core
```

Reviewed By: xcheng16

Differential Revision: D23047144

fbshipit-source-id: f368d00f7baef2d3d15f89473cdb146467aa1e0b
2020-08-24 13:40:52 -07:00
ed8b08a3ba Update quantize_jit to handle new upsample overloads (#43407)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43407

ghstack-source-id: 110404846

Test Plan:
test_general_value_ops passes with D21209991 applied.
(Without this diff D21209991 breaks that test.)

Reviewed By: jerryzh168

Differential Revision: D23256503

fbshipit-source-id: 0f75e50a9f7fccb5b4325604319a5f76b42dfe5e
2020-08-24 13:33:47 -07:00
e08e93f946 Reland of benchmark code (#43428)
Summary:
Reland of the benchmark code that broke the slow tests because the GPU were running out of memory

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43428

Reviewed By: ngimel

Differential Revision: D23296136

Pulled By: albanD

fbshipit-source-id: 0002ae23dc82f401604e33d0905d6b9eedebc851
2020-08-24 13:27:26 -07:00
4cfac34075 [ROCm] allow .jenkins/pytorch/test.sh to run on centos (#42197)
Summary:
This doesn't fix any reported issue.  We validate ROCm PyTorch on ubuntu and centos.  For centos, we must modify the test.sh script to let it run on centos.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42197

Reviewed By: ezyang, ngimel

Differential Revision: D23175669

Pulled By: malfet

fbshipit-source-id: 0da435de6fb17d2ca48e924bec90ef61ebbb5042
2020-08-24 13:12:49 -07:00
35a36c1280 Implement JIT Enum type serialization and deserialization (#43460)
Summary:
[Re-review tips: nothing changed other than a type in python_ir.cpp to fix a windows build failure]

Adds code printing for enum type
Enhance enum type to include all contained enum names and values
Adds code parsing for enum type in deserialization
Enabled serialization/deserialization test in most TestCases. (With a few dangling issues to be addressed in later PRs to avoid this PR grows too large)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43460

Reviewed By: albanD

Differential Revision: D23284929

Pulled By: gmagogsfm

fbshipit-source-id: e3e81d6106f18b7337ac3ff5cd1eeaff854904f3
2020-08-24 12:04:31 -07:00
0fa99d50bc Enable torch.cuda.memory typechecking (#43444)
Summary:
Add number of function prototypes defined in torch/csrs/cuda/Module.cpp to `__init__.pyi.in`

Fixes https://github.com/pytorch/pytorch/issues/43442

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43444

Reviewed By: ezyang

Differential Revision: D23280221

Pulled By: malfet

fbshipit-source-id: 7d67dff7b24c8d7b7e72c919e6e7b847f242ef83
2020-08-24 11:46:04 -07:00
7024ce8a2c [quant] Add benchmarks for quantized embeddingbag module (#43296)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43296

Use common config for float and quantized embedding_bag modules

Test Plan:
```
python -m pt.qembeddingbag_test

 Benchmarking PyTorch: qEmbeddingBag
 Mode: Eager
 Name: qEmbeddingBag_embeddingbags10_dim4_modesum_input_size8_offset0_sparseTrue_include_last_offsetTrue_cpu
 Input: embeddingbags: 10, dim: 4, mode: sum, input_size: 8, offset: 0, sparse: True, include_last_offset: True, device: cpu
Forward Execution Time (us) : 35.738

 Benchmarking PyTorch: qEmbeddingBag
 Mode: Eager
 Name: qEmbeddingBag_embeddingbags10_dim4_modesum_input_size8_offset0_sparseTrue_include_last_offsetFalse_cpu
 Input: embeddingbags: 10, dim: 4, mode: sum, input_size: 8, offset: 0, sparse: True, include_last_offset: False, device: cpu
Forward Execution Time (us) : 62.708

python -m pt.embeddingbag_test

 Benchmarking PyTorch: embeddingbag
 Mode: Eager
 Name: embeddingbag_embeddingbags10_dim4_modesum_input_size8_offset0_sparseTrue_include_last_offsetTrue_cpu
 Input: embeddingbags: 10, dim: 4, mode: sum, input_size: 8, offset: 0, sparse: True, include_last_offset: True, device: cpu
Forward Execution Time (us) : 46.878

 Benchmarking PyTorch: embeddingbag
 Mode: Eager
 Name: embeddingbag_embeddingbags10_dim4_modesum_input_size8_offset0_sparseTrue_include_last_offsetFalse_cpu
 Input: embeddingbags: 10, dim: 4, mode: sum, input_size: 8, offset: 0, sparse: True, include_last_offset: False, device: cpu
Forward Execution Time (us) : 103.904

```

Imported from OSS

Reviewed By: vkuzo

Differential Revision: D23245531

fbshipit-source-id: 81b44fde522238d3eef469434e93dd7f94b528a8
2020-08-24 09:51:03 -07:00
7cc1efec13 Add lite SequentialSampler to torch mobile (#43299)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43299

Test Plan: Imported from OSS

Reviewed By: iseeyuan

Differential Revision: D23228415

Pulled By: ann-ss

fbshipit-source-id: eebe54353a128783f039c7dac0e2dd765a61940d
2020-08-24 09:45:24 -07:00
c972e6232a Implement batching rules for basic arithmetic ops (#43362)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43362

Batching rules implemented for: addition subtraction division
multiplication.

I refactored the original `mul_batching_rule` into a templated function
so that one can insert arbitrary binary operations into it.

add, sub, rsub, mul, and div all work the same way. However, other
binary operations work slightly differently (I'm still figuring out the
differences and why they're different) so those may need a different
implementation.

Test Plan: - "pytest test/test_vmap.py -v": new tests

Reviewed By: ezyang

Differential Revision: D23252317

Pulled By: zou3519

fbshipit-source-id: 6d36cd837a006a2fd31474469323463c1bd797fc
2020-08-24 08:43:36 -07:00
db78c07ced Enable torch.cuda.nvtx typechecking (#43443)
Summary:
Add pyi file covering torch._C.nvtx submodule

Fixes https://github.com/pytorch/pytorch/issues/43436

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43443

Reviewed By: ezyang

Differential Revision: D23280188

Pulled By: malfet

fbshipit-source-id: 882860cce9feb0b5307c8b7c887f4a2f2c1548a2
2020-08-24 08:20:12 -07:00
2f9c9796f1 [Codemod][FBSourceClangFormatLinter] Daily arc lint --take CLANGFORMAT
Reviewed By: zertosh

Differential Revision: D23290730

fbshipit-source-id: ee3ffbd6f9c0fade4586d8f4f8c8dd3d310d1f33
2020-08-24 05:36:38 -07:00
c4e841654d Add alias torch.negative to torch.neg. (#43400)
Summary:
xref https://github.com/pytorch/pytorch/issues/42515

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43400

Reviewed By: albanD

Differential Revision: D23266011

Pulled By: mruberry

fbshipit-source-id: ca20b30d99206a255cf26438b09c3ca1f99445c6
2020-08-24 01:15:04 -07:00
1f0cfbaaad [fx] add type annotations (#43083)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43083

This adds type annotations to all classes, arguments, and returns
for fx. This should make it easier to understand the code, and
encourage users of the library to also write typed code.

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D23145853

Pulled By: zdevito

fbshipit-source-id: 648d91df3f9620578c1c51408003cd5152e34514
2020-08-23 15:38:33 -07:00
b349f58c21 [fx] enabling typechecking of fx files (#43082)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43082

Fixes all present errors in mypy. Does not try to add annotations everywhere.

Test Plan: Imported from OSS

Reviewed By: jamesr66a

Differential Revision: D23145854

Pulled By: zdevito

fbshipit-source-id: 18e483ed605e89ed8125971e84da1a83128765b7
2020-08-23 15:37:29 -07:00
a97ca93c0e remove prim::profile and special-casing (#43160)
Summary:
Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43160

Reviewed By: ZolotukhinM

Differential Revision: D23284421

Pulled By: Krovatkin

fbshipit-source-id: 35e97aad299509a682ae7e95d7cef53301625309
2020-08-22 23:52:36 -07:00
d70b263e3a [DPER3] Separate user embeddings and ad embeddings in blob reorder
Summary:
Separate user embeddings and ad embeddings in blobsOrder. New order:
1. meta_net_def
2. preload_blobs
3. user_embeddings (embeddings in remote request only net)
4. ad_embeddings (embeddings in remote other net)

Add a field requestOnlyEmbeddings in meta_net_def to record user_embeddings.

This is for flash verification.

Test Plan:
buck test dper3/dper3_backend/delivery/tests:blob_reorder_test

Run a flow with canary package f211282476
Check the net: n326826, request_only_embeddings are recorded as expected

Reviewed By: ipiszy

Differential Revision: D23008305

fbshipit-source-id: 9360ba3d078f205832821005e8f151b8314f0cf2
2020-08-22 23:40:04 -07:00
4dc8f3be8c Creates test_tensor_creation_ops.py test suite (#43104)
Summary:
As part of our continued refactoring of test_torch.py, this takes tests for tensor creation ops like torch.eye, torch.randint, and torch.ones_like and puts them in test_tensor_creation_ops.py. There hare three test classes in the new test suite: TestTensorCreation, TestRandomTensorCreation, TestLikeTensorCreation. TestViewOps and tests for construction of tensors from NumPy arrays have been left in test_torch.py. These might be refactored separately into test_view_ops.py and test_numpy_interop.py in the future.

Most of the tests ported from test_torch.py were left as is or received a signature change to make them nominally "device generic." Future work will need to review test coverage and update the tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43104

Reviewed By: ngimel

Differential Revision: D23280358

Pulled By: mruberry

fbshipit-source-id: 469325dd1a734509dd478cc7fe0413e276ffb192
2020-08-22 23:18:54 -07:00
35351ff409 Fix ToC Link (#43427)
Summary:
CC ezyang - no code here

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43427

Reviewed By: albanD

Differential Revision: D23273866

Pulled By: mrshenli

fbshipit-source-id: ca07d286410f367cc78549828e517510a86d63ec
2020-08-22 19:51:24 -07:00
e4af45f3aa Fix bugs in vec256_float_neon.h (#43321)
Summary:
fixing neon vector conversion problems.

Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43321

Reviewed By: pbelevich

Differential Revision: D23241536

Pulled By: kimishpatel

fbshipit-source-id: 37a4e10989c9342ae5e8c78f6875b7aad785dd76
2020-08-22 17:27:18 -07:00
b003f2cc28 Enable input pointer caching in XNNPACK integration. (#42840)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42840

By caching input/output pointers and input parameters we enable the use
of caching allocator and check if we get the same input/output pointers.
If so we skip setup steps.

Test Plan:
python test/test_xnnpack_integration.py

Imported from OSS

Reviewed By: AshkanAliabadi

Differential Revision: D23044585

fbshipit-source-id: ac676cff77f264d8ccfd792d1a540c76816d5359
2020-08-22 16:50:17 -07:00
b52e6d00f9 Change quantizer to account for input tensor's memory format. (#42178)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42178

This otherwise introduces unnecessary calls to contiguous in the rest of
the network, where certain ops want channels last format.

Test Plan:
Quantization tests.

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D22796479

fbshipit-source-id: f1ada1c2eeed84991b9b195120699b943ef6e421
2020-08-22 16:48:50 -07:00
b1d31428e7 Reduce number of prim::profile (#43147)
Summary:
Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43147

Reviewed By: colesbury

Differential Revision: D23190137

Pulled By: Krovatkin

fbshipit-source-id: bf5f29a76e5ebfb5b9d3b6adee424e213c25891b
2020-08-22 16:06:30 -07:00
8efa898349 [ONNX] Export split_to_sequence as slice when output number is static (#42744)
Summary:
Optimize exported graph to export slice nodes for aten::split when the number of split outputs are fixed. Previously under some cases these are exported as onnx::SplitToSequence, which is dynamic in tensor output count.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42744

Reviewed By: houseroad

Differential Revision: D23172465

Pulled By: bzinodev

fbshipit-source-id: 11e432b4ac1351f17e48356c16dc46f877fdf7da
2020-08-22 09:11:25 -07:00
ec9e6e07bc [quant][graphmode][fx] Add support for general value ops (#43439)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43439

Porting op tests from test_quantize_jit.py

Test Plan:
TestQuantizeFxOps

Imported from OSS

Reviewed By: raghuramank100

Differential Revision: D23278585

fbshipit-source-id: ad29f39482cf4909068ce29555470ef430ea17f6
2020-08-22 08:52:28 -07:00
47e1b7a8f1 Set CONSTEXPR_EXCEPT_WIN_CUDA as const while it is not constexpr (#43380)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/42467

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43380

Reviewed By: albanD

Differential Revision: D23278930

Pulled By: pbelevich

fbshipit-source-id: 6ce0bc9fd73cd0ead46c414fdea5f6fb7e9fec3e
2020-08-22 03:25:37 -07:00
d94b10a832 Revert D23223281: Add Enum TorchScript serialization and deserialization support
Test Plan: revert-hammer

Differential Revision:
D23223281 (f269fb83c1)

Original commit changeset: 716d1866b777

fbshipit-source-id: da1ad8387b7d7aad9ff69e1ebeb5cd0b9394c2df
2020-08-22 02:38:12 -07:00
915fd1c8fc centralize autograd dispatch key set (#43387)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43387

Test Plan: Imported from OSS

Reviewed By: ailzhang

Differential Revision: D23258687

Pulled By: bhosmer

fbshipit-source-id: 3718f74fc7324db027f87eda0b90893a960aa56e
2020-08-22 00:46:02 -07:00
88b564ce39 [quant][graphmode][fx] Add support for general shape ops (#43438)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43438

Porting op tests from test_quantize_jit.py

Test Plan:
TestQuantizeFxOps

Imported from OSS

Reviewed By: raghuramank100

Differential Revision: D23278583

fbshipit-source-id: 34b73390d47c7ce60528444da77c4096432ea2cb
2020-08-21 23:07:20 -07:00
192c4b0050 [quant][graphmode][fx] Add support for clamp (#43437)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43437

Porting op tests from test_quantize_jit.py

Test Plan:
TestQuantizeFxOps

Imported from OSS

Reviewed By: raghuramank100

Differential Revision: D23278584

fbshipit-source-id: 266dc68c9ca30d9160a1dacf28dc7781b3d472c2
2020-08-21 20:21:50 -07:00
40c77f926c Add prim::TypeCheck operation (#43026)
Summary:
TypeCheck is a new operation to check the shape of tensors against
 expectd shapes. TypeCheck is a variadic operation. An example,

 %t0 : Tensor = ...
 %t1 : Tensor = ...
 %2 : FLOAT(20, 20), %3 : FLOAT(30, 30), %1 : bool =
 prim::TypeCheck(%t1, %t2)
 prim::If(%1)

Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43026

Reviewed By: ZolotukhinM

Differential Revision: D23115830

Pulled By: bzinodev

fbshipit-source-id: fbf142126002173d2d865cf4b932dea3864466b4
2020-08-21 20:03:24 -07:00
98307a2821 Fix bfloat16 erfinv get incorrect value problem for cpu path (#43399)
Summary:
Fix https://github.com/pytorch/pytorch/issues/43344

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43399

Reviewed By: albanD

Differential Revision: D23264789

Pulled By: pbelevich

fbshipit-source-id: 8b77c0f6ca44346e44599844fb1e172fdbd9df6c
2020-08-21 19:59:37 -07:00
5e04bb2c1c caffe2: expose CPUContext RandSeed for backwards compatibility with external RNG (#43239)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43239

This is an incremental step as part of the process to migrate caffe2 random number generator off of std::mt19937 and to instead use at::mt19937+at::CPUGeneratorImpl. The ATen variants are much more performant (10x faster).

This adds a way to get the CPUContext RandSeed for tail use cases that require a std::mt19937 and borrow the CPUContext one.

Test Plan: This isn't used anywhere within the caffe2 codebase. Compile should be sufficient.

Reviewed By: dzhulgakov

Differential Revision: D23203280

fbshipit-source-id: 595c1cb447290604ee3ef61d5b5fc079b61a4e14
2020-08-21 19:36:38 -07:00
fb12992b5d Call qnnpack's conv setup only if input pointer has changed. (#42008)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42008

With caching allocator we have increased the likelihood of getting the
same input pointer. With that we can cache qnnpack operator and input
pointer and check if the input pointer is the same. If so we can skip
setup step.

Test Plan:
Ran one of the quantized models to observe
1. No pagefaults due to indirection buffer reallocation.
2. Much less time spent in indirection buffer population.

Imported from OSS

Reviewed By: AshkanAliabadi

Differential Revision: D22726973

fbshipit-source-id: 2dd2a6a6ecf1b5cfa7dde65e384b36a6eab052d7
2020-08-21 19:10:40 -07:00
04aa42a073 Refactor qconv to reduce allocations. (#42007)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42007

zero buffer and indirection pointers are allocatoed on every iterations.
With this refactor we create op once for qnnpackconv struct and keep
repopulating indirection pointer as necessary.

For deconv moved much of op creation outside so that we can avoid creating and
destroying ops every time.

Test Plan:
CI quantization tests.
deconvolution-test

Imported from OSS

Reviewed By: AshkanAliabadi

Differential Revision: D22726972

fbshipit-source-id: 07c03a4e90b397c36aae537ef7c0b7d81d4adc1a
2020-08-21 19:10:37 -07:00
2a08566b8f Simple caching allocator for CPU. (#42006)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42006

This PR introduces a simple CPU caching allocator. This is specifically
intended for mobile use cases and for inference. There is nothing
specific to the implementation that can prevent it from other use cases,
however its simplicity may not be suitable everywhere.
It simply tracks allocation by sizes and relies on deterministic
repeatable behavior where allocation of same sizes are made on every
inference.
Thus after the first allocation when the pointer is returned, instead of
returning it to system, allocator caches it for subsequent use.
Memory is freed automatically at the end of the process, or it can be
explicitly freed.
This is enabled at the moment in DefaultMobileCPUAllocator only.

Test Plan:
android test: cpu_caching_allocator_test

Imported from OSS

Reviewed By: dreiss

Differential Revision: D22726976

fbshipit-source-id: 9a38b1ce34059d5653040a1c3d035bfc97609e6c
2020-08-21 19:09:22 -07:00
abe878ce96 Allow Freezing of Module containing interface attribute (#41860)
Summary:
This patch allows to freeze model that utilizes interfaces. Freezing works
under the user assumption that the interfase module dones not aliases with
any value used in the model.

To enable freezing of such modules, added an extra pramater:

torch._C._freeze_module(module, ignoreInterfaces = True)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41860

Reviewed By: eellison

Differential Revision: D22670566

Pulled By: bzinodev

fbshipit-source-id: 41197a724bc2dca2e8495a0924c224dc569f62a4
2020-08-21 18:57:13 -07:00
490d41aaa6 [quant][graphmode][fx] Add support for instance_norm (#43377)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43377

Porting op tests from test_quantize_jit.py

Test Plan:
TestQuantizeFxOps

Imported from OSS

Reviewed By: z-a-f

Differential Revision: D23257045

fbshipit-source-id: 7f4ad5d81f21bf0b8b9d960b054b20dc889e6c3b
2020-08-21 18:32:50 -07:00
a5a6a3e633 add support for optional int list with scalar fill (#43262)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43262

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D23212049

Pulled By: bhosmer

fbshipit-source-id: c7ceb2318645c07d36c3f932c981c9ee3c414f82
2020-08-21 18:24:36 -07:00
f269fb83c1 Add Enum TorchScript serialization and deserialization support (#42963)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42963

* Adds code printing for enum type
* Enhance enum type to include all contained enum names and values
* Adds code parsing for enum type in deserialization
* Enabled serialization/deserialization test in most TestCases. (With a few dangling issues to be addressed in later PRs to avoid this PR grows too large)

Test Plan: Imported from OSS

Reviewed By: SplitInfinity

Differential Revision: D23223281

Pulled By: gmagogsfm

fbshipit-source-id: 716d1866b7770dfb7bd8515548cfe7dc4c4585f7
2020-08-21 18:13:27 -07:00
aa53b2d427 Workaround bugs in user side embedding meta info and better msgs (#43355)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43355

There seem to be some bugs where we cannot guarantees that blobs in `PARAMETERS_BLOB_TYPE_FULLY_REMOTE_REQUEST_ONLY` and `PARAMETERS_BLOB_TYPE_DISAGG_ACC_REMOTE_OTHER` are disjoint. Hence we need to walk around this.

Also make the msg more informative.

Test Plan:
```
flow-cli test-locally --mode opt dper.workflows.evaluation.eval_workflow --parameters-file=/mnt/shared/yinghai/v0_ctr_mbl_feed_1120_onnx.json
```

Reviewed By: ehsanardestani

Differential Revision: D23141538

fbshipit-source-id: 8e311f8fc0e40eff6eb2c778213f78592e6bf079
2020-08-21 17:18:51 -07:00
aec917a408 [quant][graphmode][fx] Add support for layer_norm (#43376)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43376

Porting op tests from test_quantize_jit.py

Test Plan:
TestQuantizeFxOps

Imported from OSS

Reviewed By: z-a-f

Differential Revision: D23257048

fbshipit-source-id: 47a04a5221bcaf930d574f879d515e3dff2d1f6d
2020-08-21 16:38:16 -07:00
089bb1a8e4 [quant][graphmode][fx] Add support for elu (#43375)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43375

Porting op tests from test_quantize_jit.py

Test Plan:
TestQuantizeFxOps

Imported from OSS

Reviewed By: z-a-f

Differential Revision: D23257043

fbshipit-source-id: 22360610d87ef98d25871daff3fdc3dbb3ec5bdb
2020-08-21 16:07:36 -07:00
5a02c6b158 [quant][graphmode][fx] Add support for hardswish (#43374)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43374

Porting op tests from test_quantize_jit.py

Test Plan:
TestQuantizeFxOps

Imported from OSS

Reviewed By: z-a-f

Differential Revision: D23257044

fbshipit-source-id: 2cdf12e104db6e51ffa0324eb602e68132a646ef
2020-08-21 16:06:32 -07:00
93f1b5c8da Mobile backward compatibility (#42413)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42413

When a default argument is added, it does not break backward compatibility (BC) for full-jit, but does break BC for mobile bytecode. For example, https://github.com/pytorch/pytorch/pull/40737. To make bytecode BC in this case, we

1. Introduce kMinSupportedBytecodeVersion. The loaded model version should be between kMinSupportedBytecodeVersion and kProducedBytecodeVersion.
2. If an operator is updated, and we can handle BC, bump the kProducedBytecodeVersion (for example, from 3 to 4).
3. If model version is at the older version of the operator, add an adapter function at loading. For the added default arg, we push this default arg to stack before calling the actual operator function.

Test Plan: Imported from OSS

Reviewed By: xcheng16

Differential Revision: D22898314

Pulled By: iseeyuan

fbshipit-source-id: 90d339f8e1365f4bb178db8db7c147390173372b
2020-08-21 15:45:52 -07:00
e96871ea46 [quant][graphmode][fx] Add support for mul and mul relu (#43373)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43373

Porting op tests from test_quantize_jit.py

Test Plan:
TestQuantizeFxOps

Imported from OSS

Reviewed By: z-a-f

Differential Revision: D23257047

fbshipit-source-id: b7f9fcef965d6368018e05cff09260f0eb6f3b50
2020-08-21 15:31:00 -07:00
6c772515ed Revert D23252335: Refactor Vulkan context into its own files. Use RAII.
Test Plan: revert-hammer

Differential Revision:
D23252335 (054073c60d)

Original commit changeset: 43144446f2f3

fbshipit-source-id: 442b914f47a82efee18cfd84aab893e22d1defdd
2020-08-21 15:10:06 -07:00
8eb3de76ba Fix enum constant printing and add FileCheck to all Enum tests (#42874)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42874

Test Plan: Imported from OSS

Reviewed By: SplitInfinity

Differential Revision: D23222894

Pulled By: gmagogsfm

fbshipit-source-id: 86495a350d388c82276933d24a2ca3c0f59af8da
2020-08-21 14:55:46 -07:00
ff454cc429 [quant][grapphmode][fx][test][refactor] Refactor quantized add test (#43372)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43372

So that adding more binary op tests are easier

Test Plan: Imported from OSS

Reviewed By: z-a-f

Differential Revision: D23257046

fbshipit-source-id: 661acd4c38abdc892c9db8493b569226b13e0d0d
2020-08-21 14:53:23 -07:00
109ea59afc [quant][graphmode][fx] Add support for batchnorm relu (#43335)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43335

Porting op tests from test_quantize_jit.py

Test Plan:
TestQuantizeFxOps

Imported from OSS

Reviewed By: z-a-f

Differential Revision: D23243563

fbshipit-source-id: 3c562f519b90e0157761a00c89eca63af8b909f2
2020-08-21 14:32:51 -07:00
9e87a8ddf4 [quant][graphmode][fx] Add support for batchnorm (#43334)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43334

Porting op tests from test_quantize_jit.py

Test Plan:
TestQuantizeFxOps

Imported from OSS

Reviewed By: z-a-f

Differential Revision: D23243560

fbshipit-source-id: 0a7bc331293bbc3db85616bf43a995d3b112beb6
2020-08-21 14:31:49 -07:00
054073c60d Refactor Vulkan context into its own files. Use RAII. (#42273)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42273

Test Plan: Imported from OSS

Reviewed By: IvanKobzarev

Differential Revision: D23252335

Pulled By: AshkanAliabadi

fbshipit-source-id: 43144446f2f3530e6cb2a85706a9afc60771347d
2020-08-21 14:28:38 -07:00
3d76f7065e [quant][graphmode][fx] Add support for cat (#43333)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43333

Porting op tests from test_quantize_jit.py

Test Plan:
TestQuantizeFxOps

Imported from OSS

Reviewed By: z-a-f

Differential Revision: D23243562

fbshipit-source-id: 5c8eab2af592a9ea4afa713fb884e34e0ffd82b1
2020-08-21 12:54:50 -07:00
26be4dcfa1 [quant][graphmode][fx] Add support for add relu (#43332)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43332

Porting op tests from test_quantize_jit.py

Test Plan:
TestQuantizeFxOps

Imported from OSS

Reviewed By: z-a-f

Differential Revision: D23243564

fbshipit-source-id: 3cd1786c6356aaa234d31b50f12ad6ddc38d5664
2020-08-21 12:54:41 -07:00
452a473729 [quant][graphmode][fx] Add support for add (#43331)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43331

Porting op tests from test_quantize_jit.py

Test Plan:
TestQuantizeFxOps

Imported from OSS

Reviewed By: raghuramank100

Differential Revision: D23243561

fbshipit-source-id: 5a6399d25cc881728cf298c77570ce2aaf3ca22e
2020-08-21 12:52:37 -07:00
6e48c88e09 .circleci: Prefer using env-file for docker run (#43293)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43293

'docker run' has the capability to use a file for environment variables,
we should prefer to use that instead of having it be sourced per command
in the docker container.

Also opens the door for cutting down on the total number of commands we
need to echo into a script to then execute as a 'docker exec' command.

Plus side of this approach is that the BASH_ENV is persisted through all
of the steps so there's no need to do any exports / worry about
environment variables not persisting through jobs.

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D23227059

Pulled By: seemethere

fbshipit-source-id: be425aa21b420b9c6e96df8b2177f508ee641a20
2020-08-21 12:48:35 -07:00
100649d6a9 Normalize loops with non-zero start. (#43179)
Summary:
This diff normalizes for-loops that have non 0 loop starts to always start from 0. Given a for-loop, this normalization changes the loop start to be 0 and adjusts the loop end and all accesses to the index variable within the loop body appropriately.

This diff also adds tests for several cases of normalization and also tests normalization in conjunction with `splitwithTail` transformation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43179

Reviewed By: nickgg

Differential Revision: D23220534

Pulled By: navahgar

fbshipit-source-id: 64be0c72e4dbc76906084f7089dea81ae07d6020
2020-08-21 12:37:27 -07:00
74781ab5b8 Revert D23242101: [pytorch][PR] Implement first draft of autograd benchmark.
Test Plan: revert-hammer

Differential Revision:
D23242101 (c2511bdfa4)

Original commit changeset: a2b92d5a4341

fbshipit-source-id: bda562d15565f074b448022d180ec8f959c6ecc9
2020-08-21 12:22:57 -07:00
650590da0d [quant][graphmode][fx] Add support for conv module + relu (#43287)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43287

Porting op tests from test_quantize_jit.py

Test Plan:
TestQuantizeFxOps

Imported from OSS

Reviewed By: raghuramank100

Differential Revision: D23221735

fbshipit-source-id: 2513892a1928f92c09d7e9a24b2ea12b00de218d
2020-08-21 12:13:02 -07:00
3293fdfa80 [quant] Enable from_float for quantized Embedding_Bag (#43176)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43176

Convert floating point nn.EmbeddingBag module to
nn.quantized.dynamic.EmbeddingBag module

Test Plan:
python test/test_quantization.py TestDynamicQuantizedModule.test_embedding_bag_api
python test/test_quantization.py TestPostTrainingDynamic.test_embedding_quantization

Imported from OSS

Reviewed By: vkuzo

Differential Revision: D23200196

fbshipit-source-id: 090f47dbf7aceab9c719cbf282fad20fe3e5a983
2020-08-21 11:46:03 -07:00
b354b422ee [quant] Make offsets an optional argument (#43090)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43090

To match the floating point module

Test Plan: Imported from OSS

Reviewed By: vkuzo

Differential Revision: D23167518

fbshipit-source-id: 29db596e10731be4cfed7efd18f33a0b3dbd0ca7
2020-08-21 11:46:00 -07:00
4db8ca1129 [quant] Create nn.quantized.dynamic.EmbeddingBag (#43088)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43088

Create quantized module that the user can use to perform embedding bag quantization
The module uses the EmbeddingPackedParams to store the weights which can be serialized /deserialized
using TorchBind custom classes (C++ get/setstate code)
Following PR will add support for `from_float` to convert from float to quantized module

Test Plan:
python test/test_quantization.py TestDynamicQuantizedModule.test_embedding_bag_api

Imported from OSS

Reviewed By: vkuzo

Differential Revision: D23167519

fbshipit-source-id: 029d7bb44debf78c4ef08bfebf267580ed94d033
2020-08-21 11:45:02 -07:00
f20a04fa2d [TensorExpr] Simplify conditional select (#43350)
Summary:
Fold conditional select when both sides are constant.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43350

Test Plan: test_tensorexpr --gtest_filter=TensorExprTest.ConditionalSelectFold*

Reviewed By: pbelevich

Differential Revision: D23256602

Pulled By: asuhan

fbshipit-source-id: ec04b1e4ae64f59fa574047f2d7af55a717a5262
2020-08-21 11:15:48 -07:00
743cff4a1a Fix PackedGemmMatrixFP16 repacking (#43320)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43320

Previous impl seem to be buggy although I don't why. New impl is copied from https://fburl.com/diffusion/cing6mxv

Reviewed By: jianyuh

Differential Revision: D23235964

fbshipit-source-id: 780b6e388ef895232e3ba34b125c2492b1cee60c
2020-08-21 10:58:18 -07:00
e57b89c8dc Adds arccos, arcsin, arctan aliases (#43319)
Summary:
These aliases are consistent with NumPy (see, for example, https://numpy.org/doc/stable/reference/generated/numpy.arccos.html?highlight=acos).

Note that PyTorch's existing names are consistent with Python (see https://docs.python.org/3.10/library/math.html?highlight=acos#math.acos) and C++ (see, for example, https://en.cppreference.com/w/cpp/numeric/math/acos).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43319

Reviewed By: pbelevich

Differential Revision: D23260426

Pulled By: mruberry

fbshipit-source-id: 98a6c97f69d1f718a396c2182e938a7a260c0889
2020-08-21 10:53:17 -07:00
3aec1185e0 Enables bfloat16 x [float16, complex64, complex128] type promotion (#43324)
Summary:
Implements bfloat16 type promotion consistent with JAX (see https://jax.readthedocs.io/en/latest/type_promotion.html), addressing issue https://github.com/pytorch/pytorch/issues/43049.

- bfloat16 x float16 -> float32
- bfloat16 x complex64 -> complex64
- bfloat16 x complex128 -> complex128

Existing tests, after updates, are sufficient to validate the new behavior.

cc xuhdev

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43324

Reviewed By: albanD

Differential Revision: D23259823

Pulled By: mruberry

fbshipit-source-id: ca9c2c7d0325faced1f884f3c37edf8fa8c8b089
2020-08-21 10:48:04 -07:00
478fb925e6 [jit] PyTorchStreamReader::getAllRecord should omit archive name prefix (#43317)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43317

Previous version was returning the path with a prefix so subsequent `getRecord` would fail.

There's only one place in PyTorch codebase that uses this function (introduced in https://github.com/pytorch/pytorch/pull/29339 ) and it's unlikely that anyone else is using it - it's not a public API anyway.

Test Plan: unittest

Reviewed By: houseroad

Differential Revision: D23235241

fbshipit-source-id: 6f7363e6981623aa96320f5e39c54e65d716240b
2020-08-21 10:39:57 -07:00
0bd35de30e Add Enum convert back to Python object support (#43121)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43121

Test Plan: Imported from OSS

Reviewed By: SplitInfinity

Differential Revision: D23222628

Pulled By: gmagogsfm

fbshipit-source-id: 6850c56ced5b52943a47f627b2d1963cc9239408
2020-08-21 10:36:51 -07:00
f4b6ef9c56 Do not define the macro "isnan" (#43242)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43242

This causes "std::isnan" to produce confusing error messages (std::std has not been declared).
Instead, simply let isnan be exposed in the global namespace.

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D23214374

Pulled By: ezyang

fbshipit-source-id: 9615116a980340e36376a20f2e546e4d36839d4b
2020-08-21 10:08:38 -07:00
7b520297dc Remove erroneous trailing backslashes (#43318)
Summary:
They were likely copied from some macro definition, but they do not
belong to macro definitions here.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43318

Reviewed By: pbelevich

Differential Revision: D23241526

Pulled By: mrshenli

fbshipit-source-id: e0b5eddfde2c882bb67f56d84ee79281cc5fc941
2020-08-21 08:21:56 -07:00
c2511bdfa4 Implement first draft of autograd benchmark. (#40586)
Summary:
It is quite a lot of code because I pulled some code from torchaudio and torchvision to remove issues I had to get latest version with pytorch built from source while I can't build there libs from source (dependency missing for torchaudio).

The compare script generates table as follows:
| model | task | speedup | mean (before) | var (before) | mean (after) | var (after) |
| -- | -- | -- | -- | -- | -- | -- |
| resnet18 | vjp | 1.021151844124464 | 1.5627719163894653 | 0.005164200905710459 | 1.5304011106491089 | 0.003979875706136227 |
| resnet18 | vhp | 0.9919114430761606 | 6.8089728355407715 | 0.019538333639502525 | 6.86449670791626 | 0.014775685034692287 |
| resnet18 | jvp | 0.9715963084255123 | 5.720699310302734 | 0.08197150379419327 | 5.887938499450684 | 0.018408503383398056 |
| ppl_simple_reg | vjp | 0.9529183269165618 | 0.000362396240234375 | 7.526952949810095e-10 | 0.00038030146970413625 | 7.726220357939795e-11 |
| ppl_simple_reg | vhp | 0.9317708619586977 | 0.00048058031825348735 | 5.035701855504726e-10 | 0.0005157709238119423 | 3.250243477137538e-11 |
| ppl_simple_reg | jvp | 0.8609755877018406 | 0.00045447348384186625 | 9.646707044286273e-11 | 0.0005278587341308594 | 1.4493808930815533e-10 |
| ppl_simple_reg | hvp | 0.9764100147808232 | 0.0005881547695025802 | 7.618464747949361e-10 | 0.0006023645401000977 | 6.370915461850757e-10 |
| ppl_simple_reg | jacobian | 1.0019173715134297 | 0.0003612995205912739 | 2.2979899233499523e-11 | 0.0003606081008911133 | 1.2609764794835332e-11 |
| ppl_simple_reg | hessian | 1.0358429970264393 | 0.00206911563873291 | 2.590938796842579e-09 | 0.0019975185859948397 | 2.8916853356264482e-09 |
| ppl_robust_reg | vjp | 1.0669910916521521 | 0.0017304659122601151 | 3.1047047155396967e-09 | 0.0016218185191974044 | 4.926861585374809e-09 |
| ppl_robust_reg | vhp | 1.0181130455462972 | 0.0029563189018517733 | 2.6359153082466946e-08 | 0.0029037236236035824 | 1.020585038702393e-08 |
| ppl_robust_reg | jvp | 0.9818360373406179 | 0.0026934861671179533 | 6.981357714153091e-09 | 0.00274331565015018 | 3.589908459389335e-08 |
| ppl_robust_reg | hvp | 1.0270848910527002 | 0.005576515104621649 | 3.2798087801211295e-08 | 0.005429458804428577 | 6.438724398094564e-08 |
| ppl_robust_reg | jacobian | 1.0543611284155785 | 0.00167675013653934 | 2.3236829349571053e-08 | 0.001590299652889371 | 1.2011492245278532e-08 |
| ppl_robust_reg | hessian | 1.0535378727082656 | 0.01643357239663601 | 1.8450685956850066e-06 | 0.015598463825881481 | 2.1876705602608126e-07 |
| wav2letter | vjp | 1.0060408105086573 | 0.3516994118690491 | 1.4463969819189515e-05 | 0.349587619304657 | 9.897866402752697e-05 |
| wav2letter | vhp | 0.9873655295086051 | 1.1196287870407104 | 0.00474404776468873 | 1.133955717086792 | 0.009759620763361454 |
| wav2letter | jvp | 0.9741820317882822 | 0.7888165712356567 | 0.0017476462526246905 | 0.8097219467163086 | 0.0018235758179798722 |
| transfo | vjp | 0.9883954031921641 | 2.8865864276885986 | 0.008410997688770294 | 2.9204773902893066 | 0.006901870481669903 |
| transfo | vhp | 1.0111290842971339 | 8.374398231506348 | 0.014904373325407505 | 8.282224655151367 | 0.04449500888586044 |
| transfo | jvp | 1.0080534543381963 | 6.293097972869873 | 0.03796082362532616 | 6.24282169342041 | 0.010179692879319191 |

Pull Request resolved: https://github.com/pytorch/pytorch/pull/40586

Reviewed By: pbelevich

Differential Revision: D23242101

Pulled By: albanD

fbshipit-source-id: a2b92d5a4341fe1472711a685ca425ec257d6384
2020-08-21 07:36:26 -07:00
0cb52cb458 Autograd better error (#43308)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/5025

Thanks for the conversation in the issue thread. Hopefully this must fix it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43308

Reviewed By: ezyang

Differential Revision: D23241918

Pulled By: suraj813

fbshipit-source-id: e1efac13f5ce590196f227149f011c973c2bbdde
2020-08-21 05:50:33 -07:00
da036250cd Add benchmark for performance comparison (#43221)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43221

Test Plan: Example: https://www.internalfb.com/intern/paste/P139226521/

Reviewed By: kimishpatel

Differential Revision: D23197567

Pulled By: kimishpatel

fbshipit-source-id: 7d0f8e653c62f0bee5795618e712d07effbd460a
2020-08-20 23:11:40 -07:00
da70976e66 [ONNX] Add support for operator add between tensor list (#41888)
Summary:
E.g.
```python
outs = []
outs += [torch.randn(3,4)]
outs = outs + [torch.randn(4,5), torch.randn(5,6)]
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41888

Reviewed By: houseroad

Differential Revision: D23172880

Pulled By: bzinodev

fbshipit-source-id: 93865106e3de5908a993e0cfa82f626ba94dab7e
2020-08-20 22:38:23 -07:00
c64594f5cc Extends test_unary_ufunc.py with numerics, contiguity, domain tests (#42965)
Summary:
This PR:

- ports the tests in TestTorchMathOps to test_unary_ufuncs.py
- removes duplicative tests for the tested unary ufuncs from test_torch.py
- adds a new test, test_reference_numerics, that validates the behavior of our unary ufuncs vs. reference implementations on empty, scalar, 1D, and 2D tensors that are contiguous, discontiguous, and that contain extremal values, for every dtype the unary ufunc supports
- adds support for skipping tests by regex, this behavior is used to make the test suite pass on Windows, MacOS, and ROCm builds, which have a variety of issues, and on Linux builds (see https://github.com/pytorch/pytorch/issues/42952)
- adds a new OpInfo helper, `supports_dtype`, to facilitate test writing
- extends unary ufunc op info to include reference, domain, and extremal value handling information
- adds OpInfos for `torch.acos` and `torch.sin`

These improvements reveal that our testing has been incomplete on several systems, especially with larger float values and complex values, and several TODOs have been added for follow-up investigations. Luckily when writing tests that cover many ops we can afford to spend additional time crafting the tests and ensuring coverage.

Follow-up PRs will:

- refactor TestTorchMathOps into test_unary_ufuncs.py
- continue porting tests from test_torch.py to test_unary_ufuncs.py (where appropriate)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42965

Reviewed By: pbelevich

Differential Revision: D23238083

Pulled By: mruberry

fbshipit-source-id: c6be317551453aaebae9d144f4ef472f0b3d08eb
2020-08-20 22:02:00 -07:00
e31cd46278 Add alias torch.fix for torch.trunc to be compatible with NumPy. (#43326)
Summary:
xref https://github.com/pytorch/pytorch/issues/42515

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43326

Reviewed By: pbelevich

Differential Revision: D23249089

Pulled By: mruberry

fbshipit-source-id: 6afa9eb20493983d084e0676022c6245e7463e05
2020-08-20 21:47:39 -07:00
17f9edda42 Bias Correction Implementation (#41845)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/41845

Test Plan: Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D22661503

Pulled By: edmundw314

fbshipit-source-id: a88c349c6cc15b1c66aa6dee7593ef3df588eb85
2020-08-20 21:40:33 -07:00
665da61d2b Replace Conv1d with Conv2d (#42867)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42867

Test Plan: Imported from OSS

Reviewed By: kimishpatel

Differential Revision: D23177916

Pulled By: kimishpatel

fbshipit-source-id: 68cc40cf42d03e5b8432dc08f9933a4409c76e25
2020-08-20 21:36:51 -07:00
e8139624f2 Search on system path for Vulkan headers and libraries as a last resort. (#43301)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43301

Test Plan: Imported from OSS

Reviewed By: IvanKobzarev

Differential Revision: D23252338

Pulled By: AshkanAliabadi

fbshipit-source-id: 8eefe98eedf9dbeb570565bfb13ab61b1d6bca0e
2020-08-20 21:14:09 -07:00
217ddea93a [quant] Make OP_LIST_TO_FUSER_METHOD public (#43286)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43286

We need to use this in graph mode quantization on fx

Test Plan: Imported from OSS

Reviewed By: vkuzo

Differential Revision: D23221734

fbshipit-source-id: 7c3c3840ce5bdc185b962e081aff1618f4c58e85
2020-08-20 20:19:13 -07:00
844d469ae7 Remove proprietary notices
Summary:
These were added accidentally (probably by an IDE) during a refactor.
These files have always been Open Source.

Test Plan: CI

Reviewed By: xcheng16

Differential Revision: D23250761

fbshipit-source-id: 4974430c0e28dd3269424d38edb36f4f71508157
2020-08-20 20:14:59 -07:00
9984d33542 [quant][graphmode][fx] Add support for conv module (#43285)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43285

Porting op tests from test_quantize_jit.py

(Note: this ignores all push blocking failures!)

Test Plan:
TestQuantizeFxOps

Imported from OSS

Reviewed By: vkuzo

Differential Revision: D23221733

fbshipit-source-id: c1f0f7ae0c82379143aa33fc1af7284d8303174b
2020-08-20 19:53:30 -07:00
7c50c2f79e Reimplement per-operator selective build (#39401)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39401

This uses the technique proposed by smessmer in D16451848 to selectively
register operators without codegen.  See the Note inside for more
details.

This PR has feature parity with the old selective build apparatus:
it can whitelist schema def()s, impl()s, and on a per dispatch key
basis.  It has expanded dispatch key whitelisting, whereas previously
manually written registrations were not whitelisted at all.  (This
means we may be dropping dispatch keys where we weren't previously!)

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: pbelevich

Differential Revision: D21905593

Pulled By: ezyang

fbshipit-source-id: d4870f800c66be5ce57ec173c9b6e14a52c4a48b
2020-08-20 19:10:02 -07:00
e32d014f46 remove empty override pretty_print (#43341)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43341

This is to remove the empty pretty_print() since it overrides the impl within Module base which is not as designed here.

Test Plan: Imported from OSS

Reviewed By: pbelevich

Differential Revision: D23244616

Pulled By: glaringlee

fbshipit-source-id: 94b8dfd3697dfc450f53b3b4eee6e9c13cafba7b
2020-08-20 18:48:29 -07:00
ad8294d35b [vulkan][ci] Vulkan tests running on linux build via swiftshader (added to docker) (#42614)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42614

Vulkan backend linux build (USE_VULKAN=1) and running Vulkan tests using software Vulkan implementation via [swiftshader](https://github.com/google/swiftshader)

Vulkan linux build needs VulkanSdk and running tests needs Swiftshader.
swiftshader needs to be compiled using clang toolchain, added them to bionic-clang-9 docker image.

VulkanSdk will be downloaded from aws;
Swiftshader is cloned from github, as it has many submodules , commit hash is fixed in install_swiftshader script.

To pass all the tests:
Disabled adaptive_avg_pool2d_2 as it needs at::view which will be landed in https://github.com/pytorch/pytorch/pull/42676 and after that can be enabled

Change strides, padding, dilation params in tests to vector

Docker image rebuild:
https://app.circleci.com/pipelines/github/pytorch/pytorch/200251/workflows/465f911f-f170-47e1-954e-b9605d91abd8/jobs/6700311
Vulkan Linux Build:
https://app.circleci.com/pipelines/github/pytorch/pytorch/200251/workflows/465f911f-f170-47e1-954e-b9605d91abd8/jobs/6701604
Vulkan Linux Test:
https://app.circleci.com/pipelines/github/pytorch/pytorch/200251/workflows/465f911f-f170-47e1-954e-b9605d91abd8/jobs/6703026

Test Plan: Imported from OSS

Reviewed By: seemethere

Differential Revision: D23174038

Pulled By: IvanKobzarev

fbshipit-source-id: 431c72e31743ca0c0b82a497420f6330a311b35b
2020-08-20 18:40:32 -07:00
5cf8592663 Fix backward compatibility test (#43371)
Summary:
Drop `.out` suffix from allow_list pattern added by https://github.com/pytorch/pytorch/issues/43272

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43371

Reviewed By: pbelevich

Differential Revision: D23256914

Pulled By: malfet

fbshipit-source-id: 10168b55b98c24c84ac2676963049d1eca5c182d
2020-08-20 18:29:10 -07:00
9a1f2b3617 .circleci: Use dynamic docker image for android (#43356)
Summary:
We recently upgraded to a dynamic docker image and this android build
job was missed during that transition

Fixes https://github.com/pytorch/pytorch/issues/43338

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43356

Reviewed By: pbelevich

Differential Revision: D23253175

Pulled By: seemethere

fbshipit-source-id: 4831d4fe554a126e202e788444a63516d34b3d72
2020-08-20 17:42:26 -07:00
e10aa47615 Fix at::native::view_as_real() for ComplexHalf Tensors (#43279)
Summary:
Add ComplexHalf case to toValueType, which fixes the logic how view_as_real and view_as_complex slices complex tensor to the floating point one, as it is used to generate tensor of random complex values, see:
018b4d7abb/aten/src/ATen/native/DistributionTemplates.h (L200)
Also add ability to convert python complex object to `c10::complex<at::Half>`

Add `torch.half` and `torch.complex32` to the list of `test_randn` dtypes

Fixes https://github.com/pytorch/pytorch/issues/43143

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43279

Reviewed By: mrshenli

Differential Revision: D23230296

Pulled By: malfet

fbshipit-source-id: b4bb66c4c81dd867e72ab7c4563d73f6a4d80a44
2020-08-20 17:38:06 -07:00
b0ec336477 [quant][graphmode][fx][test] Add per op test for graph mode quant on fx (#43229)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43229

Test Plan: Imported from OSS

Reviewed By: supriyar

Differential Revision: D23201692

fbshipit-source-id: 37fa54dcf0a9d5029f1101e11bfd4ca45b422641
2020-08-20 17:32:02 -07:00
2b7108a96f Update hardcoded pytorch_android_gradle_custom_build_single hash (#43340)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43340

This doesn't fix https://github.com/pytorch/pytorch/issues/43338 but
it gets us a little more up to date.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: seemethere

Differential Revision: D23243933

Pulled By: ezyang

fbshipit-source-id: ce2773c55864d1a6f6628ba60bb9ad6aee4aba14
2020-08-20 15:37:43 -07:00
97d594b9f7 Make grad point to bucket buffer in DDP to save memory usage (#41954)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41954
Make both variable.grad() and grad in distautograd context point to bucket buffer in DDP to save memory usage.
In this case, grad will be view of bucket buffer tensors, in order to make it compatiable with optimizer.zero_grad(), we
made changes in https://github.com/pytorch/pytorch/pull/41283.

Also be noted that we can not make variable.grad() pointing to bucket buffer during construction time, because we want to
keep grad undefined for unused parameters.
ghstack-source-id: 110260297

Test Plan:
unit tests,

For roberta_base model with ~1GB parameters, peak memory dropped ~1GB (8250MB-7183MB).  Per iteration latency (0.982s ->0.909s), 8% speed up
https://www.internalfb.com/intern/fblearner/details/211713882?tab=operator_details
https://www.internalfb.com/intern/fblearner/details/211772923?tab=operator_details

For resnet model with ~97M parameters, peak memory dropped ~100MB (3089MB -> 2988MB). Per iteration latency has no change (0.122s -> 0.123s)
https://www.internalfb.com/intern/fblearner/details/211713577?tab=operator_details
https://www.internalfb.com/intern/fblearner/details/211712582?tab=operator_details

accuracy benchmark is expected as well
https://www.internalfb.com/intern/fblearner/details/213237067?tab=Outputs

Reviewed By: mrshenli

Differential Revision: D22707857

fbshipit-source-id: b5e767cfb34ccb3d067db2735482a86d59aea7a4
2020-08-20 15:33:44 -07:00
51bab0877d Fix torch.hub for new zipfile format. (#42333)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/42239

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42333

Reviewed By: VitalyFedyunin

Differential Revision: D23215210

Pulled By: ailzhang

fbshipit-source-id: 161ead8b457c11655dd2cab5eecfd0edf7ae5c2b
2020-08-20 14:54:02 -07:00
dae2973fae [quant][graphmode][fx] Add graph mode quantization on fx (#43175)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43175

This PR added graph mode quantization on fx: https://github.com/pytorch/pytorch/pull/42741
Currently it matches eager mode quantization for torchvision with static/dynamic/qat
ddp/synbn test is still wip

Test Plan:
python test/test_quantization.py TestQuantizeFx

Imported from OSS

Reviewed By: vkuzo

Differential Revision: D23178602

fbshipit-source-id: 8e7e0322846fbda2cfa79ad188abd7235326f879
2020-08-20 14:50:09 -07:00
c89d2c6bf2 Replace black_list with block_list (#42088)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/41735

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42088

Reviewed By: pbelevich

Differential Revision: D22794582

Pulled By: SplitInfinity

fbshipit-source-id: e256353befefa2630b99f9bcf0b79df3a7a8dcbd
2020-08-20 14:34:02 -07:00
a12fe1a242 Minor RPC doc fixes (#43337)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43337

Test Plan: Imported from OSS

Reviewed By: osalpekar

Differential Revision: D23242698

Pulled By: osalpekar

fbshipit-source-id: 7757fc43824423e3a6efd4da44c69995f64a6015
2020-08-20 14:17:07 -07:00
5006d24302 Make TensorPipe the default backend for RPC (#43246)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43246

Test Plan: Imported from OSS

Reviewed By: osalpekar

Differential Revision: D23206042

Pulled By: osalpekar

fbshipit-source-id: 258481ea9e753cd36c2787183827ca3b81d678e3
2020-08-20 14:17:02 -07:00
d0a6819b0e [ROCm] skip test_rpc in .jenkins/pytorch/test.sh (#43305)
Summary:
https://github.com/pytorch/pytorch/issues/42636 added test_rpc, but this test binary is not built for ROCm.  Skip this test for ROCm builds.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43305

Reviewed By: pbelevich

Differential Revision: D23233087

Pulled By: mrshenli

fbshipit-source-id: 29cd81e88a543c922a988e09d5f789becf4b74e4
2020-08-20 14:15:27 -07:00
c66ca7a48d vmap: Fix bug with x * 0.1 (#43218)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43218

Previously, `vmap(lambda x: x * 0.1)(torch.ones(3))` would return a
float64 tensor(!!). This is because there is a subtle bug in the
batching rule: the batching rule receives:
- A batched tensor for x
- a scalar tensor: tensor(0.1, dtype=torch.float64).
The batching rule decides to expand the scalar tensor to be the same
size as x and then multiplies the two tensors, promoting the output to
be a float64 tensor. However, this isn't correct: we should treat the
scalar tensor like a scalar tensor. When adding a FloatTensor to a
Double scalar tensor, we don't promote the type usually.

Another example of a bug this PR fixes is the following:
`vmap(torch.mul)(torch.ones(3), torch.ones(3, dtype=torch.float64))`
Multiplying a scalar float tensor with a scalar double tensor produces a
float tensor, but the above produced a float64 before this PR due to
mistakingly type-promoting the tensors.

Test Plan:
- new test: `pytest test/test_vmap.py -v`
- I refactored some tests a bit.

Reviewed By: cpuhrsch

Differential Revision: D23195418

Pulled By: zou3519

fbshipit-source-id: 33b7da841e55b47352405839f1f9445c4e0bc721
2020-08-20 13:44:31 -07:00
0dc41ff465 [pytorch] add flag for autograd ops to mobile builds (#43154)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43154

Adds the build flag `BUILD_MOBILE_AUTOGRAD` which toggles whether autograd files should be included for a PyTorch mobile build (default off).
ghstack-source-id: 110369406

Test Plan: CI

Reviewed By: ljk53

Differential Revision: D23061913

fbshipit-source-id: bc3d6683ab17f158990d83e4fae0a011d5adeca1
2020-08-20 12:39:55 -07:00
4fc9e958c4 [quant] Add benchmakrs for embedding_bag coversion ops (#43291)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43291

Test Float2Fused and Fused2Float conversion operators for embedding_bag byte and 4-bit ops

Test Plan:
```
python -m pt.qembedding_pack_tes
```

Imported from OSS

Reviewed By: radkris-git

Differential Revision: D23231641

fbshipit-source-id: a2afe51bba52980d2e96dfd7dbc183327e9349fd
2020-08-20 11:26:20 -07:00
c8bc298d6c streamline stride propagation logic in TensorIterator (#42922)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/41314 among other things.
This PR streamlines layout propagation logic in TensorIterator and removes almost all cases of channels-last hardcoding. The new rules and changes are as follows:
1) behavior of undefined `output` and defined output of the wrong (e.g. 0) size is always the same (before this PR the behavior was divergent)
2) in obvious cases (unary operation on memory-dense tensors, binary operations on memory-dense tensors with the same layout) strides are propagated (before propagation was inconsistent) (see footnote)
3) in other cases the output permutation is obtained as inverse permutation of sorting inputs by strides. Sorting is done with comparator obeying the following rules: strides of broadcasted dimensions are set to 0, and 0 compares equal to anything. Strides of not-broadcasted dimensions (including dimensions of size `1`) participate in sorting. Precedence is given to the first input, in case of a tie in the first input, first the corresponding dimensions are considered, and if that does not indicate that swap is needed, strides of the same dimension in subsequent inputs are considered. See changes in `reorder_dimensions` and `compute_strides`. Note that first inspecting dimensions of the first input allows us to better recover it's permutation (and we select this behavior because it more reliably propagates channels-last strides) but in some rare cases could result in worse traversal order for the second tensor.

These rules are enough to recover previously hard-coded behavior related to channels last, so all existing tests are passing.
In general, these rules will produce intuitive results, and in most cases permutation of the full size input (in case of broadcasted operation) will be recovered, or permutation of the first input (in case of same sized inputs) will be recovered, including cases with trivial (1) dimensions. As an example of the latter, the following tensor
```
x=torch.randn(2,1,3).permute(1,0,2)
```
will produce output with the same stride (3,3,1) in binary operations with 1d tensor. Another example is a tensor of size N1H1 that has strides `H,H,1,1` when contiguous and `H, 1, 1, 1` when channels-last. The output retains these strides in binary operations when another 1d tensor is broadcasted on this one.

Footnote: for ambiguous cases where all inputs are memory dense and have the same physical layout that nevertheless can correspond to different permutations, such as e.g. NC11-sized physically contiguous tensors, regular contiguous tensor is returned, and thus permutation information of the input is lost (so for NC11 channels-last input had the strides `C, 1, C, C`, but output will have the strides `C, 1, 1, 1`). This behavior is unchanged from before and consistent with numpy, but it still makes sense to change it. The blocker for doing it currently is performance of `empty_strided`. Once we make it on par with `empty` we should be able to propagate layouts in these cases. For now, to not slow down common contiguous case, we default to contiguous.
The table below shows how in some cases current behavior loses permutation/stride information, whereas new behavior propagates permutation.
| code                                                                                                                                                                                           | old                                                   | new                                                  |
|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------|------------------------------------------------------|
| #strided tensors<br>a=torch.randn(2,3,8)[:,:,::2].permute(2,0,1)<br>print(a.stride())<br>print(a.exp().stride())<br>print((a+a).stride())<br>out = torch.empty(0)<br>torch.add(a,a,out=out)<br>print(out.stride()) | (2, 24, 8) <br>(6, 3, 1) <br>(1, 12, 4) <br>(6, 3, 1) | (2, 24, 8)<br>(1, 12, 4)<br>(1, 12, 4)<br>(1, 12, 4) |
| #memory dense tensors<br>a=torch.randn(3,1,1).as_strided((3,1,1), (1,3,3))<br>print(a.stride(), (a+torch.randn(1)).stride())<br>a=torch.randn(2,3,4).permute(2,0,1)<br>print(a.stride())<br>print(a.exp().stride())<br>print((a+a).stride())<br>out = torch.empty(0)<br>torch.add(a,a,out=out)<br>print(out.stride())                                                                                                                                                                                               |  (1, 3, 3) (1, 1, 1)<br>(1, 12, 4)<br>(6, 3, 1)<br>(1, 12, 4)<br>(6, 3, 1)                                                       |  (1, 3, 3) (1, 3, 3)<br>(1, 12, 4)<br>(1, 12, 4)<br>(1, 12, 4)<br>(1, 12, 4) |

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42922

Reviewed By: ezyang

Differential Revision: D23148204

Pulled By: ngimel

fbshipit-source-id: 670fb6188c7288e506e5ee488a0e11efc8442d1f
2020-08-20 10:50:35 -07:00
ca9d4401d4 .circleci: Remove manual docker installation (#43277)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43277

Docker added native support for GPUs with the release of 19.03 and
CircleCI's infrastructure is all on Docker 19.03 as of now.

This also removes all references to `nvidia-docker` in the `.circleci` fodler.

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D23217570

Pulled By: seemethere

fbshipit-source-id: af297c7e82bf264252f8ead10d1a154354b24689
2020-08-20 10:36:03 -07:00
66a79bf114 .circleci: Don't quote glob for conda upload (#43297)
Summary:
Globs don't get expanded if you quote them in a bash script...
apparently.

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43297

Reviewed By: malfet

Differential Revision: D23227626

Pulled By: seemethere

fbshipit-source-id: d124025cfcaacbfb68167a062ca487c08f7f6bc9
2020-08-20 10:24:27 -07:00
397325a109 Make _compute_linear_combination.out a true out function (#43272)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43272

Was missing kwarg-onlyness.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: VitalyFedyunin

Differential Revision: D23215506

Pulled By: ezyang

fbshipit-source-id: 2c282c9a534fa8ea1825c31a24cb2441f0d6b234
2020-08-20 09:00:17 -07:00
f9a766bb39 Increase deadline time for load_save tests (#43205)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43205

A number of tests that forward to `TestLoadSaveBase.load_save` are all marked as flaky due to them regularly taking much longer to start up than hypothesis' default timeout of 200ms. This diff fixes the problem by removing the timeout for `load_save`. This is alright as these tests aren't meant to be testing the performance of these operators.

I would set the deadline to 60s if I could however it appears the that caffe2 github CI uses a different version of hypothesis that doesn't allow using `dateutil.timedelta` so instead of trying to figure out an approach that works on both I've just removed the deadline time.

I've also tagged all existing tasks WRT these failures.

Differential Revision: D23175752

fbshipit-source-id: 324f9ff034df1ac4874797f04f50067149a6ba48
2020-08-20 08:41:24 -07:00
a2ae2d3203 Nightly Pull (#43294)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/40829

This addresses remaining issues/improvements in https://github.com/pytorch/pytorch/issues/40829 that were brought up prior to https://github.com/pytorch/pytorch/issues/42635 being merged.  Namely, this changes the name of the script and adds separate `checkout` and `pull` subcommands. I have tested it locally and everything appears to work.  Please let me know if you encounter any issues. I hope that this supports a more natural workflow.

CC ezyang rgommers

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43294

Reviewed By: pbelevich

Differential Revision: D23241849

Pulled By: ezyang

fbshipit-source-id: c24556024d7e5d14b9a5006e927819d4ad370dd7
2020-08-20 08:34:18 -07:00
6a09df99e1 Fix ASAN error in QNNPACK's integration of qlinear_dynamic. (#41967)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/41967

Test Plan: `buck test fbandroid/mode/asan xplat/assistant/oacr/nlu/tests:nlu_testsAndroid` no longer reports an error.

Reviewed By: kimishpatel, xuwenfang

Differential Revision: D22715307

Pulled By: AshkanAliabadi

fbshipit-source-id: bec7296b345125ec5243ee6e6c484246ecfca3b7
2020-08-20 07:46:34 -07:00
60b524f271 Update torch.Tensor.is_set_to documentation (#43052)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/30350

Preview:

![image](https://user-images.githubusercontent.com/5676233/90250018-69d72200-de09-11ea-8984-7401cfd6c719.png)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43052

Reviewed By: mrshenli

Differential Revision: D23173066

Pulled By: suraj813

fbshipit-source-id: d90a11490739068ea448d975548a71e07180bd77
2020-08-20 07:40:00 -07:00
4e964f3b97 Make Windows CUDA-11 tests master only (#43234)
Summary:
According to the correlation analysis, CUDA-10.1 vs CUDA-11 test failures are quite dependent on each other

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43234

Reviewed By: ezyang, seemethere

Differential Revision: D23204289

Pulled By: malfet

fbshipit-source-id: c53c5f87e55f2dabbb6735a0566c314c204ebc69
2020-08-19 21:05:46 -07:00
3eb31325fc refactor torch/cuda/nccl.h to remove direct dependency on NCCL in libtorch_python (#42687)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42687

Reviewed By: malfet

Differential Revision: D23145834

Pulled By: walterddr

fbshipit-source-id: c703a953a54a638852f6e5a1479ca95ae6a10529
2020-08-19 20:16:53 -07:00
6e1127ea3f [NCCL] Changed FutureNCCL's then callback logic for better efficiency. (#42869)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42869

We realized that when we invoke a simple callback that divides the tensors by `world_size` after `allreduce`, the performance was almost 50% lower in terms of QPS compared to the case where a simple `allreduce` hook is used with no `then` callback.

The main problem was as we call `work.wait()` before invoking `then` callback, we were synchronizing `work`'s stream with the default PyTorch stream inside [`runHook`](https://github.com/pytorch/pytorch/blob/master/torch/csrc/distributed/c10d/reducer.cpp#L609) and stalling the backward computation.

In that PR, we ensure that FutureNCCL's `then` callback is not stalling the backward computation. Assuming single-process single-device, `FutureNCCL` gets a new stream from device's pool using `at::cuda::getStreamFromPool` to run `callback` and before invoking the `callback` inline it synchronizes `WorkNCCL`'s stream by callback's stream not the default stream.

ghstack-source-id: 110208431

Test Plan: Run performance benchmark tests to validate performance issue is resolved. Also, `python test/distributed/test_c10d.py` to avoid any odd issues.

Reviewed By: pritamdamania87

Differential Revision: D23055807

fbshipit-source-id: 60e50993f1ed97497514eac5cb1018579ed2a4c5
2020-08-19 19:42:22 -07:00
97d62bcd19 Modify Circle CI script to upload test report for analysis. (#43180)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43180

Reviewed By: VitalyFedyunin

Differential Revision: D23195934

Pulled By: walterddr

fbshipit-source-id: 5b9b411c3ea769951b5b1a456b5f7696b8ba0a92
2020-08-19 19:38:25 -07:00
0617156f0e [vulkan] fix invalid memory op and tests (#43312)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43312

Test Plan: Imported from OSS

Reviewed By: AshkanAliabadi

Differential Revision: D23232809

Pulled By: IvanKobzarev

fbshipit-source-id: 11b070b6e082bac72e21dd4c25c9c675bbc8c4a3
2020-08-19 19:34:08 -07:00
aad1ff9f18 [quant][cleanup]test_qlinear_legacy should be under TestDynamicQuantizedLinear. (#40084)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40084

This is just a nit diff (got merge conflict) while writing some unit-tests.
This move was nit as part of D21628596 (655f1ea176).

Test Plan: buck test test:quantization -- test_qlinear_legacy

Reviewed By: supriyar

Differential Revision: D22065463

fbshipit-source-id: 96ceaa53355349af7157f38b3a6366c550eeec6f
2020-08-19 18:50:46 -07:00
410d5b95b2 [jit] fix str -> Device implicit conversions (#43213)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43213

A reversed isSubtypeOf caused erroreous conversions to be inserted.

Test Plan: Imported from OSS

Reviewed By: eellison

Differential Revision: D23192787

Pulled By: zdevito

fbshipit-source-id: 4a90b19d99a4fc889e55568ced850f08dadbc3fe
2020-08-19 16:05:11 -07:00
018b4d7abb Automated submodule update: FBGEMM (#43251)
Summary:
This is an automated pull request to update the first-party submodule for [pytorch/FBGEMM](https://github.com/pytorch/FBGEMM).

New submodule commit: 685149bbc0

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43251

Test Plan: Ensure that CI jobs succeed on GitHub before landing.

Reviewed By: YazhiGao

Differential Revision: D23207016

fbshipit-source-id: 54e13b246bb5189260ed11316ddf3d26d52c6b24
2020-08-19 11:42:16 -07:00
eb7fc2e98f .circleci: Simplify binary upload process (#43159)
Summary:
Binary uploads were gated into 3 separate scripts making it difficult to
actually contribute changes, this simplifies that by consolidating all 3
scripts into one single script and then further consolidates it by
putting them all into the same job.

This also further simplifies things by separating upload jobs into their
own function under binary_build_definitions.py, since following the
conditional logic tree under the generic function was too difficult.

Testing this change here: https://github.com/pytorch/pytorch/pull/43161

Proof of success:
* [libtorch](https://app.circleci.com/pipelines/github/pytorch/pytorch/201868/workflows/54ce962f-f35b-4d97-93a7-bee186b14ead/jobs/6791347)
* [conda](https://app.circleci.com/pipelines/github/pytorch/pytorch/201868/workflows/54ce962f-f35b-4d97-93a7-bee186b14ead/jobs/6794359)
* [manywheel](https://app.circleci.com/pipelines/github/pytorch/pytorch/201868/workflows/54ce962f-f35b-4d97-93a7-bee186b14ead/jobs/6794253)

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43159

Reviewed By: malfet

Differential Revision: D23175174

Pulled By: seemethere

fbshipit-source-id: a2de64c033df99b03a124d3a0a2c92560af62c37
2020-08-19 11:34:14 -07:00
d467ac8ff0 [GLOO] handle empty split size (#43256)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43256

* Handle empty split size by moving to call computeLengthsAndOffsets()
* Enable GLOO alltoall python tests
ghstack-source-id: 109292763

Test Plan:
buck build mode/dev-nosan caffe2/torch/lib/c10d:ProcessGroupGlooTest

./trainer_cmd.sh -p 16 -n 8 -d gloo (modify ./trainer_cmd.sh a bit)

Reviewed By: mingzhe09088

Differential Revision: D22961600

fbshipit-source-id: b9e90dadf7b45323b8af2e6cab2e156043b7743b
2020-08-19 11:14:06 -07:00
7d10298067 Implement Tensor.to batching rule (#43206)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43206

The batching rule is the same as the unary pointwise batching rules:
given a BatchedTensor, we unwrap it, call Tensor.to, and then re-wrap
it.

Test Plan: - `pytest test/test_vmap.py -v -k`

Reviewed By: ezyang

Differential Revision: D23189053

Pulled By: zou3519

fbshipit-source-id: 51b4e41b1cd34bd082082ec4fff3c643002edbaf
2020-08-19 10:54:26 -07:00
1e248caba8 [CircleCI] Use canary images until VC++ 14.27 issue is resolved (#43220)
Summary:
Should fix binary build issue on Windows, and promptly error out if images are updated to a different version of VC++

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43220

Reviewed By: ezyang

Differential Revision: D23198530

Pulled By: malfet

fbshipit-source-id: 0c80361ad7dcfb7aaffccc306b7d741671bedc11
2020-08-19 10:28:19 -07:00
bc0e1e8ed2 Add dataclasses to base Docker images. (#43217)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43217

Dataclasses is part of standard library in Python 3.7 and there
is a backport for it in Python 3.6.  Our code generation will
start using it, so add it to the default library set.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D23214028

Pulled By: ezyang

fbshipit-source-id: a2ae20b9fa8f0b22966ae48506d4ddea203e7459
2020-08-19 09:56:23 -07:00
06d43dc69a default ice-ref to c-step (#4812)
Summary:
Pull Request resolved: https://github.com/pytorch/glow/pull/4812

if no compilation options are passed, default to c-step

fixed the FC and batchmatmul implementations to match C-step
fixed the fakelowp map calling to make sure we use the fp32 substitution of operators
updated the accumulator test to make it pass with fp32

Test Plan:
fakelowp tests
glow/test/numerics
net_runner

Reviewed By: jfix71

Differential Revision: D23086534

fbshipit-source-id: 3fbb8c4055bb190becb39ce8cdff6671f8558734
2020-08-19 09:50:34 -07:00
fa6b34b54c 2 Bit Embedding Conversion Operator support. (#43077)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43077

2 Bit Embedding weight conversion operation is quite similar to
4 bit embedding weight conversion.

The diff contains both the
1. 2bit packing op `embedding_bag_2bit_prepack`.
2. 2bit unpacking op `embedding_bag_2bit_unpack`.

Comments about the op are inline with the op definition.

Test Plan: buck test caffe2/test:quantization -- test_embedding_bag_2bit_unpack

Reviewed By: supriyar

Differential Revision: D23143262

fbshipit-source-id: fd8877f049ac1f7eb4bc580e588dc95f8b1edef0
2020-08-18 23:20:30 -07:00
ab366d0f5f Fix some mistakes in native_functions.yaml (#43156)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43156

- supports_named_tensor no longer does anything, so I have removed
  it.  I'm guessing these were cargo culted from some old occurrences
  of it in native_functions.yaml

- comma, not period, in variants

In my upcoming codegen rewrite, there will be strict error checking
for these cases (indeed, that is how I found these problems), so
I do not add error testing here.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D23183977

Pulled By: ezyang

fbshipit-source-id: a47d342152badfb8aea248a819ad94fd93dd6ab2
2020-08-18 23:13:20 -07:00
27ec91b0c9 remove thunk fix now that ROCm CI images are >= ROCm 3.5 (#43226)
Summary:
Also, relax BUILD_ENVIRONMENT exact match to rocm when installing pip packages for tests.

CC ezyang xw285cornell sunway513

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43226

Reviewed By: colesbury

Differential Revision: D23200460

Pulled By: xw285cornell

fbshipit-source-id: 11cd889cc320d0249d7ebea4da261bfe779e82ac
2020-08-18 23:10:15 -07:00
8094228f26 update path in CI script to access ninja (#43236)
Summary:
This relaxes the assumption that test.sh will be run in the CI environment by the CI user.

CC ezyang xw285cornell sunway513

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43236

Reviewed By: colesbury

Differential Revision: D23205981

Pulled By: ezyang

fbshipit-source-id: 302743cb03c9e9c6bfcdd478a6cd920b536dc29b
2020-08-18 21:43:41 -07:00
7c923a1025 Optimize linux CI build/test matrix (#43240)
Summary:
Make CUDA-10.1 configs build-only, as CUDA-10.1 and CUDA-10.2 test matrix is almost identical, and now, since CUDA-11 is out perhaps it's time to stop testing CUDA-10.1.
Make CUDA-9.2+GCC_5.4 an important (i.e. running on PR) build only config, because of the big overlap between  CUDA-9.2-GCC7 and CUDA-9.2-GCC5.4 test coverage.
Make CUDA-11 libtorch tests important rather that CUDA-10.2.

As result of the change, every PR will be built against CUDA-9.2, CUDA-10.2 and CUDA-11 and tested against CUDA-10.2

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43240

Reviewed By: ezyang

Differential Revision: D23205129

Pulled By: malfet

fbshipit-source-id: 70932e8b2167cce9fd621115c8bf24b1c81ed621
2020-08-18 20:39:32 -07:00
e41ca2d9fa In copy_weights_to_flat_buf_views() explicitly construct tuple (#43244)
Summary:
In some versions of GCC, tuple constructor from initializer list  is marked as explicit, which results in the following compilation error:
```
/var/lib/jenkins/workspace/aten/src/ATen/native/cudnn/RNN.cpp: In function 'std::tuple<at::Tensor, std::vector<at::Tensor, std::allocator<at::Tensor> > > at::native::cudnn_rnn::copy_weights_to_flat_buf_views(at::TensorList, int64_t, int64_t, int64_t, int64_t, int64_t, bool, bool, cudnnDataType_t, const c10::TensorOptions&, bool, bool, bool)':
/var/lib/jenkins/workspace/aten/src/ATen/native/cudnn/RNN.cpp:687:35: error: converting to 'std::tuple<at::Tensor, std::vector<at::Tensor, std::allocator<at::Tensor> > >' from initializer list would use explicit constructor 'constexpr std::tuple<_T1, _T2>::tuple(_U1&&, _U2&&) [with _U1 = at::Tensor&; _U2 = std::vector<at::Tensor>&; <template-parameter-2-3> = void; _T1 = at::Tensor; _T2 = std::vector<at::Tensor>]'
     return {weight_buf, params_arr};
```
This regression was introduced by https://github.com/pytorch/pytorch/pull/42385

Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43244

Reviewed By: pbelevich

Differential Revision: D23205656

Pulled By: malfet

fbshipit-source-id: 51470386ad95290c7c99d733fc1fe655aa27d009
2020-08-18 19:31:51 -07:00
d06f1818ad Fix codegen/cuda gcc-5.4 compilation issues (#43223)
Summary:
Most of the fixes is the same old enum-is-not-hasheable error
In manager.cpp use std::unordered_map::emplace rather than `insert` to avoid error triggered by missed copy elision
This regression was introduced by https://github.com/pytorch/pytorch/pull/43129

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43223

Reviewed By: albanD, seemethere

Differential Revision: D23198330

Pulled By: malfet

fbshipit-source-id: 576082f7a4454dd29182892c9c4e0b51a967d456
2020-08-18 17:19:07 -07:00
d5bc2a8058 Remove std::complex from c10::Half (#39833)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/39833

Reviewed By: mrshenli

Differential Revision: D22644987

Pulled By: anjali411

fbshipit-source-id: 5ae5db10b12d410560eca43234efa04b711a639c
2020-08-18 15:22:36 -07:00
6c99d5611d [tensorexpr] Fix promotion of booleans (#43097)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43097

Boolean arguments weren't promoted, so if you tried to write a comparison with
types such as `Tensor(Bool) == Int` you'd fail typechecking inside the TE
engine.

Test Plan: Imported from OSS

Reviewed By: protonu, zheng-xq

Differential Revision: D23167926

Pulled By: bertmaher

fbshipit-source-id: 47091a815d5ae521637142a5c390e8a51a776906
2020-08-18 15:19:38 -07:00
da5df7e2d2 Remove use of term "blacklist" from tools/autograd/gen_python_functions.py (#42047)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/41720

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42047

Reviewed By: colesbury

Differential Revision: D23197785

Pulled By: SplitInfinity

fbshipit-source-id: 8ef38518f479e5e96b6a51bc420b0df5b35b447c
2020-08-18 15:11:22 -07:00
3951457ca5 [FX] Add in resnet + quantization tests (#43157)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43157

Test Plan: Imported from OSS

Reviewed By: zdevito

Differential Revision: D23173327

Pulled By: jamesr66a

fbshipit-source-id: 724d0f5399d389cdaa53917861b2113c33b9b5f9
2020-08-18 15:00:18 -07:00
dd194c1612 add _save_parameters to serialize map (#43163)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43163

Test Plan: Imported from OSS

Reviewed By: iseeyuan

Differential Revision: D23175287

Pulled By: ann-ss

fbshipit-source-id: ddfd734513c07e8bdbec108f26d1ca1770d098a6
2020-08-18 14:58:04 -07:00
2e6e295ecc refactor _save_parameters to _save_data (#43162)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43162

Test Plan: Imported from OSS

Reviewed By: iseeyuan

Differential Revision: D23175286

Pulled By: ann-ss

fbshipit-source-id: 6f930b98c367242fd4efbf51cb1d09995f7c4b40
2020-08-18 14:57:03 -07:00
888ae1b3d8 Introducing Matrix exponential (#40161)
Summary:
Implements (batched) matrix exponential. Fixes [https://github.com/pytorch/pytorch/issues/9983](https://github.com/pytorch/pytorch/issues/9983).

The algorithm follows:
```
 Bader, P.; Blanes, S.; Casas, F.
 Computing the Matrix Exponential with an Optimized Taylor Polynomial Approximation.
 Mathematics 2019, 7, 1174.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/40161

Reviewed By: zhangguanheng66

Differential Revision: D22951372

Pulled By: ezyang

fbshipit-source-id: aa068cb76d5cf71696b333d3e72cee287b3089e3
2020-08-18 14:15:10 -07:00
dfdd797723 Replace all AT_ASSERTM under ATen CUDA kernels. (#42989)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42989

Test Plan: Imported from OSS

Reviewed By: colesbury

Differential Revision: D23190011

Pulled By: ezyang

fbshipit-source-id: 7489598d7d920f32334943c1bf12bba74208a96c
2020-08-18 13:50:49 -07:00
493b3c2c7c Replace all AT_ASSERTM under ATen CPU kernels. (#41876)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/41876

Test Plan: Imported from OSS

Reviewed By: colesbury

Differential Revision: D23190010

Pulled By: ezyang

fbshipit-source-id: 238f1cd8db283805d6e892de7549763d0aa13316
2020-08-18 13:49:15 -07:00
0744dd6166 Fix shapes in the MarginRankingLoss docs (#43131)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/42884

I did some additional research and considering the first few lines of the docs (`Creates a criterion that measures the loss given inputs x1, x2, two 1D mini-batch Tensors, and a label 1D mini-batch tensor y (containing 1 or -1`) and the provided tests, this loss should be used primarily with 1-D tensors. More advanced users (that may use this loss in non-standard ways) can easily check the source and see that the definition accepts inputs/targets of arbitrary dimension as long as they match in shape or are broadcastable.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43131

Reviewed By: colesbury

Differential Revision: D23192011

Pulled By: mrshenli

fbshipit-source-id: c412c28daf9845c0142ea33b35d4287e5b65fbb9
2020-08-18 13:44:16 -07:00
fbf274f5a7 Autocast support for cudnn RNNs (#42385)
Summary:
Should close https://github.com/pytorch/pytorch/issues/36428.

The cudnn RNN API expects weights to occupy a flat buffer in memory with a particular layout.  This PR implements a "speed of light" fix:  [`_cudnn_rnn_cast_reflatten`](https://github.com/pytorch/pytorch/pull/42385/files#diff-9ef93b6a4fb5a06a37c562b83737ac6aR327) (the autocast wrapper assigned to `_cudnn_rnn`) copies weights to the right slices of a flat FP16 buffer with a single read/write per weight (as opposed to casting them to FP16 individually then reflattening the individual FP16 weights, which would require 2 read/writes per weight).

It isn't pretty but IMO it doesn't make rnn bindings much more tortuous than they already are.

The [test](https://github.com/pytorch/pytorch/pull/42385/files#diff-e68a7bc6ba14f212e5e7eb3727394b40R2683) tries a forward under autocast and a backward for the full cross product of RNN options and input/weight/hidden dtypes.  As for all FP16list autocast tests, forward output and backward grads are checked against a control where inputs (including RNN module weights in this case) are precasted to FP16 on the python side.

Not sure who to ask for review, tagging ezyang and ngimel because Ed wrote this file (almost 2 years ago) and Natalia did the most recent major [surgery](https://github.com/pytorch/pytorch/pull/12600).

Side quests discovered:
- Should we update [persistent RNN heuristics](dbdd28207c/aten/src/ATen/native/cudnn/RNN.cpp (L584)) to include compute capability 8.0?  Could be another PR but seems easy enough to include.
- Many (maybe all?!) the raw cudnn API calls in [RNN.cpp](https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/cudnn/RNN.cpp) are deprecated in cudnn 8.  I don't mind taking the AI to update them since my mental cache is full of rnn stuff, but that would be a substantial separate PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42385

Reviewed By: zhangguanheng66

Differential Revision: D23077782

Pulled By: ezyang

fbshipit-source-id: a2afb1bdab33ba0442879a703df13dc87f03ec2e
2020-08-18 13:37:42 -07:00
0a9c35aba3 maybe minor fix to dispatch/backend_fallback_test.cpp? (#42990)
Summary:
I think you want to push rewrapped `rets`, not `args`, back to the stack.

Doesn't matter for test purposes because tests only check if/when fallbacks were called, they don't check outputs for correctness.  But it avoids reader confusion.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42990

Reviewed By: mrshenli

Differential Revision: D23168277

Pulled By: ezyang

fbshipit-source-id: 2559f0707acdca2e3deac09006bc66ce3c788ea3
2020-08-18 13:01:35 -07:00
e39b43fd76 Issue 43057 (#43063)
Summary:
A small change that adds a docstring that can be found with
`getattr(nn.Module, nn.Module.forward.__name__, None).__doc__`

Fixes https://github.com/pytorch/pytorch/issues/43057

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43063

Reviewed By: mrshenli

Differential Revision: D23161782

Pulled By: ezyang

fbshipit-source-id: 95456f858e2b6a0e41ae551ea4ec2e78dd35ee3f
2020-08-18 12:50:53 -07:00
5d608d45cf Added Encoder Layer constructor with default parameters (#43130)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/37756

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43130

Reviewed By: colesbury

Differential Revision: D23189803

Pulled By: mrshenli

fbshipit-source-id: 53f3fca838828ddd728d8b44c36745bab5acee1f
2020-08-18 11:09:49 -07:00
53bbf5a48b Update README.md (#43100)
Summary:
The changes are minor.
1. Add back the external links so that readers can find out more about external tools on how to accelerate PyTorch.
2. Fix typo

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43100

Reviewed By: colesbury

Differential Revision: D23192251

Pulled By: mrshenli

fbshipit-source-id: dde54b7942ebff5bbe3d58ad95744c6d95fe60fe
2020-08-18 11:04:36 -07:00
ee74c2e5be Compress fatbin to fit into 32bit indexing (#43074)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/39968

tested with `TORCH_CUDA_ARCH_LIST='3.5 5.2 6.0 6.1 7.0 7.5 8.0+PTX'`, before this PR, it was failing, and with this  PR, the build succeed.

With `TORCH_CUDA_ARCH_LIST='7.0 7.5 8.0+PTX'`, `libtorch_cuda.so` with symbols changes from 2.9GB -> 2.2GB

cc: ptrblck mcarilli jjsjann123

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43074

Reviewed By: mrshenli

Differential Revision: D23176095

Pulled By: malfet

fbshipit-source-id: 7b3e6d049fc080e519f21e80df05ef68e7bea57e
2020-08-18 09:48:54 -07:00
b92b556a12 Add shape inference to SparseLengthsSumSparse ops (#43181)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43181

att

Test Plan:
```
buck test caffe2/caffe2/opt:bound_shape_inference_test
```

Reviewed By: ChunliF

Differential Revision: D23097145

fbshipit-source-id: 3e4506308446f28fbeb01dcac97dce70c0443975
2020-08-18 09:36:53 -07:00
b3bda94393 [NVFuser] Enable E2E BCast-PWise-Reduction fusions (#43129)
Summary:
Had a bunch of merged commits that shouldn't have been there, reverted them to prevent conflicts. Lots of new features, highlights listed below.

**Overall:**

- Enables pointwise fusion, single (but N-D) broadcast -- pointwise fusion, single (but N-D) broadcast -- pointwise -- single (but N-D) reduction fusion.

**Integration:**

- Separate "magic scheduler" logic that takes a fusion and generates code generator schedule
- Reduction fusion scheduling with heuristics closely matching eagermode (unrolling supported, but no vectorize support)
- 2-Stage caching mechanism, one on contiguity, device, type, and operations, the other one is input size->reduction heuristic

**Code Generation:**

- More generic support in code generation for computeAt
- Full rework of loop nest generation and Indexing to more generically handle broadcast operations
- Code generator has automatic kernel launch configuration (including automatic allocation of grid reduction buffers)
- Symbolic (runtime) tilling on grid/block dimensions is supported
- Simplified index generation based on user-defined input contiguity
- Automatic broadcast support (similar to numpy/pytorch semantics)
- Support for compile time constant shared memory buffers
- Parallelized broadcast support (i.e. block reduction -> block broadcast support)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43129

Reviewed By: mrshenli

Differential Revision: D23162207

Pulled By: soumith

fbshipit-source-id: 16deee4074c64de877eed7c271d6a359927111b2
2020-08-18 09:10:08 -07:00
c44b1de54e Pin VC++ version to 14.26 (#43184)
Summary:
VC++14.27 fails to compile mkl-dnn, see oneapi-src/oneDNN#812

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43184

Reviewed By: glaringlee

Differential Revision: D23181803

Pulled By: malfet

fbshipit-source-id: 9861c6243673c775374d77d2f51b45a42791b475
2020-08-17 22:17:06 -07:00
e8db0425b5 remove dot from TH (#43148)
Summary:
small cleanup of dead code

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43148

Reviewed By: mruberry

Differential Revision: D23175571

Pulled By: ngimel

fbshipit-source-id: b1b0ae9864d373c75666b95c589d090a9ca791b2
2020-08-17 21:40:44 -07:00
aef2890a75 Improve zero sized input for addmv (#41824)
Summary:
fixes https://github.com/pytorch/pytorch/issues/41340

Unfortunately, I still can not get a K80 to verify the fix, but it should be working.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41824

Reviewed By: mruberry

Differential Revision: D23172775

Pulled By: ngimel

fbshipit-source-id: aa6af96fe74e3bb07982c006cb35ecc7f18181bc
2020-08-17 20:05:31 -07:00
3c5e3966f4 [ONNX] Squeeze operator should give an error when trying to apply to a dimension with shape > 1 (#38476)
Summary:
The ONNX spec for the Squeeze operator:

> Remove single-dimensional entries from the shape of a tensor. Takes a parameter axes with a list of axes to squeeze. If axes is not provided, all the single dimensions will be removed from the shape. If an axis is selected with shape entry not equal to one, an error is raised.

Currently, as explained in issue https://github.com/pytorch/pytorch/issues/36796, it is possible to export such a model to ONNX, and this results in an exception from ONNX runtime.

Fixes https://github.com/pytorch/pytorch/issues/36796.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/38476

Reviewed By: hl475

Differential Revision: D22158024

Pulled By: houseroad

fbshipit-source-id: bed625f3c626eabcbfb2ea83ec2f992963defa19
2020-08-17 17:41:46 -07:00
cd96dfd44b Delete accidentally committed file errors.txt. (#43164)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43164

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D23175392

Pulled By: gchanan

fbshipit-source-id: 0d2d918fdf4a94361cdc3344bf1bc89dd0286ace
2020-08-17 17:37:48 -07:00
57af1ec145 observers: use torch.all to check for valid min and max values (#43151)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43151

Using `torch.all` instead of `torch.sum` and length check.
It's unclear whether the increase in perf (~5% for small inputs) is
real, but should be a net benefit, especially for larger channel inputs.

Test Plan: Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D23170426

fbshipit-source-id: ee5c25eb93cee1430661128ac9458a9c525df8e5
2020-08-17 17:08:57 -07:00
3264ba065c observers: use clamp instead of min/max in calculate_qparams (#43150)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43150

The current logic was expensive because it created tensors on CUDA.
Switching to clamp since it can work without needing to create tensors.

Test Plan:
benchmarks

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D23170427

fbshipit-source-id: 6fe3a728e737aca9f6c2c4d518c6376738577e21
2020-08-17 17:08:54 -07:00
a5dfba0a6e observers: make eps a buffer (#43149)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43149

This value doesn't change, making it a buffer to only pay
the cost of creating a tensor once.

Test Plan: Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D23170428

fbshipit-source-id: 6b963951a573efcc5b5a57649c814590b448dd72
2020-08-17 17:08:51 -07:00
5aa61afbfb quant bench: update observer configs (#42956)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42956

In preparation for observer perf improvement, cleans up the
micro benchmarks:
* disable CUDA for histogram observers (it's too slow)
* add larger shapes for better representation of real workloads

Test Plan:
```
cd benchmarks/operator_benchmark
python -m pt.qobserver_test
```

Imported from OSS

Reviewed By: supriyar

Differential Revision: D23093996

fbshipit-source-id: 5dc477c9bd5490d79d85ff8537270cd25aca221a
2020-08-17 17:07:56 -07:00
1f6e6a1166 Remove unused variable vecVecStartIdx (#42257)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42257

Reviewed By: gchanan

Differential Revision: D23109328

Pulled By: ezyang

fbshipit-source-id: dacd438395fedd1050ad3ffb81327bbb746c776c
2020-08-17 15:41:07 -07:00
133e9f96e1 Use c10 threadpool for GPU to CPU distributed autograd continuations. (#42511)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42511

DistEngine currently only has a single thread to execute GPU to CPU
continuations as part of the backward pass. This would be a significant
performance bottleneck in cases where we have such continuations and would like
to execute these using all CPU cores.

To alleviate this in this PR, we have the single thread in DistEngine only
dequeue work from the global queue, but then hand off execution of that work to
the c10 threadpool where we call "execute_graph_task_until_ready_queue_empty".

For more context please see:
https://github.com/pytorch/pytorch/issues/40255#issuecomment-663298062.
ghstack-source-id: 109997718

Test Plan: waitforbuildbot

Reviewed By: albanD

Differential Revision: D22917579

fbshipit-source-id: c634b6c97f3051f071fd7b994333e6ecb8c54155
2020-08-17 15:04:19 -07:00
825ec18eed [jit] better error message (#43093)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43093

without this it's hard to tell which module is going wrong

Test Plan:
```
> TypeError:
> 'numpy.int64' object in attribute 'Linear.in_features' is not a valid constant.
> Valid constants are:
> 1. a nn.ModuleList
> 2. a value of type {bool, float, int, str, NoneType, torch.device, torch.layout, torch.dtype}
> 3. a list or tuple of (2)
```

Reviewed By: eellison

Differential Revision: D23148516

fbshipit-source-id: b86296cdeb7b47c9fd69b5cfa479914c58ef02e6
2020-08-17 14:57:56 -07:00
864f0cfb2d Fix type annotations for torch.sparse, enable in CI (#43108)
Summary:
Closes gh-42982

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43108

Reviewed By: malfet

Differential Revision: D23167560

Pulled By: ezyang

fbshipit-source-id: 0d660ca686ada2347bf440c6349551d1539f99ef
2020-08-17 14:40:11 -07:00
6db0b8785d Adds movedim method, fixes movedim docs, fixes view doc links (#43122)
Summary:
This PR:

- Adds a method variant to movedim
- Fixes the movedim docs so it will actually appear in the documentation
- Fixes three view doc links which were broken

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43122

Reviewed By: ngimel

Differential Revision: D23166222

Pulled By: mruberry

fbshipit-source-id: 14971585072bbc04b5366d4cc146574839e79cdb
2020-08-17 14:24:52 -07:00
37252e8f00 Implement batching rules for some unary ops (#43059)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43059

This PR implements batching rules for some unary ops. In particular, it
implements the batching rules for the unary ops that take a single
tensor as input (and nothing else).

The batching rule for a unary op is:
(1) grab the physical tensor straight out of the BatchedTensor
(2) call the unary op
(3) rewrap the physical tensor in a BatchedTensor

Test Plan: - new tests `pytest test/test_vmap.py -v -k "Operators"`

Reviewed By: ezyang

Differential Revision: D23132277

Pulled By: zou3519

fbshipit-source-id: 24b9d7535338207531d767155cdefd2c373ada77
2020-08-17 13:38:10 -07:00
768c2a8c25 vmap: fixed to work with functools.partial (#43028)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43028

There was a bug where we always tried to grab the `__name__` attribute of
the function passed in by the user. Not all Callables have the
`__name__` attribute, an example being a Callable produced by
functools.partial.

This PR modifies the error-checking code to use `repr` if `__name__` is
not available. Furthermore, it moves the "get the name of this function"
functionality to the actual error sites as an optimization so we don't
spend time trying to compute `__repr__` for the Callable if there is no
error.

Test Plan: - `pytest test/test_vmap.py -v`, added new tests.

Reviewed By: yf225

Differential Revision: D23130235

Pulled By: zou3519

fbshipit-source-id: 937f3640cc4d759bf6fa38b600161f5387a54dcf
2020-08-17 13:36:49 -07:00
9c3f579528 .circleci: Copy LLVM from pre-built image (#43038)
Summary:
LLVM builds took a large amount of time and bogged down docker builds in
general. Since we build it the same for everything let's just copy it
from a pre-built image instead of building it from source every time.

Builds are defined in https://github.com/pytorch/builder/pull/491

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43038

Reviewed By: malfet

Differential Revision: D23119513

Pulled By: seemethere

fbshipit-source-id: f44324439d45d97065246caad07c848e261a1ab6
2020-08-17 11:04:35 -07:00
7cb8d68ae1 Rename XLAPreAutograd to AutogradXLA. (#43047)
Summary:
Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43047

Reviewed By: ezyang

Differential Revision: D23134326

Pulled By: ailzhang

fbshipit-source-id: 5fcbc23755daa8a28f9b03af6aeb3ea0603b5c9a
2020-08-17 10:47:43 -07:00
034e6727e7 Set default ATen threading backend to native if USE_OPENMP is false (#43067)
Summary:
Since OpenMP is not available on some platforms, or might be disabled by user, set default `ATEN_THREADING` based on USE_OPENMP and USE_TBB options

Fixes https://github.com/pytorch/pytorch/issues/43036

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43067

Reviewed By: houseroad

Differential Revision: D23138856

Pulled By: malfet

fbshipit-source-id: cc8f9ee59a5559baeb3f19bf461abbc08043b71c
2020-08-17 10:33:31 -07:00
aab66602c4 Add torch.dot for complex tensors (#42745)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42745

Test Plan: Imported from OSS

Reviewed By: izdeby

Differential Revision: D23056382

Pulled By: anjali411

fbshipit-source-id: c97f15e057095f78069844dbe0299c14104d2fce
2020-08-17 09:05:41 -07:00
472f291375 Fix freeze_module pass for sharedtype (#42457)
Summary:
During cleanup phase, calling recordReferencedAttrs would record
the attributes which are referenced and hence kept.
However, if you have two instances of the same type which are preserved
through freezing process, as the added testcase shows, then during
recording the attributes which are referenced, we iterate through the
type INSTANCES that we have seen so far and record those ones.
Thus if we have another instance of the same type, we will just look at
the first instance in the list, and record that instances.
This PR fixes that by traversing the getattr chains and getting the
actual instance of the getattr output.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42457

Test Plan:
python test/test_jit.py TestFreezing
Fixes #{issue number}

Reviewed By: gchanan

Differential Revision: D23106921

Pulled By: kimishpatel

fbshipit-source-id: ffff52876938f8a1fedc69b8b24a3872ea66103b
2020-08-17 08:27:31 -07:00
269fdb5bb2 prepare to split transformer header file (#43069)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43069

The transformer c++ impl need to put TransformerEncoderLayer/DecoderLayer and TransformerEncoder/TransformerDecoder in different header since TransformerEncoder/Decoder's options class need TransformerEncoderLayer/DecoderLayer as input parameter. Split header files to avoid cycle includsion.

Test Plan: Imported from OSS

Reviewed By: yf225

Differential Revision: D23139437

Pulled By: glaringlee

fbshipit-source-id: 3c752ed7702ba18a9742e4d47d049e62d2813de0
2020-08-17 07:54:05 -07:00
248b6a30f4 add training mode to mobile::Module (#42880)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42880

Enable switching between and checking for training and eval mode for torch::jit::mobile::Module using train(), eval(), and is_training(), like exists for torch::jit::Module.

Test Plan: Imported from OSS

Reviewed By: iseeyuan

Differential Revision: D23063006

Pulled By: ann-ss

fbshipit-source-id: b79002148c46146b6e961cbef8aaf738bbd53cb2
2020-08-17 00:20:03 -07:00
e2eb0cb1a9 Adds arccosh alias for acosh and adds an alias consistency test (#43107)
Summary:
This adds the torch.arccosh alias and updates alias testing to validate the consistency of the aliased and original operations. The alias testing is also updated to run on CPU and CUDA, which revealed a memory leak when tracing (see https://github.com/pytorch/pytorch/issues/43119).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43107

Reviewed By: ngimel

Differential Revision: D23156472

Pulled By: mruberry

fbshipit-source-id: 6155fac7954fcc49b95e7c72ed917c85e0eabfcd
2020-08-16 22:12:25 -07:00
4ae832e106 Optimize SiLU (Swish) op in PyTorch (#42976)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42976

Optimize SiLU (Swish) op in PyTorch.

Some benchmark result

input = torch.rand(1024, 32768, dtype=torch.float, device="cpu")
forward: 221ms -> 133ms
backward: 600ms -> 170ms

input = torch.rand(1024, 32768, dtype=torch.double, device="cpu")
forward: 479ms -> 297ms
backward: 1438ms -> 387ms

input = torch.rand(8192, 32768, dtype=torch.float, device="cuda")
forward: 24.34ms -> 9.83ms
backward: 97.05ms -> 29.03ms

input = torch.rand(4096, 32768, dtype=torch.double, device="cuda")
forward: 44.24ms -> 30.15ms
backward: 126.21ms -> 49.68ms

Test Plan: buck test mode/dev-nosan //caffe2/test:nn -- "SiLU"

Reviewed By: houseroad

Differential Revision: D23093593

fbshipit-source-id: 1ba7b95d5926c4527216ed211a5ff1cefa3d3bfd
2020-08-16 13:21:57 -07:00
d4c5f561ec Updates torch.clone documentation to be consistent with other functions (#43098)
Summary:
`torch.clone` exists but was undocumented, and the method incorrectly listed `memory_format` as a positional argument. This:

- documents `torch.clone`
- lists `memory_format` as a keyword-only argument
- wordsmiths the documentation

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43098

Reviewed By: ngimel

Differential Revision: D23153397

Pulled By: mruberry

fbshipit-source-id: c2ea781cdcb8b5ad3f04987c2b3a2f1fe0eaf18b
2020-08-16 04:18:49 -07:00
5bcf9b017a Implement hstack, vstack, dstack (#42799)
Summary:
Related to https://github.com/pytorch/pytorch/issues/38349

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42799

Reviewed By: izdeby

Differential Revision: D23140704

Pulled By: mruberry

fbshipit-source-id: 6a36363562c50d0abce87021b84b194bb32825fb
2020-08-15 20:39:14 -07:00
8864148823 [jit] DeepAndWide benchmark (#43096)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43096

Add benchmark script for deep and wide model.

Reviewed By: bwasti, yinghai

Differential Revision: D23099925

fbshipit-source-id: aef09d8606eba1eccc0ed674dfea59b890d3648b
2020-08-15 01:27:12 -07:00
91f3114fc1 [JIT] Represent profiled types as a node attribute (#43035)
Summary:
This changes profiled types from being represented as:
`%23 : Float(4:256, 256:1, requires_grad=0, device=cpu) = prim::profile(%0)`
->
`%23 : Tensor = prim::profile[profiled_type=Float(4:256, 256:1, requires_grad=0, device=cpu)](%0)`

Previously, by representing the profiled type in the IR directly it was very easy for optimizations to accidentally use profiled types without inserting the proper guards that would ensure that the specialized type would be seen.

It would be a nice follow up to extend this to prim::Guard as well, however we have short term plans to get rid of prim::Guard.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43035

Reviewed By: ZolotukhinM

Differential Revision: D23120226

Pulled By: eellison

fbshipit-source-id: c78d7904edf314dd65d1a343f2c3a947cb721b32
2020-08-14 20:17:46 -07:00
19902f6c0e Document unavailable reduction ops with NCCL backend (#42822)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42822

These ops arent supported with NCCL backend and used to silently error.
We disabled them as part of addressing https://github.com/pytorch/pytorch/issues/41362, so
document that here.
ghstack-source-id: 109957761

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D23023046

fbshipit-source-id: 45d69028012e0b6590c827d54b35c66cd17e7270
2020-08-14 19:08:28 -07:00
06aaf8c20d Add set_device_map to TensorPipeOptions to support GPU args (#42637)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42637

This commit enables sending non-CPU tensors through RPC using
TensorPipe backend. Users can configure device mappings by calling
set_map_location on `TensorPipeRpcBackendOptions`. Internally,
the `init_rpc` API verifies the correctness of device mappings. It
will shutdown RPC if the check failed, or proceed and pass global
mappings to `TensorPipeAgent` if the check was successful. For serde,
we added a device indices field to TensorPipe read and write buffers,
which should be either empty (all tensors must be on CPU) or match
the tensors in order and number in the RPC message. This commit
does not yet avoid zero-copy, the tensor is always moved to CPU
on the sender and then moved to the specified device on the receiver.

Test Plan: Imported from OSS

Reviewed By: izdeby

Differential Revision: D23011572

Pulled By: mrshenli

fbshipit-source-id: 62b617eed91237d4e9926bc8551db78b822a1187
2020-08-14 18:46:55 -07:00
c84f78470b Fix type annotations for a number of torch.utils submodules (#42711)
Summary:
Related issue on `torch.utils` type annotation hiccups: gh-41794

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42711

Reviewed By: mrshenli

Differential Revision: D23005434

Pulled By: malfet

fbshipit-source-id: 151554b1e7582743f032476aeccdfdad7a252095
2020-08-14 18:12:48 -07:00
bcf54f9438 Stop treating ASAN as special case (#43048)
Summary:
Add "asan" node to a `CONFIG_TREE_DATA` rather than hardcoded that non-xla clang-5 is ASAN

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43048

Reviewed By: houseroad

Differential Revision: D23126296

Pulled By: malfet

fbshipit-source-id: 22f02067bb2f5435a0e963a6c722b9c115ccfea4
2020-08-14 17:24:05 -07:00
0cf4a5bccb Add GCC codecoverage flags (#43066)
Summary:
Rename `CLANG_CODE_COVERAGE` option to `CODE_COVERAGE` and add compiler specific flags for GCC and Clang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43066

Reviewed By: scintiller

Differential Revision: D23137488

Pulled By: malfet

fbshipit-source-id: a89570469692f878d84f7da6f9d5dc01df423e80
2020-08-14 17:16:18 -07:00
ita
91b090ceaf Add polygamma where n >= 2 (#42499)
Summary:
https://github.com/pytorch/pytorch/issues/40980

I have a few questions during implementing Polygamma function...
so, I made PR prior to complete it.

1. some code blocks brought from cephes library(and I did too)
```
/*
 * The following function comes with the following copyright notice.
 * It has been released under the BSD license.
 *
 * Cephes Math Library Release 2.8:  June, 2000
 * Copyright 1984, 1987, 1992, 2000 by Stephen L. Moshier
 */
```
is it okay for me to use cephes code with this same copyright notice(already in the Pytorch codebases)

2. There is no linting in internal Aten library. (as far as I know, I read https://github.com/pytorch/pytorch/blob/master/CONTRIBUTING.md)
How do I'm sure my code will follow appropriate guidelines of this library..?

3. Actually, there's a digamma, trigamma function already
digamma is needed, however, trigamma function becomes redundant if  polygamma function is added.
it is okay for trigamma to be there or should be removed?

btw, CPU version works fine with 3-rd order polygamma(it's what we need to play with variational inference with beta/gamma distribution) now and I'm going to finish GPU version soon.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42499

Reviewed By: gchanan

Differential Revision: D23110016

Pulled By: albanD

fbshipit-source-id: 246f4c2b755a99d9e18a15fcd1a24e3df5e0b53e
2020-08-14 17:00:24 -07:00
4011685a8b [fx] split Node into Node/Proxy (#42991)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42991

Have Node both be a record of the operator in the graph, and the
way we _build_ the graph made it difficult to keep the IR datastructure
separate from the proxying logic in the build.

Among other issues this means that typos when using nodes would add
things to the graph:
```
    for node in graph.nodes:
        node.grph # does not error, returns an node.Attribute object!
```

This separates the builder into a Proxy object. Graph/Node no longer
need to understand `delegate` objects since they are now just pure IR.
This separates the `symbolic_trace` (proxy.py/symbolic_trace.py) from
the IR (node.py, graph.py).

This also allows us to add `create_arg` to the delegate object,
allowing the customization of how aggregate arguments are handled
when converting to a graph.

Test Plan: Imported from OSS

Reviewed By: jamesr66a

Differential Revision: D23099786

Pulled By: zdevito

fbshipit-source-id: 6f207a8c237e5eb2f326b63b0d702c3ebcb254e4
2020-08-14 16:45:21 -07:00
a1a6e1bc91 Fix warning: dynamic initialization in unreachable code. (#43065)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43065

Test Plan: Imported from OSS

Reviewed By: suo

Differential Revision: D23136883

Pulled By: ZolotukhinM

fbshipit-source-id: 878f6af13ff8df63fef5f34228f7667ee452dd95
2020-08-14 16:08:32 -07:00
66b3382c5b [quant] Add torchbind support for embedding_bag packed weights (#42881)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42881

This enables serialization/de-serialization of embedding packed params using getstate/setstate calls.
Added version number to deal with changes to serialization formats in future.

This can be extended in the future to support 4-bit/2-bit once we add support for that.

Test Plan:
python test/test_quantization.py TestQuantizedEmbeddingBag

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D23070634

fbshipit-source-id: 2ca322ab998184c728be6836f9fd12cec98b2660
2020-08-14 16:05:27 -07:00
7632a9b090 [quant] Add embeddingbag_prepack function that works on quantized tensor. (#42762)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42762

Use a prepack function that accepts qtensor as an input. The output is a byte tensor with packed data.
This is currently implemented only for 8-bit. In the future once we add 4-bit support this function will be extended to support that too.

Note -In the following change I will add TorchBind support for this to support serialization of packed weights.

Test Plan:
python test/test_quantization.py TestQuantizedEmbeddingBag

Imported from OSS

Reviewed By: vkuzo

Differential Revision: D23070632

fbshipit-source-id: 502aa1302dffec1298cdf52832c9e2e5b69e44a8
2020-08-14 16:02:57 -07:00
450315198a Fix a casting warning (#42451)
Summary:
Fix an annoying casting warning

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42451

Reviewed By: yf225

Differential Revision: D22993194

Pulled By: ailzhang

fbshipit-source-id: f317a212d4e768d49d24f50aeff9c003be2fd30a
2020-08-14 15:47:02 -07:00
3d8c144400 Implemented torch::nn::Unflatten in libtorch (#42613)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42613

Test Plan: Imported from OSS

Reviewed By: glaringlee

Differential Revision: D23030302

Pulled By: heitorschueroff

fbshipit-source-id: 954f1cdfcbd3a62a7f0e887fcf5995ef27222a87
2020-08-14 15:32:13 -07:00
33c5fe3c1d Enable test_logit FakeLowP test. (#43073)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43073

Enable test_logit FakeLowP test.

Test Plan: test_op_nnpi_fp16.py

Reviewed By: hyuen

Differential Revision: D23141375

fbshipit-source-id: cb7e7879487e33908b14ef401e1ab05fda193d28
2020-08-14 14:49:29 -07:00
5014cf4a4d Export MergeIdLists Caffe2 Operator to PyTorch
Summary: As titled.

Test Plan: buck test //caffe2/caffe2/python/operator_test:torch_integration_test -- test_merge_id_lists

Reviewed By: yf225

Differential Revision: D23076951

fbshipit-source-id: c37dfd93003590eed70b0d46e0151397a402dde6
2020-08-14 14:46:17 -07:00
c8e789e06e add fake fp16 fusions to net transforms (#42927)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42927

added fp16 fusion to net transforms
refactored the transforms as well as glow_transform to get out of opt/custom so that the OSS builds passed

Test Plan: added net runner tests for this

Reviewed By: yinghai

Differential Revision: D23080881

fbshipit-source-id: ee6451811fedfd07c6560c178229854bca29301f
2020-08-14 13:30:27 -07:00
1c6ace87d1 Embed torch.nn typing annotations (#43044)
Summary:
Delete several .pyi files and embed annotations from those files in respective .py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43044

Reviewed By: ezyang

Differential Revision: D23123234

Pulled By: malfet

fbshipit-source-id: 4ba361cc84402352090523924b0035e100ba48b1
2020-08-14 13:24:58 -07:00
fcc10d75e1 [JIT] Add property support to TorchScript classes (#42389)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42389

**Summary**
This commit adds support for properties to TorchScript classes,
specifically for getters and setters. They are implemented essentially
as pointers to the methods that the corresponding decorators decorate,
which are treated like regular class methods. Deleters for properties
are considered to be out of scope (and probably useless for TorchScript
anyway).

**Test Plan**
This commit adds a unit test for a class with a property that has both
getter and setter and one that has only a getter.

`python test/test_jit.py TestClassType.test_properties`

Test Plan: Imported from OSS

Reviewed By: eellison, ppwwyyxx

Differential Revision: D22880232

Pulled By: SplitInfinity

fbshipit-source-id: 4828640f4234cb3b0d4f3da4872a75fbf519e5b0
2020-08-14 12:56:57 -07:00
64a7684219 Enable typechecking of collect_env.py during CI (#43062)
Summary:
No type annotations can be added to the script, as it still have to be Python-2 compliant.
 Make changes to avoid variable type redefinition.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43062

Reviewed By: zou3519

Differential Revision: D23132991

Pulled By: malfet

fbshipit-source-id: 360c02e564398f555273e5889a99f834a5467059
2020-08-14 12:46:42 -07:00
1f6d0985d7 fix searchsorted output type (#42933)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/41389
Make sure searchsorted that returns integer type does not make them require gradients.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42933

Reviewed By: gchanan

Differential Revision: D23109583

Pulled By: albanD

fbshipit-source-id: 5af300b2f7f3c140d39fd7f7d87799f7b93a79c1
2020-08-14 12:34:51 -07:00
059aa34b12 Clip Binomial results for different endpoints in curand_uniform (#42702)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/42153

As [documented](https://docs.nvidia.com/cuda/curand/device-api-overview.html) (search for `curand_uniform` on the page), `curand_uniform` returns "from 0.0 to 1.0, where 1.0 is included and 0.0 is excluded." These endpoints are different than the CPU equivalent, and makes the calculation in the PR fail when the value is 1.0.

The test from the issue is added, it failed for me consistently before the PR even though I cut the number of samples by 10.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42702

Reviewed By: gchanan

Differential Revision: D23107451

Pulled By: ngimel

fbshipit-source-id: 3575d5b8cd5668e74b5edbecd95154b51aa485a1
2020-08-14 12:01:17 -07:00
71bbd5f1d4 Add back Tensor.nonzero type annotation (#43053)
Summary:
Closes gh-42998

The issue is marked for 1.6.1, if there's anything I need to do for a backport please tell me what that is.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43053

Reviewed By: izdeby

Differential Revision: D23131708

Pulled By: malfet

fbshipit-source-id: 2744bacce6bdf6ae463c17411b672f09707e0887
2020-08-14 11:41:19 -07:00
75dfa5a459 Remove itruediv because it's already defined in torch/tensor.py (#42962)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/42955

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42962

Reviewed By: mruberry

Differential Revision: D23111523

Pulled By: malfet

fbshipit-source-id: ecab7a4aae1fe556753b8d6528cae1ae201beff3
2020-08-14 11:36:23 -07:00
1c616c5ab7 Add complex tensor dtypes for the __cuda_array_interface__ spec (#42918)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/42860

The `__cuda_array_interface__` tensor specification is missing the appropriate datatypes for the newly merged complex64 and complex128 tensors. This PR addresses this issue by casting:

* `torch.complex64` to 'c8'
* `torch.complex128` to 'c16'

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42918

Reviewed By: izdeby

Differential Revision: D23130219

Pulled By: anjali411

fbshipit-source-id: 5f8ee8446a71cad2f28811afdeae3a263a31ad11
2020-08-14 10:26:23 -07:00
c3fb152274 Test the type promotion between every two dtypes thoroughly (#42585)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/41842

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42585

Reviewed By: izdeby

Differential Revision: D23126759

Pulled By: mruberry

fbshipit-source-id: 8337e02f23a4136c2ba28c368f8bdbd28400de44
2020-08-14 10:05:10 -07:00
ff6a2b0b7a Add inplace option for torch.nn.Hardsigmoid and torch.nn.Hardswish layers (#42346)
Summary:
**`torch.nn.Hardsigmoid`** and **`torch.nn.Hardswish`** classes currently do not support `inplace` operations as it uses `torch.nn.functional.hardsigmoid` and `torch.nn.functional.hardswish` functions with their default inplace argument which is `False`.

So, I added `inplace` argument for `torch.nn.Hardsigmoid` and `torch.nn.Hardswish` classes so that forward operation can be done inplace as well while using these layers.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42346

Reviewed By: izdeby

Differential Revision: D23108487

Pulled By: albanD

fbshipit-source-id: 0767334fa10e5ecc06fada2d6469f3ee1cacd957
2020-08-14 10:01:31 -07:00
2f9fd8ad29 Build test_e2e_tensorpipe only if Gloo is enabled (#43041)
Summary:
test_e2e_tensorpipe depends on ProcessGroupGloo, therefore it could not be tested with Gloo disabled
Otherwise, it re-introduces  https://github.com/pytorch/pytorch/issues/42776

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43041

Reviewed By: lw

Differential Revision: D23122101

Pulled By: malfet

fbshipit-source-id: a8a088b6522a3bc888238ede5c2d589b83c6ea94
2020-08-14 09:24:47 -07:00
31788ae151 Trim trailing whitespace
Test Plan: CI

Reviewed By: linbinyu

Differential Revision: D23108919

fbshipit-source-id: 913c982351a94080944f350641d7966c6c2cc508
2020-08-14 09:18:40 -07:00
a2b86d95d1 Make Mish support large inputs. (#43037)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43037

In the previous version of mish_op.cc, the output would be 'nan' for large inputs. We re-write mish_op.cc to solve this problem.

Test Plan:
Unit test
buck test //dper3/dper3/modules/tests:core_modules_test -- test_linear_compress_embedding_with_attention_with_activation_mish
{F284052906}

buck test mode/opt //dper3/dper3_models/ads_ranking/tests:model_paradigm_e2e_tests -- test_sparse_nn_with_mish
{F284224158}

## Workflow
f212113434

{F285281318}

Differential Revision: D23102644

fbshipit-source-id: 98f1ea82f8c8e05b655047b4520c600fc1a826f4
2020-08-14 08:53:16 -07:00
c7d2774d20 Fix typo in collect_env.py (#43050)
Summary:
Minor typo fix introduced in yesterdays PR: https://github.com/pytorch/pytorch/pull/42961

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43050

Reviewed By: ezyang, malfet

Differential Revision: D23130936

Pulled By: zou3519

fbshipit-source-id: e8fa2bf155ab6a5988c74e8345278d8d70855894
2020-08-14 08:33:35 -07:00
d60d6d0d7b Automated submodule update: FBGEMM (#42834)
Summary:
This is an automated pull request to update the first-party submodule for [pytorch/FBGEMM](https://github.com/pytorch/FBGEMM).

New submodule commit: 29d5eb9f3c

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42834

Test Plan: Ensure that CI jobs succeed on GitHub before landing.

Reviewed By: jspark1105

Differential Revision: D23040145

fbshipit-source-id: 1d7209ea1910419b7837703122b8a4c76380ca4a
2020-08-14 05:43:20 -07:00
ed242cbec5 Guard TensorPipe agent by USE_TENSORPIPE (#42682)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42682

ghstack-source-id: 109834351

Test Plan: CI

Reviewed By: malfet

Differential Revision: D22978717

fbshipit-source-id: 18b7cbdb532e78ff9259e82f0f92ad279124419d
2020-08-14 02:57:36 -07:00
ccd9f3244b Get, save, and load module information for each operator (#42133)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42133

Test Plan:
We save a module with module debugging information as follows.
```
import torch
m = torch.jit.load('./detect.pt')
# Save module without debug info
m._save_for_lite_interpreter('./detect.bc')
# Save module with debug info
m._save_for_lite_interpreter('./detect.bc', _save_debug_info_in_bytecode=True)
```
Size of the file without module debugging information: 4.508 MB
Size of the file with module debugging information: 4.512 MB

Reviewed By: kimishpatel

Differential Revision: D22803740

Pulled By: taivu1998

fbshipit-source-id: c82ea62498fde36a1cfc5b073e2cea510d3b7edb
2020-08-14 01:25:27 -07:00
e182ec97b3 Fix illegal memory acess issue for CUDA versionn of SplitByLengths operator.
Summary:
1. Fix illegal memory access issue for SplitByLengths operator in the CUDA context.
2. Add support to scaling lengths vector for SplitByLengths operator.
3. Add support to test SplitByLengths operator in the CUDA context.

Example for SplitByLengths operator processing scaling lengths vector:
value vector A = [1, 2, 3, 4, 5, 6]
length vector B = [1, 2]
after execution of SplitByLengths operator,
the output should be [1,2] and [3,4,5,6]

Test Plan: buck test mode/dev-nosan caffe2/caffe2/python/operator_test:concat_split_op_test

Reviewed By: kennyhorror

Differential Revision: D23079841

fbshipit-source-id: 3700e7f2ee0a5a2791850071fdc16e5b054f8400
2020-08-14 01:04:08 -07:00
b8102b1550 Implement torch.nextafter (#42580)
Summary:
Related to https://github.com/pytorch/pytorch/issues/38349.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42580

Reviewed By: smessmer

Differential Revision: D23012260

Pulled By: mruberry

fbshipit-source-id: ce82a63c4ad407ec6ffea795f575ca7c58cd6137
2020-08-14 00:35:30 -07:00
e4373083a2 torch.complex and torch.polar (#39617)
Summary:
For https://github.com/pytorch/pytorch/issues/35312 and https://github.com/pytorch/pytorch/issues/38458#issuecomment-636066256.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/39617

Reviewed By: zhangguanheng66

Differential Revision: D23083926

Pulled By: anjali411

fbshipit-source-id: 1874378001efe2ff286096eaf1e92afe91c55b29
2020-08-14 00:30:11 -07:00
b9a105bcc0 [TensorExpr] Cleanup logic in the TensorExpr fuser pass. (#42938)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42938

1. Structure the logic in a more straight-forward way: instead of magic
   tricks with node iterators in a block we now have a function that
   tries to create a fusion group starting from a given node (and pull
   everything it can into it).
2. The order in which we're pulling nodes into a fusion group is now
   more apparent.
3. The new pass structure automatically allows us to support fusion
   groups of size=1.

Test Plan: Imported from OSS

Reviewed By: eellison

Differential Revision: D23084409

Pulled By: ZolotukhinM

fbshipit-source-id: d59fc00c06af39a8e1345a4aed8d829494db084c
2020-08-13 23:49:42 -07:00
fc304bec9f [TensorExpr] Remove redundant checks from canHandle in TE fuser. (#42937)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42937

Test Plan: Imported from OSS

Reviewed By: eellison

Differential Revision: D23084408

Pulled By: ZolotukhinM

fbshipit-source-id: 8e562e25ecc73b4e7b01e30f8b282945b96b4871
2020-08-13 23:49:40 -07:00
48c183af3d [TensorExpr] Wrap fuser in a class. (#42936)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42936

Test Plan: Imported from OSS

Reviewed By: eellison

Differential Revision: D23084407

Pulled By: ZolotukhinM

fbshipit-source-id: f622874efbcbf8d4e49c8fa519a066161ebe4877
2020-08-13 23:48:16 -07:00
02c8ad70f2 Reconstruct scopes (#41615)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/41615

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D22611331

Pulled By: taivu1998

fbshipit-source-id: d4ed4cf6360bc1f72ac9fa24bb4fcf6b7d9e7576
2020-08-13 22:38:16 -07:00
3dc845319f Add more verbose error message about PackedSequence lengths argument (#42891)
Summary:
Add given tensor dimentionality, device and dtype to the error message

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42891

Reviewed By: ezyang

Differential Revision: D23068769

Pulled By: malfet

fbshipit-source-id: e49d0a5d0c10918795c1770b4f4e02494d799c51
2020-08-13 22:33:34 -07:00
b992a927a9 Clearer Semantics and Naming for Customized Quantization Range Initialization in Observer (#42602)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42602

In this diff, clearer semantics and namings for are introduced by splitting the original `init_dynamic_qrange` into 2 separate `Optional[int]` types `qmin` and `qmax` to avoid the confusion of the parameters with dynamic quantization.

The `qmin` and `qmax` parameters allow customers to specify their own customary quantization range and enables specific use cases for lower bit quantization.

Test Plan:
To assert the correctness and compatibility of the changes with existing observers, on a devvm, execute the following command to run the unit tests:

`buck test //caffe2/test:quantization -- observer`

Reviewed By: vkuzo, raghuramank100

Differential Revision: D22948334

fbshipit-source-id: 275bc8c9b5db4ba76fc2e79ed938376ea4f5a37c
2020-08-13 21:15:23 -07:00
a55b7e2a6d [reland][quant][fix] Remove activation_post_process in qat modules (#42343) (#43015)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43015

Currently activation_post_process are inserted by default in qat modules, which is not
friendly to automatic quantization tools, this PR removes them.

Test Plan:
Imported from OSS

Imported from OSS

Reviewed By: vkuzo

Differential Revision: D23105059

fbshipit-source-id: 3439ac39e718ffb0390468163bcbffd384802b57
2020-08-13 20:44:14 -07:00
8cf01c5c35 Back out "change pt_defs.bzl to python file"
Summary: Original commit changeset: d720fe2e684d

Test Plan: CIs

Reviewed By: linbinyu

Differential Revision: D23114839

fbshipit-source-id: fda570b5e989a51936a6c5bc68f0e60c6f6b4b82
2020-08-13 20:33:12 -07:00
830423b80b Python/C++ API Parity: TransformerDecoderLayer (#42717)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/37756

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42717

Reviewed By: zhangguanheng66

Differential Revision: D23095841

Pulled By: glaringlee

fbshipit-source-id: 327a5a23c9a3cca05e422666a6d7d802a7e8c468
2020-08-13 20:31:13 -07:00
85752b989d [quant][doc] Print more info for fake quantize module (#43031)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43031

fixes: https://github.com/pytorch/pytorch/issues/43023

Test Plan: Imported from OSS

Reviewed By: vkuzo

Differential Revision: D23116200

fbshipit-source-id: faa90ce8711da0785d635aacd0362c45717cfacc
2020-08-13 20:27:36 -07:00
523b2ce9c6 [jit][static runtime] Simplify the graph and add operator whitelist (#43024)
Summary:
This PR whitelists and simplifies graphs to help with development later on.  Key to note in this PR is the use of both a pattern substitution and the registration of custom operators.  This will likely be one of the main optimization types done in this folder.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43024

Reviewed By: hlu1

Differential Revision: D23114262

Pulled By: bwasti

fbshipit-source-id: e25aa3564dcc8a2b48cfd1561b3ee2a4780ae462
2020-08-13 20:19:55 -07:00
89b0b3bc8c Allow RPC to be initialized again after shutdown. (#42723)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42723

This PR is addressing https://github.com/pytorch/pytorch/issues/39340
and allows users to initialize RPC again after shutdown. Major changes in the
PR include:

1. Change to DistAutogradContainer to support this.
2. Ensure PythonRpcHandler is reinitialized appropriately.
3. Use PrefixStore in RPC initialization to ensure each new `init_rpc` uses a
different prefix.
ghstack-source-id: 109805368

Test Plan: waitforbuildbot

Reviewed By: rohan-varma

Differential Revision: D22993909

fbshipit-source-id: 9f1c1e0a58b58b97125f41090601e967f96f70c6
2020-08-13 20:18:34 -07:00
21823aa680 Nightly checkout tool (#42635)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/40829

This is cross-platform but I have only tried it on linux, personally. Also, I am not fully certain of the usage pattern, so if there are any additional features / adjustments / tests that you want me to add, please just let me know!

CC ezyang rgommers

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42635

Reviewed By: zhangguanheng66

Differential Revision: D23078663

Pulled By: ezyang

fbshipit-source-id: 5c8c8abebd1d462409c22dc4301afcd8080922bb
2020-08-13 20:07:18 -07:00
a6b69fdd33 Add DDP+RPC tutorial to RPC docs page. (#42828)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42828

ghstack-source-id: 109855425

Test Plan: waitforbuildbot

Reviewed By: jlin27

Differential Revision: D23037016

fbshipit-source-id: 250f322b652b86257839943309b8f0b8ce1bb25b
2020-08-13 19:41:06 -07:00
3544f60f76 make deadline=None for all numerics tests (#43014)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43014

changing this behavior mimics the behavior of the hold hypothesis
testing library

Test Plan: ran all tests on devserver

Reviewed By: hl475

Differential Revision: D23085949

fbshipit-source-id: 433fdfbb04b6a609b738eb7c319365049a49579b
2020-08-13 16:48:31 -07:00
8b5642a786 Fix to Learnable Fake Quantization Op Benchmarking (#43018)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43018

In this diff, a fix is added where the original non-learnable fake quantize is provided with trainable scale and zero point, whereas the requires_grad for both parameters should be completely disabled.

Test Plan:
Use the following command to execute the benchmark test:

`buck test mode/dev-nosan pt:quantization_test`

Reviewed By: vkuzo

Differential Revision: D23107846

fbshipit-source-id: d2213983295f69121e9e6ae37c84d1f37d78ef39
2020-08-13 16:32:13 -07:00
6753157c5a Enable torch.utils typechecks (#42960)
Summary:
Fix typos in torch.utils/_benchmark/README.md
Add empty __init__.py to examples folder to make example invocations from README.md correct
Fixed uniform distribution logic generation when mixval and maxval are None

Fixes https://github.com/pytorch/pytorch/issues/42984

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42960

Reviewed By: seemethere

Differential Revision: D23095399

Pulled By: malfet

fbshipit-source-id: 0546ce7299b157d9a1f8634340024b10c4b7e7de
2020-08-13 15:24:56 -07:00
eb47940c0a Add executor and fuser options to the fastrnn test fixture (#42946)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42946

There are 3 options for the executor and fuser and some of them aren't
super interesting so I've combined the options into a single parameter, but
made it fairly easy to expand the set if there are other configs we might care
about.

Test Plan:
Benchmark it

Imported from OSS

Reviewed By: zheng-xq

Differential Revision: D23090177

fbshipit-source-id: bd93a93c3fc64e5a4a847d1ce7f42ce0600a586e
2020-08-13 12:45:37 -07:00
fd5ed4b6d6 Update ort-nightly version to dev202008122 (#43019)
Summary:
Fixes caffe2_onnx_ort1_py3_6_clang7_ubuntu16_04 test failures

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43019

Reviewed By: gchanan

Differential Revision: D23108767

Pulled By: malfet

fbshipit-source-id: 0131cf4ac0bf93d3d93cb0c97a888f1524e87472
2020-08-13 11:40:16 -07:00
816d37b1d8 [quant] Make PerChannel Observer work with float qparams (#42690)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42690

Add implementation for new qscheme per_channel_affine_float_qparams in observer

Test Plan:
python test/test_quantization.py TestObserver.test_per_channel_observers

Imported from OSS

Reviewed By: vkuzo

Differential Revision: D23070633

fbshipit-source-id: 84d348b0ad91e9214770131a72f7adfd3970349c
2020-08-13 11:22:19 -07:00
6f8446840e [quant] Create PerRowQuantizer for floating point scale and zero_point (#42612)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42612

Add a new Quantizer that supports an input zero point (bias) that can be float.
The quantization equation in this case is

Xq = (Xf - bias) * inv_scale, where bias is float zero_point value
We start with per-row implementation and can extend to per-tensor in the future, if necessary

Test Plan:
python test/test_quantization.py TestQuantizedTensor

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D22960142

fbshipit-source-id: ca9ab6c5b45115d3dcb1c4358897093594313706
2020-08-13 11:20:53 -07:00
0ff51accd8 collect_env.py: Print CPU architecture after Linux OS name (#42961)
Summary:
Missed this case in https://github.com/pytorch/pytorch/pull/42887

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42961

Reviewed By: zou3519

Differential Revision: D23095264

Pulled By: malfet

fbshipit-source-id: ff1fb0eba9ecd29bfa3d8f5e4c3dcbcb11deefcb
2020-08-13 10:49:15 -07:00
ebc7ebc74e Do not ignore torch/__init__.pyi (#42958)
Summary:
Delete abovementioned from .gitignore as the file is gone since https://github.com/pytorch/pytorch/issues/42908 and no longer should be autogenerated.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42958

Reviewed By: seemethere

Differential Revision: D23094391

Pulled By: malfet

fbshipit-source-id: af303477301ae89d6f283e34d7aeddeda7a9260f
2020-08-13 10:29:58 -07:00
6fb5ce5569 [NNC] Fix some bugs in Round+Mod simplification (#42934)
Summary:
When working on the Cuda Codegen, I found that running the IRSimplifier before generating code lead to test fails. This was due to a bug in Round+Mod simplification (e.g. (x / y * y) + (x % y) => x) to do with the order in which the terms appeared. After fixing it and writing a few tests around those cases, I found another bug in simplification of the same pattern and have fixed it (with some more test coverage).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42934

Reviewed By: zhangguanheng66

Differential Revision: D23085548

Pulled By: nickgg

fbshipit-source-id: e780967dcaa7a5fda9f6d7d19a6b7e7b4e94374b
2020-08-13 09:47:21 -07:00
f03f9ad621 update clone doc (#42931)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42931

Reviewed By: zhangguanheng66

Differential Revision: D23083000

Pulled By: albanD

fbshipit-source-id: d76d90476ca294763f204c185a62ff6484381c67
2020-08-13 08:45:46 -07:00
ba9025bc1a [tensorexpr] Autograd for testing (#42548)
Summary:
A simple differentiable abstraction to allow testing of full training graphs.

Included in this 1st PR is an example of trivial differentiation.

If approved, I can add a full MLP and demonstrate convergence using purely NNC (for performance testing) in the next PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42548

Reviewed By: ZolotukhinM

Differential Revision: D23057920

Pulled By: bwasti

fbshipit-source-id: 4a239852c5479bf6bd20094c6c35f066a81a832e
2020-08-13 07:58:06 -07:00
607e49cc83 Revert D22856816: [quant][fix] Remove activation_post_process in qat modules
Test Plan: revert-hammer

Differential Revision:
D22856816 (8cb42fce17)

Original commit changeset: 988a43bce46a

fbshipit-source-id: eff5b9abdfc15b21c02c61eefbda38d349173436
2020-08-13 07:22:20 -07:00
8493b0d5d6 Enroll TensorPipe agent in C++-only E2E test (#42680)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42680

ghstack-source-id: 109544678

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D22978714

fbshipit-source-id: 04d6d190c240c6ead9bd9f3b7f3a5f964d7451e8
2020-08-13 07:07:30 -07:00
c88d3a5e76 Remove Python dependency from TensorPipe RPC agent (#42678)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42678

ghstack-source-id: 109544679

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D22978716

fbshipit-source-id: 31f91d35e9538375b047184cf4a735e4b8809a15
2020-08-13 07:06:10 -07:00
d39cb84f1f [Codemod][FBSourceClangFormatLinter] Daily arc lint --take CLANGFORMAT
Reviewed By: zertosh

Differential Revision: D23102075

fbshipit-source-id: afb89e061bb9c290df7cf4c58157fc8d67fe78ad
2020-08-13 05:14:21 -07:00
c9dcc833bc [quant][pyper] Make offsets an optional paramter in the qembedding_bag op (#42924)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42924

offsets is an optional paramter in the python module currently. So we update the operator to follow suit
in order to avoid bad optional access

Test Plan:
python test/test_quantization.py TestQuantizeDynamicJitOps.test_embedding_bag

Imported from OSS

Reviewed By: radkris-git

Differential Revision: D23081152

fbshipit-source-id: 847b58f826f5a18e8d4978fc4afc6f3a96dc4230
2020-08-12 20:25:44 -07:00
8cb42fce17 [quant][fix] Remove activation_post_process in qat modules (#42343)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42343

Currently activation_post_process are inserted by default in qat modules, which is not
friendly to automatic quantization tools, this PR removes them.

Test Plan: Imported from OSS

Reviewed By: raghuramank100

Differential Revision: D22856816

fbshipit-source-id: 988a43bce46a992b38fd0d469929f89e5b046131
2020-08-12 20:14:23 -07:00
7a7424bf91 Remove impl_unboxedOnlyKernel (#42841)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42841

There is nothing using those APIs anymore. While we still have ops that require an unboxedOnly implementation (i.e. that aren't c10-full yet), those are all already migrated to the new op registration API and use `.impl_UNBOXED()`.
ghstack-source-id: 109693705

Test Plan: waitforsandcastle

Reviewed By: bhosmer

Differential Revision: D23045335

fbshipit-source-id: d8e15cea1888262135e0d1d94c515d8a01bddc45
2020-08-12 17:35:09 -07:00
20e0e54dbe Allow Tensor& in the unboxing logic (#42712)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42712

Previously, operators taking Tensor& as arguments or returning it couldn't be c10-full because the unboxing logic didn't support it.
This adds temporary support for that. We're planning to remove this again later, but for now we need it to make those ops c10-full.
See https://docs.google.com/document/d/19thMVO10yMZA_dQRoB7H9nTPw_ldLjUADGjpvDmH0TQ for the full plan.

This PR also makes some ops c10-full that now can be.
ghstack-source-id: 109693706

Test Plan: unit tests

Reviewed By: bhosmer

Differential Revision: D22989242

fbshipit-source-id: 1bd97e5fa2b90b0860784da4eb772660ca2db5a3
2020-08-12 17:33:23 -07:00
5d2e9b6ed9 Add missing type annotation for Tensor.ndim (#42909)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/42908

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42909

Reviewed By: zhangguanheng66

Differential Revision: D23090364

Pulled By: malfet

fbshipit-source-id: 44457fddc86f6abde635aa671e7611b405780ab9
2020-08-12 17:14:20 -07:00
b8ae563ce6 Add a microbenchmark for LSTM elementwise portion (#42901)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42901

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D23079714

Pulled By: bertmaher

fbshipit-source-id: 28f8c3b5019ee898e82e64a0a674da1b4736d252
2020-08-12 17:11:47 -07:00
33d209b5f4 Fix TE microbenchmark harness to use appropriate fuser/executor (#42900)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42900

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D23079715

Pulled By: bertmaher

fbshipit-source-id: 6aa2b08a550835b7737e355960a16a7ca83878ea
2020-08-12 17:11:44 -07:00
1adeed2720 Speed up CUDA kernel launch when block/thread extents are statically known (#42899)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42899

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D23078708

Pulled By: bertmaher

fbshipit-source-id: 237404b47a31672d7145d70996868a3b9b97924e
2020-08-12 17:10:30 -07:00
f373cda021 Revert D22994446: [pytorch][PR] CUDA reduction: allow outputs to have different strides
Test Plan: revert-hammer

Differential Revision:
D22994446 (7f3f5020e6)

Original commit changeset: cc60beebad2e

fbshipit-source-id: f4635deac386db0c161f910760cace09f15a1ff9
2020-08-12 17:05:04 -07:00
86841f5f61 Update cuda init docstring to improve clarity (#42923)
Summary:
A small clarity improvement to the cuda init docstring

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42923

Reviewed By: zhangguanheng66

Differential Revision: D23080693

Pulled By: mrshenli

fbshipit-source-id: aad5ed9276af3b872c1def76c6175ee30104ccb2
2020-08-12 15:41:28 -07:00
0134deda0f [FX] Add interface to reject nodes (#42865)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42865

Test Plan: Imported from OSS

Reviewed By: zdevito

Differential Revision: D23056584

Pulled By: jamesr66a

fbshipit-source-id: 02db08165ab41be5f3c4b5ff253cbb444eb9a7b8
2020-08-12 14:30:06 -07:00
92885ebe16 Implement hypot (#42291)
Summary:
Related to https://github.com/pytorch/pytorch/issues/38349
Closes https://github.com/pytorch/pytorch/issues/22764

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42291

Reviewed By: malfet

Differential Revision: D22951859

Pulled By: mruberry

fbshipit-source-id: d0118f2b6437e5c3f775f699ec46e946a8da50f0
2020-08-12 13:18:26 -07:00
62bd2ddec7 Implemented non-named version of unflatten (#42563)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42563

Moved logic for non-named unflatten from python nn module to aten/native to be reused by the nn module later. Fixed some inconsistencies with doc and code logic.

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D23030301

Pulled By: heitorschueroff

fbshipit-source-id: 7c804ed0baa5fca960a990211b8994b3efa7c415
2020-08-12 13:14:28 -07:00
7f3f5020e6 CUDA reduction: allow outputs to have different strides (#42649)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/42364

Benchmark:
https://github.com/zasdfgbnm/things/blob/master/2020Q3/min-benchmark.ipynb
```python
import torch

print(torch.__version__)
print()

for i in range(100):
    torch.randn(1000, device='cuda')

for e in range(7, 15):
    N = 2 ** e
    input_ = torch.randn(N, N, device='cuda')
    torch.cuda.synchronize()
    %timeit input_.min(dim=0); torch.cuda.synchronize()
    input_ = torch.randn(N, N, device='cuda').t()
    torch.cuda.synchronize()
    %timeit input_.min(dim=0); torch.cuda.synchronize()
    print()
```
Before
```
1.7.0a0+5d7c3f9

21.7 µs ± 1.67 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
20.6 µs ± 773 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

22.5 µs ± 294 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
20.2 µs ± 250 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

26.4 µs ± 67 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
20.9 µs ± 316 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

33 µs ± 474 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
21.1 µs ± 218 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

84.2 µs ± 691 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
50.3 µs ± 105 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

181 µs ± 2.36 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
145 µs ± 149 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

542 µs ± 753 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
528 µs ± 10.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

2.04 ms ± 9.74 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.01 ms ± 22.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
```
After
```
1.7.0a0+9911817

21.4 µs ± 695 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
20.6 µs ± 989 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

22.4 µs ± 153 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
20.5 µs ± 58.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

26.6 µs ± 147 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
20.9 µs ± 675 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

35.4 µs ± 560 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
21.7 µs ± 1.17 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)

86.5 µs ± 1.99 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
52.2 µs ± 1.57 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

195 µs ± 2.97 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
153 µs ± 4.46 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

550 µs ± 7.72 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
527 µs ± 3.04 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

2.05 ms ± 7.87 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2 ms ± 4.93 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42649

Reviewed By: ezyang

Differential Revision: D22994446

Pulled By: ngimel

fbshipit-source-id: cc60beebad2e04c26ebf3ca702a6cb05846522c9
2020-08-12 13:09:36 -07:00
ada8404f2d [jit] Scaffold a static runtime (#42753)
Summary:
The premise of this approach is that a small subset of neural networks are well represented by a data flow graph.  The README contains more information.

The name is subject to change, but I thought it was a cute reference to fire.

suo let me know if you'd prefer this in a different spot.  Since it lowers a JIT'd module directly I assumed the JIT folder would be appropriate.  There is no exposed Python interface yet (but is mocked up in `test_accelerant.py`)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42753

Reviewed By: zou3519

Differential Revision: D23043771

Pulled By: bwasti

fbshipit-source-id: 5353731e3aae31c08b5b49820815da98113eb551
2020-08-12 13:05:27 -07:00
59f8692350 [pytorch] BUCK build for Vulkan backend
Summary:
Introducing `//xplat/caffe2:aten_vulkan` target which contains pytorch Vulkan backend and its ops.

 `//xplat/caffe2:aten_vulkan` depends on ` //xplat/caffe2:aten_cpu`

Just inclusion it to linking registers Vulkan Backend and its ops.

**Code generation:**
1. `VulkanType.h`, `VulkanType.cpp`
Tensor Types for Vulkan backend are generated by `//xplat/caffe2:gen_aten_vulkan` which runs aten code generation (`aten/src/ATen/gen.py`) with `--vulkan` argument.

2. Shaders compilation
`//xplat/caffe2:gen_aten_vulkan_spv`  genrule runs `//xplat/caffe2:gen_aten_vulkan_spv_bin` which is a wrapper on `aten/src/ATen/native/vulkan/gen_spv.py`

GLSL files are listed in `aten/src/ATen/native/vulkan/glsl/*` and to compile them `glslc` (glsl compiler) is required.

`glslc` is in opensource https://github.com/google/shaderc , that also has a few dependencies  on other libraries, that porting this build to BUCK will take significant amount of time.

To use `glslc` in BUCK introducing

dotslash `xplat/caffe2/fb/vulkan/dotslash/glslc` which is stored on manifold the latest prebuilt binaries of `glslc` from ANDROID_NDK for linux, macos and windows.

Not using it from ANDROID_NDK directly allows to update it without dependency on ndk.

Test Plan:
Building aten_vulkan target:
```
buck build //xplat/caffe2:aten_vulkan
```

Building vulkan_test that contains vulkan unittests for android:
```
buck build //xplat/caffe2:pt_vulkan_test_binAndroid#android-armv7
```
And running it on the device with vulkan support.

Reviewed By: iseeyuan

Differential Revision: D22770299

fbshipit-source-id: 843af8df226d4b5395b8e480eb47b233d57201df
2020-08-12 10:34:41 -07:00
ea65a56854 Use string(APPEND FOO " bar") instead of `set(FOO "${FOO} bar") (#42844)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42844

Reviewed By: scintiller

Differential Revision: D23067577

Pulled By: malfet

fbshipit-source-id: e4380ce02fd6aca37c955a7bc24435222c5d8b19
2020-08-12 10:33:11 -07:00
3d3752d716 Revert D22898051: [pytorch][PR] Fix freeze_module pass for sharedtype
Test Plan: revert-hammer

Differential Revision:
D22898051 (4665f3fc8d)

Original commit changeset: 8b1d80f0eb40

fbshipit-source-id: 4dc0ba274282a157509db16df13269eed6cd5be9
2020-08-12 10:28:03 -07:00
bda0007620 Improve calling backward() and grad() inside vmap error messages (#42876)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42876

Previously, the error messages were pretty bad. This PR adds nice
error messages for the following cases:
- user attempts to call .backward() inside vmap for any reason
whatsoever
- user attempts to call autograd.grad(outputs, inputs, grad_outputs),
where outputs or inputs is being vmapped over (so they are
BatchedTensors).

The case we do support is calling autograd.grad(outputs, inputs,
grad_outputs) where `grad_outputs` is being vmapped over. This is the
case for batched gradient support (e.g., user passes in a batched
grad_output).

Test Plan: - new tests: `pytest test/test_vmap.py -v`

Reviewed By: ezyang

Differential Revision: D23059836

Pulled By: zou3519

fbshipit-source-id: 2fd4e3fd93f558e67e2f0941b18f0d00d8ab439f
2020-08-12 10:05:31 -07:00
5c39146c34 Fix get_writable_path (#42895)
Summary:
As name suggests, this function should always return a writable path
Call `mkdtemp` to create temp folder if path is not writable

This fixes `TestNN.test_conv_backcompat` if PyTorch is installed in non-writable location

Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42895

Reviewed By: dzhulgakov

Differential Revision: D23070320

Pulled By: malfet

fbshipit-source-id: ed6a681d46346696a0de7e71f0b21cba852a964e
2020-08-12 09:38:24 -07:00
5157afcf59 fix int8 FC (#42691)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42691

fix quantization of FC bias to match nnpi
quantize biases to fp16

Test Plan: improved the unit test to have input tensors in fp32

Reviewed By: tracelogfb

Differential Revision: D22941521

fbshipit-source-id: 00afb70610f8a149110344d52595c39e3fc988ab
2020-08-12 09:30:34 -07:00
686705c98b Optimize LayerNorm performance on CPU both forward and backward (#35750)
Summary:
This PR aims at improving `LayerNorm` performance on CPU for both forward and backward.

Results on Xeon 6248:
1. single socket inference **1.14x** improvement
2. single core inference **1.77x** improvement
3. single socket training **6.27x** improvement

The fine tuning of GPT2 on WikiTest2 dataset time per iteration on dual socket reduced from **4.69s/it** to **3.16s/it**, **1.48x** improvement.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/35750

Reviewed By: zhangguanheng66

Differential Revision: D20810026

Pulled By: glaringlee

fbshipit-source-id: c5801bd76eb944f2e46c2fe4991d9ad4f40495c3
2020-08-12 09:17:20 -07:00
75a15d3d01 Follow-up for pytorch/pytorch#37091. (#42806)
Summary:
This is a follow-up PR for https://github.com/pytorch/pytorch/issues/37091, fixing some of the quirks of that PR as that one was landed early to avoid merge conflicts.

This PR addresses the following action items:

- [x] Use error-handling macros instead of a `try`-`catch`.
- [x] Renamed and added comments to clarify the use of `HANDLED_FUNCTIONS_WRAPPERS` in tests. `HANDLED_FUNCTIONS_NAMESPACES` was already removed in the last PR as we had a way to test for methods.

This PR does NOT address the following action item, as it proved to be difficult:

- [ ] Define `__module__`  for whole API.

Single-line repro-er for why this is hard:

```python
>>> torch.Tensor.grad.__get__.__module__ = "torch.Tensor.grad"
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'method-wrapper' object has no attribute '__module__'
```

Explanation: Methods  defined in C/properties don't always have a `__dict__` attribute or a mutable `__module__` slot for us to modify.

The documentation action items were addressed in the following commit, with the additional future task of adding the rendered RFCs to the documentation: 552ba37c05

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42806

Reviewed By: smessmer

Differential Revision: D23031501

Pulled By: ezyang

fbshipit-source-id: b781c97f7840b8838ede50a0017b4327f96bc98a
2020-08-12 09:11:33 -07:00
2878efb35d Use C10_API_ENUM to fix invalid attribute warnings (#42464)
Summary:
Using the macro added in https://github.com/pytorch/pytorch/issues/38988 to fix more attribute warnings.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42464

Reviewed By: malfet

Differential Revision: D22916943

Pulled By: ezyang

fbshipit-source-id: ab9ca8755cd8b89aaf7f8718b4107b4b94d95005
2020-08-12 09:02:49 -07:00
2f1baf6c25 Fix coding style and safety issues in CuBLAS nondeterministic unit test (#42627)
Summary:
Addresses some comments that were left unaddressed after PR https://github.com/pytorch/pytorch/issues/41377 was merged:

* Use `check_output` instead of `Popen` to run each subprocess sequentially
* Use f-strings rather than old python format string style
* Provide environment variables to subprocess through the `env` kwarg
* Check for correct error behavior inside the subprocess, and raise another error if incorrect. Then the main process fails the test if any error is raised

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42627

Reviewed By: malfet

Differential Revision: D22969231

Pulled By: ezyang

fbshipit-source-id: 38d5f3f0d641c1590a93541a5e14d90c2e20acec
2020-08-12 08:54:28 -07:00
77bd4d3426 MAINT: speed up istft by using col2im (the original python code used … (#42826)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/42213

The [original python code](https://github.com/pytorch/audio/blob/v0.5.0/torchaudio/functional.py#L178) from `torchaudio` was converted to a native function, but used `eye` to  allocate a Tensor and was much slower.
Using `at::col2im` (which is the equivalent of `torch.nn.functional.fold`) solved the slowdown.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42826

Reviewed By: smessmer

Differential Revision: D23043673

Pulled By: mthrok

fbshipit-source-id: 3f5d0779a87379b002340ea19c9ae5042a43e94e
2020-08-12 08:39:12 -07:00
4665f3fc8d Fix freeze_module pass for sharedtype (#42457)
Summary:
During cleanup phase, calling recordReferencedAttrs would record
the attributes which are referenced and hence kept.
However, if you have two instances of the same type which are preserved
through freezing process, as the added testcase shows, then during
recording the attributes which are referenced, we iterate through the
type INSTANCES that we have seen so far and record those ones.
Thus if we have another instance of the same type, we will just look at
the first instance in the list, and record that instances.
This PR fixes that by traversing the getattr chains and getting the
actual instance of the getattr output.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42457

Test Plan:
python test/test_jit.py TestFreezing
Fixes #{issue number}

Reviewed By: zou3519

Differential Revision: D22898051

Pulled By: kimishpatel

fbshipit-source-id: 8b1d80f0eb40ab99244f931d4a1fdb28290a4683
2020-08-12 08:35:05 -07:00
ecb9e790ed Remove excessive logging in plan_executor (#42888)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42888

as title

Test Plan: flow-cli test-locally dper.workflows.evaluation.eval_workflow --parameters-file /mnt/public/ehsanardestani/temp/quant_eval_inputs_all.json

Reviewed By: amylittleyang

Differential Revision: D23066529

fbshipit-source-id: f925afd1734e617e412b0f171e16c781d13272d9
2020-08-11 23:57:17 -07:00
a346e90c49 Update to NNP-I v1.0.0.5 (#4770)
Summary:
Align code to NNP-I v1.0.0.5 (glow tracing changes).

Pull Request resolved: https://github.com/pytorch/glow/pull/4770

Reviewed By: arunm-git

Differential Revision: D22927904

Pulled By: hl475

fbshipit-source-id: 3746a6b07f3fcffc662d80a95513427cfccac7a5
2020-08-11 23:53:23 -07:00
ab0a04dc9c Add torch.nansum (#38628)
Summary:
Reference: https://github.com/pytorch/pytorch/issues/38349

Pull Request resolved: https://github.com/pytorch/pytorch/pull/38628

Reviewed By: VitalyFedyunin

Differential Revision: D22860549

Pulled By: mruberry

fbshipit-source-id: 87fcbfd096d83fc14b3b5622f2301073729ce710
2020-08-11 22:26:04 -07:00
38c7b9a168 avoid redundant isCustomClassRegistered() checks (#42852)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42852

Test Plan: Imported from OSS

Reviewed By: smessmer

Differential Revision: D23048381

Pulled By: bhosmer

fbshipit-source-id: 40b71670a84cb6f7e5a03279f58ce227d676aa03
2020-08-11 21:53:19 -07:00
bee174dc3f Adds linalg.det alias, fixes outer alias, updates alias testing (#42802)
Summary:
This PR:

- updates test_op_normalization.py, which verifies that aliases are correctly translated in the JIT
- adds torch.linalg.det as an alias for torch.det
- moves the torch.linalg.outer alias to torch.outer (to be consistent with NumPy)

The torch.linalg.outer alias was put the linalg namespace erroneously as a placeholder since it's a "linear algebra op" according to NumPy but is actually still in the main NumPy namespace.

The updates to test_op_normalization are necessary. Previously it was using method_tests to generate tests, and method_tests assumes test suites using it also use the device generic framework, which test_op_normalization did not. For example, some ops require decorators like `skipCPUIfNoLapack`, which only works in device generic test classes. Moving test_op_normalization to the device generic framework also lets these tests run on CPU and CUDA.

Continued reliance on method_tests() is excessive since the test suite is only interested in testing aliasing, and a simpler and more readable `AliasInfo` class is used for the required information. An example impedance mismatch between method_tests and the new tests, for example, was how to handle ops in namespaces like torch.linalg.det. In the future this information will likely be folded into a common 'OpInfo' registry in the test suite.

The actual tests performed are similar to what they were previously: a scripted and traced version of the op is run and the test verifies that both graphs do not contain the alias name and do contain the aliased name.

The guidance for adding an alias has been updated accordingly.

cc mattip

Note:

ngimel suggests:
- deprecating and then removing the `torch.ger` name
- reviewing the implementation of `torch.outer`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42802

Reviewed By: zou3519

Differential Revision: D23059883

Pulled By: mruberry

fbshipit-source-id: 11321c2a7fb283a6e7c0d8899849ad7476be42d1
2020-08-11 21:48:31 -07:00
cd756ee3d4 Support boolean key in dictionary (#42833)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/41449 .

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42833

Test Plan: `python test/test_jit.py TestDict`

Reviewed By: zou3519

Differential Revision: D23056250

Pulled By: asuhan

fbshipit-source-id: 90dabe1490c99d3e57a742140a4a2b805f325c12
2020-08-11 21:37:37 -07:00
ac93d45906 [quant] Attach qconfig to all modules (#42576)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42576

Previously we have qconfig propagate list and we only attach qconfig for modules
in the list, this works when everything is quantized in the form of module.
but now we are expanding quantization for functional/torch ops, we'll need to attach qconfig
to all modules

Test Plan: Imported from OSS

Reviewed By: vkuzo

Differential Revision: D22939453

fbshipit-source-id: 7d6a1f73ff9bfe461b3afc75aa266fcc8f7db517
2020-08-11 20:34:34 -07:00
e845b0ab51 [Resending] [ONNX] Add eliminate_unused_items pass (#42743)
Summary:
This PR:

- Adds eliminate_unused_items pass that removes unused inputs and initializers.
- Fixes run_embed_params function so it doesn't export unnecessary parameters.
- Removes test_modifying_params in test_verify since it's no longer needed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42743

Reviewed By: hl475

Differential Revision: D23058954

Pulled By: houseroad

fbshipit-source-id: cd1e81463285a0bf4e60766c8c87fc9a350d9c7e
2020-08-11 20:30:50 -07:00
a846ed5ce7 [quant] Reduce number of variants of add/mul (#42769)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42769

Some of the quantized add and mul can have the same name

Test Plan: Imported from OSS

Reviewed By: supriyar

Differential Revision: D23054822

fbshipit-source-id: c1300f3f0f046eaf0cf767d03b957835e22cfb4b
2020-08-11 20:01:06 -07:00
5edd9aa95a Fix manual seed to unpack unsigned long (#42206)
Summary:
`torch.manual_seed` was unpacking its argument as an `int64_t`. This fix changes it to a `uint64_t`.

Fixes https://github.com/pytorch/pytorch/issues/33546

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42206

Reviewed By: ezyang

Differential Revision: D22822098

Pulled By: albanD

fbshipit-source-id: 97c978139c5cb2d5b62cc2c963550c758ee994f7
2020-08-11 18:05:34 -07:00
b0b8340065 Collect more data in collect_env (#42887)
Summary:
Collect Python runtime bitness (32 vs 64 bit)
Collect Mac/Linux OS machine time (x86_64, arm, Power, etc)
Collect Clang version

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42887

Reviewed By: seemethere

Differential Revision: D23064788

Pulled By: malfet

fbshipit-source-id: df361bdbb79364dc521b8e1ecbed1b4bd08f9742
2020-08-11 18:01:14 -07:00
7a9ae52550 [hypothesis] Deadline followup (#42842)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42842

Test Plan: `buck test`

Reviewed By: thatch

Differential Revision: D23045269

fbshipit-source-id: 8a3f4981869287a0f5fb3f0009e13548b7478086
2020-08-11 15:33:23 -07:00
eeb43ffab9 format for readability (#42851)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42851

Test Plan: Imported from OSS

Reviewed By: smessmer

Differential Revision: D23048382

Pulled By: bhosmer

fbshipit-source-id: 55d84d5f9c69be089056bf3e3734c1b1581dc127
2020-08-11 14:46:42 -07:00
3bf2978497 remove deadline enforcement for hypothesis (#42871)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42871

old version of hypothesis.testing was not enforcing deadlines
after the library got updated, default deadline=200ms, but even with 1s or
more, tests are flaky. Changing deadline to non-enforced which is the same
behavior as the old version

Test Plan: tested fakelowp/tests

Reviewed By: hl475

Differential Revision: D23059033

fbshipit-source-id: 79b6aec39a2714ca5d62420c15ca9c2c1e7a8883
2020-08-11 14:28:53 -07:00
0ff0fea42b [FX] fix lint (#42866)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42866

Test Plan: Imported from OSS

Reviewed By: zdevito

Differential Revision: D23056813

Pulled By: jamesr66a

fbshipit-source-id: d30cdffe6f0465223354dec00f15658eb0b08363
2020-08-11 14:01:26 -07:00
43613b4236 Fix incorrect aten::sorted.str return type (#42853)
Summary:
aten::sorted.str output type was incorrectly set to bool[] due to a copy-paste error. This PR fixes it.

Fixes https://fburl.com/0rv8amz7

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42853

Reviewed By: yf225

Differential Revision: D23054907

Pulled By: gmagogsfm

fbshipit-source-id: a62968c90f0301d4a5546e6262cb9315401a9729
2020-08-11 14:01:23 -07:00
71dbfc79b3 Export BatchBucketOneHot Caffe2 Operator to PyTorch
Summary: As titled.

Test Plan:
```
buck test caffe2/caffe2/python/operator_test:torch_integration_test -- test_batch_bucket_one_hot_op
```

Reviewed By: yf225

Differential Revision: D23005981

fbshipit-source-id: 1daa8d3e7d6ad75e97e94964db95ccfb58541672
2020-08-11 14:00:19 -07:00
4afbf39737 Add nn.functional.adaptive_avg_pool size empty tests (#42857)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42857

Reviewed By: seemethere

Differential Revision: D23053677

Pulled By: malfet

fbshipit-source-id: b3d0d517cddc96796461332150e74ae94aac8090
2020-08-11 12:59:58 -07:00
9c8f5cb61d Ensure IDEEP transpose operator works correctly
Summary: I found out that without exporting to public format IDEEP transpose operator in the middle of convolution net produces incorrect results (probably reading some out-of-bound memory). Exporting to public format might not be the most efficient solution, but at least it ensures correct behavior.

Test Plan: Running ConvFusion followed by transpose should give identical results on CPU and IDEEP

Reviewed By: bwasti

Differential Revision: D22970872

fbshipit-source-id: 1ddca16233e3d7d35a367c93e72d70632d28e1ef
2020-08-11 12:58:31 -07:00
c660d2a9ae Initial quantile operator implementation (#42755)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42755

Attempting to land quantile again after being landed here https://github.com/pytorch/pytorch/pull/39417 and reverted here https://github.com/pytorch/pytorch/pull/41616.

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D23030338

Pulled By: heitorschueroff

fbshipit-source-id: 124a86eea3aee1fdaa0aad718b04863935be26c7
2020-08-11 12:08:17 -07:00
6471b5dc66 Correct the type of some floating point literals in calc_digamma (#42846)
Summary:
They are double, but they are supposed to be of accscalar_t or a faster type.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42846

Reviewed By: zou3519

Differential Revision: D23049405

Pulled By: mruberry

fbshipit-source-id: 29bb5d5419dc7556b02768f0ff96dfc28676f257
2020-08-11 11:39:06 -07:00
4bafca1a69 Adds list of operator-related information for testing (#41662)
Summary:
This PR adds:

- an "OpInfo" class in common_method_invocations that can contain useful information about an operator, like what dtypes it supports
- a more specialized "UnaryUfuncInfo" class designed to help test the unary ufuncs
- the `ops` decorator, which can generate test variants from lists of OpInfos
- test_unary_ufuncs.py, a new test suite stub that shows how the `ops` decorator and operator information can be used to improve the thoroughness of our testing

The single test in test_unary_ufuncs.py simply ensures that the dtypes associated with a unary ufunc operator in its OpInfo entry are correct. Writing a test like this previously, however, would have required manually constructing test-specific operator information and writing a custom test generator. The `ops` decorator and a common place to put operator information make writing tests like this easier and allows what would have been test-specific information to be reused.

The `ops` decorator extends and composes with the existing device generic test framework, allowing its decorators to be reused. For example, the `onlyOnCPUAndCUDA` decorator works with the new `ops` decorator. This should keep the tests readable and consistent.

Future PRs will likely:

- continue refactoring the too large test_torch.py into more verticals (unary ufuncs, binary ufuncs, reductions...)
- add more operator information to common_method_invocations.py
- refactor tests for unary ufuncs into test_unary_ufunc

Examples of possible future extensions are [here](616747e50d), where an example unary ufunc test is added, and [here](d0b624f110), where example autograd tests are added. Both tests leverage the operator info in common_method_invocations to simplify testing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41662

Reviewed By: ngimel

Differential Revision: D23048416

Pulled By: mruberry

fbshipit-source-id: ecce279ac8767f742150d45854404921a6855f2c
2020-08-11 11:34:53 -07:00
aabdef51f9 [NNC] Registerizer for GPU [1/x] (#42606)
Summary:
Adds a new optimization pass, the Registerizer, which looks for common Stores and Loads to a single item in a buffer and replaces them with a local temporary scalar which is cheaper to write.

For example it can replace:
```
A[0] = 0;
for (int x = 0; x < 10; x++) {
  A[0] = (A[0]) + x;
}
```

with:
```
int A_ = 0;
for (int x = 0; x < 10; x++) {
  A_ = x + A_;
}
A[0] = A_;
```

This is particularly useful on GPUs when parallelizing, since after replacing loops with metavars we have a lot of accesses like this. Early tests of simple reductions on a V100 indicates this can speed them up by ~5x.

This diff got a bit unwieldy with the integration code so that will come in a follow up.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42606

Reviewed By: bertmaher

Differential Revision: D22970969

Pulled By: nickgg

fbshipit-source-id: 831fd213f486968624b9a4899a331ea9aeb40180
2020-08-11 11:17:50 -07:00
57b056b5f2 align qlinear benchmark to linear benchmark (#42767)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42767

Same as previous PR, forcing the qlinear benchmark to follow the fp one

Test Plan:
```
cd benchmarks/operator_benchmark
python -m pt.linear_test
python -m pt.qlinear_test
```

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D23013937

fbshipit-source-id: fffaa7cfbfb63cea41883fd4d70cd3f08120aaf8
2020-08-11 10:35:16 -07:00
a7bdf575cb align qconv benchmark to conv benchmark (#42761)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42761

Makes the qconv benchmark follow the conv benchmark exactly. This way
it will be easy to compare q vs fp with the same settings.

Test Plan:
```
cd benchmarks/operator_benchmark
python -m pt.qconv_test
python -m pt.conv_test
```

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D23012533

fbshipit-source-id: af30ee585389395569a6322f5210828432963077
2020-08-11 10:33:19 -07:00
2c8cbd78bd Fix orgqr input size conditions (#42825)
Summary:
* Adds support for `n > k`
* Throw error if `m >= n >= k` is not true
* Updates existing error messages to match argument names shown in public docs
* Adds error tests

Fixes https://github.com/pytorch/pytorch/issues/41776

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42825

Reviewed By: smessmer

Differential Revision: D23038916

Pulled By: albanD

fbshipit-source-id: e9bec7b11557505e10e0568599d0a6cb7e12ab46
2020-08-11 10:17:39 -07:00
575e7497f6 Introduce experimental FX library (#42741)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42741

Test Plan: Imported from OSS

Reviewed By: dzhulgakov

Differential Revision: D23006383

Pulled By: jamesr66a

fbshipit-source-id: 6cb6d921981fcae47a07df581ffcf900fb8a7fe8
2020-08-11 10:01:47 -07:00
7524699d58 Modify clang code coverage to CMakeList.txt (for MacOS) (#42837)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42837

Originally we use
```
list(APPEND CMAKE_C_FLAGS  -fprofile-instr-generate -fcoverage-mapping)
list(APPEND CMAKE_CXX_FLAGS  -fprofile-instr-generate -fcoverage-mapping)
```
But when compile project on mac with Coverage On, it has the error:
`clang: error: no input files
/bin/sh: -fprofile-instr-generate: command not found
/bin/sh: -fcoverage-mapping: command not found`

The reason behind it, is `list(APPEND CMAKE_CXX_FLAGS` will add an additional `;` to the variable. This means, if we do `list(APPEND foo a)` and then `list(APPEND foo b)`, then `foo` will be `a;b` -- with the additional `;`. Since we have `CMAKE_CXX_FLAGS` defined before in the `CMakeList.txt`, we can only use `set(...)` here
After changing it to
```
set(CMAKE_C_FLAGS "${CMAKE_C_FLAGS} -fprofile-instr-generate -fcoverage-mapping")
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -fprofile-instr-generate -fcoverage-mapping")
```
Test successufully in local mac machine.

Test Plan: Test locally on mac machine

Reviewed By: malfet

Differential Revision: D23043057

fbshipit-source-id: ff6f4891b35b7f005861ee2f8e4c550c997fe961
2020-08-11 09:57:55 -07:00
42114a0154 Update the documentation for scatter to include streams parameter. (#42814)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/41827

![Screenshot from 2020-08-10 13-41-20](https://user-images.githubusercontent.com/46765601/89813181-41041380-db0f-11ea-88c2-a97d7b994ac5.png)

Current:
https://pytorch.org/docs/stable/cuda.html#communication-collectives

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42814

Reviewed By: smessmer

Differential Revision: D23033544

Pulled By: mrshenli

fbshipit-source-id: 88747fbb06e88ef9630c042ea9af07dafd422296
2020-08-11 09:28:14 -07:00
1041bdebb0 Fix a typo in EmbeddingBag.cu (#42742)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42742

Reviewed By: smessmer

Differential Revision: D23011029

Pulled By: mrshenli

fbshipit-source-id: 615f8b876ef1881660af71b6e145fb4ca97d2ebb
2020-08-11 09:24:38 -07:00
916235284c [JIT] Fix typing.Final for python 3.8 (#39568)
Summary:
fixes https://github.com/pytorch/pytorch/issues/39566

`typing.Final` is a thing since python 3.8, and on python 3.8, `typing_extensions.Final` is an alias of `typing.Final`, therefore, `ann.__module__ == 'typing_extensions'` will become False when using 3.8 and `typing_extensions` is installed.

~~I don't know why the test is skipped, seems like due to historical reason when python 2.7 was still a thing?~~ Edit: I know now, the `Final` for `<3.7` don't have `__origin__`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/39568

Reviewed By: smessmer

Differential Revision: D23043388

Pulled By: malfet

fbshipit-source-id: cc87a9e4e38090d784e9cea630e1c543897a1697
2020-08-11 08:51:46 -07:00
d28639a080 Optimization with Backward Implementation of Learnable Fake Quantize Per Channel Kernel (CPU and GPU) (#42810)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42810

In this diff, the original backward pass implementation is sped up by merging the 3 iterations computing dX, dScale, and dZeroPoint separately. In this case, a native loop is directly used on a byte-wise level (referenced by `strides`). In addition, vectorization is used such that scale and zero point are expanded to share the same shape and the element-wise corresponding values to X along the channel axis.

In the benchmark test on the operators, for an input of shape `3x3x256x256`, we have observed the following improvement in performance:
**Speedup from python operator**: ~10x
**Speedup from original learnable kernel**: ~5.4x
**Speedup from non-backprop kernel**: ~1.8x

Test Plan:
To assert correctness of the new kernel, on a devvm, enter the command

`buck test //caffe2/test:quantization -- learnable_backward_per_channel`

To benchmark the operators, on a devvm, enter the command
1. Set the kernel size to 3x3x256x256 or a reasonable input size.
2. Run `buck test //caffe2/benchmarks/operator_benchmark/pt:quantization_test`
3. The relevant outputs for CPU are as follows:

```
# Benchmarking PyTorch: FakeQuantizePerChannelOpBenchmark
# Mode: Eager
# Name: FakeQuantizePerChannelOpBenchmark_N3_C3_H256_W256_cpu_op_typepy_module
# Input: N: 3, C: 3, H: 256, W: 256, device: cpu, op_type: py_module
Backward Execution Time (us) : 989024.686

# Benchmarking PyTorch: FakeQuantizePerChannelOpBenchmark
# Mode: Eager
# Name: FakeQuantizePerChannelOpBenchmark_N3_C3_H256_W256_cpu_op_typelearnable_kernel
# Input: N: 3, C: 3, H: 256, W: 256, device: cpu, op_type: learnable_kernel
Backward Execution Time (us) : 95654.079

# Benchmarking PyTorch: FakeQuantizePerChannelOpBenchmark
# Mode: Eager
# Name: FakeQuantizePerChannelOpBenchmark_N3_C3_H256_W256_cpu_op_typeoriginal_kernel
# Input: N: 3, C: 3, H: 256, W: 256, device: cpu, op_type: original_kernel
Backward Execution Time (us) : 176948.970
```
4. The relevant outputs for GPU are as follows:
The relevant outputs are as follows

**Pre-optimization**:

```
# Benchmarking PyTorch: FakeQuantizePerChannelOpBenchmark
# Mode: Eager
# Name: FakeQuantizePerChannelOpBenchmark_N3_C3_H256_W256_cuda_op_typepy_module
# Input: N: 3, C: 3, H: 256, W: 256, device: cpu, op_type: py_module
Backward Execution Time (us) : 6795.173

# Benchmarking PyTorch: FakeQuantizePerChannelOpBenchmark
# Mode: Eager
# Name: FakeQuantizePerChannelOpBenchmark_N3_C3_H256_W256_cuda_op_typelearnable_kernel
# Input: N: 3, C: 3, H: 256, W: 256, device: cpu, op_type: learnable_kernel
Backward Execution Time (us) : 4321.351

# Benchmarking PyTorch: FakeQuantizePerChannelOpBenchmark
# Mode: Eager
# Name: FakeQuantizePerChannelOpBenchmark_N3_C3_H256_W256_cuda_op_typeoriginal_kernel
# Input: N: 3, C: 3, H: 256, W: 256, device: cpu, op_type: original_kernel
Backward Execution Time (us) : 1052.066
```

**Post-optimization**:
```
# Benchmarking PyTorch: FakeQuantizePerChannelOpBenchmark
# Mode: Eager
# Name: FakeQuantizePerChannelOpBenchmark_N3_C3_H256_W256_cuda_op_typepy_module
# Input: N: 3, C: 3, H: 256, W: 256, device: cpu, op_type: py_module
Backward Execution Time (us) : 6737.106

# Benchmarking PyTorch: FakeQuantizePerChannelOpBenchmark
# Mode: Eager
# Name: FakeQuantizePerChannelOpBenchmark_N3_C3_H256_W256_cuda_op_typelearnable_kernel
# Input: N: 3, C: 3, H: 256, W: 256, device: cpu, op_type: learnable_kernel
Backward Execution Time (us) : 2112.484

# Benchmarking PyTorch: FakeQuantizePerChannelOpBenchmark
# Mode: Eager
# Name: FakeQuantizePerChannelOpBenchmark_N3_C3_H256_W256_cuda_op_typeoriginal_kernel
# Input: N: 3, C: 3, H: 256, W: 256, device: cpu, op_type: original_kernel
Backward Execution Time (us) : 1078.79

Reviewed By: vkuzo

Differential Revision: D22946853

fbshipit-source-id: 1a01284641480282b3f57907cc7908d68c68decd
2020-08-11 08:41:53 -07:00
42b4a7132e Raise error if at::native::embedding is given 0-D weight (#42550)
Summary:
Previously, `at::native::embedding` implicitly assumed that the `weight` argument would be 1-D or greater. Given a 0-D tensor, it would segfault. This change makes it throw a RuntimeError instead.

Fixes https://github.com/pytorch/pytorch/issues/41780

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42550

Reviewed By: smessmer

Differential Revision: D23040744

Pulled By: albanD

fbshipit-source-id: d3d315850a5ee2d2b6fcc0bdb30db2b76ffffb01
2020-08-11 08:26:45 -07:00
d396d135db Added torch::cuda::manual_seed(_all) to mirror torch.cuda.manual_seed(_all) (#42638)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42638

Test Plan: Imported from OSS

Reviewed By: glaringlee

Differential Revision: D23030317

Pulled By: heitorschueroff

fbshipit-source-id: b0d7bdf0bc592a913ae5b1ffc14c3a5067478ce3
2020-08-11 08:22:20 -07:00
e8f4b04d9a vmap: temporarily disable support for random functions (#42617)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42617

While we figure out the random plan, I want to initially disable
support for random operations. This is because there is an ambiguity in
what randomness means. For example,

```
tensor = torch.zeros(B0, 1)
vmap(lambda t: t.normal_())(tensor)
```

in the above example, should tensor[0] and tensor[1] be equal (i.e.,
use the same random seed), or should they be different?

The mechanism for disabling random support is as follows:
- We add a new dispatch key called VmapMode
- Whenever we're inside vmap, we enable VmapMode for all tensors.
This is done via at::VmapMode::increment_nesting and
at::VmapMode::decrement_nesting.
- DispatchKey::VmapMode's fallback kernel is the fallthrough kernel.
- We register kernels that raise errors for all random functions on
DispatchKey::VmapMode. This way, whenever someone calls a random
function on any tensor (not just BatchedTensors) inside of a vmap block,
an error gets thrown.

Test Plan: - pytest test/test_vmap.py -v -k "Operators"

Reviewed By: ezyang

Differential Revision: D22954840

Pulled By: zou3519

fbshipit-source-id: cb8d71062d4087e10cbf408f74b1a9dff81a226d
2020-08-11 07:19:51 -07:00
ffc3da35f4 Don't materialize output grads (#41821)
Summary:
Added a new option in AutogradContext to tell autograd to not materialize output grad tensors, that is, don't expand undefined/None tensors into tensors full of zeros before passing them as input to the backward function.

This PR is the second part that closes https://github.com/pytorch/pytorch/issues/41359. The first PR is https://github.com/pytorch/pytorch/pull/41490.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41821

Reviewed By: albanD

Differential Revision: D22693163

Pulled By: heitorschueroff

fbshipit-source-id: a8d060405a17ab1280a8506a06a2bbd85cb86461
2020-08-11 04:27:07 -07:00
ddcf3ded3e Revert D23002043: add net transforms for fusion
Test Plan: revert-hammer

Differential Revision:
D23002043 (a4b763bc2c)

Original commit changeset: f0b13d51d68c

fbshipit-source-id: d43602743af35db825e951358992e979283a26f6
2020-08-10 21:22:57 -07:00
59b10f7929 [quant] Sorting the list of dispathes (#42758)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42758

Test Plan: Imported from OSS

Reviewed By: vkuzo

Differential Revision: D23011764

Pulled By: z-a-f

fbshipit-source-id: df87acdcf77ae8961a109eaba20521bc4f27ad0e
2020-08-10 21:05:30 -07:00
dedcc30c84 Fix ROCm CI by increasing test timeout (#42827)
Summary:
ROCm is failing to run this test in the allotted time. See, for example, https://app.circleci.com/pipelines/github/pytorch/pytorch/198759/workflows/f6066acf-b289-46c5-aad0-6f4f663ce820/jobs/6618625.

cc jeffdaily

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42827

Reviewed By: pbelevich

Differential Revision: D23042220

Pulled By: mruberry

fbshipit-source-id: 52b426b0733b7b52ac3b311466d5000334864a82
2020-08-10 20:26:20 -07:00
a4b763bc2c add net transforms for fusion (#42763)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42763

add the fp16 fusions as net transforms:
-layernorm fused with mul+add
-swish int8

Test Plan: added unit test, ran flows

Reviewed By: yinghai

Differential Revision: D23002043

fbshipit-source-id: f0b13d51d68c240b05d2a237a7fb8273e996328b
2020-08-10 20:16:14 -07:00
103887892c Fix "non-negative integer" error messages (#42734)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/42662

Use "positive integer" error message for consistency with: 17f76f9a78/torch/optim/lr_scheduler.py (L958-L959)
ad7133d3c1/torch/utils/data/sampler.py (L102-L104)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42734

Reviewed By: zdevito

Differential Revision: D23039575

Pulled By: smessmer

fbshipit-source-id: 1be1e0caa868891540ecdbe6f471a6cd51c40ede
2020-08-10 19:39:37 -07:00
c14a7f6808 adaptive_avg_pool[23]d: check output_size.size() (#42831)
Summary:
Return an error if output_size is unexpected

Fixes https://github.com/pytorch/pytorch/issues/42578

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42831

Reviewed By: ezyang

Differential Revision: D23039295

Pulled By: malfet

fbshipit-source-id: d14a5e6dccdf785756635caee2c87151c9634872
2020-08-10 19:27:18 -07:00
c9e825640a [c10d] Template computeLengthsAndOffsets() (#42706)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42706

Different backends accept different type of length to, like MPI_Alltoallv, nccSend/Recv(), gloo::alltoallv(). So to make computeLengthsAndOffsets() template

Test Plan:
Sandcastle
CI
HPC: ./trainer_cmd.sh -p 16 -n 8 -d nccl

Reviewed By: osalpekar

Differential Revision: D22961459

fbshipit-source-id: 45ec271f8271b96f2dba76cd9dce3e678bcfb625
2020-08-10 19:21:46 -07:00
a414bd69de Skip test_c10d.ProcessGroupNCCLTest under TSAN (#42750)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42750

All of these tests fail under TSAN since we fork in a multithreaded
environment.
ghstack-source-id: 109566396

Test Plan: CI

Reviewed By: pritamdamania87

Differential Revision: D23007746

fbshipit-source-id: 65571607522b790280363882d61bfac8a52007a1
2020-08-10 19:13:52 -07:00
a2559652ab Rename some BatchedTensorImpl APIs (#42700)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42700

I was about to use `isBatched` somewhere not in the files used to
implement vmap but then realized how silly that sounds due to
ambiguity. This PR renames some of the BatchedTensor APIs to make a bit
more sense to onlookers.

- isBatched(Tensor) -> isBatchedTensor(Tensor)
- unsafeGetBatched(Tensor) -> unsafeGetBatchedImpl(Tensor)
- maybeGetBatched(Tensor) -> maybeGetBatchedImpl(Tensor)

Test Plan: - build Pytorch, run tests.

Reviewed By: ezyang

Differential Revision: D22985868

Pulled By: zou3519

fbshipit-source-id: b8ed9925aabffe98085bcf5c81d22cd1da026f46
2020-08-10 17:43:20 -07:00
8f67c7a624 BatchedTensor fallback: extended to support ops with multiple Tensor returns (#42628)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42628

This PR extends the BatchedTensor fallback to support operators with
multiple Tensor returns. If an operator has multiple returns, we stack
shards of each return to create the full outputs.

Test Plan:
- `pytest test/test_vmap.py -v`. Added a new test for an operator with
multiple returns (torch.var_mean).

Reviewed By: izdeby

Differential Revision: D22957095

Pulled By: zou3519

fbshipit-source-id: 5c0ec3bf51283cc4493b432bcfed1acf5509e662
2020-08-10 17:42:03 -07:00
64a7939ee5 test_cpp_rpc: Build test_e2e_process_group.cpp only if USE_GLOO is true (#42836)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/42776

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42836

Reviewed By: seemethere

Differential Revision: D23041274

Pulled By: malfet

fbshipit-source-id: 8605332701271bea6d9b3a52023f548c11d8916f
2020-08-10 16:54:26 -07:00
8718524571 [vulkan] cat op (concatenate) (#41434)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/41434

Test Plan: Imported from OSS

Reviewed By: AshkanAliabadi

Differential Revision: D22754941

Pulled By: IvanKobzarev

fbshipit-source-id: cd03577e1c2f639b2592d4b7393da4657422e23c
2020-08-10 16:24:20 -07:00
3cf2551f2f Fix torch.nn.functional.grid_sample crashes if grid has NaNs (#42703)
Summary:
In `clip_coordinates` replace `minimum(maximum(in))` composition with `clamp_max(clamp_min(in))`
Swap order of `clamp_min` operands to clamp NaNs in grid to 0

Fixes https://github.com/pytorch/pytorch/issues/42616

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42703

Reviewed By: ezyang

Differential Revision: D22987447

Pulled By: malfet

fbshipit-source-id: a8a2d6de8043d6b77c8707326c5412d0250efae6
2020-08-10 16:20:09 -07:00
e06b4be5ae change pt_defs.bzl to python file (#42725)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42725

This diff changes pt_defs.bzl to pt_defs.py, so that it can be included as python source file.

The reason is if we remove base ops, pt_defs.bzl becomes too big (8k lines) and we cannot pass its content to gen_oplist (python library). The easy solution is to change it to a python source file so that it can be used in gen_oplist.

Test Plan: sandcastle

Reviewed By: ljk53, iseeyuan

Differential Revision: D22968258

fbshipit-source-id: d720fe2e684d9a2bf5bd6115b6e6f9b812473f12
2020-08-10 16:12:43 -07:00
752f433a24 DDP communication hook: skip dividing grads by world_size if hook registered. (#42400)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42400

mcarilli spotted that in the original DDP communication hook design described in [39272](https://github.com/pytorch/pytorch/issues/39272), the hooks receive grads that are already predivided by world size.

It makes sense to skip the divide completely if hook registered. The hook is meant for the user to completely override DDP communication. For example, if the user would like to implement something like GossipGrad, always dividing by the world_size would not be a good idea.

We also included a warning in the register_comm_hook API as:
> GradBucket bucket's tensors will not be predivided by world_size. User is responsible to divide by the world_size in case of operations like allreduce.
ghstack-source-id: 109548696

**Update:** We discovered and fixed a bug with the sparse tensors case. See new unit test called `test_ddp_comm_hook_sparse_gradients` and changes in `reducer.cpp`.

Test Plan: python test/distributed/test_c10d.py and perf benchmark tests.

Reviewed By: ezyang

Differential Revision: D22883905

fbshipit-source-id: 3277323fe9bd7eb6e638b7ef0535cab1fc72f89e
2020-08-10 13:55:42 -07:00
d7aaa3327b .circleci: Only do comparisons when available (#42816)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42816

Comparisons were being done on branches where the '<<
pipeline.git.base_revision >>' didn't exist before so let's just move it
so that comparison / code branch is only run when that variable is
available

Example: https://app.circleci.com/pipelines/github/pytorch/pytorch/198611/workflows/8a316eef-d864-4bb0-863f-1454696b1e8a/jobs/6610393

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D23032900

Pulled By: seemethere

fbshipit-source-id: 98a49c78b174d6fde9c6b5bd3d86a6058d0658bd
2020-08-10 12:33:37 -07:00
d83cc92948 [ONNX] Add support for scalar src in torch.scatter ONNX export. (#42765)
Summary:
`torch.scatter` supports two overloads – one where `src` input tensor is same size as the `index` tensor input, and second, where `src` is a scalar. Currrently, ONNX exporter only supports the first overload. This PR adds export support for the second overload of `torch.scatter`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42765

Reviewed By: hl475

Differential Revision: D23025189

Pulled By: houseroad

fbshipit-source-id: 5c2a3f3ce3b2d69661a227df8a8e0ed7c1858dbf
2020-08-10 11:45:42 -07:00
e7b5a23607 include missing settings import
Summary: from hypothesis import given, settings

Test Plan: test_op_nnpi_fp16.py

Differential Revision: D23031038

fbshipit-source-id: 751547e6a6e992d8816d4cc2c5a699ba19a97796
2020-08-10 10:45:34 -07:00
77305c1e44 Automated submodule update: FBGEMM (#42781)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42781

This is an automated pull request to update the first-party submodule for [pytorch/FBGEMM](https://github.com/pytorch/FBGEMM).

New submodule commit: fbd813e29f

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42771

Test Plan: Ensure that CI jobs succeed on GitHub before landing.

Reviewed By: dskhudia

Differential Revision: D23015890

Pulled By: jspark1105

fbshipit-source-id: f0f62969f8744df96a4e7f5aff2ce95baabb2f76
2020-08-10 10:14:56 -07:00
e5adf45dde Add python unittest target to caffe2/test/TARGETS (#42766)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42766

**Summary**
Some python tests are missing in `caffe2/test/TARGETS`, add them to be more comprehension.

According to [run_test.py](https://github.com/pytorch/pytorch/blob/master/test/run_test.py#L125), some tests are slower. Slow tests are added as independent targets and others are put together into one `others` target. The reason is because we want to reduce overhead, especially for code covarge collection.  Tests in one target can be run as a bundle, and then coverage can be collected together. Typically coverage collection procedure is time-expensive, so this helps us save time.

Test Plan:
Run all the new test targets locally in dev server and record the time they cost.
**Statistics**

```
# jit target
real    33m7.694s
user    653m1.181s
sys     58m14.160s

--------- Compare to Initial Jit Target runtime: ----------------

real    32m13.057s
user    613m52.843s
sys     54m58.678s

```

```
# others target
real    9m2.920s
user    164m21.927s
sys     12m54.840s
```

```
# serialization target
real    4m21.090s
user    23m33.501s
sys     1m53.308s

```

```
# tensorexpr
real    11m28.187s
user    33m36.420s
sys     1m15.925s
```

```
# type target
real    3m36.197s
user    51m47.912s
sys     4m14.149s
```

Reviewed By: malfet

Differential Revision: D22979219

fbshipit-source-id: 12a30839bb76a64871359bc024e4bff670c5ca8b
2020-08-10 09:48:59 -07:00
bc779667d6 generalize circleci docker build.sh and add centos support (#41255)
Summary:
Add centos Dockerfile and support to circleci docker builds, and allow generic image names to be parsed by build.sh, so both hardcoded images and custom images can be built.

Currently only adds a ROCm centos Dockerfile.

CC ezyang xw285cornell sunway513

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41255

Reviewed By: mrshenli

Differential Revision: D23003218

Pulled By: malfet

fbshipit-source-id: 562c53533e7fb9637dc2e81edb06b2242afff477
2020-08-10 09:42:05 -07:00
05f00532f5 Fix TensorPipe submodule (#42789)
Summary:
Not sure what happened, but possibly I landed a PR on PyTorch which updated the TensorPipe submodule to a commit hash of a *PR* of TensorPipe. Now that the latter PR has been merged though that same commit has a different hash. The commit referenced by PyTorch, therefore, has become orphaned. This is causing some issues.

Hence here I am updating the commit, which however does not change a single line of code.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42789

Reviewed By: houseroad

Differential Revision: D23023238

Pulled By: lw

fbshipit-source-id: ca2dcf6b7e07ab64fb37e280a3dd7478479f87fd
2020-08-10 02:15:44 -07:00
55ac240589 [ONNX] Fix scalar type cast for comparison ops (#37787)
Summary:
Always promote type casts for comparison operators, regardless if the input is tensor or scalar. Unlike arithmetic operators, where scalars are implicitly cast to the same type as tensors.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/37787

Reviewed By: hl475

Differential Revision: D21440585

Pulled By: houseroad

fbshipit-source-id: fb5c78933760f1d1388b921e14d73a2cb982b92f
2020-08-09 23:00:57 -07:00
162972e980 Fix op benchmark (#42757)
Summary:
A benchmark relies on abs_ having a functional variant.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42757

Reviewed By: ngimel

Differential Revision: D23011037

Pulled By: mruberry

fbshipit-source-id: c04866015fa259e4c544e5cf0c33ca1e11091d92
2020-08-09 17:31:51 -07:00
87970b70a7 Adds 'clip' alias for clamp (#42770)
Summary:
Per title. Also updates our guidance for adding aliases to clarify interned_string and method_test requirements. The alias is tested by extending test_clamp to also test clip.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42770

Reviewed By: ngimel

Differential Revision: D23020655

Pulled By: mruberry

fbshipit-source-id: f1d8e751de9ac5f21a4f95d241b193730f07b5dc
2020-08-09 02:46:02 -07:00
b6810c1064 Include/ExcludeDispatchKeySetGuard API (#42658)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42658

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D22971426

Pulled By: bhosmer

fbshipit-source-id: 4d63e0cb31745e7b662685176ae0126ff04cdece
2020-08-08 16:27:05 -07:00
79b8328aaf optimize_for_mobile: bring packed params to root module (#42740)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42740

Adds a pass to hoist conv packed params to root module.
The benefit is that if there is nothing else in the conv module,
subsequent passes will delete it, which will reduce module size.

For context, freezing does not handle this because conv packed
params is a custom object.

Test Plan:
```
PYTORCH_JIT_LOG_LEVEL=">hoist_conv_packed_params.cpp" python test/test_mobile_optimizer.py TestOptimizer.test_hoist_conv_packed_params
```

Imported from OSS

Reviewed By: kimishpatel

Differential Revision: D23005961

fbshipit-source-id: 31ab1f5c42a627cb74629566483cdc91f3770a94
2020-08-08 15:53:20 -07:00
d8801f590c fix asan failure for module freezing in conv bn folding (#42739)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42739

This is a test case which fails with ASAN on at the module freezing
step.

Test Plan:
```
USE_ASAN=1 USE_CUDA=0 python setup.py develop
LD_PRELOAD=/usr/lib64/libasan.so.4 python test/test_mobile_optimizer.py TestOptimizer.test_optimize_for_mobile_asan

// output tail: https://gist.github.com/vkuzo/7a0018b9e10ffe64dab0ac7381479f23
```

Imported from OSS

Reviewed By: kimishpatel

Differential Revision: D23005962

fbshipit-source-id: b7d4492e989af7c2e22197c16150812bd2dda7cc
2020-08-08 15:51:59 -07:00
5cd0f5e8ec [PyFI] Update hypothesis and switch from tp2 (#41645)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41645

Pull Request resolved: https://github.com/facebookresearch/pytext/pull/1405

Test Plan: buck test

Reviewed By: thatch

Differential Revision: D20323893

fbshipit-source-id: 54665d589568c4198e96a27f0ed8e5b41df7b86b
2020-08-08 12:13:04 -07:00
b7a9bc0802 Revert D22217029: Add fake quantize operator that works in backward pass
Test Plan: revert-hammer

Differential Revision:
D22217029 (48e978ba18)

Original commit changeset: 7055a2cdafcf

fbshipit-source-id: f57a27be412c6fbfd5a5b07a26f758ac36be3b67
2020-08-07 23:04:40 -07:00
18ca999e1a integrate int8 swish with net transformer
Summary:
add a fuse path for deq->swish->quant
update swish fake op interface to take arguments accordingly

Test Plan:
net_runner passes
unit tests need to be updated

Reviewed By: venkatacrc

Differential Revision: D22962064

fbshipit-source-id: cef79768db3c8af926fca58193d459d671321f80
2020-08-07 23:01:06 -07:00
c889de7e25 update DispatchKey::toString() (#42619)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42619

Added missing entries to `DispatchKey::toString()` and reordered to match declaration order in `DispatchKey.h`

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D22963407

Pulled By: bhosmer

fbshipit-source-id: 34a012135599f497c308ba90ea6e8117e85c74ac
2020-08-07 22:39:23 -07:00
5dd230d6a2 [vulkan] inplace add_, relu_ (#41380)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/41380

Test Plan: Imported from OSS

Reviewed By: AshkanAliabadi

Differential Revision: D22754939

Pulled By: IvanKobzarev

fbshipit-source-id: 19b0bbfc5e1f149f9996b5043b77675421ecb2ed
2020-08-07 21:18:17 -07:00
6755e49cad Set proper return type (#42454)
Summary:
This function was always expecting to return a `size_t` value

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42454

Reviewed By: ezyang

Differential Revision: D22993168

Pulled By: ailzhang

fbshipit-source-id: 044df8ce17983f04681bda8c30cd742920ef7b1e
2020-08-07 19:22:35 -07:00
e95fbaaba3 Adding Peter's Swish Op ULP analysis. (#42573)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42573

* Generate the ULP png files for different ranges.

Test Plan: test_op_ulp_error.py

Reviewed By: hyuen

Differential Revision: D22938572

fbshipit-source-id: 6374bef6d44c38e1141030d44029dee99112cd18
2020-08-07 19:13:01 -07:00
0a804be47d [NCCL] DDP communication hook: getFuture() without cudaStreamAddCallback (#42335)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42335

**Main goal:** For DDP communication hook, provide an API called "get_future" to retrieve a future associated with the completion of c10d.ProcessGroupNCCL.work. Enable NCCL support for this API in this diff.

We add an API `c10::intrusive_ptr<c10::ivalue::Future> getFuture()` to `c10d::ProcessGroup::Work`. This API will only be supported by NCCL in the first version, the default implementation will throw UnsupportedOperation.

We no longer consider a design that involves cudaStreamAddCallback which potentially was causing performance regression in [#41596](https://github.com/pytorch/pytorch/pull/41596).

ghstack-source-id: 109461507

Test Plan:
```(pytorch) [sinannasir@devgpu017.ash6 ~/local/pytorch] python test/distributed/test_c10d.py
Couldn't download test skip set, leaving all tests enabled...
..............................s.....................................................s................................
----------------------------------------------------------------------
Ran 117 tests in 298.042s

OK (skipped=2)
```
### Facebook Internal:
2\. HPC PT trainer run to validate no regression. Check the QPS number:
**Master:** QPS after 1000 iters: around ~34100
```
hpc_dist_trainer --fb-data=none --mtml-fusion-level=1 --target-model=ifr_video --max-ind-range=1000000 --embedding-partition=row-wise mast --domain $USER"testvideo_master" --trainers 16 --trainer-version 1c53912
```
```
[0] I0806 142048.682 metrics_publishers.py:50] Finished iter 999, Local  window NE: [0.963963 0.950479 0.953704], lifetime NE: [0.963963 0.950479 0.953704], loss: [0.243456 0.235225 0.248375], QPS: 34199
```
[detailed logs](https://www.internalfb.com/intern/tupperware/details/task/?handle=priv3_global%2Fmast_hpc%2Fhpc.sinannasirtestvideo_mastwarm.trainer.trainer%2F0&ta_tab=logs)

**getFuture/new design:** QPS after 1000 iters: around ~34030
```
hpc_dist_trainer --fb-data=none --mtml-fusion-level=1 --target-model=ifr_video --max-ind-range=1000000 --embedding-partition=row-wise mast --domain $USER"testvideo_getFutureCyclicFix" --trainers 16 --trainer-version 8553aee
```
```
[0] I0806 160149.197 metrics_publishers.py:50] Finished iter 999, Local  window NE: [0.963959 0.950477 0.953704], lifetime NE: [0.963959 0.950477 0.953704], loss: [0.243456 0.235225 0.248375], QPS: 34018
```
[detailed logs](https://www.internalfb.com/intern/tupperware/details/task/?handle=priv3_global%2Fmast_hpc%2Fhpc.sinannasirtestvideo_getFutureCyclicFix.trainer.trainer%2F0&ta_tab=logs)
**getFuture/new design Run 2:** QPS after 1000 iters: around ~34200
```
hpc_dist_trainer --fb-data=none --mtml-fusion-level=1 --target-model=ifr_video --max-ind-range=1000000 --embedding-partition=row-wise mast --domain $USER"test2video_getFutureCyclicFix" --trainers 16 --trainer-version 8553aee
```
```
[0] I0806 160444.650 metrics_publishers.py:50] Finished iter 999, Local  window NE: [0.963963 0.950482 0.953706], lifetime NE: [0.963963 0.950482 0.953706], loss: [0.243456 0.235225 0.248375], QPS: 34201
```
[detailed logs](https://www.internalfb.com/intern/tupperware/details/task/?handle=priv3_global%2Fmast_hpc%2Fhpc.sinannasirtest2video_getFutureCyclicFix.trainer.trainer%2F0&ta_tab=logs)
**getFuture/old design (Regression):** QPS after 1000 iters: around ~31150
```
hpc_dist_trainer --fb-data=none --mtml-fusion-level=1 --target-model=ifr_video --max-ind-range=1000000 --embedding-partition=row-wise mast --domain $USER”testvideo_OLDgetFutureD22583690 (d904ea5972)" --trainers 16 --trainer-version 1cb5cbb
```
```
priv3_global/mast_hpc/hpc.sinannasirtestvideo_OLDgetFutureD22583690 (d904ea5972).trainer.trainer/0 [0] I0805 101320.407 metrics_publishers.py:50] Finished iter 999, Local  window NE: [0.963964 0.950482 0.953703], lifetime NE: [0.963964 0.950482 0.953703], loss: [0.243456 0.235225 0.248375], QPS: 31159
```
3\. `flow-cli` tests; roberta_base; world_size=4:
**Master:** f210039922
```
total:
  32 GPUs -- 32 GPUs: p25:  0.908    35/s  p50:  1.002    31/s  p75:  1.035    30/s  p90:  1.051    30/s  p95:  1.063    30/s
forward:
  32 GPUs -- 32 GPUs: p25:  0.071   452/s  p50:  0.071   449/s  p75:  0.072   446/s  p90:  0.072   445/s  p95:  0.072   444/s
backward:
  32 GPUs -- 32 GPUs: p25:  0.821    38/s  p50:  0.915    34/s  p75:  0.948    33/s  p90:  0.964    33/s  p95:  0.976    32/s
optimizer:
  32 GPUs -- 32 GPUs: p25:  0.016  2037/s  p50:  0.016  2035/s  p75:  0.016  2027/s  p90:  0.016  2019/s  p95:  0.016  2017/s
```
**getFuture new design:** f210285797
```
total:
  32 GPUs -- 32 GPUs: p25:  0.952    33/s  p50:  1.031    31/s  p75:  1.046    30/s  p90:  1.055    30/s  p95:  1.070    29/s
forward:
  32 GPUs -- 32 GPUs: p25:  0.071   449/s  p50:  0.072   446/s  p75:  0.072   445/s  p90:  0.072   444/s  p95:  0.072   443/s
backward:
  32 GPUs -- 32 GPUs: p25:  0.865    37/s  p50:  0.943    33/s  p75:  0.958    33/s  p90:  0.968    33/s  p95:  0.982    32/s
optimizer:
  32 GPUs -- 32 GPUs: p25:  0.016  2037/s  p50:  0.016  2033/s  p75:  0.016  2022/s  p90:  0.016  2018/s  p95:  0.016  2017/s

```

Reviewed By: ezyang

Differential Revision: D22833298

fbshipit-source-id: 1bb268d3b00335b42ee235c112f93ebe2f25b208
2020-08-07 18:48:35 -07:00
d4a4c62df3 [caffe2] Fix the timeout (stuck) issues of dedup SparseAdagrad C2 kernel
Summary:
Backout D22800959 (f30ac66e79). This one is causing the timeout (machine stuck) issues for dedup kernels. Reverting it make the unit test pass. Still need to investigate why this is the culprit...

Original commit changeset: 641d52a51070

Test Plan:
```
buck test mode/dev-nosan //caffe2/caffe2/fb/net_transforms/tests:fuse_sparse_ops_test -- 'test_fuse_sparse_adagrad_with_sparse_lengths_sum_gradient \(caffe2\.caffe2\.fb\.net_transforms\.tests\.fuse_sparse_ops_test\.TestFuseSparseOps\)' --print-passing-details
```

Reviewed By: jspark1105

Differential Revision: D23008389

fbshipit-source-id: 4f1b9a41c78eaa5541d57b9d8aa12401e1d495f2
2020-08-07 18:42:36 -07:00
3fa0581cf2 [fbgemm] use new more general depthwise 3d conv interface (#42697)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42697

Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/401

As title

Test Plan: CI

Reviewed By: dskhudia

Differential Revision: D22972233

fbshipit-source-id: a2c8e989dee84b2c0587faccb4f8e3bcb05c797c
2020-08-07 18:30:56 -07:00
13bc542829 Fix lite trainer unit test submodule registration (#42714)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42714

Change two unit tests for the lite trainer to register two instances/objects of the same submodule type instead of the same submodule object twice.

Test Plan: Imported from OSS

Reviewed By: iseeyuan

Differential Revision: D22990736

Pulled By: ann-ss

fbshipit-source-id: 2bf56b5cc438b5a5fc3db90d3f30c5c431d3ae77
2020-08-07 18:26:56 -07:00
48e978ba18 Add fake quantize operator that works in backward pass (#40532)
Summary:
This diff adds FakeQuantizeWithBackward. This works the same way as the regular FakeQuantize module, allowing QAT to occur in the forward pass, except it has an additional quantize_backward parameter. When quantize_backward is enabled, the gradients are fake quantized as well (dynamically, using hard-coded values). This allows the user to see whether there would be a significant loss of accuracy if the gradients were quantized in their model.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/40532

Test Plan: The relevant test for this can be run using `python test/test_quantization.py TestQATBackward.test_forward_and_backward`

Reviewed By: supriyar

Differential Revision: D22217029

Pulled By: durumu

fbshipit-source-id: 7055a2cdafcf022f1ea11c3442721ae146d2b3f2
2020-08-07 17:47:01 -07:00
2b04712205 Exposing Percentile Caffe2 Operator in PyTorch
Summary: As titled.

Test Plan:
```
buck test caffe2/caffe2/python/operator_test:torch_integration_test -- test_percentile
```

Reviewed By: yf225

Differential Revision: D22999896

fbshipit-source-id: 2e3686cb893dff1518d533cb3d78c92eb2a6efa5
2020-08-07 16:22:37 -07:00
55b1706775 Skips some complex tests on ROCm (#42759)
Summary:
Fixes ROCm build on OSS master.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42759

Reviewed By: ngimel

Differential Revision: D23011560

Pulled By: mruberry

fbshipit-source-id: 3339ecbd5a0ca47aede6f7c3f84739af1ac820d5
2020-08-07 16:12:32 -07:00
95f4f67552 Restrict conversion to SmallVector (#42694)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42694

The old implementation allowed calling SmallVector constructor and operator= for any type without restrictions,
but then failed with a compiler error when the type wasn't a collection.

Instead, we should only use it if Container follows a container concept and just not match the constructor otherwise.

This fixes an issue kimishpatel was running into.
ghstack-source-id: 109370513

Test Plan: unit tests

Reviewed By: kimishpatel, ezyang

Differential Revision: D22983020

fbshipit-source-id: c31264f5c393762d822f3d64dd2a8e3279d8da44
2020-08-07 15:47:29 -07:00
faca3c43e6 fix celu in quantized benchmark (#42756)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42756

Similar to ELU, CELU was also broken in the quantized benchmark, fixing.

Test Plan:
```
cd benchmarks/operator_benchmark
python -m pt.qactivation_test
```

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D23010863

fbshipit-source-id: 203e63f9cff760af6809f6f345b0d222dc1e9e1b
2020-08-07 15:23:50 -07:00
4eb66b814e Automated submodule update: FBGEMM (#42713)
Summary:
This is an automated pull request to update the first-party submodule for [pytorch/FBGEMM](https://github.com/pytorch/FBGEMM).

New submodule commit: a989b99279

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42713

Test Plan: Ensure that CI jobs succeed on GitHub before landing.

Reviewed By: amylittleyang

Differential Revision: D22990108

Pulled By: jspark1105

fbshipit-source-id: 3252a0f5ad9546221ef2fe908ce6b896252e1887
2020-08-07 13:41:54 -07:00
02f58bdbd7 [caffe2] add type annotations for caffe2.distributed.python
Summary: Add Python type annotations for the `caffe2.distributed.python` module.

Test Plan: Will check sandcastle results.

Reviewed By: jeffdunn

Differential Revision: D22994012

fbshipit-source-id: 30565cc41dd05b5fbc639ae994dfe2ddd9e56cb1
2020-08-07 13:12:53 -07:00
6ebc0504ca BAND, BOR and BXOR for NCCL (all_)reduce should throw runtime errors (#42669)
Summary:
cc rohan-varma
Fixes https://github.com/pytorch/pytorch/issues/41362 #39708

# Description
NCCL doesn't support `BAND, BOR, BXOR`. Since the [current mapping](0642d17efc/torch/lib/c10d/ProcessGroupNCCL.cpp (L39)) doesn't contain any of the mentioned bitwise operator, a default value of `ncclSum` is used instead.

This PR should provide the expected behaviour where a runtime exception is thrown.

# Notes
- The way I'm throwing exceptions is derived from [ProcessGroupGloo.cpp](0642d17efc/torch/lib/c10d/ProcessGroupGloo.cpp (L101))

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42669

Reviewed By: ezyang

Differential Revision: D22996295

Pulled By: rohan-varma

fbshipit-source-id: 83a9fedf11050d2890f9f05ebcedf53be0fc3516
2020-08-07 13:09:07 -07:00
7332c21f7a Speed up HistogramObserver by vectorizing critical path (#41041)
Summary:
22x speedup over the code this replaces. Tested on ResNet18 on a devvm using CPU only, using default parameters for HistogramObserver (i.e. 2048 bins).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41041

Test Plan:
To run the test against the reference (old) implementation, you can use `python test/test_quantization.py TestRecordHistogramObserver.test_histogram_observer_against_reference`.

To run the benchmark, while in the folder `benchmarks/operator_benchmark`, you can use `python -m benchmark_all_quantized_test --operators HistogramObserverCalculateQparams`.

Benchmark results before speedup:
```
# ----------------------------------------
# PyTorch/Caffe2 Operator Micro-benchmarks
# ----------------------------------------
# Tag : short

# Benchmarking PyTorch: HistogramObserverCalculateQparams
# Mode: Eager
# Name: HistogramObserverCalculateQparams_C3_M512_N512_dtypetorch.quint8_cpu_qschemetorch.per_tensor_affine
# Input: C: 3, M: 512, N: 512, dtype: torch.quint8, device: cpu, qscheme: torch.per_tensor_affine
Forward Execution Time (us) : 185818.566

# Benchmarking PyTorch: HistogramObserverCalculateQparams
# Mode: Eager
# Name: HistogramObserverCalculateQparams_C3_M512_N512_dtypetorch.quint8_cpu_qschemetorch.per_tensor_symmetric
# Input: C: 3, M: 512, N: 512, dtype: torch.quint8, device: cpu, qscheme: torch.per_tensor_symmetric
Forward Execution Time (us) : 165325.916
```

Benchmark results after speedup:
```
# ----------------------------------------
# PyTorch/Caffe2 Operator Micro-benchmarks
# ----------------------------------------
# Tag : short

# Benchmarking PyTorch: HistogramObserverCalculateQparams
# Mode: Eager
# Name: HistogramObserverCalculateQparams_C3_M512_N512_dtypetorch.quint8_cpu_qschemetorch.per_tensor_affine
# Input: C: 3, M: 512, N: 512, dtype: torch.quint8, device: cpu, qscheme: torch.per_tensor_affine
Forward Execution Time (us) : 12242.241

# Benchmarking PyTorch: HistogramObserverCalculateQparams
# Mode: Eager
# Name: HistogramObserverCalculateQparams_C3_M512_N512_dtypetorch.quint8_cpu_qschemetorch.per_tensor_symmetric
# Input: C: 3, M: 512, N: 512, dtype: torch.quint8, device: cpu, qscheme: torch.per_tensor_symmetric
Forward Execution Time (us) : 12655.354
```

Reviewed By: raghuramank100

Differential Revision: D22400755

Pulled By: durumu

fbshipit-source-id: 639ac796a554710a33c8a930c1feae95a1148718
2020-08-07 12:29:23 -07:00
98de150381 C++ API TransformerEncoderLayer (#42633)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42633

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D22994332

Pulled By: glaringlee

fbshipit-source-id: 873abdf887d135fb05bde560d695e2e8c992c946
2020-08-07 11:49:42 -07:00
eba35025e0 [JIT] Exclude staticmethods from TS class compilation (#42611)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42611

**Summary**
This commit modifies the Python frontend to ignore static functions on
Torchscript classes when compiling them. They are currently included
along with methods, which causes the first argument of the
staticfunction to be unconditionally inferred to be of the type of the
class it belongs to (regardless of how it is annotated or whether it is
annotated at all). This can lead to compilation errors depending on
how that argument is used in the body of the function.

Static functions are instead imported and scripted as if they were
standalone functions.

**Test Plan**
This commit augments the unit test for static methods in `test_class_types.py`
to test that static functions can call each other and the class
constructor.

**Fixes**
This commit fixes #39308.

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D22958163

Pulled By: SplitInfinity

fbshipit-source-id: 45c3c372792299e6e5288e1dbb727291e977a2af
2020-08-07 11:22:04 -07:00
9f88bcb5a2 Minor typo fix (#42731)
Summary:
Just fixed a typo in test/test_sparse.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42731

Reviewed By: ezyang

Differential Revision: D22999930

Pulled By: mrshenli

fbshipit-source-id: 1b5b21d7cb274bd172fb541b2761f727ba06302c
2020-08-07 11:17:51 -07:00
04c62d4a06 [vulkan] Fix warnings: static_cast, remove unused (#42195)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42195

Test Plan: Imported from OSS

Reviewed By: AshkanAliabadi

Differential Revision: D22803035

Pulled By: IvanKobzarev

fbshipit-source-id: d7bf256437eccb5c421a7fd0aa8ec23a8fec0470
2020-08-07 11:12:54 -07:00
586399c03f Remove duplicate definitions of CppTypeToScalarType (#42640)
Summary:
I noticed that `TensorIteratorDynamicCasting.h` defines a helper meta-function `CPPTypeToScalarType` which does exactly the same thing as the `c10::CppTypeToScalarType` meta-function I added in gh-40927. No need for two identical definitions.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42640

Reviewed By: malfet

Differential Revision: D22969708

Pulled By: ezyang

fbshipit-source-id: 8303c7f4a75ae248f393a4811ae9d2bcacab44ff
2020-08-07 11:02:42 -07:00
944ac133d0 [NNC] Remove VarBinding and go back to Let stmts (#42634)
Summary:
Awhile back when commonizing the Let and LetStmt nodes, I ended up removing both and adding a separate VarBinding section the Block. At the time I couldn't find a counter example, but I found it today: Local Vars and Allocations dependencies may go in either direction and so we need to support interleaving of those statements.

So, I've removed all the VarBinding logic and reimplemented Let statements. ZolotukhinM I think you get to say "I told you so". No new tests, existing tests should cover this.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42634

Reviewed By: mruberry

Differential Revision: D22969771

Pulled By: nickgg

fbshipit-source-id: a46c5193357902d0f59bf30ab103fe123b1503f1
2020-08-07 10:50:38 -07:00
2971bc23a6 Handle fused scale and bias in fake fp16 layernorm
Summary: Allow passing scale and bias to fake fp16 layernorm.

Test Plan: net_runner. Now matches glow's fused layernorm.

Reviewed By: hyuen

Differential Revision: D22952646

fbshipit-source-id: cf9ad055b14f9d0167016a18a6b6e26449cb4de8
2020-08-07 10:48:33 -07:00
dcee8933fb Fix some linking rules to allow path with whitespaces (#42718)
Summary:
Essentially, replace `-Wl,--whole-archive,$<TARGET_FILE:FOO>` with `-Wl,--whole-archive,\"$<TARGET_FILE:FOO>\"` as TARGET_FILE might return path containing whitespaces

Fixes https://github.com/pytorch/pytorch/issues/42657

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42718

Reviewed By: ezyang

Differential Revision: D22993568

Pulled By: malfet

fbshipit-source-id: de878b17d20e35b51dd350f20d079c8b879f70b5
2020-08-07 10:23:23 -07:00
9c8021c0b1 Adds torch.linalg namespace (#42664)
Summary:
This PR adds the `torch.linalg` namespace as part of our continued effort to be more compatible with NumPy. The namespace is tested by adding a single function, `torch.linalg.outer`, and testing it in a new test suite, test_linalg.py. It follows the same pattern that https://github.com/pytorch/pytorch/pull/41911, which added the `torch.fft` namespace, did.

Future PRs will likely:

- add more functions to torch.linalg
- expand the testing done in test_linalg.py, including legacy functions, like torch.ger
- deprecate existing linalg functions outside of `torch.linalg` in preference to the new namespace

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42664

Reviewed By: ngimel

Differential Revision: D22991019

Pulled By: mruberry

fbshipit-source-id: 39258d9b116a916817b3588f160b141f956e5d0b
2020-08-07 10:18:30 -07:00
c9346ad3b8 [CPU] Added torch.bmm for complex tensors (#42383)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42383

Test Plan - Updated existing tests to run for complex dtypes as well.

Also added tests for `torch.addmm`, `torch.badmm`

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D22960339

Pulled By: anjali411

fbshipit-source-id: 0805f21caaa40f6e671cefb65cef83a980328b7d
2020-08-07 10:04:20 -07:00
31ed468905 Fix cmake warning (#42707)
Summary:
If argumenets in set_target_properties are not separated by whitespace, cmake raises a warning:
```
CMake Warning (dev) at cmake/public/cuda.cmake:269:
  Syntax Warning in cmake code at column 54

  Argument not separated from preceding token by whitespace.
```

Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42707

Reviewed By: ailzhang

Differential Revision: D22988055

Pulled By: malfet

fbshipit-source-id: c3744f23b383d603788cd36f89a8286a46b6c00f
2020-08-07 09:57:21 -07:00
3c66a3795a [vulkan] Ops registration to TORCH_LIBRARY_IMPL (#42194)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42194

Test Plan: Imported from OSS

Reviewed By: AshkanAliabadi

Differential Revision: D22803036

Pulled By: IvanKobzarev

fbshipit-source-id: 2f402541aecf887d78f650bf05d758a0e403bc4d
2020-08-07 09:06:22 -07:00
4eb02add51 Blacklist to Blocklist in onnxifi_transformer (#42590)
Summary:
Fixes issues in https://github.com/pytorch/pytorch/issues/41704 and https://github.com/pytorch/pytorch/issues/41705

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42590

Reviewed By: ailzhang

Differential Revision: D22977357

Pulled By: malfet

fbshipit-source-id: ab61b964cfdf8bd2b469f4ff8f6486a76bc697de
2020-08-07 08:05:32 -07:00
fb8aa0046c Add use_glow_aot, and include ONNX again as a backend for onnxifiGlow (#4787)
Summary:
Pull Request resolved: https://github.com/pytorch/glow/pull/4787

Resurrect ONNX as a backend through onnxifiGlow (was killed as part of D16215878). Then look for the `use_glow_aot` argument in the Onnxifi op. If it's there and true, then we override whatever `backend_id` is set and use the ONNX backend.

Reviewed By: yinghai, rdzhabarov

Differential Revision: D22762123

fbshipit-source-id: abb4c3458261f8b7eeae3016dda5359fa85672f0
2020-08-07 04:31:24 -07:00
73642d9425 Updates alias pattern (and torch.absolute to use it) (#42586)
Summary:
This PR canonicalizes our (current) pattern for adding aliases to PyTorch. That pattern is:

- Copy the original functions native_functions.yaml entry, but replace the original function's name with their own.
- Implement the corresponding functions and have them redispatch to the original function.
- Add docstrings to the new functions that reference the original function.
- Update the alias_map in torch/csrc/jit/passes/normalize_ops.cpp.
- Update the op_alias_mappings in torch/testing/_internal/jit_utils.py.
- Add a test validating the alias's behavior is the same as the original function's.

An alternative pattern would be to use Python and C++ language features to alias ops directly. For example in Python:

```
torch.absolute = torch.abs
```

Let the pattern in this PR be the "native function" pattern, and the alternative pattern be the "language pattern." There are pros/cons to both approaches:

**Pros of the "Language Pattern"**
- torch.absolute is torch.abs.
- no (or very little) overhead for calling the alias.
- no native_functions.yaml redundancy or possibility of "drift" between the original function's entries and the alias's.

**Cons of the "Language Pattern"**
- requires manually adding doc entries
- requires updating Python alias and C++ alias lists
- requires hand writing alias methods on Tensor (technically this should require a C++ test to validate)
- no single list of all PyTorch ops -- have to check native_functions.yaml and one of the separate alias lists

**Pros of the "Native Function" pattern**

- alias declarations stay in native_functions.yaml
- doc entries are written as normal

**Cons of the "Native Function" pattern**

- aliases redispatch to the original functions
- torch.absolute is not torch.abs (requires writing test to validate behavior)
- possibility of drift between original's and alias's native_functions.yaml entries

While either approach is reasonable, I suggest the "native function" pattern since it preserves "native_functions.yaml" as a source of truth and minimizes the number of alias lists that need to be maintained. In the future, entries in native_functions.yaml may support an "alias" argument and replace whatever pattern we choose now.

Ops that are likely to use aliasing are:

- div (divide, true_divide)
- mul (multiply)
- bucketize (digitize)
- cat (concatenate)
- clamp (clip)
- conj (conjugate)
- rad2deg (degrees)
- trunc (fix)
- neg (negative)
- deg2rad (radians)
- round (rint)
- acos (arccos)
- acosh (arcosh)
- asin (arcsin)
- asinh (arcsinh)
- atan (arctan)
- atan2 (arctan2)
- atanh (arctanh)
- bartlett_window (bartlett)
- hamming_window (hamming)
- hann_window (hanning)
- bitwise_not (invert)
- gt (greater)
- ge (greater_equal)
- lt (less)
- le (less_equal)
- ne (not_equal)
- ger (outer)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42586

Reviewed By: ngimel

Differential Revision: D22991086

Pulled By: mruberry

fbshipit-source-id: d6ac96512d095b261ed2f304d7dddd38cf45e7b0
2020-08-07 00:24:06 -07:00
cb1ac94069 [blob reorder] Seperate user embeddings and ad embeddings in large model loading script
Summary: Put user embedding before ads embedding in blobReorder, for flash verification reason.

Test Plan:
```
buck run mode/opt-clang -c python.package_style=inplace sigrid/predictor/scripts:enable_large_model_loading -- --model_path_src="/home/$USER/models/" --model_path_dst="/home/$USER/models_modified/" --model_file_name="182560549_0.predictor"
```
https://www.internalfb.com/intern/anp/view/?id=320921 to check blobsOrder

Reviewed By: yinghai

Differential Revision: D22964332

fbshipit-source-id: 78b4861476a3c889a5ff62492939f717c307a8d2
2020-08-06 23:54:03 -07:00
9597af01ca Support iterating through an Enum class (#42661)
Summary:
[5/N] Implement Enum JIT support

Implement Enum class iteration
Add aten.ne for EnumType

Supported:
Enum-typed function arguments
using Enum type and comparing them
Support getting name/value attrs of enums
Using Enum value as constant
Support Enum-typed return values
Support iterating through Enum class (enum value list)

TODO:
Support serialization and deserialization

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42661

Reviewed By: SplitInfinity

Differential Revision: D22977364

Pulled By: gmagogsfm

fbshipit-source-id: 1a0216f91d296119e34cc292791f9aef1095b5a8
2020-08-06 22:56:34 -07:00
952526804c Print TE CUDA kernel (#42692)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42692

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D22986112

Pulled By: bertmaher

fbshipit-source-id: 52ec3389535c8b276858bef8c470a59aeba4946f
2020-08-06 20:42:04 -07:00
a6c8730045 [ONNX] Add preprocess pass for onnx export (#41832)
Summary:
in `_jit_pass_onnx`, symbolic functions are called for each node for conversion. However, there are nodes that cannot be converted without additional context. For example, the number of outputs from split (and whether it is static or dynamic) is unknown until the point where it is unpacked by listUnpack node. This pass does a preprocess, and prepares the nodes such that enough context can be received by the symbolic function.
* After preprocessing, `_jit_pass_onnx` should have enough context to produce valid ONNX nodes, instead of half baked nodes that replies on fixes from later postpasses.
* `_jit_pass_onnx_peephole` should be a pass that does ONNX specific optimizations instead of ONNX specific fixes.
* Producing more valid ONNX nodes in `_jit_pass_onnx` enables better utilization of the ONNX shape inference https://github.com/pytorch/pytorch/issues/40628.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41832

Reviewed By: ZolotukhinM

Differential Revision: D22968334

Pulled By: bzinodev

fbshipit-source-id: 8226f03c5b29968e8197d242ca8e620c6e1d42a5
2020-08-06 20:34:12 -07:00
9152f2f73a Optimization of Backward Implementation for Learnable Fake Quantize Per Tensor Kernels (CPU and GPU) (#42384)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42384

In this diff, the original backward pass implementation is sped up by merging the 3 iterations computing dX, dScale, and dZeroPoint separately. In this case, a native loop is directly used on a byte-wise level (referenced by `strides`).

In the benchmark test on the operators, for an input of shape `3x3x256x256`, we have observed the following improvement in performance:
- original python operator: 1021037 microseconds
- original learnable kernel: 407576 microseconds
- optimized learnable kernel: 102584 microseconds
- original non-backprop kernel: 139806 microseconds

**Speedup from python operator**: ~10x
**Speedup from original learnable kernel**: ~4x
**Speedup from non-backprop kernel**: ~1.2x

Test Plan:
To assert correctness of the new kernel, on a devvm, enter the command

`buck test //caffe2/test:quantization -- learnable_backward_per_tensor`

To benchmark the operators, on a devvm, enter the command
1. Set the kernel size to 3x3x256x256 or a reasonable input size.
2. Run `buck test //caffe2/benchmarks/operator_benchmark/pt:quantization_test`
3. The relevant outputs are as follows:

(CPU)
```
# Benchmarking PyTorch: FakeQuantizePerTensorOpBenchmark
# Mode: Eager
# Name: FakeQuantizePerTensorOpBenchmark_N3_C3_H256_W256_nbits4_cpu_op_typepy_module
# Input: N: 3, C: 3, H: 256, W: 256, device: cpu, op_type: py_module
Backward Execution Time (us) : 1021036.957

# Benchmarking PyTorch: FakeQuantizePerTensorOpBenchmark
# Mode: Eager
# Name: FakeQuantizePerTensorOpBenchmark_N3_C3_H256_W256_nbits4_cpu_op_typelearnable_kernel
# Input: N: 3, C: 3, H: 256, W: 256, device: cpu, op_type: learnable_kernel
Backward Execution Time (us) : 102583.693

# Benchmarking PyTorch: FakeQuantizePerTensorOpBenchmark
# Mode: Eager
# Name: FakeQuantizePerTensorOpBenchmark_N3_C3_H256_W256_nbits4_cpu_op_typeoriginal_kernel
# Input: N: 3, C: 3, H: 256, W: 256, device: cpu, op_type: original_kernel
Backward Execution Time (us) : 139806.086
```

(GPU)
```
# Benchmarking PyTorch: FakeQuantizePerChannelOpBenchmark
# Mode: Eager
# Name: FakeQuantizePerChannelOpBenchmark_N3_C3_H256_W256_cuda_op_typepy_module
# Input: N: 3, C: 3, H: 256, W: 256, device: cuda, op_type: py_module
Backward Execution Time (us) : 6548.350

# Benchmarking PyTorch: FakeQuantizePerChannelOpBenchmark
# Mode: Eager
# Name: FakeQuantizePerChannelOpBenchmark_N3_C3_H256_W256_cuda_op_typelearnable_kernel
# Input: N: 3, C: 3, H: 256, W: 256, device: cuda, op_type: learnable_kernel
Backward Execution Time (us) : 1340.724

# Benchmarking PyTorch: FakeQuantizePerChannelOpBenchmark
# Mode: Eager
# Name: FakeQuantizePerChannelOpBenchmark_N3_C3_H256_W256_cuda_op_typeoriginal_kernel
# Input: N: 3, C: 3, H: 256, W: 256, device: cuda, op_type: original_kernel
Backward Execution Time (us) : 656.863
```

Reviewed By: vkuzo

Differential Revision: D22875998

fbshipit-source-id: cfcd62c327bb622270a783d2cbe97f00508c4a16
2020-08-06 19:54:17 -07:00
4959981cff [ONNX] Export tensor (#41872)
Summary:
Adding tensor symbolic for opset 9

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41872

Reviewed By: houseroad

Differential Revision: D22968426

Pulled By: bzinodev

fbshipit-source-id: 70e1afc7397e38039e2030e550fd72f09bac7c7c
2020-08-06 19:33:11 -07:00
40ac95dd3c [ONNX] Update ONNX export of torch.where to support ByteTensor as input. (#42264)
Summary:
`torch.where` supports `ByteTensor` and `BoolTensor` types for the first input argument (`condition` predicate). Currently, ONNX exporter assumes that the first argument is `BoolTensor`. This PR updates the export for `torch.where` to correctly support export when first argument is a `ByteTensor`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42264

Reviewed By: houseroad

Differential Revision: D22968473

Pulled By: bzinodev

fbshipit-source-id: 7306388c8446ef3faeb86dc89d72d1f72c1c2314
2020-08-06 19:16:39 -07:00
f9a6c14364 Fix sequence numbers in profiler output (#42565)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42565

After recent changes to the record function we record more
ranges in profiler output and also keep emitting sequence numbers for
all ranges.

Sequence numbers are used by external tools to correlate forward
and autograd ranges and with many ranges having the same sequence number
it becomes impossible to do this.

This PR ensures that we set sequence numbers only for the top-level
ranges and only in case when autograd is enabled.

Test Plan:
nvprof -fo trace.nvvp --profile-from-start off python test_script.py
test_script
https://gist.github.com/ilia-cher/2baffdd98951ee2a5f2da56a04fe15d0
then examining ranges in nvvp

Reviewed By: ngimel

Differential Revision: D22938828

Pulled By: ilia-cher

fbshipit-source-id: 9a5a076706a6043dfa669375da916a1708d12c19
2020-08-06 19:12:05 -07:00
dab9bbfce7 Move jit_profiling tests into test1 on Windows (#42650)
Summary:
Test takes 5 min to finish and 5 min to spin up the environment, so it doesn't make much sense to keep it as separate config
Limit those tests to be run only when `USE_CUDA` environment variable is set to tru

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42650

Reviewed By: ailzhang

Differential Revision: D22967817

Pulled By: malfet

fbshipit-source-id: c6c26df140059491e7ff53ee9cbbc93433d2f36f
2020-08-06 16:16:40 -07:00
33519e19ab Fix 64-bit indexing in GridSampler (#41923)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/41656

For the CPU version, this is a regression introduced in https://github.com/pytorch/pytorch/issues/10980 which vectorized the `grid_sampler_2d` implementation. It uses the AVX2 gather intrinsic which for `float` requires 32-bit indexing to match the number of floats in the AVX register. There is also an `i64gather_ps` variant but this only utilizes half of the vector width so would be expected to give worse performance in the more likely case where 32-bit indexing is acceptable. So, I've left the optimised AVX version as-is and reinstated the old non-vectorized version as a fallback.

For the CUDA version, this operation has never supported 32-bit indexing so this isn't a regression. I've templated the kernel on index type and added 64-bit variants. Although I gather in some places a simple `TORCH_CHECK(canUse32BitIndexMath(...))` is used instead. So, there is a decision to be made here.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41923

Reviewed By: glaringlee

Differential Revision: D22925931

Pulled By: zou3519

fbshipit-source-id: 920816107aae26360c5e7f4e9c729fa9057268bb
2020-08-06 16:08:09 -07:00
eaace3e10e Skip CUDA benchmarks on nogpu configs (#42704)
Summary:
Avoids timeouts when the benchmark is launched on nogpu configs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42704

Reviewed By: mruberry

Differential Revision: D22987725

Pulled By: malfet

fbshipit-source-id: aa9aece16557c0af8e05e612277ae1d9e0173a51
2020-08-06 15:47:48 -07:00
6cb0807f88 Fixes ROCm CI (#42701)
Summary:
Per title. ROCm CI doesn't have MKL so this adds a couple missing test annotations.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42701

Reviewed By: ngimel

Differential Revision: D22986273

Pulled By: mruberry

fbshipit-source-id: efa717e2e3771562e9e82d1f914e251918e96f64
2020-08-06 15:24:50 -07:00
cc596ac3a8 [JIT] Add debug dumps in between passes in graph executor. (#42688)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42688

Both the profiling executor and the legacy executor have the debug
loggin now.

Ideally, if we had a pass manager, this could be done as a part of it,
but since we have none, I had to insert the debug statements manually.

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D22981675

Pulled By: ZolotukhinM

fbshipit-source-id: 22b8789e860aa90d5802fc72a4113b22c6fc4da5
2020-08-06 15:16:35 -07:00
cdd7db1ffc Bound shape inferencer: fix int8fc scale and bias
Summary:
Previous when inferring Int8FC, we failed to carry over the scale and zero point properly.

Also fixed int8 FC weight data type to be int8 instead of uint8 as that's what C2 actually uses.

Test Plan: Use net_runner to lower a single Int8Dequantize op. Previous scale and bias would always be 1 and 0. Now the proper value is set.

Reviewed By: yinghai

Differential Revision: D22912186

fbshipit-source-id: a6620c3493e492bdda91da73775bfc9117db12d1
2020-08-06 14:40:25 -07:00
b44a10c179 List[index]::toOptionalStringRef (#42263)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42263

Allow a way to get a reference to the stored string in a `List<optional<string>>` without having to copy the string.
This for example improves perf of the map_lookup op by 3x.
ghstack-source-id: 109162026

Test Plan: unit tests

Reviewed By: ezyang

Differential Revision: D22830381

fbshipit-source-id: e6af2bc8cebd6e68794eb18daf183979bc6297ae
2020-08-06 13:44:33 -07:00
f22aa601ce All Gather and gather APIs for Python Objects (#42189)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42189

Rehash of https://github.com/pytorch/pytorch/pull/28811, which was several months old.

As part of addressing https://github.com/pytorch/pytorch/issues/23232, this PR adds support for the following APIs:

`allgather_object` and `gather_object` to support gather/allgather of generic, pickable Python objects. This has been a long-requested feature so PyTorch should provide these helpers built-in.

The methodology is what is proposed in the original issue:
1) Pickle object to ByteTensor using torch.save
2) Comm. tensor sizes
3) Copy local ByteTensor into a tensor of maximal size
4) Call tensor-based collectives on the result of (3)
5) Unpickle back into object using torch.load

Note that the API is designed to match other than supporting `async_op`. For now, it is a blocking call. If we see demand to support `async_op`, we will have to make more progress on merging work/future to support this.

If this is a suitable approach, we can support `scatter`, `broadcast` in follow up PRs.
ghstack-source-id: 109322433

Reviewed By: mrshenli

Differential Revision: D22785387

fbshipit-source-id: a265a44ec0aa3aaffc3c6966023400495904c7d8
2020-08-06 13:30:25 -07:00
1f689b6ef9 suppress all Autograd keys in AutoNonVariableTypeMode (#42610)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42610

Fix for https://github.com/pytorch/pytorch/issues/42609: `AutoNonVariableTypeMode` should suppress all autograd dispatch keys, not just `Autograd` (e.g. `XLAPreAutograd`, `PrivateUse<N>_PreAutograd`)

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D22963408

Pulled By: bhosmer

fbshipit-source-id: 2f3516580ce0c9136aff5e025285d679394f2f18
2020-08-06 13:15:42 -07:00
85a00c4c92 Skips spectral tests to prevent ROCm build from timing out (#42667)
Summary:
Per title.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42667

Reviewed By: ailzhang

Differential Revision: D22978531

Pulled By: mruberry

fbshipit-source-id: 0c3ba116836ed6c433e2c6a0e1a0f2e3c94c7803
2020-08-06 12:41:32 -07:00
40b6dacb50 Delete dead is_named_tensor_only (#42672)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42672

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D22978389

Pulled By: ezyang

fbshipit-source-id: ef1302c57fe26a58a46ca1f4a4a7c3e2cdbfdc5d
2020-08-06 12:19:44 -07:00
5ca08b8891 Add benchmark for calculate_qparams (#42138)
Summary:
Adds a benchmark for `HistogramObserver.calculate_qparams` to the quantized op benchmarks. The next diff in this stack adds a ~15x speedup for this benchmark.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42138

Test Plan:
While in the folder `benchmarks/operator_benchmark`, the benchmark can be run using `python -m benchmark_all_quantized_test --operators HistogramObserverCalculateQparams`.

Benchmark results before speedup:
```
# ----------------------------------------
# PyTorch/Caffe2 Operator Micro-benchmarks
# ----------------------------------------
# Tag : short

# Benchmarking PyTorch: HistogramObserverCalculateQparams
# Mode: Eager
# Name: HistogramObserverCalculateQparams_C3_M512_N512_dtypetorch.quint8_cpu_qschemetorch.per_tensor_affine
# Input: C: 3, M: 512, N: 512, dtype: torch.quint8, device: cpu, qscheme: torch.per_tensor_affine
Forward Execution Time (us) : 185818.566

# Benchmarking PyTorch: HistogramObserverCalculateQparams
# Mode: Eager
# Name: HistogramObserverCalculateQparams_C3_M512_N512_dtypetorch.quint8_cpu_qschemetorch.per_tensor_symmetric
# Input: C: 3, M: 512, N: 512, dtype: torch.quint8, device: cpu, qscheme: torch.per_tensor_symmetric
Forward Execution Time (us) : 165325.916
```

Benchmark results after speedup:
```
# ----------------------------------------
# PyTorch/Caffe2 Operator Micro-benchmarks
# ----------------------------------------
# Tag : short

# Benchmarking PyTorch: HistogramObserverCalculateQparams
# Mode: Eager
# Name: HistogramObserverCalculateQparams_C3_M512_N512_dtypetorch.quint8_cpu_qschemetorch.per_tensor_affine
# Input: C: 3, M: 512, N: 512, dtype: torch.quint8, device: cpu, qscheme: torch.per_tensor_affine
Forward Execution Time (us) : 12242.241

# Benchmarking PyTorch: HistogramObserverCalculateQparams
# Mode: Eager
# Name: HistogramObserverCalculateQparams_C3_M512_N512_dtypetorch.quint8_cpu_qschemetorch.per_tensor_symmetric
# Input: C: 3, M: 512, N: 512, dtype: torch.quint8, device: cpu, qscheme: torch.per_tensor_symmetric
Forward Execution Time (us) : 12655.354
```

Reviewed By: supriyar

Differential Revision: D22779291

Pulled By: durumu

fbshipit-source-id: 1fe17d20eda5dd99e0e2590480142034c3574d4e
2020-08-06 11:10:12 -07:00
79de9c028a Remove VS2017 workaround for autocasting (#42352)
Summary:
Because VS2017 is no longer supported after https://github.com/pytorch/pytorch/pull/42144
cc: mcarilli

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42352

Reviewed By: malfet

Differential Revision: D22962809

Pulled By: ngimel

fbshipit-source-id: 0346cde87bf5d617dfc0d7b34c92ac6ec5bbf568
2020-08-06 11:03:34 -07:00
e28a98a904 Turn on non ASCII string literals serialization (#40719)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40719

This is a follow up patch to turn on this feature in order to handle breaking
forward compatibility.

Test Plan: Imported from OSS

Reviewed By: suo

Differential Revision: D22457952

Pulled By: bzinodev

fbshipit-source-id: fac0dfed8b8b5fa2d52d342ee8cf06742959b3c5
2020-08-06 10:47:09 -07:00
57854e7f08 [JIT] Clone runOptimizations and similar functions for profiling executor. (#42656)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42656

Thing change will allow us to more freely experiment with pass pipelines
in the profiling executor without affecting passes in the legacy
executor. Also, it somewhat helps to keep all passes in one place to be
able to tell what's going on.

Currently this change should not affect any behavior as I copied the
passes exactly as they've been invoked before, but we will probably want
to change these pipelines in a near future.

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D22971050

Pulled By: ZolotukhinM

fbshipit-source-id: f5bb60783a553c7b51c5343eec7f8fe40037ff99
2020-08-06 10:43:28 -07:00
a4dbc64800 Add documentation for PYTORCH_JIT_TYPE_VERBOSITY (#42241)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42241

that's it

Test Plan: docs only

Reviewed By: SplitInfinity

Differential Revision: D22818705

fbshipit-source-id: 22cdf4f23c3ed0a15c23f116457fc842d7f7b520
2020-08-06 10:39:39 -07:00
65066d779b Add fastrnns benchmark to CI and upload data to scribe (#42030)
Summary:
Run fastrnns benchmark using pytest-benchmark infra, then parse its json format and upload to scribe.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42030

Reviewed By: malfet

Differential Revision: D22970270

Pulled By: wconstab

fbshipit-source-id: 87da9b7ddf741da14b80d20779771d19123be3c5
2020-08-06 10:30:27 -07:00
a5af2434fe NVMified NE Eval
Summary:
This diff NVMifies the NE Eval Flow.
- It defines a `LoadNVM` operator which either
  - receives a list of nvm blobs, or
  - extracts the blobs that could be NVMified from the model.
- dumps NVMified blobs into NVM
-  and deallocates from DRAM
- NVMify the Eval net on dper and C2 backend

Specific NVMOp for SLS is pushed through different diffs.

Test Plan: flow-cli test-locally dper.workflows.evaluation.eval_workflow --parameters-file=/mnt/public/ehsaardestani/temp/small_model.json 2>&1 | tee log

Reviewed By: yinghai, amylittleyang

Differential Revision: D22469973

fbshipit-source-id: ed8379ad404e96d04ac05e580176d3aca984575b
2020-08-06 10:25:31 -07:00
049c1b97be pin numpy version to 1.18.5 (#42670)
Summary:
Using numpy 1.19.x instead of 1.18.x breaks certain unit tests.
Fixes https://github.com/pytorch/pytorch/issues/42561.  Likely also fixes https://github.com/pytorch/pytorch/issues/42583.

CC ezyang xw285cornell sunway513

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42670

Reviewed By: ezyang

Differential Revision: D22978369

Pulled By: malfet

fbshipit-source-id: ce1f35c7ba620c2b9dd10613f39354cebee8b87d
2020-08-06 10:01:56 -07:00
bcab2d6848 And type annotations for cpp_extension, utils.data, signal_handling (#42647)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42647

Reviewed By: ezyang

Differential Revision: D22967041

Pulled By: malfet

fbshipit-source-id: 35e124da0be56934faef56834a93b2b400decf66
2020-08-06 09:42:07 -07:00
608f99e4ea Fix cudnn version on build_environment of Windows CI (#42615)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42615

Reviewed By: mrshenli

Differential Revision: D22958660

Pulled By: malfet

fbshipit-source-id: 97a6a0e769143bd161667d0ee081ea0751995775
2020-08-06 09:36:24 -07:00
576aab5084 Bump up NCCL to 2.7.6 (#42645)
Summary:
Because 2.7.3 has some bug on GA100 which is fixed in 2.7.6

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42645

Reviewed By: malfet

Differential Revision: D22977280

Pulled By: mrshenli

fbshipit-source-id: 74779eff90d7d660a988ff33659f3a2237ca7e29
2020-08-06 08:45:59 -07:00
0642d17efc Enable C++ RPC tests (#42636)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42636

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D22967777

Pulled By: mrshenli

fbshipit-source-id: 8816c190a4ead7d7f906c140c8a4e76b992f5502
2020-08-06 07:15:02 -07:00
c30bc6d4d7 Update TensorPipe submodule (#42522)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42522

Main changes:
- Consolidated CMake files to have a single entry point, rather than having a specialized one for PyTorch.
- Changed the way the preprocessor flags are provided, and changed their name.

There were a few instances in PyTorch's CMake files where we were directly adding TensorPipe's source directory as an include path, which however doesn't contain the auto-generated header we now added. We fix that by adding the `tensorpipe` CMake target as a dependency, so that the include paths defined by TensorPipe are used, which contain that auto-generated header. So instead we link those targets to the tensorpipe target in order for them to pick up the correct include directories.

I'm turning off SHM and CMA for now because they have never been covered by the CI. I'll enable them in a separate PR so that if they turn out to be flaky we can revert that change without reverting this one.

Test Plan: CI

Reviewed By: malfet

Differential Revision: D22959472

fbshipit-source-id: 1959a41c4a66ef78bf0f3bd5e3964969a2a1bf67
2020-08-06 02:14:58 -07:00
bd458b7d02 Don't reference TensorPipe headers in our headers (#42521)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42521

PyTorch's usage of TensorPipe is entirely wrapped within the RPC agent, which means we only need access to TensorPipe within the implementation (the .cpp file) and not in the interface (the .h file). We were however including the TensorPipe headers from the public PyTorch headers, which meant that PyTorch's downstream users had to have the TensorPipe include directories for that to work. By forward-declaring the symbols we need in the PyTorch header, and then including the TensorPipe header in the PyTorch implementation, we avoid "leaking" the dependency on TensorPipe, thus effectively keeping it private.

Test Plan: Imported from OSS

Reviewed By: beauby

Differential Revision: D22944238

Pulled By: lw

fbshipit-source-id: 2b12d59bd5beeaa439e50f9088a792c9d9bae9e8
2020-08-06 02:14:00 -07:00
a53fdaa23f Remove ProfiledType (#42570)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42570

ProfiledType doesn't do anything and is not used atm, removing

Test Plan: CI

Reviewed By: ezyang

Differential Revision: D22938664

Pulled By: ilia-cher

fbshipit-source-id: 037c512938028f44258b702bbcde3f8c144f4aa0
2020-08-06 01:52:08 -07:00
ccfce9d4a9 Adds fft namespace (#41911)
Summary:
This PR creates a new namespace, torch.fft (torch::fft) and puts a single function, fft, in it. This function is analogous to is a simplified version of NumPy's [numpy.fft.fft](https://numpy.org/doc/1.18/reference/generated/numpy.fft.fft.html?highlight=fft#numpy.fft.fft) that accepts no optional arguments. It is intended to demonstrate how to add and document functions in the namespace, and is not intended to deprecate the existing torch.fft function.

Adding this namespace was complicated by the existence of the torch.fft function in Python. Creating a torch.fft Python module makes this name ambiguous: does it refer to a function or module? If the JIT didn't exist, a solution to this problem would have been to make torch.fft refer to a callable class that mimicked both the function and module. The JIT, however, cannot understand this pattern. As a workaround it's required to explicitly `import torch.fft` to access the torch.fft.fft function in Python:

```
import torch.fft

t = torch.randn(128, dtype=torch.cdouble)
torch.fft.fft(t)
```

See https://github.com/pytorch/pytorch/issues/42175 for future work. Another possible future PR is to get the JIT to understand torch.fft as a callable class so it need not be imported explicitly to be used.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41911

Reviewed By: glaringlee

Differential Revision: D22941894

Pulled By: mruberry

fbshipit-source-id: c8e0b44cbe90d21e998ca3832cf3a533f28dbe8d
2020-08-06 00:20:50 -07:00
644d787cd8 find rccl properly (#42072)
Summary:
Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42072

Reviewed By: malfet

Differential Revision: D22969778

Pulled By: ezyang

fbshipit-source-id: 509178775d4d99460bcb147bcfced29f04cabdc4
2020-08-05 21:46:38 -07:00
23607441c2 Create CuBLAS PointerModeGuard (#42639)
Summary:
Adds an RAII guard for `cublasSetPointerMode()`.
Updates `dot_cuda` to use the guard, rather than exception catching.

Addresses this comment: https://github.com/pytorch/pytorch/pull/41377#discussion_r465754082

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42639

Reviewed By: malfet

Differential Revision: D22969985

Pulled By: ezyang

fbshipit-source-id: b05c35d1884bb890f8767d6a4ef8b4724a329471
2020-08-05 21:40:42 -07:00
eb9ae7c038 Implement gpu_kernel_multiple_outputs (#37969)
Summary:
This PR introduces a variant of `gpu_kernel` for functions that return multiple values with `thrust::tuple`.
With this I simplified `prelu_cuda_backward_share_weights_kernel`.

### Why using `thrust::tuple`?
Because `std::tuple` does not support `operator=` on device code which makes the implementation complicated.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/37969

Reviewed By: paulshaoyuqiao

Differential Revision: D22868670

Pulled By: ngimel

fbshipit-source-id: eda0a29ac0347ad544b24bf60e3d809a7db1a929
2020-08-05 21:17:08 -07:00
1848b43c4d [NNC] Add loop unroll transformation (#42465)
Summary:
Unroll a loop with constant boundaries, replacing it with multiple
instances of the loop body. For example:

```
for x in 0..3:
  A[x] = x*2
```

becomes:

```
A[0] = 0
A[1] = 2
A[2] = 4
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42465

Test Plan: `test_tensorexpr` unit tests.

Reviewed By: agolynski

Differential Revision: D22914418

Pulled By: asuhan

fbshipit-source-id: 72ca10d7c0b1ac7f9a3688ac872bd94a1c53dc51
2020-08-05 20:46:32 -07:00
3d46e02ea1 Add __torch_function__ for methods (#37091)
Summary:
According to pytorch/rfcs#3

From the goals in the RFC:

1. Support subclassing `torch.Tensor` in Python (done here)
2. Preserve `torch.Tensor` subclasses when calling `torch` functions on them (done here)
3. Use the PyTorch API with `torch.Tensor`-like objects that are _not_ `torch.Tensor`
   subclasses (done in https://github.com/pytorch/pytorch/issues/30730)
4. Preserve `torch.Tensor` subclasses when calling `torch.Tensor` methods. (done here)
5. Propagating subclass instances correctly also with operators, using
   views/slices/indexing/etc. (done here)
6. Preserve subclass attributes when using methods or views/slices/indexing. (done here)
7. A way to insert code that operates on both functions and methods uniformly
   (so we can write a single function that overrides all operators). (done here)
8. The ability to give external libraries a way to also define
   functions/methods that follow the `__torch_function__` protocol. (will be addressed in a separate PR)

This PR makes the following changes:

1. Adds the `self` argument to the arg parser.
2. Dispatches on `self` as well if `self` is not `nullptr`.
3. Adds a `torch._C.DisableTorchFunction` context manager to disable `__torch_function__`.
4. Adds a `torch::torch_function_enabled()` and `torch._C._torch_function_enabled()` to check the state of `__torch_function__`.
5. Dispatches all `torch._C.TensorBase` and `torch.Tensor` methods via `__torch_function__`.

TODO:

- [x] Sequence Methods
- [x] Docs
- [x] Tests

Closes https://github.com/pytorch/pytorch/issues/28361

Benchmarks in https://github.com/pytorch/pytorch/pull/37091#issuecomment-633657778

Pull Request resolved: https://github.com/pytorch/pytorch/pull/37091

Reviewed By: ngimel

Differential Revision: D22765678

Pulled By: ezyang

fbshipit-source-id: 53f8aa17ddb8b1108c0997f6a7aa13cb5be73de0
2020-08-05 20:44:13 -07:00
92b7347fd7 Enforce counter value to double type in rowwise_counter
Summary:
Enforce counter value to double type in rowwise_counter.

**Context:**
The existing implementation is using float type for counter value. But due to the precision limit of a floating number [1], we observed that the counter value can't increment beyond 16777216.0 (i.e., the max value is 16777216.0) in our earlier experiments. We decide to enforce double type to avoid this issue.

[1] https://stackoverflow.com/questions/12596695/why-does-a-float-variable-stop-incrementing-at-16777216-in-c

Test Plan:
op test
```
ruixliu@devvm1997:~/fbsource/fbcode/caffe2/caffe2/python/operator_test(f0b0b48c)$ buck test :rowwise_counter_test
Trace available for this run at /tmp/testpilot.20200728-083200.729292.log
TestPilot test runner for Facebook. See https://fburl.com/testpilot for details.
Testpilot build revision cd2638f1f47250eac058b8c36561760027d16add fbpkg f88726c8ebde4ba288e1172a348c7f46 at Mon Jul 27 18:11:43 2020 by twsvcscm from /usr/local/fbprojects/packages/testinfra.testpilot/887/t.par
Discovering tests
Running 1 test
Started new test run: https://our.intern.facebook.com/intern/testinfra/testrun/7881299364977047
      ✓ caffe2/caffe2/python/operator_test:rowwise_counter_test - test_rowwise_counter (caffe2.caffe2.python.operator_test.rowwise_counter_test.TestRowWiseCounter) 0.265 1/1 (passed)
      ✓ caffe2/caffe2/python/operator_test:rowwise_counter_test - main 14.414 (passed)
Finished test run: https://our.intern.facebook.com/intern/testinfra/testrun/7881299364977047
Summary (total time 18.51s):
  PASS: 2
  FAIL: 0
  SKIP: 0
  FATAL: 0
  TIMEOUT: 0
  OMIT: 0
```

optimizer test
```
ruixliu@devvm1997:~/fbsource/fbcode/caffe2/caffe2/python(7d66fbb9)$ buck test :optimizer_test
Finished test run: https://our.intern.facebook.com/intern/testinfra/testrun/7036874434841896
Summary (total time 64.87s):
  PASS: 48
  FAIL: 0
  SKIP: 24
    caffe2/caffe2/python:optimizer_test - testGPUDense (caffe2.caffe2.python.optimizer_test.TestMomentumSgd)
    caffe2/caffe2/python:optimizer_test - testGPUDense (caffe2.caffe2.python.optimizer_test.TestGFtrl)
    caffe2/caffe2/python:optimizer_test - test_caffe2_cpu_vs_numpy (caffe2.caffe2.python.optimizer_test.TestYellowFin)
    caffe2/caffe2/python:optimizer_test - testGPUDense (caffe2.caffe2.python.optimizer_test.TestSparseRAdam)
    caffe2/caffe2/python:optimizer_test - testGPUDense (caffe2.caffe2.python.optimizer_test.TestRowWiseAdagradWithCounter)
    caffe2/caffe2/python:optimizer_test - testGPUDense (caffe2.caffe2.python.optimizer_test.TestAdagrad)
    caffe2/caffe2/python:optimizer_test - test_caffe2_gpu_vs_numpy (caffe2.caffe2.python.optimizer_test.TestYellowFin)
    caffe2/caffe2/python:optimizer_test - testDense (caffe2.caffe2.python.optimizer_test.TestRowWiseAdagrad)
    caffe2/caffe2/python:optimizer_test - testGPUDense (caffe2.caffe2.python.optimizer_test.TestFtrl)
    caffe2/caffe2/python:optimizer_test - testSparse (caffe2.caffe2.python.optimizer_test.TestRmsProp)
    ...and 14 more not shown...
  FATAL: 0
  TIMEOUT: 0
  OMIT: 0
```

param download test
```
ruixliu@devvm1997:~/fbsource/fbcode/caffe2/caffe2/fb/net_transforms/tests(7ef20a38)$ sudo buck test :param_download_test
Finished test run: Finished test run: https://our.intern.facebook.com/intern/testinfra/testrun/6473924481526935
```

e2e flow:
f208394929
f207991149
f207967273

ANP notebook to check the counter value loaded from the flows
https://fburl.com/anp/5fdcbnoi

screenshot of the loaded counter (note that counter max is larger than 16777216.0)

{F250926501}

Reviewed By: ellie-wen

Differential Revision: D22711514

fbshipit-source-id: 426fed7415270aa3f276dda8141907534734337f
2020-08-05 20:40:51 -07:00
c14fbc36ed Update docs about CUDA stream priority (#41364)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/41364

Reviewed By: malfet

Differential Revision: D22962856

Pulled By: ngimel

fbshipit-source-id: 47f65069516cb555579455e8680deb937fc1f544
2020-08-05 20:03:18 -07:00
ddb8849ffc Fix method stub used for fixing mypy issue to work with pylint (#42356)
Summary:
Make function from method

Since _forward_unimplemented is defined within the nn.Module class,
pylint (correctly) complains about not implementing this method in subclasses.

Fixes https://github.com/pytorch/pytorch/issues/42305

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42356

Reviewed By: mruberry

Differential Revision: D22867255

Pulled By: ezyang

fbshipit-source-id: ccf3e45e359d927e010791fadf70b2ef231ddb0b
2020-08-05 19:57:38 -07:00
04d7e1679d [quant] Quantized Average Pool Refactoring (#42009)
Summary:
**cc** z-a-f. Refactor `qavg_pool(2,3)d_nhwc_kernel` as mentioned in https://github.com/pytorch/pytorch/issues/40316.

# Benchmarks
## Python
Before | After
![before_after](https://user-images.githubusercontent.com/37529096/88401550-fea7ba80-ce1d-11ea-81c5-3ae912e81e8f.png)
## C++
![before_after_cpp](https://user-images.githubusercontent.com/37529096/88401845-5ba37080-ce1e-11ea-9bf2-3c95ac2b4b49.png)
## Notes
- It does seem that for `qint8` and `quint8` there is a noticeable 2x increase in speed at least when the `channels > 64` in the benchmarks.
## Reproduce
### Python
```
import time
import numpy as np
import torch
from termcolor import colored
def time_avg_pool2d(X, kernel, stride, padding, ceil_mode, count_include_pad, divisor_override, iterations):
    X, (scale, zero_point, torch_type) = X
    qX_nchw = torch.quantize_per_tensor(torch.from_numpy(X), scale=scale,
                                    zero_point=zero_point, dtype=torch_type)
    qX_nhwc = qX_nchw.contiguous(memory_format=torch.channels_last)
    assert(qX_nhwc.stride() != sorted(qX_nhwc.stride()))
    assert(qX_nchw.is_contiguous(memory_format=torch.contiguous_format))
    assert(qX_nhwc.is_contiguous(memory_format=torch.channels_last))
    start = time.time()
    for _ in range(iterations):
        X_hat = torch.nn.quantized.functional.avg_pool2d(qX_nchw, kernel_size=kernel, stride=stride, padding=padding, ceil_mode=ceil_mode,
                count_include_pad=count_include_pad, divisor_override=divisor_override)
    qnchw_end = time.time() - start
    start = time.time()
    for _ in range(iterations):
        X_hat = torch.nn.quantized.functional.avg_pool2d(qX_nhwc, kernel_size=kernel, stride=stride, padding=padding, ceil_mode=ceil_mode,
                count_include_pad=count_include_pad, divisor_override=divisor_override)
    qnhwc_end = time.time() - start
    return qnchw_end*1000/iterations, qnhwc_end*1000/iterations

def time_avg_pool3d(X, kernel, stride, padding, ceil_mode, count_include_pad, divisor_override,  iterations):
    X, (scale, zero_point, torch_type) = X
    qX_ncdhw = torch.quantize_per_tensor(torch.from_numpy(X), scale=scale,
                                    zero_point=zero_point, dtype=torch_type)
    qX_ndhwc = qX_ncdhw.contiguous(memory_format=torch.channels_last_3d)
    assert(qX_ndhwc.stride() != sorted(qX_ndhwc.stride()))
    assert(qX_ncdhw.is_contiguous(memory_format=torch.contiguous_format))
    assert(qX_ndhwc.is_contiguous(memory_format=torch.channels_last_3d))
    start = time.time()
    for _ in range(iterations):
        X_hat = torch.nn.quantized.functional.avg_pool3d(qX_ncdhw, kernel_size=kernel, stride=stride, padding=padding, ceil_mode=ceil_mode,
                count_include_pad=count_include_pad, divisor_override=divisor_override)
    qncdhw_end = time.time() - start
    start = time.time()
    for _ in range(iterations):
        X_hat = torch.nn.quantized.functional.avg_pool3d(qX_ndhwc, kernel_size=kernel, stride=stride, padding=padding, ceil_mode=ceil_mode,
                count_include_pad=count_include_pad, divisor_override=divisor_override)
    qndhwc_end = time.time() - start
    return qncdhw_end*1000/iterations, qndhwc_end*1000/iterations

iterations = 10000
print("iterations = {}".format(iterations))
print("Benchmark", "Time(ms)", sep="\t\t\t\t\t")
for torch_type in (torch.qint8, torch.quint8, torch.qint32):
    for channel in (4,8,64,256):
        X = np.random.rand(1, channel, 56, 56).astype(np.float32), (0.5, 1, torch_type)
        ts = time_avg_pool2d(X, 4, None, 0, True, True, None, iterations)
        print(colored("avg_pool2d({}, {}, {})".format(str(torch_type), channel, "nchw"), 'green'), colored(ts[0], 'yellow'), sep="\t")
        print(colored("avg_pool2d({}, {}, {})".format(str(torch_type), channel, "nhwc"), 'green'), colored(ts[1], 'yellow'), sep="\t")
for torch_type in (torch.qint8, torch.quint8, torch.qint32):
    for channel in (4,8,64,256):
        X = np.random.rand(1, channel, 56, 56, 4).astype(np.float32), (0.5, 1, torch_type)
        ts = time_avg_pool3d(X, 4, None, 0, True, True, None, iterations)
        print(colored("avg_pool3d({}, {}, {})".format(str(torch_type), channel, "ncdhw"), 'green'), colored(ts[0], 'yellow'), sep="\t")
        print(colored("avg_pool3d({}, {}, {})".format(str(torch_type), channel, "ndhwc"), 'green'), colored(ts[1], 'yellow'), sep="\t")
```
### C++
1. `git clone https://github.com/google/benchmark.git`
2. `git clone https://github.com/google/googletest.git benchmark/googletest`

```
# CMakeLists.txt
cmake_minimum_required(VERSION 3.10 FATAL_ERROR)
project(time_avg_pool VERSION 0.1.0)

find_package(Torch REQUIRED)
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} ${TORCH_CXX_FLAGS}")
add_subdirectory(benchmark)

add_executable(time_average_pool time_average_pool.cpp)
target_link_libraries(time_average_pool ${TORCH_LIBRARIES})
set_property(TARGET time_average_pool PROPERTY CXX_STANDARD 14)
target_link_libraries(time_average_pool benchmark::benchmark)
```

```
// time_average_pool.cpp
#include <benchmark/benchmark.h>
#include <torch/torch.h>

torch::Device device(torch::kCPU);

static void BM_TORCH_QAVG_POOL2D_NCHW_SINGLE_THREADED(benchmark::State& state) {
  torch::init_num_threads();
  torch::set_num_threads(1);
  auto x_nchw = torch::rand({1, state.range(0), 56, 56}, device);
  auto qx_nchw = torch::quantize_per_tensor(x_nchw, 0.5, 1, torch::kQUInt8);
  torch::Tensor X_hat;
  for (auto _ : state)
    X_hat = torch::nn::functional::avg_pool2d(
        qx_nchw,
        torch::nn::AvgPool2dOptions({4, 4}).ceil_mode(true).count_include_pad(
            true));
}

static void BM_TORCH_QAVG_POOL2D_NHWC_SINGLE_THREADED(benchmark::State& state) {
  torch::init_num_threads();
  torch::set_num_threads(1);
  auto x_nchw = torch::rand({1, state.range(0), 56, 56}, device);
  auto qx_nchw = torch::quantize_per_tensor(x_nchw, 0.5, 1, torch::kQUInt8);
  auto qx_nhwc = qx_nchw.contiguous(torch::MemoryFormat::ChannelsLast);
  torch::Tensor X_hat;
  for (auto _ : state)
    X_hat = torch::nn::functional::avg_pool2d(
        qx_nhwc,
        torch::nn::AvgPool2dOptions({4, 4}).ceil_mode(true).count_include_pad(
            true));
}

static void BM_TORCH_QAVG_POOL2D_NCHW(benchmark::State& state) {
  auto x_nchw = torch::rand({1, state.range(0), 56, 56}, device);
  auto qx_nchw = torch::quantize_per_tensor(x_nchw, 0.5, 1, torch::kQUInt8);
  torch::Tensor X_hat;
  for (auto _ : state)
    X_hat = torch::nn::functional::avg_pool2d(
        qx_nchw,
        torch::nn::AvgPool2dOptions({4, 4}).ceil_mode(true).count_include_pad(
            true));
}

static void BM_TORCH_QAVG_POOL2D_NHWC(benchmark::State& state) {
  auto x_nchw = torch::rand({1, state.range(0), 56, 56}, device);
  auto qx_nchw = torch::quantize_per_tensor(x_nchw, 0.5, 1, torch::kQUInt8);
  auto qx_nhwc = qx_nchw.contiguous(torch::MemoryFormat::ChannelsLast);
  torch::Tensor X_hat;
  for (auto _ : state)
    X_hat = torch::nn::functional::avg_pool2d(
        qx_nhwc,
        torch::nn::AvgPool2dOptions({4, 4}).ceil_mode(true).count_include_pad(
            true));
}

static void BM_TORCH_QAVG_POOL3D_NCDHW_SINGLE_THREADED(
    benchmark::State& state) {
  torch::init_num_threads();
  torch::set_num_threads(1);
  auto x_ncdhw = torch::rand({1, state.range(0), 56, 56, 4}, device);
  auto qx_ncdhw = torch::quantize_per_tensor(x_ncdhw, 0.5, 1, torch::kQUInt8);
  torch::Tensor X_hat;
  for (auto _ : state)
    X_hat = torch::nn::functional::avg_pool3d(
        qx_ncdhw,
        torch::nn::AvgPool3dOptions({5, 5, 5})
            .ceil_mode(true)
            .count_include_pad(true));
}

static void BM_TORCH_QAVG_POOL3D_NDHWC_SINGLE_THREADED(
    benchmark::State& state) {
  torch::init_num_threads();
  torch::set_num_threads(1);
  auto x_ncdhw = torch::rand({1, state.range(0), 56, 56, 4}, device);
  auto qx_ncdhw = torch::quantize_per_tensor(x_ncdhw, 0.5, 1, torch::kQUInt8);
  auto qx_ndhwc = qx_ncdhw.contiguous(torch::MemoryFormat::ChannelsLast3d);
  torch::Tensor X_hat;
  for (auto _ : state)
    X_hat = torch::nn::functional::avg_pool3d(
        qx_ndhwc,
        torch::nn::AvgPool3dOptions({5, 5, 5})
            .ceil_mode(true)
            .count_include_pad(true));
}

static void BM_TORCH_QAVG_POOL3D_NCDHW(benchmark::State& state) {
  auto x_ncdhw = torch::rand({1, state.range(0), 56, 56, 4}, device);
  auto qx_ncdhw = torch::quantize_per_tensor(x_ncdhw, 0.5, 1, torch::kQUInt8);
  torch::Tensor X_hat;
  for (auto _ : state)
    X_hat = torch::nn::functional::avg_pool3d(
        qx_ncdhw,
        torch::nn::AvgPool3dOptions({5, 5, 5})
            .ceil_mode(true)
            .count_include_pad(true));
}

static void BM_TORCH_QAVG_POOL3D_NDHWC(benchmark::State& state) {
  auto x_ncdhw = torch::rand({1, state.range(0), 56, 56, 4}, device);
  auto qx_ncdhw = torch::quantize_per_tensor(x_ncdhw, 0.5, 1, torch::kQUInt8);
  auto qx_ndhwc = qx_ncdhw.contiguous(torch::MemoryFormat::ChannelsLast3d);
  torch::Tensor X_hat;
  for (auto _ : state)
    X_hat = torch::nn::functional::avg_pool3d(
        qx_ndhwc,
        torch::nn::AvgPool3dOptions({5, 5, 5})
            .ceil_mode(true)
            .count_include_pad(true));
}

BENCHMARK(BM_TORCH_QAVG_POOL2D_NCHW)->RangeMultiplier(8)->Range(4, 256);
BENCHMARK(BM_TORCH_QAVG_POOL2D_NHWC)->RangeMultiplier(8)->Range(4, 256);
BENCHMARK(BM_TORCH_QAVG_POOL3D_NCDHW)->RangeMultiplier(8)->Range(4, 256);
BENCHMARK(BM_TORCH_QAVG_POOL3D_NDHWC)->RangeMultiplier(8)->Range(4, 256);
BENCHMARK(BM_TORCH_QAVG_POOL2D_NCHW_SINGLE_THREADED)
    ->RangeMultiplier(8)
    ->Range(4, 256);
BENCHMARK(BM_TORCH_QAVG_POOL2D_NHWC_SINGLE_THREADED)
    ->RangeMultiplier(8)
    ->Range(4, 256);
BENCHMARK(BM_TORCH_QAVG_POOL3D_NCDHW_SINGLE_THREADED)
    ->RangeMultiplier(8)
    ->Range(4, 256);
BENCHMARK(BM_TORCH_QAVG_POOL3D_NDHWC_SINGLE_THREADED)
    ->RangeMultiplier(8)
    ->Range(4, 256);
BENCHMARK_MAIN();
```

3. `mkdir build && cd build`
4. ```cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_PREFIX_PATH=`python -c 'import torch;print(torch.utils.cmake_prefix_path)'` .. ```
5. `cmake --build . --config Release`
6. `./time_average_pool`

# Further notes
- I've used `istrideB, istrideD, istrideH, strideW, strideC` to match `_qadaptive_avg_pool_kernel` since there's some code duplication there as mentioned in https://github.com/pytorch/pytorch/issues/40316.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42009

Reviewed By: pbelevich

Differential Revision: D22794441

Pulled By: z-a-f

fbshipit-source-id: 16710202811a1fbe1c99ea4d9b45876d6d28a8da
2020-08-05 19:44:42 -07:00
9add11ffc1 Fix IS_SPMM_AVAILABLE macro definition (#42643)
Summary:
This should fix CUDA-11 on Windows build issue

`defined` is not a function, and so it can not be used in macro substitution.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42643

Reviewed By: pbelevich, xw285cornell

Differential Revision: D22963420

Pulled By: malfet

fbshipit-source-id: cccf7db0d03cd62b655beeb154db9e628aa749f0
2020-08-05 18:56:23 -07:00
509fb77b70 Adjust bound_shape_inferencer to take 4 inputs for FCs (#41934)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41934

The model exported from online training workflow with int8 quantization contains FCs with 4 inputs. The extra input is the quant_param blob. This diff is to adjust the bound_shape_inferencer and int8 op schema to get shape info for the quant_param input.

Test Plan:
```
buck test caffe2/caffe2/opt:bound_shape_inference_test
```

Reviewed By: yinghai

Differential Revision: D22683554

fbshipit-source-id: 684d1433212a528120aba1c37d27e26b6a31b403
2020-08-05 18:44:48 -07:00
9ea9d1b52e [fbs][2/n] Remove .python3 markers
Test Plan:
`xbgr '\.python3'` shows only one (dead) usage of this file:
https://www.internalfb.com/intern/diffusion/FBS/browse/master/fbcode/python/repo_stats/buck.py?commit=9a8dd3243207819325d520c208218f6ab69e4e49&lines=854

Reviewed By: lisroach

Differential Revision: D22955631

fbshipit-source-id: e686d9157c08c347d0ce4acdd05bd7ab29ff7df5
2020-08-05 18:25:50 -07:00
5d7c3f92b9 Issue warning instead of error when parsing Enum while enum support is not enabled (#42623)
Summary:
Returnning None rather than error matches previous behavior better.

Fixes https://fburl.com/yrrvtes3

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42623

Reviewed By: ajaech

Differential Revision: D22957498

Pulled By: gmagogsfm

fbshipit-source-id: 61dabc6d23ad44e75bd35d837768bdb6fe71eece
2020-08-05 17:55:29 -07:00
50f0d2b97d quant: add q_batchnorm_1d op (#42491)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42491

Hooks up quantized batchnorm_1d to the quantized_bn kernel. Eager mode
hookup will be in a future PR, and graph mode should work after this PR.

Note: currently the implementation is ~2x slower on the benchmark than q_batch_norm2d
because we convert back to contiguous memory format at the end, since
channels_last is only defined for rank >= 4. If further optimization is
needed, that can be a separate PR (will need the NHWC folks to see if
there is a workaround).  Meanwhile, having this is better than not having anything.

Context: There have been both internal and external requests for various
quantized BN1d use cases.

Test Plan:
```
python test/test_quantization.py TestQuantizedOps.test_batch_norm_1d_2d_3d
python test/test_quantization.py TestQuantizedOps.test_batch_norm_1d_2d_3d_relu
python test/test_quantization.py TestQuantizeJitOps.test_qbatch_norm

// performance:
// https://gist.github.com/vkuzo/73a07c0f24c05f5804990d9ebfaecf5e

```

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D22926254

fbshipit-source-id: 2780e6a81cd13a7455f6ab6e5118c22850a97a12
2020-08-05 17:20:18 -07:00
54ffb05eff better error message between C2 and glow (#41603)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41603

Pull Request resolved: https://github.com/pytorch/glow/pull/4704

Previously in the glow onnxifi path, when an error is encountered, we log it to stderr then just return ONNXIFI_STATUS_INTERNAL_ERROR to C2. C2 then does CAFFE2_ENFORCE_EQUAL(return_code, ONNXIFI_STATUS_SUCCESS). The error message that eventually went to the user is something like

   [enforce fail at onnxifi_op.cc:545] eventStatus == ONNXIFI_STATUS_SUCCESS. 1030 vs 0

This diff adds plumbing to get human readable error message out of glow into C2.

Test Plan:
Run ads replayer. Overload it with traffic. Now the error message sent back to the client used to be

  E0707 00:57:45.697196 3709559 Caffe2DisaggAcceleratorTask.cpp:493] During running REMOTE_OTHER net: [enforce fail at onnxifi_op.cc:545] eventStatus == ONNXIFI_STATUS_SUCCESS. 1030 vs 0 (Error from operator:....

Now it's

```
E0707 16:46:48.366263 1532943 Client.cpp:966] Exception when calling caffe2_run_disagg_accelerator on remote predictor for model 190081310_0 : apache::thrift::TApplicationException: c10::Error: [enforce fail at onnxifi_op.cc:556] .
Error code: RUNTIME_REQUEST_REFUSED
Error message: The number of allowed queued requests has been exceeded. queued requests: 100 allowed requests: 100
Error return stack:
glow/glow/lib/Runtime/HostManager/HostManager.cpp:673
glow/glow/lib/Onnxifi/HostMana (Error from operator:...
```

Reviewed By: gcatron, yinghai

Differential Revision: D22416857

fbshipit-source-id: 564bc7644d9666eb660725c2dca5637affae9b73
2020-08-05 16:25:13 -07:00
aa4e91a6dc Fix TestSparse.test_bmm_windows_error when CUDA is not available (#42626)
Summary:
Refactor comnon pattern of (torch.cuda.version and [int(x) for x in torch.cuda.version.split(".")] >= [a, b]) into `_get_torch_cuda_version()` function

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42626

Reviewed By: seemethere

Differential Revision: D22956149

Pulled By: malfet

fbshipit-source-id: 897c55965e53b477cd20f69e8da15d90489035de
2020-08-05 16:07:35 -07:00
5023995292 fix output size adjustment for onnxifi_op
Summary: this breaks if we cut the net at certain int8 ops boundary.

Test Plan: with net_runner to lower a single Int8Quantize op. It used to break. Now it works.

Reviewed By: yinghai

Differential Revision: D22912178

fbshipit-source-id: ca306068c9768df84c1cfa8b34226a1330e19912
2020-08-05 15:55:46 -07:00
102abb877c Reland D22939119: "[TensorExpr] Fix a way we were createing np arrays in tests." (#42608)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42608

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D22952745

Pulled By: ZolotukhinM

fbshipit-source-id: fd6a3efbfcaa876a2f4d27b507fe0ccdcb55a002
2020-08-05 15:14:23 -07:00
2501e2b12d [RPC tests] Run DdpUnderDistAutogradTest and DdpComparisonTest with fork too (#42528)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42528

It seems it was an oversight that they weren't run. This allows to simplify our auto-generation logic as now all test suites are run in both modes.
ghstack-source-id: 109229969

Test Plan: CI

Reviewed By: pritamdamania87

Differential Revision: D22922151

fbshipit-source-id: 0766a6970c927efb04eee4894b73d4bcaf60b97f
2020-08-05 15:10:29 -07:00
4da602b004 [RPC tests] Generate test classes automatically (#42527)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42527

ghstack-source-id: 109229468

Test Plan: CI

Reviewed By: pritamdamania87

Differential Revision: D22864698

fbshipit-source-id: 6a55f3201c544f0173493b38699a2c7e95ac1bbc
2020-08-05 15:10:26 -07:00
d7516ccfac [RPC tests] Enroll TensorPipe in missing test suites (#40823)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40823

Summary of the entire stack:
--

This diff is part of an attempt to refactor the RPC tests. They currently suffer from several problems:
- Several ways to specify the agent to use: there exists one "generic" fixture that uses the global variable TEST_CONFIG to look up the agent name, and is used for process group and Thrift, and then there are separate fixtures for the flaky agent and the TensorPipe one.
- These two ways lead to having two separate decorators (`requires_process_group_agent` and `@_skip_if_tensorpipe_agent`) which must both be specified, making it unclear what the effect of each of them is and what happens if only one is given.
- Thrift must override the TEST_CONFIG global variable before any other import (in order for the `requires_process_group_agent` decorator to work correctly) and for that it must use a "trap" file, which makes it even harder to track which agent is being used, and which is specific to Buck, and thus cannot be used in OSS by other agents.
- Even if the TensorPipe fixture doesn't use TEST_CONFIG, it still needs to set it to the right value for other parts of the code to work. (This is done in `dist_init`).
- There are a few functions in dist_utils.py that return some properties of the agent (e.g., a regexp to match against the error it returns in case of shutdown). These functions are effectively chained if/elses on the various agents, which has the effect of "leaking" some part of the Thrift agent into OSS.
- Each test suite (RPC, dist autograd/dist optimizer, their JIT versions, remote module, ...) must be run on each agent (or almost; the faulty one is an exception) in both fork and spawn mode. Each of these combinations is a separate file, which leads to a proliferation of scripts.
- There is no "master list" of what combinations make sense and should be run. Therefore it has happened that when adding new tests or new agents we forgot to enroll them into the right tests. (TensorPipe is still missing a few tests, it turns out).
- All of these tiny "entry point" files contain almost the same duplicated boilerplate. This makes it very easy to get the wrong content into one of them due to a bad copy-paste.

This refactoring aims to address these problems by:
- Avoiding global state, defaults/override, traps, if/elses, ... and have a single way to specify the agent, based on an abstract base class and several concrete subclasses which can be "mixed in" to any test suite.
- Instead of enabling/disabling tests using decorators, the tests that are specific to a certain agent are now in a separate class (which is a subclass of the "generic" test suite) so that they are only picked up by the agent they apply to.
- Instead of having one separate entry point script for each combination, it uses one entry point for each agent, and in that script it provides a list of all the test suites it wants to run on that agent. And it does that by trying to deduplicate the boilerplate as much as possible. (In fact, the various agent-suite combinations could be grouped in any way, not necessarily by agent as I did here).

It provides further advantages:
- It puts all the agents on equal standing, by not having any of them be the default, making it thus easier to migrate from process group to TensorPipe.
- It will make it easier to add more versions of the TensorPipe tests (e.g., one that disables the same-machine backends in order to test the TCP-based ones) without a further duplication of entry points, of boilerplate, ...

Summary of this commit
--
As it is now easier to spot that the TensorPipe agent wasn't being run on some test suite, we fix that. We keep this change for last so that if those tests turn out to be flaky and must be reverted this won't affect the rest of the stack.
ghstack-source-id: 109229469

Test Plan: Sandcastle and CircleCI

Reviewed By: pritamdamania87

Differential Revision: D22309432

fbshipit-source-id: c433a6a49a7b6737e0df4cd953f3dfde290f20b8
2020-08-05 15:10:23 -07:00
2e7b464c43 [RPC tests] Remove global TEST_CONFIG (#40822)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40822

Summary of the entire stack:
--

This diff is part of an attempt to refactor the RPC tests. They currently suffer from several problems:
- Several ways to specify the agent to use: there exists one "generic" fixture that uses the global variable TEST_CONFIG to look up the agent name, and is used for process group and Thrift, and then there are separate fixtures for the flaky agent and the TensorPipe one.
- These two ways lead to having two separate decorators (`requires_process_group_agent` and `@_skip_if_tensorpipe_agent`) which must both be specified, making it unclear what the effect of each of them is and what happens if only one is given.
- Thrift must override the TEST_CONFIG global variable before any other import (in order for the `requires_process_group_agent` decorator to work correctly) and for that it must use a "trap" file, which makes it even harder to track which agent is being used, and which is specific to Buck, and thus cannot be used in OSS by other agents.
- Even if the TensorPipe fixture doesn't use TEST_CONFIG, it still needs to set it to the right value for other parts of the code to work. (This is done in `dist_init`).
- There are a few functions in dist_utils.py that return some properties of the agent (e.g., a regexp to match against the error it returns in case of shutdown). These functions are effectively chained if/elses on the various agents, which has the effect of "leaking" some part of the Thrift agent into OSS.
- Each test suite (RPC, dist autograd/dist optimizer, their JIT versions, remote module, ...) must be run on each agent (or almost; the faulty one is an exception) in both fork and spawn mode. Each of these combinations is a separate file, which leads to a proliferation of scripts.
- There is no "master list" of what combinations make sense and should be run. Therefore it has happened that when adding new tests or new agents we forgot to enroll them into the right tests. (TensorPipe is still missing a few tests, it turns out).
- All of these tiny "entry point" files contain almost the same duplicated boilerplate. This makes it very easy to get the wrong content into one of them due to a bad copy-paste.

This refactoring aims to address these problems by:
- Avoiding global state, defaults/override, traps, if/elses, ... and have a single way to specify the agent, based on an abstract base class and several concrete subclasses which can be "mixed in" to any test suite.
- Instead of enabling/disabling tests using decorators, the tests that are specific to a certain agent are now in a separate class (which is a subclass of the "generic" test suite) so that they are only picked up by the agent they apply to.
- Instead of having one separate entry point script for each combination, it uses one entry point for each agent, and in that script it provides a list of all the test suites it wants to run on that agent. And it does that by trying to deduplicate the boilerplate as much as possible. (In fact, the various agent-suite combinations could be grouped in any way, not necessarily by agent as I did here).

It provides further advantages:
- It puts all the agents on equal standing, by not having any of them be the default, making it thus easier to migrate from process group to TensorPipe.
- It will make it easier to add more versions of the TensorPipe tests (e.g., one that disables the same-machine backends in order to test the TCP-based ones) without a further duplication of entry points, of boilerplate, ...

Summary of this commit
--
This is the last step of removing TEST_CONFIG. As there was no one left using it, there is really not much to it.
ghstack-source-id: 109229471

Test Plan: Sandcastle and CircleCI

Reviewed By: pritamdamania87

Differential Revision: D22307778

fbshipit-source-id: 0d9498d9367eec671e0a964ce693015f73c5638c
2020-08-05 15:10:20 -07:00
e7c7eaab82 [RPC tests] Move some functions to methods of fixture (#40821)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40821

Summary of the entire stack:
--

This diff is part of an attempt to refactor the RPC tests. They currently suffer from several problems:
- Several ways to specify the agent to use: there exists one "generic" fixture that uses the global variable TEST_CONFIG to look up the agent name, and is used for process group and Thrift, and then there are separate fixtures for the flaky agent and the TensorPipe one.
- These two ways lead to having two separate decorators (`requires_process_group_agent` and `@_skip_if_tensorpipe_agent`) which must both be specified, making it unclear what the effect of each of them is and what happens if only one is given.
- Thrift must override the TEST_CONFIG global variable before any other import (in order for the `requires_process_group_agent` decorator to work correctly) and for that it must use a "trap" file, which makes it even harder to track which agent is being used, and which is specific to Buck, and thus cannot be used in OSS by other agents.
- Even if the TensorPipe fixture doesn't use TEST_CONFIG, it still needs to set it to the right value for other parts of the code to work. (This is done in `dist_init`).
- There are a few functions in dist_utils.py that return some properties of the agent (e.g., a regexp to match against the error it returns in case of shutdown). These functions are effectively chained if/elses on the various agents, which has the effect of "leaking" some part of the Thrift agent into OSS.
- Each test suite (RPC, dist autograd/dist optimizer, their JIT versions, remote module, ...) must be run on each agent (or almost; the faulty one is an exception) in both fork and spawn mode. Each of these combinations is a separate file, which leads to a proliferation of scripts.
- There is no "master list" of what combinations make sense and should be run. Therefore it has happened that when adding new tests or new agents we forgot to enroll them into the right tests. (TensorPipe is still missing a few tests, it turns out).
- All of these tiny "entry point" files contain almost the same duplicated boilerplate. This makes it very easy to get the wrong content into one of them due to a bad copy-paste.

This refactoring aims to address these problems by:
- Avoiding global state, defaults/override, traps, if/elses, ... and have a single way to specify the agent, based on an abstract base class and several concrete subclasses which can be "mixed in" to any test suite.
- Instead of enabling/disabling tests using decorators, the tests that are specific to a certain agent are now in a separate class (which is a subclass of the "generic" test suite) so that they are only picked up by the agent they apply to.
- Instead of having one separate entry point script for each combination, it uses one entry point for each agent, and in that script it provides a list of all the test suites it wants to run on that agent. And it does that by trying to deduplicate the boilerplate as much as possible. (In fact, the various agent-suite combinations could be grouped in any way, not necessarily by agent as I did here).

It provides further advantages:
- It puts all the agents on equal standing, by not having any of them be the default, making it thus easier to migrate from process group to TensorPipe.
- It will make it easier to add more versions of the TensorPipe tests (e.g., one that disables the same-machine backends in order to test the TCP-based ones) without a further duplication of entry points, of boilerplate, ...

Summary of this commit
--
This change continues the work towards removing TEST_CONFIG, by taking a few functions that were accepting the agent name (as obtained from TEST_CONFIG) and then did a bunch of if/elses on it, and replace them by new abstract methods on the fixtures, so that these functions become "decentralized".
ghstack-source-id: 109229472

Test Plan: Sandcastle and CircleCI

Reviewed By: pritamdamania87

Differential Revision: D22307776

fbshipit-source-id: 9e1f6edca79aacf0bcf9d83d50ce9e0d2beec0dd
2020-08-05 15:10:17 -07:00
2acef69ce3 [RPC tests] Make generic fixture an abstract base class (#40820)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40820

Summary of the entire stack:
--

This diff is part of an attempt to refactor the RPC tests. They currently suffer from several problems:
- Several ways to specify the agent to use: there exists one "generic" fixture that uses the global variable TEST_CONFIG to look up the agent name, and is used for process group and Thrift, and then there are separate fixtures for the flaky agent and the TensorPipe one.
- These two ways lead to having two separate decorators (`requires_process_group_agent` and `@_skip_if_tensorpipe_agent`) which must both be specified, making it unclear what the effect of each of them is and what happens if only one is given.
- Thrift must override the TEST_CONFIG global variable before any other import (in order for the `requires_process_group_agent` decorator to work correctly) and for that it must use a "trap" file, which makes it even harder to track which agent is being used, and which is specific to Buck, and thus cannot be used in OSS by other agents.
- Even if the TensorPipe fixture doesn't use TEST_CONFIG, it still needs to set it to the right value for other parts of the code to work. (This is done in `dist_init`).
- There are a few functions in dist_utils.py that return some properties of the agent (e.g., a regexp to match against the error it returns in case of shutdown). These functions are effectively chained if/elses on the various agents, which has the effect of "leaking" some part of the Thrift agent into OSS.
- Each test suite (RPC, dist autograd/dist optimizer, their JIT versions, remote module, ...) must be run on each agent (or almost; the faulty one is an exception) in both fork and spawn mode. Each of these combinations is a separate file, which leads to a proliferation of scripts.
- There is no "master list" of what combinations make sense and should be run. Therefore it has happened that when adding new tests or new agents we forgot to enroll them into the right tests. (TensorPipe is still missing a few tests, it turns out).
- All of these tiny "entry point" files contain almost the same duplicated boilerplate. This makes it very easy to get the wrong content into one of them due to a bad copy-paste.

This refactoring aims to address these problems by:
- Avoiding global state, defaults/override, traps, if/elses, ... and have a single way to specify the agent, based on an abstract base class and several concrete subclasses which can be "mixed in" to any test suite.
- Instead of enabling/disabling tests using decorators, the tests that are specific to a certain agent are now in a separate class (which is a subclass of the "generic" test suite) so that they are only picked up by the agent they apply to.
- Instead of having one separate entry point script for each combination, it uses one entry point for each agent, and in that script it provides a list of all the test suites it wants to run on that agent. And it does that by trying to deduplicate the boilerplate as much as possible. (In fact, the various agent-suite combinations could be grouped in any way, not necessarily by agent as I did here).

It provides further advantages:
- It puts all the agents on equal standing, by not having any of them be the default, making it thus easier to migrate from process group to TensorPipe.
- It will make it easier to add more versions of the TensorPipe tests (e.g., one that disables the same-machine backends in order to test the TCP-based ones) without a further duplication of entry points, of boilerplate, ...

Summary of this commit
--
Now that no one is using the generic fixture anymore (i.e., the fixture that looks up the agent's name in the global TEST_CONFIG) we can make it abstract, i.e., have its methods become no-ops and add decorators that will require all subclasses to provide new implementations of those methods. This is a first step towards removing TEST_CONFIG.
ghstack-source-id: 109229475

Test Plan: Sandcastle and CircleCI

Reviewed By: pritamdamania87

Differential Revision: D22307777

fbshipit-source-id: e52abd915c37894933545eebdfdca3ecb9559926
2020-08-05 15:10:14 -07:00
a94039fce5 [RPC tests] Avoid decorators to skip tests (#40819)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40819

Summary of the entire stack:
--

This diff is part of an attempt to refactor the RPC tests. They currently suffer from several problems:
- Several ways to specify the agent to use: there exists one "generic" fixture that uses the global variable TEST_CONFIG to look up the agent name, and is used for process group and Thrift, and then there are separate fixtures for the flaky agent and the TensorPipe one.
- These two ways lead to having two separate decorators (`requires_process_group_agent` and `@_skip_if_tensorpipe_agent`) which must both be specified, making it unclear what the effect of each of them is and what happens if only one is given.
- Thrift must override the TEST_CONFIG global variable before any other import (in order for the `requires_process_group_agent` decorator to work correctly) and for that it must use a "trap" file, which makes it even harder to track which agent is being used, and which is specific to Buck, and thus cannot be used in OSS by other agents.
- Even if the TensorPipe fixture doesn't use TEST_CONFIG, it still needs to set it to the right value for other parts of the code to work. (This is done in `dist_init`).
- There are a few functions in dist_utils.py that return some properties of the agent (e.g., a regexp to match against the error it returns in case of shutdown). These functions are effectively chained if/elses on the various agents, which has the effect of "leaking" some part of the Thrift agent into OSS.
- Each test suite (RPC, dist autograd/dist optimizer, their JIT versions, remote module, ...) must be run on each agent (or almost; the faulty one is an exception) in both fork and spawn mode. Each of these combinations is a separate file, which leads to a proliferation of scripts.
- There is no "master list" of what combinations make sense and should be run. Therefore it has happened that when adding new tests or new agents we forgot to enroll them into the right tests. (TensorPipe is still missing a few tests, it turns out).
- All of these tiny "entry point" files contain almost the same duplicated boilerplate. This makes it very easy to get the wrong content into one of them due to a bad copy-paste.

This refactoring aims to address these problems by:
- Avoiding global state, defaults/override, traps, if/elses, ... and have a single way to specify the agent, based on an abstract base class and several concrete subclasses which can be "mixed in" to any test suite.
- Instead of enabling/disabling tests using decorators, the tests that are specific to a certain agent are now in a separate class (which is a subclass of the "generic" test suite) so that they are only picked up by the agent they apply to.
- Instead of having one separate entry point script for each combination, it uses one entry point for each agent, and in that script it provides a list of all the test suites it wants to run on that agent. And it does that by trying to deduplicate the boilerplate as much as possible. (In fact, the various agent-suite combinations could be grouped in any way, not necessarily by agent as I did here).

It provides further advantages:
- It puts all the agents on equal standing, by not having any of them be the default, making it thus easier to migrate from process group to TensorPipe.
- It will make it easier to add more versions of the TensorPipe tests (e.g., one that disables the same-machine backends in order to test the TCP-based ones) without a further duplication of entry points, of boilerplate, ...

Summary of this commit
--
This diff removes the two decorators (`requires_process_group_agent` and `@_skip_if_tensorpipe_agent`) which were used to skip tests. They were only used to prevent the TensorPipe agent from running tests that were using the process group agent's options. The converse (preventing the PG agent from using the TP options) is achieved by having those tests live in a `TensorPipeAgentRpcTest` class. So here we're doing the same for process group, by moving those tests to a `ProcessGroupAgentRpcTest` class.
ghstack-source-id: 109229473

Test Plan: Sandcastle and CircleCI

Reviewed By: pritamdamania87

Differential Revision: D22283179

fbshipit-source-id: b9315f9fd67f35e88fe1843faa161fc53a4133c4
2020-08-05 15:10:11 -07:00
935fcc9580 [RPC tests] Merge process group tests into single entry point (#40818)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40818

Summary of the entire stack:
--

This diff is part of an attempt to refactor the RPC tests. They currently suffer from several problems:
- Several ways to specify the agent to use: there exists one "generic" fixture that uses the global variable TEST_CONFIG to look up the agent name, and is used for process group and Thrift, and then there are separate fixtures for the flaky agent and the TensorPipe one.
- These two ways lead to having two separate decorators (`requires_process_group_agent` and `@_skip_if_tensorpipe_agent`) which must both be specified, making it unclear what the effect of each of them is and what happens if only one is given.
- Thrift must override the TEST_CONFIG global variable before any other import (in order for the `requires_process_group_agent` decorator to work correctly) and for that it must use a "trap" file, which makes it even harder to track which agent is being used, and which is specific to Buck, and thus cannot be used in OSS by other agents.
- Even if the TensorPipe fixture doesn't use TEST_CONFIG, it still needs to set it to the right value for other parts of the code to work. (This is done in `dist_init`).
- There are a few functions in dist_utils.py that return some properties of the agent (e.g., a regexp to match against the error it returns in case of shutdown). These functions are effectively chained if/elses on the various agents, which has the effect of "leaking" some part of the Thrift agent into OSS.
- Each test suite (RPC, dist autograd/dist optimizer, their JIT versions, remote module, ...) must be run on each agent (or almost; the faulty one is an exception) in both fork and spawn mode. Each of these combinations is a separate file, which leads to a proliferation of scripts.
- There is no "master list" of what combinations make sense and should be run. Therefore it has happened that when adding new tests or new agents we forgot to enroll them into the right tests. (TensorPipe is still missing a few tests, it turns out).
- All of these tiny "entry point" files contain almost the same duplicated boilerplate. This makes it very easy to get the wrong content into one of them due to a bad copy-paste.

This refactoring aims to address these problems by:
- Avoiding global state, defaults/override, traps, if/elses, ... and have a single way to specify the agent, based on an abstract base class and several concrete subclasses which can be "mixed in" to any test suite.
- Instead of enabling/disabling tests using decorators, the tests that are specific to a certain agent are now in a separate class (which is a subclass of the "generic" test suite) so that they are only picked up by the agent they apply to.
- Instead of having one separate entry point script for each combination, it uses one entry point for each agent, and in that script it provides a list of all the test suites it wants to run on that agent. And it does that by trying to deduplicate the boilerplate as much as possible. (In fact, the various agent-suite combinations could be grouped in any way, not necessarily by agent as I did here).

It provides further advantages:
- It puts all the agents on equal standing, by not having any of them be the default, making it thus easier to migrate from process group to TensorPipe.
- It will make it easier to add more versions of the TensorPipe tests (e.g., one that disables the same-machine backends in order to test the TCP-based ones) without a further duplication of entry points, of boilerplate, ...

Summary of this commit
--
This diff does the changes described above for the process group agent. It defines a fixture for it (instead of using the generic fixture in its default behavior) and then merges all the entry points into a single script. Note that after this change there won't be anymore a "vanilla" RPC test: all test scripts now specify what agent they are using. This puts all agents on equal standing.
ghstack-source-id: 109229474

Test Plan: Sandcastle and CircleCI

Reviewed By: pritamdamania87

Differential Revision: D22283182

fbshipit-source-id: 7e3626bbbf37d88b892077a03725f0598576b370
2020-08-05 15:10:07 -07:00
b93c7c54eb [RPC tests] Merge tests for faulty agent into single script (#40817)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40817

Summary of the entire stack:
--

This diff is part of an attempt to refactor the RPC tests. They currently suffer from several problems:
- Several ways to specify the agent to use: there exists one "generic" fixture that uses the global variable TEST_CONFIG to look up the agent name, and is used for process group and Thrift, and then there are separate fixtures for the flaky agent and the TensorPipe one.
- These two ways lead to having two separate decorators (`requires_process_group_agent` and `@_skip_if_tensorpipe_agent`) which must both be specified, making it unclear what the effect of each of them is and what happens if only one is given.
- Thrift must override the TEST_CONFIG global variable before any other import (in order for the `requires_process_group_agent` decorator to work correctly) and for that it must use a "trap" file, which makes it even harder to track which agent is being used, and which is specific to Buck, and thus cannot be used in OSS by other agents.
- Even if the TensorPipe fixture doesn't use TEST_CONFIG, it still needs to set it to the right value for other parts of the code to work. (This is done in `dist_init`).
- There are a few functions in dist_utils.py that return some properties of the agent (e.g., a regexp to match against the error it returns in case of shutdown). These functions are effectively chained if/elses on the various agents, which has the effect of "leaking" some part of the Thrift agent into OSS.
- Each test suite (RPC, dist autograd/dist optimizer, their JIT versions, remote module, ...) must be run on each agent (or almost; the faulty one is an exception) in both fork and spawn mode. Each of these combinations is a separate file, which leads to a proliferation of scripts.
- There is no "master list" of what combinations make sense and should be run. Therefore it has happened that when adding new tests or new agents we forgot to enroll them into the right tests. (TensorPipe is still missing a few tests, it turns out).
- All of these tiny "entry point" files contain almost the same duplicated boilerplate. This makes it very easy to get the wrong content into one of them due to a bad copy-paste.

This refactoring aims to address these problems by:
- Avoiding global state, defaults/override, traps, if/elses, ... and have a single way to specify the agent, based on an abstract base class and several concrete subclasses which can be "mixed in" to any test suite.
- Instead of enabling/disabling tests using decorators, the tests that are specific to a certain agent are now in a separate class (which is a subclass of the "generic" test suite) so that they are only picked up by the agent they apply to.
- Instead of having one separate entry point script for each combination, it uses one entry point for each agent, and in that script it provides a list of all the test suites it wants to run on that agent. And it does that by trying to deduplicate the boilerplate as much as possible. (In fact, the various agent-suite combinations could be grouped in any way, not necessarily by agent as I did here).

It provides further advantages:
- It puts all the agents on equal standing, by not having any of them be the default, making it thus easier to migrate from process group to TensorPipe.
- It will make it easier to add more versions of the TensorPipe tests (e.g., one that disables the same-machine backends in order to test the TCP-based ones) without a further duplication of entry points, of boilerplate, ...

Summary of this commit
--
This diff does the changes described above for the faulty agent, which is its own strange beast. It merges all the test entry points (i.e., the combinations of agent, suite and fork/spawn) into a single file. It also modifies the test suites that are intended to be run only on the faulty agent, which used to inherit from its fixture, to inherit from the generic fixture, as they will be mixed in with the faulty fixture at the very end, inside the entry point script.
ghstack-source-id: 109229477

Test Plan: Sandcastle and CircleCI

Reviewed By: pritamdamania87

Differential Revision: D22283178

fbshipit-source-id: 72659efe6652dac8450473642a578933030f2c74
2020-08-05 15:10:04 -07:00
edf6c4bc4d [RPC tests] Merge TensorPipe tests into single entry point (#40816)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40816

Summary of the entire stack:
--

This diff is part of an attempt to refactor the RPC tests. They currently suffer from several problems:
- Several ways to specify the agent to use: there exists one "generic" fixture that uses the global variable TEST_CONFIG to look up the agent name, and is used for process group and Thrift, and then there are separate fixtures for the flaky agent and the TensorPipe one.
- These two ways lead to having two separate decorators (`requires_process_group_agent` and `@_skip_if_tensorpipe_agent`) which must both be specified, making it unclear what the effect of each of them is and what happens if only one is given.
- Thrift must override the TEST_CONFIG global variable before any other import (in order for the `requires_process_group_agent` decorator to work correctly) and for that it must use a "trap" file, which makes it even harder to track which agent is being used, and which is specific to Buck, and thus cannot be used in OSS by other agents.
- Even if the TensorPipe fixture doesn't use TEST_CONFIG, it still needs to set it to the right value for other parts of the code to work. (This is done in `dist_init`).
- There are a few functions in dist_utils.py that return some properties of the agent (e.g., a regexp to match against the error it returns in case of shutdown). These functions are effectively chained if/elses on the various agents, which has the effect of "leaking" some part of the Thrift agent into OSS.
- Each test suite (RPC, dist autograd/dist optimizer, their JIT versions, remote module, ...) must be run on each agent (or almost; the faulty one is an exception) in both fork and spawn mode. Each of these combinations is a separate file, which leads to a proliferation of scripts.
- There is no "master list" of what combinations make sense and should be run. Therefore it has happened that when adding new tests or new agents we forgot to enroll them into the right tests. (TensorPipe is still missing a few tests, it turns out).
- All of these tiny "entry point" files contain almost the same duplicated boilerplate. This makes it very easy to get the wrong content into one of them due to a bad copy-paste.

This refactoring aims to address these problems by:
- Avoiding global state, defaults/override, traps, if/elses, ... and have a single way to specify the agent, based on an abstract base class and several concrete subclasses which can be "mixed in" to any test suite.
- Instead of enabling/disabling tests using decorators, the tests that are specific to a certain agent are now in a separate class (which is a subclass of the "generic" test suite) so that they are only picked up by the agent they apply to.
- Instead of having one separate entry point script for each combination, it uses one entry point for each agent, and in that script it provides a list of all the test suites it wants to run on that agent. And it does that by trying to deduplicate the boilerplate as much as possible. (In fact, the various agent-suite combinations could be grouped in any way, not necessarily by agent as I did here).

It provides further advantages:
- It puts all the agents on equal standing, by not having any of them be the default, making it thus easier to migrate from process group to TensorPipe.
- It will make it easier to add more versions of the TensorPipe tests (e.g., one that disables the same-machine backends in order to test the TCP-based ones) without a further duplication of entry points, of boilerplate, ...

Summary of this commit
--
This diff does the changes described above for the TensorPipe agent. It fixes its fixture (making it inherit from the generic fixture) and merges all the entry point scripts into a single one, so that it's easier to have a clear overview of all the test suites which we run on TensorPipe (you'll notice that many are missing: the JIT ones, the remote module one, ...).
ghstack-source-id: 109229476

Test Plan: Sandcastle and CircleCI

Reviewed By: pritamdamania87

Differential Revision: D22283180

fbshipit-source-id: d5e9f9f4e6d4bfd6fbcae7ae56eed63d2567a02f
2020-08-05 15:08:32 -07:00
73351ee91d [TensorExpr] Disallow fallback to JIT interpreter from TensorExprKernel (flip the default). (#42568)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42568

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D22936175

Pulled By: ZolotukhinM

fbshipit-source-id: 62cb505acb77789ed9f483842a8b31eb245697b3
2020-08-05 14:13:49 -07:00
ef50694d44 [TensorExpr] Apply GenericIntrinsicExpander recursively. (#42567)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42567

Before this change we didn't expand arguments, and thus in an expr
`sigmoid(sigmoid(x))` only the outer call was expanded.

Test Plan: Imported from OSS

Reviewed By: gmagogsfm

Differential Revision: D22936177

Pulled By: ZolotukhinM

fbshipit-source-id: 9c05dc96561225bab9a90a407d7bcf9a89b078a1
2020-08-05 14:13:46 -07:00
ea9053b86d [TensorExpr] Handle constant nodes in shape inference. (#42566)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42566

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D22936176

Pulled By: ZolotukhinM

fbshipit-source-id: 69d0f9907de0e98f1fbd56407df235774cb5b788
2020-08-05 14:13:44 -07:00
b9c49f0e69 [TensorExpr] Support shape inference in TE for aten::cat. (#42387)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42387

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D22879281

Pulled By: ZolotukhinM

fbshipit-source-id: 775e46a4cfd91c63196b378ee587cc4434672c89
2020-08-05 14:11:24 -07:00
feeb515ad5 add Quantizer support to IValue (#42438)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42438

Test Plan: Imported from OSS

Reviewed By: smessmer

Differential Revision: D22894190

Pulled By: bhosmer

fbshipit-source-id: b2d08abd6f582f29daa6cc7ebf05bb1a99f7514b
2020-08-05 12:56:18 -07:00
24e2a8a171 Revert D22780307: Fix illegal memory acess issue for CUDA versionn of SplitByLengths operator.
Test Plan: revert-hammer

Differential Revision:
D22780307 (76905527fe)

Original commit changeset: c5ca60ae16b2

fbshipit-source-id: f3c99eec5f05121e2bed606fe2ba84a0be0cdf16
2020-08-05 12:47:56 -07:00
df7c059428 Throw error if torch.set_deterministic(True) is called with nondeterministic CuBLAS config (#41377)
Summary:
For CUDA >= 10.2, the `CUBLAS_WORKSPACE_CONFIG` environment variable must be set to either `:4096:8` or `:16:8` to ensure deterministic CUDA stream usage. This PR adds some logic inside `torch.set_deterministic()` to raise an error if this environment variable is not set properly and CUDA >= 10.2.

Issue https://github.com/pytorch/pytorch/issues/15359

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41377

Reviewed By: malfet

Differential Revision: D22758459

Pulled By: ezyang

fbshipit-source-id: 4b96f1e9abf85d94ba79140fd927bbd0c05c4522
2020-08-05 12:42:24 -07:00
7221a3d1aa enable torch.optim.swa_utils.SWALR (#42574)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/42435

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42574

Reviewed By: zou3519

Differential Revision: D22949369

Pulled By: vincentqb

fbshipit-source-id: f2f319ec94a97e0afe4d4327c866504ae632a986
2020-08-05 12:37:45 -07:00
18a32b807b Add API to collect output_col_minmax_histogram
Summary:
Add an API to collect output_col_minmax_histogram. This is used to implement input_equalization.

Roll back revised the collect_single_histogram in the new version to make sure it does not affect the product.
The newly added one can implement collect the activation histogram and output col max histogram at the same time.

Test Plan:
Add a unit test, and pass it.
https://our.intern.facebook.com/intern/testinfra/testrun/2251799847601374
After updating the dump API, it passed the updated unit test
https://our.intern.facebook.com/intern/testinfra/testrun/844425097716401

Integrated the output_col_minmax_histogram to the collect single histogram, and make it downward compatible
https://our.intern.facebook.com/intern/testinfra/testrun/8162774342207893

I added different cases to tested newly added function. It passed the unit test  https://our.intern.facebook.com/intern/testinfra/testrun/4503599658969000

Tested after new revision: https://our.intern.facebook.com/intern/testinfra/testrun/5348024589078557

Reviewed By: hx89

Differential Revision: D22919913

fbshipit-source-id: c9cb05e0cf14af0dfde3d22921abb42f97a61df2
2020-08-05 12:33:10 -07:00
7c33225c72 Add strict mypy type checking and update code_template.py (#42322)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42322

Our current type checking rules are rather lax, and for
example don't force users to make sure they annotate all functions
with types.  For code generation code, it would be better to force
100% typing.  This PR introduces a new mypy configuration
mypy-strict.ini which applies rules from --strict.  We extend
test_type_hints.py to test for this case.  It only covers
code_template.py, which I have made strict clean in this PR.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D22846120

Pulled By: ezyang

fbshipit-source-id: 8d253829223bfa0d811b6add53b7bc2d3a4356b0
2020-08-05 12:28:15 -07:00
5c5d7a9dca Freeze dynamic (re)quantizaiton ops into standard ones (#42591)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42591

We don't support lowering with 2-input Int8Quantize and 4-input Int8FC. Just do a conversion to absorb the quantization params into the op itself.

Test Plan:
```
buck test caffe2/caffe2/quantization/server:quantize_dnnlowp_op_test
```

Reviewed By: benjibc

Differential Revision: D22942673

fbshipit-source-id: a392ba2afdfa39c05c5adcb6c4dc5f814c95e449
2020-08-05 11:53:09 -07:00
6d1e43c5a6 Release the GIL before invokeOperator (#42341)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/41865

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42341

Reviewed By: ezyang

Differential Revision: D22928622

Pulled By: wconstab

fbshipit-source-id: 8fa41277c9465f816342db6ec0e6cd4b30095c5c
2020-08-05 11:51:39 -07:00
76905527fe Fix illegal memory acess issue for CUDA versionn of SplitByLengths operator.
Summary:
1. Fix illegal memory access issue for SplitByLengths operator in the CUDA context.
2. Add support to scaling lengths vector for SplitByLengths operator.
3. Add support to test SplitByLengths operator in the CUDA context.

Example for SplitByLengths operator processing scaling lengths vector:
value vector A = [1, 2, 3, 4, 5, 6]
length vector B = [1, 2]
after execution of SplitByLengths operator,
the output should be [1,2] and [3,4,5,6]

Test Plan: buck test mode/dev-nosan caffe2/caffe2/python/operator_test:concat_split_op_test

Reviewed By: kennyhorror

Differential Revision: D22780307

fbshipit-source-id: c5ca60ae16b24032cedfa045a421503b713daa6c
2020-08-05 11:46:00 -07:00
06d978a9ad [c10/cuda] Reorganize device_count() and robustly surface ASAN warnings (#42249)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42249

Main change is to bring Caffe2's superior error messages for cuda initialization into c10 and use them in all code paths.

Basic logic:

| Case | Call to device_count() | init_cuda, e.g. allocating tensor |
| -- | -- | -- |
| all good | non-zero | just works |
| no gpus | 0, no warning | throw exception with good message |
| driver issues | 0, produce warning | throw exception with good message |
| out of memory with ASAN | 0, produce warning| throw exception with ASAN message |

Previously, the error thrown from init_cuda was very generic and the ASAN warning (if any) was buried in the logs.

Other clean up changes:
* cache device_count() always in a static variable
* move all asan macros in c10

Test Plan:
Hard to unittest because of build modes. Verified manually that the behavior from the table above holds by running the following script in different modes (ASAN/no-ASAN, CUDA_VISIBLE_DEVICES=):

```
print('before import')
import torch
print('after import')
print('devices: ', torch.cuda.device_count())
x = torch.tensor([1,2,3])
print('tensor creation')
x = x.cuda()
print('moved to cuda')
```

Reviewed By: ngimel

Differential Revision: D22824329

fbshipit-source-id: 5314007313a3897fc955b02f8b21b661ae35fdf5
2020-08-05 11:39:31 -07:00
27e8dc78ca [vulkan] VulkanTensor lazy buffer allocation (#42569)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42569

We do not need to allocate buffers for Vulkan Tensors if they are not the forward input or output.
Removing allocate_storage() for outputs of operations by default, their image representation will have the result.
Allocating buffer only if it was requested for the operations (For some ops like concatenate, transpose) or copy to host.

`VulkanTensor.image()` if buffer was not allocated - just allocates texture skipping copy from buffer to texture.
As allocate storage was before for all operations - we are saving buffer allocation and buffer_to_image call.

MobilNetV2 on my Pixel4:
```
flame:/data/local/tmp $ ./speed_benchmark_torch  --model=mnfp32-vopt.pt --input_type=float --input_dims=1,3,224,224 --warmup=3 --iter=20 --vulkan=true
Starting benchmark.
Running warmup runs.
Main runs.
Main run finished. Microseconds per iter: 305818. Iters per second: 3.26991
Segmentation fault
```
```
139|flame:/data/local/tmp $ ./speed_benchmark_torch_noas  --model=mnfp32-vopt.pt --input_type=float --input_dims=1,3,224,224 --warmup=3 --iter=20 --vulkan=true
Starting benchmark.
Running warmup runs.
Main runs.
Main run finished. Microseconds per iter: 236768. Iters per second: 4.22355
Segmentation fault
```

Test Plan: Imported from OSS

Reviewed By: AshkanAliabadi

Differential Revision: D22946552

Pulled By: IvanKobzarev

fbshipit-source-id: ac0743bb316847632a22cf9aafb8938e50b2fb7b
2020-08-05 10:54:41 -07:00
dae94ed022 Keep manual_kernel_registration only effective in aten codegen. (#42386)
Summary:
This PR removes manual registration in aten/native codebase.
And it separates manual device/catchall kernel registration from manual VariableType kernel registration.
The first one remains as manual_kernel_registration in native_functions.yaml.
The second one is moved to tools/ codegen.

Difference in generated TypeDefault.cpp: https://gist.github.com/ailzhang/897ef9fdf0c834279cd358febba07734
No difference in generated VariableType_X.cpp

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42386

Reviewed By: agolynski

Differential Revision: D22915649

Pulled By: ailzhang

fbshipit-source-id: ce93784b9b081234f05f3343e8de3c7a704a5783
2020-08-05 10:31:35 -07:00
b08347fd7b Add CUDA 11 builds for Windows CI (#42420)
Summary:
Stacked on https://github.com/pytorch/pytorch/pull/42410.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42420

Reviewed By: seemethere

Differential Revision: D22917230

Pulled By: malfet

fbshipit-source-id: 6ad394f7f8c430c587e0b0d9c5a5e7b7bcd85bfe
2020-08-05 09:40:33 -07:00
db52cd7322 .circleci: Hardcode rocm image to previous tag (#42603)
Summary:
There were some inconsistencies with the newer docker images so it'd be
best to stick with something that works without reverting the entire
docker builder PR

This was made after the previous efforts to disable the tests that were failing:
* https://github.com/pytorch/pytorch/pull/42583
* https://github.com/pytorch/pytorch/pull/42561

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42603

Reviewed By: ezyang

Differential Revision: D22948743

Pulled By: seemethere

fbshipit-source-id: cc8b834e0c8a6a4763f5ba07ce220a9c192ea6eb
2020-08-05 09:23:21 -07:00
eb8a5fed38 Automated submodule update: FBGEMM (#42584)
Summary:
This is an automated pull request to update the first-party submodule for [pytorch/FBGEMM](https://github.com/pytorch/FBGEMM).

New submodule commit: 4abc34af1a

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42584

Test Plan: Ensure that CI jobs succeed on GitHub before landing.

Reviewed By: dskhudia

Differential Revision: D22941475

fbshipit-source-id: 29863cad7f77939edb44d337918693879b35cfaa
2020-08-05 09:19:27 -07:00
924a1dbe9b Revert D22939119: [TensorExpr] Fix a way we were createing np arrays in tests.
Test Plan: revert-hammer

Differential Revision:
D22939119 (882ad117cf)

Original commit changeset: 3388270af8ea

fbshipit-source-id: 7c8d159586ce2c4c21184fd84aa6da5183bc71ea
2020-08-05 08:25:47 -07:00
0cf71eb547 Unconditinally use typing extensions in jit_internal (#42538)
Summary:
Since https://github.com/pytorch/pytorch/issues/38221 is closed now, `typing_extensions` module should always be available

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42538

Reviewed By: ezyang

Differential Revision: D22942153

Pulled By: malfet

fbshipit-source-id: edabbadde13800a3412d14c19ca55ef206ada5e1
2020-08-05 08:22:59 -07:00
b85216887b [vulkan] max_pool2d (#41379)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/41379

Test Plan: Imported from OSS

Reviewed By: AshkanAliabadi

Differential Revision: D22754944

Pulled By: IvanKobzarev

fbshipit-source-id: 5261337bb731a207a1532e6423c0d33f1307e413
2020-08-05 01:53:52 -07:00
0f358fab6b Hide cudnn symbols in libtorch_cuda.so when statically linking cudnn (#41986)
Summary:
This PR intends to fix https://github.com/pytorch/pytorch/issues/32983.

The initial (one-line) diff causes statically linked cudnn symbols in `libtorch_cuda.so` to have local linkage (such that they shouldn't be visible to external libraries during dynamic linking at load time), at least in my source build on Ubuntu 20.04.

Procedure I used to verify:
```
export USE_STATIC_CUDNN=ON
python3 setup.py install
...
```
then
```
mcarilli@mcarilli-desktop:~/Desktop/mcarilli_github/pytorch/torch/lib$ nm libtorch_cuda.so | grep cudnnCreate
00000000031ff540 t cudnnCreate
00000000031fbe70 t cudnnCreateActivationDescriptor
```
Before the diff they were marked with capital `T`s indicating external linkage.

Caveats:
- The fix is gcc-specific afaik.  I have no idea how to enable it for Windows or other compilers.
- Hiding the cudnn symbols will break external C++ applications that rely on linking `libtorch.so` to supply cudnn symbol definitions.  IMO this is "off menu" usage so I don't think it's a major concern.  Hiding the symbols _won't_ break applications that call cudnn indirectly through torch functions, which IMO is the "on menu" way.
- I know _very little_ about the build system.  The diff's intent is to add a link option that applies to any Pytorch `.so`s that statically link cudnn, and does so on Linux only.  I'm blindly following soumith 's recommendation https://github.com/pytorch/pytorch/issues/32983#issuecomment-662056151, and post-checking the built libs (I also added `set(CMAKE_VERBOSE_MAKEFILE ON)` to the top-level CMakeLists.txt at one point to confirm `-Wl,--exclude-libs,libcudnn_static.a` was picked up by the command that linked `libtorch_cuda.so`).
- https://github.com/pytorch/pytorch/issues/32983 (which used a Pytorch 1.4 binary build) complained about `libtorch.so`, not `libtorch_cuda.so`:
    ```
    nvpohanh@ubuntu:~$ nm /usr/local/lib/python3.5/dist-packages/torch/lib/libtorch.so | grep ' cudnnCreate'
    000000000f479c30 T cudnnCreate
    000000000f475ff0 T cudnnCreateActivationDescriptor
    ```
  In my source build, `libtorch.so` ends up small, containing no cudnn symbols (this is true with or without the PR's diff), which contradicts https://github.com/pytorch/pytorch/issues/32983.  Maybe the symbol organization (what goes in   `libtorch.so` vs `libtorch_cuda/cpu/whatever.so`) changed since 1.4.  Or maybe the symbol organization is different for source vs binary builds, in which case I have no idea if this PR's diff has the same effect for a binary build.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41986

Reviewed By: glaringlee

Differential Revision: D22934926

Pulled By: malfet

fbshipit-source-id: 711475834e0f8148f0e5f2fe28fca5f138ef494b
2020-08-04 22:59:40 -07:00
882ad117cf [TensorExpr] Fix a way we were createing np arrays in tests. (#42575)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42575

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D22939119

Pulled By: ZolotukhinM

fbshipit-source-id: 3388270af8eae9fd4747f06202f366887aaf5f36
2020-08-04 21:24:25 -07:00
3c7fccc1c2 Reenable cusparse SpMM on cuda 10.2 (#42556)
Summary:
This fixes feature regression introduced by https://github.com/pytorch/pytorch/issues/42412 which limited all the use of the API to CUDA-11.0+

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42556

Reviewed By: ngimel

Differential Revision: D22932129

Pulled By: malfet

fbshipit-source-id: 2756e0587456678fa1bc7deaa09d0ea482dfd19f
2020-08-04 19:02:34 -07:00
78f4cff8fe handle multiple returns properly in boxing wrappers (#42437)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42437

Test Plan: Imported from OSS

Reviewed By: smessmer

Differential Revision: D22894191

Pulled By: bhosmer

fbshipit-source-id: fd4c7bc605a4b20bb3882f71e3b8874150671324
2020-08-04 18:27:25 -07:00
d45e2d3ef9 Reduce the output overhead of OutputColumnMaxHistogramObserver by enabling changing bin_nums, Update the observer_test.py
Summary: Current OutputColumnMaxHistogramObserver will output 2048 bins for each column. The file will be extremely large and the dumping time is quite long. However, we only use the min and max finally. This diff enables changing bin_nums by adding an argument. And the default value is set to 16 to reduce dumping overhead. When we need more bins to analyze the results, we only need to change this argument

Test Plan:
buck run caffe2/caffe2/quantization/server:observer_test

{F263843430}

Reviewed By: hx89

Differential Revision: D22918202

fbshipit-source-id: bda34449355b269b24c55802012450ebaa4d280c
2020-08-04 17:07:25 -07:00
61027a1a59 Install typing_extensions in PyTorch CI (#42551)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42551

Reviewed By: seemethere

Differential Revision: D22929256

Pulled By: malfet

fbshipit-source-id: 9a6f8c56ca1c0fb8a8569614a34a12f2769755f3
2020-08-04 17:03:44 -07:00
29700c0092 [JIT] Fix torch.jit.is_tracing() (#42486)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42486

**Summary**
This commit fixes a small bug in which `torch.jit.is_tracing()` returns
`torch._C.is_tracing`, the function object, instead of calling the
function and returning the result.

**Test Plan**
Continuous integration?

**Fixes**
This commit fixes #42448.

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D22911062

Pulled By: SplitInfinity

fbshipit-source-id: b94eca0c1c65ca6f22acc6c5542af397f2dc37f0
2020-08-04 16:57:36 -07:00
afa489dea9 [ONNX] Enable lower_tuple pass for custom layer (#41548)
Summary:
Custom layer by `torch.autograd.Function` appears in the lower_tuple as `prim::PythonOp`. Adding this op type to the allowed list to enable lower_tuple pass. This helps with exporting custom layer with tuple outputs.

E.g.
```python
import torch
class CustomFunction(torch.autograd.Function):
    staticmethod
    def symbolic(g, input):
        return g.op('CustomNamespace::Custom', input, outputs=2)
    staticmethod
    def forward(ctx, input):
        return input, input
class Custom(torch.nn.Module):
    def forward(self, input):
        return CustomFunction.apply(input)

model = Custom()
batch = torch.FloatTensor(1, 3)
torch.onnx.export(model, batch, "test.onnx", verbose=True)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41548

Reviewed By: glaringlee

Differential Revision: D22926143

Pulled By: bzinodev

fbshipit-source-id: ce14d1d3c70a920154a8235d635ab31ddf0c46f3
2020-08-04 16:22:39 -07:00
ccc831ae35 test: Disable test_strided_grad_layout on ROCM (#42561)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42561

Regression was introduced as part of 5939d8a3e0, logs: https://app.circleci.com/pipelines/github/pytorch/pytorch/196558/workflows/9a2dd56e-86af-4d0f-9fb9-b205dcd12f93/jobs/6502042

Going to go ahead and disable the test to give rocm folks time to investigate what's going on

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D22932615

Pulled By: seemethere

fbshipit-source-id: 41150f3085f848cce75990716362261fea9391a0
2020-08-04 16:20:44 -07:00
c3e2ee725f Automated submodule update: FBGEMM (#42496)
Summary:
This is an automated pull request to update the first-party submodule for [pytorch/FBGEMM](https://github.com/pytorch/FBGEMM).

New submodule commit: 87c378172a

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42496

Test Plan: Ensure that CI jobs succeed on GitHub before landing.

Reviewed By: dskhudia

Differential Revision: D22911638

fbshipit-source-id: f20c83908b51ff56d8bf1d8b46961f70d023c81a
2020-08-04 16:15:26 -07:00
b9e68e03c4 Fix the bug in THCTensor_(baddbmm) and ATen's addmm_cuda for strided views input (#42425)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/42418.

The problem was that the non-contiguous batched matrices were passed to `gemmStridedBatched`.

The following code fails on master and works with the proposed patch:
```python
import torch
x = torch.tensor([[1., 2, 3], [4., 5, 6]], device='cuda:0')
c = torch.as_strided(x, size=[2, 2, 2], stride=[3, 1, 1])
torch.einsum('...ab,...bc->...ac', c, c)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42425

Reviewed By: glaringlee

Differential Revision: D22925266

Pulled By: ngimel

fbshipit-source-id: a72d56d26c7381b7793a047d76bcc5bd45a9602c
2020-08-04 16:11:07 -07:00
317b9d3bfc Implement sort for string in aten (#42398)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/42375

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42398

Reviewed By: ailzhang

Differential Revision: D22884849

Pulled By: gmagogsfm

fbshipit-source-id: e53386949f0a5e166f3d1c2aa695294340bd1440
2020-08-04 15:25:35 -07:00
56fc7d0345 Fix doc build (#42559)
Summary:
Add space between double back quotes and left curly bracket

Otherwise doc generation failed with `Inline literal start-string without end-string.`

This regression was introduced by b56db305cf

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42559

Reviewed By: glaringlee

Differential Revision: D22931527

Pulled By: malfet

fbshipit-source-id: 11c04a92dbba48592505f704d77222cf92a81055
2020-08-04 15:15:15 -07:00
e995c3d21e Add private API to support tensor lists: _foreach_add(TensorList tensors, Scalar scalar) (#41554)
Summary:
Initial PR for the Tensor List functionality.

**Motivation**
[GitHub issue](https://github.com/pytorch/pytorch/issues/38655)
Current PyTorch optimizer implementations are not efficient in cases when we work with a lot of small feature tensors. Starting a lot of kernels slows down the whole process. We need to reduce the number of kernels that we start.
As an example, we should be looking at [NVIDIAs Apex](https://github.com/NVIDIA/apex).
In order to track progress, we will pick PyTorchs DCGAN model with Adam optimizer and once the optimizer is reimplemented with tensor lists, benchmark the model performance against original model version, Apexs version with original Adam optimizer and it’s FusedAdam optimizer.

**In this PR**
- Adding `multi_tensor_apply` mechanism which will help to efficiently apply passed functor on a given list of tensors on CUDA.
- Adding a first private API - `std::vector<Tensor> _foreach_add(TensorList tensors, Scalar scalar)`

**Tests**
Tested via unit tests

**Plan for the next PRs**

1. Cover these ops with `multi_tensor_apply` support
- exponent
- division
- mul_
- add_
- addcmul_
- addcdiv_
- Sqrt

2. Rewrite PyTorch optimizers to use for-each operators in order to get performance gains.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41554

Reviewed By: cpuhrsch

Differential Revision: D22829724

Pulled By: izdeby

fbshipit-source-id: 47febdbf7845cf931958a638567b7428a24782b1
2020-08-04 15:01:09 -07:00
a0695b34cd .circleci: Have python docs always push to site (#42552)
Summary:
Was getting an error when attempting to push to master for
pytorch/pytorch.github.io since the main branch on that repository is
actually site and not master.

Get rid of the loop too since the loop wasn't going to work with a
conditional and conditionals on a two variable loop just isn't worth the
readability concerns

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42552

Reviewed By: malfet

Differential Revision: D22929503

Pulled By: seemethere

fbshipit-source-id: acdd26b86718304eac9dcfc81761de0b3e609004
2020-08-04 14:44:42 -07:00
91d87292a6 [vulkan][asan] Fix Invalid Memory ops (#41224)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/41224

Test Plan: Imported from OSS

Reviewed By: AshkanAliabadi

Differential Revision: D22754940

Pulled By: IvanKobzarev

fbshipit-source-id: f012b78a57f5f88897b2b6b91713090c8984a0bc
2020-08-04 14:33:49 -07:00
0d1a689764 [vulkan] reshape op (#41223)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/41223

Test Plan: Imported from OSS

Reviewed By: AshkanAliabadi

Differential Revision: D22754942

Pulled By: IvanKobzarev

fbshipit-source-id: 99fc5888803d6afe2a73bb5bbed6651d2ea98313
2020-08-04 14:32:06 -07:00
e97e87368e Clean up CUDA Sleep and Tensor Initialization in ProcessGroupNCCLTest (#42211)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42211

Helper functions for launching CUDA Sleep and Tensor Value Initialization for the collective test functions.

This is more of a code cleanup fix compared to the previous diffs.
ghstack-source-id: 109097243

Test Plan: working on devGPU and devvm

Reviewed By: jiayisuse

Differential Revision: D22782671

fbshipit-source-id: 7d88f568a4e08feae778669affe69c8d638973db
2020-08-04 12:36:27 -07:00
3ca361791f TearDown function for ProcessGroupNCCLTest Initializer (#42209)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42209

This PR adds a TearDown function to the testing superclass to ensure that the NCCL_BLOCKING_WAIT environment variable is reset after each test case.
ghstack-source-id: 109097247

Test Plan: Working on devGPU and devvm.

Reviewed By: jiayisuse

Differential Revision: D22782672

fbshipit-source-id: 8f919a96d7112f9f167e90ce3df59886c88f3514
2020-08-04 12:36:24 -07:00
2b8e7e2f2d Moving ProcessGroupNCCLTest to Gtest (#42208)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42208

ProcessGroupNCCLTest is currently written without any testing framework, and all tests are simply called from the main function and throw exceptions upon failure. As a result, it is hard to debug and pinpoint which tests have succeeded/failed.

This PR moves ProcessGroupNCCLTest to gtest with appropriate setup and skipping functionality in the test superclass.
ghstack-source-id: 109097246

Test Plan: Working Correctly on devGPU and devvm.

Reviewed By: jiayisuse

Differential Revision: D22782673

fbshipit-source-id: 85bd407f4534f3d339ddcdd65ef3d2022aeb7064
2020-08-04 12:34:09 -07:00
b3ffebda7a [TensorExpr] Properly handle all dtypes of the condition in evaluation of IfThenElse exprs. (#42495)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42495

Test Plan: Imported from OSS

Reviewed By: nickgg

Differential Revision: D22910753

Pulled By: ZolotukhinM

fbshipit-source-id: f9ffd3dc4c50fb3fb84ce6d6916c1fbfd3201c8f
2020-08-04 12:25:56 -07:00
c334ebf1aa [TensorExpr] Properly handle all dtypes in evaluation of Intrinsics exprs. (#42494)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42494

Note that we're currently assuming that dtypes of all the arguments and
the return value is the same.

Test Plan: Imported from OSS

Reviewed By: nickgg

Differential Revision: D22910755

Pulled By: ZolotukhinM

fbshipit-source-id: 7f899692065428fbf2ad05d22b4ca39cab788ae5
2020-08-04 12:25:54 -07:00
38a9984451 [TensorExpr] Properly handle all dtypes in evaluation of CompareSelect exprs. (#42493)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42493

Test Plan: Imported from OSS

Reviewed By: nickgg

Differential Revision: D22910754

Pulled By: ZolotukhinM

fbshipit-source-id: cf7073d6ea792998a9fa3989c7ec486419476de0
2020-08-04 12:24:03 -07:00
5939d8a3e0 Revert "Revert D22360735: .circleci: Build docker images as part of C… (#40950)
Summary:
…I workflow"

This reverts commit 3c6b8a64964b0275884359dd6a5bf484655d8c7c.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/40950

Reviewed By: malfet

Differential Revision: D22909883

Pulled By: seemethere

fbshipit-source-id: 93c070400d7fbe1753f88c3291ab5eba4ab237fa
2020-08-04 12:12:17 -07:00
4b42a5b5a1 Remove redundant kernels calling TypeDefault in VariableType codegen. (#42031)
Summary:
We have code snippet like below in VariableType_X.cpp
```
Tensor __and___Scalar(const Tensor & self, Scalar other) {
  auto result = TypeDefault::__and___Scalar(self, other);
  return result;
}
 TORCH_LIBRARY_IMPL(aten, Autograd, m) {
  m.impl("__and__.Scalar",
         c10::impl::hacky_wrapper_for_legacy_signatures(TORCH_FN(VariableType::__and___Scalar))
  );
```
We already register TypeDefault kernels as catchAll so they're not needed to be wrapped and register to Autograd key in VariableType.cpp. This PR removes the wrapper and registration in VariableType.cpp. (The ones in other files like TracedType.cpp remains the same).
Here's a [diff in generated VariableTypeEverything.cpp](https://gist.github.com/ailzhang/18876edec4dad54e43a1db0c127c5707)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42031

Reviewed By: agolynski

Differential Revision: D22903507

Pulled By: ailzhang

fbshipit-source-id: 04e6672b6c79e079fc0dfd95c409ebca7f9d76fc
2020-08-04 11:56:15 -07:00
94e8676a70 Initialize uninitialized variable (#42419)
Summary:
Fixes internal T70924595

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42419

Reviewed By: allwu, Krovatkin

Differential Revision: D22889325

Pulled By: wconstab

fbshipit-source-id: 108b6a6c6bb7c98d77e22bae9974a6c00bc296f0
2020-08-04 11:35:54 -07:00
d2a2ac4eea Fix read/write bulk data (#42504)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42504

Reviewed By: glaringlee

Differential Revision: D22922750

Pulled By: mrshenli

fbshipit-source-id: 9008fa22c00513bd75c3cf88a3081184cd72b0e3
2020-08-04 11:30:53 -07:00
ec898b1ab5 fix discontiguous inputs/outputs for cummin/cummax (#42507)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/42363

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42507

Reviewed By: mruberry

Differential Revision: D22917876

Pulled By: ngimel

fbshipit-source-id: 05f3f4a55bcddf6a853552184c9fafcef8d36270
2020-08-04 10:12:07 -07:00
ecb88c5d11 Add NCCL Alltoall to PT NCCL process group (#42514)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42514

Add Alltoall and Alltoallv to PT NCCL process group using NCCL Send/Recv.

Reviewed By: mrshenli

Differential Revision: D22917967

fbshipit-source-id: 402f2870915bc237845864a4a27c97df4351d975
2020-08-04 08:39:28 -07:00
b56db305cf Improve the documentation of DistributedDataParallel (#42471)
Summary:
Fixes #{issue number}

It's not clear by illustrating 'gradients from each node are averaged' in the documentation of DistributedDataParallel. Many people, including me, have a totally wrong understanding on this part. I add a note into the documentation to make it more straight forward and more user friendly.

Here is some toy code to illustrate my point:

* non-DistributedDataParallel version
    ```python
    import torch
    import torch.nn as nn

    x = torch.tensor([-1, 2, -3, 4], dtype=torch.float).view(-1, 1)
    print("input:", x)

    model = nn.Linear(in_features=1, out_features=1, bias=False)
    model.weight.data.zero_()
    model.weight.data.add_(1.0)

    opti = torch.optim.SGD(model.parameters(), lr=0.001)
    opti.zero_grad()

    y = model(x)

    label = torch.zeros(4, 1, dtype=torch.float)
    loss = torch.sum((y - label)**2)

    loss.backward()
    opti.step()

    print("grad:", model.weight.grad)
    print("updated weight:\n", model.weight)

    # OUTPUT
    # $ python test.py
    # input: tensor([[-1.],
    #         [ 2.],
    #         [-3.],
    #         [ 4.]])
    # grad: tensor([[60.]])
    # updated weight:
    #  Parameter containing:
    # tensor([[0.9400]], requires_grad=True)
    ```

* DistributedDataParallel version
    ```python
    import os
    import torch
    import torch.nn as nn
    import torch.distributed as dist
    from torch.multiprocessing import Process

    def run(rank, size):
        x = torch.tensor([-(1 + 2 * rank), 2 + 2 * rank], dtype=torch.float).view(-1, 1)
        print("input:", x)

        model = nn.Linear(in_features=1, out_features=1, bias=False)
        model.weight.data.zero_()
        model.weight.data.add_(1.0)
        model = torch.nn.parallel.DistributedDataParallel(model)

        opti = torch.optim.SGD(model.parameters(), lr=0.001)
        opti.zero_grad()

        y = model(x)

        label = torch.zeros(2, 1, dtype=torch.float)
        loss = torch.sum((y.view(-1, 1) - label)**2)

        loss.backward()
        opti.step()

        if rank == 0:
            print("grad:", model.module.weight.grad)
            print("updated weight:\n", model.module.weight)

    def init_process(rank, size, fn, backend="gloo"):
        os.environ['MASTER_ADDR'] = '127.0.0.1'
        os.environ['MASTER_PORT'] = '29500'
        dist.init_process_group(backend, rank=rank, world_size=size)
        fn(rank, size)

    if __name__ == "__main__":
        size = 2
        process = []
        for rank in range(size):
            p = Process(target=init_process, args=(rank, size, run))
            p.start()
            process.append(p)

        for p in process:
            p.join()

    # OUTPUT
    # $ python test_d.py
    # input: tensor([[-3.],
    #         [ 4.]])input: tensor([[-1.],
    #         [ 2.]])

    # grad: tensor([[30.]])
    # updated weight:
    #  Parameter containing:
    # tensor([[0.9700]], requires_grad=True)
    ```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42471

Reviewed By: glaringlee

Differential Revision: D22923340

Pulled By: mrshenli

fbshipit-source-id: 40b8c8ba63a243f857cd5976badbf7377253ba82
2020-08-04 08:36:42 -07:00
f3e8fff0d2 Batching rules for: chunk, split, unbind (#42480)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42480

These are grouped together because they all return a tuple of multiple
tensors.

This PR implements batching rules for chunk, split, and unbind. It also
updates the testing logic. Previously, reference_vmap was not able to
handle multiple outputs, now, it does.

Test Plan: - `pytest test/test_vmap.py -v -k "Operators"`

Reviewed By: ezyang

Differential Revision: D22905401

Pulled By: zou3519

fbshipit-source-id: 9963c943d035e9035c866be74dbdf7ab1989f8c4
2020-08-04 08:33:43 -07:00
f1d7f001b9 Batching rules for: torch.movedim, torch.narrow, Tensor.unfold (#42474)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42474

Test Plan: - `pytest test/test_vmap.py -v -k "Operators"`

Reviewed By: ezyang

Differential Revision: D22903513

Pulled By: zou3519

fbshipit-source-id: 06b3fb0c7d12b9a045c73a5c5a4f4e3207e07b02
2020-08-04 08:33:41 -07:00
01cd613e7e Batching rules for: T, view, view_as, reshape, reshape_as (#42458)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42458

Test Plan: - `pytest test/test_vmap.py -v -k "Operators"`

Reviewed By: ezyang

Differential Revision: D22898715

Pulled By: zou3519

fbshipit-source-id: 47f374962697dcae1d5aec80a41085679d016f92
2020-08-04 08:31:33 -07:00
0c48aa1e07 Add typing annotations to hub.py and _jit_internal.py (#42252)
Summary:
xref: https://github.com/pytorch/pytorch/wiki/Guide-for-adding-type-annotations-to-PyTorch

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42252

Reviewed By: malfet

Differential Revision: D22916480

Pulled By: ezyang

fbshipit-source-id: 392ab805b0023640a3b5cdf600f70638b375f84f
2020-08-04 08:20:44 -07:00
d21e345ef0 Fix segfault in THPGenerator_dealloc (take 2) (#42510)
Summary:
Segfault happens when one tries to deallocate uninitialized generator.
Make `THPGenerator_dealloc` UBSAN-safe by moving implicit cast in the struct definition to reinterpret_cast

Add `TestTorch.test_invalid_generator_raises` that validates that Generator created on invalid device is handled correctly

Fixes https://github.com/pytorch/pytorch/issues/42281

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42510

Reviewed By: pbelevich

Differential Revision: D22917469

Pulled By: malfet

fbshipit-source-id: 5eaa68eef10d899ee3e210cb0e1e92f73be75712
2020-08-04 08:06:08 -07:00
8850fd1952 Add python inferface to create OfflineTensor (#42516)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42516

att. We need it for some scripts.

Reviewed By: houseroad

Differential Revision: D22918112

fbshipit-source-id: 8a1696ceeeda67a34114bc57cb52c925711cfb4c
2020-08-04 01:31:34 -07:00
ae67f4c8b8 Revert D22845258: [pytorch][PR] [ONNX] Enable scripting tests and update jit passes
Test Plan: revert-hammer

Differential Revision:
D22845258 (04e55d69f9)

Original commit changeset: d57fd4086f27

fbshipit-source-id: 15aa5cdae496a5e8ce2d8739a06dd4a7edc2200c
2020-08-03 23:15:06 -07:00
842759591d [ONNX] Refactor ONNX fixup for Loop and If (#40943)
Summary:
* move both under new file `fixup_onnx_controlflow`
* move the fixup to where the ONNX loop/if node is created, as oppose to running the fixup as postpass. This will help with enable onnx shape inference later.
* move `fuseSequenceSplitConcat` to `Peephole`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/40943

Reviewed By: mrshenli

Differential Revision: D22709999

Pulled By: bzinodev

fbshipit-source-id: 51d316991d25dc4bb4047a6bb46ad1e2401d3d2d
2020-08-03 22:33:17 -07:00
55d2a732cd Skip part of test_figure[_list] if Matplotlib-3.3.0 is installed (#42500)
Summary:
See https://github.com/matplotlib/matplotlib/issues/18163 for more details
Fixes https://github.com/pytorch/pytorch/issues/41680

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42500

Reviewed By: ezyang

Differential Revision: D22915857

Pulled By: malfet

fbshipit-source-id: 4f8858b7b0018c6958a49f908de81a13a29e6046
2020-08-03 21:43:22 -07:00
49e06e305f [ONNX] Updating input node removal in ONNX function_substitution pass. (#42146)
Summary:
ONNX pass `torch._C._jit_pass_onnx_function_substitution(graph)` inlines the function with the compiled torch graph. But while it removes all connections with the compiled function node (e.g. see below - `%6 : Function = prim::Constant[name="f"]()`), it does not remove the function node itself. For example, if the input graph is:
```
graph(%0 : Long(requires_grad=0, device=cpu),
      %1 : Long(requires_grad=0, device=cpu)):
  %6 : Function = prim::Constant[name="f"]()
  %7 : Tensor = prim::CallFunction(%6, %0, %1)
  return (%7)
```
The output graph is:
```
graph(%0 : Long(requires_grad=0, device=cpu),
      %1 : Long(requires_grad=0, device=cpu)):
  %6 : Function = prim::Constant[name="f"]()
  %8 : int = prim::Constant[value=1]()
  %z.1 : Tensor = aten::sub(%0, %1, %8) # test/onnx/test_utility_funs.py:790:20
  %10 : Tensor = aten::add(%0, %z.1, %8) # test/onnx/test_utility_funs.py:791:23
  return (%10)
```
Note that the `%6 : Function = prim::Constant[name="f"]()` has not been removed (though it is not being used).

This PR updates the pass to remove the function node completely. The updated graph looks as follows:
```
graph(%0 : Long(requires_grad=0, device=cpu),
      %1 : Long(requires_grad=0, device=cpu)):
  %8 : int = prim::Constant[value=1]()
  %z.1 : Tensor = aten::sub(%0, %1, %8) # test/onnx/test_utility_funs.py:790:20
  %10 : Tensor = aten::add(%0, %z.1, %8) # test/onnx/test_utility_funs.py:791:23
  return (%10)
```

A test point has also been added for this scenario.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42146

Reviewed By: VitalyFedyunin

Differential Revision: D22845314

Pulled By: bzinodev

fbshipit-source-id: 81fb351f0a36f47204e5327b60b84d7a91d3bcd9
2020-08-03 21:31:19 -07:00
0cb86afd72 Revert D22908795: [pytorch][PR] Fix segfault in THPGenerator_dealloc
Test Plan: revert-hammer

Differential Revision:
D22908795 (d3acfe3ba8)

Original commit changeset: c5b6a35db381

fbshipit-source-id: c7559c382fced23cef683c8c90cff2d6012801ec
2020-08-03 21:03:44 -07:00
dc1f87c254 Add typing_extensions as a dependency. (#42431)
Summary:
Closes gh-38221.

The related pytorch/builder PR: https://github.com/pytorch/builder/pull/475

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42431

Reviewed By: malfet

Differential Revision: D22916499

Pulled By: ezyang

fbshipit-source-id: c8fe9413b62fc7a6b829fc82aaf32531b55994d1
2020-08-03 20:06:16 -07:00
c8cb5e5bcb Relax cusparse windows guard on cuda 11 (#42412)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/42406

### cusparse Xcsrmm2 API:

(https://github.com/pytorch/pytorch/issues/37202)

- new: https://docs.nvidia.com/cuda/cusparse/index.html#cusparse-generic-function-spmm
- old (deprecated in cuda 11): https://docs.nvidia.com/cuda/archive/10.2/cusparse/index.html#csrmm2

Before:

|cuda ver | windows | linux |
|--|--|--|
| 10.1 | old api | old api  |
| 10.2 | old api | new api |
| 11    | old api (build error claimed in https://github.com/pytorch/pytorch/issues/42406) | new api |

After:

|cuda ver | windows | linux |
|--|--|--|
| 10.1 | old api | old api  |
| 10.2 | old api | **old api** |
| 11    | **new api** | new api |

### cusparse bmm-sparse-dense API

<details><summary>reverted, will be revisited in the future</summary>
(cc kurtamohler https://github.com/pytorch/pytorch/issues/33430)

- new: https://docs.nvidia.com/cuda/cusparse/index.html#cusparse-generic-function-spmm

Before:

|cuda ver | windows | linux |
|--|--|--|
| 10.1 | not supported | new api  |
| 10.2 | not supported | new api |
| 11    | not supported | new api |

After:

|cuda ver | windows | linux |
|--|--|--|
| 10.1 | not supported | new api  |
| 10.2 | not supported | new api |
| 11    | **new api** | new api |

</details>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42412

Reviewed By: agolynski

Differential Revision: D22892032

Pulled By: ezyang

fbshipit-source-id: cded614af970f0efdc79c74e18e1d9ea8a46d012
2020-08-03 19:59:59 -07:00
24199e0768 tuple_map / tuple_concat (#42326)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42326

ghstack-source-id: 108868289

Test Plan: Unit tests

Reviewed By: smessmer

Differential Revision: D22846504

fbshipit-source-id: fa9539d16e21996bbd80db3e3c524b174b22069e
2020-08-03 19:19:47 -07:00
1b18adb7e8 [ONNX] Export static as_strided (#41569)
Summary:
`as_strided` creates a view of an existing tensor with specified `sizes`, `strides`, and `storage_offsets`. This PR supports the export of `as_strided` with static argument `strides`. The following scenarios will not be supported:
* Calling on tensor of dynamic shape, i.e. the tensor shape differs between model runs and different model inputs.
* In-place operations, i.e. updates to the original tensor that are expected to reflect in the `as_strided` output, and vice versa.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41569

Reviewed By: VitalyFedyunin

Differential Revision: D22845295

Pulled By: bzinodev

fbshipit-source-id: 7d1aa88a810e6728688491478dbf029f17ae7201
2020-08-03 18:56:40 -07:00
04e55d69f9 [ONNX] Enable scripting tests and update jit passes (#41413)
Summary:
This PR initiates the process of updating the torchsciprt backend interface used by ONNX exporter.

- Replace jit lower graph pass by freeze module pass

- Enable ScriptModule tests for ONNX operator tests (ORT backend) and model tests by default.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41413

Reviewed By: VitalyFedyunin

Differential Revision: D22845258

Pulled By: bzinodev

fbshipit-source-id: d57fd4086f27bd0c3bf5f70af7fd0daa39a2814a
2020-08-03 18:51:19 -07:00
c000b890a8 [ONNX] Export torch.eye to ONNX::EyeLike (#41357)
Summary:
Export dynamic torch.eye, i.e. commonly from another tensor, shape for torch.eye is not known at export time.
Static torch.eye where n,m are constants is exported as constant tensor directly.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41357

Reviewed By: VitalyFedyunin

Differential Revision: D22845220

Pulled By: bzinodev

fbshipit-source-id: 6e5c331fa28ca542022ea16f9c88c69995a393b2
2020-08-03 18:51:17 -07:00
fb56299d4a Fix check highlight in filecheck. (#42417)
Summary:
* It originally failed to check for cases where highlight token appears more than once.
* Now it repeated tries to find highlight token if one doesn't seem correctly highlighted until end of error message.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42417

Reviewed By: SplitInfinity

Differential Revision: D22889411

Pulled By: gmagogsfm

fbshipit-source-id: 994835db32849f3d7e98ab7f662bd5c6b8a1662e
2020-08-03 18:49:22 -07:00
7a5708832f fix masked_select for discontiguous outputs (#41841)
Summary:
This fixes https://github.com/pytorch/pytorch/issues/41473 for discontiguous input, mask and out. Tests to follow. Reverting https://github.com/pytorch/pytorch/issues/33269 is not a great solution because I'm told masked_select was needed for printing complex tensors.
cc gchanan , zou3519, ezyang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41841

Reviewed By: mruberry

Differential Revision: D22706943

Pulled By: ngimel

fbshipit-source-id: 413d7fd3f3308b184de04fd56b8a9aaabcad22fc
2020-08-03 18:43:45 -07:00
d707d4bf6d Implement a light SGD optimizer (#42137)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42137

This PR implements an SGD optimizer class similar to torch::optim::SGD, but it doesn't inherit from torch::optim::Optimizer, for use on mobile devices (or other lightweight use case).

Adding Martin's comment for visibility: "SGD may be the only optimizer used in near future. If more client optimizers are needed, refactoring the full optim codes and reusing the existing code would be an option."

Test Plan: Imported from OSS

Reviewed By: iseeyuan

Differential Revision: D22846514

Pulled By: ann-ss

fbshipit-source-id: f5f46804aa021e7ada7c0cd3f16e24404d10c7eb
2020-08-03 17:27:53 -07:00
934b68f866 ecr_gc: Iterate through all tags, reduce prints (#42492)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42492

There's a potential for multiple tags to be created for the same digest
so we should iterate through all potential tags so that we're not
deleting digests that are associated with tags that we actually want.

Also, reduced the number of prints in this script to only the absolutely
necessary prints. (i.e. only the deleted images)

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D22909248

Pulled By: seemethere

fbshipit-source-id: 7f2e540d133485ed6464e413b01ef67aa73df432
2020-08-03 16:59:56 -07:00
d3acfe3ba8 Fix segfault in THPGenerator_dealloc (#42490)
Summary:
Segfault happens when one tries to deallocate unintialized generator

Add `TestTorch.test_invalid_generator_raises` that validates that Generator created on invalid device is handled correctly

Fixes https://github.com/pytorch/pytorch/issues/42281

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42490

Reviewed By: seemethere

Differential Revision: D22908795

Pulled By: malfet

fbshipit-source-id: c5b6a35db381738c0fc984aa54e5cab5ef2cbb76
2020-08-03 16:28:34 -07:00
dbdd28207c Expose a generic shape info struct for ONNXIFI Python interface (#42421)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42421

Previously, we can only feed shape info from Python with float dtype, and batch based dim type when we do onnxifi from Python. This diff removes this limitation and uses TensorBoundShapes protobuf as a generic shape info struct. This will make the onnxifi interface in Python more flexible.

Reviewed By: ChunliF

Differential Revision: D22889781

fbshipit-source-id: 1a89f3a68c215a0409738c425b4e0d0617d58245
2020-08-03 16:10:05 -07:00
f0fd1cc873 Calculate inverse of output scale first. (#41342)
Summary:
This is to unify how output scale calculation is to be done between
fbgemm and qnnpack (servers vs mobile).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41342

Test Plan: Quantization tests.

Reviewed By: vkuzo

Differential Revision: D22506347

Pulled By: kimishpatel

fbshipit-source-id: e14d22f13c6e751cafa3e52617e76ecd9d39dad5
2020-08-03 14:45:08 -07:00
c3236b6649 [quant] Expose register activation post process hook function to user (#42342)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42342

Test Plan: Imported from OSS

Reviewed By: raghuramank100

Differential Revision: D22856711

fbshipit-source-id: d6ad080c82b744ae1147a656c321c448ac5e7f10
2020-08-03 12:28:42 -07:00
1b9cd747cf Revert "Conda build (#38796)" (#42472)
Summary:
This reverts commit 9c7ca89ae637a9cea52b4fee0877adc7485f4eb7.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42472

Reviewed By: ezyang, agolynski

Differential Revision: D22903382

Pulled By: seemethere

fbshipit-source-id: e2b01537bcdf6c50d967329833cb6450a75b8247
2020-08-03 12:08:13 -07:00
0eb513beef Set a proper type for a variable (#42453)
Summary:
`ninputs` variable was always used as a `size_t` but declared as an `int32_t`

Now, some annoying warnings are fixed

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42453

Reviewed By: agolynski

Differential Revision: D22898282

Pulled By: mrshenli

fbshipit-source-id: b62d6b07f0bc3717482906df6010d88762ae0ccd
2020-08-03 11:44:37 -07:00
34025eb826 Vectorize arange (#38697)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38697

Benchmark (gcc 8.3, Debian Buster, turbo off, Release build, Intel(R)
Xeon(R) E-2136, Parallelization using OpenMP):

```python
import timeit
for dtype in ('torch.double', 'torch.float', 'torch.uint8', 'torch.int8', 'torch.int16', 'torch.int32', 'torch.int64'):
    for n, t in [(40_000, 50000),
                (400_000, 5000)]:
        print(f'torch.arange(0, {n}, dtype={dtype}) for {t} times')
        print(timeit.timeit(f'torch.arange(0, {n}, dtype={dtype})', setup=f'import torch', number=t))
```

Before:

```
torch.arange(0, 40000, dtype=torch.double) for 50000 times
1.587841397995362
torch.arange(0, 400000, dtype=torch.double) for 5000 times
0.47885190199303906
torch.arange(0, 40000, dtype=torch.float) for 50000 times
1.5519152240012772
torch.arange(0, 400000, dtype=torch.float) for 5000 times
0.4733216500026174
torch.arange(0, 40000, dtype=torch.uint8) for 50000 times
1.426058754004771
torch.arange(0, 400000, dtype=torch.uint8) for 5000 times
0.43596178699226584
torch.arange(0, 40000, dtype=torch.int8) for 50000 times
1.4289699140063021
torch.arange(0, 400000, dtype=torch.int8) for 5000 times
0.43451592899509706
torch.arange(0, 40000, dtype=torch.int16) for 50000 times
0.5714442400058033
torch.arange(0, 400000, dtype=torch.int16) for 5000 times
0.14837959500437137
torch.arange(0, 40000, dtype=torch.int32) for 50000 times
0.5964003179979045
torch.arange(0, 400000, dtype=torch.int32) for 5000 times
0.15676555599202402
torch.arange(0, 40000, dtype=torch.int64) for 50000 times
0.8390555799996946
torch.arange(0, 400000, dtype=torch.int64) for 5000 times
0.23184613398916554
```

After:

```
torch.arange(0, 40000, dtype=torch.double) for 50000 times
0.6895066159922862
torch.arange(0, 400000, dtype=torch.double) for 5000 times
0.16820953000569716
torch.arange(0, 40000, dtype=torch.float) for 50000 times
1.3640095089940587
torch.arange(0, 400000, dtype=torch.float) for 5000 times
0.39255041000433266
torch.arange(0, 40000, dtype=torch.uint8) for 50000 times
0.3422072059911443
torch.arange(0, 400000, dtype=torch.uint8) for 5000 times
0.0605111670010956
torch.arange(0, 40000, dtype=torch.int8) for 50000 times
0.3449254590086639
torch.arange(0, 400000, dtype=torch.int8) for 5000 times
0.06115841199061833
torch.arange(0, 40000, dtype=torch.int16) for 50000 times
0.7745441729930462
torch.arange(0, 400000, dtype=torch.int16) for 5000 times
0.22106765500211623
torch.arange(0, 40000, dtype=torch.int32) for 50000 times
0.720475220005028
torch.arange(0, 400000, dtype=torch.int32) for 5000 times
0.20230313099455088
torch.arange(0, 40000, dtype=torch.int64) for 50000 times
0.8144655400101328
torch.arange(0, 400000, dtype=torch.int64) for 5000 times
0.23762561299372464
```

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D22291236

Pulled By: VitalyFedyunin

fbshipit-source-id: 134dd08b77b11e631d914b5500ee4285b5d0591e
2020-08-03 11:14:57 -07:00
fa6e900e8c Let TensorIterator::nullary_op support check_mem_overlap option (#38693)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38693

Test Plan: Imported from OSS

Differential Revision: D22291237

Pulled By: VitalyFedyunin

fbshipit-source-id: 5bc96e617ed36ed076da73e3d019699f2efd6e4e
2020-08-03 11:13:04 -07:00
ed44269edc Add missing space after -> for topk.values (#42321)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42321

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: vkuzo

Differential Revision: D22846520

Pulled By: ezyang

fbshipit-source-id: 7c0ab0b019d05a13309c3b8d770582414795799f
2020-08-03 10:10:20 -07:00
326d777e53 Convert _wait_all_workers to _all_gather (#42276)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42276

This commit converts `_wait_all_workers()` to `_all_gather()` by
allowing each worker to provide its own data object. The `_all_gather()`
function blocks and returns the gathered results. This API can be
converted to `rpc.barrier()` latter.

Test Plan: Imported from OSS

Reviewed By: lw

Differential Revision: D22853480

Pulled By: mrshenli

fbshipit-source-id: 9d506813b9fd5b7c144885e2b76a863cbd19466a
2020-08-03 08:48:45 -07:00
ebde590864 Remove debug vestige (#42277)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42277

Test Plan: Imported from OSS

Reviewed By: lw

Differential Revision: D22853481

Pulled By: mrshenli

fbshipit-source-id: 74e58c532d8f872c1dd830573b2a4c4c86410de2
2020-08-03 08:46:38 -07:00
4cdbe5c495 Implement batching rules for some view ops (#42248)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42248

Including:
- torch.diagonal
- torch.t
- torch.select
- Tensor.expand_as
- Tensor slicing.

Please let me know in the future if it would be easier to review these
separately (I put five operators into this PR because each
implementation is relatively simple).

Test Plan:
- new tests in `test/test_vmap.py`.
- I would like to have a more structured/automated way of testing but
my previous attempts at making something resulted in something very
complicated.

Reviewed By: ezyang

Differential Revision: D22846273

Pulled By: zou3519

fbshipit-source-id: 8e45ebe11174512110faf1ee0fdc317a25e8b7ac
2020-08-03 08:01:48 -07:00
2f8d5b68fa vmap fallback kernel (#41943)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41943

If an operator doesn't have a batching rule implemented then we fallback
to this implementation. The fallback only works on out-of-place operators
that return only tensors with new memory. (e.g., no in-place operators,
no view operations).

The fallback effectively takes all of the BatchedTensors in `stack`,
slices them, and runs `op` on all of the corresponding slices to produce slices
of the outputs. The output slices then get `torch.stack`ed to create the
final returns.

The performance of the fallback is not very good because it introduces
an extra copy from stacking the sliced outputs. Because of this, we prefer
to write batching rules for operators whenever possible.

In the future, I'd like to disable the fallback kernel for random
functions until we have a better random story for vmap. I will probably
add a blocklist of operators to support that.

Test Plan: - `pytest test/test_vmap.py -v`

Reviewed By: ezyang

Differential Revision: D22764103

Pulled By: zou3519

fbshipit-source-id: b235833f7f27e11fb76a8513357ac3ca286a638b
2020-08-03 07:59:33 -07:00
192487d716 Update MAGMA to 2.5.3 for Windows (#42410)
Summary:
In order to introduce CUDA 11 build jobs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42410

Reviewed By: malfet

Differential Revision: D22892025

Pulled By: ezyang

fbshipit-source-id: 11bd7507f623d654a589ba00a138f6b947990f4c
2020-08-03 07:43:09 -07:00
ebfff31e19 [distributedhogwild] Introducing new tags for distributed hogwild. (#42381)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42381

Introduce new tag to support distributed hogwild.

Reviewed By: boryiingsu

Differential Revision: D20484099

fbshipit-source-id: 5973495589e0a7ab185d3867b37437aa747f408a
2020-08-03 07:10:44 -07:00
bfa94487b9 Remove register_mobile_autograd.cpp. (#42397)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42397

Since the autograd registration is unified to code-gen, we don't need to keep a manual registration file for mobile.
Remove it to avoid extra maintenance.

Test Plan: Imported from OSS

Reviewed By: ljk53

Differential Revision: D22883153

Pulled By: iseeyuan

fbshipit-source-id: 6db0bd89369beab9eed6e9a9692dd46f5bd1ff48
2020-08-02 14:14:33 -07:00
91c80d122a torch.gcd: Do not use std::abs() because it does not have an unsigned integer overload (#42254)
Summary:
`abs` doesn't have an signed overload across all compilers, so applying abs on uint8_t can be ambiguous: https://en.cppreference.com/w/cpp/numeric/math/abs

This may cause unexpected issue when the input is uint8 and is greater
than 128. For example, on MSVC, applying `std::abs` on an unsigned char
variable

```c++
#include <cmath>

unsigned char a(unsigned char x) {
    return std::abs(x);
}
```

gives the following warning:

    warning C4244: 'return': conversion from 'int' to 'unsigned char',
    possible loss of data

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42254

Reviewed By: VitalyFedyunin

Differential Revision: D22860505

Pulled By: mruberry

fbshipit-source-id: 0076d327bb6141b2ee94917a1a21c22bd2b7f23a
2020-08-01 23:03:33 -07:00
4cbf18ccc3 Enables integer -> float type promotion in TensorIterator (#42359)
Summary:
Many ufuncs (mostly unary ufuncs) in NumPy promote integer inputs to float. This typically occurs when the results of the function are not representable as integers.

For example:

```
a = np.array([1, 2, 3], dtype=np.int64)
np.sin(a)
: array([0.84147098, 0.90929743, 0.14112001])
```

In PyTorch we only have one function, `torch.true_divide`, which exhibits this behavior today, and it did it by explicitly pre-casting its inputs to the default (float) scalar type where necessary before calling TensorIterator.

This PR lets TensorIterator understand and implement this behavior directly, and it updates `torch.true_divide` to verify the behavior is properly implemented. This will be convenient when implementing more integer->float promotions later (like with `torch.sin`), and also saves copies on CUDA, where the cast from one dtype to another is fused with the computation.

The mechanism for this change is simple. A new flag, `promote_integer_inputs_to_float_` is added to TensorIteratorConfig, and it requires `promote_integer_inputs_to_float_` be true if it's set. When the new flag is set, after the TensorIterator's "common dtype" (AKA "computation type") is computed it's checked for being an integral (boolean included) type and, if it is, changed to the default (float) scalar type, instead. Only `torch.true_divide` sets this flag (for now).

In the future we'll likely...
- provide helpers (`binary_float_op`, `unary_float_op`) to more easily construct functions that promote int->float instead of requiring they build their own TensorIteratorConfigs.
- update torch.atan2 to use `binary_float_op`
- update many unary ufuncs, like `torch.sin` to use `unary_float_op` and support unary ops having different input and result type (this will also require a small modification to some of the "loops" code)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42359

Reviewed By: ngimel

Differential Revision: D22878394

Pulled By: mruberry

fbshipit-source-id: b8de01e46be859321522da411aed655e2c40e5b9
2020-08-01 22:41:00 -07:00
d403983695 Support List[str].index (#39210) (#40348)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40348

Test Plan: Imported from OSS

Reviewed By: wanchaol

Differential Revision: D22757035

Pulled By: firstprayer

fbshipit-source-id: 4fadf8beabf8d5bdfa5b0a185075f7caf9ba8b02
2020-08-01 13:47:25 -07:00
bdcf320bed Support custom exception message (#41907)
Summary:
Raise and assert used to have a hard-coded error message "Exception". User provided error message was ignored. This PR adds support to represent user's error message in TorchScript.

This breaks backward compatibility because now we actually need to script the user's error message, which can potentially contain unscriptable expressions. Such programs can break when scripting, but saved models can still continue to work.

Increased an op count in test_mobile_optimizer.py because now we need aten::format to form the actual exception message.

This is built upon an WIP PR:  https://github.com/pytorch/pytorch/pull/34112 by driazati

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41907

Reviewed By: ngimel

Differential Revision: D22778301

Pulled By: gmagogsfm

fbshipit-source-id: 2b94f0db4ae9fe70c4cd03f4048e519ea96323ad
2020-08-01 13:03:45 -07:00
5769b06ab5 [Caffe2] Remove explicitly divide by zero in SpatialBN training mode (#42380)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42380

[Caffe2] Remove explicitly divide by zero in SpatialBN training mode

Test Plan: buck test mode/dev-nosan //caffe2/caffe2/python/operator_test:spatial_bn_op_test

Reviewed By: houseroad

Differential Revision: D22873214

fbshipit-source-id: 70b505391b5db02b45fc46ecd7feb303e50c6280
2020-08-01 11:54:58 -07:00
115d226498 Pin NumPy version on MacOS testers to 1.18.5 (#42409)
Summary:
Otherwise numba linking by clang-9 fails with:
```
ld: in /Users/distiller/workspace/miniconda3/lib/python3.7/site-packages/numpy/core/lib/libnpymath.a(npy_math.o), could not parse object file /Users/distiller/workspace/miniconda3/lib/python3.7/site-packages/numpy/core/lib/libnpymath.a(npy_math.o): 'Unknown attribute kind (61) (Producer: 'LLVM10.0.0' Reader: 'LLVM APPLE_1_902.0.39.2_0')', using libLTO version 'LLVM version 9.1.0, (clang-902.0.39.2)' for architecture x86_64
```
Because conda's numpy-1.19.1 is compiled with clang-10
This should fix MacOS regressions in CIrcleCI

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42409

Reviewed By: xw285cornell

Differential Revision: D22887683

Pulled By: malfet

fbshipit-source-id: d58ee9bf53772b57c59e18f71151916d4f0a3c7d
2020-08-01 09:22:23 -07:00
2912390662 Limits cpu scalar error message to where it's appropriate (#42360)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/40986.

TensorIterator's test for a CUDA kernel getting too many CPU scalar inputs was too permissive. This update limits the check to not consider outputs and to only be performed if the kernel can support CPU scalars.

A test is added to verify the appropriate error message is thrown in a case where the old error message was thrown previously.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42360

Reviewed By: ngimel

Differential Revision: D22868536

Pulled By: mruberry

fbshipit-source-id: 2bc8227978f8f6c0a197444ff0c607aeb51b0671
2020-08-01 02:04:30 -07:00
206db5c127 Improve torch.norm functionality, errors, and tests (#41956)
Summary:
**BC-Breaking Note:**
BC breaking changes in the case where keepdim=True. Before this change, when calling `torch.norm` with keepdim=True and p='fro' or p=number, leaving all other optional arguments as their default values, the keepdim argument would be ignored. Also, any time `torch.norm` was called with p='nuc', the result would have one fewer dimension than the input, and the dimensions could be out of order depending on which dimensions were being reduced. After the change, for each of these cases, the result has the same number and order of dimensions as the input.

**PR Summary:**

* Fix keepdim behavior
* Throw descriptive errors for unsupported sparse norm args
* Increase unit test coverage for these cases and for complex inputs

These changes were taken from part of PR https://github.com/pytorch/pytorch/issues/40924. That PR is not going to be merged because it overrides `torch.norm`'s interface, which we want to avoid. But these improvements are still useful.

Issue https://github.com/pytorch/pytorch/issues/24802

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41956

Reviewed By: albanD

Differential Revision: D22837455

Pulled By: mruberry

fbshipit-source-id: 509ecabfa63b93737996f48a58c7188b005b7217
2020-08-01 01:55:12 -07:00
44b018ddeb Convert ProcessGroupNCCLTest.cpp to gtest unittest (#42365)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42365

Converting the test

Reviewed By: malfet

Differential Revision: D22855087

fbshipit-source-id: dc917950dcf99ec7036e48aaa4264d2c455cb19e
2020-07-31 20:34:11 -07:00
f47e00bdc3 [NNC] Bounds Inference: make inferred bounds respect gaps (#42185)
Summary:
A heavy refactor of bounds inference to fix some issues and bugs blocking using it to analyze cross thread interactions:
* We were merging all accesses to a Buf into a single bounds info entry, even if they did not overlap. E.g. if we accessed a[0:2] and a[5:6] we would merge that into a bound of a[0:6]. I've changed this behaviour to merge only overlapping bounds.
* We were not separating bounds of different kinds (e.g. Load vs Store) and would merge a Store bounds into a Load bounds, losing the information about what kind of access it was. E.g. this loop would produce bounds: [{Load, 0, 10}] and now produces bounds [{Load, 0, 9}, {Store, 1, 10}]:
```
for i in 1 to 10...
  x[i] = x[i-1]
```
* Both ComputeAt and Rfactor relied on the overzealous merging and only used a single entry in the bounds list to determine the bounds of temporary buffers they created, which could result in temporary buffers allocated smaller than accesses to them. I've fixed Rfactor, but *not* ComputeAt - however all ComputeAt tests still pass (may require loop fusion to trigger this issue) - I will come back to it.

Being more precise about bounds is more complex, rather than taking the minimum of starts and maximum of stops we now need to determine if two bounds overlap or are adjacent. There are many edge cases and so I've added a bunch of test coverage of the merging method.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42185

Reviewed By: mruberry

Differential Revision: D22870391

Pulled By: nickgg

fbshipit-source-id: 3ee34fcbf0740a47259defeb44cba783b54d0baa
2020-07-31 20:22:04 -07:00
dcc4d11ffa [TensorExpr] Make tensorOrConstant non-templatized function. (#42202)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42202

Currently we used the template in order to be able to take both
`std::vector<ExprHandle>` and `std::vector<VarHandle>`. However,
semantics of this function tells that the only allowed option should be
the former one: we're specifying indices for the tensor access we want
to generate. While it could be convenient to avoid conversion from
vector of vars to a vector of exprs at the callsites, it makes the code
less explicit and thus more difficult to reason about.

Test Plan: Imported from OSS

Reviewed By: SplitInfinity

Differential Revision: D22806429

Pulled By: ZolotukhinM

fbshipit-source-id: 8403af5fe6947c27213050a033e79a09f7075d4c
2020-07-31 20:05:24 -07:00
2decccea2e [TensorExpr] Implement shape inference for TE. (#41451)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41451

Since TE operates on a limited subset of ops with a well-defined
semantics, we can easily infer shapes of intermediate and output tensors
given shapes of the inputs.

There is a couple of ops that are not yet supported in the shape
inference, once we add them we could relax the shape info requirements
in the TE fuser: currently it requires all values in the fusion group to
have shapes known and we can change it to only inputs.

Test Plan: Imported from OSS

Reviewed By: eellison

Differential Revision: D22543470

Pulled By: ZolotukhinM

fbshipit-source-id: 256bae921028cb6ec3af91977f12bb870c385f40
2020-07-31 20:05:21 -07:00
f41bb1f92b [TensorExpr] Explicitly cast to bool results of comparison ops in kernel.cpp. (#42201)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42201

Previously, we've been using operators <, >, ==, et al. and relied on
the dtype to be picked automatically. It led to a wrong dtype being
picked for the result, but that choice was overwritten by the type
explicitly specified in JIT IR, which we were lowering. Now we are
moving towards using shape inference instead of relying on all types
being specified in the IR, and that made this issue to immediately pop
up.

Test Plan: Imported from OSS

Reviewed By: Krovatkin

Differential Revision: D22806428

Pulled By: ZolotukhinM

fbshipit-source-id: 89d2726340efa2bb3da45d1603bedc53955e14b9
2020-07-31 20:05:19 -07:00
f8c5800bb5 [TensorExpr] Add debug dumps to kernel.cpp. (#42196)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42196

Test Plan: Imported from OSS

Reviewed By: SplitInfinity

Differential Revision: D22803676

Pulled By: ZolotukhinM

fbshipit-source-id: 109372ca45d86478826190b868d005d2fb2c9ba7
2020-07-31 20:02:21 -07:00
655f376460 Implement Enum sugared value and Enum constant support (#42085)
Summary:
[3/N] Implement Enum JIT support

* Add enum value as constant support
* Add sugared value for EnumClass

Supported:
Enum-typed function arguments
using Enum type and comparing them
Support getting name/value attrs of enums
Using Enum value as constant

TODO:
Add PyThon sugared value for Enum
Support Enum-typed return values
Support serialization and deserialization

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42085

Reviewed By: eellison

Differential Revision: D22758042

Pulled By: gmagogsfm

fbshipit-source-id: 5c6e571686c0b60d7fbad59503f5f94b3b3cd125
2020-07-31 17:29:55 -07:00
ff91b169c7 Changes to match Fused Op: Dequantize->Swish->Quantize (#42255)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42255

Changes to match Fused Op: Dequantize->Swish->Quantize
* Changes to scale handling

Results showing matching intermediate and final Swish_Int8 Op.
P137389801

Test Plan: test case test_deq_swish_quant_nnpi.py

Reviewed By: hyuen

Differential Revision: D22827499

fbshipit-source-id: b469470ca66f6405ccc89696694af372ce6ce89e
2020-07-31 16:54:39 -07:00
1542c41a67 Change C++ frontend to take optional<Tensor> arguments (#41947)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41947

Previously, if an op took an optional `Tensor?` argument, the C++ frontend (i.e. `at::op()` and `Tensor::op()`)
were generated to take `Tensor`. A previous PR (https://github.com/pytorch/pytorch/pull/41610) changed the kernels
to be written with `c10::optional<Tensor>` instead of `Tensor`, but that did not touch the C++ frontend yet.

This PR changes the C++ frontend API to take `c10::optional<Tensor>` instead of `Tensor` as well.
This should be mostly bc conserving. Since `Tensor` implicitly converts to `c10::optional<Tensor>`, any old code
calling an op with a `Tensor` would still work. There are likely corner cases that get broken though.
For example, C++ only ever does *one* implicit conversion. So if you call an op with a non-tensor object
that gets implicitly converted to a `Tensor`, then that previously worked since the API took a `Tensor` and
C++ allows one implicit conversion. Now it wouldn't work anymore because it would require two implicit conversions
(to `Tensor` and then to `c10::optional<Tensor>`) and C++ doesn't do that.

The main reasons for doing this are
- Make the C++ API more sane. Those arguments are optional and that should be visible from the signature.
- Allow easier integration for XLA and Autocast. Those backends generate code to wrap operators and forward
  operator arguments to calls to at::op(). After https://github.com/pytorch/pytorch/pull/41610, there was
  a mismatch because they had to implement operators with `optional<Tensor>` but call `at::op()` with `Tensor`,
  so they had to manually convert between those. After this PR, they can just forward the `optional<Tensor>`
  in their call to `at::op()`.
ghstack-source-id: 108873705

Test Plan: unit tests

Reviewed By: bhosmer

Differential Revision: D22704832

fbshipit-source-id: f4c00d457b178fbc124be9e884a538a3653aae1f
2020-07-31 16:11:55 -07:00
3a19af2427 Make operators with optional Tensor? arguments c10-full (#41610)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41610

Previously, operators that have a `Tensor?` (i.e. optional tensor) in their schema implemented it using `Tensor` in C++ and filled in an undefined tensor for the None case.
The c10 operator library, however, expects `Tensor?` to be represented as `optional<Tensor>`, so those operators couldn't be c10-full yet and still had to use codegenerated unboxing instead of templated unboxing.

This PR changes that. It extends the `hacky_wrapper_for_legacy_signatures` to not only take case of TensorOptions, but now also map between signatures taking `Tensor` and `optional<Tensor>`.
For this, it requires an additional template parameter, the expected signature, and it uses that to go argument-by-argument and unwrap any optionals it finds.
ghstack-source-id: 108873701

Test Plan: waitforsandcastle

Reviewed By: bhosmer

Differential Revision: D22607879

fbshipit-source-id: 57b2fb01a294b804f82cd55cd70f0ef4a478e14f
2020-07-31 16:09:08 -07:00
f502290e91 [JIT] Make create autodiff subgraphs do in place updates to aliasDb (#42141)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42141

Update alias db in-place instead of having to construct alias db from scratch on each change, causing O(n^2) behavior.

Description from https://github.com/pytorch/pytorch/pull/37106 holds pretty well:
"""
Recomputing the aliasdb on every fusion iteration + in every subblock
is hugely expensive. Instead, update it in-place when doing fusion.

The graph fuser pass operates by pushing nodes into a fusion group. So
we start with

`x, y = f(a, b, c)`

and end with:
```
x_out, y_out = prim::fusionGroup(a, b, c)
   x_in, y_in = f(a_in, b_in, c_in)
   -> x_in, y_in
```

We destroy the x and y Value*s in the process. This operation is
easy to express as an update to the aliasDb--x_out just takes on all
the aliasing information x used to have. In particular, since we know
f and prim::fusionGroup are purely functional, we don't have to mess
with any write information.
"""

The one difficulty here is mapping x, y to x_out, y_out is not trivial in merging nodes into the autodiff subgraph node.
There are a few options:
- attempt to make all subgraph utils & ir cloning logic update a map
- mirror the subgraph utils implementation in create_autodiff_subgraph
- uniquely map x, y and x_in, y_in so you can back out the correspondence.

I went with the third option.

This shouldn't affect the results of the pass at all. LMK if you think there's anything else I should be doing to test, I was thinking about maybe exposing an option to run create autodiff subgraphs without the post processor and check that the alias db was correctly updated.

Test Plan: Imported from OSS

Reviewed By: SplitInfinity

Differential Revision: D22798377

Pulled By: eellison

fbshipit-source-id: 9a133bcaa3b051c0fb565afb23a3eed56dbe71f9
2020-07-31 15:13:32 -07:00
2285a2fc11 refactor canonical ordering to also be able to do isAfter checks (#42140)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42140

Test Plan: Imported from OSS

Reviewed By: SplitInfinity

Differential Revision: D22798378

Pulled By: eellison

fbshipit-source-id: d1a549f43b28fe927729597818a46674c58fe81d
2020-07-31 15:11:40 -07:00
4fc525e729 [Dper3] Implementation of squeezed input to DC++
Summary:
This Diff provides an option for DC++ module to use the squeezed sparse feature embeddings to generate attention weights, with the purpose of reducing the network size to achieve QPS gains. There are 3 squeeze options: sum, max, and mean, along the embedding dimension and are provided for both the attention weights and resnet generation.
Example workflow: f208474456

{F257199459}

Test Plan:
1. Test single ops
buck test dper3/dper3/modules/low_level_modules/tests:single_operators_test -- test_reduce_back_mean
buck test dper3/dper3/modules/low_level_modules/tests:single_operators_test -- test_reduce_back_max
2. Test DC++ module
buck test dper3/dper3/modules/tests:core_modules_test -- test_dc_pp_arch_one_layer_compressed_embeddings_only_squeeze_input
buck test dper3/dper3/modules/tests:core_modules_test -- test_dc_pp_arch_shared_input_squeeze_input
buck test dper3/dper3/modules/tests:core_modules_test -- test_dc_pp_input_compress_embeddings_squeeze_input
3. Test Arch
buck test dper3/dper3_models/ads_ranking/model_impl/sparse_nn/tests:sparse_nn_lib_test -- test_dense_sparse_interaction_compress_dot_arch_dot_compress_pp_squeezed_input
4. e2e test
buck test dper3/dper3_models/ads_ranking/tests:model_paradigm_e2e_tests -- test_sparse_nn_compress_dot_attention_fm_max_fc_size_squeeze_input

Reviewed By: taiqing

Differential Revision: D22825069

fbshipit-source-id: 29269ea22cb47d487a1c92a1f6daae1055f54cfc
2020-07-31 14:31:43 -07:00
a01e91e6b2 [pytorch] include all overloads for OSS custom build
Summary:
For mobile custom build, we only generate code for ops that are used by
specific models to reduce binary size.

There multiple places where we apply the op filtering:
- generated_unboxing_wrappers_*.cpp
- autograd/VariableType*.cpp
- c10 op registration (in aten/gen.py)

For c10 op registration, we filter by the main op name - all overloads
that match the main op name part will be kept.

For generated_unboxing_wrappers_*, we filter by the full op name - only
those having exactly the same overload name will be kept.

This PR changes generated_unboxing_wrappers_* and autograd/VariableType*.cpp
codegen to also filter by the main op name.

The reasons are:
- keeping all overloads can have better backward compatibility;
- generated_unboxing_wrappers_* are relatively small as it only contains
  thin wrappers for root ops.
- generated_unboxing_wrappers_* will be replaced by c10 op registration
  soon anyway.
- autograd/VariableType*.cpp are not included in OSS build.

Why it offers better backward compatibility? #40737 is an example:
It introduced a new `_convolution` overload and renamed the original one
to `_convolution.deprecated`. Before this PR, the model prepared by the
old version PyTorch won't be able to run on the custom mobile build
generated on the PR because `_convolution.deprecated` won't be kept in
the custom build due to full op name matching policy. By relaxing it to
partial matching policy, the mobile custom build CI on the PR can pass.

Will test the size impact for FB production build before landing.

Differential Revision: D22809564

Test Plan: Imported from OSS

Reviewed By: iseeyuan

Pulled By: ljk53

fbshipit-source-id: e2fc017da31f38b9430cc2113f33e6d21a0eaf0b
2020-07-31 12:43:31 -07:00
38bf5be24f [quant] Use PlaceholderObserver instead of Fp16Observer and NoopObserver (#42348)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42348

Use the dtype info in placeholderObserver to decide what ops to insert in the graph
In the next PR we can delete NoopObserver

Test Plan:
python test/test_quantization.py

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D22859457

fbshipit-source-id: a5c618f22315534ebd9a2df77b14a0aece196989
2020-07-31 12:33:56 -07:00
6bd46b583e [quant][graph] Add support for FP16 dynamic quant (#42222)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42222

This change adds the necessary passes to perform FP16 dynamic quantization.
We skip inserting observers for activations based on the dtype (torch.float16) and only insert the Fp16Observer for weights

Test Plan:
python test/test_quantization.py TestQuantizeJitOps

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D22849220

fbshipit-source-id: 2c53594ecd2485e9e3dd0b380eceaf7c5ab5fc50
2020-07-31 12:33:53 -07:00
8c5bf10264 [quant] Add FP16Observer for fp16 quant support (#42221)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42221

Adds a new observer that emits a warning if the range of tensor is beyond fp16 range. This will be further used in graph mode quantization to insert the cast to fp16 ops in the graph

Test Plan:
python test/test_quantizaton.py TestObserver.test_fp16_observer

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D22849222

fbshipit-source-id: a301281ce38ba4d4e7a009308400d34a08c113d2
2020-07-31 12:33:51 -07:00
a9eebaf693 [quant] Add saturate_to_fp16 op for FP16 quant support (#42147)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42147

Op to check the range of a tensor and clamp the values to fp16 range
This operator will be inserted into the graph in subsequent diffs.

Test Plan:
python test/test_quantization.py TestQuantizedTensor.test_fp16_saturate_op

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D22849221

fbshipit-source-id: 0da3298e179750f6311e3a09596a7b8070509096
2020-07-31 12:31:07 -07:00
bdd9ef1981 Support RowWiseSparseAdam on GPU (#35404)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35404

Implement RowWiseSparseAdam on CUDA

Reviewed By: xw285cornell

Differential Revision: D20650225

fbshipit-source-id: 5f871e2f259e362b713c9281b4d94534453995cf
2020-07-31 10:47:29 -07:00
a9e7e787f8 [jit] make clone works for interface type (#42121)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42121

This PR changes the Module API to allow register a module with module
interface type, and therefore allows Module::clone works on the case
where there's a module interface type being shared by two submodules.

interface type will be shared by the new cloned instance in the same
compilation unit bc it only
contains a list of functionSchema, which does not involve any
attributes compared to classType.

fixes https://github.com/pytorch/pytorch/issues/41882

Test Plan: Imported from OSS

Reviewed By: suo

Differential Revision: D22781205

Pulled By: wanchaol

fbshipit-source-id: f97f4b75970f0b434e38b5a1f778eda2c4e5109b
2020-07-31 10:24:27 -07:00
352e15f1a2 Revert D22812445: Update TensorPipe submodule
Test Plan: revert-hammer

Differential Revision:
D22812445 (2335430086)

Original commit changeset: e6d824bb28f5

fbshipit-source-id: 606632a9aaf2513b5ac949e4d6687aa7563eae5d
2020-07-31 10:16:48 -07:00
832b1659e7 Fix missing attribute when loading model from older version (#42242) (#42290)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/42242

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42290

Reviewed By: VitalyFedyunin

Differential Revision: D22844096

Pulled By: albanD

fbshipit-source-id: 707e552e0ed581fbe00f1527ab7426880edaed64
2020-07-31 09:03:07 -07:00
4c6878c97d [gloo] change ProcessGroupGlooAsyncTest to use gtest (#42313)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42313

Changes the tests in `ProcessGroupGlooAsyncTest.cpp` to use the Gtest testing framework.

Reviewed By: malfet

Differential Revision: D22821577

fbshipit-source-id: 326b24a334ae84a16434d0d5ef27d16ba4b90d5d
2020-07-31 08:54:50 -07:00
0adb584376 Make resize_ use normal device dispatch (#42240)
Summary:
`resize_` only requires manual registration to `Autograd` key and its device kernels can safely live together with our normal device dispatch in `native_functions.yaml`.
But currently we do manual registration for `CPU/CUDA` kernels (and leaves no dispatch in native_functions.yaml) which makes `resize_` non-overrideable from backend point of view. While it indeed should dispatch at device level, this caused xla to whitelist `resize_` and register a lowering to XLA key. This PR moves the device dispatch of `resize_` back to `native_functions.yaml` so that it shows up as `abstract` method properly for downstream extensions.
Note that we also do manual registration for `copy_/detach_/resize_as_/etc` in aten but they are slightly different than `resize_` since for them we only register `catchAll` kernels instead of device kernels. I'll need to investigate and send a followup PR for those ops.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42240

Reviewed By: VitalyFedyunin

Differential Revision: D22846311

Pulled By: ailzhang

fbshipit-source-id: 10b6cf99c4ed3d62fc4e1571f4a2a463d1b88c81
2020-07-31 02:15:27 -07:00
2f840b1662 Warns when TensorIterator would resize its output (#42079)
Summary:
See https://github.com/pytorch/pytorch/issues/41027.

This adds a helper to resize output to ATen/native/Resize.* and updates TensorIterator to use it. The helper throws a warning if a tensor with one or more elements needs to be resized. This warning indicates that these resizes will become an error in a future PyTorch release.

 There are many functions in PyTorch that will resize their outputs and don't use TensorIterator. For example,

985fd970aa/aten/src/ATen/native/cuda/NaiveConvolutionTranspose2d.cu (L243)

And these functions will need to be updated to use this helper, too. This PR avoids their inclusion since the work is separable, and this should let us focus on the function and its behavior in review. A TODO appears in the code to reflect this.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42079

Reviewed By: VitalyFedyunin

Differential Revision: D22846851

Pulled By: mruberry

fbshipit-source-id: d1a413efb97e30853923bce828513ba76e5a495d
2020-07-30 22:39:16 -07:00
e54f268a7a Enables torch.full bool and integer type inference (#41912)
Summary:
After being deprecated in 1.5 and throwing a runtime error in 1.6, we can now enable torch.full inferring its dtype when given bool and integer fill values. This PR enables that inference and updates the tests and docs to reflect this.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41912

Reviewed By: albanD

Differential Revision: D22836802

Pulled By: mruberry

fbshipit-source-id: 33dfbe4d4067800c418b314b1f60fab8adcab4e7
2020-07-30 22:39:13 -07:00
31d41f987a torch.where : Scalar Support (#40336)
Summary:
Reference: https://github.com/pytorch/pytorch/issues/38349 #9190

TODO
* [x] Add Tests
* [x] Update Docs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/40336

Reviewed By: albanD

Differential Revision: D22813834

Pulled By: mruberry

fbshipit-source-id: 67c1693c059a301b249213afee3c25cea9f64fec
2020-07-30 22:36:53 -07:00
1c8217a7a6 Abstract cuda calls made from torch_python (#42251)
Summary:
* Make c10::cuda functions regular non-inlined functions
* Add driver_version() and device_synchronize() functions

With this change I don't see anymore direct calls to CUDA API when look at Modules.cpp.obj

FYI malfet

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42251

Reviewed By: malfet

Differential Revision: D22826505

Pulled By: ziab

fbshipit-source-id: 8dc2f3e209d3710e2ce78411982a10e8c727573c
2020-07-30 19:18:33 -07:00
fbb052c2cc BlackList to BlockList (#42279)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/41701 blackList convention to blockList convention

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42279

Reviewed By: VitalyFedyunin

Differential Revision: D22843178

Pulled By: malfet

fbshipit-source-id: c9be5a5f084dfd0e46545d4a3d1124ef59277604
2020-07-30 18:06:49 -07:00
27c22b9b3c Modify function to takes dtype as argument
Summary: To avoid repeating to() casts for every argument of the function

Test Plan: CI

Reviewed By: malfet

Differential Revision: D22833521

fbshipit-source-id: ae0a8f70339cd6adfeea2f552d35bbcd48b11cf7
2020-07-30 16:27:55 -07:00
b5fcd89479 Add tests to sigmoid_backward and fmod (#42289)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42289

`sigmoid_backward` and `fmod` are not covered by neither in `test/cpp/api` nor in `Aten/test`. Add test functions to cover them

Test Plan:
1. Test locally and check new lines are covred
2. CI

Reviewed By: malfet

Differential Revision: D22804912

fbshipit-source-id: ea50ef0ef3dcf3940ac950d74f6f1cb38d8547a7
2020-07-30 16:26:13 -07:00
7d6c4f62ef Remove 4 unused variables in lp_pool_op.cc (#42329)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42329

Reviewed By: VitalyFedyunin

Differential Revision: D22850894

Pulled By: mrshenli

fbshipit-source-id: 1e91380a432525b83c0bb0bfef0d5067c767cb67
2020-07-30 15:50:17 -07:00
153673c33b fix quantized elu benchmark (#42318)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42318

We forgot to update this benchmark when quantized elu's signature
changed to require observation, fixing.

Test Plan:
```
cd benchmarks/operator_benchmark
python -m pt.qactivation_test
```

Imported from OSS

Reviewed By: supriyar

Differential Revision: D22845251

fbshipit-source-id: 1443f6f0deac695715b1f2bd47f0f22b96dc72ca
2020-07-30 14:57:12 -07:00
5ff54ff4ff import freeze (#42319)
Summary:
torch.jit.freeze was broken with https://github.com/pytorch/pytorch/pull/41154/files#diff-9084cd464651f7fa1ff030d2edd9eb55R1

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42319

Reviewed By: ZolotukhinM

Differential Revision: D22845476

Pulled By: eellison

fbshipit-source-id: bc9e50678d0e0ffca4062854ccc71bbef2e1a97b
2020-07-30 13:00:11 -07:00
344defc973 Let bfloat16 support promotion with other types (#41698)
Summary:
Fix https://github.com/pytorch/pytorch/issues/40580

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41698

Reviewed By: albanD

Differential Revision: D22824042

Pulled By: mruberry

fbshipit-source-id: 7dad9c12dc51d8f88c3ca963ae9c5f8aa2f72277
2020-07-30 12:28:09 -07:00
c489bbe122 Add typing support to torch._six (#42232)
Summary:
Also add __prepare__ method metaclass created by `with_metaclass` to conform with PEP 3115

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42232

Reviewed By: ezyang

Differential Revision: D22816936

Pulled By: malfet

fbshipit-source-id: a47d054b2f061985846d0db6b407f4e5df97b0d4
2020-07-30 12:12:46 -07:00
26d58503c2 Implementing NumPy-like function torch.signbit() (#41589)
Summary:
- Related with https://github.com/pytorch/pytorch/issues/38349
- Implementing the NumPy-like function `torch.signbit()` .

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41589

Reviewed By: albanD

Differential Revision: D22835249

Pulled By: mruberry

fbshipit-source-id: 7988f7fa8f591ce4b6a23ac884ee7b3aa718bcfd
2020-07-30 11:21:15 -07:00
c35faae10d [pytorch][ci] install nightly instead of stable libtorch for mobile CIs (#42220)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42220

Mobile custom build CI jobs need desktop version libtorch to prepare
models and dump root ops.

Ideally we should use the libtorch built on the PR so that backward
incompatible changes won't break this script - but it will significantly
slow down mobile CI jobs.

This PR changed it to install nightly instead of stable so that we have
an option to temporarily skip mobile CI jobs on BC-breaking PRs until
they are in nightly.

Test Plan: Imported from OSS

Reviewed By: seemethere

Differential Revision: D22810484

Pulled By: ljk53

fbshipit-source-id: eb5f7b762a969d1cfeeac2648816be546bd291b6
2020-07-30 11:07:14 -07:00
ce546328a3 Const-correctness, variable initialization, and error checking. (#42124)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42124

Test Plan: Imported from OSS

Reviewed By: IvanKobzarev

Differential Revision: D22835543

Pulled By: AshkanAliabadi

fbshipit-source-id: 29b7619b7bc6dd346eec91b8a2b6cc6a76769bcf
2020-07-30 11:04:24 -07:00
d0ed1e303f Add missing header guards. (#42272)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42272

Test Plan: Imported from OSS

Reviewed By: IvanKobzarev

Differential Revision: D22835546

Pulled By: AshkanAliabadi

fbshipit-source-id: c880199acaf0ad11c3db4ac9f9f2d000038f98f1
2020-07-30 11:04:21 -07:00
ee2150370e Add Vulkan Test to ATen Mobile Tests. (#42123)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42123

Test Plan: Imported from OSS

Reviewed By: IvanKobzarev

Differential Revision: D22835544

Pulled By: AshkanAliabadi

fbshipit-source-id: 08bce5d94ed8c966d25707f69e51b16d5b45febd
2020-07-30 11:04:19 -07:00
7cd92aaa6b Disable validation layers in non-debug builds. (#42122)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42122

Test Plan: Imported from OSS

Reviewed By: IvanKobzarev

Differential Revision: D22835545

Pulled By: AshkanAliabadi

fbshipit-source-id: b0eee550c8d727c79b5d45a7e1d603379ae3af5c
2020-07-30 11:01:51 -07:00
8e3d1908b6 Fix minor typo in comment (#42184)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42184

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D22809375

Pulled By: ezyang

fbshipit-source-id: 322a4c2059b612a10c6257013bbf2fd207e75df7
2020-07-30 09:48:22 -07:00
86b2faeb53 Automated submodule update: FBGEMM (#42302)
Summary:
This is an automated pull request to update the first-party submodule for [pytorch/FBGEMM](https://github.com/pytorch/FBGEMM).

New submodule commit: e04b9ce034

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42302

Test Plan: Ensure that CI jobs succeed on GitHub before landing.

Reviewed By: efiks

Differential Revision: D22841424

fbshipit-source-id: 211463b0207da986fc5b451242ae99edf32b9f68
2020-07-30 08:56:34 -07:00
f15af2fe4f Remove unused variable "schema" (#42245)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42245

Reviewed By: albanD

Differential Revision: D22835223

Pulled By: mrshenli

fbshipit-source-id: 94f0cbddb36feefc8a136ef38b0a74d22b305680
2020-07-30 08:40:36 -07:00
547bbdac86 Add MSFT Owners to the Windows Maintainership (#42280)
Summary:
Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42280

Reviewed By: albanD

Differential Revision: D22836782

Pulled By: soumith

fbshipit-source-id: a38f91e381abc0acf3ab41e05ff70611926091ac
2020-07-30 08:22:13 -07:00
269ec767ca [Codemod][FBSourceClangFormatLinter] Daily arc lint --take CLANGFORMAT
Reviewed By: zertosh

Differential Revision: D22838806

fbshipit-source-id: 29039585c82bb214db860d582cc4e269ab990c85
2020-07-30 04:01:20 -07:00
2335430086 Update TensorPipe submodule (#42225)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42225

Main changes:
- Consolidated CMake files to have a single entry point, rather than having a specialized one for PyTorch.
- Changed the way the preprocessor flags are provided, and changed their name.

There were a few instances in PyTorch's CMake files where we were directly adding TensorPipe's source directory as an include path, which however doesn't contain the auto-generated header we now added. We fix that by adding the `tensorpipe` CMake target as a dependency, so that the include paths defined by TensorPipe are used, which contain that auto-generated header.

I'm turning off SHM and CMA for now because they have never been covered by the CI. I'll enable them in a separate PR so that if they turn out to be flaky we can revert that change without reverting this one.

Test Plan: CircleCI is all green.

Reviewed By: beauby

Differential Revision: D22812445

fbshipit-source-id: e6d824bb28f5afe75fd765de0430968174f3531f
2020-07-30 02:32:52 -07:00
4f163df41a [caffe2] Special handling of If/AsyncIf op in RemoveOpsByType (#42286)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42286

One more bug to fix. Operators such as If and AsyncIf need special treatment not just in `onnx::SsaRewrite`, but also in `RemoveOpsByType`. The solution needs two steps:
1) add external inputs/outputs of the subnets of If/AsyncIf op to the inputs/outputs of the op
2) if the inputs/outputs of the If/AsyncIf op need to be renamed as a result, the same inputs/outputs of the subnets need to be renamed as well.

I also added unit tests to cover this corner case.

Test Plan:
```
buck test //caffe2/caffe2/fb/predictor:black_box_predictor_test

mkdir /tmp/models
rm -rf /tmp/$USER/snntest
rm -rf /tmp/snntest
buck run mode/opt admarket/lib/ranking/prediction_replayer/snntest_replayer_test/tools:snntest_replay_test -- --serving_paradigm=USER_AD_PRECOMPUTATION_DSNN
```

Differential Revision: D22834028

fbshipit-source-id: c070707316cac694f452a96e5c80255abf4014bc
2020-07-30 02:02:20 -07:00
f30ac66e79 [caffe2] Fix a performance bug in Dedup SparseAdagrad op (#42287)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42287

We shouldn't use block_size for thread dimensions in linear_index_weight_offsets_dedup_kernel, since the kernel doesn't iterate the embedding dimensions.
ghstack-source-id: 108834058

Test Plan:
```
buck test mode/dev-nosan //caffe2/caffe2/fb/net_transforms/tests:fuse_sparse_ops_test -- 'test_fuse_sparse_adagrad_with_sparse_lengths_sum_gradient \(caffe2\.caffe2\.fb\.net_transforms\.tests\.fuse_sparse_ops_test\.TestFuseSparseOps\)' --print-passing-details
```

Reviewed By: jspark1105

Differential Revision: D22800959

fbshipit-source-id: 641d52a51070715c04f9fd286e7e22ac62001f61
2020-07-30 01:00:59 -07:00
0444bac940 Add test to cross function
Summary: function `cross_kernel_scalar` is not covered in `Aten/native/cpu/CrossKernel.cpp`, add tests to cover it

Test Plan:
1. Test locally to check new lines are covered
2. CI

https://pxl.cl/1fZjG

Reviewed By: malfet

Differential Revision: D22834122

fbshipit-source-id: 0d50f3a3e6aee52cb6fdee2b9f5883f542c7b6e2
2020-07-29 22:48:52 -07:00
9ea7476d9c Add test to lerp function (#42266)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42266

function `lerp_kernel_scalar` and `lerp_kernel_tensor` are not covered in `Aten/native/cpu/LerpKernel.cpp`, add tests to cover them

Test Plan:
1. Test locally to check new lines are covered
2. CI

https://pxl.cl/1fXPd

Reviewed By: malfet

Differential Revision: D22832164

fbshipit-source-id: b1eaabbf8bfa08b4dedc1a468abfdfb619a50e3c
2020-07-29 22:47:37 -07:00
7459da268e Add typing annotations to torch.random (#42234)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42234

Reviewed By: ezyang

Differential Revision: D22816933

Pulled By: malfet

fbshipit-source-id: 9e2124ad16fed339abd507f6e474cb63feb7eada
2020-07-29 22:16:08 -07:00
872237c1f2 Output to stderr in distributed tests. (#42139)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42139

A bunch of tests were failing with buck since we would output to
stdout and buck would fail parsing stdout in some cases.

Moving these print statements to stderr fixes this issue.
ghstack-source-id: 108606579

Test Plan: Run the offending unit tests.

Reviewed By: mrshenli

Differential Revision: D22779135

fbshipit-source-id: 789af3b16a03b68a6cb12377ed852e5b5091bbad
2020-07-29 19:23:34 -07:00
fe4f19e164 [CUDA] max_pool2d NCHW performance improvement (#42182)
Summary:
Fix the regression introduced in https://github.com/pytorch/pytorch/issues/38953.

Please see https://github.com/xwang233/code-snippet/blob/master/max-pool2d-nchw-perf/max-pool2d.ipynb for detailed before & after performance comparisons.

Performance improvement for backward max_pool2d before and after this PR (negative value means speed up)

![image](https://user-images.githubusercontent.com/24860335/88712204-363c8e00-d0ce-11ea-8586-057e09b16103.png)

Seems like the forward modulo doesn't benefit much from a similar change, so I did not change forward. 1718f0ccfd

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42182

Reviewed By: albanD

Differential Revision: D22829498

Pulled By: ngimel

fbshipit-source-id: 4c81968fe072f4e264e70c70ade4c32d760a3af4
2020-07-29 19:01:31 -07:00
c18223f9ef add Dimname support to IValue (#42054)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42054

Test Plan: Imported from OSS

Reviewed By: smessmer

Differential Revision: D22750398

Pulled By: bhosmer

fbshipit-source-id: 7028268093f86b33c4117868b0edcb9e1ca6f7ee
2020-07-29 16:30:26 -07:00
6c251f74b2 replace black_list/blacklist with blocklist/block_list (#42089)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/41734

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42089

Reviewed By: pbelevich

Differential Revision: D22794556

Pulled By: SplitInfinity

fbshipit-source-id: 4404845b6293b076b3c8cc02b135b20c91397a79
2020-07-29 16:26:02 -07:00
27b03d62de [HT] Clear the device placement tag for the auto gen sum so that we could break the component for FC sharing the same input (#42219)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42219

Introduce a new extra info that is tagged on the forward net for the operators sharing the same input. The effect is that the auto gen sum of gradient for the input will not follow the tag of the operator tags in the forward net. This allow more flexible device allocation.

Test Plan:
# unit test
`./buck-out/gen/caffe2/caffe2/python/core_gradients_test#binary.par -r  testMultiUseInputAutoGenSumDevice`

Reviewed By: xianjiec, boryiingsu

Differential Revision: D22609080

fbshipit-source-id: d558145e5eb36295580a70e1ee3a822504dd439a
2020-07-29 15:21:27 -07:00
7cdf786a07 fix typo in GradScaler docstring (#42236)
Summary:
Closes https://github.com/pytorch/pytorch/issues/42226.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42236

Reviewed By: albanD

Differential Revision: D22817980

Pulled By: ngimel

fbshipit-source-id: 4326fe028dba1dbeed454edc4e4d4fffa56f51d6
2020-07-29 13:14:57 -07:00
79cfd85987 grad detach_ only when it has grad_fn in zero_grad call (#41283)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41283

in optimizer.zero_grad(), detach_ is useful to avoid memory leak only when grad has grad_fn, so add check to call grad.detach_ only when the grad has grad_fn in zero_grad() function
ghstack-source-id: 108702289

Test Plan: unit test

Reviewed By: mrshenli

Differential Revision: D22487315

fbshipit-source-id: 861909b15c8497f1da57f092d8963d4920c85e38
2020-07-29 11:40:13 -07:00
4b6e5f42a4 Creates spectral ops test suite (#42157)
Summary:
In preparation for creating the new torch.fft namespace and NumPy-like fft functions, as well as supporting our goal of refactoring and reducing the size of test_torch.py, this PR creates a test suite for our spectral ops.

The existing spectral op tests from test_torch.py and test_cuda.py are moved to test_spectral_ops.py and updated to run under the device generic test framework.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42157

Reviewed By: albanD

Differential Revision: D22811096

Pulled By: mruberry

fbshipit-source-id: e5c50f0016ea6bb8b093cd6df2dbcef6db9bb6b6
2020-07-29 11:36:18 -07:00
029007c8b6 Improved coverage for unboxed->boxed kernel wrappers (#38999)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38999

Adds boxing for inplace and outplace kernels, itemizes
remaining unsupported cases, and fails compilation when
new unsupported types are introduced in op signatures.

Test Plan: Imported from OSS

Differential Revision: D21718547

Pulled By: bhosmer

fbshipit-source-id: 03295128b21d1843e86789fb474f38411b26a8b6
2020-07-29 11:31:16 -07:00
60f51542dc [Caffe2] Fix spatial_bn bug for computing running_var on CPU or on CUDA without CuDNN (#42151)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42151

Previously our Caffe2 SpatialBN op impl was incorrect for computing running_var without unbias coefficent. Actually it should fail the test because the output will be different with CuDNN's output. However, our tests are too weak to find this bug. This diff fix all of them.

Test Plan: buck test mode/dev-nosan //caffe2/caffe2/python/operator_test:spatial_bn_op_test

Reviewed By: houseroad

Differential Revision: D22786127

fbshipit-source-id: db80becb67d60c44faae180c7e4257cb136a266d
2020-07-29 11:20:03 -07:00
91546a4b0f Environment variable for controlling type verbosity in debug output (#41906)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41906

Fixes #41770

Test Plan:
Example:
```
import torch
def bar():
    def test(a):
        return a
    x = torch.ones(10,10, device='cpu')
    print(torch.jit.trace(test, (x)).graph)
bar()
```

Bash:
```
for i in 0 1 2 3; do
  PYTORCH_JIT_TYPE_VERBOSITY=$i python test.py
done
```

Output:
```
graph(%0):
  return (%0)

graph(%0 : Float(10, 10)):
  return (%0)

graph(%0 : Float(10:10, 10:1)):
  return (%0)

graph(%0 : Float(10:10, 10:1, requires_grad=0, device=cpu)):
  return (%0)
```

Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D22687966

fbshipit-source-id: cd395257d79a4baa35245c778a74a55d1ea2a842
2020-07-29 11:17:24 -07:00
01b794f169 Operator-level Benchmark Test for Per Tensor and Per Channel Fake Quantization (#41974)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41974

In this diff, 2 new sets of benchmark tests are added to the `quantization` benchmark suite where operator-level benchmarking is conducted for the learnable Python operators, the learnable c++ kernels, and the original non-backprop c++ kernels.

Test Plan:
Inside the path `torch/benchmarks/operator_benchmark` (The root directory will be `caffe2` inside `fbcode` if working on a devvm):
- On a devvm, run the command `buck run pt:fake_quantize_learnable_test`
- On a personal laptop, run the command `python3 -m pt.fake_quantize_learnable_test`

Benchmark Results (On devGPU with 0% volatile utilization -- all GPUs are free):
Each sample has dimensions **3x256x256**;

### In **microseconds** (`1e-6` second),

|                           | Python Module | C++ Kernel | Non-backprop C++ Kernel |
|---------------------------|---------------|------------|-------------------------|
| Per Tensor CPU Forward    | 3112.666      | 3270.740   | 3596.864                |
| Per Tensor Cuda Forward   | 797.258       | 258.961    | 133.953                 |
| Per Channel CPU Forward   | 6587.693      | 6931.461   | 6352.417                |
| Per Channel Cuda Forward  | 1579.576      | 555.723    | 479.016                 |
| Per Tensor CPU Backward   | 72278.390     | 22466.648  | 12922.195               |
| Per Tensor Cuda Backward  | 6512.280      | 1546.218   | 652.942                 |
| Per Channel CPU Backward  | 74138.545     | 41212.777  | 14131.576               |
| Per Channel Cuda Backward | 6795.173      | 4321.351   | 1052.066                |

Reviewed By: z-a-f

Differential Revision: D22715683

fbshipit-source-id: 8be528b790663413cbeeabd4f68bbca00be052dd
2020-07-29 11:12:17 -07:00
48acdfd505 add tests to BinaryOpsKernel -- max/min kernel (#42198)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42198

1. add tests to max/min kernel

Test Plan:
1. Run locally to check cover the corresponding code part in BinaryOpsKernel.cpp.
2. CI

Reviewed By: malfet

Differential Revision: D22796019

fbshipit-source-id: 84c8d7df509de453c4ec3c5e38977733b0ef3457
2020-07-29 10:35:40 -07:00
382781221d Extending Learnable Fake Quantize module to support gradient scaling and factory (partial) construction (#41969)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41969

In this diff, the `_LearnableFakeQuantize` module is extended to provide support for gradient scaling where the gradients for both scale and zero point are multiplied by a constant `g` (in some cases, can help with quicker convergence). In addition, it is also augmented to provide a factory method via `_with_args` such that a partial constructor of the module can be built.

Test Plan:
For correctness of the fake quantizer operators, on a devvm, enter the following command:
```
buck test //caffe2/torch:quantization -- learnable_py_module
```

Reviewed By: z-a-f

Differential Revision: D22715629

fbshipit-source-id: ff8e5764f81ca7264bf9333789f57e0b0cec7a72
2020-07-29 10:22:26 -07:00
0a64f99162 [JIT] Dont include view ops in autodiff graphs (#42027)
Summary:
View ops as outputs of differentiable subgraphs can cause incorrect differentiation. For now, do not include them in the subgraph. This was observed with our autograd tests for MultiheadAttention and nn.Transformer, which currently fail with the legacy executor. This commit fixes those test failures.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42027

Reviewed By: pbelevich

Differential Revision: D22798133

Pulled By: eellison

fbshipit-source-id: 2f6c08953317bbe013933c6faaad20100376c039
2020-07-29 10:17:33 -07:00
b45b82b006 Fix type annotation for DistributedDataParallel (#42231)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42231

Reviewed By: albanD

Differential Revision: D22816589

Pulled By: mrshenli

fbshipit-source-id: a355f7e2fa895617bf81ef681b051f074d39ab8c
2020-07-29 10:12:20 -07:00
c8e15842aa Automated submodule update: FBGEMM (#42205)
Summary:
This is an automated pull request to update the first-party submodule for [pytorch/FBGEMM](https://github.com/pytorch/FBGEMM).

New submodule commit: cad1c21404

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42205

Test Plan: Ensure that CI jobs succeed on GitHub before landing.

Reviewed By: dskhudia

Differential Revision: D22806731

Pulled By: efiks

fbshipit-source-id: 779a9f7f00645e7e65f183e2832dc79117eae5fd
2020-07-29 09:26:18 -07:00
460970483d Revert D22790718: [pytorch][PR] Enables torch.full bool and integer type inference
Test Plan: revert-hammer

Differential Revision:
D22790718 (6b3f335641)

Original commit changeset: 8d1eb01574b1

fbshipit-source-id: c321177cce129a6c83f1a7b26bd5ed94a343ac0f
2020-07-29 07:52:04 -07:00
90074bbfa6 implement numpy-like functionality isposinf, isneginf (#41588)
Summary:
Related https://github.com/pytorch/pytorch/issues/38349

Numpy-like functionalities `isposinf` and `isneginf` are implemented.

Test-Plan:
- pytest test/test_torch.py -k "test_isposinf_isneginf"

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41588

Reviewed By: ngimel

Differential Revision: D22770732

Pulled By: mruberry

fbshipit-source-id: 7448653e8fb8df6b9cd4604a4739fe18a1135578
2020-07-29 03:29:31 -07:00
1c5c289b62 [pt] Add incude_last_offset option to EmbeddingBag mean and max (#42215)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42215

Specifically on https://github.com/pytorch/pytorch/pull/27477#discussion_r371402079

We would like to supported with include_last=True overall for other reduction types like mean and max. It now causes further code fragmentation in DPER (https://www.internalfb.com/intern/diff/D22794469/).

More details: https://www.internalfb.com/intern/diff/D22794469/?dest_fbid=309597093427021&transaction_id=631457624153457

ghstack-source-id: 108733009

Test Plan:
```
buck test mode/dev-nosan //caffe2/test:nn -- "test_EmbeddingBag_per_sample_weights_and_new_offsets_cpu"
```

```
(base) [jianyuhuang@devbig281.ftw3.facebook.com: ~/fbsource/fbcode/caffe2/test] $ TORCH_SHOW_CPP_STACKTRACES=1 buck test mode/dev-nosan //caffe2/test:
nn -- "test_EmbeddingBag_per_sample_weights_and_new_offsets_cpu" --print-passing-details
Parsing buck files: finished in 1.2 sec
Building: finished in 5.5 sec (100%) 10130/10130 jobs, 2 updated
  Total time: 6.7 sec
More details at https://www.internalfb.com/intern/buck/build/dbdc2063-69d8-45cb-9146-308a9e8505ef
First unknown argument: --print-passing-details.
Falling back to TestPilot classic.
Trace available for this run at /tmp/testpilot.20200728-195414.1422748.log
TestPilot test runner for Facebook. See https://fburl.com/testpilot for details.
Testpilot build revision cd2638f1f47250eac058b8c36561760027d16add fbpkg f88726c8ebde4ba288e1172a348c7f46 at Mon Jul 27 18:11:43 2020 by twsvcscm from /usr/local/fbprojects/packages/testinfra.testpilot/887/t.par
Discovering tests
Running 1 test
Started new test run: https://our.intern.facebook.com/intern/testinfra/testrun/844425097242375
      ✓ caffe2/test:nn - test_EmbeddingBag_per_sample_weights_and_new_offsets_cpu (test_nn.TestNNDeviceTypeCPU) 0.162 1/1 (passed)
Test output:
> /data/users/jianyuhuang/fbsource/fbcode/buck-out/dev/gen/caffe2/test/nn#binary,link-tree/torch/_utils_internal.py:103: DeprecationWarning: This is a NOOP in python >= 3.7, its just too dangerous with how we write code at facebook. Instead we patch os.fork and multiprocessing which can raise exceptions if a deadlock would happen.
>   threadSafeForkRegisterAtFork()
> /usr/local/fbcode/platform007/lib/python3.7/importlib/_bootstrap.py:219: ImportWarning: can't resolve package from __spec__ or __package__, falling back on __name__
and __path__
>   return f(*args, **kwds)
> test_EmbeddingBag_per_sample_weights_and_new_offsets_cpu (test_nn.TestNNDeviceTypeCPU) ... Couldn't download test skip set, leaving all tests enabled...
> ok
>
> ----------------------------------------------------------------------
> Ran 1 test in 0.162s
>
> OK
Finished test run: https://our.intern.facebook.com/intern/testinfra/testrun/844425097242375
Summary (total time 5.54s):
  PASS: 1
  FAIL: 0
  SKIP: 0
  FATAL: 0
  TIMEOUT: 0
  OMIT: 0
Did _not_ run with tpx. See https://fburl.com/tpx for details.
```

Reviewed By: dzhulgakov

Differential Revision: D22801881

fbshipit-source-id: 80a624465727081bb9bf55c28419695a3d79c6e5
2020-07-29 01:20:00 -07:00
6b3f335641 Enables torch.full bool and integer type inference (#41912)
Summary:
After being deprecated in 1.5 and throwing a runtime error in 1.6, we can now enable torch.full inferring its dtype when given bool and integer fill values. This PR enables that inference and updates the tests and docs to reflect this.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41912

Reviewed By: pbelevich

Differential Revision: D22790718

Pulled By: mruberry

fbshipit-source-id: 8d1eb01574b1977f00bc0696974ac38ffdd40d9e
2020-07-28 23:11:08 -07:00
8c653e05ff DOC: fail to build if there are warnings (#41335)
Summary:
Merge after gh-41334 and gh-41321 (EDIT: both are merged).
Closes gh-38011

This is the last in a series of PRs to build documentation without warnings. It adds `-WT --keepgoing` to the shpinx build which will [fail the build if there are warnings](https://www.sphinx-doc.org/en/master/man/sphinx-build.html#cmdoption-sphinx-build-W), print a [trackeback on error](https://www.sphinx-doc.org/en/master/man/sphinx-build.html#cmdoption-sphinx-build-T) and [finish the build](https://www.sphinx-doc.org/en/master/man/sphinx-build.html#cmdoption-sphinx-build-keep-going) even when there are warnings.

It should fail now, but pass once the PRs mentioned at the top are merged.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41335

Reviewed By: pbelevich

Differential Revision: D22794425

Pulled By: mruberry

fbshipit-source-id: eb2903e50759d1d4f66346ee2ceebeecfac7b094
2020-07-28 22:33:44 -07:00
4b108ca763 refactor save_data as non member function (#42045)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42045

This PR changes the save_data() member functions of torch::jit::mobile::Module which was introduced in #41403 to be the non member function torch::jit::mobile::_save_parameters() (taking a mobile Module as its first argument).

In addition, this PR:
* adds a getter function _ivalue() for the mobile::Module object
* renames torch::jit::mobile::_load_mobile_data() to torch::jit::mobile_load_parameters()
* refactors the import.h header file into import.h and import_data.h

Test Plan: Imported from OSS

Reviewed By: kwanmacher, iseeyuan

Differential Revision: D22766781

Pulled By: ann-ss

fbshipit-source-id: 5cabae31927187753a958feede5e9a28d71d9e92
2020-07-28 21:52:32 -07:00
8fc5adc88e Remove dead named_tensors_unsupported_error definitions. (#42171)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42171

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: pbelevich

Differential Revision: D22794980

Pulled By: ezyang

fbshipit-source-id: 250b6566270e19240361d758db55101d6fcb33e9
2020-07-28 21:40:28 -07:00
8deb4fe809 Fix flaky NCCL error handling tests. (#42149)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42149

Some of these tests were flaky since we could kill the process in some
way without cleaning up the ProcessGroup. This resulted in issues where the
FileStore didn't clean up appropriately resulting in other processes in the
group to crash.

Fixed this by explicitly deleting the process_group before we bring a process
down forcibly.
ghstack-source-id: 108629057

Test Plan: waitforbuildbot

Reviewed By: mrshenli

Differential Revision: D22785042

fbshipit-source-id: c31d0f723badbc23b7258e322f75b57e0a1a42cf
2020-07-28 18:38:26 -07:00
b6a9f42758 Add appropriate error messages for ProcessGroupNCCLTest (#42143)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42143

Replaces the original makeshift error messages in ProcessGroupNCCLTest
with more appropriate ones.
ghstack-source-id: 108711579

Test Plan: Ran the tests on DevGPU

Reviewed By: mrshenli

Differential Revision: D22778505

fbshipit-source-id: 27109874f0b474a74b09f588cf6e7528d2069702
2020-07-28 18:31:23 -07:00
e4c3f526c8 Fixed Skipping Logic in ProcessGroupNCCLErrors tests (#42192)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42192

This PR fixes the complicated skipping logic for ProcessGroupNCCLErrors Tests - it correctly logs the reason for skipping tests when GPUs are not available or the NCCL version is too old.

This is part of a broader effort to improve the testing of the ProcessGroup and Collectives tests.
ghstack-source-id: 108620568

Test Plan: Tested on devGPU and devvm. Tests are run correctly on GPU and skipped on CPU as expected.

Reviewed By: mrshenli

Differential Revision: D22782856

fbshipit-source-id: 6071dfdd9743f45e59295e5cee09e89c8eb299c9
2020-07-28 16:59:40 -07:00
b2ef7fa359 Add a flag to enforce fp32 to fp16 conversion for all inputs of the onnxifi net. (#39931)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39931

ATT.

Reviewed By: yinghai, ChunliF

Differential Revision: D21993492

fbshipit-source-id: ff386e6e9b95a783906fc1ae6a62462e6559a20b
2020-07-28 16:48:43 -07:00
8a644f0c13 [Shape Inference] Fix InferFC
Summary: Sometimes first dim of X in FC is BATCH_OF_FEATURE_MAX instead of BATCH. This caused an issue in f207899183 (when first dim of X is 64 but is set to 1 in inferFC). Change the check from `!= BATCH` to `== UNKNOWN`

Test Plan: unit test

Reviewed By: yinghai

Differential Revision: D22784691

fbshipit-source-id: eb66ba361d6fe75672b13edbac2fbd269a7e7a00
2020-07-28 16:43:19 -07:00
30eacb5fb6 [quant][graphmode] Support stack (#42187)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42187

Test Plan: Imported from OSS

Reviewed By: vkuzo

Differential Revision: D22801229

fbshipit-source-id: 7d1758c4fb1c8f742a275c3a631605f0f0d08e44
2020-07-28 16:35:34 -07:00
deac621ae2 Stop building PyTorch for VS2017 (#42144)
Summary:
And since CUDA-9.2 is incompatible with VS2019, disable CUDA-9.2 for Windows as well

Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42144

Reviewed By: pbelevich

Differential Revision: D22794475

Pulled By: malfet

fbshipit-source-id: 24fc980e6fc75240664b9de8a4a63b1153f8d8ee
2020-07-28 16:09:21 -07:00
3c084fd358 Dequant => Swish => Quant Test case. (#41976)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41976

Dequant => Swish => Quant Test case.

(Note: this ignores all push blocking failures!)

Test Plan: test_deq_swish_quant_nnpi.py.

Reviewed By: hyuen

Differential Revision: D22718593

fbshipit-source-id: 1cee503a27e339af6d89c819007511b90bb6610c
2020-07-28 16:05:12 -07:00
e2344db886 Use Python3.7 when running OSX builds/tests (#42191)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42191

Reviewed By: seemethere

Differential Revision: D22801091

Pulled By: malfet

fbshipit-source-id: b589343ef1bc6896d3d6d8d863f75aa3a102d985
2020-07-28 16:00:54 -07:00
4c7fb8c2b6 make FusionCallback refer to specified GraphFuser context (#41560)
Summary:
Fixes issue where
 - top level fuser's block_ was captured by callback due to [&] capture,
 - recursive/nested fusers would compare erroneously to top-level block_ instead of own block_

Closes (https://github.com/pytorch/pytorch/issues/39810)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41560

Reviewed By: Krovatkin

Differential Revision: D22583196

Pulled By: wconstab

fbshipit-source-id: 8f543cd9ea00e116cf3e776ab168cdd9fed69632
2020-07-28 15:01:24 -07:00
8ddd2c4e1b [pytorch] fix code analyzer for LLVM 9 & 10 (#42135)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42135

Tested the code analyzer with LLVM 9 & 10 and fixed a couple issues:
- Rename local demangle() which is available as public API since LLVM 9;
- Fix falsely associated op registrations due to the `phi` instruction;

Test Plan: Imported from OSS

Reviewed By: iseeyuan

Differential Revision: D22795508

Pulled By: ljk53

fbshipit-source-id: 2d47af088acd3312a7ea5fd9361cdccd48940fe6
2020-07-28 14:57:07 -07:00
fd9205e14b Enable caffe2 tests for RocM jobs (#41604)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/41604

Reviewed By: ezyang

Differential Revision: D22603703

Pulled By: malfet

fbshipit-source-id: 789ccf2bb79668a5a68006bb877b2d88fb569809
2020-07-28 14:21:42 -07:00
4d17ecb071 Changed Blacklisted to Blocklisted (#42100)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/41703

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42100

Reviewed By: ngimel

Differential Revision: D22780380

Pulled By: SplitInfinity

fbshipit-source-id: d465c41f1d4951ab6de55cb827c7ef53975209af
2020-07-28 13:21:26 -07:00
030ab2bda5 Replaced whitelist reference with allowlist (#42071)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/41741

Replaced whitelist reference with allowlist.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42071

Reviewed By: pbelevich

Differential Revision: D22795176

Pulled By: SplitInfinity

fbshipit-source-id: bcf1b8afe516b9684ce0298bc257ef81152ba20c
2020-07-28 12:29:33 -07:00
64965c4572 Replaced blacklist with blocklist (#42097)
Summary:
Closes https://github.com/pytorch/pytorch/issues/41726

Fixes https://github.com/pytorch/pytorch/issues/41726

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42097

Reviewed By: ngimel

Differential Revision: D22779535

Pulled By: SplitInfinity

fbshipit-source-id: 1d414af22a1b3e856a11d64cff4b4d33160d957b
2020-07-28 12:08:54 -07:00
5ed7cd0025 Allow drop_last option in DistributedSampler (#41171)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41171

DistributedSampler allows data to be split evenly across workers in
DDP, but it has always added additional samples in order for the data to be
evenly split in the case that the # of samples is not evenly divisible by the
number of workers. This can cause issues such as when doing distributed
validation accuracy, where multiple samples could be considered twice.

This PR adds a drop_last option where the tail of the data is dropped such that
the effective dataset size is still evenly divisible across the workers. This
ensures that DDP can train fine (there is no uneven inputs) and each replica
gets an equal number of data indices.
ghstack-source-id: 108617516

Test Plan: Added unittest

Reviewed By: mrshenli

Differential Revision: D22449974

fbshipit-source-id: e3156b751f5262cc66437b9191818b78aee8ddea
2020-07-28 11:33:08 -07:00
48ae5945de Skip TestExtractPredictorNet if compiled without OpenCV (#42168)
Summary:
Found while trying to get RocM Caffe2 CI green

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42168

Reviewed By: seemethere

Differential Revision: D22791879

Pulled By: malfet

fbshipit-source-id: 8f7ef9711bdc5941b2836e4c8943bb95c72ef8af
2020-07-28 11:26:55 -07:00
f666be7bc1 [vulkan] support add for dim < 4 (#41222)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/41222

Test Plan: Imported from OSS

Reviewed By: AshkanAliabadi

Differential Revision: D22754937

Pulled By: IvanKobzarev

fbshipit-source-id: f8c5e55c965c0a805e75c63b21f410fb0c323515
2020-07-28 11:15:37 -07:00
b3a9e21a29 [vulkan] mm op through addmm (#41221)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/41221

Test Plan: Imported from OSS

Reviewed By: AshkanAliabadi

Differential Revision: D22754938

Pulled By: IvanKobzarev

fbshipit-source-id: f9a0f48d7943a85b7dbb3fc9edf9e214ba07543b
2020-07-28 11:13:48 -07:00
b0424a895c Raise RuntimeError for zero stride pooling (#41819)
Summary:
Close https://github.com/pytorch/pytorch/issues/41767

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41819

Reviewed By: mrshenli

Differential Revision: D22780634

Pulled By: ngimel

fbshipit-source-id: 376ce5229ad5bd60804d839340d2c6505cf3288d
2020-07-28 11:07:12 -07:00
5aa2b572ff replace black list with block (#42091)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/41729

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42091

Reviewed By: pbelevich

Differential Revision: D22792096

Pulled By: ezyang

fbshipit-source-id: caafa42d12cbad377b67ddbaba8f84a2b8c98066
2020-07-28 10:23:51 -07:00
2f61aca17b Skip DataIO tests relying on LevelDB if compiled without it (#42169)
Summary:
Found while trying to get RocM Caffe2 job green

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42169

Reviewed By: seemethere

Differential Revision: D22791896

Pulled By: malfet

fbshipit-source-id: 9df6233876aec5ead056365499bab970aa7e8bdc
2020-07-28 10:18:26 -07:00
73ff252913 Back out "[NCCL] DDP communication hook: getFuture()" (#42152)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42152

Original commit changeset: 8c059745261d

Test Plan: .

Reviewed By: ajtulloch, jianyuh

Differential Revision: D22786183

fbshipit-source-id: 51155389d37dc82ccb4d2fa20d350f9d14abeaca
2020-07-28 10:05:35 -07:00
2de549518e Make fmod work with zero divisors consistently (#41948)
Summary:
Currently `torch.tensor(1, dtype=torch.int).fmod(0)` crashes (floating point exception).

This PR should fix this issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41948

Reviewed By: ngimel

Differential Revision: D22771081

Pulled By: ezyang

fbshipit-source-id: a94dd35d6cd85daa2d51cae8362004e31f97989e
2020-07-28 08:58:39 -07:00
e7ed0b3fae Avoid zero division in _cubic_interpolate (#42093)
Summary:
I encountered a zero division problem when using LBFGS:

File "/home/yshen/anaconda3/lib/python3.7/site-packages/torch/optim/lbfgs.py", line 118, in _strong_wolfe
    bracket[1], bracket_f[1], bracket_gtd[1])
File "/home/yshen/anaconda3/lib/python3.7/site-packages/torch/optim/lbfgs.py", line 21, in _cubic_interpolate
    d1 = g1 + g2 - 3 * (f1 - f2) / (x1 - x2)
ZeroDivisionError: float division by zero

My solution is to determine whether "line-search bracket is so small" before calling _cubic_interpolate

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42093

Reviewed By: pbelevich

Differential Revision: D22770667

Pulled By: mrshenli

fbshipit-source-id: f8fdfcbd3fd530235901d255208fef8005bf898c
2020-07-28 08:32:00 -07:00
f0c46878c6 Fix the issue GPU skip message(#41378) (#41973)
Summary:
Related https://github.com/pytorch/pytorch/issues/41378

Fix the issue GPU skip message

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41973

Reviewed By: pbelevich

Differential Revision: D22753459

Pulled By: mrshenli

fbshipit-source-id: d24b531926e28b860ae90b9ae07e8ca3438d21db
2020-07-28 08:28:31 -07:00
3acd6b7359 Document formatting (#42065)
Summary:
Apply syntax highlighting to the command in `README.md`. This makes `README.md` easier to read.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42065

Reviewed By: pbelevich

Differential Revision: D22753418

Pulled By: mrshenli

fbshipit-source-id: ebfa90fdf60478c34bc8a7284d163e0254cfbe3b
2020-07-28 08:27:42 -07:00
14e75fbdb9 Remove py2 specific code from test_utils.py (#42105)
Summary:
As https://github.com/pytorch/pytorch/issues/23795 mentioned drop Python 2 support. albanD
Fixes https://github.com/pytorch/pytorch/issues/31796

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42105

Reviewed By: ngimel

Differential Revision: D22765768

Pulled By: mrshenli

fbshipit-source-id: bae114a21cd5598004c7f92d313938ad826b4a24
2020-07-28 08:25:40 -07:00
86492410bc Don't run tests with custom arguments with pytest (#41397)
Summary:
This patch basically removes the `-m pytest` parameters when `extra_unittest_args` is used (e.g. `--subprocess`)

Fixes https://github.com/pytorch/pytorch/issues/41393

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41397

Reviewed By: pbelevich

Differential Revision: D22792133

Pulled By: ezyang

fbshipit-source-id: 29930d703666f4ecc0d727356bbab4a5f7ed4860
2020-07-28 08:17:36 -07:00
672ed3c06b replace onnx producer_version when updating results (#41910)
Summary:
xref gh-39002 which handled the reading but not the writing of the onnx expect files, and the last comment in that PR which points out `XXX` was suboptimal.
xref [this comment](https://github.com/pytorch/pytorch/pull/37091#discussion_r456460168) which pointed out the problem.

This PR:
- replaces `XXX` with `CURRENT_VERSION` in the stored files
- ensures that updating the results with the `--accept` flag will maintain the change

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41910

Reviewed By: pbelevich

Differential Revision: D22758671

Pulled By: ezyang

fbshipit-source-id: 47c345c66740edfc8f0fb9ff358047a41e19b554
2020-07-28 08:15:01 -07:00
b282297559 Replace whitelist with allowlist (#42067)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/41757

I've replaced all the whitelist with allowlist for this issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42067

Reviewed By: pbelevich

Differential Revision: D22791690

Pulled By: malfet

fbshipit-source-id: 638c13cf49915f5c83bd79c7f4a39b8390cc15b4
2020-07-28 08:01:16 -07:00
1a8269a566 Replace blacklist with blocklist in test/run_test.py file. (#42011)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/41716
test/run_test.py file updated with an appropriate replacement for blacklist and whitelist.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42011

Reviewed By: pbelevich

Differential Revision: D22791836

Pulled By: malfet

fbshipit-source-id: 8139649c5b70c876b711e25c33f3051ea8461063
2020-07-28 07:56:01 -07:00
e179966248 [caffe2][tpx] log to stderr (#42162)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42162

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D22791440

fbshipit-source-id: 14f16cd7a94a57161c5724177b518527f486232d
2020-07-28 07:50:27 -07:00
0571cfd875 Implement MultiBatchVmapTransform::logicalToPhysical(TensorList) (#41942)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41942

This function:
- permutes all batch dims to the front of the tensors
- aligns all the batch dims to the collective levels of all the tensors
- expands all of the batch dims such that they are present in each of
the result tensors

This function is useful for the next diff up on the stack (which is
implementing a fallback kernel for BatchedTensor). It's also useful in
general for implementing batching rules on operators that take in
multiple batch dimensions at the front of each tensor (but we don't have
too many of those in PyTorch).

Test Plan: - `./build/bin/vmap_test`

Reviewed By: ezyang

Differential Revision: D22764104

Pulled By: zou3519

fbshipit-source-id: d42cc8824a1bcf258687de164b7853af52852f53
2020-07-28 07:45:25 -07:00
1994ab1473 Optimize alignBatchDimsAtFront (#41941)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41941

If we know that the tensor already has the desired aligned size, we
don't need to put in the effort to align it.

Test Plan: - `./build/bin/vmap_test`, `pytest test/test_vmap.py -v`

Reviewed By: albanD

Differential Revision: D22764101

Pulled By: zou3519

fbshipit-source-id: a2ab7ce7b98d405ae905f7fd98db097210bfad65
2020-07-28 07:45:23 -07:00
5124436af4 Fix const correctness for VmapPhysicalView struct methods (#41940)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41940

See title. I marked methods that don't mutate the VmapPhysicalView as
`const`.

Test Plan: - wait for tests

Reviewed By: albanD

Differential Revision: D22764102

Pulled By: zou3519

fbshipit-source-id: 40f957ad61c85f0e5684357562a541a2712b1f38
2020-07-28 07:43:09 -07:00
2bc7dae2fc Use new sccache for RocM builds (#42134)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42134

Reviewed By: seemethere

Differential Revision: D22782146

Pulled By: malfet

fbshipit-source-id: 85ba69a705600e30ae0eddbf654298b3dc6f96ed
2020-07-28 07:15:56 -07:00
6bd88f581a Revert D22790238: [caffe2][tpx] Use logger instead of print
Test Plan: revert-hammer

Differential Revision:
D22790238 (3c6fae6567)

Original commit changeset: c0a801cdf7f0

fbshipit-source-id: cadfbd22f7d3ce656624483c9a19062f7c9a5b61
2020-07-28 06:11:30 -07:00
3c6fae6567 [caffe2][tpx] Use logger instead of print
Test Plan: CI?

Differential Revision: D22790238

fbshipit-source-id: c0a801cdf7f0da489c67708a0eb1b498ff104c64
2020-07-28 04:26:51 -07:00
5336ccc1b2 [BugFix] Fix bug in onnx::SsaRewrite (#42148)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42148

Differential Revision: D22687388

fbshipit-source-id: facf7a186dd48d6f919d0ff5d42f756977c3f9f4
2020-07-28 01:44:47 -07:00
4f723825b4 [vulkan] adaptive_avg_pool2d (#41220)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/41220

Test Plan: Imported from OSS

Reviewed By: AshkanAliabadi

Differential Revision: D22754943

Pulled By: IvanKobzarev

fbshipit-source-id: 91a94f32db005ebb693384f4d27efe66e2c33a14
2020-07-27 23:24:14 -07:00
0a0960126c If we don't collect tracing, always free the trace data (#42118)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42118

We toggle trace on with a certain probablility. In the case of 3 inferences with trace on/off/on. We leak the trace from the first inference. Always clean up the trace will fix it.

Test Plan:
predictor

I created a tiny repro here: D22786551

With this fix, this issue is gone.

Reviewed By: gcatron

Differential Revision: D22768382

fbshipit-source-id: 9ee0bbcb2bc5f76107dae385759fe578909a683d
2020-07-27 21:49:30 -07:00
83762844e5 Make run_binary_ops_test function generic and Add tests to add_kernel function (#42101)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42101

1. Add test fixture `atest class` to store global variables
2. Make `run_binary_ops_test` function generic: can dispose different dtypes and different numbers of parameters
3. add test to `add_kernel`

Test Plan:
Run locally to check cover the corresponding code part in `BinaryOpsKernel.cpp`.
CI

Reviewed By: malfet

Differential Revision: D22760015

fbshipit-source-id: 95b47732f661124615c0856efa827445dd714125
2020-07-27 21:03:00 -07:00
c062cdbd90 Log the net if blob doesn't exist when setting output record (#41971)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/41971

Reviewed By: wx1988

Differential Revision: D22490309

fbshipit-source-id: d967ee211b610f5523a307b5266b9fcb0277a21c
2020-07-27 19:13:50 -07:00
f805184165 onnxifi: make it work with AsyncIf
Summary:
the onnxifi path didn't handle the input/output name rewrite for ssa correctly for AsyncIf op. Add support for it.

Also fixed a place where we lose the net type while doing onnxifi transform.

Test Plan: Load 163357582_593 which is a multi feed model that uses AsyncIf. This used to fail with c2 not finding some blobs in workspace. Now it works.

Reviewed By: dhe95

Differential Revision: D21268230

fbshipit-source-id: ce7ec0e952513d0f251df1bfcfb2b0250f51fd94
2020-07-27 18:27:35 -07:00
c76fada4a8 Let DDP.train() return self to stay consistent with nn.Module (#42131)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42131

Test Plan: Imported from OSS

Reviewed By: pritamdamania87

Differential Revision: D22775311

Pulled By: mrshenli

fbshipit-source-id: ac9e6cf8b2381036a2b6064bd029dca361a81777
2020-07-27 18:22:13 -07:00
bcd75bd683 [ModelLints] Refine dropout lint message. (#42046)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42046

Refine dropout lint message as we have enabled dropout operator removal in optimize_for_mobile method.
ghstack-source-id: 108607182

Test Plan: buck test ai_infra/ai_mobile_infra/tests:mobile_model_util_tests

Reviewed By: kimishpatel

Differential Revision: D22741132

fbshipit-source-id: 8f87356aae2bd9c89d1cad0d7be7286278bb14ad
2020-07-27 18:15:30 -07:00
d5de616a4a Enable c10d Store tests in CI (#42128)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42128

Reviewed By: pritamdamania87

Differential Revision: D22774445

Pulled By: mrshenli

fbshipit-source-id: 6e5e56f42833414ef375b6cd23fdb3260cb07be9
2020-07-27 18:12:37 -07:00
509c18a096 Documentation for torch.optim.swa_utils (#41228)
Summary:
This PR adds a description of `torch.optim.swa_utils` added in https://github.com/pytorch/pytorch/pull/35032 to the docs at `docs/source/optim.rst`. Please let me know what you think!

vincentqb andrewgordonwilson

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41228

Reviewed By: ngimel

Differential Revision: D22609451

Pulled By: vincentqb

fbshipit-source-id: 8dd98102c865ae4a074a601b047072de8cc5a5e3
2020-07-27 17:52:16 -07:00
646042e0fb Add suggestion to enumerate ModuleDict in error message (#41946)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/41946

Reviewed By: ngimel

Differential Revision: D22774243

Pulled By: wconstab

fbshipit-source-id: 5cfbe52b5b1c540f824593e67ae6ba4973458bb5
2020-07-27 16:24:00 -07:00
1df35ba61e Back out "Support aarch32 neon backend for Vec256"
Summary: Original commit changeset: 1c22cf67ec35

Test Plan: sandcastle, testing on Portal

Reviewed By: currybeef

Differential Revision: D22774614

fbshipit-source-id: 8897aec5df32092c4df86c0d54b0d2fe58d66e66
2020-07-27 16:09:05 -07:00
d198fb3efe changed white-allowlisted (#41796)
Summary:
closes https://github.com/pytorch/pytorch/issues/41749

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41796

Reviewed By: gmagogsfm

Differential Revision: D22718991

Pulled By: SplitInfinity

fbshipit-source-id: 6c2d2b0e3b1e79fd515f9bdd395335a32f525a26
2020-07-27 16:01:45 -07:00
cb9c2049cd replace blacklist in aten/src/ATen/native/cudnn/Conv.cpp (#41627)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/41700.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41627

Reviewed By: gmagogsfm

Differential Revision: D22678492

Pulled By: SplitInfinity

fbshipit-source-id: 75b82bd10059754d8e6c25fc20e9dde775d54698
2020-07-27 15:56:36 -07:00
6ca5421a8f Enable non-synchronizing cub scan for cum* operations (#42036)
Summary:
This uses cub for cum* operations, because, unlike thrust, cub is non-synchronizing.
Cub does not support more than `2**31` element tensors out of the box (in fact, due to cub bugs the cutoff point is even smaller)
so to support that I split the tensor into `2**30` element chunks, and modify the first value of the second and subsequent chunks to contain the cumsum result of the previous chunks. Since modification is done inplace on the source tensor, if something goes wrong and we error out before the source tensor is reverted back to its original state, source tensor will be corrupted, but in most cases errors will invalidate the full coda context.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42036

Reviewed By: ajtulloch

Differential Revision: D22749945

Pulled By: ngimel

fbshipit-source-id: 9fc9b54d466df9c8885e79c4f4f8af81e3f224ef
2020-07-27 15:44:03 -07:00
330a107199 Refactor lite serializer dependencies from full jit (#42127)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42127

This diff renames core_autograd_sources to core_trainer_sources and moves/adds dependencies for the lite trainer in order to build the serializer functionality internally.
ghstack-source-id: 108589416

Test Plan: Manually tested serializer functionality from the internal lite trainer and verified that data is written correctly.

Reviewed By: iseeyuan

Differential Revision: D22738293

fbshipit-source-id: 992beb0c4368b2395f5bd5563fb2bc12ddde39a1
2020-07-27 15:38:54 -07:00
f7d50f50b9 .circleci: Prefer netrc for docs push (#42136)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42136

Expect was giving weird issues so let's just use netrc since it doesn't
rely on janky expect behavior

Another follow up for: https://github.com/pytorch/pytorch/pull/41964

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: yns88

Differential Revision: D22778940

Pulled By: seemethere

fbshipit-source-id: 1bdf879a5cfbf68a7d2d34b6966c20f95bd0a3b5
2020-07-27 15:28:46 -07:00
ed822de0fc change 2 instances of blacklist to blocklist in tools/pyi/gen_pyi.py (#41979)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/41722

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41979

Reviewed By: ngimel

Differential Revision: D22764112

Pulled By: zou3519

fbshipit-source-id: 3f8580c96cf45078a9df3cd9ca6fdb10d58e143f
2020-07-27 14:12:32 -07:00
5246bc4e87 register parameters correctly in c++ MultiheadAttention (#42037)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42037

This is to fix #41951

Test Plan: Imported from OSS

Reviewed By: yf225

Differential Revision: D22764717

Pulled By: glaringlee

fbshipit-source-id: e6da0aeb05a2356f52446e6d5fad391f2cd1cf6f
2020-07-27 13:58:11 -07:00
e59db43313 Find hip properly (#42064)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/41886

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42064

Reviewed By: seemethere

Differential Revision: D22757115

Pulled By: malfet

fbshipit-source-id: 9c8805e6eb0b7d7defe0ecb08c1e45dcc775a237
2020-07-27 13:47:01 -07:00
d6f1346c37 Add a new op for converting the dense feature to sparse representation
Summary: we need this op to avoid the splicing of a dense tensor and then use the Mergesinglescaler op

Test Plan: integrated test with dper2

Differential Revision: D22677523

fbshipit-source-id: f4f9a1f06841b0906ec8cbb435482ae0a89e1721
2020-07-27 12:45:37 -07:00
4281240cb5 Raise error for duplicate params in param group #40967 (#41597)
Summary:
This PR fixes an issue in https://github.com/pytorch/pytorch/issues/40967 where duplicate parameters across different parameter groups are not allowed, but duplicates inside the same parameter group are accepted. After this PR, both cases are treated equally and raise `ValueError`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41597

Reviewed By: zou3519

Differential Revision: D22608019

Pulled By: vincentqb

fbshipit-source-id: 6df41dac62b80db042cfefa6e53fb021b49f4399
2020-07-27 12:25:52 -07:00
6367a9d2b0 [vulkan] Shaders caching (#39384)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39384

Introducing `ComputeUnitFactory` which is responsible for providing `ComputeUnit`s (Shaders),
it caches it, using shader name (glsl file name)+workGroupSize as a cacheKey, just `std::map<string, std::shared_ptr>`

Macro GLSL_SPV changed to have literal name for cache key as a first argument.

All constructors of ComputeUnit are changed to use `ComputeUnitFactory`

Ownership model:
ComputeUnitFactory also owns `vkPipelineCache` that is internal vulkan cache object ( https://www.khronos.org/registry/vulkan/specs/1.2-extensions/man/html/VkPipelineCache.html )

`VContext` (global object) owns ComputeUnitFactory, that owns ComputeUnits, vkPipelineCache, for destruction of them we need valid VkDevice, so it should be destructed before `vkDestryDevice` in `~VContext` => As members of the class will be destructed only after destructor - forcing destruction of ComputeUnitFactory before `vkDestroyDevice`, doing `unique_ptr<ComputeUnitFactory>.reset()`

Test Plan: Imported from OSS

Reviewed By: AshkanAliabadi

Differential Revision: D21962430

Pulled By: IvanKobzarev

fbshipit-source-id: effe60538308805f317c11448b31dbcf670487e8
2020-07-27 11:57:07 -07:00
d4735ff490 Avoid refcount bump in IValue::toStringRef() (#42019)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42019

According to benchmarks, this makes IValue::toStringRef() 3-4x as fast.
ghstack-source-id: 108451154

Test Plan: unit tests

Reviewed By: ezyang

Differential Revision: D22731354

fbshipit-source-id: 3ca3822ea7310d8593e38b1d3e6014d6d80963db
2020-07-27 11:44:27 -07:00
5a6d88d503 Updates to Scale and Zero Point Gradient Calculation (#42034)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42034

In this diff, scale and zero point gradient calculations are updated to correctly reflect the actual backpropagation equation (instead of `dScale * dX`, the near-final output should be `dScale * dY`; the same applies to zero point).

Test Plan:
To execute the unit tests for all affected learnable fake quantize modules and kernels, on a devvm, execute the following command:

`buck test //caffe2/test:quantization -- learnable`

To enable the `cuda` tests, execute the following command:

`buck test mode/dev-nosan //caffe2/test:quantization -- learnable`

Reviewed By: jerryzh168

Differential Revision: D22735668

fbshipit-source-id: 45c1e0fd38cbb2d8d5e60be4711e1e989e9743b4
2020-07-27 11:18:49 -07:00
c261a894d1 Updates to Python Module for Calculation of dX and Addition of Unit Tests (#42033)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42033

In this diff, the Python `_LearnableFakeQuantize` module is updated where the gradient with respect to the input `x` is actually computed instead of passed through. Argument naming is also updated for better clarity; and unit tests on the `PerTensor` and `PerChannel` operators are added for asserting correctness.

Test Plan:
On a devvm, execute the command:

`buck test //caffe2/test:quantization -- learnable_py_module`

To include `cuda` tests as well, run:

`buck test mode/dev-nosan //caffe2/test:quantization -- learnable_py_module`

Reviewed By: jerryzh168

Differential Revision: D22735580

fbshipit-source-id: 66bea7e9f8cb6422936e653500f917aa597c86de
2020-07-27 11:18:47 -07:00
e62bf89273 Renaming variables from dX to dY in Learnable Fake Quantize kernels for Better Clarity (#42032)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42032

In this diff, the arguments `dX` within the C++ kernels are named as `dY` for clarity and avoid confusion since it doesn't represent the gradient with respect to the input.

Test Plan:
To test all related fake quantize kernel operators, on a devvm, run the command:

`buck test //caffe2/test:quantization -- learnable`

Reviewed By: z-a-f, jerryzh168

Differential Revision: D22735429

fbshipit-source-id: 9d6d967f08b98a720eca39a4d2280ca8109dcdd6
2020-07-27 11:17:26 -07:00
3e121d9688 Amend docstring and add test for Flatten module (#42084)
Summary:
I've noticed when PR https://github.com/pytorch/pytorch/issues/22245 introduced `nn.Flatten`, the docstring had a bug where it wouldn't render properly on the web, and this PR addresses that. Additionally, it adds a unit test for this module.

**Actual**
![image](https://user-images.githubusercontent.com/13088001/88483672-cf896a00-cf3f-11ea-8b1b-a30d152e1368.png)

**Expected**
![image](https://user-images.githubusercontent.com/13088001/88483642-86391a80-cf3f-11ea-8333-0964a027a172.png)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42084

Reviewed By: mrshenli

Differential Revision: D22756662

Pulled By: ngimel

fbshipit-source-id: 60c58c18c9a68854533196ed6b9e9fb0d4f83520
2020-07-27 11:04:28 -07:00
4290d0be60 Remove settings for the logit test case. (#42114)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42114

Remove settings for the logit test case.

(Note: this ignores all push blocking failures!)

Test Plan: test_op_nnpi_fp16.py test case.

Reviewed By: hyuen

Differential Revision: D22766728

fbshipit-source-id: 2fe8404b103c613524cf1beddf1a0eb9068caf8a
2020-07-27 10:59:23 -07:00
11e5174926 Added support for Huber Loss (#37599)
Summary:
Current losses in PyTorch only include a (partial) implementation of Huber loss through `smooth l1` based on Fast RCNN - which essentially uses a delta value of 1. Changing/Renaming the [`_smooth_l1_loss()`](3e1859959a/torch/nn/functional.py (L2487)) and refactoring to include delta, enables to use the actual function.

Supplementary to this, I have also made a functional and criterion versions for anyone that wants to set the delta explicitly - based on the functional `smooth_l1_loss()` and the criterion `Smooth_L1_Loss()`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/37599

Differential Revision: D21559311

Pulled By: vincentqb

fbshipit-source-id: 34b2a5a237462e119920d6f55ba5ab9b8e086a8c
2020-07-27 10:42:30 -07:00
fbdaa555a2 Enable ProcessGroupGlooTest in CI (take 2) (#42086)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/42073

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42086

Reviewed By: ngimel

Differential Revision: D22765777

Pulled By: malfet

fbshipit-source-id: ebbcd44f448a1e7f9a3d18fa9967461129dd1dcd
2020-07-27 10:21:59 -07:00
96aaa311c0 Grammar Changes (#42076)
Summary:
Small grammatical updates.
![Screenshot (188)](https://user-images.githubusercontent.com/56619747/88471271-02723480-cf25-11ea-8fd1-ae98d5ebcc86.png)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42076

Reviewed By: mrshenli

Differential Revision: D22756651

Pulled By: ngimel

fbshipit-source-id: e810eb7397a5831d801348c8fff072854658830e
2020-07-26 13:53:41 -07:00
b7bda236d1 DOC: split quantization.rst into smaller pieces (#41321)
Summary:
xref gh-38010 and gh-38011.

After this PR, there should be only two warnings:
```
pytorch/docs/source/index.rst:65: WARNING: toctree contains reference to nonexisting \
      document 'torchvision/index'
WARNING: autodoc: failed to import class 'tensorboard.writer.SummaryWriter' from module \
     'torch.utils'; the following exception was raised:
No module named 'tensorboard'
```

If tensorboard and torchvision are prerequisites to building docs, they should be added to the `requirements.txt`.

As for breaking up quantization into smaller pieces: I split out the list of supported operations and the list of modules to separate documents. I think this makes the page flow better, makes it much "lighter" in terms of page cost, and also removes some warnings since the same class names appear in multiple sub-modules.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41321

Reviewed By: ngimel

Differential Revision: D22753099

Pulled By: mruberry

fbshipit-source-id: d504787fcf1104a0b6e3d1c12747ec53450841da
2020-07-25 23:59:40 -07:00
6af659629a DOC: fix two build warnings (#41334)
Summary:
xref gh-38011.

Fixes two warnings when building documentation by
- using the external link to torchvision
- install tensorboard before building documentation

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41334

Reviewed By: ngimel

Differential Revision: D22753083

Pulled By: mruberry

fbshipit-source-id: 876377e9bd09750437fbfab0378664b85701f827
2020-07-25 23:38:33 -07:00
47e6d4b3c8 Revert D22741514: [pytorch][PR] Enable ProcessGroupGlooTest in CI
Test Plan: revert-hammer

Differential Revision:
D22741514 (45e6f2d600)

Original commit changeset: 738d2e27f523

fbshipit-source-id: 0381105ed0ab676b0abd1927f602a35b1b264a6a
2020-07-25 18:19:17 -07:00
b00c05c86c update cub submodule (#42042)
Summary:
Per title

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42042

Reviewed By: mruberry

Differential Revision: D22752345

Pulled By: ngimel

fbshipit-source-id: 363735bfe3d49bab12fedef43b68c9dc9e372815
2020-07-25 17:52:45 -07:00
c5b4f60fc2 Move qconfig removal into convert() (#41930)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41930

As title
ghstack-source-id: 108517079

Test Plan: CI

Reviewed By: jerryzh168

Differential Revision: D22698386

fbshipit-source-id: 4f748c9bae4a0b615aa69c7cc8d8e451e5d26863
2020-07-25 13:27:13 -07:00
12cd083fd7 Updates torch.tensor, torch.as_tensor, and sparse ctors to use the device of inputs tensors they're given, by default (#41984)
Summary:
**BC-Breaking Note**

This PR changes the behavior of the torch.tensor, torch.as_tensor, and sparse constructors. When given a tensor as input and a device is not explicitly specified, these constructors now always infer their device from the tensor. Historically, if the optional dtype kwarg was provided then these constructors would not infer their device from tensor inputs. Additionally, for the sparse ctor a runtime error is now thrown if the indices and values tensors are on different devices and the device kwarg is not specified.

**PR Summary**
This PR's functional change is a single line:

```
auto device = device_opt.has_value() ? *device_opt : (type_inference ? var.device() : at::Device(computeDeviceType(dispatch_key)));
```
=>
```
auto device = device_opt.has_value() ? *device_opt : var.device();
```

in `internal_new_from_data`. This line entangled whether the function was performing type inference with whether it inferred its device from an input tensor, and in practice meant that

```
t = torch.tensor((1, 2, 3), device='cuda')
torch.tensor(t, dtype=torch.float64)
```

would return a tensor on the CPU, not the default CUDA device, while

```
t = torch.tensor((1, 2, 3), device='cuda')
torch.tensor(t)
```

would return a tensor on the device of `t`!

This behavior is niche and odd, but came up while aocsa was fixing https://github.com/pytorch/pytorch/issues/40648.

An additional side affect of this change is that the indices and values tensors given to a sparse constructor must be on the same device, or the sparse ctor must specify the dtype kwarg. The tests in test_sparse.py have been updated to reflect this behavior.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41984

Reviewed By: ngimel

Differential Revision: D22721426

Pulled By: mruberry

fbshipit-source-id: 909645124837fcdf3d339d7db539367209eccd48
2020-07-25 02:49:45 -07:00
366c014a77 [Resubmit #41318] NCCL backend support for torch bool (#41959)
Summary:
Resubmit of https://github.com/pytorch/pytorch/issues/41318 pushed to ci-all branch.

Original description:
Closes https://github.com/pytorch/pytorch/issues/24137.
This PR adds support for the torch.bool tensor type to ProcessGroupNCCL. For most types we use the existing mapping, but since bool is not supported as a native ncclDataType_t, we add the following logic:

Map at::kBool to ncclUint8
During reduction (allreduce for example), if the operation is SUM, we instead override to to a MAX, to avoid overflow issues. The rest of the operations work with no changes. In the boolean case, changing sum to max makes no correctness difference since they both function as a bitwise OR.
The reduction logic (for example for reduce/allreduce) is as follows:
sum, max = bitwise or
product, min = bitwise and

Note that this PR doesn't add support for BAND/BOR/BXOR. That is because these reduction ops currently are not supported by NCCL backend, see https://github.com/pytorch/pytorch/issues/41362

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41959

Reviewed By: mrshenli

Differential Revision: D22719665

Pulled By: rohan-varma

fbshipit-source-id: 8bc4194a8d1268589640242277124f277d2ec9f1
2020-07-24 23:44:29 -07:00
38580422bb Allow specifying PYTHON executable to build_android (#41927)
Summary:
build_android.sh should check PYTHON environment variable before trying to use default python executable.
Even in that case, try to pick python3 over python2 when available.

Closes https://github.com/pytorch/pytorch/issues/41795

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41927

Reviewed By: seemethere

Differential Revision: D22696850

Pulled By: malfet

fbshipit-source-id: be236c2baf54a1cd111e55ee7743cdc93cb6b9d7
2020-07-24 18:34:42 -07:00
8e03c38a4f Add prim::EnumName and prim::EnumValue ops (#41965)
Summary:
[2/N] Implement Enum JIT support

Add prim::EnumName and prim::EnumValue and their lowerings to support getting `name` and `value` attribute of Python enums.

Supported:
Enum-typed function targuments
using Enum type and comparing them
Support getting name/value attrs of enums

TODO:
Add PyThon sugared value for Enum
Support Enum-typed return values
Support enum values of different types in same Enum class
Support serialization and deserialization

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41965

Reviewed By: eellison

Differential Revision: D22714446

Pulled By: gmagogsfm

fbshipit-source-id: db8c4e26b657e7782dbfc2b58a141add1263f76e
2020-07-24 18:33:18 -07:00
6287f9ed65 Remove AllGatherTestWithTimeout (#41945)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41945

This test previously did a thread sleep before launching the allgather operation, and then waited on the work object. Since the sleep was done before the work object was created, it did not affect the allgather call, and thus, did not test work-level timeouts as intended.

I am removing this test for now. In the future we can add this test back, but would need to somehow inject a `cudaSleep` call before the  allgather (so the collective operation itself is delayed). This may require overriding the `ProcessGroupNCCL::collective`, so it's a bit more heavy-weight.

In the meantime, we can remove this test - work-level timeouts are still thoroughly tested with Gloo.
ghstack-source-id: 108370178

Test Plan: Ran ProcessGroupNCCL tests on devGPU

Reviewed By: jiayisuse

Differential Revision: D22702291

fbshipit-source-id: a36ac3d83abfab6351c0476046a2f3b04a80c44d
2020-07-24 18:17:48 -07:00
45e6f2d600 Enable ProcessGroupGlooTest in CI (#41985)
Summary:
Partially addresses https://github.com/pytorch/pytorch/issues/41143

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41985

Reviewed By: rohan-varma

Differential Revision: D22741514

Pulled By: malfet

fbshipit-source-id: 738d2e27f52334e402b65b724b8ba3b0b41372ee
2020-07-24 17:44:00 -07:00
cf7e7909d5 NCCL must depend on librt (#41978)
Summary:
Since NCCL makes calls to shm_open/shm_close it must depend on librt on Linux

This should fix `DSO missing from command line` error on some platforms

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41978

Reviewed By: colesbury

Differential Revision: D22721430

Pulled By: malfet

fbshipit-source-id: d2ae08ce9da3979daaae599e677d5e4519b080f0
2020-07-24 16:47:19 -07:00
dede71d6e3 Support aarch32 neon backend for Vec256 (#41267)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41267

Due to llvm bug and some unsupported intrinsics we could not directly
use intrinsics for implementing aarch32 neon back end for Vec256.
Instead we resort to inline assembly.

Test Plan:
vec256_test run on android phone.

Imported from OSS

Reviewed By: AshkanAliabadi

Differential Revision: D22482196

fbshipit-source-id: 1c22cf67ec352942c465552031e9329550b27b3e
2020-07-24 15:49:26 -07:00
976e614915 caffe2: add PIPELINE tag (#41482)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41482

This adds a new tag for use with pipeline parallelism.

Test Plan: CI

Reviewed By: heslami

Differential Revision: D22551487

fbshipit-source-id: 90910f458a9bce68f7ef684773322a49aa24494a
2020-07-24 15:25:14 -07:00
0c0864c6be update tests to run back-compat check using new binary (#41949)
Summary:
instead exporting schemas using the current binary being tested, install nightly and export its schemas to use in a back-compat test run by the current binary being tested.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41949

Reviewed By: houseroad

Differential Revision: D22731054

Pulled By: bradleyhd

fbshipit-source-id: 68a7e7637b9be2604c0ffcde2a40dd208057ba72
2020-07-24 15:20:05 -07:00
42a0b51f71 Easier english updated tech docs (#42016)
Summary:
Just added a easier way to understand the tech docs

![Screenshot from 2020-07-24 21-48-07](https://user-images.githubusercontent.com/55920093/88412562-6991cb00-cdf7-11ea-9612-5f69146ea233.png)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42016

Reviewed By: colesbury

Differential Revision: D22735752

Pulled By: mrshenli

fbshipit-source-id: 8e3dfb721f51ee0869b0df66bf856d9949553453
2020-07-24 14:36:17 -07:00
becc1b26dd updated white list/allow list (#41789)
Summary:
closes https://github.com/pytorch/pytorch/issues/41758

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41789

Reviewed By: izdeby

Differential Revision: D22648038

Pulled By: SplitInfinity

fbshipit-source-id: 5abc895789d8803ca542dfc0c62069350c6977c4
2020-07-24 14:26:16 -07:00
7e84913233 .circleci: Make sure to install expect for docs push (#41964)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41964

Since we're not executing this in a docker container we should go ahead
an install expect explicitly

This is a follow up PR to #41871

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D22736738

Pulled By: seemethere

fbshipit-source-id: a56e19c1ee13c2f6e2750c2483202c1eea3b558a
2020-07-24 14:19:23 -07:00
d4736ef95f Add done() API to Future (#42013)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42013

Test Plan: Imported from OSS

Reviewed By: rohan-varma

Differential Revision: D22729596

Pulled By: mrshenli

fbshipit-source-id: ed31021a35af6e2c3393b9b14e4572cf51013bc0
2020-07-24 14:13:41 -07:00
890b52e09f Reduce instability in runCleanUpPasses by reordering passes. (#41891)
Summary:
Currently constant pooling runs before const propagation, which can create more constants that need pooling. This can get in the way of serialization/deserialization stability because each time user serializes and deserializes a module, runCleanUpPasses is called upon it. Doing so multiple times would lead to different saved module.

This PR moves constant pooling after const propagation, which may slow down const propagation a little bit, but would otherwise side-step aforementioned problem.

test_constant_insertion in test_jit.py is also updated because after fixing the pass ordering, the number of constants is no longer a constant and it is extremely difficult to get the exact number with the current convoluted test structure. So for now, I changed the test to check only that CSE doesn't change number of "prim::constant" rather than comparing against a known number. Also left a TODO to improve this test.

ConstantPropagation pass is replaced by ConstantPropagationImmutableTypes because the latter is used in runCleanUpPasses. If not replaced, the former would create new CSE opportunities by folding more constants. This voids the purpose of the test case.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41891

Reviewed By: colesbury

Differential Revision: D22701540

Pulled By: gmagogsfm

fbshipit-source-id: 8e60dbdcc54a93dac111d81b8d88fb39387224f5
2020-07-24 11:39:20 -07:00
d904ea5972 [NCCL] DDP communication hook: getFuture() (#41596)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41596

We've modified the previous design of `convert_dist_work_to_future` API in the GH Issue [#39272](https://github.com/pytorch/pytorch/issues/39272).

1. Whenever we create a `WorkNCCL` object, create a `Future` associated with `WorkNCCL` and store it with the object.
2. Add an API `c10::intrusive_ptr<c10::ivalue::Future> getFuture()` to `c10d::ProcessGroup::Work`.
3. This API will only be supported by NCCL in the first version, the default implementation will throw UnsupportedOperation.
4. To mark the future associated with WorkNCCL completed, implement a `cudaStreamCallback` function.

`cudaStreamAddCallback` is marked as deprecated. An alternative is `cudaLaunchHostFunc`, but it is supported for CUDA > 10 and may not be deprecated until there's a reasonable alternative available according to [this discussion](https://stackoverflow.com/questions/56448390/how-to-recover-from-cuda-errors-when-using-cudalaunchhostfunc-instead-of-cudastr).
ghstack-source-id: 108409748

Test Plan:
Run old  python test/distributed/test_c10d.py.
Some additional tests:
`test_ddp_comm_hook_allreduce_hook_nccl`: This unit test verifies whether a DDP communication hook that just calls allreduce gives the same result result with the case of no hook registered.  Without the then callback, the future_value in reducer is no longer a PyObject, and this unit test verifies future_value is properly checked.
`test_ddp_comm_hook_allreduce_then_mult_ten_hook_nccl`: This unit test verifies whether a DDP communication hook that calls allreduce and then multiplies the result by ten gives the expected result.

As of v10:
```
........................s.....s.....................................................s...............................
----------------------------------------------------------------------
Ran 116 tests

OK (skipped=3)
```
`flow-cli` performance validation using a stacked diff where `bucket.work` is completely replaced with `bucket.future_work` in `reducer`. See PR [#41840](https://github.com/pytorch/pytorch/pull/41840) [D22660198](https://www.internalfb.com/intern/diff/D22660198/).

Reviewed By: izdeby

Differential Revision: D22583690

fbshipit-source-id: 8c059745261d68d543eaf21a5700e64826e8d94a
2020-07-24 11:22:44 -07:00
2e95b29988 restore at::Half support for caffe2 SumOp (#41952)
Summary:
PR https://github.com/pytorch/pytorch/issues/40379 added long support but removed at::Half support.  Restore at::Half support.

CC ezyang xw285cornell neha26shah

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41952

Reviewed By: colesbury

Differential Revision: D22720656

Pulled By: xw285cornell

fbshipit-source-id: be83ca7fe51fc43d81bc0685a3b658353d42f8ea
2020-07-24 10:49:06 -07:00
e9e6cc8c83 Added Prehook option to prepare method (#41863)
Summary:
Added a logic so that if a prehook is passed into the prepare method during quantization, then the hook will be added as a prehook to all leaf nodes (and modules specified in the non_leaf_module_list).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41863

Test Plan:
Small demo, made simple module then called prepare with prehook parameter set to the numeric suite logger, printed the results to verify its what we wanted
{F245156246}

Reviewed By: jerryzh168

Differential Revision: D22671288

Pulled By: edmundw314

fbshipit-source-id: ce65a00830ff03360a82c0a075b3b6d8cbc4362e
2020-07-24 10:26:39 -07:00
1b55e2b043 add prefetch_factor for multiprocessing prefetching process (#41130)
Summary:
fix https://github.com/pytorch/pytorch/issues/40604
Add parameter to Dataloader to configure the per-worker prefetch number.
Before this edit, the prefetch process always prefetch 2 * num_workers data items, this commit help us make this configurable, e.x. you can specify to prefetch 10 * num_workers data items.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41130

Reviewed By: izdeby

Differential Revision: D22705288

Pulled By: albanD

fbshipit-source-id: 2c483fce409735fef1351eb5aa0b033f8e596561
2020-07-24 08:38:13 -07:00
79cdd84c81 Downloading different sccache binary in case of ROCm build (#41958)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/41958

Reviewed By: colesbury

Differential Revision: D22717509

Pulled By: malfet

fbshipit-source-id: 96c94512f12193fa549ec84cd51f17978f221bc6
2020-07-24 08:04:25 -07:00
c0bfa45f9d Enable typechecking for torch.futures (#41675)
Summary:
Add typing declarations for torch._C.Future and torch._C._collect_all

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41675

Reviewed By: izdeby

Differential Revision: D22627539

Pulled By: malfet

fbshipit-source-id: 29b87685d65dd24ee2094bae8a84a0fe3787e7f8
2020-07-23 23:06:45 -07:00
750d9dea49 move min/max tests to TestTorchDeviceType (#41908)
Summary:
so that testing _min_max on the different devices is easier, and min/max operations have better CUDA test coverage.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41908

Reviewed By: mruberry

Differential Revision: D22697032

Pulled By: ngimel

fbshipit-source-id: a796638fdbed8cda90a23f7ff4ee167f45530914
2020-07-23 22:49:30 -07:00
6a8c9f601f Removed whitelist references from test/backward_compatibility/check_b… (#41691)
Summary:
Removed whitelist reference
Fixes https://github.com/pytorch/pytorch/issues/41733.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41691

Reviewed By: houseroad

Differential Revision: D22641467

Pulled By: SplitInfinity

fbshipit-source-id: 72899b7410d4fc8454d87ca0c042f1ede7cf73de
2020-07-23 21:36:14 -07:00
e42eab4b1c Update PULL_REQUEST_TEMPLATE.md (#41812)
Summary:
**Summary**
This commit updates the repository's pull request template to remind contributors to tag the issue that their pull request addresses.

**Fixes**
This commit fixes https://github.com/pytorch/pytorch/issues/35319.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41812

Reviewed By: gmagogsfm

Differential Revision: D22667902

Pulled By: SplitInfinity

fbshipit-source-id: cda5ff7cbbbfeb89c589fd0dfd378bf73a59d77b
2020-07-23 21:30:43 -07:00
2da69081d7 Fix one error message format of torch.dot() (#41963)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41963

the error message of dot(CUDA) was copied from dot(CPU), however, they both are easy to cause confusion

Test Plan: wait for unittests

Reviewed By: ngimel

Differential Revision: D22710822

fbshipit-source-id: 565b51149ff4bee567ef0775e3f8828579565f8a
2020-07-23 20:47:11 -07:00
f00a37dd71 Make setup.py Python-2 syntactically correct (#41960)
Summary:
Import __future__ to make `print(*args)` a syntactically correct statement under Python-2
Otherwise, if once accidentally invokes setup.py using Python-2 interpreter they will be greeted by:
```
  File "setup.py", line 229
    print(*args)
          ^
SyntaxError: invalid syntax
```
instead of:
```
Python 2 has reached end-of-life and is no longer supported by PyTorch.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41960

Reviewed By: orionr, seemethere

Differential Revision: D22710174

Pulled By: malfet

fbshipit-source-id: ffde3ddd585707ba1d39e57e0c6bc9c4c53f8004
2020-07-23 19:10:20 -07:00
97ab33d47c Fix memory leak in XNNPACK/MaxPool2D. (#41874)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/41874

Test Plan: Imported from OSS

Reviewed By: ann-ss

Differential Revision: D22699598

Pulled By: AshkanAliabadi

fbshipit-source-id: fec59ed3d5d23bd9197349057fcf2ce56a2b278b
2020-07-23 18:59:53 -07:00
36fb14b68b [quant] Add Graph Mode Passes to quantize EmbeddingBag operators (#41612)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41612

This change adds preliminary support to quantize the EmbeddingBag operators. We currently support 4-bit and 8-bit quantization+packing of the weights.

To quantize these operators, specify the operator name in the `custom_op_name` field of the NoopObserver. Based on the op name (4bit or 8bit) we call the corresponding quantization functions.
Refer to the testplan for how to invoke the qconfig for the embedding_bag ops.

Future versions of this will support 4-bit and 2-bit qtensors with native support to observe and quantize it.

NB - This version assumes that the weights in the EmbeddingBag Module reside on the same device.

Test Plan:
python test/test_quantization.py TestQuantizeDynamicJitOps.test_embedding_bag

Imported from OSS

Reviewed By: vkuzo, jerryzh168

Differential Revision: D22609342

fbshipit-source-id: 23e33f44a451c26719e6e283e87fbf09b584c0e6
2020-07-23 18:54:59 -07:00
401ac2dd39 Replaced whitelisted with allowed (#41867)
Summary:
Closes https://github.com/pytorch/pytorch/issues/41746
Closes https://github.com/pytorch/pytorch/issues/41745

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41867

Reviewed By: izdeby

Differential Revision: D22703533

Pulled By: mrshenli

fbshipit-source-id: 915895463a92e18f36db93b8884d9fd432c0997d
2020-07-23 16:53:51 -07:00
a1cfcd4d22 Change whitelist to another context in binary_smoketest.py (#41822)
Summary:
Fix https://github.com/pytorch/pytorch/issues/41740

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41822

Reviewed By: izdeby

Differential Revision: D22703682

Pulled By: mrshenli

fbshipit-source-id: 1df82fd43890142dfd261eb7bf49dbd128295e03
2020-07-23 16:14:54 -07:00
b6690eb29a Might be good for newcomers to read what N means (#41851)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/41851

Reviewed By: izdeby

Differential Revision: D22703602

Pulled By: mrshenli

fbshipit-source-id: 44905f43cdf53b38e383347e5002a28c9363a446
2020-07-23 16:10:38 -07:00
7646f3c77f Fix type annotation for CosineAnnealingLR (#41866)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/41866

Reviewed By: izdeby

Differential Revision: D22703576

Pulled By: mrshenli

fbshipit-source-id: 10a0f593ffaaae82a2923a42815c36793a9043d5
2020-07-23 15:56:50 -07:00
cyy
c5fdcd85c7 check pruned attributes before deleting (#41913)
Summary:
I copyed a pruned model after deleteing the derived tensors. In order to be able to reparameter the model, we should check the existence of the tensors here.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41913

Reviewed By: izdeby

Differential Revision: D22703248

Pulled By: mrshenli

fbshipit-source-id: f5274d2c634a4c9a038100d8a6e837f132eabd34
2020-07-23 15:56:48 -07:00
183b43f323 Clarify Python 3.5 is the minimum supported version in the installation section. (#41937)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/41937

Reviewed By: izdeby

Differential Revision: D22702924

Pulled By: mrshenli

fbshipit-source-id: 67306435e80f80236b585f1d5406444daec782d6
2020-07-23 15:54:56 -07:00
a4b831a86a Replace if(NOT ${var}) by if(NOT var) (#41924)
Summary:
As explained in https://github.com/pytorch/pytorch/issues/41922 using `if(NOT ${var})" is usually wrong and can lead to issues like https://github.com/pytorch/pytorch/issues/41922 where the condition is wrongly evaluated to FALSE instead of TRUE. Instead the unevaluated variable name should be used in all cases, see the CMake docu for details.

This fixes the `NOT ${var}` cases by using a simple regexp replacement. It seems `pybind11_PREFER_third_party` is the only variable really prone to causing an issue as all others are set. However due to CMake evaluating unquoted strings in `if` conditions as variable names I recommend to never use unquoted `${var}` in an if condition. A similar regexp based replacement could be done on the whole codebase but as that does a lot of changes I didn't include this now. Also `if(${var})` will likely lead to a parser error if `var` is unset instead of a wrong result

Fixes https://github.com/pytorch/pytorch/issues/41922

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41924

Reviewed By: seemethere

Differential Revision: D22700229

Pulled By: mrshenli

fbshipit-source-id: e2b3466039e4312887543c2e988270547a91c439
2020-07-23 15:49:20 -07:00
dbe6bfbd7e Revert D22496604: NCCL Backend support for torch.bool
Test Plan: revert-hammer

Differential Revision:
D22496604 (3626473105)

Original commit changeset: a1a15381ec41

fbshipit-source-id: 693c2f9fd1df568508cbcf8c734c092cec3b0a72
2020-07-23 15:33:58 -07:00
b898bdd4d3 [JIT] Don't re run CSE on every block (#41479)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41479

Previously we were re-running CSE every time we recursed into a new block, which in turn created a new Alias Db for the whole graph. This was O(# Nodes * # Blocks).

For graphs which don't have any autodiff opportunities, such as Densenet,  create_autodiff_subgraphs is now linear in number of nodes. For Densenet this pass was measured at ~.1 seconds.

This pass is still non-linear for models which actually do create autodiff subgraphs, because in the
```
      bool any_changed = true;
      while (any_changed) {
        AliasDb aliasDb(graph_);
        any_changed = false;
        for (auto it = workblock.end()->reverseIterator();
             it != workblock.begin()->reverseIterator();) {
          bool changed;
          std::tie(it, changed) = scanNode(*it, aliasDb);
          any_changed |= changed;
        }
      }
```
loop we recreate the AliasDb (which is O(N)) every time we merge something and scan node returns. I will make that linear in next PR in the stack.

Test Plan: Imported from OSS

Reviewed By: Krovatkin

Differential Revision: D22600606

Pulled By: eellison

fbshipit-source-id: b08abfde2df474f168104c5b477352362e0b7b16
2020-07-23 14:50:04 -07:00
25b6e2e5ee [JIT] optimize autodiff subgraph slicing (#41437)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41437

[copied from commented code]
the IR has many nodes which can never be reordered around, such as a
prim::Bailout. if a node N is surrounded by two nodes which cannot be
reordered, A and B, then a differentiable subgraph that is created from N
can only contain nodes from [A, B] The nodes from A to B represent one
work block for the subgraph slicer to work on. By creating these up
front, we avoid retraversing the whole graph block any time scanNode
returns, and we can also avoid attempting to create differentiable
subgraphs in work blocks that do not contain a minimum number of differentiable nodes

This improved compilation time of e of densenet (the model with the slowest compilation time we're tracking) from 56s  -> 28s, and for mobilenet from 8s -> 6s.

Test Plan: Imported from OSS

Reviewed By: Krovatkin, ZolotukhinM

Differential Revision: D22600607

Pulled By: eellison

fbshipit-source-id: e5ab6ed87bf6820b4e22c86eabafd9d17bf7cedc
2020-07-23 14:49:57 -07:00
da3ff5e473 [JIT] dont count constants in subgraph size (#41436)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41436

Constants are not executed as instructions, we should ignore them when counting subgraph size, as we ignore them in counting block size for loop unrolling.

Test Plan: Imported from OSS

Reviewed By: Krovatkin, ZolotukhinM

Differential Revision: D22600608

Pulled By: eellison

fbshipit-source-id: 9770b21c936144a3d6a1df89cf3be5911095187e
2020-07-23 14:48:25 -07:00
dfe7d27d0e implement lite parameter serializer (#41403)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/41403

Test Plan: Imported from OSS

Reviewed By: kwanmacher

Differential Revision: D22611633

Pulled By: ann-ss

fbshipit-source-id: b391e8c96234b2e69f350119a11f688e920c7817
2020-07-23 14:25:44 -07:00
b85df3709a Add __main__ entrypoint to test_futures.py (#41826)
Summary:
Per comment in run_test.py, every test module must have a __main__ entrypoint:
60e2baf5e0/test/run_test.py (L237-L238)
Also disable test_wait_all on Windows, as it fails with an uncaught exception:
```
  test_wait_all (__main__.TestFuture) ... Traceback (most recent call last):
  File "run_test.py", line 744, in <module>
    main()
  File "run_test.py", line 733, in main
    raise RuntimeError(err)
RuntimeError: test_futures failed!
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41826

Reviewed By: seemethere, izdeby

Differential Revision: D22654899

Pulled By: malfet

fbshipit-source-id: ab7fdd7adce3f32c53034762ae37cf35ce08cafc
2020-07-23 12:56:03 -07:00
3626473105 NCCL Backend support for torch.bool (#41318)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41318

Closes https://github.com/pytorch/pytorch/issues/24137.

This PR adds support for the `torch.bool` tensor type to ProcessGroupNCCL. For most types we use the existing mapping, but since `bool` is not supported as a native `ncclDataType_t`, we add the following logic:
1) Map `at::kBool` to `ncclUint8`
2) During reduction (allreduce for example), if the operation is SUM, we instead override to to a MAX, to avoid overflow issues. The rest of the operations work with no changes. In the boolean case, changing sum to max makes no correctness difference since they both function as a bitwise OR.

The reduction logic (for example for reduce/allreduce) is as follows:
sum, max = bitwise or
product, min = bitwise and

Tests are added to ensure that the reductions work as expected.
ghstack-source-id: 108315417

Test Plan: Added unittests

Reviewed By: mrshenli

Differential Revision: D22496604

fbshipit-source-id: a1a15381ec41dc59923591885d40d966886ff556
2020-07-23 12:33:39 -07:00
01c406cc22 [pytorch] bump up variable version regardless of differentiability (#41269)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41269

The ultimate goal is to move things that are not gated with `if (compute_requires_grad(...))`
or `if (grad_fn)` out from VariableType so that VariableType kernels can be enabled/disabled
based upon `GradMode`. Then we can merge `AutoNonVariableTypeMode` and `NoGradGuard`.

We've moved profiling / tracing logic out from VariableType. One remaining thing that's
not gated with the if-statement is the `increment_version` call.

However, the `gen_variable_type.py` does use bits from `derivatives.yaml` to determine whether
to emit the `increment_version` call. If an output is never going to be differentiable (not based
upon runtime property of the variable but based upon static property, e.g. it's integral type)
then it would never emit the increment_version call.

Hypothetically, increment_version for a tensor can be orthogonal to its differentiability.

This PR is to make the change and test its impact. Making this logical simplification would
allow us to move this out from VariableType to aten codegen.
ghstack-source-id: 108318746

Test Plan: CI

Reviewed By: ezyang

Differential Revision: D22471643

fbshipit-source-id: 3e3a442c7fd851641eb4a9c4f024d1f5438acdb8
2020-07-23 12:07:32 -07:00
1978188639 Remove two "return"s that return "void" (#41811)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/41811

Reviewed By: izdeby

Differential Revision: D22673690

Pulled By: ezyang

fbshipit-source-id: 10d4aff90e2e051116e682fa51fb9494af8482c1
2020-07-23 10:17:29 -07:00
77db93228b Temporary fix for determinant bug on CPU (#35136)
Summary:
Changelog:
- Make diagonal contiguous

Temporarily Fixes https://github.com/pytorch/pytorch/issues/34061

Pull Request resolved: https://github.com/pytorch/pytorch/pull/35136

Reviewed By: izdeby

Differential Revision: D22673153

Pulled By: ezyang

fbshipit-source-id: 850f537483f929fcb43bcdef9d4ec264a7c3d354
2020-07-23 10:12:06 -07:00
17f76f9a78 Verbose param for schedulers that don't have it #38726 (#41580)
Summary:
Verbose param for schedulers that don't have it https://github.com/pytorch/pytorch/issues/38726

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41580

Reviewed By: izdeby

Differential Revision: D22671163

Pulled By: vincentqb

fbshipit-source-id: 53a6c9e929141d411b6846bc25f3fe7f46fdf3be
2020-07-23 09:57:33 -07:00
37e7f0caf6 Fix docstring in Unflatten (#41835)
Summary:
I'd like to amend the docstring introduced in https://github.com/pytorch/pytorch/issues/41564. It's not rendering correctly on the web, and this should fix it.

cc albanD

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41835

Reviewed By: izdeby

Differential Revision: D22672368

Pulled By: albanD

fbshipit-source-id: f0b03c2b2a4c79b790d54f7c8f2ae28ef9d76a75
2020-07-23 09:55:11 -07:00
fab1795577 move benchmark utils into torch namespace (#41506)
Summary:
Move the timing utils to `torch.utils._benchmark`. I couldn't figure out how to get setuptools to pick it up and put it under `torch` unless it is in the `torch` directory. (And I think it has to be for `setup.py develop` anyway.)

I also modified the record function benchmark since `Timer` and `Compare` should always be available now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41506

Reviewed By: ngimel

Differential Revision: D22601460

Pulled By: robieta

fbshipit-source-id: 9cea7ff1dcb0bb6922c15b99dd64833d9631c37b
2020-07-23 09:48:39 -07:00
266657182a Add torch.movedim (#41480)
Summary:
https://github.com/pytorch/pytorch/issues/38349 #36048

TODO:
* [x] Tests
* [x] Docs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41480

Reviewed By: zhangguanheng66

Differential Revision: D22649917

Pulled By: zou3519

fbshipit-source-id: a7f3920a24bae16ecf2ad731698ca65ca3e8c1ce
2020-07-23 09:41:01 -07:00
c0e3839845 fix #36801 (#41607)
Summary:
unittest actually did stdout testname like (test_accumulate_grad (__main__.TestAutograd) ... ) first befor test start running. export PYTHONUNBUFFERED=1 or python -u could record this msg. ezyang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41607

Reviewed By: izdeby

Differential Revision: D22673930

Pulled By: ezyang

fbshipit-source-id: 18512b6f5f80485c2b0d812f2ebdecc1fdc4b4ec
2020-07-23 09:32:46 -07:00
272fb3635f Add regression test for ONNX exports of modules that embed an Embedding layer inside a Sequential (#32598)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/19227

This PR adds a regression test for ONNX exports where a module has a sequential that references an Embedding layer

Pull Request resolved: https://github.com/pytorch/pytorch/pull/32598

Reviewed By: izdeby

Differential Revision: D22672790

Pulled By: ezyang

fbshipit-source-id: c88beb29a36b07378c28b0e4546efe887fcbc3be
2020-07-23 09:32:44 -07:00
e831299bae Fix typing error of torch/optim/lr_scheduler.pyi (#41775)
Summary:
* add `_LRScheduler.get_last_lr` type stub.
* remove `CosineAnnealingWarmRestarts.step` because its signature is same with `_LRScheduler`'s.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41775

Reviewed By: izdeby

Differential Revision: D22649350

Pulled By: vincentqb

fbshipit-source-id: 5355dd062a5af437f4fc153244dda793a2382e7e
2020-07-23 09:30:32 -07:00
4b4273a04e Update Adam documentation (#41679)
Summary:
This PR fixes https://github.com/pytorch/pytorch/issues/41477

Adam implementation is doing L2 regularization and not decoupled weight decay. However, the change mentioned in https://github.com/pytorch/pytorch/issues/41477 was motivated by Line 12 of algorithm 2 in [Decoupled Weight Decay Regularization](https://arxiv.org/pdf/1711.05101.pdf) paper.

Please let me know if you have other suggestions about how to deliver this info in the docs.
cc ezyang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41679

Reviewed By: izdeby

Differential Revision: D22671329

Pulled By: vincentqb

fbshipit-source-id: 2caf60e4f62fe31f29aa35a9532d1c6895a24224
2020-07-23 09:25:41 -07:00
30ce7b3740 Fix bug when compiling with caffe2 (#41868)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41868

Fix bug when compiling with caffe2

Reviewed By: jianyuh

Differential Revision: D22670707

fbshipit-source-id: aa654d7b9004257e0288c8ae8819ca5752eea443
2020-07-23 09:11:05 -07:00
0ec7ba4088 [iOS] Bump up the cocoapods version (#41895)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41895

### Summary

The iOS binary for 1.6.0 has been uploaded to AWS. This PR bumps up the version for cocoapods.

### Test Plan

- Check CI

Test Plan: Imported from OSS

Reviewed By: husthyc

Differential Revision: D22683787

Pulled By: xta0

fbshipit-source-id: bb95b670a7945d823d55e9c65b357765753f295a
2020-07-22 22:03:40 -07:00
2a3ab71f28 [quant][graphmode][fix] Remove useQuantizable check for dynamic quant (#41892)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41892

Currently the input of batch_norm is considered as dynamically quantizable but it shouldn't be
this PR fixes that

Test Plan:
internal models

Imported from OSS

Reviewed By: vkuzo

Differential Revision: D22681423

fbshipit-source-id: 7f428751de0c4af0a811b9c952e1d01afda42d85
2020-07-22 21:06:48 -07:00
ca3ba1095e Do not chown files inside docker for pytorch-job-tests (#41884)
Summary:
They are already owned by `jenkins` user after the build

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41884

Reviewed By: orionr

Differential Revision: D22682441

Pulled By: malfet

fbshipit-source-id: daf99532d300d30a5de591ad03af4597e145fdfc
2020-07-22 19:53:59 -07:00
586b7f991c Enable skipped tests from test_torch on ROCm (#41611)
Summary:
This pull request enables the following tests from test_torch, previously skipped on ROCm:
test_pow_-2_cuda_float32/float64
test_sum_noncontig_cuda_float64
test_conv_transposed_large

The first two tests experienced precision issues on earlier ROCm version, whereas the conv_transposed test was hitting a bug in MIOpen which is fixed with the version shipping with ROCm 3.5

ezyang jeffdaily

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41611

Reviewed By: xw285cornell

Differential Revision: D22672690

Pulled By: ezyang

fbshipit-source-id: 5585387c048f301a483c4c0566eb9665555ef874
2020-07-22 19:49:17 -07:00
7fefa46820 scatter/gather - check that inputs are of the same dimensionality (#41672)
Summary:
As per title.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41672

Reviewed By: malfet, ngimel

Differential Revision: D22678302

Pulled By: gchanan

fbshipit-source-id: 95a1bde81e660b8963e5914d5348fd4fbff1338e
2020-07-22 18:51:51 -07:00
b40ef422d3 .circleci: Separate out docs build from push (#41871)
Summary:
Separates out the docs build from the push and limits when the push
actually happens.

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41871

Reviewed By: yns88

Differential Revision: D22673716

Pulled By: seemethere

fbshipit-source-id: fff8b35ba8465dc15832214c4c9ef03ce12faa48
2020-07-22 17:01:24 -07:00
4e16be9073 [MemLeak] Fix memory leak from releasing unique ptr (#41883)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41883

Fix memory leak from releasing unique ptr

Test Plan:
Tested serialization with and without the change.

Heap profile without change:
```
Welcome to jeprof!  For help, type 'help'.
(jeprof) top
Total: 7298.4 MB
  4025.2  55.2%  55.2%   4025.2  55.2% c10::alloc_cpu (inline)
  3195.3  43.8%  98.9%   3195.3  43.8% caffe2::SerializeUsingBytesOrInt32
    63.6   0.9%  99.8%     63.6   0.9% __gnu_cxx::new_allocator::allocate (inline)
     5.0   0.1%  99.9%      5.0   0.1% google::protobuf::RepeatedField::Reserve
     2.5   0.0%  99.9%      2.5   0.0% folly::aligned_malloc (inline)
     1.2   0.0%  99.9%      1.2   0.0% caffe2::detail::CopyFromProtoWithCast (inline)
     1.0   0.0%  99.9%      1.0   0.0% __new_exitfn
     1.0   0.0% 100.0%      1.0   0.0% std::_Function_base::_Base_manager::_M_init_functor (inline)
     0.5   0.0% 100.0%      0.5   0.0% folly::HHWheelTimerBase::newTimer (inline)
     0.5   0.0% 100.0%      0.5   0.0% std::__detail::_Hashtable_alloc::_M_allocate_node
```

Heap profile with change:
```
Welcome to jeprof!  For help, type 'help'.
(jeprof) top
Total: 6689.2 MB
  4025.2  60.2%  60.2%   4025.2  60.2% c10::alloc_cpu (inline)
  2560.0  38.3%  98.4%   2560.0  38.3% caffe2::::HugePagesArena::alloc_huge (inline)
    90.9   1.4%  99.8%     90.9   1.4% __gnu_cxx::new_allocator::allocate (inline)
     5.0   0.1%  99.9%      5.0   0.1% google::protobuf::RepeatedField::Reserve
     2.0   0.0%  99.9%      2.0   0.0% prof_backtrace_impl (inline)
     1.0   0.0%  99.9%     20.3   0.3% std::__cxx11::basic_string::_M_construct (inline)
     1.0   0.0%  99.9%      1.0   0.0% std::_Function_base::_Base_manager::_M_init_functor (inline)
     0.5   0.0%  99.9%      0.5   0.0% folly::UnboundedQueue::allocNextSegment (inline)
     0.5   0.0% 100.0%      0.5   0.0% folly::aligned_malloc (inline)
     0.5   0.0% 100.0%      0.5   0.0% __new_exitfn
```

Reviewed By: yinghai

Differential Revision: D22662093

fbshipit-source-id: d0b8ff1ed26c72b14bb02fb1146c51ef11a7e519
2020-07-22 16:54:19 -07:00
dbc6a2904b [quant][graphmode][fix] Remove assert for uses == 1 in remove dequantize pass (#41859)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41859

A value can be used multiple times in the same node, we don't really need to assert uses of dequantize == 1

Test Plan: Imported from OSS

Reviewed By: raghuramank100

Differential Revision: D22673525

fbshipit-source-id: 2c4a770e0ddee722ca54e68d310c395e7f418b3b
2020-07-22 15:58:11 -07:00
dfa914a90c Modify lazy_dyndep loading to trigger inside workspace. (#41687)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41687

Specifically, this makes a new library (lazy), which can be used from both core
and workspace.

This allows workspace.Createnet to trigger lazy loading of dyndep dependencies.

Test Plan: Added a unit test specifically for workspace.CreateNet

Reviewed By: dzhulgakov

Differential Revision: D22441877

fbshipit-source-id: 3a9d1af9962585d08ea2566c9c85bec7377d39f2
2020-07-22 15:36:43 -07:00
af5d0bff00 [ONNX] Add pass that fuses Conv and BatchNormalization (#40547)
Summary:
Add pass that fuses Conv and Batchnormalization nodes into one node Conv.
This pass is only applied in inference mode (training is None or TrainingMode.Eval).
Since this pass needs access to param_dict it is written outside peephole file where these kind of passes (fusing multiple nodes into one) is usually placed.

This PR also adds wrapper skipIfNoEmbed to skip debug_embed_params test:
Pass that fuses Conv and Batchnorm changes the params of resnet model and parameters of onnx and pytorch model won't match. Since parameters are not matching, debug_embed_params test for test_resnet will fail and that is expected, therefore debug_embed_params test for test_resnet should be skipped.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/40547

Reviewed By: gchanan

Differential Revision: D22631687

Pulled By: bzinodev

fbshipit-source-id: fe45812400398a32541e797f727fd8697eb6d8c0
2020-07-22 14:59:27 -07:00
ad7133d3c1 Patch for #40026 RandomSampler generates samples one at a time when replacement=True (#41682)
Summary:
Fix https://github.com/pytorch/pytorch/issues/32530
Fix/Patch https://github.com/pytorch/pytorch/pull/40026

Resubmit this patch and fix the type error.

Force the input type to `manual_seed()` in `sampler.py` to be `int`.

ezyang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41682

Reviewed By: izdeby

Differential Revision: D22665477

Pulled By: ezyang

fbshipit-source-id: 1725c8aa742c31e74321f20448f4b6a392afb38d
2020-07-22 13:45:09 -07:00
2d15b39745 [Onnxifi] Support running with quantized int8 inputs (#41820)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41820

Pull Request resolved: https://github.com/pytorch/glow/pull/4721

In order to support int8 quantized tensor as an input to OnnxifiOp, we need to
- Add support to recognize and extract shape meta from int8 tensor at input of OnnxifiOp
- Make a copy of the input data and shift by 128 in Glow if input data is uint8 quantized tensor to get correct result because Glow uses int8 to represent the quantized data regardless.
- Propagate correct quantization parameters to through shape info in C2.

This diff implements the above.

Test Plan:
```
buck test caffe2/caffe2/contrib/fakelowp/test:test_int8_quantnnpi
```

Reviewed By: jackm321

Differential Revision: D22650584

fbshipit-source-id: 5e867f7ec7ce98bb066ec4128ceb7cad321b3392
2020-07-22 13:42:34 -07:00
47c57e8804 rename TestFuser to TestTEFuser (#41542)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/41542

Reviewed By: jamesr66a

Differential Revision: D22579606

Pulled By: Krovatkin

fbshipit-source-id: f65b2cae996b42d55ef864bc0b424d9d43d8a2e2
2020-07-22 13:37:27 -07:00
6ceb65f98c Document default dim for cross being None (#41850)
Summary:
The function torch.cross is a bit confusing, in particular the defaulting of the dim argument.

The default `dim` has been documented as -1 but it is actually `None`. This increases the confusion, in two possible ways depending on how carefully you read the rest. I also add a final warning to the final sentence.

This partially addresses https://github.com/pytorch/pytorch/issues/39310.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41850

Reviewed By: izdeby

Differential Revision: D22664625

Pulled By: albanD

fbshipit-source-id: b8669e026fd01de9e4ec16da1414b9edfaa76bdd
2020-07-22 13:31:47 -07:00
b80ffd44b0 Revert D20781624: Add NCCL Alltoall to PT NCCL process group
Test Plan: revert-hammer

Differential Revision:
D20781624 (b87f0e5085)

Original commit changeset: 109436583ff6

fbshipit-source-id: 03f6ee4d56baea93a1cf795d26dd92b7d6d1df28
2020-07-22 13:22:17 -07:00
ec683299eb Reland Add non-deterministic alert to CUDA operations that use atomicAdd() (#41538)
Summary:
Reland PR https://github.com/pytorch/pytorch/issues/40056

A new overload of upsample_linear1d_backward_cuda was added in a recent commit, so I had to add the nondeterministic alert to it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41538

Reviewed By: zou3519

Differential Revision: D22608376

Pulled By: ezyang

fbshipit-source-id: 54a2aa127e069197471f1feede6ad8f8dc6a2f82
2020-07-22 13:12:29 -07:00
aa91a65b59 [TensorExpr] Fix propagation of loop options when splitting loops (#40035)
Summary:
Fix a bug in SplitWithTail and SplitWithMask where loop_options such as Cuda block/thread bindings are overwritten by the split. This PR fixes this bug by propagating the loop options to the outer loop, which for axis bindings should be equivalent.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/40035

Reviewed By: ZolotukhinM

Differential Revision: D22080263

Pulled By: nickgg

fbshipit-source-id: b8a9583fd90f69319fc4bb4db644e91f6ffa8e67
2020-07-22 11:49:07 -07:00
9c7ca89ae6 Conda build (#38796)
Summary:
closes gh-37584. ~I think I need to do more to generate an image, but the `.circleci/README.md` is vague in the details. The first commit reflows and updates that document a bit, I will continue to update it as the PR progresses :)~ Dropped updating `.circleci/README.md`, will do that in a separate PR once this is merged.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/38796

Reviewed By: gchanan

Differential Revision: D22627522

Pulled By: ezyang

fbshipit-source-id: 99d5c19e942f15b9fc10f0de425790474a4242ab
2020-07-22 11:42:39 -07:00
61511aa1d6 Remove zmath_std.h (#39835)
Summary:
std::complex is gone

Pull Request resolved: https://github.com/pytorch/pytorch/pull/39835

Reviewed By: gchanan

Differential Revision: D22639834

Pulled By: anjali411

fbshipit-source-id: 57da43d4e6c82261b1f9e5b876f1bbbdf9ae56ca
2020-07-22 11:08:17 -07:00
ca68dc7fa2 replace std::clamp with shim (#41855)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41855

replace std::clamp with shim

Test Plan: test_op_nnpi_fp16.py covers the testing.

Reviewed By: hyuen

Differential Revision: D22667645

fbshipit-source-id: 5e7c94b499f381bde73f1984a6f0d01fb962a671
2020-07-22 11:06:36 -07:00
b87f0e5085 Add NCCL Alltoall to PT NCCL process group (#39984)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39984

Add Alltoall and Alltoallv to PT NCCL process group using NCCL Send/Recv.

Reviewed By: jiayisuse

Differential Revision: D20781624

fbshipit-source-id: 109436583ff69a3fea089703d32cfc5a75f973e0
2020-07-22 10:55:51 -07:00
2da8c8df08 [quant] Reaname from quantized... to ...quantized_cpu in the native_functions.yaml (#41071)
Summary:
Issue https://github.com/pytorch/pytorch/issues/40315

Reaname from `quantized...` to `...quantized_cpu` in the native_functions.yaml

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41071

Reviewed By: z-a-f

Differential Revision: D22487087

Pulled By: jerryzh168

fbshipit-source-id: f0d12907967739794839c1ffea44e78957f50b9b
2020-07-22 10:45:41 -07:00
f03156f9df replace blacklist in caffe2/python/onnx/frontend.py (#41777)
Summary:
Close https://github.com/pytorch/pytorch/issues/41712

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41777

Reviewed By: izdeby

Differential Revision: D22648532

Pulled By: yinghai

fbshipit-source-id: 7f4c9f313e2887e70bb4eb1ab037aea6b549cec7
2020-07-22 10:02:16 -07:00
5152633258 [ROCm] update hip library name (#41813)
Summary:
With transition to hipclang, the HIP runtime library name was changed.  A symlink was added to ease the transition, but is going to be removed.  Conditionally set library name based on HIP compiler used.  Patch gloo submodule as part of build_amd.py script until its associated fix is available.

CC ezyang xw285cornell sunway513

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41813

Reviewed By: zhangguanheng66

Differential Revision: D22660077

Pulled By: xw285cornell

fbshipit-source-id: c538129268d9947535b34523201f655b13c9e0a3
2020-07-22 09:42:45 -07:00
9fbcfe848b Automated submodule update: FBGEMM (#41814)
Summary:
This is an automated pull request to update the first-party submodule for [pytorch/FBGEMM](https://github.com/pytorch/FBGEMM).

New submodule commit: 139c6f2292

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41814

Test Plan: Ensure that CI jobs succeed on GitHub before landing.

Reviewed By: dskhudia

Differential Revision: D22648844

fbshipit-source-id: 4cfa8d83585407f870ea2bdee74e1c1f371082eb
2020-07-22 09:38:15 -07:00
71aad6ea66 Revert "port masked_select from TH to ATen and optimize perf on CPU (#33269)" (#41828)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41828

This reverts commit fe66bdb498efe912d8b9c437a14efa4295c04fdd.

This also makes a sense to THTensorEvenMoreMath because sumall was removed, see THTensor_wrap.

Test Plan: Imported from OSS

Reviewed By: orionr

Differential Revision: D22657473

Pulled By: malfet

fbshipit-source-id: 95a806cedf1a3f4df91e6a21de1678252b117489
2020-07-22 09:28:04 -07:00
fd62847eb2 cross_layer_equalization (#41685)
Summary:
The goal is to implement cross layer equalization as described in section 4.1 in this paper: https://arxiv.org/pdf/1906.04721.pdf
Given two adjacent submodules in a trained model, A,B quantization might hurt one of the submodules more than the other. The paper poses the idea that a loss in accuracy from quantizing can be due to a difference in the channel ranges between the two submodules (the output channel range of A can be small, while the input channel range of B can be large). To minimize this source of error, we want to scale the tensors of A,B s.t. their channel ranges are equal (them being equal means no difference in ranges and minimizes this source of error).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41685

Test Plan: Imported from OSS

Reviewed By: z-a-f

Differential Revision: D22630219

Pulled By: edmundw314

fbshipit-source-id: ccc91ba12c10b652d7275222da8b85455b8a7cd5
2020-07-22 08:39:23 -07:00
fced54aa67 [RPC tests] Fix test_init_(rpc|pg)_then_(rpc|pg) not shutting down RPC (#41558)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41558

The problem was due to non-deterministic destruction order of two global static variables: the mutexes used by glog and the RPC agent (which was still set because we didn't call `rpc.shutdown()`). When the TensorPipe RPC agent shuts down some callbacks may fire with an error and thus attempt to log something. If the mutexes have already been destroyed this causes a SIGABRT.

Fixes https://github.com/pytorch/pytorch/issues/41474
ghstack-source-id: 108231453

Test Plan: Verified in https://github.com/pytorch/pytorch/issues/41474.

Reviewed By: fmassa

Differential Revision: D22582779

fbshipit-source-id: 63e34d8a020c4af996ef079cfb7041b2474e27c9
2020-07-22 06:33:19 -07:00
e17e55831d [pytorch] disable per-op profiling for internal mobile build (#41825)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41825

Add flag to gate D21374246 (e7a09b4d17) to mitigate mobile size regression.
ghstack-source-id: 108212047

Test Plan: CI

Reviewed By: linbinyu

Differential Revision: D22650708

fbshipit-source-id: ac9318af824ac31f519b7d5b4fe72df892d8d3f9
2020-07-22 03:02:21 -07:00
825a387ea2 Fix bug on the backpropagation of LayerNorm when create_graph=True (#41595)
Summary:
Solve an issue https://github.com/pytorch/pytorch/issues/41332

I found the bug at https://github.com/pytorch/pytorch/issues/41332 is caused by LayerNorm.

Current implementations of LayerNorm have a disparity between
1. [`create_graph=False` CUDA implementation](dde3d5f4a8/aten/src/ATen/native/cuda/layer_norm_kernel.cu (L145))
2. [`create_graph=True` implementation](dde3d5f4a8/tools/autograd/templates/Functions.cpp (L2536))

With this bug-fix, https://github.com/pytorch/pytorch/issues/41332 is solved.

Ailing BIT-silence

Signed-off-by: Vinnam Kim <vinnamkim@gmail.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41595

Reviewed By: houseroad

Differential Revision: D22598415

Pulled By: BIT-silence

fbshipit-source-id: 63e390724bd935dc8e028b4dfb75d34a80558c3a
2020-07-22 00:19:12 -07:00
5c9918e757 Fix row-wise sparse SparseLengthSum and sparse adagrad fused operator (#41818)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41818

Fix row-wise sparse SparseLengthSum and sparse adagrad fused operator

Reviewed By: jianyuh

Differential Revision: D22345013

fbshipit-source-id: 7c2d6c506b404f15a7aa8f1d0ccadb82e515a4c3
2020-07-21 19:32:16 -07:00
a0f2a5625f [quant][graphmode][fix] Make it work with CallMethod on non-Module objects (#41576)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41576

Previously we are assuming CallMethod only happens on module instances,
but it turns out this is not true, this PR fixes this issue.

Test Plan: Imported from OSS

Reviewed By: z-a-f

Differential Revision: D22592789

fbshipit-source-id: 48217626d9ea8e82536f00a296b8f9a471582ebe
2020-07-21 19:03:40 -07:00
ce8c7185de Add unittests to Comparison Operator Kernels in BinaryOpsKernel.cpp (#41809)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41809

Add new unittests to Operator Kernels.
Explicitly announce function type in tests because it can't be inferred.

Test Plan: CI

Reviewed By: malfet

Differential Revision: D22647221

fbshipit-source-id: ef2f0e8c847841e90aa26d028753f23c8c53d6b0
2020-07-21 18:26:53 -07:00
302e566205 add max_and_min function and cpu kernel to speed up observers (#41570)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41570

For min/max based quantization observers, calculating min and max of a tensor
takes most of the runtime. Since the calculation of min and max is done
on the same tensor, we can speed this up by only reading the tensor
once, and reducing with two outputs.

One question I had is whether we should put this into the quantization
namespace, since the use case is pretty specific.

This PR implements the easier CPU path to get an initial validation.
There is some needed additional work in future PRs, which durumu will
take a look at:
* CUDA kernel and tests
* making this work per channel
* benchmarking on observer
* benchmarking impact on QAT overhead

Test Plan:
```
python test/test_torch.py TestTorch.test_min_and_max
```

quick bench (not representative of real world use case):
https://gist.github.com/vkuzo/7fce61c3456dbc488d432430cafd6eca
```
(pytorch) [vasiliy@devgpu108.ash6 ~/local/pytorch] OMP_NUM_THREADS=1 python ~/nfs/pytorch_scripts/observer_bench.py
tensor(5.0390) tensor(-5.4485) tensor([-5.4485,  5.0390])
min and max separate 11.90243935585022
min and max combined 6.353186368942261
% decrease 0.466228209277153
(pytorch) [vasiliy@devgpu108.ash6 ~/local/pytorch] OMP_NUM_THREADS=4 python ~/nfs/pytorch_scripts/observer_bench.py
tensor(5.5586) tensor(-5.3983) tensor([-5.3983,  5.5586])
min and max separate 3.468616485595703
min and max combined 1.8227086067199707
% decrease 0.4745142294372342
(pytorch) [vasiliy@devgpu108.ash6 ~/local/pytorch] OMP_NUM_THREADS=8 python ~/nfs/pytorch_scripts/observer_bench.py
tensor(5.2146) tensor(-5.2858) tensor([-5.2858,  5.2146])
min and max separate 1.5707778930664062
min and max combined 0.8645427227020264
% decrease 0.4496085496757899
```

Imported from OSS

Reviewed By: supriyar

Differential Revision: D22589349

fbshipit-source-id: c2e3f1b8b5c75a23372eb6e4c885f842904528ed
2020-07-21 18:16:22 -07:00
9e0c746b15 Augmenting Concrete Observer Constructors to Support Dynamic Quantization Range; Modifying Utility Functions in _LearnableFakeQuantize Module for Better Logging and Baseline Construction. (#41815)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41815

**All are minor changes to enable better simulations.**

The constructors of MinMaxObserver, MovingAverageMinMaxObserver, PerChannelMinMaxObserver, and MovingAveragePerChannelMinMaxObserver are augmented so they can utilize the dynamic quantization range support in the _ObserverBase class.

In addition, minor adjustments are made to the enable_static_observation function that allow observer to update parameters but do not fake quantize on the output (for constructing baseline).

Test Plan:
To ensure this modification is still backward compatible with past usages, numerics are verified by running the quantization unit test suite, which contains various observer tests. The following command executes the test suite, which also verifies the observer numerics:
```
buck test //caffe2/test:quantization -- observer
```

Reviewed By: z-a-f

Differential Revision: D22649128

fbshipit-source-id: 32393b706f9b69579dc2f644fb4859924d1f3773
2020-07-21 17:59:40 -07:00
60e2baf5e0 [doc] Add LSTM non-deterministic workaround (#40893)
Summary:
Related: https://github.com/pytorch/pytorch/issues/35661

Preview
![image](https://user-images.githubusercontent.com/24860335/86861581-4b4c7100-c07c-11ea-950a-3145bfae9af9.png)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/40893

Reviewed By: vincentqb

Differential Revision: D22535418

Pulled By: ngimel

fbshipit-source-id: f194ddaff8ec6d03a3616c87466e2cbbe7e429a9
2020-07-21 16:20:02 -07:00
941069ca09 [tensorexpr][trivial] Remove debug printing from test (#41806)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41806

Generally a good practice not to have tests spew output.

Test Plan:
`build/bin/test_tensorexpr`

Imported from OSS

Reviewed By: zheng-xq

Differential Revision: D22646833

fbshipit-source-id: 444e883307d058fe77e7550d436fa61b7d91a701
2020-07-21 15:54:31 -07:00
7ffdd765c8 [TensorExpr] more convenient outer Rfactor output (#40050)
Summary:
Auto fuse the output loops of outer Rfactors, so it is in a more convenient format for binding GPU axes.

An example:
```
  Tensor* c = Reduce("sum", {}, Sum(), b, {{m, "m"}, {n, "n"}, {k, "k"}});
  LoopNest loop({c});
  std::vector<For*> loops = loop.getLoopStmtsFor(c);
  auto v = loops.at(0)->var();
  loop.rfactor(c->body(), v);
```
Before:
```
{
  Allocate(tmp_buf, float, {m});
  sum[0] = 0.f;
  for (int m_1 = 0; m_1 < m; m_1++) {
    tmp_buf[m_1] = 0.f;
  }
  for (int m_1 = 0; m_1 < m; m_1++) {
    for (int n = 0; n < n_1; n++) {
      for (int k = 0; k < k_1; k++) {
        tmp_buf[m_1] = (tmp_buf[m_1]) + (b[((n_1 * m_1) * k_1 + k) + k_1 * n]);
      }
    }
  }
  for (int m_1 = 0; m_1 < m; m_1++) {
    sum[0] = (sum[0]) + (tmp_buf[m_1]);
  }
  Free(tmp_buf);
}
```

After:
```
{
  sum[0] = 0.f;
  for (int m = 0; m < m_1; m++) {
    Allocate(tmp_buf, float, {m_1});
    tmp_buf[m] = 0.f;
    for (int n = 0; n < n_1; n++) {
      for (int k = 0; k < k_1; k++) {
        tmp_buf[m] = (tmp_buf[m]) + (b[((n_1 * m) * k_1 + k) + k_1 * n]);
      }
    }
    sum[0] = (sum[0]) + (tmp_buf[m]);
    Free(tmp_buf);
  }
}
```

The existing Rfactor tests cover this case, although I did rename a few for clarity. This change broke the LLVMRFactorVectorizedReduction test because it now does what its intending to (vectorize a loop with a reduction in it) rather than nothing, and since that doesn't work it correctly fails. I've disabled it for now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/40050

Reviewed By: ZolotukhinM

Differential Revision: D22605639

Pulled By: nickgg

fbshipit-source-id: e359be53ea62d9106901cfbbc42d55d0e300e8e0
2020-07-21 14:44:26 -07:00
dac393fa24 [PT] enforce duplicate op name check on mobile
Summary: Enforce duplicate op name check on mobile

Test Plan: run full/lite predictor

Reviewed By: iseeyuan

Differential Revision: D22639758

fbshipit-source-id: 2993c4bc1b14c833b273183f4f343ffad62121b3
2020-07-21 13:14:17 -07:00
62f4f87914 Removed whitelist reference from tools/clang_format_ci.sh (#41636)
Summary:
Removed whitelist and blacklist references
Fixes https://github.com/pytorch/pytorch/issues/41753

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41636

Reviewed By: SplitInfinity

Differential Revision: D22648632

Pulled By: suo

fbshipit-source-id: d22130a7cef96274f3fc73d00b50327dfcae332c
2020-07-21 12:32:14 -07:00
1ad7160a59 fix backward compat (#41810)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/41810

Reviewed By: malfet

Differential Revision: D22647763

Pulled By: albanD

fbshipit-source-id: 8ce70ecb706bb98ed24b0b3e7e9ebf3d4c270964
2020-07-21 12:14:55 -07:00
03186a86d9 Add test dependencies to CONTRIBUTING.md (#41799)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/41799

Reviewed By: zhangguanheng66

Differential Revision: D22645323

Pulled By: zou3519

fbshipit-source-id: 0a695bffb57b29024461472dd1c8518a9a0d1d3b
2020-07-21 11:29:38 -07:00
341c4045df replaced blacklist with blocklist in test/test_type_hints.py (#41644)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/41719.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41644

Reviewed By: zhangguanheng66

Differential Revision: D22645479

Pulled By: zou3519

fbshipit-source-id: 82710acae96ab508b8e9198dadb7d7911cb97235
2020-07-21 11:23:19 -07:00
46808b49a8 Change whitelist to allow in file test_quantized_op.py (#41771)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/41751

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41771

Reviewed By: zhangguanheng66

Differential Revision: D22641463

Pulled By: SplitInfinity

fbshipit-source-id: 1a60af8d43ccdf1f35dc84dbf4a7bc64965eb44a
2020-07-21 11:08:07 -07:00
72a1146339 Skip warning 4522 with MSVC (#41648)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/41648

Reviewed By: zhangguanheng66

Differential Revision: D22644623

Pulled By: malfet

fbshipit-source-id: 7fb86f05b3d8cd6a4c7c0e3fdfd651b70a5094c9
2020-07-21 09:47:30 -07:00
2da2b5c081 update CONTRIBUTING.md for ccache (#41619)
Summary:
ccache now use cmake for building, update installation script.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41619

Reviewed By: zhangguanheng66

Differential Revision: D22644594

Pulled By: malfet

fbshipit-source-id: f894dd408822231f8aab36efbce188f06f004057
2020-07-21 09:43:30 -07:00
523f80e894 .circleci: Remove docker_hub_index_job, wasn't used (#41800)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41800

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: soumith

Differential Revision: D22645363

Pulled By: seemethere

fbshipit-source-id: 35ed43ed5fb4053f71dc9525c4ed62f1c60eacc1
2020-07-21 09:16:02 -07:00
1f11e930d0 [ROCm] skip test_streams on rocm. (#41697)
Summary:
Skipping the test test_streams as it is flaky on rocm.
cc: jeffdaily  sunway513

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41697

Reviewed By: zhangguanheng66

Differential Revision: D22644600

Pulled By: malfet

fbshipit-source-id: b1b16d496e58a91c44c40d640851fd62a5d7393d
2020-07-21 08:55:07 -07:00
48569cc330 Reland split (#41567)
Summary:
Take 3

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41567

Reviewed By: zou3519

Differential Revision: D22586331

Pulled By: albanD

fbshipit-source-id: ca08199da716d64a335455610edbce752fee224b
2020-07-21 08:06:27 -07:00
c89c294ef9 Add Unflatten Module (#41564)
Summary:
This PR implements a feature extension discussed in https://github.com/pytorch/pytorch/issues/41516.

I followed this other PR https://github.com/pytorch/pytorch/issues/22245 to add this other module. While I was at it, I also added `extra_repr()` method in `Flatten` which was missing.

I see there are no unit tests for these modules. Should I add those too? If so, what is the best place I should place these?

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41564

Reviewed By: gchanan

Differential Revision: D22636766

Pulled By: albanD

fbshipit-source-id: f9efdefd3ffe7d9af9482087625344af8f990943
2020-07-21 07:43:02 -07:00
fe415589a9 disable mkl for expm1 (#41654)
Summary:
On some systems/mkl versions it produces expm1(nan)=-1

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41654

Reviewed By: mruberry

Differential Revision: D22621333

Pulled By: ngimel

fbshipit-source-id: 84544679fe96aed7de6873dce6f31f488e5e35dd
2020-07-20 23:40:17 -07:00
65bd38127a GLOO process group GPU alltoall (#41690)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41690

Gloo alltoall for GPU

Test Plan: buck test mode/dev-nosan caffe2/torch/lib/c10d:ProcessGroupGlooTest

Reviewed By: osalpekar

Differential Revision: D22631554

fbshipit-source-id: 4b126d9d991a118f3925c005427f399fc60f92f7
2020-07-20 19:01:12 -07:00
5c50cb567c Generalized Learnable Fake Quantizer Module (#41535)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41535

A generalized fake quantization module is built to support lower-bit fake quantization with back propagation on the scale and zero point. The module supports both per tensor and per channel fake quantization.

Test Plan:
Please see diff D22337313 for a related experiment performed on the fake quantizer module.

The `_LearnableFakeQuantize` module supports the following use cases:
- Per Tensor Fake Quantization or Per Channel Fake Quantization
- Static Estimation from Observers or Quantization Parameter Learning through Back Propagation

By default, the module assumes per tensor affine fake quantization. To switch to per channel, during initialization, declare `channel_size` with the appropriate length. To toggle between utilizing static estimation and parameter learning with back propagation, you can invoke the call `enable_param_learning` or `enable_static_estimate`. For more information on the flags that support these operations, please see the doc string of the `_LearnableFakeQuantize` module.

The `_LearnableFakeQuantizer` module relies on 2 operators for its forward and backward paths: `_LearnableFakeQuantizePerTensorOp` and `_LearnableFakeQuantizePerChannelOp`. The backpropagation routine is developed based on the following literature:
- Learned Step Size Quantization: https://openreview.net/pdf?id=rkgO66VKDS
- Trained Quantization Thresholds: https://arxiv.org/pdf/1903.08066.pdf

Reviewed By: z-a-f

Differential Revision: D22573645

fbshipit-source-id: cfd9ece8a959ae31c00d9beb1acf9dfed71a7ea1
2020-07-20 18:24:21 -07:00
3a9a64a4da Add non zero offset test cases for Quantize and Dequantize Ops. (#41693)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41693

Add non zero offset test cases for Quantize and Dequantize Ops.

Test Plan: Added new test case test_int8_non_zero_offset_quantize part of the test_int8_ops_nnpi.py test file.

Reviewed By: hyuen

Differential Revision: D22633796

fbshipit-source-id: be17ee7a0caa6e9bc7b175af539be2e6625ad47a
2020-07-20 16:03:32 -07:00
1039bbf4eb add named parameters to mobile module (#41376)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41376

torch::jit::mobile::Module does not currently support accessing parameters via their attribute names, but torch::jit::Module does. This diff adds an equivalent functionality to mobile::Module.

Test Plan: Imported from OSS

Reviewed By: iseeyuan

Differential Revision: D22609142

Pulled By: ann-ss

fbshipit-source-id: 1a5272ff336f99a3c0bb6194c6a6384754f47846
2020-07-20 15:57:49 -07:00
30551ea7b2 Update NCCL from 2.4.8 to 2.7.3 (#41608)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/41608

Reviewed By: mrshenli, ngimel

Differential Revision: D22604953

Pulled By: malfet

fbshipit-source-id: 28151e2d5b6ea360b79896cb79c761756687d121
2020-07-20 13:21:47 -07:00
f07816003a [2/n][Compute Meta] support analysis for null flag features
Summary:
## TLDR
Support using NaN default value for missing dense features in RawInputProcessor for DPER2. In preparation for subsequent support for null flag features in compute meta. For train_eval this is already supported in DPER3 and we do not plan to support this in DPER2 train eval.

Differential Revision: D22439142

fbshipit-source-id: 99ae9755bd41a5d5f43bf5a9a2819d64f3883005
2020-07-20 13:13:45 -07:00
897cabc081 Add operators for smart keyboard to lite interpreter (#41539)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/41539

Test Plan: Imported from OSS

Reviewed By: iseeyuan

Differential Revision: D22574746

Pulled By: ann-ss

fbshipit-source-id: 3e2b78385149d7bde2598c975e60845a766ef86a
2020-07-20 12:08:58 -07:00
de400fa5ac [JIT] handle specially mapped ops (#41503)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41503

Fix for https://github.com/pytorch/pytorch/issues/41192

We can map fill_ and zero_ to their functional equivalents full_like and zeros_like

Test Plan: Imported from OSS

Reviewed By: jamesr66a

Differential Revision: D22629269

Pulled By: eellison

fbshipit-source-id: f1c62684dc55682c0b3845022e0461ec77d07179
2020-07-20 12:03:31 -07:00
6161730174 [JIT] move remove mutation to its own test file (#41502)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/41502

Test Plan: Imported from OSS

Reviewed By: Krovatkin

Differential Revision: D22629270

Pulled By: eellison

fbshipit-source-id: fcec6ae4ff8f108164539d67427ef3d72fa07494
2020-07-20 12:03:28 -07:00
cfcee816f1 .circleci: Prefix docker jobs with docker- (#41689)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41689

It's annoying not to know which jobs are actually related to docker
builds so let's just add the prefix.

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D22631578

Pulled By: seemethere

fbshipit-source-id: ac0cdd983ccc3bebcc360ba479b378d8f0eaa9c0
2020-07-20 12:00:53 -07:00
cc3c18edbc More LayerNorm Vectorization in calcMeanStd function. (#41618)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41618

More LayerNorm Vectorization in calcMeanStd function.

Test Plan: test covered in test_layernorm_nnpi_fp16.py

Reviewed By: hyuen

Differential Revision: D22606585

fbshipit-source-id: be773e62f0fc479dbc2d6735f60c2e98441916e9
2020-07-20 11:55:54 -07:00
26bbbeaea4 [DOCS] Fix the docs for the inputs arg of trace_module func (#41586)
Summary:
Fix the docs for the `inputs` arg of `trace_module` func.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41586

Reviewed By: ezyang

Differential Revision: D22598453

Pulled By: zou3519

fbshipit-source-id: c2d182238b5a51f6d0a7d0683372d72a239146c5
2020-07-20 10:57:56 -07:00
ce443def01 Grammar patch 1 (.md) (#41599)
Summary:
A minor spell check!
I have gone through a dozen of .md files to fix the typos.
zou3519 take a look!

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41599

Reviewed By: ezyang

Differential Revision: D22601629

Pulled By: zou3519

fbshipit-source-id: 68d8f77ad18edc1e77874f778b7dadee04b393ef
2020-07-20 10:19:08 -07:00
6769b850b2 Remove needless test duplication (#41583)
Summary:
The test loops over `upper` but does not use it effectively running the same test twice which increases test times for no gain.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41583

Reviewed By: soumith, seemethere, izdeby

Differential Revision: D22598475

Pulled By: zou3519

fbshipit-source-id: d100f20143293a116ff3ba08b0f4eaf0cc5a8099
2020-07-20 10:14:11 -07:00
16dde6e3a0 Augmenting Observers to Support Dynamic Quantization Range (#41113)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41113

In this diff, the `ObserverBase` class is augmented with 2 additional optional arguments qmin and qmax. Correspondingly the calculation of qmin and qmax and the related quantization parameters are modified to accommodate this additional flexibility should the number of bits for quantization be lower than 8 (the default value).

Additional logic in the base class `_calculate_qparams` function has also been modified to provide support for dynamic quantization range.

Test Plan:
To ensure this modification is still backward compatible with past usages, numerics are verified by running the quantization unit test suite, which contains various observer tests. The following command executes the test suite, which also verifies the observer numerics:

`buck test //caffe2/test:quantization -- observer`

This modified observer script can be tested within the experiments for lower bit fake quantization. Please see the following diffs for reference.
- Single Fake Quantizer: D22337447
- Single Conv Layer: D22338532

Reviewed By: z-a-f

Differential Revision: D22427134

fbshipit-source-id: f405e633289322078b0f4a417f54b684adff2549
2020-07-20 08:51:31 -07:00
9600ed9af3 typo fixes (#41632)
Summary:
typo fixes

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41632

Reviewed By: ezyang

Differential Revision: D22617827

Pulled By: mrshenli

fbshipit-source-id: c2bfcb7cc36913a8dd32f13fc9adc3aa0a9b682f
2020-07-20 07:23:00 -07:00
bd42e1a082 Doc language fixes (#41643)
Summary:
Updates doc for abs, acos, and isinf for clarity and consistency.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41643

Reviewed By: ngimel

Differential Revision: D22622957

Pulled By: mruberry

fbshipit-source-id: 040f01b4e101153098577bf10dcd569b679aae2c
2020-07-19 21:31:51 -07:00
a69a262810 workaround segfault in deviceGuard construction (#41621)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41621

Per title. In some situation, deviceGuard constructor in mul_kernel_cuda segfaults, so construct deviceGuard conditionally only when first argument is scalar.
This does not root cause why deviceGuard constructor segfaults, so the issue might come back.

Test Plan: pytorch oss CI

Reviewed By: jianyuh

Differential Revision: D22616460

fbshipit-source-id: b91bbe55c6eb0bbe80b8d6a61c41f09288752658
2020-07-18 23:41:43 -07:00
4a3aad354a [1/N] Implement Enum JIT support (#41390)
Summary:
* Add EnumType and AnyEnumType as first-class jit type
* Add Enum-typed IValue
* Enhanced aten::eq to support Enum

Supported:
Enum-typed function targuments
using Enum type and comparing them

TODO:
Add PyThon sugared value for Enum
Support getting name/value attrs of enums
Support Enum-typed return values
Support enum values of different types in same Enum class
Support serialization and deserialization

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41390

Reviewed By: eellison

Differential Revision: D22524388

Pulled By: gmagogsfm

fbshipit-source-id: 1627154a64e752d8457cd53270f3d14aea4b1150
2020-07-18 22:15:06 -07:00
46eb8d997c Revert D22533824: [PT] add check for duplicated op names in JIT
Test Plan: revert-hammer

Differential Revision:
D22533824 (d72c9f4200)

Original commit changeset: b36884531d41

fbshipit-source-id: 8bf840a09b4001cc68858a5dc3540505a0e1abdc
2020-07-18 17:26:42 -07:00
c7bcb285f3 Makes elementwise comparison docs more consistent (#41626)
Summary:
- Removes outdated language like "BoolTensor"
- Consistently labels keyword arguments, like out
- Uses a more natural string to describe their return type
- A few bonus fixes

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41626

Reviewed By: ngimel

Differential Revision: D22617322

Pulled By: mruberry

fbshipit-source-id: 03cc3562b78a07ed30bd1dc7936d7a4f4e31f01d
2020-07-18 16:30:59 -07:00
e7a09b4d17 RecordFunction in Dispatcher (#37587)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37587

Lifting RecordFunction up into the dispatcher code

Test Plan: Imported from OSS

Differential Revision: D21374246

fbshipit-source-id: 19f9c1719e6fd3990e451c5bbd771121e91128f7
2020-07-17 22:20:05 -07:00
c6d0fdd215 torch.isreal (#41298)
Summary:
https://github.com/pytorch/pytorch/issues/38349

mruberry
Not entirely sure if all the changes are necessary in how functions are added to Pytorch.

Should it throw an error when called with a non-complex tensor? Numpy allows non-complex arrays in its imag() function which is used in its isreal() function but Pytorch's imag() throws an error for non-complex arrays.

Where does assertONNX() get its expected output to compare to?

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41298

Reviewed By: ngimel

Differential Revision: D22610500

Pulled By: mruberry

fbshipit-source-id: 817d61f8b1c3670788b81690636bd41335788439
2020-07-17 22:07:24 -07:00
581e9526bb [GradualGating] support better k value change (#41557)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41557

 - add new learning rate functor "slope"
 - use "slope" learning rate in gated_sparse_feature module

Test Plan:
buck test dper3/dper3/modules/tests:core_modules_test -- test_gated_sparse_features_shape_num_warmup_tensor_k
buck test caffe2/caffe2/python/operator_test:learning_rate_op_test -- test_slope_learning_rate_op

Reviewed By: huayuli00

Differential Revision: D22544628

fbshipit-source-id: f2fcae564e79e1d8bcd3a2305d0c11ca7c0d3b3c
2020-07-17 20:44:28 -07:00
d72c9f4200 [PT] add check for duplicated op names in JIT (#41549)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41549

D22467871 (a548c6b18f) was reverted due to double linking torch_mobile_train.
Re-do this change after D22531358 (7a33d8b001).

Test Plan:
buck install fb4a
Train mnist in Internal Settings.

Reviewed By: iseeyuan

Differential Revision: D22533824

fbshipit-source-id: b36884531d41cea2e76b7fb1a567f21106c612b6
2020-07-17 20:26:48 -07:00
96ac12fdf4 [PT] add overload name for int prim ops (#41578)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41578

A new op aten::gcd(Tensor...) was added while the duplicated op name check was disabled. It's not a prime op, but it has the same name with one prime op aten::gcd(int, int).

It will be safer to enforce all prim ops have overload name, even there is no duplicated name right now. People may add tensor ops without overload name in the future.

This diff added the overload name for all ops defined using "DEFINE_INT_OP".

```
aten::__and__
aten::__or__
aten::__xor__
aten::__lshift__
aten::__rshift__
aten::__round_to_zero_floordiv
aten::gcd
```

Test Plan: run full JIT predictor

Reviewed By: iseeyuan

Differential Revision: D22593689

fbshipit-source-id: b3335d356a774d33450a09d0a43ff947197f9b8a
2020-07-17 18:18:38 -07:00
445e7eb01b Add quantized CELU operator by adding additional parameters to quantized ELU (#39199)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/39199

Test Plan: Imported from OSS

Differential Revision: D21771202

Pulled By: durumu

fbshipit-source-id: 910de6202fa3d5780497c5bf85208568a09297dd
2020-07-17 17:56:33 -07:00
1734f24276 Revert D22525217: [pytorch][PR] Initial implementation of quantile operator
Test Plan: revert-hammer

Differential Revision:
D22525217 (c7798ddf7b)

Original commit changeset: 27a8bb23feee

fbshipit-source-id: 3beb3d4f8a4d558e993fbdfe977af12c7153afc8
2020-07-17 17:22:48 -07:00
b774ce54f8 remediation of S205607
fbshipit-source-id: 798decc90db4f13770e97cdce3c0df7d5421b2a3
2020-07-17 17:19:47 -07:00
8fdea489af remediation of S205607
fbshipit-source-id: 5113fe0c527595e4227ff827253b7414abbdf7ac
2020-07-17 17:17:03 -07:00
39b4701d31 [caffe2][redo] Reimplement RemoveOpsByType with SSA (#41606)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41606

The previous diff (D22220798 (59294fbbb9) and D22220797) was recently reverted (D22492356 (28291d3cf8), D22492355) because of a bug associated with the op AsyncIf. The AsyncIf op has net_defs as args and the SSA rewriting didn't take that into account. It has a special path for the op If, but not for AsyncIf. Several changes I made to fix the bug:
1) Add op AsyncIf to the special path for If op in SSA rewriting
2) clear inputs/outputs of the netdefs that are args in If/AsyncIf ops because they're no longer valid
3) revert renamed inputs/outputs in the arg netdefs that are in the external_outputs in the parent netdef

2) and 3) are existing bugs in the `SsaRewrite` function that were just never exposed before.

The algorithm for `RemoveOpsByType` is the same as in my previous diff D22220798 (59294fbbb9). The only new changes in this diff are in `onnx::SsaRewrite` and a few newly added unit tests.

(Note: this ignores all push blocking failures!)

Reviewed By: yinghai

Differential Revision: D22588652

fbshipit-source-id: ebb68ecd1662ea2bae14d4be8f61a75cd8b7e3e6
2020-07-17 16:06:43 -07:00
349c40507c Revert "[CircleCI] Delete docker image after testing" (#41601)
Summary:
Per AMD request, this reverts commit 1e64bf4c40ef82d6bc3dcc42b3874353f7632be0.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41601

Reviewed By: ezyang

Differential Revision: D22603147

Pulled By: malfet

fbshipit-source-id: f423d406601383f26ea83a51f1de37e60b53810e
2020-07-17 14:42:27 -07:00
92b95e5243 Fix NCCL version check when nccl.h in non-standard location. (#40982)
Summary:
The NCCL discovery process fails to compile detect_nccl_version.cc when nccl.h resides in a non-standard location.
Pass __NCCL_INCLUDE_DIRS__ to _try_run(... detect_nccl_version.cc)_ to fix this.

Can reproduce with Dockerfile ..
```Dockerfile
FROM nvidia/cuda:10.2-cudnn7-devel-ubuntu18.04 as build
WORKDIR /stage

# install conda
ARG CONDA_VERSION=4.7.10
ARG CONDA_URL=https://repo.anaconda.com/miniconda/Miniconda3-${CONDA_VERSION}-Linux-x86_64.sh
RUN cd /stage && curl -fSsL --insecure ${CONDA_URL} -o install-conda.sh &&\
    /bin/bash ./install-conda.sh -b -p /opt/conda &&\
    /opt/conda/bin/conda clean -ya
ENV PATH=/opt/conda/bin:${PATH}

# install prerequisites
RUN conda install numpy ninja pyyaml mkl mkl-include setuptools cmake cffi

# attempt compile
ENV CUDA_HOME="/usr/local/cuda" \
    CUDNN_LIBRARY="/usr/lib/x86_64-linux-gnu" \
    NCCL_INCLUDE_DIR="/usr/local/cuda/include" \
    NCCL_LIB_DIR="/usr/local/cuda/lib64" \
    USE_SYSTEM_NCCL=1
RUN apt-get -y update &&\
    apt-get -y install git &&\
    cd /stage && git clone https://github.com/pytorch/pytorch.git &&\
    cd pytorch &&\
    git submodule update --init --recursive &&\
    python setup.py bdist_wheel
```

This generates the following error ..
```
-- Found NCCL: /usr/local/cuda/include
-- Determining NCCL version from /usr/local/cuda/include/nccl.h...
-- Looking for NCCL_VERSION_CODE
-- Looking for NCCL_VERSION_CODE - found
CMake Error at cmake/Modules/FindNCCL.cmake:78 (message):
  Found NCCL header version and library version do not match! (include:
  /usr/local/cuda/include, library: /usr/local/cuda/lib64/libnccl.so) Please
  set NCCL_INCLUDE_DIR and NCCL_LIB_DIR manually.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/40982

Reviewed By: zou3519

Differential Revision: D22603911

Pulled By: malfet

fbshipit-source-id: 084870375a270fb9c7daf3c2e731992a03614ad6
2020-07-17 13:54:17 -07:00
cf811d2fb3 retain undefined tensors in backward pass (#41490)
Summary:
Leave undefined tensors / None returned from custom backward functions as undefined/None instead of creating a tensor full of zeros. This change improves performance in some cases.

**This is BC-Breaking:** Custom backward functions that return None will now see it potentially being propagated all the way up to AccumulateGrad nodes. Potential impact is that .grad field of leaf tensors as well as the result of autograd.grad may be undefined/None where it used to be a tensor full of zeros. Also, autograd.grad may raise an error, if so, consider using allow_unused=True ([see doc](https://pytorch.org/docs/stable/autograd.html?highlight=autograd%20grad#torch.autograd.grad)) if it applies to your case.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41490

Reviewed By: albanD

Differential Revision: D22578241

Pulled By: heitorschueroff

fbshipit-source-id: f4966f4cb520069294f8c5c1691eeea799cc0abe
2020-07-17 12:42:50 -07:00
a874c1e584 Adds missing abs to lcm (#41552)
Summary:
lcm was missing an abs. This adds it plus extends the test for NumPy compliance. Also includes a few doc fixes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41552

Reviewed By: ngimel

Differential Revision: D22580997

Pulled By: mruberry

fbshipit-source-id: 5ce1db56f88df4355427e1b682fcf8877458ff4e
2020-07-17 12:29:50 -07:00
0f78e596ba ROCm: Fix linking of custom ops in load_inline (#41257)
Summary:
Previously we did not link against amdhip64 (roughly equivalent to cudart). Apparently, the recent RTDL_GLOBAL fixes prevent the extensions from finding the symbols needed for launching kernels.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41257

Reviewed By: zou3519

Differential Revision: D22573288

Pulled By: ezyang

fbshipit-source-id: 89f9329b2097df26785e2f67e236d60984d40fdd
2020-07-17 12:14:50 -07:00
3c862c80cf Move list size constants for profiler::Event and profiler::ProfilerConfig into (#40474)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40474

These constants are unnecessary since there is an enum, and we can add
the size at the end of the enum and it will be equal to the list size. I
believe that this is the typical pattern used to represent enum sizes.
ghstack-source-id: 107969012

Test Plan: CI

Reviewed By: ezyang

Differential Revision: D22147754

fbshipit-source-id: 7064a897a07f9104da5953c2f87b58179df8ea84
2020-07-17 12:00:18 -07:00
fbd960801a [JIT] Replace use of "whitelist" in lower_tuples pass (#41460)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41460

**Test Plan**
Continuous integration.

**Fixes**
This commit partially addresses #41443.

Test Plan: Imported from OSS

Reviewed By: suo

Differential Revision: D22544272

Pulled By: SplitInfinity

fbshipit-source-id: b46940d1e24f81756daaace260bad7a1feda1e8f
2020-07-17 11:33:14 -07:00
c2c2c1c106 [JIT] Remove use of "whitelist" in quantization/helper.cpp (#41459)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41459

**Test Plan**
Continuous integration.

**Fixes**
This commit partially addresses #41443.

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D22544269

Pulled By: SplitInfinity

fbshipit-source-id: d4bb7c0c9c71e953677a34f0530b66e5119447d0
2020-07-17 11:33:12 -07:00
4f4e3a0f15 [JIT] Replace uses of "whitelist" in jit/_script.py (#41458)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41458

**Test Plan**
Continuous integration.

**Fixes**
This commit partially fixes #41443.

Test Plan: Imported from OSS

Reviewed By: suo

Differential Revision: D22544273

Pulled By: SplitInfinity

fbshipit-source-id: 8148e5338f90a5ef19177cf68bf36b56926d5a6c
2020-07-17 11:33:10 -07:00
bf0d0900a7 [JIT] Replace uses of "blacklist" in jit/_recursive.py (#41457)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41457

**Test Plan**
Continuous integration.

**Fixes**
This commit partially addresses #41443.

Test Plan: Imported from OSS

Reviewed By: suo

Differential Revision: D22544274

Pulled By: SplitInfinity

fbshipit-source-id: ee74860c48d85d819d46c8b8848960e77bb5013e
2020-07-17 11:33:07 -07:00
758edcd7df [JIT] Replace use of "blacklist" in python/init.cpp (#41456)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41456

**Test Plan**
Continuous integration.

**Fixes**
This commit partially addresses #41443.

Test Plan: Imported from OSS

Reviewed By: Krovatkin

Differential Revision: D22544270

Pulled By: SplitInfinity

fbshipit-source-id: 649b30e1fcc6516a4def6b148a1da07bc3ce941d
2020-07-17 11:33:05 -07:00
c9bdf474d7 [JIT] Replace use of "blacklist" in xnnpack_rewrite (#41455)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41455

**Test Plan**
Continuous integration.

**Fixes**
This commit partially addresses #41443.

Test Plan: Imported from OSS

Reviewed By: eellison

Differential Revision: D22544275

Pulled By: SplitInfinity

fbshipit-source-id: 5037b16e6ebc9e3b40dd03d2ce5a0671d7867892
2020-07-17 11:33:03 -07:00
3b7c05b11b [JIT] Replace uses of "blacklist" in gen_unboxing_wrappers.py (#41454)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41454

**Test Plan**
Continuous integration (if this file is still used).

**Fixes**
This commit partially addresses #41443.

Test Plan: Imported from OSS

Reviewed By: eellison

Differential Revision: D22544271

Pulled By: SplitInfinity

fbshipit-source-id: 84a4d552745fe5163b2e3200103c3b1f2a9ffb2a
2020-07-17 11:33:01 -07:00
f85a27e100 [JIT] Replace "blacklist" in test_jit.py (#41453)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41453

**Test Plan**
`python test/test_jit.py`

**Fixes**
This commit partially addresses #41443.

Test Plan: Imported from OSS

Reviewed By: suo

Differential Revision: D22544268

Pulled By: SplitInfinity

fbshipit-source-id: 8b6b94211a626209c3960fda6c860593148dcbf2
2020-07-17 11:30:27 -07:00
43b1923d98 Enable SLS FP32 accumulation SparseLengthsWeightedSumFused8BitRowwiseFakeFP32NNPI Op. (#41577)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41577

* Remove skipping test
* Use fma_avx_emulation
* Increase test examples to 100

(Note: this ignores all push blocking failures!)

Test Plan: Tests are covered in test_sls_8bit_nnpi.py

Reviewed By: hyuen

Differential Revision: D22585742

fbshipit-source-id: e1f62f47eb10b402b11893ffca7a6786e31daa79
2020-07-17 11:19:47 -07:00
319b20b7db [ONNX] Update ORT version (#41372)
Summary:
Update ORT version [1.4 candidate].

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41372

Reviewed By: houseroad

Differential Revision: D22580050

Pulled By: bzinodev

fbshipit-source-id: c66e3bab865b3221d52eea30db48e0870ae5b681
2020-07-17 11:17:17 -07:00
346c69a626 [ONNX] Export embedding_bag (#41234)
Summary:
Enable export of embedding_bag op to ONNX

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41234

Reviewed By: houseroad

Differential Revision: D22567470

Pulled By: bzinodev

fbshipit-source-id: 2fcf74e54f3a9dee4588d7877a4ac9eb6c2a3629
2020-07-17 11:11:43 -07:00
7eb71b4beb Profiler: Do not record zero duration kernel events (#41540)
Summary:
Changes in the ROCm runtime have improved hipEventRecord.  The events no longer take ~4 usec to execute on the gpu stream, instead they appear instantaneous.  If you record two events, with no other activity in between, then they will have the same timestamp and the elapsed duration will be 0.

The profiler uses hip/cuda event pairs to infer gpu execution times.  It wraps functions whether they send work to the gpu or not.  Functions that send no gpu work will show as having zero duration.  Also they will show as running at the same time as neighboring functions.  On a trace, all those functions combine into a 'call stack' that can be tens of functions tall (when indeed they should be sequential).

This patch suppresses recording the zero duration 'kernel' events, leaving only the CPU execution part.  This means functions that do not use the GPU do not get an entry for how long they were using the GPU, which seams reasonable.  This fixes the 'stacking' on traces.  It also improves the signal to noise of the GPU trace beyond what was available previously.

This patch will not effect CUDA or legacy ROCm as those are not able to 'execute' eventRecord markers instantaneously.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41540

Reviewed By: zou3519

Differential Revision: D22597207

Pulled By: albanD

fbshipit-source-id: 5e89de2b6d53888db4f9dbcb91a94478cde2f525
2020-07-17 11:03:43 -07:00
324c18fcad fix division by low precision scalar (#41446)
Summary:
Before, inverse for division by scalar is calculated in the precision of the non-scalar operands, which can lead to underflow:
```
>>> x = torch.tensor([3388.]).half().to(0)
>>> scale = 524288.0
>>> x.div(scale)
tensor([0.], device='cuda:0', dtype=torch.float16)
>>> x.mul(1. / scale)
tensor([0.0065], device='cuda:0', dtype=torch.float16)
```
This PR makes results of multiplication by inverse and division the same.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41446

Reviewed By: ezyang

Differential Revision: D22542872

Pulled By: ngimel

fbshipit-source-id: b60e3244809573299c2c3030a006487a117606e9
2020-07-17 10:41:28 -07:00
5d7046522b [JIT] Teach IRPrinter and IRParser to handle 'requires_grad' and 'device' as a part of type info. (#41507)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41507

These fields have always been a part of tensor types, this change just
makes them serializable through IR dumps.

Test Plan: Imported from OSS

Reviewed By: Krovatkin, ngimel

Differential Revision: D22563661

Pulled By: ZolotukhinM

fbshipit-source-id: f01aaa130b7e0005bf1ff21f65827fc24755b360
2020-07-17 10:27:04 -07:00
241bc648c9 Adding missing setting state_.ptr() and hook_.ptr() to nullptr. (#41537)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41537

Explicitly setting PyObject* state_ and hook_ to nullptr to prevent py::object's dtor to decref on the PyObject again.
Reference PR [#40848](https://github.com/pytorch/pytorch/pull/40848).
ghstack-source-id: 107959254

Test Plan: `python test/distributed/test_c10d.py`

Reviewed By: zou3519

Differential Revision: D22573858

fbshipit-source-id: 84cc5949a370ffdb4ac3ca7a16a6f0f136563c1c
2020-07-17 10:21:03 -07:00
c7798ddf7b Initial implementation of quantile operator (#39417)
Summary:
Implementing the quantile operator similar to [numpy.quantile](https://numpy.org/devdocs/reference/generated/numpy.quantile.html).

For this implementation I'm reducing it to existing torch operators to get free CUDA implementation. It is more efficient to implement multiple quickselect algorithm instead of sorting but this can be addressed in a future PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/39417

Reviewed By: mruberry

Differential Revision: D22525217

Pulled By: heitorschueroff

fbshipit-source-id: 27a8bb23feee24fab7f8c228119d19edbb6cea33
2020-07-17 10:15:57 -07:00
71fdf748e5 Add torch.atleast_{1d/2d/3d} (#41317)
Summary:
https://github.com/pytorch/pytorch/issues/38349

TODO:
 * [x] Docs
 * [x] Tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41317

Reviewed By: ngimel

Differential Revision: D22575456

Pulled By: mruberry

fbshipit-source-id: cc79f4cd2ca4164108ed731c33cf140a4d1c9dd8
2020-07-17 10:10:41 -07:00
840ad94ef5 Add reference documentation for torch/library.h (#41470)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41470

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D22577426

Pulled By: ezyang

fbshipit-source-id: 4bfe5806061e74181a74d161c868acb7c1ecd1e4
2020-07-17 10:05:16 -07:00
1e230a5c52 rewrite C++ __torch_function__ handling to work with TensorList operands (#41575)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41575

Fixes https://github.com/pytorch/pytorch/issues/34294

This updates the C++ argument parser to correctly handle `TensorList` operands. I've also included a number of updates to the testing infrastructure, this is because we're now doing a much more careful job of testing the signatures of aten kernels, using the type information about the arguments as read in from `Declarations.yaml`. The changes to the tests are required because we're now only checking for `__torch_function__` attributes on `Tensor`, `Optional[Tensor]` and elements of `TensorList` operands, whereas before we were checking for `__torch_function__` on all operands, so the relatively simplistic approach the tests were using before -- assuming all positional arguments might be tensors -- doesn't work anymore. I now think that checking for `__torch_function__` on all operands was a mistake in the original design.

The updates to the signatures of the `lambda` functions are to handle this new, more stringent checking of signatures.

I also added override support for `torch.nn.functional.threshold` `torch.nn.functional.layer_norm`, which did not yet have python-level support.

Benchmarks are still WIP.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/34725

Reviewed By: mruberry

Differential Revision: D22357738

Pulled By: ezyang

fbshipit-source-id: 0e7f4a58517867b2e3f193a0a8390e2ed294e1f3
2020-07-17 08:54:29 -07:00
cb9029df9d Assert valid inner type for OptionalType creation (#41509)
Summary:
Assert in OptionalType::create for valid TypePtr to catch all uses, as well as in python resolver to propagate slightly more helpful error message.

Closes https://github.com/pytorch/pytorch/issues/40713.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41509

Reviewed By: suo

Differential Revision: D22563710

Pulled By: wconstab

fbshipit-source-id: ee6314b1694a55c1ba7c8251260ea120be148b17
2020-07-17 07:22:41 -07:00
e3e58e20cd enable jit profiling tests on macos (#41550)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/41550

Reviewed By: SplitInfinity

Differential Revision: D22579593

Pulled By: Krovatkin

fbshipit-source-id: 3e67bcf418ef266d5416b7fac413e94b1ac1ec7e
2020-07-16 22:55:24 -07:00
eb3bf96f95 During inbatch broadcast, move Tile op after Fused8BitRowwiseQuantizedToFloat if applicable (#41464)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41464

If input is int8 rowwise quantized, currently we cannot low it to Glow. And previously, we had some error when running with inbatch broadcast. The main issue is that Tile op doesn't support uint8_t type, which is very easily added here. However, this will result in non-ideal situation that we will leave Tile -> Fused8BitRowwiseQuantizedToFloat on host side, which probably hurt the memory bw a lot. Even we later add the support to Fused8BitRowwiseQuantizedToFloat in Glow, it's still not ideal because we are doing redudant compute on identical columns. So the solution here is to swap the order of Fused8BitRowwiseQuantizedToFloat and Tile to make it Tile -> Fused8BitRowwiseQuantizedToFloat. In this way, it will resolve the error we saw immediately. For the short term, we can still run Tile in card. And for longer term, things runs faster on card.

The optimization is a heuristic. If in the net, there isn't such pattern, inbatch broadcast will work as it was before.

(Note: this ignores all push blocking failures!)

Test Plan:
```
buck test caffe2/caffe2/opt/custom:in_batch_broadcast_test
```

Reviewed By: benjibc

Differential Revision: D22544162

fbshipit-source-id: b6dd36a5925a9c8103b80f034e7730a7a085a6ff
2020-07-16 21:25:18 -07:00
5376785a70 Run NO_AVX jobs on CPU (#41565)
Summary:
Delete "nogpu" job since both "AVX" and "AVX2" jobs already act like one
Fix naming problem when NO_AVX_NO_AVX2 job and NO_AVX2 jobs were semantically identical, due to the following logic in test.sh:
```
if [[ "${BUILD_ENVIRONMENT}" == *-NO_AVX-* ]]; then
  export ATEN_CPU_CAPABILITY=default
elif [[ "${BUILD_ENVIRONMENT}" == *-NO_AVX2-* ]]; then
  export ATEN_CPU_CAPABILITY=avx
fi
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41565

Reviewed By: seemethere

Differential Revision: D22584743

Pulled By: malfet

fbshipit-source-id: 783cce60f35947b5d1e8b93901db36371ef78243
2020-07-16 21:21:48 -07:00
728fd37d92 [JIT] make fastrnns runnable on cpu (#41483)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/41483

Reviewed By: gmagogsfm

Differential Revision: D22580275

Pulled By: eellison

fbshipit-source-id: f2805bc7fa8037cfde7862b005d2940add3ac864
2020-07-16 15:53:39 -07:00
b1d4e33c8b Revert D22552377: [pytorch][PR] Reland split unsafe version
Test Plan: revert-hammer

Differential Revision:
D22552377 (5bba973afd)

Original commit changeset: 1d1b713d2429

fbshipit-source-id: 8194458f99bfd5f077b7daa46ca3e81b549adc1b
2020-07-16 15:24:19 -07:00
415ff0bceb Create lazy_dyndeps to avoid caffe2 import costs. (#41343)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41343

Currently caffe2.InitOpLibrary does the dll import uniliaterally. Instead if we make a lazy version and use it, then many pieces of code which do not need the caffe2urrenoperators get a lot faster.

One a real test, the import time went from 140s to 68s. 8s.

This also cleans up the algorithm slightly (although it makes a very minimal
difference), by parsing the list of operators once, rather than every time a
new operator is added, since we defer the RefreshCall until after we've
imported all the operators.

The key way we maintain safety, is that as soon as someone does an operation
which requires a operator (or could), we force importing of all available
operators.

Future work could include trying to identify which code is needed for which
operator and only import the needed ones. There may also be wins available by
playing with dlmopen (which opens within a namespace), or seeing if the dl
flags have an impact (I tried this and didn't see an impact, but dlmopen may
make it better).

Note that this was previously landed and reverted. The issue was that if a import failed and raised an exception, the specific library would not be removed from the lazy imports. This caused our tests which had libraries that failed to poison all other tests that ran after it. This has been fixed and a unit test has been added for this case (to help make it obvious what failed).

Test Plan:
I added a new test a lazy_dyndep_test.py (copied from all_compare_test.py).
I'm a little concerned that I don't see any explicit tests for dyndep, but this
should provide decent coverage.

I've added a specific test to handle the poisoning issues mentioned above, which caused the previous version to get reverted.

Differential Revision: D22506369

fbshipit-source-id: 7395df4778e8eb0220630c570360b99a7d60eb83
2020-07-16 15:17:41 -07:00
9ed825746a Use c10::cuda:: primitives rather than make CUDA runtime calls directly (#41405)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/41405

Test Plan:
**Imported from GitHub: all checks have passed**

{F244195355}

**The Intern Builds & Tests have 127 success, 5 no signals, and 1 failure. Double check the failed test log file, the failure is result differences:**
- AssertionError: 0.435608434677124 != 0.4356083869934082
- AssertionError: 0.4393022060394287 != 0.4393021583557129
- AssertionError: 0.44707541465759276 != 0.44707536697387695

These are all very small numerical errors (within 0.0000001).

Reviewed By: malfet

Differential Revision: D22531486

Pulled By: threekindoms

fbshipit-source-id: 21543ec76bb9b502885b5146c8ba5ede719be9ff
2020-07-16 15:11:57 -07:00
a0e58996fb Makes the use of the term "module" consistent through the serialization note (#41563)
Summary:
module -> torch.nn.Module or ScriptModule, as appropriate. + bonus grammar fix.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41563

Reviewed By: gchanan

Differential Revision: D22584173

Pulled By: mruberry

fbshipit-source-id: 8c90f1f9a194bfdb277c97cf02c9b8c1c6ddc601
2020-07-16 14:59:49 -07:00
454cd3ea2e Fix RocM resource class allocation (#41553)
Summary:
Add Conf.is_test_stage() method to avoid duplicating state in ['test', 'test1', 'test2'] throughout the code

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41553

Test Plan: Make sure that in modified config.yml ROCM tests jobs are assigned `pytorch/amd-gpu` resource class

Reviewed By: yns88

Differential Revision: D22580471

Pulled By: malfet

fbshipit-source-id: 514555f0c0ac94c807bf837ba209560055335587
2020-07-16 14:13:25 -07:00
e324ea85ea Add tests to logical operation in BinaryOpsKernel.cpp (#41515)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41515

add test in atest.cpp to cover logical_and_kernel, logical_or_kernel and logical_nor_kernel in Aten/native/cpu/BinaryOpsKernel.cpp

https://pxl.cl/1drmV

Test Plan: CI

Reviewed By: malfet

Differential Revision: D22565235

fbshipit-source-id: 7ad9fd8420d7fdd23fd9a703c75da212f72bde2c
2020-07-16 13:21:57 -07:00
f49d97a848 Notes for lcm and gcd, formatting doc fixes (#41526)
Summary:
A small PR fixing some formatting in lcm, gcd, and the serialization note. Adds a note to lcm and gcd explaining behavior that is not always defined.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41526

Reviewed By: ngimel

Differential Revision: D22569341

Pulled By: mruberry

fbshipit-source-id: 5f5ff98c0831f65e82b991ef444a5cee8e3c8b5a
2020-07-16 13:15:29 -07:00
86590f226e Revert D22519869: [pytorch][PR] RandomSampler generates samples one at a time when replacement=True
Test Plan: revert-hammer

Differential Revision:
D22519869 (09647e1287)

Original commit changeset: be6585002586

fbshipit-source-id: 31ca5ceb24dd0b291f46f427a6f30f1037252a5d
2020-07-16 12:59:10 -07:00
ba6b235461 [RocM] Switch to rocm-3.5.1 image (#41273)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/41273

Reviewed By: seemethere

Differential Revision: D22575277

Pulled By: malfet

fbshipit-source-id: 6f43654c8c8c33adbc1de928dd43911931244978
2020-07-16 12:52:17 -07:00
09647e1287 RandomSampler generates samples one at a time when replacement=True (#40026)
Summary:
Fix https://github.com/pytorch/pytorch/issues/32530

I used the next() function to generate samples one at a time. To compensate replacement=False, I added a variable called "sample_list" to RandomSampler for random permutation.

cc SsnL

Pull Request resolved: https://github.com/pytorch/pytorch/pull/40026

Reviewed By: zhangguanheng66

Differential Revision: D22519869

Pulled By: ezyang

fbshipit-source-id: be65850025864d659a713b3bc461b25d6d0048a2
2020-07-16 11:42:32 -07:00
6f5f455c54 [Gloo] alltoall to ProcessGroupGloo (#41424)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41424

Adding alltoall to Gloo process group

Test Plan:
buck test caffe2/torch/lib/c10d:ProcessGroupGlooTest

Verified on TSC as well D22141532

Reviewed By: osalpekar

Differential Revision: D22451929

fbshipit-source-id: 695c4655c894c85229b16097fa63352ed04523ef
2020-07-16 11:27:26 -07:00
1ac4692489 Remove unnecessary test in rpc_test.py (#41218)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41218

This test doesn't assert anything and was accidentally committed as
part of a larger diff a few months ago.
ghstack-source-id: 107882848

Test Plan: CI

Reviewed By: ezyang

Differential Revision: D22469852

fbshipit-source-id: 0baa23da56b08200e16cf66df514566223dd9b15
2020-07-16 11:23:52 -07:00
b5e32528d0 Fix flaky test_udf_remote_message_delay_timeout_to_self (#41217)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41217

Fixes this flaky test. Due to the possibility of callback
finishCreatingOwnerRRef running after request_callback has processed and
created the owner RRef, we could actually end up with 0 owners on the node,
since the callback removes from the owners_ map. In this case, shutdown is fine
since there are no owners. On the other hand, if the callback runs first, there
will be 1 owner which we will delete in shutdown when we detect it has no
forks. So either way, shutdown works fine and we don't need to enforce there to
be 1 owner.
ghstack-source-id: 107883497

Test Plan: Ran the test 500 times with TSAN.

Reviewed By: ezyang

Differential Revision: D22469806

fbshipit-source-id: 02290d6d5922f91a9e2d5ede21d1cf1c4598cb46
2020-07-16 11:20:56 -07:00
94e4248d80 Split ASAN and ROCM tests into test1 and test2 (#41520)
Summary:
This should reduce end-to-end test runtime for 2 slowest configs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41520

Reviewed By: seemethere

Differential Revision: D22575028

Pulled By: malfet

fbshipit-source-id: a65bfa5932fcda3cf0f4fdd97bcc7ebb3f54c281
2020-07-16 11:15:03 -07:00
81e964904e [Gloo] Tests for Gloo Async Work Wait-level Timeouts (#41265)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41265

This PR adds tests for the Async Work wait-level timeouts that were added in the previous PR
ghstack-source-id: 107835732

Test Plan: New tests are in this diff - Running on local machine and Sandcastle

Reviewed By: jiayisuse

Differential Revision: D22470084

fbshipit-source-id: 5552e384d384962e359c5f665e6572df03b6aa63
2020-07-16 10:59:01 -07:00
b979129cba [Gloo] Support work-level timeouts in ProcessGroupGloo (#40948)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40948

Add work-level timeouts to ProcessGroupGloo. This uses the timeout support in `waitSend` and `waitRecv` functions from Gloo's `unbound_buffer` construct.
ghstack-source-id: 107835738

Test Plan: Tests are in the last PR in this stack

Reviewed By: jiayisuse

Differential Revision: D22173763

fbshipit-source-id: e0493231a23033464708ee2bc0e295d2b087a1c9
2020-07-16 10:58:59 -07:00
01dcef2e15 [NCCL] Tests for WorkNCCL::wait with Timeouts (#40947)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40947

This PR adds tests for work-level timeouts in WorkNCCL objects. We kick off an allgather operation that waits for 1000ms before actually starting computation. We wait on completion of this allgather op with a timeout of 250ms, expecting the operation to timeout and throw a runtime error.
ghstack-source-id: 107835734

Test Plan: This diff added tests - checking CI/Sandcastle for correctness. These are NCCL tests so they require at least 2 GPUs to run.

Reviewed By: jiayisuse

Differential Revision: D22173101

fbshipit-source-id: 8595e4b67662cef781b20ced0befdcc53d157c39
2020-07-16 10:58:56 -07:00
edf3dc73f2 [NCCL] Support Wait Timeout in ProcessGroupNCCL (#40946)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40946

Adds timeout to ProcessGroupNCCL::wait. Currently, WorkNCCL objects already have a timeout set during ProcessGroupNCCL construction. The new wait function will override the existing timeout with the user-defined timeout if one is provided. Timed out operations result in NCCL communicators being aborted and an exception being thrown.
ghstack-source-id: 107835739

Test Plan: Test added to `ProcessGroupNCCLTest` in the next PR in this stack.

Reviewed By: jiayisuse

Differential Revision: D22127898

fbshipit-source-id: 543964855ac5b41e464b2df4bb6c211ef053e73b
2020-07-16 10:58:54 -07:00
9d92fa2679 [NCCL] Add timeout to ProcessGroup Work Wait (#40944)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40944

This stack adds Work-level timeout for blocking wait.

This PR just changes the API to accept a default wait arg for the wait function in each ProcessGroup backend. The ProcessGroup superclass correctly waits for the given timeout by changing the CV wait to wait_for.

Closes: https://github.com/pytorch/pytorch/issues/37571
ghstack-source-id: 107835735

Test Plan: Tests in 4th PR in this stack

Reviewed By: jiayisuse

Differential Revision: D22107135

fbshipit-source-id: b38c07cb5e79e6c86c205e580336e7918ed96501
2020-07-16 10:56:58 -07:00
fef30220fd Runs CUDA test_istft_of_sine on CUDA (#41523)
Summary:
The test was always running on the CPU. This actually caused it to throw an error on non-MKL builds, since the CUDA test (which ran on the CPU) tried to execute but the test requires MKL (a requirement only checked for the CPU variant of the test).

Fixes https://github.com/pytorch/pytorch/issues/41402.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41523

Reviewed By: ngimel

Differential Revision: D22569344

Pulled By: mruberry

fbshipit-source-id: e9908c0ed4b5e7b18cc7608879c6213fbf787da2
2020-07-16 10:43:51 -07:00
b2b8af9645 Removes assertAlmostEqual (#41514)
Summary:
This test function is confusing since our `assertEqual` behavior allows for tolerance to be specified, and this is a redundant mechanism.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41514

Reviewed By: ngimel

Differential Revision: D22569348

Pulled By: mruberry

fbshipit-source-id: 2b2ff8aaa9625a51207941dfee8e07786181fe9f
2020-07-16 10:35:12 -07:00
58244a9586 Automated submodule update: FBGEMM (#40332)
Summary:
This is an automated pull request to update the first-party submodule for [pytorch/FBGEMM](https://github.com/pytorch/FBGEMM).

New submodule commit: 73ea1f5828

Pull Request resolved: https://github.com/pytorch/pytorch/pull/40332

Test Plan: Ensure that CI jobs succeed on GitHub before landing.

Reviewed By: gchanan, yns88

Differential Revision: D22150737

fbshipit-source-id: fe7e6787adef9e2fedee5d1a0a1e57bc4760b88c
2020-07-16 10:32:39 -07:00
2b14f2d368 [reland][DNNL]:enable max_pool3d and avg_pool3d (#40996)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40996

Test Plan: Imported from OSS

Differential Revision: D22440766

Pulled By: VitalyFedyunin

fbshipit-source-id: 242711612920081eb4a7e5a7e80bc8b2d4c9f978
2020-07-16 10:26:45 -07:00
45c5bac870 [WIP] Fix cpp grad accessor API (#40887)
Summary:
Update the API to access grad in cpp to avoid unexpected thread safety issues.
In particular, with the current API, a check like `t.grad().defined()` is not thread safe.

- This introduces `t.mutable_grad()` that should be used when getting a mutable version of the saved gradient. This function is **not** thread safe.
- The `Tensor& grad()` API is now removed. We could not do a deprecation cycle as most of our call side use non-const Tensors that use the non-const overload. This would lead to most calls hitting the warning. This would be too verbose for all the users.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/40887

Reviewed By: ezyang

Differential Revision: D22343932

Pulled By: albanD

fbshipit-source-id: d5eb909bb743bc20caaf2098196e18ca4110c5d2
2020-07-16 09:11:12 -07:00
5bba973afd Reland split unsafe version (#41484)
Summary:
Reland of https://github.com/pytorch/pytorch/pull/39299

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41484

Reviewed By: glaringlee

Differential Revision: D22552377

Pulled By: albanD

fbshipit-source-id: 1d1b713d2429ae162e04bda845ef0838c52df789
2020-07-16 09:01:45 -07:00
b9442bb03e Doc note for complex (#41252)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/41252

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D22553266

Pulled By: anjali411

fbshipit-source-id: f6dc409da048496d72b29b0976dfd3dd6645bc4d
2020-07-16 08:53:27 -07:00
d80e0c62be fix dequantization to match nnpi (#41505)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41505

fix the dequantization to match the fixes from quantization

Test Plan:
test is not conclusive, since only comparing emulation with reference collected from Amy's run

running an evaluation workflow at the moment

Reviewed By: venkatacrc

Differential Revision: D22558092

fbshipit-source-id: 3ff00ea15eac76007e194659c3b4949f07ff02a4
2020-07-16 00:40:57 -07:00
26790fb26d fix quantization mechanism to match nnpi (#41494)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41494

revert back to the changes from amylittleyang to make quantization work

Test Plan:
ran against a dump from ctr_instagram, and verified that:
-nnpi and fakelowp match bitwise
-nnpi is different at most by 1 vs fbgemm, most likely due to the type of
rounding

Reviewed By: venkatacrc

Differential Revision: D22555276

fbshipit-source-id: 7074521d181f15ef6270985bb71c4b44d25d1c30
2020-07-16 00:40:55 -07:00
e6859ec78f resurrect single quantization op test (#41476)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41476

deleted this test by default, re-adding it in its own file to make it
more explicit

Test Plan: ran the test

Reviewed By: yinghai

Differential Revision: D22550217

fbshipit-source-id: 758e279b2bab3b23452a3d0ce75fb366f7afb7be
2020-07-16 00:37:46 -07:00
04c0f2e3cc enable TE on windows (#41501)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/41501

Reviewed By: ZolotukhinM

Differential Revision: D22563872

Pulled By: Krovatkin

fbshipit-source-id: 2b5730017b34af27800cc03f3ba62f1cc8b4f240
2020-07-15 23:00:05 -07:00
b2e52186b9 Rename capacity to nbytes in ShareExternalPointer to avoid confusion in future (#41461)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41461

capacity is misleading, and we have many wrong uses internally. Let's rename to nbytes to avoid the confusion in future. Ultimately, we could remove this parameter if possible.
So far I haven't seen any case this capacity is necessary.

Test Plan: oss ci

Differential Revision: D22544189

fbshipit-source-id: f310627f2ab8f4ebb294e0dd5eabc380926991eb
2020-07-15 22:04:18 -07:00
702140758f Move GLOG_ constants into c10 namespace (#41504)
Summary:
Declaring GLOG_ constants in google namespace causes a conflict in C++ project that uses GLOG and links with LibPyTorch compiled without GLOG.
For example, see https://github.com/facebookresearch/ReAgent/issues/288

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41504

Reviewed By: kaiwenw

Differential Revision: D22564308

Pulled By: malfet

fbshipit-source-id: 2167bd2c6124bd14a67cc0a1360521d3c375e3c2
2020-07-15 21:56:00 -07:00
f27e395a4a [Gloo] update gloo submodule for PyTorch (#41462)
Summary:
To include alltoall

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41462

Test Plan: CI

Reviewed By: osalpekar

Differential Revision: D22544255

Pulled By: jiayisuse

fbshipit-source-id: ad55a50a31e5e5affaf3e14e2401d38f99657dc9
2020-07-15 21:50:08 -07:00
1fb2a7e5a2 onnx export of fake quantize functions (#39738)
Summary:
As discussed in https://github.com/pytorch/pytorch/issues/39502.

This PR adds support for exporting  `fake_quantize_per_tensor_affine` to a pair of `QuantizeLinear` and `DequantizeLinear`.

Exporting `fake_quantize_per_channel_affine` to ONNX depends on https://github.com/onnx/onnx/pull/2772. will file another PR once ONNX merged the change.

It will generate ONNX graph like this:
![image](https://user-images.githubusercontent.com/1697840/84180123-ddd90080-aa3b-11ea-81d5-eaf6f5f26715.png)

jamesr66a

Pull Request resolved: https://github.com/pytorch/pytorch/pull/39738

Reviewed By: hl475

Differential Revision: D22517911

Pulled By: houseroad

fbshipit-source-id: e998b4012e11b0f181b193860ff6960069a91d70
2020-07-15 21:20:23 -07:00
7a33d8b001 [PyTorch Mobile] Modularize the autograd source files shared by mobile and full-jit (#41430)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41430

To avoid duplication at compile time, modularize the common autograd files used by both mobile and full-jit.
ghstack-source-id: 107742889

Test Plan: CI

Reviewed By: kwanmacher

Differential Revision: D22531358

fbshipit-source-id: 554f10be89b7ed59c9bde13387a0e1b08000c116
2020-07-15 21:14:47 -07:00
23174ca71b [reland] Enable TF32 support for cuBLAS (#41498)
Summary:
fix rocm

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41498

Reviewed By: mruberry

Differential Revision: D22560572

Pulled By: ngimel

fbshipit-source-id: 5ee79e96cb29e70d9180830d058efb53d1c6c041
2020-07-15 21:00:55 -07:00
200c343184 Implement gcd, lcm (#40651)
Summary:
Resolves https://github.com/pytorch/pytorch/issues/40018.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/40651

Reviewed By: ezyang

Differential Revision: D22511828

Pulled By: mruberry

fbshipit-source-id: 3ef251e45da4688b1b64c79f530fb6642feb63ab
2020-07-15 20:56:23 -07:00
e44f460079 [jit] Fix jit not round to even if const is folded (#40897)
Summary:
Fixed https://github.com/pytorch/pytorch/issues/40771

Pull Request resolved: https://github.com/pytorch/pytorch/pull/40897

Reviewed By: Krovatkin

Differential Revision: D22543261

Pulled By: gmagogsfm

fbshipit-source-id: 0bd4b1d910a42d5aa87e120c81acfdfb7ca895fa
2020-07-15 20:13:12 -07:00
1770937c9c Restore the contiguity preprocessing of linspace (#41286)
Summary:
The contiguity preprocessing was mistakenly removed in
cd48fb503088af2c00884f1619db571fffbcdafa . It causes erroneous output
when the output tensor is not contiguous. Here we restore this
preprocessing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41286

Reviewed By: zou3519

Differential Revision: D22550822

Pulled By: ezyang

fbshipit-source-id: ebad4e2ba83d2d808e3f958d4adc9a5513a95bec
2020-07-15 20:02:16 -07:00
d90fb72b5a remove use of the term "blacklist" from docs/cpp/source/Doxyfile (#41450)
Summary:
As requested in https://github.com/pytorch/pytorch/issues/41443

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41450

Reviewed By: ezyang

Differential Revision: D22561782

Pulled By: SplitInfinity

fbshipit-source-id: b38ab5e2725735d1f0c70a4d0012678636e992c3
2020-07-15 19:45:53 -07:00
404799d43f Disable failed caffe2 tests for BoundShapeInference on Windows (#41472)
Summary:
Related:
https://github.com/pytorch/pytorch/issues/40861
https://github.com/pytorch/pytorch/issues/41471

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41472

Reviewed By: yns88

Differential Revision: D22562385

Pulled By: malfet

fbshipit-source-id: aebc600915342b984f4fc47cef0a1e79d8965c10
2020-07-15 19:39:45 -07:00
60f2fa6a84 Updates serialization note to explain versioned symbols and dynamic versioning (#41395)
Summary:
Doc update intended to clarify and expand our current serialization behavior, including explaining the difference between torch.save/torch.load, torch.nn.Module.state_dict/torch.nn.Module.load_state_dict, and torch.jit.save/torch.jit.load. Also explains, for the time, when historic serialized Torchscript behavior is preserved and our recommendation for preserving behavior (using the same PyTorch version to consume a model as produced it).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41395

Reviewed By: ngimel

Differential Revision: D22560538

Pulled By: mruberry

fbshipit-source-id: dbc2f1bb92ab61ff2eca4888febc21f7dda76ba1
2020-07-15 19:05:19 -07:00
488ee3790e Support @torch.jit.unused on a @torch.no_grad decorated function (#41496)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41496

use the wrapped function (instead of the wrapper) to obtain argument names

Test Plan:
```
buck test mode/dev-nosan //caffe2/test:jit -- 'test_unused_decorator \(test_jit\.TestScript\)'
```

Before:
```
> Traceback (most recent call last):
>   File "/data/users/yuxinwu/fbsource2/fbcode/buck-out/dev/gen/caffe2/test/jit#binary,link-tree/test_jit.py", line 3014, in test_unused_decorator
>     torch.jit.script(MyMod())
>   File "/data/users/yuxinwu/fbsource2/fbcode/buck-out/dev/gen/caffe2/test/jit#binary,link-tree/torch/jit/_script.py", line 888, in script
>     obj, torch.jit._recursive.infer_methods_to_compile
>   File "/data/users/yuxinwu/fbsource2/fbcode/buck-out/dev/gen/caffe2/test/jit#binary,link-tree/torch/jit/_recursive.py", line 317, in create_script_module
>     return create_script_module_impl(nn_module, concrete_type, stubs_fn)
>   File "/data/users/yuxinwu/fbsource2/fbcode/buck-out/dev/gen/caffe2/test/jit#binary,link-tree/torch/jit/_recursive.py", line 376, in create_script_module_impl
>     create_methods_from_stubs(concrete_type, stubs)
>   File "/data/users/yuxinwu/fbsource2/fbcode/buck-out/dev/gen/caffe2/test/jit#binary,link-tree/torch/jit/_recursive.py", line 292, in create_methods_from_stubs
>     concrete_type._create_methods(defs, rcbs, defaults)
> RuntimeError:
> Non-static method does not have a self argument:
>   File "/data/users/yuxinwu/fbsource2/fbcode/buck-out/dev/gen/caffe2/test/jit#binary,link-tree/test_jit.py", line 3012
>             def forward(self, x):
>                 return self.fn(x)
>                        ~~~~~~~ <--- HERE
>
```

Reviewed By: eellison

Differential Revision: D22554479

fbshipit-source-id: 03e432ea92ed973cc57ff044da80ae7a36f6af4c
2020-07-15 16:54:43 -07:00
71c3b397a6 Reduce Image Size (2) (#41301)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/41301

Reviewed By: malfet

Differential Revision: D22559626

Pulled By: ssylvain

fbshipit-source-id: 32da88b7efe2e8d134f74b6ff2dff0bffede012c
2020-07-15 16:47:15 -07:00
5bd71259ed remove blacklist reference (#41447)
Summary:
Reference : issue https://github.com/pytorch/pytorch/issues/41443
Removed blacklist reference

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41447

Reviewed By: ezyang

Differential Revision: D22542428

Pulled By: SplitInfinity

fbshipit-source-id: 09728c7718bb99ff56b16fda6971ebd887a99c97
2020-07-15 16:25:12 -07:00
b7147fe6d7 Learnable Fake Quantizer Benchmark Test (#41429)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41429

This diff contains the benchmark test to evaluate the speed of executing the learnable fake quantization operator, both in the forward path and the backward path, with respect to both per tensor and per channel usages.

Test Plan:
Inside the path `torch/benchmarks/operator_benchmark` (The root directory will be `caffe2` inside `fbcode` if working on a devvm):
- On a devvm, run the command `buck run pt:fake_quantize_learnable_test`
- On a personal laptop, run the command `python3 -m pt.fake_quantize_learnable_test`

Benchmark Results (Locally on CPU):
Each sample has dimensions **3x256x256**; Each batch has 16 samples (`N=16`)
- Per Tensor Forward: 0.023688 sec/sample
- Per Tensor Backward: 0.165926 sec/sample
- Per Channel Forward: 0.040432 sec / sample
- Per Channel Backward: 0.173528 sec / sample

Reviewed By: vkuzo

Differential Revision: D22535252

fbshipit-source-id: e8e953ff2de2107c6f2dde4c8d5627bdea67ef7f
2020-07-15 14:00:20 -07:00
2b8db35c7e [reland][DNNL]:enable batchnorm3d (#40995)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40995

Test Plan: Imported from OSS

Differential Revision: D22440765

Pulled By: VitalyFedyunin

fbshipit-source-id: b4bf427bbb7010ee234a54e81ade371627f9e82c
2020-07-15 13:56:47 -07:00
b48ee175e6 [reland][DNNL]:enable conv3d (#40691)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40691

Test Plan: Imported from OSS

Differential Revision: D22296548

Pulled By: VitalyFedyunin

fbshipit-source-id: 8e2a7cf14e8bdfa2f29b735a89e8c83f6119e68d
2020-07-15 13:54:41 -07:00
ff6e560301 Add C++ end to end test for RPC and distributed autograd. (#36893)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36893

Adding an end to end test for running a simple training loop in C++
for the distributed RPC framework.

The goal of this change is to enable LeakSanitizer and potentially catch memory
leaks in the Future. Enabling LSAN with python multiprocessing is tricky and we
haven't found a solution for this. As a result, adding a C++ test that triggers
most of the critical codepaths would be good for now.

As an example, this unit test would've caught the memory leak fixed by:
https://github.com/pytorch/pytorch/pull/31030
ghstack-source-id: 107781167

Test Plan:
1) Verify the test catches memory leaks.
2) waitforbuildbot

Reviewed By: mrshenli

Differential Revision: D21112208

fbshipit-source-id: 4eb2a6b409253108f6b6e14352e593d250c7a64d
2020-07-15 12:59:19 -07:00
8940a4e684 Pull upstream select_compute_arch from cmake for Ampere (#41133)
Summary:
This pulls the following merge requests from CMake upstream:
- https://gitlab.kitware.com/cmake/cmake/-/merge_requests/4979
- https://gitlab.kitware.com/cmake/cmake/-/merge_requests/4991

The above two merge requests improve the Ampere build:
- If `TORCH_CUDA_ARCH_LIST` is not set, it can now automatically pickup 8.0 as its part of its default value
- If `TORCH_CUDA_ARCH_LIST=Ampere`, it no longer fails with `Unknown CUDA Architecture Name Ampere in CUDA_SELECT_NVCC_ARCH_FLAGS`

Codes related to architecture < 3.5 are manually removed because PyTorch no longer supports it.

cc: ngimel ptrblck

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41133

Reviewed By: malfet

Differential Revision: D22540547

Pulled By: ezyang

fbshipit-source-id: 6e040f4054ef04f18ebb7513497905886a375632
2020-07-15 12:53:32 -07:00
c62550e3f4 Cuda Support for Learnable Fake Quantize Per Channel (GPU) (#41262)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41262

In this diff, implementation is provided to support the GPU kernel running the learnable fake quantize per tensor kernels.

Test Plan: On a devvm, run `buck test //caffe2/test:quantization -- learnable` to test both the forward and backward for the learnable per tensor fake quantize kernels. The test will test the `cuda` version if a gpu is available.

Reviewed By: vkuzo

Differential Revision: D22478832

fbshipit-source-id: 2731bd8b57bc83416790f6d65ef42d450183873c
2020-07-15 12:23:43 -07:00
4367a73399 Cuda Support for Learnable Fake Quantize Per Tensor (GPU) (#41127)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41127

In this diff, implementation is provided to support the GPU kernel running the learnable fake quantize per tensor kernels.

Test Plan: On a devvm, run `buck test //caffe2/test:quantization -- learnable` to test both the forward and backward for the learnable per tensor fake quantize kernels. The test will test the `cuda` version if a gpu is available.

Reviewed By: z-a-f

Differential Revision: D22435037

fbshipit-source-id: 515afde13dd224d21fd47fb7cb027ee8d704cbdd
2020-07-15 12:21:48 -07:00
225289abc6 Adding epsilon input argument to the Logit Op
Summary: Adding epsilon input argument to the Logit Op

Test Plan: Added test_logit test case.

Reviewed By: hyuen

Differential Revision: D22537133

fbshipit-source-id: d6f89afd1589fda99f09550a9d1b850cfc0b9ee1
2020-07-15 12:16:19 -07:00
954c260061 Revert D22480638: [pytorch][PR] Add non-deterministic alert to CUDA operations that use atomicAdd()
Test Plan: revert-hammer

Differential Revision:
D22480638 (6ff306b8b5)

Original commit changeset: 4cc913cb3ca6

fbshipit-source-id: e47fa14b5085bb2b74a479bd0830efc2d7604eea
2020-07-15 12:10:05 -07:00
008ab27b22 [quant][pyper] Add embedding_bag weight quantize and dequantize ops (#41293)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41293

Add new operators that does quantize and packing for 8 bit and 4 bit embedding bag operators.
This is an initial change to help unblock testing. This will be follwed by adding graph mode passes to enable quantization of embedding_bag module

Note to reviewers: Future PRs will replace this op with a separate quantize and pack operator and add support for floating point scale and zero point.

Test Plan:
python test/test_quantization.py TestQuantizedEmbeddingBag

Imported from OSS

Reviewed By: vkuzo

Differential Revision: D22506700

fbshipit-source-id: 090cc85a8f56da417e4b7e45818ea987ae97ca8a
2020-07-15 11:34:53 -07:00
d5ae4a07ef DDP Communication Hook Main Structure (#40848)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40848

Sub-tasks 1 and 2 of [39272](https://github.com/pytorch/pytorch/issues/39272)
ghstack-source-id: 107787878

Test Plan:
1\. Perf tests to to validate new code (if conditions before `allreduce`) doesn't slow down today's DDP. Execute the following command with diff patched/unpatched (with V25):

* **Unpatched Runs:**
```
hg checkout D22514243
flow-cli canary pytorch.benchmark.main.workflow --parameters-json '{"model_arch": "resnet50", "batch_size": 32, "world_size": 1, "use_fp16": false, "print_percentile": true, "backend": "gloo"}' --entitlement pytorch_ftw_gpu --name test_torchelastic_gloo_masterD22514243 --run-as-secure-group pytorch_distributed
```
* **Run 1 (unpatched):** `elastic_gang:benchmark_single.elastic_operator` Ran for 2 mins 59 s
f204539235
```
sum:
8 GPUs: p25:  0.156   205/s  p50:  0.160   200/s  p75:  0.164   194/s  p90:  0.169   189/s  p95:  0.173   185/s
fwds:
8 GPUs: p25:  0.032  1011/s  p50:  0.032  1006/s  p75:  0.032  1000/s  p90:  0.032   992/s  p95:  0.033   984/s
bwds:
8 GPUs: p25:  0.121   265/s  p50:  0.125   256/s  p75:  0.129   248/s  p90:  0.134   239/s  p95:  0.137   232/s
opts:
8 GPUs: p25:  0.003  11840/s  p50:  0.003  11550/s  p75:  0.004  8037/s  p90:  0.006  5633/s  p95:  0.007  4631/s
```
* **Run 2 (unpatched):** `elastic_gang:benchmark_single.elastic_operator` Ran for 3 mins 1 s
f204683840
```
sum:
8 GPUs: p25:  0.145   220/s  p50:  0.147   217/s  p75:  0.150   213/s  p90:  0.154   207/s  p95:  0.157   204/s
fwds:
8 GPUs: p25:  0.032  1015/s  p50:  0.032  1009/s  p75:  0.032  1002/s  p90:  0.032   994/s  p95:  0.032   990/s
bwds:
8 GPUs: p25:  0.107   297/s  p50:  0.111   288/s  p75:  0.115   278/s  p90:  0.119   268/s  p95:  0.122   262/s
opts:
8 GPUs: p25:  0.003  11719/s  p50:  0.004  9026/s  p75:  0.006  5160/s  p90:  0.009  3700/s  p95:  0.010  3184/s
```

* **Patched Runs:**
```
hg checkout D22328310
flow-cli canary pytorch.benchmark.main.workflow --parameters-json '{"model_arch": "resnet50", "batch_size": 32, "world_size": 1, "use_fp16": false, "print_percentile": true, "backend": "gloo"}' --entitlement pytorch_ftw_gpu --name test_torchelastic_gloo_localD22328310 --run-as-secure-group pytorch_distributed
```
* **Run 1 (patched):** `elastic_gang:benchmark_single.elastic_operator` Ran for 3 mins 30 s
f204544541
```
sum:
8 GPUs: p25:  0.148   216/s  p50:  0.152   210/s  p75:  0.156   205/s  p90:  0.160   200/s  p95:  0.163   196/s
fwds:
8 GPUs: p25:  0.032  1011/s  p50:  0.032  1005/s  p75:  0.032   999/s  p90:  0.032   991/s  p95:  0.033   984/s
bwds:
8 GPUs: p25:  0.112   286/s  p50:  0.116   275/s  p75:  0.120   265/s  p90:  0.125   256/s  p95:  0.128   250/s
opts:
8 GPUs: p25:  0.003  11823/s  p50:  0.003  10948/s  p75:  0.004  7225/s  p90:  0.007  4905/s  p95:  0.008  3873/s
```
* **Run 2 (patched):** `elastic_gang:benchmark_single.elastic_operator`
Ran for 3 mins 14 s
f204684520
```
sum:
8 GPUs: p25:  0.146   219/s  p50:  0.147   217/s  p75:  0.150   214/s  p90:  0.152   210/s  p95:  0.153   208/s
fwds:
8 GPUs: p25:  0.032  1013/s  p50:  0.032  1008/s  p75:  0.032  1002/s  p90:  0.032   996/s  p95:  0.032   990/s
bwds:
8 GPUs: p25:  0.107   299/s  p50:  0.110   290/s  p75:  0.114   280/s  p90:  0.117   274/s  p95:  0.119   269/s
opts:
8 GPUs: p25:  0.003  11057/s  p50:  0.005  6490/s  p75:  0.008  4110/s  p90:  0.010  3309/s  p95:  0.010  3103/s
```
* **Run 3 (patched):** `elastic_gang:benchmark_single.elastic_operator` Ran for 2 mins 54 s
f204692872
```
sum:
8 GPUs: p25:  0.145   220/s  p50:  0.147   217/s  p75:  0.150   213/s  p90:  0.154   207/s  p95:  0.156   204/s
fwds:
8 GPUs: p25:  0.032  1001/s  p50:  0.032   995/s  p75:  0.032   988/s  p90:  0.033   980/s  p95:  0.033   973/s
bwds:
8 GPUs: p25:  0.108   295/s  p50:  0.111   287/s  p75:  0.114   280/s  p90:  0.119   269/s  p95:  0.121   264/s
opts:
8 GPUs: p25:  0.003  11706/s  p50:  0.003  9257/s  p75:  0.005  6333/s  p90:  0.008  4242/s  p95:  0.009  3554/s
```

* **Memory:**
   * Unpatched:
```
CUDA Memory Summary After                     first iteration: |===========================================================================|
|                  PyTorch CUDA memory summary, device ID 0                 |
|---------------------------------------------------------------------------|
|            CUDA OOMs: 0            |        cudaMalloc retries: 0         |
|===========================================================================|
|        Metric         | Cur Usage  | Peak Usage | Tot Alloc  | Tot Freed  |
|---------------------------------------------------------------------------|
| Allocated memory      |  428091 KB |    2892 MB |    9825 MB |    9407 MB |
|       from large pool |  374913 KB |    2874 MB |    9752 MB |    9386 MB |
|       from small pool |   53178 KB |      52 MB |      73 MB |      21 MB |
|---------------------------------------------------------------------------|
| Active memory         |  428091 KB |    2892 MB |    9825 MB |    9407 MB |
|       from large pool |  374913 KB |    2874 MB |    9752 MB |    9386 MB |
|       from small pool |   53178 KB |      52 MB |      73 MB |      21 MB |
|---------------------------------------------------------------------------|
| GPU reserved memory   |    3490 MB |    3490 MB |    3490 MB |       0 B  |
|       from large pool |    3434 MB |    3434 MB |    3434 MB |       0 B  |
|       from small pool |      56 MB |      56 MB |      56 MB |       0 B  |
|---------------------------------------------------------------------------|
| Non-releasable memory |  315332 KB |  343472 KB |    2295 MB |    1987 MB |
|       from large pool |  311166 KB |  340158 KB |    2239 MB |    1935 MB |
|       from small pool |    4166 KB |    4334 KB |      56 MB |      52 MB |
|---------------------------------------------------------------------------|
| Allocations           |     704    |     705    |    1390    |     686    |
|       from large pool |      60    |     131    |     395    |     335    |
|       from small pool |     644    |     645    |     995    |     351    |
|---------------------------------------------------------------------------|
| Active allocs         |     704    |     705    |    1390    |     686    |
|       from large pool |      60    |     131    |     395    |     335    |
|       from small pool |     644    |     645    |     995    |     351    |
|---------------------------------------------------------------------------|
| GPU reserved segments |     102    |     102    |     102    |       0    |
|       from large pool |      74    |      74    |      74    |       0    |
|       from small pool |      28    |      28    |      28    |       0    |
|---------------------------------------------------------------------------|
| Non-releasable allocs |      34    |      54    |     430    |     396    |
|       from large pool |      15    |      48    |     208    |     193    |
|       from small pool |      19    |      19    |     222    |     203    |
|===========================================================================|

```
   * Patched:
```
CUDA Memory Summary After                     first iteration: |===========================================================================|
|                  PyTorch CUDA memory summary, device ID 0                 |
|---------------------------------------------------------------------------|
|            CUDA OOMs: 0            |        cudaMalloc retries: 0         |
|===========================================================================|
|        Metric         | Cur Usage  | Peak Usage | Tot Alloc  | Tot Freed  |
|---------------------------------------------------------------------------|
| Allocated memory      |  428091 KB |    2892 MB |    9825 MB |    9407 MB |
|       from large pool |  374913 KB |    2874 MB |    9752 MB |    9386 MB |
|       from small pool |   53178 KB |      52 MB |      73 MB |      21 MB |
|---------------------------------------------------------------------------|
| Active memory         |  428091 KB |    2892 MB |    9825 MB |    9407 MB |
|       from large pool |  374913 KB |    2874 MB |    9752 MB |    9386 MB |
|       from small pool |   53178 KB |      52 MB |      73 MB |      21 MB |
|---------------------------------------------------------------------------|
| GPU reserved memory   |    3490 MB |    3490 MB |    3490 MB |       0 B  |
|       from large pool |    3434 MB |    3434 MB |    3434 MB |       0 B  |
|       from small pool |      56 MB |      56 MB |      56 MB |       0 B  |
|---------------------------------------------------------------------------|
| Non-releasable memory |  315332 KB |  343472 KB |    2295 MB |    1987 MB |
|       from large pool |  311166 KB |  340158 KB |    2239 MB |    1935 MB |
|       from small pool |    4166 KB |    4334 KB |      56 MB |      52 MB |
|---------------------------------------------------------------------------|
| Allocations           |     704    |     705    |    1390    |     686    |
|       from large pool |      60    |     131    |     395    |     335    |
|       from small pool |     644    |     645    |     995    |     351    |
|---------------------------------------------------------------------------|
| Active allocs         |     704    |     705    |    1390    |     686    |
|       from large pool |      60    |     131    |     395    |     335    |
|       from small pool |     644    |     645    |     995    |     351    |
|---------------------------------------------------------------------------|
| GPU reserved segments |     102    |     102    |     102    |       0    |
|       from large pool |      74    |      74    |      74    |       0    |
|       from small pool |      28    |      28    |      28    |       0    |
|---------------------------------------------------------------------------|
| Non-releasable allocs |      34    |      54    |     431    |     397    |
|       from large pool |      15    |      48    |     208    |     193    |
|       from small pool |      19    |      19    |     223    |     204    |
|===========================================================================|

```

2\. As of v18: `python test/distributed/test_c10d.py`
```
....................s.....s.....................................................s................................
----------------------------------------------------------------------
Ran 114 tests in 215.983s

OK (skipped=3)

```

3\. Additional tests in `python test/distributed/test_c10d.py`:
* `test_ddp_comm_hook_future_passing_cpu`: This unit test verifies whether the Future object is passed properly. The callback function creates a Future object and sets a value to it.
* `_test_ddp_comm_hook_future_passing_gpu`: This unit test verifies whether the Future object is passed properly. The callback function creates a Future object and sets a value to it.
* `test_ddp_comm_hook_future_passing_gpu_gloo`: This unit test executes _test_ddp_comm_hook_future_passing_gpu using gloo backend.
* `test_ddp_comm_hook_future_passing_gpu_nccl`: This unit test executes _test_ddp_comm_hook_future_passing_gpu using nccl backend.
* `test_ddp_invalid_comm_hook_init`: This unit test makes sure that register_comm_hook properly checks the format of hook defined by user. The Python hook must be callable. This test also checks whether bucket annotation checked properly if defined.
* `test_ddp_invalid_comm_hook_return_type`: This test checks whether return annotation checked properly if defined. It also checks whether an internal error is thrown if return type is incorrect and user hasn't specified any return type annotation.
* `test_ddp_comm_hook_register_just_once`: DDP communication hook can only be registered once. This test validates whether the error is thrown properly when register_comm_hook is called more than once.

Reviewed By: ezyang

Differential Revision: D22328310

fbshipit-source-id: 77a6a71808e7b6e947795cb3fcc68c8c8f024549
2020-07-15 11:25:29 -07:00
c86699d425 [cmake] Use PROJECT_SOURCE_DIR instead of CMAKE_* (#41387)
Summary:
Add support for including pytorch via an add_subdirectory()
This requires using PROJECT_* instead of CMAKE_* which refer to
the top-most project including pytorch.

TEST=add_subdirectory() into a pytorch checkout and build.
There are still some hardcoded references to TORCH_SRC_DIR, I will
fix in a follow on commit. For now you can create a symlink to
 <pytorch>/torch/ in your project.

Change-Id: Ic2a8aec3b08f64e2c23d9e79db83f14a0a896abc

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41387

Reviewed By: zhangguanheng66

Differential Revision: D22539944

Pulled By: ezyang

fbshipit-source-id: b7e9631021938255f0a6ea897a7abb061759093d
2020-07-15 11:09:05 -07:00
563b60b890 Fix flaky test_stream_event_nogil due to missing event sync (#41398)
Summary:
The test asserts that the stream is "ready" but doesn't wait for the
event to be "executed" which makes it fail on some platforms where the
`query` call occurs "soon enough".

Fixes https://github.com/pytorch/pytorch/issues/38807

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41398

Reviewed By: zhangguanheng66

Differential Revision: D22540012

Pulled By: ezyang

fbshipit-source-id: 6f56d951e48133ce4f6a9a54534298b7d2877c80
2020-07-15 11:03:35 -07:00
6ff306b8b5 Add non-deterministic alert to CUDA operations that use atomicAdd() (#40056)
Summary:
Issue https://github.com/pytorch/pytorch/issues/15359

Pull Request resolved: https://github.com/pytorch/pytorch/pull/40056

Differential Revision: D22480638

Pulled By: ezyang

fbshipit-source-id: 4cc913cb3ca6d4206de80f4665bbc9031aa3ca01
2020-07-15 10:57:32 -07:00
dddac948a3 Add CUDA to pooling benchmark configs (#41438)
Summary:
Related to https://github.com/pytorch/pytorch/issues/41368

These benchmarks support CUDA already so there is no reason for it not to be in the benchmark config.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41438

Reviewed By: zhangguanheng66

Differential Revision: D22540756

Pulled By: ezyang

fbshipit-source-id: 621eceff37377c1ab06ff7483b39fc00dc34bd46
2020-07-15 10:51:43 -07:00
3971777ebb Krovatkin/reenable test tensorexpr (#41445)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/41445

Reviewed By: ZolotukhinM

Differential Revision: D22543075

Pulled By: Krovatkin

fbshipit-source-id: fd8c0a94f5b3aff34d2b444dbf551425fdc1df04
2020-07-15 10:42:40 -07:00
04320a47d7 Add optimizer_for_mobile doc into python api root doc (#41211)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/41211

Test Plan: Imported from OSS

Reviewed By: xta0

Differential Revision: D22543608

fbshipit-source-id: bf522a6c94313bf2696eca3c5bb5812ea98998d0
2020-07-15 09:57:40 -07:00
3a63a939d4 Revert D22517785: [pytorch][PR] Enable TF32 support for cuBLAS
Test Plan: revert-hammer

Differential Revision:
D22517785 (288ece89e1)

Original commit changeset: 87334c893561

fbshipit-source-id: 0a0674f49c1bcfc98f7f88af5a8c7de93b76e458
2020-07-15 08:15:48 -07:00
8548a21c00 Revert D22543215: Adjust bound_shape_inferencer to take 4 inputs for FCs
Test Plan: revert-hammer

Differential Revision:
D22543215 (86a2bdc35e)

Original commit changeset: 0977fca06630

fbshipit-source-id: b440f9b1eaeb35ec8b08e899890691e7a77a9f6d
2020-07-15 08:10:39 -07:00
f153b35b9b Shape inference for SparseToDense in ExpertCombiner
Summary: Adding shape inference for SpraseToDense. Proposal impl of shape inference only works when data_to_infer_dim is given, otherwise SpraseToDense output dimension depends on max value of input tensor

Test Plan:
buck test //caffe2/caffe2/python:sparse_to_dense_test
buck test //caffe2/caffe2/python:hypothesis_test -- test_sparse_to_dense

Dper3 Changes:
f204594813
buck test dper3/dper3_models/ads_ranking/model_impl/sparse_nn/tests:sparse_nn_lib_test

Reviewed By: zhongyx12, ChunliF

Differential Revision: D22479511

fbshipit-source-id: 8983a9baea8853deec53ad6f795c874c3fb93de0
2020-07-15 08:04:48 -07:00
86a2bdc35e Adjust bound_shape_inferencer to take 4 inputs for FCs (#41452)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41452

The model exported from online training workflow with int8 quantization contains FCs with 4 inputs. The extra input is the quant_param blob. This diff is to adjust the bound_shape_inferencer to get shape info for the quant_param input.

Test Plan:
```
buck test caffe2/caffe2/opt:bound_shape_inference_test
```

Reviewed By: anurag16

Differential Revision: D22543215

fbshipit-source-id: 0977fca06630e279d47292e6b44f3d8180a767a5
2020-07-15 01:43:39 -07:00
14f19ab833 Port index_select to ATen (CUDA) (#39946)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/24578

Pull Request resolved: https://github.com/pytorch/pytorch/pull/39946

Reviewed By: ngimel

Differential Revision: D22520160

Pulled By: mruberry

fbshipit-source-id: 7eb3029e3917e793f3c020359acb0989d5deb61e
2020-07-15 01:11:32 -07:00
9552ec787c Revert D22516606: [pytorch][PR] Temporary fix for determinant bug on CPU
Test Plan: revert-hammer

Differential Revision:
D22516606 (fcd6d91045)

Original commit changeset: 7ea8299b9d2c

fbshipit-source-id: 41e19d5e1ba843cd70dce677869892f2e33fac09
2020-07-14 23:44:32 -07:00
921d2a164f SparseAdagrad/RowWiseSparseAdagrad mean fusion on CPU & GPU and dedup version for RowWiseSparse mean fusion on GPU
Summary:
1. Support SparseAdagradFusedWithSparseLengthsMeanGradient and RowWiseSparseAdagradFusedWithSparseLengthsMeanGradient on CPU and GPU
2. Add the dedup implementation of fused RowWiseAdagrad op on GPUs for mean pooling

Reviewed By: xianjiec

Differential Revision: D22165603

fbshipit-source-id: 743fa55ed5893c34bc6406ddfbbbb347b88091d1
2020-07-14 22:36:16 -07:00
44b9306d0a Export replaceAllUsesAfterNodeWith for PythonAPI (#41414)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41414

This diff exports replaceAllUsesAfterNodeWith to PythonAPI.

Test Plan: Tested locally. Please let me know if there is a set of unit tests to be passed outside of the default ones triggered by Sandcastle.

Reviewed By: soumith

Differential Revision: D22523211

fbshipit-source-id: 3f075bafa6208ada462abc57d495c15179a6e53d
2020-07-14 22:20:19 -07:00
20f3051f7d [adaptive_]max_pool{1,2,3}d: handle edge case when input is filled with -inf (#40665)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/40131

Pull Request resolved: https://github.com/pytorch/pytorch/pull/40665

Differential Revision: D22463538

Pulled By: ezyang

fbshipit-source-id: 7e08fd0205926911d45aa150012154637e64a8d4
2020-07-14 21:51:40 -07:00
fcd6d91045 Temporary fix for determinant bug on CPU (#35136)
Summary:
Changelog:
- Make diagonal contiguous

Temporarily Fixes https://github.com/pytorch/pytorch/issues/34061

Pull Request resolved: https://github.com/pytorch/pytorch/pull/35136

Reviewed By: vincentqb

Differential Revision: D22516606

Pulled By: ezyang

fbshipit-source-id: 7ea8299b9d2c1c244995955b333a1dffb0cdff73
2020-07-14 21:20:50 -07:00
f074994a31 vectorize rounding ops (#41439)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41439

use RoundToFloat16 on arrays

Test Plan: layernorm unittest

Reviewed By: venkatacrc

Differential Revision: D22540118

fbshipit-source-id: dc84fd22b5dc6a3bd15ad4ec1eecb9db13d64e97
2020-07-14 20:59:39 -07:00
96f124e623 remove template arguments of layernorm
Summary:
remove layernorm templates and make them float since that's the only variant
minor fixes in logging and testing

Test Plan: ran the test

Reviewed By: venkatacrc

Differential Revision: D22527359

fbshipit-source-id: d6eec362a6e88e1c12fddf820ae629ede13fb2b8
2020-07-14 20:56:23 -07:00
0b73ea0ea2 Change BCELoss size mismatch warning into an error (#41426)
Summary:
BCELoss currently uses different broadcasting semantics than numpy. Since previous versions of PyTorch have thrown a warning in these cases telling the user that input sizes should match, and since the CUDA and CPU results differ when sizes do not match, it makes sense to upgrade the size mismatch warning to an error.

We can consider supporting numpy broadcasting semantics in BCELoss in the future if needed.

Closes https://github.com/pytorch/pytorch/issues/40023

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41426

Reviewed By: zou3519

Differential Revision: D22540841

Pulled By: ezyang

fbshipit-source-id: 6c6d94c78fa0ae30ebe385d05a9e3501a42b3652
2020-07-14 20:34:06 -07:00
fd0329029f Fix flaky profiler and test_callback_simple RPC tests (#41287)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41287

Profiler tests that test profiling with builtin functions and `test_callback_simple` test has been broken for a while. This diff fixes that by preferring c10 ops to non-c10 ops in our operation matching logic.

The result of this is that these ops go through the c10 dispatch and thus have profiling enabled. For `test_callback_simple` this results in the effect that we choose `aten::add.Tensor` over `aten::add.Int` which fixes the type issue.

Test Plan:
Ensured that the tests are no longer flaky by running them a bunch
of times.

Reviewed By: vincentqb

Differential Revision: D22489197

fbshipit-source-id: 8452b93e4d45703453f77d968350c0d32f3f63fe
2020-07-14 19:26:44 -07:00
0d4a110c28 [JIT] Fix dead stores in JIT (#41202)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41202

This commit fixes dead stores in JIT surfaced by the Quality Analyzer.

Test Plan: Continuous integration.

Reviewed By: jerryzh168

Differential Revision: D22461492

fbshipit-source-id: c587328f952054fb9449848e90b7d28a20aed4af
2020-07-14 17:59:50 -07:00
4ddf27ba48 [op-bench] check device attribute in user inputs
Summary: The device attribute in the op benchmark can only include 'cpu' or 'cuda'. So adding a check in this diff.

Test Plan: buck run caffe2/benchmarks/operator_benchmark:benchmark_all_test -- --warmup_iterations 1 --iterations 1

Reviewed By: ngimel

Differential Revision: D22538252

fbshipit-source-id: 3e5af72221fc056b8d867321ad22e35a2557b8c3
2020-07-14 17:17:59 -07:00
a0f110190c clamp Categorical logit from -inf to min_fifo when calculating entropy (#41002)
Summary:
Fixes gh-40553 by clamping logit values when calculating Categorical.entropy

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41002

Reviewed By: mruberry

Differential Revision: D22436432

Pulled By: ngimel

fbshipit-source-id: 08b7c7b0c15ab4e5a56b3a8ec0d0237ad360202e
2020-07-14 16:21:12 -07:00
359cdc20e2 Revert D22432885: [pytorch][PR] unsafe_split, unsafe_split_with_sizes, unsafe_chunk operations
Test Plan: revert-hammer

Differential Revision:
D22432885 (c17670ac50)

Original commit changeset: 324aef091b32

fbshipit-source-id: 6b7c52bde46932e1cf77f61e7035d8a641b0beb6
2020-07-14 16:06:42 -07:00
144f04e7ef Fix qobserver test
Summary: Change the device config in qobserver test to a string to honor --device flag.

Test Plan: buck run caffe2/benchmarks/operator_benchmark/pt:qobserver_test  -- --iterations 1 --device cpu

Reviewed By: ngimel

Differential Revision: D22536379

fbshipit-source-id: 8926b2393be1f52f9183f8205959a3ff18e3ed2a
2020-07-14 15:47:03 -07:00
c68c5ea0e6 Upgrade cpp docs Sphinx/breathe/exhale to latest version (#41312)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41312

I was hoping that exhale had gotten incremental recompilation
in its latest version, but experimentally this does not seem
to have been the case.  Still, I had gotten the whole shebang
to be working on the latest version of these packages, so might
as well land the upgrade.  There was one bug in Optional.h that
I had to fix; see the cited bug report.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D22526349

Pulled By: ezyang

fbshipit-source-id: d4169c2f48ebd8dfd8a593cc8cd232224d008ae9
2020-07-14 15:35:43 -07:00
05207b7371 .circleci: Re-split postnightly into its own thing (#41354)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41354

The nightly pipeline has the potential to be flaky and thus the html
pages have the potential not to be updated.

This should actually be done as an automatic lambda job that runs
whenever the S3 bucket updates but this is intermediate step in order to
get there.

Closes https://github.com/pytorch/pytorch/issues/40998

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D22530283

Pulled By: seemethere

fbshipit-source-id: 0d80b7751ede83e6dd466690cc0a0ded68f59c5d
2020-07-14 14:49:01 -07:00
c17670ac50 unsafe_split, unsafe_split_with_sizes, unsafe_chunk operations (#39299)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/36403

Copy-paste of the issue description:

* Escape hatch: Introduce unsafe_* version of the three functions above that have the current behavior (outputs not tracked as views). The documentation will explain in detail why they are unsafe and when it is safe to use them. (basically, only the outputs OR the input can be modified inplace but not both. Otherwise, you will get wrong gradients).
* Deprecation: Use the CreationMeta on views to track views created by these three ops and throw warning when any of the views is modified inplace saying that this is deprecated and will raise an error soon. For users that really need to modify these views inplace, they should look at the doc of the unsafe_* version to make sure their usecase is valid:
  * If it is not, then pytorch is computing wrong gradients for their use case and they should not do inplace anymore.
  * If it is, then they can use the unsafe_* version to keep the current behavior.
* Removal: Use the CreationMeta on view to prevent any inplace on these views (like we do for all other views coming from multi-output Nodes). The users will still be able to use the unsafe_ versions if they really need to do this.

Note about BC-breaking:
- This PR changes the behavior of the regular function by making them return proper views now. This is a modification that the user will be able to see.
- We skip all the view logic for these views and so the code should behave the same as before (except the change in the `._is_view()` value).
- Even though the view logic is not performed, we do raise deprecation warnings for the cases where doing these ops would throw an error.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/39299

Differential Revision: D22432885

Pulled By: albanD

fbshipit-source-id: 324aef091b32ce69dd067fe9b13a3f17d85d0f12
2020-07-14 14:15:41 -07:00
e2c4c2f102 addmm: Reduce constant time overhead (#41374)
Summary:
Fixes the overhead reported by ngimel in https://github.com/pytorch/pytorch/pull/40927#issuecomment-657709646

As it turns out, `Tensor.size(n)` has more overhead than `Tensor.sizes()[n]`. Since addmm does a lot of introspection of the input matrix sizes and strides, this added up to a noticeable (~1 us) constant time overhead.

With this change, a 1x1 matmul takes 2.85 us on my machine compared to 2.90 us on pytorch 1.5.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41374

Reviewed By: ailzhang

Differential Revision: D22519924

Pulled By: ngimel

fbshipit-source-id: b29504bee7de79ce42e5e50f91523dde42b073b7
2020-07-14 13:47:16 -07:00
288ece89e1 Enable TF32 support for cuBLAS (#40800)
Summary:
Benchmark on a fully connected network and torchvision models (time in seconds) on GA100:

| model              | batch size | forward(TF32) | forward(FP32) | backward(TF32) | backward(FP32) |
|--------------------|------------|---------------|---------------|----------------|----------------|
| FC 512-128-32-8    | 512        | 0.000211      | 0.000321      | 0.000499       | 0.000532       |
| alexnet            | 512        | 0.0184        | 0.0255        | 0.0486         | 0.0709         |
| densenet161        | 128        | 0.0665        | 0.204         | 0.108          | 0.437          |
| googlenet          | 256        | 0.0925        | 0.110         | 0.269          | 0.326          |
| inception_v3       | 256        | 0.155         | 0.214         | 0.391          | 0.510          |
| mnasnet1_0         | 512        | 0.108         | 0.137         | 0.298          | 0.312          |
| mobilenet_v2       | 512        | 0.114         | 0.294         | 0.133          | 0.303          |
| resnet18           | 512        | 0.0722        | 0.100         | 0.182          | 0.228          |
| resnext50_32x4d    | 256        | 0.170         | 0.237         | 0.373          | 0.479          |
| shufflenet_v2_x1_0 | 512        | 0.0463        | 0.0473        | 0.125          | 0.123          |
| squeezenet1_0      | 512        | 0.0870        | 0.0948        | 0.205          | 0.214          |
| vgg16              | 256        | 0.167         | 0.234         | 0.401          | 0.502          |
| wide_resnet50_2    | 512        | 0.186         | 0.310         | 0.415          | 0.638          |

Pull Request resolved: https://github.com/pytorch/pytorch/pull/40800

Reviewed By: mruberry

Differential Revision: D22517785

Pulled By: ngimel

fbshipit-source-id: 87334c8935616f72a6af5abbd3ae69f76923dc3e
2020-07-14 13:21:10 -07:00
c528faac7d [ROCm] Skip problematic mgpu tests on ROCm3.5 (#41409)
Summary:
nccl tests and parallelize_bmuf_distributed test are failing on rocm3.5.1. Skipping these tests to upgrade the CI to rocm3.5.1

jeffdaily sunway513

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41409

Reviewed By: orionr

Differential Revision: D22528928

Pulled By: seemethere

fbshipit-source-id: 928196b7a62a441d391e69f54b278313ecc75d77
2020-07-14 11:55:43 -07:00
5f146a4125 fix include file path in unary ops
Summary: fix include file path in unary ops

Test Plan: compile

Reviewed By: amylittleyang

Differential Revision: D22527312

fbshipit-source-id: 589efd2231ff8bd3133cb7844738429927ecee68
2020-07-14 11:08:51 -07:00
4972cf06a2 [JIT] Add out-of-source-tree to_backend tests (#41145)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41145

**Summary**
This commit adds out-of-source-tree tests for `to_backend`. These tests check
that a Module can be lowered to a backend, exported, loaded (in both
Python and C++) and executed.

**Fixes**
This commit fixes #40067.

Test Plan: Imported from OSS

Reviewed By: jamesr66a

Differential Revision: D22510076

Pulled By: SplitInfinity

fbshipit-source-id: f65964ef3092a095740f06636ed5b1eb0884492d
2020-07-14 10:57:04 -07:00
0e7b9d4ff8 Fix logit doc (#41384)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41384

Fix logit doc

Test Plan: unittest

Reviewed By: houseroad

Differential Revision: D22521730

fbshipit-source-id: 270462008c6ac73cd90aecd77c5de112fc93ea8d
2020-07-14 10:40:52 -07:00
87bf04fe12 AvgPool: Ensure all cells are valid in ceil mode (#41368)
Summary:
Closes https://github.com/pytorch/pytorch/issues/36977

This avoid the division by zero that was causing NaNs to appear in the output. `AvgPooling2d` and `AvgPooling3d` both had this issue on CPU and CUDA.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41368

Reviewed By: ailzhang

Differential Revision: D22520013

Pulled By: ezyang

fbshipit-source-id: 3ece7829f858f5bc17c2c1d905266ac510f11194
2020-07-14 09:24:30 -07:00
535e8814a4 Add operators for LiteLMLSTM to Lite Interpreter (#41270)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41270

The Smart Keyboard model for Oculus requires operators previously not in the lite interpreter: aten::exp (for floats), aten::ord, aten::lower, aten::__contains__.str_list, aten::slice.str, aten::strip, aten::split.str, and aten::__getitem__.str.

Test Plan:
Verify smart keyboard model can be used:
Check out next diff in stack and follow test instructions there

Reviewed By: iseeyuan

Differential Revision: D22289812

fbshipit-source-id: df574d5af4d4fafb40f0e209b66a93fe02d83020
2020-07-14 09:18:41 -07:00
befb22790f Fix a number of deprecation warnings (#40179)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40179

- Pass no-psabi to shut up GCC about # Suppress "The ABI for passing
  parameters with 64-byte alignment has changed in GCC 4.6"
- Fix use of deprecated data() accessor (and minor optimization: hoist
  accessor out of loop)
- Undeprecate NetDef.num_workers, no one is serious about fixing these
- Suppress warnings about deprecated pthreadpool types

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Differential Revision: D22234138

Pulled By: ezyang

fbshipit-source-id: 6a1601b6d7551a7e6487a44ae65b19acdcb7b849
2020-07-14 09:11:34 -07:00
13dd53b3d2 [Codemod][FBSourceClangFormatLinter] Daily arc lint --take CLANGFORMAT
Reviewed By: zertosh

Differential Revision: D22523334

fbshipit-source-id: e687e26f68a4f923164a51ce0b69ec1d131b9022
2020-07-14 08:42:23 -07:00
e888c3bca1 Update torch.set_default_dtype doc (#41263)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/41263

Test Plan: Imported from OSS

Differential Revision: D22482989

Pulled By: anjali411

fbshipit-source-id: 2aadfbb84bbab66f3111970734a37ba74d817ffd
2020-07-14 07:29:49 -07:00
c20426f86d Fix torch.cuda.check_error type errors (#41330)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41330

`torch.cuda.check_error` is annotated as taking an `int` as argument but when running `torch.cuda.check_error(34)` one would get:
```
TypeError: cudaGetErrorString(): incompatible function arguments. The following argument types are supported:
    1. (arg0: torch._C._cudart.cudaError) -> str

Invoked with: 34
```
Even if one explicitly casted the argument, running `torch.cuda.check_error(torch._C._cudart.cudaError(34))` would give:
```
AttributeError: 'str' object has no attribute 'decode'
```

This PR fixes both issues (thus allowing `check_error` to be called with a un-casted int) and adds a test.
ghstack-source-id: 107628709

Test Plan: Unit tests

Reviewed By: ezyang

Differential Revision: D22500549

fbshipit-source-id: 9170c1e466dd554d471e928b26eb472a712da9e1
2020-07-14 00:47:14 -07:00
80d5b3785b Add torch.logit function (#41062)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41062

Add torch.logit function

Test Plan: buck test mode/dev-nosan //caffe2/test:torch -- "logit"

Reviewed By: hl475

Differential Revision: D22406912

fbshipit-source-id: b303374f4c68850eb7477eb0645546a24b844606
2020-07-13 19:33:20 -07:00
34e11b45c9 Remove thrust casting from static_cast_with_inter_type (#39905)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/39905

Reviewed By: ZolotukhinM

Differential Revision: D22510307

Pulled By: ngimel

fbshipit-source-id: 34357753fca4f2a8d5e2b1bbf8de8d642ca9bb20
2020-07-13 19:16:00 -07:00
5f6c6ed157 Fix FC issue (#41198)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41198

https://github.com/pytorch/pytorch/pull/39611 unified signatures of some ops taking TensorOptions arguments by making them optional.
That has FC implications but only for models writting with a PyTorch version after that version (see explanation in description of that PR).

However, it also changed the default from `pin_memory=False` to `pin_memory=None`, which actually breaks FC for preexisting models too if they're re-exported with a newer PyTorch,
because we materialize default values when exporting. This is bad.

This PR reverts that particular part of https://github.com/pytorch/pytorch/pull/39611 to revert the FC breakage.
ghstack-source-id: 107475024

Test Plan: waitforsandcastle

Reviewed By: bhosmer

Differential Revision: D22461661

fbshipit-source-id: ba2776267c3bba97439df66ecb50be7c1971d20d
2020-07-13 18:48:56 -07:00
ca1b8ebbcb move misc implementation out of jit/__init__.py (#41154)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/41154

Test Plan: Imported from OSS

Reviewed By: ailzhang

Differential Revision: D22445213

Pulled By: suo

fbshipit-source-id: 200545715c5ef13beb1437f49e01efb21498ddb7
2020-07-13 16:59:55 -07:00
6392713584 add spaces in .md annotation for python indent (#41260)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/41260

Reviewed By: ezyang

Differential Revision: D22504634

Pulled By: ailzhang

fbshipit-source-id: 9d2d605dc19b07896ee4b1811fcd34d4dcb9b0c7
2020-07-13 15:11:46 -07:00
b6e1944d35 .circleci: Explicitly remove nvidia apt repos (#41367)
Summary:
The nvidia apt repositories seem to be left over on the amd nodes so
let's just go ahead and remove them explicitly if we're not testing for
CUDA

Example: https://app.circleci.com/pipelines/github/pytorch/pytorch/190222/workflows/8f75b5cd-1afd-43dc-9fa7-f7b058f07b46/jobs/6223743/steps

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41367

Reviewed By: ezyang

Differential Revision: D22513844

Pulled By: seemethere

fbshipit-source-id: 6da4dd8423de5f7ec80c7904187cf80c1b91ab14
2020-07-13 15:05:57 -07:00
d601325de4 update operators in the mapping to fp16 emulation
Summary: add logit and swish to this list

Test Plan: f203925461

Reviewed By: amylittleyang

Differential Revision: D22506814

fbshipit-source-id: b449e4ea16354cb76915adb01cf317cffb494733
2020-07-13 14:08:24 -07:00
4196605776 helper function to print out all DDP-relevant env vars (#41297)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41297

GH issue: https://github.com/pytorch/pytorch/issues/40105

Add a helper function to DDP to print out all relevant env vars for debugging

Test Plan:
test through unittest, example output:
 ---
env:RANK=3
env:LOCAL_RANK=N/A
env:WORLD_SIZE=N/A
env:MASTER_PORT=N/A
env:MASTER_ADDR=N/A
env:CUDA_VISIBLE_DEVICES=N/A
env:GLOO_SOCKET_IFNAME=N/A
env:GLOO_DEVICE_TRANSPORT=N/A
env:NCCL_SOCKET_IFNAME=N/A
env:NCCL_BLOCKING_WAIT=N/A
...
 ---

Reviewed By: mrshenli

Differential Revision: D22490486

fbshipit-source-id: 5dc7d2a18111e5a5a12a1b724d90eda5d35acd1c
2020-07-13 14:03:04 -07:00
6e6931e234 fix duplicate extern sdot and missing flags (#41195)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41195

`BLAS_F2C` is set in `THGeneral.h`.
`sdot` redefined with double return type in the case that `BLAS_F2C` is set and `BLAS_USE_CBLAS_DOT` is not.

Test Plan: CircleCI green, ovrsource green

Reviewed By: malfet

Differential Revision: D22460253

fbshipit-source-id: 75f17b3e47da0ed33fcadc2843a57ad616f27fb5
2020-07-13 13:43:48 -07:00
0c77bd7c0b Quantization: preserving pre and post forward hooks (#37233)
Summary:
1. While do convert() preserve module's **pre and post forward** hooks
2. While do fusion preserve only module's **pre forward** hooks (because after fusion output no longer the same)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/37233

Differential Revision: D22425141

Pulled By: jerryzh168

fbshipit-source-id: e69b81821d507dcd110d2ff3594ba94b9593c8da
2020-07-13 12:41:24 -07:00
c451ddaeda Add shape inference functions for int8 quantization related ops (#41215)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41215

To unblock int8 model productization on accelerators, we need the shape and type info for all the blobs after int8 quantization. This diff added shape inference functions for int8 quantization related ops.

Test Plan:
```
buck test caffe2/caffe2/quantization/server:int8_gen_quant_params_test
buck test caffe2/caffe2/quantization/server:fully_connected_dnnlowp_op_test
```

Reviewed By: hx89

Differential Revision: D22467487

fbshipit-source-id: 8298abb0df3457fcb15df81f423f557c1a11f530
2020-07-13 12:02:11 -07:00
7183fd20f8 Add interpolate-style overloads to aten::upsample* ops (#37176)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37176

The non-deprecated user-facing interface to these ops (F.interpolate)
has a good interface: output size and scale are both specified as
a scalar or list, and exactly one must be present.  These aten ops
have an older, clunkier interface where output size is required and
scales are specified as separate optional scalars per dimension.

This change adds new overloads to the aten ops that match the interface
of interpolate.  The plan is to eventually remove the old overloads,
resulting in roughly net-zero code added.  I also believe it is possible
to push this interface down further, eliminating multiple optional<double>
arguments, and simplifying the implementations.

The rollout plan is to land this, wait for a reasonable interval for
forwards-compatibility (maybe 1 week?), land the change that updates
interpolate to call these overloads, wait for a reasonable interval
for backwards-compatibility (maybe 6 months?), then remove the old
overloads.

This diff does not add the `.out` variants of the ops because they
are not currently accessible through any user-facing API.

ghstack-source-id: 106938113

Test Plan:
test_nn covers these ops fairly well, so that should prevent this diff
from breaking anything on its own.

test_nn on the next diff in the stack actually uses these new overloads,
so that should validate that they are actually correct.

Differential Revision: D21209989

fbshipit-source-id: 2b74d230401f071364eb05e138cdaa55279cfe91
2020-07-13 11:53:29 -07:00
fb9e44f8dd Add support for float[]? arguments in native_functions.yaml (#37175)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37175

ghstack-source-id: 106938114

Test Plan: Upcoming diffs use this for upsampling.

Differential Revision: D21209994

fbshipit-source-id: 1a71c07e45e28772a2bbe450b68280dcc0fe2def
2020-07-13 11:51:10 -07:00
d04a2e4dae Back out "Revert D22329069: Self binning histogram" (#41313)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41313

This diff backs out the backout diff.  The failure was due to C++ `or`
not being supported in MSVC. This is now replaced with ||

Original commit changeset: fc7f3f8c968d

Test Plan: Existing unit tests, check github CI.

Reviewed By: malfet

Differential Revision: D22494777

fbshipit-source-id: 3271288919dc3a6bfb82508ab9d021edc910ae45
2020-07-13 11:46:34 -07:00
86d803a9da .cirlceci: Setup nvidia runtime for cu as well (#41268)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41268

We also want nvidia runtime packages to get installed when the
BUILD_ENVIRONMENT also includes "*cu*"

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Differential Revision: D22505885

Pulled By: seemethere

fbshipit-source-id: 4d8e70ed8aed9c6fd1828bc13cf7d5b0f8f50a0a
2020-07-13 10:29:25 -07:00
dea39b596e reduce logging for layernorm (#41305)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41305

added a warning message when layernorm under/overflows, which is what
nnpi does, reducing the frequency of the logging to every 1000

Test Plan: compilation

Reviewed By: yinghai

Differential Revision: D22492726

fbshipit-source-id: 9343beeae6e65bf3846c6b3d2edd2a08dac85ed6
2020-07-13 10:23:46 -07:00
67a4f375cd Pass the number of indices but not embedding size in PyTorch operator (#41315)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41315

We should pass the number of indices but not embedding size in SparseAdagrad fused PyTorch operator

Reviewed By: jianyuh

Differential Revision: D22495422

fbshipit-source-id: ec5d3a5c9547fcd8f95106d912b71888217a5af0
2020-07-12 20:55:40 -07:00
98df9781a7 Impl for ParameterList (#41259)
Summary:
This is a new PR for https://github.com/pytorch/pytorch/issues/40850, https://github.com/pytorch/pytorch/issues/40987 and https://github.com/pytorch/pytorch/issues/41206(I unintentionally closed), as I have some issues for rebates for that one. Very sorry about that. And I have fixed the tests failed in that PR.

This diff contains the implementation of C++ API for ParameterList from https://github.com/pytorch/pytorch/issues/25883.
Refer to the Python API: bc9e8af218/torch/nn/modules/container.py (L376)
Not sure about some naming difference between C++ API and Python API, like `append`, should it be called `push_back`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41259

Test Plan: Add unit tests in this diff

Differential Revision: D22495780

Pulled By: glaringlee

fbshipit-source-id: 79ea3592db640f35477d445ecdaeafbdad814bec
2020-07-12 20:50:31 -07:00
fa153184c8 Fake Quantization Per Channel Kernel Core Implementation (CPU) (#41037)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41037

This diff contains the core implementation for the fake quantizer per channel kernel that supports back propagation on the scale and zero point.

Test Plan:
On a devvm, use:
- `buck test //caffe2/test:quantization -- learnable_forward_per_channel`
- `buck test //caffe2/test:quantization -- learnable_backward_per_channel`

Reviewed By: z-a-f

Differential Revision: D22395665

fbshipit-source-id: 280c2405d04adfeda9fb9cfc94d89e8d868e0d41
2020-07-12 12:14:00 -07:00
5e72ebeda3 Fake Quantization Per Tensor Kernel Core Implementation (CPU) (#41029)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41029

This diff contains the core implementation for the fake quantizer per tensor kernel that supports back propagation on the scale and zero point.

Test Plan:
On a devvm, use:
- `buck test //caffe2/test:quantization -- learnable_forward_per_tensor`
- `buck test //caffe2/test:quantization -- learnable_backward_per_tensor`

Reviewed By: z-a-f

Differential Revision: D22394145

fbshipit-source-id: f6748b635b86679aa9174a8065e6be5e20a95d81
2020-07-12 12:11:38 -07:00
402be850a8 [quant] Adding zero point type check for per channel quantization (#40811)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40811

Test Plan: Imported from OSS

Differential Revision: D22319417

Pulled By: z-a-f

fbshipit-source-id: 7be3a511ddd33b5fe749a83166bbc5874d1bd539
2020-07-12 11:40:19 -07:00
4b4184fc69 [quant][graphmode] use RemoveMutation to remove append (#41161)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/41161

Test Plan: Imported from OSS

Reviewed By: z-a-f

Differential Revision: D22446714

fbshipit-source-id: 15da28ef773300a141603d67a1c4524f1ec32239
2020-07-11 16:49:56 -07:00
106b0b6a62 Op to create quant scheme blob (#40760)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40760

Add op to create a quant scheme.

Test Plan:
buck test mode/opt caffe2/caffe2/quantization/server:int8_quant_scheme_blob_fill_test

{F241838981}

Reviewed By: csummersea

Differential Revision: D22228154

fbshipit-source-id: 1b7a02c06937c68e2fcccf77eb10a965300ed732
2020-07-11 10:53:10 -07:00
edcf2cdf86 [quant] dequantize support list and tuple of tensors (#41079)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/41079

Test Plan: Imported from OSS

Differential Revision: D22420700

fbshipit-source-id: bc4bf0fb47dcf8b94b11fbdc91e8d5a75142b7be
2020-07-11 10:44:19 -07:00
c864158475 Add fp16 support to SparseLengthSum PyTorch operator (#41058)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41058

SparseLengthSum PyTorch operator just accept float and double type before, this diff add fp16 support to SparseLengthSum PT operator.

Reviewed By: jianyuh

Differential Revision: D22387253

fbshipit-source-id: 2a7d03ceaadbb7b04077cff72ab77da6457ba989
2020-07-11 07:54:32 -07:00
28291d3cf8 [caffe2] Revert D22220798 (#41302)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/41302

Test Plan:
```
buck test //caffe2/caffe2/fb/predictor:black_box_predictor_test
```

Differential Revision: D22492356

fbshipit-source-id: efcbc3c67abda5cb9da47e633804a4800d92f89b
2020-07-11 03:28:29 -07:00
e544bf2924 fix the range of the random weights used in the int8fc test (#41303)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41303

the error came from I0710 18:02:48.025024 1780875 NNPIOptions.cpp:49]
[NNPI_LOG][D] [KS] convert_base_kernel_ivp.cpp(524): Output Scale 108240.101562
is out of valid range +-(Min 0.000061 Max 65504.000000)!!!

Seems like the weights we are using are too small, thus generating scaling
factors out of the range of fp16 (>65k). I am tentatively increasing this
factor to a higher value to avoid this. (10x bigger)

Also increased max_examples to 100

Test Plan: ran this test

Reviewed By: yinghai

Differential Revision: D22492481

fbshipit-source-id: c0f9e59b0e70895ab787868ef1d87e6e80106554
2020-07-11 00:19:29 -07:00
a1ed6e1eb3 Revert D22467871: add check for duplicated op registration in JIT
Test Plan: revert-hammer

Differential Revision:
D22467871 (a548c6b18f)

Original commit changeset: 9b7a40a217e6

fbshipit-source-id: b594d4d0a079f7e24ef0efb45476ded2838cbef1
2020-07-10 23:39:23 -07:00
095886fa42 [caffe2] Fix the issues when using CUB RadixSort (#41299)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41299

When using `cub::DeviceRadixSort::SortPairs` (https://nvlabs.github.io/cub/structcub_1_1_device_radix_sort.html), the `end_bit` argument, or the most-significant bit index (exclusive) needed for key comparison, should be passed with  `int(log2(float(num_rows)) + 1)` instead of `int(log2(float(num_indice)) + 1)`. This is because all the values in indices array are guaranteed to be less than num_rows (hash_size), not num_indices. Thanks ngimel for pointing this point and thanks malfet for quickly fixing the log2() compilation issues.

Note:
An optional bit subrange [begin_bit, end_bit) of differentiating key bits can be specified. This can reduce overall sorting overhead and yield a corresponding performance improvement.

Test Plan:
```
buck test mode/dev-nosan //caffe2/caffe2/fb/net_transforms/tests:fuse_sparse_ops_test -- 'test_fuse_sparse_adagrad_with_sparse_lengths_sum_gradient \(caffe2\.caffe2\.fb\.net_transforms\.tests\.fuse_sparse_ops_test\.TestFuseSparseOps\)' --print-passing-details
```

Reviewed By: malfet

Differential Revision: D22491662

fbshipit-source-id: 4fdabe86244c948af6244f9bd91712844bf1dec1
2020-07-10 22:39:43 -07:00
d1f06da9b7 Solve log2(x:int) ambiguity by using log2(float(x)) (#41295)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/41295

Differential Revision: D22490995

Pulled By: malfet

fbshipit-source-id: 17037e551ce5986f3162389a61932099563c02a7
2020-07-10 20:12:36 -07:00
1c098ae339 Fix arg type annotations in jit.trace and onnx.export (#41093)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/40350

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41093

Differential Revision: D22477950

Pulled By: malfet

fbshipit-source-id: f1141c129b6d9efb373d22291b441df86c529ddd
2020-07-10 20:07:05 -07:00
877a59967f Ampere has CUDA_MAX_THREADS_PER_SM == 2048 (#41138)
Summary:
See: https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/nvidia-ampere-architecture-whitepaper.pdf
page 44, table 5
![image](https://user-images.githubusercontent.com/1032377/86958633-56051580-c111-11ea-94da-c726a61dc00a.png)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41138

Differential Revision: D22488904

Pulled By: malfet

fbshipit-source-id: 97bd585d91e1a368f51aa6bd52081bc57d42dbf8
2020-07-10 20:02:20 -07:00
6cbb92494d Better THGeneric.h generation rules in bazel (#41285)
Summary:
It  doesn't do a good job of checking BLAS library capabilities, so hardcode the undef of BLAS_F2C

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41285

Differential Revision: D22489781

Pulled By: malfet

fbshipit-source-id: 13a14f31e08d7f9ded49731e4fd23663bac75cd2
2020-07-10 17:40:04 -07:00
67f5d68fdf Revert D22465221: [pytorch][PR] Reducing size of docker Linux image
Test Plan: revert-hammer

Differential Revision:
D22465221 (7c143e5d3e)

Original commit changeset: 487542597294

fbshipit-source-id: f085763a13497bd5ceea0ed6aa7676320c8806bf
2020-07-10 17:12:26 -07:00
ac3542fa59 Define PSIMD_SOURCE_DIR when including FP16 (#41233)
Summary:
Avoids a superflous redownload when *NNPACK is not used (e.g. on Power)

Example: https://powerci.osuosl.org/job/pytorch-master-nightly-py3-linux-ppc64le/1128/consoleFull
Search for "Downloading PSimd"

See also https://github.com/pytorch/pytorch/issues/41178

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41233

Differential Revision: D22488833

Pulled By: malfet

fbshipit-source-id: 637291419ddd3b2a8dc25e211a4ebbba955e5855
2020-07-10 16:55:10 -07:00
abea7cd561 msvc anonymous namespace bug (#41199)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41199

workaround for: https://developercommunity.visualstudio.com/content/problem/900452/variable-in-anonymous-namespace-has-external-linka.html

Test Plan: CI green, ovrsource green

Reviewed By: malfet

Differential Revision: D22462050

fbshipit-source-id: 11a2fd6a4db1f29ce350699cfc3121dc89ab7ef6
2020-07-10 16:45:14 -07:00
48d6e2adce Disable the mkldnn for conv2d in some special cases (#40610)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40610

We have benchmarked several models, which shows the native implementation of conv2d is faster mkldnn path. For group conv, the native implementation does not batch all the groups.

Test Plan:
```
import torch
import torch.nn.functional as F

import numpy as np

from timeit import Timer

num = 50

S = [
#         [1, 1, 100, 40, 16, 3, 3, 1, 1, 1, 1],
#         [1, 2048, 4, 2, 512, 1, 1, 1, 1, 0, 0],
#         [1, 512, 4, 2, 512, 3, 3, 1, 1, 1, 1],
#         [1, 512, 4, 2, 2048, 1, 1, 1, 1, 0, 0],
#         [1, 2048, 4, 2, 512, 1, 1, 1, 1, 0, 0],
#         [1, 512, 4, 2, 512, 3, 3, 1, 1, 1, 1],
#         [1, 512, 4, 2, 2048, 1, 1, 1, 1, 0, 0],
#         [1, 2048, 4, 2, 512, 1, 1, 1, 1, 0, 0],
#         [1, 512, 4, 2, 512, 3, 3, 1, 1, 1, 1],
#         [1, 512, 4, 2, 2048, 1, 1, 1, 1, 0, 0],
#         [1, 2048, 4, 2, 512, 1, 1, 1, 1, 0, 0],
#         [1, 512, 4, 2, 512, 3, 3, 1, 1, 1, 1],
#         [1, 512, 4, 2, 2048, 1, 1, 1, 1, 0, 0],
#         [1, 2048, 4, 2, 512, 1, 1, 1, 1, 0, 0],
#         [1, 512, 4, 2, 512, 3, 3, 1, 1, 1, 1],
#         [1, 512, 4, 2, 2048, 1, 1, 1, 1, 0, 0],
#         [1, 2048, 4, 2, 512, 1, 1, 1, 1, 0, 0],
#         [1, 512, 4, 2, 512, 3, 3, 1, 1, 1, 1],
#         [1, 512, 4, 2, 2048, 1, 1, 1, 1, 0, 0],
[1, 3, 224, 224, 64, 7, 7, 2, 2, 3, 3, 1],
[1, 64, 56, 56, 128, 1, 1, 1, 1, 0, 0, 1],
[1, 128, 56, 56, 128, 3, 3, 1, 1, 1, 1, 32],
[1, 128, 56, 56, 256, 1, 1, 1, 1, 0, 0, 1],
[1, 64, 56, 56, 256, 1, 1, 1, 1, 0, 0, 1],
[1, 256, 56, 56, 128, 1, 1, 1, 1, 0, 0, 1],
[1, 128, 56, 56, 128, 3, 3, 1, 1, 1, 1, 32],
[1, 128, 56, 56, 256, 1, 1, 1, 1, 0, 0, 1],
[1, 256, 56, 56, 128, 1, 1, 1, 1, 0, 0, 1],
[1, 128, 56, 56, 128, 3, 3, 1, 1, 1, 1, 32],
[1, 128, 56, 56, 256, 1, 1, 1, 1, 0, 0, 1],
[1, 256, 56, 56, 256, 1, 1, 1, 1, 0, 0, 1],
[1, 256, 56, 56, 256, 3, 3, 2, 2, 1, 1, 32],
[1, 256, 28, 28, 512, 1, 1, 1, 1, 0, 0, 1],
[1, 256, 56, 56, 512, 1, 1, 2, 2, 0, 0, 1],
[1, 512, 28, 28, 256, 1, 1, 1, 1, 0, 0, 1],
[1, 256, 28, 28, 256, 3, 3, 1, 1, 1, 1, 32],
[1, 256, 28, 28, 512, 1, 1, 1, 1, 0, 0, 1],
[1, 512, 28, 28, 256, 1, 1, 1, 1, 0, 0, 1],
[1, 256, 28, 28, 256, 3, 3, 1, 1, 1, 1, 32],
[1, 256, 28, 28, 512, 1, 1, 1, 1, 0, 0, 1],
[1, 512, 28, 28, 256, 1, 1, 1, 1, 0, 0, 1],
[1, 256, 28, 28, 256, 3, 3, 1, 1, 1, 1, 32],
[1, 256, 28, 28, 512, 1, 1, 1, 1, 0, 0, 1],
[1, 512, 28, 28, 512, 1, 1, 1, 1, 0, 0, 1],
[1, 512, 28, 28, 512, 3, 3, 2, 2, 1, 1, 32],
[1, 512, 14, 14, 1024, 1, 1, 1, 1, 0, 0, 1],
[1, 512, 28, 28, 1024, 1, 1, 2, 2, 0, 0, 1],
[1, 1024, 14, 14, 512, 1, 1, 1, 1, 0, 0, 1],
[1, 512, 14, 14, 512, 3, 3, 1, 1, 1, 1, 32],
[1, 512, 14, 14, 1024, 1, 1, 1, 1, 0, 0, 1],
[1, 1024, 14, 14, 512, 1, 1, 1, 1, 0, 0, 1],
[1, 512, 14, 14, 512, 3, 3, 1, 1, 1, 1, 32],
[1, 512, 14, 14, 1024, 1, 1, 1, 1, 0, 0, 1],
[1, 1024, 14, 14, 512, 1, 1, 1, 1, 0, 0, 1],
[1, 512, 14, 14, 512, 3, 3, 1, 1, 1, 1, 32],
[1, 512, 14, 14, 1024, 1, 1, 1, 1, 0, 0, 1],
[1, 1024, 14, 14, 512, 1, 1, 1, 1, 0, 0, 1],
[1, 512, 14, 14, 512, 3, 3, 1, 1, 1, 1, 32],
[1, 512, 14, 14, 1024, 1, 1, 1, 1, 0, 0, 1],
[1, 1024, 14, 14, 512, 1, 1, 1, 1, 0, 0, 1],
[1, 512, 14, 14, 512, 3, 3, 1, 1, 1, 1, 32],
[1, 512, 14, 14, 1024, 1, 1, 1, 1, 0, 0, 1],
[1, 1024, 14, 14, 512, 1, 1, 1, 1, 0, 0, 1],
[1, 512, 14, 14, 512, 3, 3, 1, 1, 1, 1, 32],
[1, 512, 14, 14, 1024, 1, 1, 1, 1, 0, 0, 1],
[1, 1024, 14, 14, 512, 1, 1, 1, 1, 0, 0, 1],
[1, 512, 14, 14, 512, 3, 3, 1, 1, 1, 1, 32],
[1, 512, 14, 14, 1024, 1, 1, 1, 1, 0, 0, 1],
[1, 1024, 14, 14, 512, 1, 1, 1, 1, 0, 0, 1],
[1, 512, 14, 14, 512, 3, 3, 1, 1, 1, 1, 32],
[1, 512, 14, 14, 1024, 1, 1, 1, 1, 0, 0, 1],
[1, 1024, 14, 14, 512, 1, 1, 1, 1, 0, 0, 1],
[1, 512, 14, 14, 512, 3, 3, 1, 1, 1, 1, 32],
[1, 512, 14, 14, 1024, 1, 1, 1, 1, 0, 0, 1],
[1, 1024, 14, 14, 512, 1, 1, 1, 1, 0, 0, 1],
[1, 512, 14, 14, 512, 3, 3, 1, 1, 1, 1, 32],
[1, 512, 14, 14, 1024, 1, 1, 1, 1, 0, 0, 1],
[1, 1024, 14, 14, 512, 1, 1, 1, 1, 0, 0, 1],
[1, 512, 14, 14, 512, 3, 3, 1, 1, 1, 1, 32],
[1, 512, 14, 14, 1024, 1, 1, 1, 1, 0, 0, 1],
[1, 1024, 14, 14, 512, 1, 1, 1, 1, 0, 0, 1],
[1, 512, 14, 14, 512, 3, 3, 1, 1, 1, 1, 32],
[1, 512, 14, 14, 1024, 1, 1, 1, 1, 0, 0, 1],
[1, 1024, 14, 14, 512, 1, 1, 1, 1, 0, 0, 1],
[1, 512, 14, 14, 512, 3, 3, 1, 1, 1, 1, 32],
[1, 512, 14, 14, 1024, 1, 1, 1, 1, 0, 0, 1],
[1, 1024, 14, 14, 512, 1, 1, 1, 1, 0, 0, 1],
[1, 512, 14, 14, 512, 3, 3, 1, 1, 1, 1, 32],
[1, 512, 14, 14, 1024, 1, 1, 1, 1, 0, 0, 1],
[1, 1024, 14, 14, 512, 1, 1, 1, 1, 0, 0, 1],
[1, 512, 14, 14, 512, 3, 3, 1, 1, 1, 1, 32],
[1, 512, 14, 14, 1024, 1, 1, 1, 1, 0, 0, 1],
[1, 1024, 14, 14, 512, 1, 1, 1, 1, 0, 0, 1],
[1, 512, 14, 14, 512, 3, 3, 1, 1, 1, 1, 32],
[1, 512, 14, 14, 1024, 1, 1, 1, 1, 0, 0, 1],
[1, 1024, 14, 14, 512, 1, 1, 1, 1, 0, 0, 1],
[1, 512, 14, 14, 512, 3, 3, 1, 1, 1, 1, 32],
[1, 512, 14, 14, 1024, 1, 1, 1, 1, 0, 0, 1],
[1, 1024, 14, 14, 512, 1, 1, 1, 1, 0, 0, 1],
[1, 512, 14, 14, 512, 3, 3, 1, 1, 1, 1, 32],
[1, 512, 14, 14, 1024, 1, 1, 1, 1, 0, 0, 1],
[1, 1024, 14, 14, 512, 1, 1, 1, 1, 0, 0, 1],
[1, 512, 14, 14, 512, 3, 3, 1, 1, 1, 1, 32],
[1, 512, 14, 14, 1024, 1, 1, 1, 1, 0, 0, 1],
[1, 1024, 14, 14, 512, 1, 1, 1, 1, 0, 0, 1],
[1, 512, 14, 14, 512, 3, 3, 1, 1, 1, 1, 32],
[1, 512, 14, 14, 1024, 1, 1, 1, 1, 0, 0, 1],
[1, 1024, 14, 14, 512, 1, 1, 1, 1, 0, 0, 1],
[1, 512, 14, 14, 512, 3, 3, 1, 1, 1, 1, 32],
[1, 512, 14, 14, 1024, 1, 1, 1, 1, 0, 0, 1],
[1, 1024, 14, 14, 512, 1, 1, 1, 1, 0, 0, 1],
[1, 512, 14, 14, 512, 3, 3, 1, 1, 1, 1, 32],
[1, 512, 14, 14, 1024, 1, 1, 1, 1, 0, 0, 1],
[1, 1024, 14, 14, 1024, 1, 1, 1, 1, 0, 0, 1],
[1, 1024, 14, 14, 1024, 3, 3, 2, 2, 1, 1, 32],
[1, 1024, 7, 7, 2048, 1, 1, 1, 1, 0, 0, 1],
[1, 1024, 14, 14, 2048, 1, 1, 2, 2, 0, 0, 1],
[1, 2048, 7, 7, 1024, 1, 1, 1, 1, 0, 0, 1],
[1, 1024, 7, 7, 1024, 3, 3, 1, 1, 1, 1, 32],
[1, 1024, 7, 7, 2048, 1, 1, 1, 1, 0, 0, 1],
[1, 2048, 7, 7, 1024, 1, 1, 1, 1, 0, 0, 1],
[1, 1024, 7, 7, 1024, 3, 3, 1, 1, 1, 1, 32],
[1, 1024, 7, 7, 2048, 1, 1, 1, 1, 0, 0, 1],
    ]
for x in range(105):
    P = S[x]
    print(P)
    (N, C, H, W) = P[0:4]
    M = P[4]
    (kernel_h, kernel_w) = P[5:7]
    (stride_h, stride_w) = P[7:9]
    (padding_h, padding_w) = P[9:11]

    X_np = np.random.randn(N, C, H, W).astype(np.float32)
    W_np = np.random.randn(M, C, kernel_h, kernel_w).astype(np.float32)
    X = torch.from_numpy(X_np)
    g = P[11]
    conv2d_pt = torch.nn.Conv2d(
        C, M, (kernel_h, kernel_w), stride=(stride_h, stride_w),
        padding=(padding_h, padding_w), groups=g, bias=True)

    class ConvNet(torch.nn.Module):
        def __init__(self):
            super(ConvNet, self).__init__()
            self.conv2d = conv2d_pt

        def forward(self, x):
            return self.conv2d(x)

    model = ConvNet()

    def pt_forward():
        with torch.no_grad():
            model(X)

    torch._C._set_mkldnn_enabled(True)
    t = Timer("pt_forward()", "from __main__ import pt_forward, X")
    print("MKLDNN pt time = {}".format(t.timeit(num) / num * 1000.0))
    torch._C._set_mkldnn_enabled(False)
    t = Timer("pt_forward()", "from __main__ import pt_forward, X")
    print("TH pt time = {}".format(t.timeit(num) / num * 1000.0))

OMP_NUM_THREADS=1 MKL_NUM_THREADS=1 python bm.py
```

output:
```
[1, 3, 224, 224, 64, 7, 7, 2, 2, 3, 3, 1]
MKLDNN pt time = 5.891108009964228
TH pt time = 7.0624795742332935
[1, 64, 56, 56, 128, 1, 1, 1, 1, 0, 0, 1]
MKLDNN pt time = 1.4464975893497467
TH pt time = 0.721491202712059
[1, 128, 56, 56, 128, 3, 3, 1, 1, 1, 1, 32]
MKLDNN pt time = 1.4036639966070652
TH pt time = 3.299683593213558
[1, 128, 56, 56, 256, 1, 1, 1, 1, 0, 0, 1]
MKLDNN pt time = 3.3908068016171455
TH pt time = 2.227546200156212
[1, 64, 56, 56, 256, 1, 1, 1, 1, 0, 0, 1]
MKLDNN pt time = 2.226586602628231
TH pt time = 1.3865559734404087
[1, 256, 56, 56, 128, 1, 1, 1, 1, 0, 0, 1]
MKLDNN pt time = 3.31307839602232
TH pt time = 2.4284918047487736
[1, 128, 56, 56, 128, 3, 3, 1, 1, 1, 1, 32]
MKLDNN pt time = 1.5028003975749016
TH pt time = 3.824346773326397
[1, 128, 56, 56, 256, 1, 1, 1, 1, 0, 0, 1]
MKLDNN pt time = 3.4405963867902756
TH pt time = 2.6227117888629436
[1, 256, 56, 56, 128, 1, 1, 1, 1, 0, 0, 1]
MKLDNN pt time = 3.405764400959015
TH pt time = 2.644723802804947
[1, 128, 56, 56, 128, 3, 3, 1, 1, 1, 1, 32]
MKLDNN pt time = 1.5220053866505623
TH pt time = 3.9365867897868156
[1, 128, 56, 56, 256, 1, 1, 1, 1, 0, 0, 1]
MKLDNN pt time = 3.606868200004101
TH pt time = 2.5387956015765667
[1, 256, 56, 56, 256, 1, 1, 1, 1, 0, 0, 1]
MKLDNN pt time = 6.0041105933487415
TH pt time = 5.305919591337442
[1, 256, 56, 56, 256, 3, 3, 2, 2, 1, 1, 32]
MKLDNN pt time = 1.4830979891121387
TH pt time = 7.532084975391626
[1, 256, 28, 28, 512, 1, 1, 1, 1, 0, 0, 1]
MKLDNN pt time = 3.025687597692013
TH pt time = 2.2185291908681393
[1, 256, 56, 56, 512, 1, 1, 2, 2, 0, 0, 1]
MKLDNN pt time = 3.5893129743635654
TH pt time = 2.696530409157276
[1, 512, 28, 28, 256, 1, 1, 1, 1, 0, 0, 1]
MKLDNN pt time = 2.8203356079757214
TH pt time = 2.0819314010441303
[1, 256, 28, 28, 256, 3, 3, 1, 1, 1, 1, 32]
MKLDNN pt time = 0.8583215996623039
TH pt time = 2.7761065773665905
[1, 256, 28, 28, 512, 1, 1, 1, 1, 0, 0, 1]
MKLDNN pt time = 2.9077288135886192
TH pt time = 2.045416794717312
[1, 512, 28, 28, 256, 1, 1, 1, 1, 0, 0, 1]
MKLDNN pt time = 2.805021796375513
TH pt time = 2.131381593644619
[1, 256, 28, 28, 256, 3, 3, 1, 1, 1, 1, 32]
MKLDNN pt time = 0.9023251943290234
TH pt time = 2.9028950072824955
[1, 256, 28, 28, 512, 1, 1, 1, 1, 0, 0, 1]
MKLDNN pt time = 3.1174601800739765
TH pt time = 2.275596000254154
[1, 512, 28, 28, 256, 1, 1, 1, 1, 0, 0, 1]
MKLDNN pt time = 3.100480604916811
TH pt time = 2.399571593850851
[1, 256, 28, 28, 256, 3, 3, 1, 1, 1, 1, 32]
MKLDNN pt time = 0.9321337938308716
TH pt time = 2.886691205203533
[1, 256, 28, 28, 512, 1, 1, 1, 1, 0, 0, 1]
MKLDNN pt time = 3.065785188227892
TH pt time = 2.1640316024422646
[1, 512, 28, 28, 512, 1, 1, 1, 1, 0, 0, 1]
MKLDNN pt time = 5.891813579946756
TH pt time = 4.2956990003585815
[1, 512, 28, 28, 512, 3, 3, 2, 2, 1, 1, 32]
MKLDNN pt time = 0.9399276040494442
TH pt time = 4.7622935846447945
[1, 512, 14, 14, 1024, 1, 1, 1, 1, 0, 0, 1]
MKLDNN pt time = 3.2426914013922215
TH pt time = 2.3699573799967766
[1, 512, 28, 28, 1024, 1, 1, 2, 2, 0, 0, 1]
MKLDNN pt time = 3.0341636016964912
TH pt time = 2.6606030017137527
[1, 1024, 14, 14, 512, 1, 1, 1, 1, 0, 0, 1]
MKLDNN pt time = 2.991385366767645
TH pt time = 2.6313263922929764
[1, 512, 14, 14, 512, 3, 3, 1, 1, 1, 1, 32]
MKLDNN pt time = 0.7330256141722202
TH pt time = 3.008321188390255
[1, 512, 14, 14, 1024, 1, 1, 1, 1, 0, 0, 1]
MKLDNN pt time = 2.880081795156002
TH pt time = 2.289068605750799
[1, 1024, 14, 14, 512, 1, 1, 1, 1, 0, 0, 1]
MKLDNN pt time = 2.9583285935223103
TH pt time = 2.6302105747163296
[1, 512, 14, 14, 512, 3, 3, 1, 1, 1, 1, 32]
MKLDNN pt time = 0.7322711870074272
TH pt time = 2.8230775892734528
[1, 512, 14, 14, 1024, 1, 1, 1, 1, 0, 0, 1]
MKLDNN pt time = 2.8620235808193684
TH pt time = 2.4078205972909927
[1, 1024, 14, 14, 512, 1, 1, 1, 1, 0, 0, 1]
MKLDNN pt time = 2.828651014715433
TH pt time = 2.616014201194048
[1, 512, 14, 14, 512, 3, 3, 1, 1, 1, 1, 32]
MKLDNN pt time = 0.7084695994853973
TH pt time = 2.8024527989327908
[1, 512, 14, 14, 1024, 1, 1, 1, 1, 0, 0, 1]
MKLDNN pt time = 2.7884829975664616
TH pt time = 2.4237345717847347
[1, 1024, 14, 14, 512, 1, 1, 1, 1, 0, 0, 1]
MKLDNN pt time = 2.89030060172081
TH pt time = 2.5852439925074577
[1, 512, 14, 14, 512, 3, 3, 1, 1, 1, 1, 32]
MKLDNN pt time = 0.724627785384655
TH pt time = 2.651805803179741
[1, 512, 14, 14, 1024, 1, 1, 1, 1, 0, 0, 1]
MKLDNN pt time = 2.249914798885584
TH pt time = 2.0440668053925037
[1, 1024, 14, 14, 512, 1, 1, 1, 1, 0, 0, 1]
MKLDNN pt time = 2.722136974334717
TH pt time = 2.531316000968218
[1, 512, 14, 14, 512, 3, 3, 1, 1, 1, 1, 32]
MKLDNN pt time = 0.7164162024855614
TH pt time = 2.8521843999624252
[1, 512, 14, 14, 1024, 1, 1, 1, 1, 0, 0, 1]
MKLDNN pt time = 2.8891782090067863
TH pt time = 2.436912599951029
[1, 1024, 14, 14, 512, 1, 1, 1, 1, 0, 0, 1]
MKLDNN pt time = 3.0049769952893257
TH pt time = 2.649025786668062
[1, 512, 14, 14, 512, 3, 3, 1, 1, 1, 1, 32]
MKLDNN pt time = 0.7299130037426949
TH pt time = 2.67714099958539
[1, 512, 14, 14, 1024, 1, 1, 1, 1, 0, 0, 1]
MKLDNN pt time = 2.799382768571377
TH pt time = 2.4427592009305954
[1, 1024, 14, 14, 512, 1, 1, 1, 1, 0, 0, 1]
MKLDNN pt time = 3.0201382003724575
TH pt time = 2.6285660080611706
[1, 512, 14, 14, 512, 3, 3, 1, 1, 1, 1, 32]
MKLDNN pt time = 0.6983320042490959
TH pt time = 2.9118607938289642
[1, 512, 14, 14, 1024, 1, 1, 1, 1, 0, 0, 1]
MKLDNN pt time = 2.8802538104355335
TH pt time = 2.385452575981617
[1, 1024, 14, 14, 512, 1, 1, 1, 1, 0, 0, 1]
MKLDNN pt time = 2.9600497893989086
TH pt time = 2.594646792858839
[1, 512, 14, 14, 512, 3, 3, 1, 1, 1, 1, 32]
MKLDNN pt time = 0.5688861943781376
TH pt time = 2.5941073894500732
[1, 512, 14, 14, 1024, 1, 1, 1, 1, 0, 0, 1]
MKLDNN pt time = 2.7758505940437317
TH pt time = 2.336081601679325
[1, 1024, 14, 14, 512, 1, 1, 1, 1, 0, 0, 1]
MKLDNN pt time = 2.6135251857340336
TH pt time = 2.3902921937406063
[1, 512, 14, 14, 512, 3, 3, 1, 1, 1, 1, 32]
MKLDNN pt time = 0.6303061917424202
TH pt time = 2.6228136010468006
[1, 512, 14, 14, 1024, 1, 1, 1, 1, 0, 0, 1]
MKLDNN pt time = 2.8868251852691174
TH pt time = 2.5620524026453495
[1, 1024, 14, 14, 512, 1, 1, 1, 1, 0, 0, 1]
MKLDNN pt time = 3.057632204145193
TH pt time = 2.691414188593626
[1, 512, 14, 14, 512, 3, 3, 1, 1, 1, 1, 32]
MKLDNN pt time = 0.7316274009644985
TH pt time = 3.14683198928833
[1, 512, 14, 14, 1024, 1, 1, 1, 1, 0, 0, 1]
MKLDNN pt time = 3.2674955762922764
TH pt time = 2.602821197360754
[1, 1024, 14, 14, 512, 1, 1, 1, 1, 0, 0, 1]
MKLDNN pt time = 3.0993166007101536
TH pt time = 2.609328981488943
[1, 512, 14, 14, 512, 3, 3, 1, 1, 1, 1, 32]
MKLDNN pt time = 0.7257938012480736
TH pt time = 2.9255208000540733
[1, 512, 14, 14, 1024, 1, 1, 1, 1, 0, 0, 1]
MKLDNN pt time = 3.3086097799241543
TH pt time = 2.544360812753439
[1, 1024, 14, 14, 512, 1, 1, 1, 1, 0, 0, 1]
MKLDNN pt time = 3.0537622049450874
TH pt time = 2.6343842037022114
[1, 512, 14, 14, 512, 3, 3, 1, 1, 1, 1, 32]
MKLDNN pt time = 0.7194169983267784
TH pt time = 2.9009717889130116
[1, 512, 14, 14, 1024, 1, 1, 1, 1, 0, 0, 1]
MKLDNN pt time = 2.6461398042738438
TH pt time = 2.3600555770099163
[1, 1024, 14, 14, 512, 1, 1, 1, 1, 0, 0, 1]
MKLDNN pt time = 2.6328082010149956
TH pt time = 2.415131386369467
[1, 512, 14, 14, 512, 3, 3, 1, 1, 1, 1, 32]
MKLDNN pt time = 0.6832938082516193
TH pt time = 2.6299685798585415
[1, 512, 14, 14, 1024, 1, 1, 1, 1, 0, 0, 1]
MKLDNN pt time = 2.9594415985047817
TH pt time = 2.509857602417469
[1, 1024, 14, 14, 512, 1, 1, 1, 1, 0, 0, 1]
MKLDNN pt time = 2.956229578703642
TH pt time = 2.691046390682459
[1, 512, 14, 14, 512, 3, 3, 1, 1, 1, 1, 32]
MKLDNN pt time = 0.7222409918904305
TH pt time = 2.938339803367853
[1, 512, 14, 14, 1024, 1, 1, 1, 1, 0, 0, 1]
MKLDNN pt time = 2.9467295855283737
TH pt time = 2.4219116009771824
[1, 1024, 14, 14, 512, 1, 1, 1, 1, 0, 0, 1]
MKLDNN pt time = 3.0215882137417793
TH pt time = 2.7782391756772995
[1, 512, 14, 14, 512, 3, 3, 1, 1, 1, 1, 32]
MKLDNN pt time = 0.719242412596941
TH pt time = 2.8529402054846287
[1, 512, 14, 14, 1024, 1, 1, 1, 1, 0, 0, 1]
MKLDNN pt time = 2.8062099777162075
TH pt time = 2.9951974004507065
[1, 1024, 14, 14, 512, 1, 1, 1, 1, 0, 0, 1]
MKLDNN pt time = 3.1621821969747543
TH pt time = 2.5330167822539806
[1, 512, 14, 14, 512, 3, 3, 1, 1, 1, 1, 32]
MKLDNN pt time = 0.690075010061264
TH pt time = 2.5531245954334736
[1, 512, 14, 14, 1024, 1, 1, 1, 1, 0, 0, 1]
MKLDNN pt time = 2.832614816725254
TH pt time = 2.339891381561756
[1, 1024, 14, 14, 512, 1, 1, 1, 1, 0, 0, 1]
MKLDNN pt time = 2.7835668064653873
TH pt time = 2.513139396905899
[1, 512, 14, 14, 512, 3, 3, 1, 1, 1, 1, 32]
MKLDNN pt time = 0.7026367820799351
TH pt time = 2.796882800757885
[1, 512, 14, 14, 1024, 1, 1, 1, 1, 0, 0, 1]
MKLDNN pt time = 2.6479675993323326
TH pt time = 2.4971639923751354
[1, 1024, 14, 14, 512, 1, 1, 1, 1, 0, 0, 1]
MKLDNN pt time = 2.9846629686653614
TH pt time = 2.4657804146409035
[1, 512, 14, 14, 512, 3, 3, 1, 1, 1, 1, 32]
MKLDNN pt time = 0.5969022028148174
TH pt time = 2.697007991373539
[1, 512, 14, 14, 1024, 1, 1, 1, 1, 0, 0, 1]
MKLDNN pt time = 2.7602720074355602
TH pt time = 2.4498093873262405
[1, 1024, 14, 14, 512, 1, 1, 1, 1, 0, 0, 1]
MKLDNN pt time = 2.963611613959074
TH pt time = 2.6310251839458942
[1, 512, 14, 14, 512, 3, 3, 1, 1, 1, 1, 32]
MKLDNN pt time = 0.7004458084702492
TH pt time = 2.9164502024650574
[1, 512, 14, 14, 1024, 1, 1, 1, 1, 0, 0, 1]
MKLDNN pt time = 2.887732572853565
TH pt time = 2.4575488083064556
[1, 1024, 14, 14, 512, 1, 1, 1, 1, 0, 0, 1]
MKLDNN pt time = 2.8350806050002575
TH pt time = 2.23197178915143
[1, 512, 14, 14, 512, 3, 3, 1, 1, 1, 1, 32]
MKLDNN pt time = 0.5626789852976799
TH pt time = 2.704860605299473
[1, 512, 14, 14, 1024, 1, 1, 1, 1, 0, 0, 1]
MKLDNN pt time = 2.6168799959123135
TH pt time = 2.2481359727680683
[1, 1024, 14, 14, 512, 1, 1, 1, 1, 0, 0, 1]
MKLDNN pt time = 2.5654693879187107
TH pt time = 2.2636358067393303
[1, 512, 14, 14, 512, 3, 3, 1, 1, 1, 1, 32]
MKLDNN pt time = 0.6836861930787563
TH pt time = 2.825192976742983
[1, 512, 14, 14, 1024, 1, 1, 1, 1, 0, 0, 1]
MKLDNN pt time = 2.7971909940242767
TH pt time = 2.471243590116501
[1, 1024, 14, 14, 512, 1, 1, 1, 1, 0, 0, 1]
MKLDNN pt time = 2.8480279818177223
TH pt time = 2.553586605936289
[1, 512, 14, 14, 512, 3, 3, 1, 1, 1, 1, 32]
MKLDNN pt time = 0.7191735878586769
TH pt time = 2.6465672068297863
[1, 512, 14, 14, 1024, 1, 1, 1, 1, 0, 0, 1]
MKLDNN pt time = 2.7811027877032757
TH pt time = 2.457349617034197
[1, 1024, 14, 14, 1024, 1, 1, 1, 1, 0, 0, 1]
MKLDNN pt time = 5.434317365288734
TH pt time = 4.639615211635828
[1, 1024, 14, 14, 1024, 3, 3, 2, 2, 1, 1, 32]
MKLDNN pt time = 0.9400106035172939
TH pt time = 2.9971951991319656
[1, 1024, 7, 7, 2048, 1, 1, 1, 1, 0, 0, 1]
MKLDNN pt time = 4.494664408266544
TH pt time = 3.478870000690222
[1, 1024, 14, 14, 2048, 1, 1, 2, 2, 0, 0, 1]
MKLDNN pt time = 4.8432330042123795
TH pt time = 3.6410867795348167
[1, 2048, 7, 7, 1024, 1, 1, 1, 1, 0, 0, 1]
MKLDNN pt time = 4.779010973870754
TH pt time = 3.4093930013477802
[1, 1024, 7, 7, 1024, 3, 3, 1, 1, 1, 1, 32]
MKLDNN pt time = 0.8385192044079304
TH pt time = 3.0921380035579205
[1, 1024, 7, 7, 2048, 1, 1, 1, 1, 0, 0, 1]
MKLDNN pt time = 3.9088409766554832
TH pt time = 3.130124807357788
[1, 2048, 7, 7, 1024, 1, 1, 1, 1, 0, 0, 1]
MKLDNN pt time = 4.0072557888925076
TH pt time = 2.977220807224512
[1, 1024, 7, 7, 1024, 3, 3, 1, 1, 1, 1, 32]
MKLDNN pt time = 0.8867520093917847
TH pt time = 3.1505179964005947
[1, 1024, 7, 7, 2048, 1, 1, 1, 1, 0, 0, 1]
MKLDNN pt time = 4.118196591734886
TH pt time = 3.46621660515666
```

Reviewed By: dzhulgakov

Differential Revision: D22250817

fbshipit-source-id: c9dc61b633e11a378a05810d711a696effd7f02b
2020-07-10 16:43:29 -07:00
ce3ba3b9bc [JIT] Add support for backend-lowered submodules (#41146)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41146

**Summary**
This commit adds support for using `Modules` that have been lowered as
submodules in `ScriptModules`.

**Test Plan**
This commit adds execution and save/load tests to test_backends.py for
backend-lowered submodules.

**Fixes**
This commit fixes #40069.

Test Plan: Imported from OSS

Reviewed By: ailzhang

Differential Revision: D22459543

Pulled By: SplitInfinity

fbshipit-source-id: 02e0c0ccdce26c671ade30a34aca3e99bcdc5ba7
2020-07-10 16:35:24 -07:00
1f2e91fa4f Impilcit casting resulting internal build failure. (#41272)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41272

Implicit casting from int to float is resulting in vec256_test build failure
internally. This diff fixes that.

Test Plan: Build vec256_test for android and run it on android phone.

Reviewed By: ljk53, paulshaoyuqiao

Differential Revision: D22484635

fbshipit-source-id: ebb9fc2eccb8261ab01d8266150fc3b05166f1e7
2020-07-10 16:29:54 -07:00
7bae5780a2 Revert D22329069: Self binning histogram
Test Plan: revert-hammer

Differential Revision:
D22329069 (16c8146da9)

Original commit changeset: 28406b94e284

fbshipit-source-id: fc7f3f8c968d1ec7d2a1cf7a4d05900f51055d82
2020-07-10 16:22:29 -07:00
dd0c98d82a [ONNX]Add tests for ConvTranspose 1D and 3D (#40703)
Summary:
Add tests for ConvTranspose 1D and 3D

Pull Request resolved: https://github.com/pytorch/pytorch/pull/40703

Reviewed By: hl475

Differential Revision: D22480087

Pulled By: houseroad

fbshipit-source-id: 92846ed7181f543af20669e5ea191bfb5522ea13
2020-07-10 16:10:09 -07:00
9daba76ba1 Change to.dtype_layout to c10-full (#41169)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41169

-
ghstack-source-id: 107537240

Test Plan: waitforsandcastle

Differential Revision: D22289257

fbshipit-source-id: ed3cc06327951fa886eb3b8f1c8bcc014ae2bc41
2020-07-10 16:04:34 -07:00
7c143e5d3e Reducing size of docker Linux image (#41207)
Summary:
# Description
The goal is to reduce the size of the docker image. I checked a few things:
* Docker layer overlaps
* Removing .git folder
* Removing intermediate build artifacts (*.o and *.a)

The only one that gave satisfying result was the 3rd approach, removing *.o and *.a. The final image went from 10 GB to 9.7 GB.

I used Dive (https://github.com/wagoodman/dive) to inspect the Docker image manually.

# Test:
* Check the image size was reduced
* No test failures in CI

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41207

Test Plan:
* Check the image size was reduced
* No test failures in CI

Differential Revision: D22465221

Pulled By: ssylvain

fbshipit-source-id: 48754259729401e3c08447b0fa0630ca7217cb98
2020-07-10 15:59:18 -07:00
0651887eb4 Improve repr for torch.iinfo & torch.finfo (#40488)
Summary:
- fix https://github.com/pytorch/pytorch/issues/39991
- Include directly `min`/`max`/`eps`/`tiny` values in repr of `torch.iinfo` & `torch.finfo` for inspection
- Use `torch.float16` / `torch.int16` instead of uncorrespond names `Half` / `Short`
- The improved repr is shown just like:
```
>>> torch.iinfo(torch.int8)
iinfo(type=torch.int8, max=127, min=-128)
>>> torch.iinfo(torch.int16)
iinfo(type=torch.int16, max=32767, min=-32768)
>>> torch.iinfo(torch.int32)
iinfo(type=torch.int32, max=2.14748e+09, min=-2.14748e+09)
>>> torch.iinfo(torch.int64)
iinfo(type=torch.int64, max=9.22337e+18, min=-9.22337e+18)
>>> torch.finfo(torch.float16)
finfo(type=torch.float16, eps=0.000976563, max=65504, min=-65504, tiny=6.10352e-05)
>>> torch.finfo(torch.float32)
finfo(type=torch.float32, eps=1.19209e-07, max=3.40282e+38, min=-3.40282e+38, tiny=1.17549e-38)
>>> torch.finfo(torch.float64)
finfo(type=torch.float64, eps=2.22045e-16, max=1.79769e+308, min=-1.79769e+308, tiny=2.22507e-308)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/40488

Differential Revision: D22445301

Pulled By: mruberry

fbshipit-source-id: 552af9904c423006084b45d6c4adfb4b5689db54
2020-07-10 15:22:55 -07:00
cb6c3526c6 Migrate addmm, addbmm and THBlas_gemm to ATen (#40927)
Summary:
Resubmit #40927
Closes https://github.com/pytorch/pytorch/issues/24679, closes https://github.com/pytorch/pytorch/issues/24678

`addbmm` depends on `addmm` so needed to be ported at the same time. I also removed `THTensor_(baddbmm)` which I noticed had already been ported so was just dead code.

After having already written this code, I had to fix merge conflicts with https://github.com/pytorch/pytorch/issues/40354 which revealed there was already an established place for cpu blas routines in ATen. However, the version there doesn't make use of ATen's AVX dispatching so thought I'd wait for comment before migrating this into that style.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/40927

Reviewed By: ezyang

Differential Revision: D22468490

Pulled By: ngimel

fbshipit-source-id: f8a22be3216f67629420939455e31a88af20201d
2020-07-10 14:30:55 -07:00
16c8146da9 Self binning histogram (#40875)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40875

This op uses the given num_bins and a spacing strategy to automatically bin and compute the histogram of given matrices.

Test Plan: Unit tests.

Reviewed By: neha26shah

Differential Revision: D22329069

fbshipit-source-id: 28406b94e284d52d875f73662fc82f93dbc00064
2020-07-10 13:55:42 -07:00
9b0393fcf1 [ONNX]Fix export of flatten (#40418)
Summary:
Shape is passed to _reshape_to_tensor as a Constant and cannot infer shape of the input when model is exported with dynamic axes set. Instead of a Constant pass output of a subgraph Shape-Slice-Concat to compute the shape for the Reshape node in _reshape_to_tensor function.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/40418

Reviewed By: hl475

Differential Revision: D22480127

Pulled By: houseroad

fbshipit-source-id: 11853adb6e6914936871db1476916699141de435
2020-07-10 13:06:25 -07:00
a548c6b18f add check for duplicated op registration in JIT (#41214)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41214

Same as D21032976, add check for duplicated op name in JIT

Test Plan:
run full JIT predictor
also
buck test pytorch-playground

Reviewed By: smessmer

Differential Revision: D22467871

fbshipit-source-id: 9b7a40a217e6c63cca44cad54f9f657b8b207a45
2020-07-10 12:19:04 -07:00
75b6dd3d49 Wrap Caffe2's SparseLengthsSum into a PyTorch op (#39596)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39596

This diff wraps Caffe2's SparseLengthsSum on GPU as a PT op.

Reviewed By: jianyuh

Differential Revision: D21895309

fbshipit-source-id: 38bb156f9be8d28225d2b44f5b4c93d27779aff9
2020-07-10 11:19:13 -07:00
d927aee312 Small clarification of torch.cuda.amp multi-model example (#41203)
Summary:
some people have been confused by `retain_graph` in the snippet, they thought it was an additional requirement imposed by amp.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41203

Differential Revision: D22463700

Pulled By: ngimel

fbshipit-source-id: e6fc8871be2bf0ecc1794b1c6f5ea99af922bf7e
2020-07-10 11:13:26 -07:00
4a09501fbe LogitOp LUT based fake FP16 Op. (#41258)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41258

LogitOp LUT based fake FP16 Op.

(Note: this ignores all push blocking failures!)

Test Plan: test_op_nnpi_fp16.py covers the test_logit testing.

Reviewed By: hyuen

Differential Revision: D22351963

fbshipit-source-id: e2ed2bd9bfdc58c6f823d7d41557109c08628bd7
2020-07-10 10:53:42 -07:00
33f9fbf8ba Modularize parsing NCCL_BLOCKING_WAIT in ProcessGroupNCCL (#41076)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41076

Modularizes Parsing the NCCL_BLOCKING_WAIT environment variable in the ProcessGroupNCCL Constructor.
ghstack-source-id: 107491850

Test Plan: Sandcastle/CI

Differential Revision: D22401225

fbshipit-source-id: 79866d3f4f1a617cdcbca70e3bea1ce9dcac3316
2020-07-10 10:47:38 -07:00
db38487ece Autograd Doc for Complex Numbers (#41012)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/41012

Test Plan: Imported from OSS

Differential Revision: D22476911

Pulled By: anjali411

fbshipit-source-id: 7da20cb4312a0465272bebe053520d9911475828
2020-07-10 09:57:43 -07:00
e568b3fa2d test nan and inf in TestTorchMathOps (#41225)
Summary:
Per title. `lgamma` produces a different result for `-inf` compared to scipy, so there comparison is skipped.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41225

Differential Revision: D22473346

Pulled By: ngimel

fbshipit-source-id: e4ebda1b10e2a061bd4cef38d1d7b5bf0f581790
2020-07-10 09:46:46 -07:00
62e16934cb [caffe2] Add the dedup implementation of fused RowWiseAdagrad op on GPUs (#40282)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40282

Test Plan:
```
buck test mode/dev-nosan //caffe2/caffe2/fb/net_transforms/tests:fuse_sparse_ops_test -- 'test_fuse_sparse_adagrad_with_sparse_lengths_sum_gradient \(caffe2\.caffe2\.fb\.net_transforms\.tests\.fuse_sparse_ops_test\.TestFuseSparseOps\)' --print-passing-details
```

https://our.intern.facebook.com/intern/testinfra/testrun/4785074632584150

Reviewed By: jspark1105

Differential Revision: D22102737

fbshipit-source-id: fa3fef7cecb1e2cf5c9b6019579dc0f86fd3a3b2
2020-07-10 09:05:24 -07:00
08227072e2 Benchmark RecordFunction overhead on some models (#40952)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40952

Adding a benchmark to measure RecordFunction overhead,
currently on resnet50 and lstm models

Test Plan:
python benchmarks/record_function_benchmark/record_function_bench.py
Benchmarking RecordFunction overhead for lstm_jit
Running warmup... finished
Running 100 iterations with RecordFunction... finished
N = 100, avg. time: 251.970 ms, stddev: 39.348 ms
Running 100 iterations without RecordFunction... finished
N = 100, avg. time: 232.828 ms, stddev: 24.556 ms

Reviewed By: dzhulgakov

Differential Revision: D22368357

Pulled By: ilia-cher

fbshipit-source-id: bff4f4e0e06fb80fdfcf85966c2468e48ed7bc98
2020-07-10 08:46:19 -07:00
8a79eec98a Add add_relu fusion pass to optimize_for_mobile. (#40252)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40252

As title says.

Test Plan:
python test/test_mobile_optimizer.py

Imported from OSS

Differential Revision: D22126825

fbshipit-source-id: a1880587ba8db9dee0fa450bc463734e4a8693d9
2020-07-10 08:10:22 -07:00
75a4862f63 Added SiLU activation function (#41034)
Summary:
Implemented the SiLU activation function as discussed in https://github.com/pytorch/pytorch/issues/3169.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41034

Reviewed By: glaringlee

Differential Revision: D22465203

Pulled By: heitorschueroff

fbshipit-source-id: b27d064529fc99600c586ad49b594b52b718b0d2
2020-07-10 07:37:30 -07:00
f6eb92a354 Expose private APIs to enable/disable pickling ScriptModules without RPC (#39631)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39631

Background:
Currently, we cannot send ScriptModule over RPC as an argument.
Otherwise, it would hit the following error:

> _pickle.PickleError: ScriptModules cannot be deepcopied using
> copy.deepcopy or saved using torch.save. Mixed serialization of
> script and non-script modules is not supported. For purely
> script modules use my_script_module.save(<filename>) instead.

Failed attempt:
tried to install `torch.jit.ScriptModule` to RPC's
dispatch table, but it does not work as the dispatch table only
matches exact types and using base type `torch.jit.ScriptModule`
does not work for derived typed.

Current solution:
The current solution exposes `_enable_jit_rref_pickle` and
`_disable_jit_rref_pickle` APIs to toggle the `allowJitRRefPickle`
flag. See `test_pickle_script_module_with_rref` as an example.

Test Plan: Imported from OSS

Differential Revision: D21920870

Pulled By: mrshenli

fbshipit-source-id: 4d58afce5d0b4b81249b383c173488820b1a47d6
2020-07-10 07:27:51 -07:00
df252c059c [ROCm] Skip caffe2 unique op test for rocm3.5 (#41219)
Summary:
unique op test failure in caffe2 blocks upgrading CI to rocm3.5.1. Skipping the test to unblock will re-enable after root causing and fixing the issue.
jeffdaily sunway513

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41219

Differential Revision: D22471452

Pulled By: xw285cornell

fbshipit-source-id: 9e503c8b37c0a4b92632f77b2f8a90281a9889c3
2020-07-09 20:00:29 -07:00
a79b416847 make Int8 FC bias quantization use round flush to infinity
Summary:
the current quantization rounding function uses fbgemm which
defaults to round to nearest. The current implementation of hw uses round
flush to infinity. Adding such an option to switch the mode of rounding.

Test Plan: ran against test_fc_int8

Reviewed By: venkatacrc

Differential Revision: D22452306

fbshipit-source-id: d2a1fbfc695612fe07caaf84f52669643507cc9c
2020-07-09 17:25:41 -07:00
7c2c752e6d Revert D22458928: [pytorch][PR] Use explicit templates in CUDALoops kernels
Test Plan: revert-hammer

Differential Revision:
D22458928 (e374280768)

Original commit changeset: cca623bb6e76

fbshipit-source-id: 6dd24f783ec3b781140f314716ffb02f0892c57a
2020-07-09 16:31:50 -07:00
c5dcf056ee JIT pass for add relu fusion. (#39343)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39343

Building on top of previous PR that adds fused add_relu op, this PR adds
a JIT pass to transform input graph to find all fusable instancs of add
+ relu and fuses them.

Test Plan:
python test/test_jit.py TestJit.test_add_relu_fusion

Imported from OSS

Differential Revision: D21822396

fbshipit-source-id: 12c7e8db54c6d70a2402b32cc06c7e305ffbb1be
2020-07-09 16:25:13 -07:00
82c9f79e0e Add fused add_relu op. (#39342)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39342

Many networks such as resnet have adds followed by relu. This op is the
first step in enabling this fused implementation.
Once we have the fused add_relu op, a JIT pass will be written to
replace add + relu patterns with add_relu.

Test Plan:
python test/test_nn.py TestAddRelu

Imported from OSS

Differential Revision: D21822397

fbshipit-source-id: 03df83a3e46ddb48a90c5a6f755227a7e361a0e8
2020-07-09 16:25:11 -07:00
d6feb6141f [Vec256][neon] Add neon backend for vec256 (#39341)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39341

This PR introduces neon backend for vec256 class for float datatype.
For now only aarch64 is enabled due to few issues with enabling in
aarch32 bit.

Test Plan:
vec256_test

Imported from OSS

Differential Revision: D21822399

fbshipit-source-id: 3851c4336d93d1c359c85b38cf19904f82bc7b8d
2020-07-09 16:25:09 -07:00
bddba1e336 Add benchmark for add op. (#40059)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40059

This benchmark is added specifically for mobile to see if compiler is
autovectorizing and thus we have no advantage of neon backend for vec256
for add op.

Test Plan:
CI

Imported from OSS

Differential Revision: D22055146

fbshipit-source-id: 43ba6c4ae57c6f05d84887c2750ce21ae1b0f0b5
2020-07-09 16:22:55 -07:00
dde3d5f4a8 [RPC docs] Remove mention of TensorPipe's SHM and CMA backends as they're not built (#41200)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41200

In short, we messed up. The SHM and CMA backends of TensorPipe are Linux-specific and thus they are guarded by a #ifdef in the agent's code. Due to a mishap with CMake (due the fact that TensorPipe has two CMake files, one for PyTorch and a "standalone" one) we were not correctly propagating some flags and these #ifdefs were always false. This means that these two backends have always been disabled and have thus never been covered by our OSS CI. It would be irresponsible to enable them now in v1.6, so instead we remove any mention of them from the docs.

Note that this is perhaps not as bad as it sounds. These two backends were providing higher performance (latency) when the two endpoints were on the same machine. However, I suspect that most RPC users will only do transfers across machines, for which SHM and CMA wouldn't have played any role.
ghstack-source-id: 107458630

Test Plan: Docs only

Differential Revision: D22462158

fbshipit-source-id: 0d72fea11bcaab6d662184bbe7270529772a5e9b
2020-07-09 15:33:07 -07:00
a88099ba3e restore old documentation references (#39086)
Summary:
Fixes gh-39007

We replaced actual content with links to generated content in many places to break the documentation into manageable chunks. This caused references like
```
https://pytorch.org/docs/stable/torch.html#torch.flip
```
to become
```
https://pytorch.org/docs/master/generated/torch.flip.html#torch.flip
```
The textual content that was located at the old reference was replaced with a link to the new reference. This PR adds a `<p id="xxx"/p>` reference next to the link, so that the older references from outside tutorials and forums still work: they will bring the user to the link that they can then follow through to see the actual content.

The way this is done is to monkeypatch the sphinx writer method that produces the link. It is ugly but practical, and in my mind not worse than adding javascript to do the same thing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/39086

Differential Revision: D22462421

Pulled By: jlin27

fbshipit-source-id: b8f913b38c56ebb857c5a07bded6509890900647
2020-07-09 15:20:10 -07:00
b952eaf668 Preserve CUDA gencode flags (#41173)
Summary:
Add `torch._C._cuda_getArchFlags()` that returns list of architecture `torch_cuda` were compiled with
Add `torch.cuda.get_arch_list()` and `torch.cuda.get_gencode_flags()` methods that returns architecture list and gencode flags PyTorch were compiled with
Print warning if some of GPUs is not compatible with any of the CUBINs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41173

Differential Revision: D22459998

Pulled By: malfet

fbshipit-source-id: 65d40ae29e54a0ba0f3f2da11b821fdb4d452d95
2020-07-09 14:59:35 -07:00
e374280768 Use explicit templates in CUDALoops kernels (#41059)
Summary:
Follow up after https://github.com/pytorch/pytorch/pull/40992
Use explicit templates instead of lambdas to reduce binary size without affecting the perf by 100-200Kb per arch per CU, namely:
BinaryMulDivKernel.cu 3.8Mb -> 3.5Mb
CompareEQKernel.cu 1.8Mb -> 1.7Mb
BinaryAddSubKernel.cu 2.0Mb -> 1.8Mb
BinaryBitwiseOpsKernels.cu 2.6Mb -> 2.3Mb

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41059

Differential Revision: D22458928

Pulled By: malfet

fbshipit-source-id: cca623bb6e769cfe372977b08463d98b1a02dd14
2020-07-09 14:55:38 -07:00
1f1351488e Revert D21870844: Create lazy_dyndeps to avoid caffe2 import costs.
Test Plan: revert-hammer

Differential Revision:
D21870844 (07fd5f8ff9)

Original commit changeset: 3f65fedb65bb

fbshipit-source-id: 4f661072d72486a9c14711e368247b3d30e28af9
2020-07-09 14:18:38 -07:00
22f940b7bd add clang code coverage compile flags (#41103)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41103

add a CLANG_CODE_COVERAGE option to CMakeList. If the option is ON, add code coverage needed compile flags.

Test Plan:
Clone pytorch source code to local, modified these changes and builded it with `CLANG_CODE_COVERAGE ON` and `BUILD_TESTS ON`.  Run a manual test and attach code coverage report.

{F243609020}

Reviewed By: malfet

Differential Revision: D22422513

fbshipit-source-id: 27a31395c31b5b5f4b72523954722771d8f61080
2020-07-09 14:14:18 -07:00
2cf31fb577 Fix max_pool2d perf regression (#41174)
Summary:
The two pointer variables `ptr_top_diff` and `ptr_top_mask` were introduced in https://github.com/pytorch/pytorch/issues/38953. Some end-to-end testing showed training performance regression due to this change. The performance is restored after removing the two pointer variables, and adding offset directly below in the indexing [ ] calculations.

See PR change https://github.com/pytorch/pytorch/pull/38953/files#diff-8085d370f4e98295074a51b8a1f829e9R187-R188

e4a3c584d5/aten/src/ATen/native/cuda/DilatedMaxPool2d.cu (L186-L195)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41174

Differential Revision: D22451565

Pulled By: ngimel

fbshipit-source-id: 37ed6b9fd785e1be31a027ef5d60794656cc575a
2020-07-09 14:00:05 -07:00
1922f2212a Make IterableDataset dataloader.__len__ warning clearer (#41175)
Summary:
Based on discussion with jlucier (https://github.com/pytorch/pytorch/pull/38925#issuecomment-655859195) . `batch_size` change isn't made because data loader only has the notion of `batch_sampler`, not batch size. If `batch_size` dependent sharding is needed, users can still access it from their own code.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41175

Differential Revision: D22456525

Pulled By: zou3519

fbshipit-source-id: 5281fcf14807f219de06e32107d5fe7d5b6a8623
2020-07-09 13:49:29 -07:00
e84ef45dd3 [JIT] Fix JIT triage workflow (#41170)
Summary:
**Summary**
This commit fixes the JIT triage workflow based on testing done in my
own fork.

**Test Plan**
This commit has been tested against my own fork. This commit is
currently at the tip of my master branch, and if you open an issue in my
fork and label it JIT, it will be added to the Triage Review project in
that fork under the Needs triage column.

*Old issue that is labelled JIT later*

<img width="700" alt="Captura de Pantalla 2020-07-08 a la(s) 6 59 42 p  m" src="https://user-images.githubusercontent.com/4392003/86988551-5b805100-c14d-11ea-9de3-072916211f24.png">

*New issue that is opened with the JIT label*
<img width="725" alt="Captura de Pantalla 2020-07-08 a la(s) 6 59 17 p  m" src="https://user-images.githubusercontent.com/4392003/86988560-60dd9b80-c14d-11ea-94f0-fac01a0d239b.png">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41170

Differential Revision: D22460584

Pulled By: SplitInfinity

fbshipit-source-id: 278483cebbaf3b35e5bdde2a541513835b644464
2020-07-09 12:40:01 -07:00
c1fa74b2d7 [quant][refactor] test_only_eval_fn (#41078)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/41078

Test Plan: Imported from OSS

Differential Revision: D22420699

fbshipit-source-id: cf105cd41d83036df65c6bb3147cc14aaf755897
2020-07-09 12:34:05 -07:00
7c29a4e66f Don't add NCCL dependency to gloo if system NCCL is used (#41180)
Summary:
This avoids a (currently only) warning of cmake:
```
The dependency target "nccl_external" of target "gloo_cuda" does not exist.
Call Stack (most recent call first):
  CMakeLists.txt:411 (include)
```

This will be a real problem once Policy CMP0046 is set which will make this warning be an error

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41180

Differential Revision: D22460623

Pulled By: malfet

fbshipit-source-id: 0222b12b435e5e2fdf2bc85752f95abba1e3d4d5
2020-07-09 12:10:29 -07:00
2252188e85 [caffe2] Fix spatial_batch_norm_op dividision-by-zero crash (#40806)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40806

When the input is empty, the operator will crash on "runtime error: division by zero". This has been causing Inference platform server crashes.

Example crash logs:

{P134526683}

Test Plan:
Unit test

See reproducing steps in the Test Plan of D22300135

Reviewed By: houseroad

Differential Revision: D22302089

fbshipit-source-id: aaa5391fddc86483b0f3aba3efa7518e54913635
2020-07-09 12:04:11 -07:00
df1f8a48d8 add null check for c2 tensor conversion (#41096)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41096

The spark spot model had some issues in tensor conversion, see P134598596. It happens when we convert an undefined c10 tensor to caffe2 tensor.
This diff added a null check.

Test Plan: spark spot model runs without problem

Reviewed By: smessmer

Differential Revision: D22330705

fbshipit-source-id: dfe0f29a48019b6611cad3fd8f2ae49e8db5427e
2020-07-09 11:44:23 -07:00
a318234eb0 Print raising warnings in Python rather than C++ if other error occurs (#41116)
Summary:
When we return to Python from C++ in PyTorch and have warnings and and error, we have the problem of what to do when the warnings throw because we can only throw one error.
Previously, if we had an error, we punted all warnings to the C++ warning handler which would write them to stderr (i.e. system fid 2) or pass them on to glog.

This has drawbacks if an error happened:
- Warnings are not handled through Python even if they don't raise,
- warnings are always printed with no way to suppress this,
- the printing bypasses sys.stderr, so Python modules wanting to
  modify this don't work (with the prominent example being Jupyter).

This patch does the following instead:
- Set the warning using standard Python extension mechanisms,
- if Python decides that this warning is an error and we have a
  PyTorch error, we print the warning through Python and clear
  the error state (from the warning).

This resolves the three drawbacks discussed above, in particular it fixes https://github.com/pytorch/pytorch/issues/37240 .

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41116

Differential Revision: D22456393

Pulled By: albanD

fbshipit-source-id: c3376735723b092efe67319321a8a993402985c7
2020-07-09 11:38:07 -07:00
07fd5f8ff9 Create lazy_dyndeps to avoid caffe2 import costs. (#39488)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39488

Currently caffe2.InitOpLibrary does the dll import uniliaterally. Instead if we make a lazy version and use it, then many pieces of code which do not need the caffe2urrenoperators get a lot faster.

One a real test, the import time went from 140s to 68s. 8s.

This also cleans up the algorithm slightly (although it makes a very minimal
difference), by parsing the list of operators once, rather than every time a
new operator is added, since we defer the RefreshCall until after we've
imported all the operators.

The key way we maintain safety, is that as soon as someone does an operation
which requires a operator (or could), we force importing of all available
operators.

Future work could include trying to identify which code is needed for which
operator and only import the needed ones. There may also be wins available by
playing with dlmopen (which opens within a namespace), or seeing if the dl
flags have an impact (I tried this and didn't see an impact, but dlmopen may
make it better).

Test Plan:
I added a new test a lazy_dyndep_test.py (copied from all_compare_test.py).
I'm a little concerned that I don't see any explicit tests for dyndep, but this
should provide decent coverage.

Differential Revision: D21870844

fbshipit-source-id: 3f65fedb65bb48663670349cee5e1d3e22d560ed
2020-07-09 11:34:57 -07:00
f69d6a7ea3 [ONNX] Update Default Value of recompute_scale_factor in Interpolate (#39453)
Summary:
This is a duplicate of https://github.com/pytorch/pytorch/pull/38362

"This PR completes Interpolate's deprecation process for recomputing the scales values, by updating the default value of the parameter recompute_scale_factor as planned for pytorch 1.6.0.
The warning message is also updated accordingly."

I'm recreating this PR as previous one is not being updated.

cc gchanan

Pull Request resolved: https://github.com/pytorch/pytorch/pull/39453

Reviewed By: hl475

Differential Revision: D21955284

Pulled By: houseroad

fbshipit-source-id: 911585d39273a9f8de30d47e88f57562216968d8
2020-07-09 11:32:49 -07:00
9b3a212d30 quantizer.cpp: fix cuda memory pinning (#41139)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41139

Fixes the test case in https://github.com/pytorch/pytorch/issues/41115
by using PyTorch's CUDA allocator instead of the old Caffe2 one.

Test Plan:
run the test case from the issue:
https://gist.github.com/vkuzo/6d013aa1645cb986d0d4464a931c779b

let's run CI and see what it uncovers

Imported from OSS

Reviewed By: malfet

Differential Revision: D22438787

fbshipit-source-id: 0853b0115d198a99c43e6176aef34ea951bf5c2e
2020-07-09 11:14:58 -07:00
62cee0001e Move async + serialization implementation out of 'jit/__init__.py' (#41018)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41018

See https://github.com/pytorch/pytorch/pull/40807 for context.

Test Plan: Imported from OSS

Reviewed By: ailzhang

Differential Revision: D22393869

Pulled By: suo

fbshipit-source-id: a71cc571a423ccb81cd148444dc2a18d2ee43464
2020-07-09 10:10:01 -07:00
c8deca8ea8 Update pthreadpool to pthreadpool:029c88620802e1361ccf41d1970bd5b07fd6b7bb. (#40524)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40524

Reviewed By: ezyang

Differential Revision: D22215742

Pulled By: AshkanAliabadi

fbshipit-source-id: ef594e0901337a92b21ddd44e554da66c723eb7c
2020-07-09 10:00:36 -07:00
c038f8afcc Do not install nvidia docker for non-NVIDIA configs (#41144)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/41144

Differential Revision: D22457124

Pulled By: malfet

fbshipit-source-id: e615199cb78b315aa700efcc7332ebf4299212bf
2020-07-09 09:24:26 -07:00
690946c49d Generalize constant_table from tensor only to ivalue (#40718)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40718

Currently only constant except tensor must be inlined during serialization.
Tensor are stored in the contant table. This patch generalizes this capability
to any IValue. This is particularly useful for non ASCII string literal that
cannot be inlined.

Test Plan: Imported from OSS

Differential Revision: D22298169

Pulled By: bzinodev

fbshipit-source-id: 88cc59af9cc45e426ca8002175593b9e431f4bac
2020-07-09 09:09:40 -07:00
86f72953dd [Codemod][FBSourceClangFormatLinter] Daily arc lint --take CLANGFORMAT
Reviewed By: zertosh

Differential Revision: D22452776

fbshipit-source-id: a103da6a5b1db7f1c91ca25490358da268fdfe96
2020-07-09 08:49:32 -07:00
3e26709a4e Remove copy_ warnings for angle and abs for complex tensors (#41152)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41152

fixes https://github.com/pytorch/pytorch/issues/40838

Test Plan: Imported from OSS

Differential Revision: D22444357

Pulled By: anjali411

fbshipit-source-id: 2879d0cffc0a011c624eb8e00c7b64bd33522cc3
2020-07-09 08:05:36 -07:00
7ff7c9738c Revert D22418756: [pytorch][PR] Migrate addmm, addbmm and THBlas_gemm to ATen
Test Plan: revert-hammer

Differential Revision:
D22418756 (6725c034b6)

Original commit changeset: 44e7bb596426

fbshipit-source-id: cbaaf3ad277648901700ef0e47715580e8f8e0dc
2020-07-09 07:47:19 -07:00
bf9cc5c776 Add callback with TLS state API in futures (#40326)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40326

Adds a helper function `addCallbackWithTLSState` to both
torch/csrc/utils/future.h which is used internally by RPC framework and the JIT
future. Uses this helper function to avoid to pass in TLS state where it is needed for rpc and `record_function_ops.cpp`. For example, the following:

```
at::ThreadLocalState tls_state;
fut->addCallback([tls_state = std::move(tls_state)]() {
at::ThreadLocalStateGuard g(tls_state);
some_cb_that_requires_tls_state();
}
```

becomes

```
fut->addCallbackWithTLSState(some_cb_that_requires_tls_state);
```
ghstack-source-id: 107383961

Test Plan: RPC Tests and added a test in test_misc.cpp

Differential Revision: D22147634

fbshipit-source-id: 46c02337b90ee58ca5a0861e932413c40d06ed4c
2020-07-08 23:25:35 -07:00
155fb22e77 Run single-threaded gradgradcheck in testnn (#41147)
Summary:
Reland https://github.com/pytorch/pytorch/issues/40999

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41147

Reviewed By: mruberry

Differential Revision: D22450357

Pulled By: ngimel

fbshipit-source-id: 02b6e020af5e6ef52542266bd9752b9cfbec4159
2020-07-08 22:53:27 -07:00
8e2841781e [easy] Use torch.typename in JIT error messages (#41024)
Summary:
Noticed while trying to script one of the models which happened to have numpy values as constants. Lacking the numpy prefix in the error message was quite confusing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41024

Differential Revision: D22426399

Pulled By: dzhulgakov

fbshipit-source-id: 06158b75355fac6871e4861f82fc637c2420e370
2020-07-08 21:49:37 -07:00
33e26656fa list workaround for CREATE_OBJECT failure (#41129)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/41129

Test Plan: Imported from OSS

Differential Revision: D22436064

Pulled By: ann-ss

fbshipit-source-id: 7cfc38eb953410edfe3d21346c6e377c3b3bfc1f
2020-07-08 18:36:04 -07:00
302cf6835e [ROCm][Caffe2] Enable MIOpen 3D Pooling (#38260)
Summary:
This PR contains the following updates:
1. MIOpen 3D pooling enabled in Caffe2.
2. Refactored the MIOpen pooling code in caffe2.
3. Enabled unit test cases for 3D pooling.

CC: ezyang jeffdaily ashishfarmer

Pull Request resolved: https://github.com/pytorch/pytorch/pull/38260

Differential Revision: D21524754

Pulled By: xw285cornell

fbshipit-source-id: ddfe09dc585cd61e42eee22eff8348d326fd0c3b
2020-07-08 17:42:55 -07:00
f71cccc457 test: Add option to continue testing through error (#41136)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41136

Running this within CI seems impossible since this script exits out
after one failed test, so let's just add an option that CI can use to
power through these errors.

Should not affect current functionality.

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Differential Revision: D22441694

Pulled By: seemethere

fbshipit-source-id: 7f152fea15af9d47a964062ad43830818de5a109
2020-07-08 17:26:13 -07:00
04004bf10c Fix a minor typo "forget add" -> "forget to add" (#41131)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/41131

Differential Revision: D22441122

Pulled By: gmagogsfm

fbshipit-source-id: 383ef167b7742e2f211d1cae010b6ebb37c6e7a0
2020-07-08 17:00:42 -07:00
c7768e21b1 [JIT] Add GitHub workflow for importing issues to triage project (#41056)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41056

**Summary**
This commit adds a new GitHub workflow that automatically adds a card to
the "Need triage" section of the project board for tracking JIT triage
for each new issue that is opened and labelled "jit".

**Test Plan**
???

Test Plan: Imported from OSS

Differential Revision: D22444262

Pulled By: SplitInfinity

fbshipit-source-id: 4e7d384822bffb978468c303322f3e2c04062644
2020-07-08 17:00:40 -07:00
6725c034b6 Migrate addmm, addbmm and THBlas_gemm to ATen (#40927)
Summary:
Closes https://github.com/pytorch/pytorch/issues/24679, closes https://github.com/pytorch/pytorch/issues/24678

`addbmm` depends on `addmm` so needed to be ported at the same time. I also removed `THTensor_(baddbmm)` which I noticed had already been ported so was just dead code.

After having already written this code, I had to fix merge conflicts with https://github.com/pytorch/pytorch/issues/40354 which revealed there was already an established place for cpu blas routines in ATen. However, the version there doesn't make use of ATen's AVX dispatching so thought I'd wait for comment before migrating this into that style.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/40927

Differential Revision: D22418756

Pulled By: ezyang

fbshipit-source-id: 44e7bb5964263d73ae8cc6adc5f6d4e966476ae6
2020-07-08 17:00:37 -07:00
3f32332ee6 [JIT][Easy]move remove mutation to own file (#41137)
Summary:
This should be in its own file...

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41137

Reviewed By: jamesr66a

Differential Revision: D22437922

Pulled By: eellison

fbshipit-source-id: 1b62dde1a4ebac673b5c60aea4f398f734d62501
2020-07-08 17:00:35 -07:00
b8d2ccf009 Unify TensorOptions signatures (#39611)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39611

A few ops have been taking non-optional ScalarType, Device and Layout. That isn't supported by the hacky wrapper that makes those
kernels work with the c10 operator library. This PR unifies the signatures and makes those ops c10-full.
ghstack-source-id: 107330186

Test Plan: waitforsandcastle

Differential Revision: D21915788

fbshipit-source-id: 39f0e114f2766a3b27b80f93f2c1a95fa23c78d4
2020-07-08 17:00:33 -07:00
10caf58a52 [typing] tensor._version is int (#41125)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/41125

Differential Revision: D22440717

Pulled By: ezyang

fbshipit-source-id: f4849c6e13f01cf247b2f64f68a621b055c8bc17
2020-07-08 17:00:30 -07:00
97052c5fa8 Extend SparseAdagrad fusion with stochastic rounding FP16 (#41107)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41107

Extend row wise sparse Adagrad fusion op to FP16 (stochastic rounding) for PyTorch.

Differential Revision: D22195408

fbshipit-source-id: e9903ca7ca3b542fb56f36580e69bb2a39b554f6
2020-07-08 16:58:53 -07:00
af2680e9ce Update ShipIt sync
fbshipit-source-id: ceb761e28fe8c53bc53f3b82b304ea8ab0e98183
2020-07-08 16:52:13 -07:00
0edbe6b063 Add a link in RPC doc page to point to PT Distributed overview (#41108)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/41108

Test Plan: Imported from OSS

Differential Revision: D22440751

Pulled By: mrshenli

fbshipit-source-id: 9e7b002091a3161ae385fdfcc26484ae8fc243bb
2020-07-08 14:00:05 -07:00
9d1138afec Remove unnecessary atomic ops in DispatchStub (#40930)
Summary:
I noticed this very unusual use of atomics in `at::native::DispatchStub`. The comment asserts that `choose_cpu_impl()` will always return the same value on every thread, yet for some reason it uses a CAS loop to exchange the value instead of a simple store? That makes no sense considering it doesn't even read the exchanged value.

This replaces the CAS loop with a simple store and also improves the non-initializing case to a single atomic load instead of two.

For reference, the `compare_exchange` was added in https://github.com/pytorch/pytorch/issues/32148 and the while loop added in https://github.com/pytorch/pytorch/issues/35794.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/40930

Differential Revision: D22438224

Pulled By: ezyang

fbshipit-source-id: d56028ce18c8c5dbabdf366379a0b6aaa41aa391
2020-07-08 13:55:11 -07:00
ec58d739c6 .circleci: Remove pynightly jobs
These jobs didn't really fulfill the intended purpose that they had once
had since the travis python versions were basically locked to 3.7.

Going to go ahead and remove these along with its docker jobs as well
since we don't actively need them anymore.

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

ghstack-source-id: cdfc4fc2ae15a0c86d322cc706d383d6bc189fbc
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41134
2020-07-08 13:46:42 -07:00
dfd21ec00d Revert D22418716: [JIT] Add support for backend-lowered submodules
Test Plan: revert-hammer

Differential Revision:
D22418716 (6777ea19fe)

Original commit changeset: d2b2c6d5d2cf

fbshipit-source-id: 5ce177e13cab0be60020f8979f9b6c520cc8654e
2020-07-08 13:14:21 -07:00
2bc9ee97d1 Revert D22418731: [JIT] Add out-of-source-tree to_backend tests
Test Plan: revert-hammer

Differential Revision:
D22418731 (e2a291b396)

Original commit changeset: 621ba4efc1b1

fbshipit-source-id: 475ae24c5b612fe285035e5ebb92ffc66780a468
2020-07-08 13:11:45 -07:00
131a0ea277 Add version number to bytecode. (#36439)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36439

A proposal of versioning in bytecode, as suggested by dzhulgakov in the internal post: https://fb.workplace.com/groups/pytorch.mobile.work/permalink/590192431851054/

kProducedBytecodeVersion is added. If the model version is not the same as the number in the code, an error will be thrown.

The updated bytecode would look like below. It's a tuple of elements, where the first element is the version number.
```
(3,
 ('__torch__.m.forward',
  (('instructions',
    (('STOREN', 1, 2),
     ('DROPR', 1, 0),
     ('MOVE', 2, 0),
     ('OP', 0, 0),
     ('RET', 0, 0))),
   ('operators', (('aten::Int', 'Tensor'),)),
   ('constants', ()),
   ('types', ()),
   ('register_size', 2))))
```

Test Plan: Imported from OSS

Differential Revision: D22433532

Pulled By: iseeyuan

fbshipit-source-id: 6d62e4abe679cf91a8e18793268ad8c1d94ce746
2020-07-08 12:30:58 -07:00
58d7d91f88 Return atomic (#41028)
Summary:
Per title. This is not used currently in the pytorch codebase, but it is a legitimate usecase, and we have extensions that want to do that and are forced to roll their own atomic implementations for non-standard types. Whether atomic op returns old value or not should not affect performance, compiler is able to generate correct code depending on whether return value is used. https://godbolt.org/z/DBU_UW.
Atomic operations for non-standard integer types (1,2 and 8 byte-width) are left as is, with void return.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41028

Differential Revision: D22425008

Pulled By: ngimel

fbshipit-source-id: ca064edb768a6b290041a599e5b50620bdab7168
2020-07-08 11:54:24 -07:00
351407dd75 Disables unary op casting to output dtype (#41097)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/41047.

Some CPU kernel implementations don't call `cast_outputs()`, so when CPU temporaries were created to hold their outputs they weren't copied back to the out parameters correctly. Instead of fixing that issue, for simplicity this PR disables the behavior. The corresponding test in test_type_promotion.py is expanded with more operations to verify that unary ops can no longer have out arguments with different dtypes than their inputs (except in special cases like torch.abs which maps complex inputs to float outputs and torch.deg2rad which is secretly torch.mul).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41097

Differential Revision: D22422352

Pulled By: mruberry

fbshipit-source-id: 8e61d34ef1c9608790b35cf035302fd226fd9421
2020-07-08 11:48:40 -07:00
c93e96fbd9 [jit] move script-related implementation out of torch/jit/__init__.py (#40902)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40902

See the bottom of this stack for context.

Test Plan: Imported from OSS

Reviewed By: eellison

Differential Revision: D22360210

Pulled By: suo

fbshipit-source-id: 4275127173a36982ce9ad357aa344435b98e1faf
2020-07-08 11:38:34 -07:00
6c9b869930 [ROCm] Skip Conv2d, Conv3d transpose fp16 test for ROCm3.5 (#41088)
Summary:
There's a regression in MIOpen in ROCm3.5 that results in failure of autocast tests. Skipping the tests for now and will re-enable once the fixes are in MIOpen.

ezyang jeffdaily sunway513

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41088

Differential Revision: D22419823

Pulled By: xw285cornell

fbshipit-source-id: 347fb9a03368172fe0b263d14d27ee0c3efbf4f6
2020-07-08 11:13:49 -07:00
dde18041a6 [quant][graphmode] Refactor quantization patterns (#40894)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40894

Test Plan:
python test/test_quantization.py

Imported from OSS

Differential Revision: D22403901

fbshipit-source-id: e0bcf8a628c6a1acfe6fa10a52912360a619bc62
2020-07-08 10:36:25 -07:00
03eec07956 Move error messages in-line in _vmap_internals.py (#41077)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41077

This PR is a refactor that moves error messages into their callsites in
`_vmap_internals.py`. Furthermore, because a little bird told me we've
dropped python 3.5 support, this PR adopts f-string syntax to clean up
the string replace logic. Together these changes make the error messages
read better IMO.

Test Plan:
- `python test/test_vmap.py -v`. There exists tests that invoke each of the
error messages.

Differential Revision: D22420473

Pulled By: zou3519

fbshipit-source-id: cfd46b2141ac96f0a62864928a95f8eaa3052f4e
2020-07-08 08:42:56 -07:00
de4fc23381 clean up duplicated op names (#41092)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41092

added overload name for some full JIT operators and removed some duplicated op registrations

Test Plan:
apply D21032976, then buck run fbsource//xplat/caffe2/fb/pytorch_predictor:predictor
make sure there's no runtime error in operator registration

Reviewed By: iseeyuan

Differential Revision: D22419922

fbshipit-source-id: f651898e75b5bdb8dc03fc00b136689536c51707
2020-07-08 06:39:39 -07:00
e4fbcaa2bc [Codemod][FBSourceClangFormatLinter] Daily arc lint --take CLANGFORMAT
Reviewed By: zertosh

Differential Revision: D22429730

fbshipit-source-id: 585d8df36d7fa18a9c2d3fa54c1d333bf94464d0
2020-07-08 05:02:26 -07:00
3d3fd13e04 [quant][graphmode][fix] filter for list append change (#41020)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41020

Only support quantization of list append for List[Tensor]

Test Plan: Imported from OSS

Differential Revision: D22420698

fbshipit-source-id: 179677892037e136d90d16230a301620c3111063
2020-07-08 03:44:44 -07:00
e0e8b98c43 Export logic op to pytorch
Summary: Export logit op to pt for better preproc perf

Test Plan:
unit test
Also tested with model re-generation

Reviewed By: houseroad

Differential Revision: D22324611

fbshipit-source-id: 86accb6b4528e5c818d2c3f8c67926f279d158d6
2020-07-08 02:27:09 -07:00
6ef94590fa match int8 quantization of nnpi (#41094)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41094

mimic nnpi's quantization operations

removed redundant int8 test

Test Plan: ran FC with sizes up to 5, running bigger sizes

Reviewed By: venkatacrc

Differential Revision: D22420537

fbshipit-source-id: 91211c8a6e4d3d3bec2617b758913b44aa44b1b1
2020-07-08 00:07:42 -07:00
e2a291b396 [JIT] Add out-of-source-tree to_backend tests (#40842)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40842

**Summary**
This commit adds out-of-source-tree tests for `to_backend`. These tests check
that a Module can be lowered to a backend, exported, loaded (in both
Python and C++) and executed.

**Fixes**
This commit fixes #40067.

Test Plan: Imported from OSS

Differential Revision: D22418731

Pulled By: SplitInfinity

fbshipit-source-id: 621ba4efc1b121fa76c9c7ca377792ac7440d250
2020-07-07 21:00:43 -07:00
6777ea19fe [JIT] Add support for backend-lowered submodules (#40841)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40841

**Summary**
This commit adds support for using `Modules` that have been lowered as
submodules in `ScriptModules`.

**Test Plan**
This commit adds execution and save/load tests to test_backends.py for
backend-lowered submodules.

**Fixes**
This commit fixes #40069.

Test Plan: Imported from OSS

Differential Revision: D22418716

Pulled By: SplitInfinity

fbshipit-source-id: d2b2c6d5d2cf3042a620b3bde7d494f1abe28dc1
2020-07-07 21:00:40 -07:00
5a4c45f8d1 [JIT] Move TestBackend to test directory (#40840)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40840

**Summary**
This commit moves the TestBackend used for the JIT backend
extension to the tests directory. It was temporarily placed
in the source directory while figuring out some details of
the user experience for this feature.

**Test Plan**
`python test/test_jit.py TestBackends`

**Fixes**
This commit fixes #40067.

Test Plan: Imported from OSS

Differential Revision: D22418682

Pulled By: SplitInfinity

fbshipit-source-id: 9356af1341ec4d552a41c2a8929b327bc8b56057
2020-07-07 21:00:38 -07:00
3e01931e49 [JIT] Separate to_backend API into libtorch and libtorch_python (#40839)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40839

**Summary**
This commit splits the to_backend API properly into
`libtorch` and `libtorch_python`. The backend interface and all
of the code needed to run a graph on a backend is in
libtorch, and all of the code related to creating a Python binding
for the lowering process is in `libtorch_python`.

**Test Plan**
`python test/test_jit.py TestBackends`

**Fixes**
This commit fixes #40072.

Test Plan: Imported from OSS

Differential Revision: D22418664

Pulled By: SplitInfinity

fbshipit-source-id: b96e0c34ab84e45dff0df68b8409ded57a55ab25
2020-07-07 20:58:42 -07:00
0911c1e71a Added index_put to promotelist (#41035)
Summary:
[index_put](https://pytorch.org/docs/master/tensors.html#torch.Tensor.index_put) requires src and dst tensors to be the same dtype, so imo it belongs on the promote list when autocast is active (output should be widest dtype among input dtypes).

i also put some other registrations in alphabetical order.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41035

Differential Revision: D22418305

Pulled By: ngimel

fbshipit-source-id: b467cb16ac6c2ba1f9e43531f69a144b17f00b87
2020-07-07 20:36:55 -07:00
c55d8a6f62 Remove std::complex from c10::Scalar (#39831)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/39831

Differential Revision: D22018505

Pulled By: ezyang

fbshipit-source-id: 4719c0f1673077598c5866dafc7391d9e074f4eb
2020-07-07 20:31:42 -07:00
3615e344a3 Unit test case for the Int8FC to cover quantization scale errors. (#41100)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41100

Unit test case for the Int8FC to cover quantization scale errors.

Test Plan: test_int8_ops_nnpi.py test case test_int8_small_input.

Reviewed By: hyuen

Differential Revision: D22422353

fbshipit-source-id: b1c1baadc32751cd7e98e0beca8f0c314d9e5f10
2020-07-07 20:04:17 -07:00
bacca663ff Fix Broken Link in CONTRIBUTING.md (#41066)
Summary:
Spotted a broken link, and while I was at it, fixed a few little language and formatting nits.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41066

Reviewed By: mruberry

Differential Revision: D22415371

Pulled By: dongreenberg

fbshipit-source-id: 7d11c13235b28a01886063c11a4c5ccb333c0c02
2020-07-07 20:02:47 -07:00
445128d0f2 Add PyTorch Glossary (#40639)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40639

Differential Revision: D22421207

Pulled By: gmagogsfm

fbshipit-source-id: 7df8bfc85e28bcf1fb08892a3671e7a9cb0dee9c
2020-07-07 19:53:44 -07:00
bce75a2536 add first implementation of swish (#41085)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41085

add the first LUT implementation of swish

Test Plan:
compared against swish lowered as x*sigmoid(x), had to
increase the threshold of error but looks generally right

Reviewed By: venkatacrc

Differential Revision: D22418117

fbshipit-source-id: c75fa496aa7a5356ddc87f1d61650f432e389457
2020-07-07 19:48:34 -07:00
a8bc7545d5 use PYTORCH_ROCM_ARCH to set GLOO_ROCM_ARCH (#40170)
Summary:
Previously it used the default arch set which may or may not coincide with the user's.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/40170

Differential Revision: D22400866

Pulled By: xw285cornell

fbshipit-source-id: 222ba684782024fa68f37bf7d4fdab9a2389bdea
2020-07-07 19:41:02 -07:00
054e5d8943 .circleci: Fix job-specs-custom docker tag (#41111)
Summary:
Should resolve master breakages

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41111

Differential Revision: D22426863

Pulled By: seemethere

fbshipit-source-id: 561eaaa0d97a6fe13c75c1a73e4324b92d94afed
2020-07-07 19:32:23 -07:00
cc29c192a6 add "aten::add.str" op and remove two duplicated ops
Summary: add "aten::add.str" op and remove two duplicated ops

Test Plan:
```
buck run //xplat/caffe2/fb/pytorch_predictor:converter /mnt/vol/gfsfblearner-altoona/flow/data/2020-06-29/1ca8a85f-dbd5-4181-b5fc-63d24465c1fc/201084299/2068673333/model.pt1 ~/model_f201084299.bc

buck run xplat/assistant/model_benchmark_tool/mobile/binary/:lite_predictor -- --model ~/model_f201084299.bc --input_file /tmp/gc_model_input.txt --model_input_args src_tokens,dict_feat,contextual_token_embedding --warmup 1 --iter 2
```

Reviewed By: pengtxiafb

Differential Revision: D22395604

fbshipit-source-id: 0ce21e8b8ae989d125f2f3739523e3c486590b9f
2020-07-07 19:07:35 -07:00
a4fd4905c8 bump docker version to more recent tag (#41105)
Summary:
Tag was introduced originally as https://github.com/pytorch/pytorch/pull/40385

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41105

Reviewed By: malfet

Differential Revision: D22423910

Pulled By: seemethere

fbshipit-source-id: 336fc7ef5243a5863c59762efd182ed7ea6dfc2c
2020-07-07 18:28:24 -07:00
eea535742f Add bfloat16 support for nccl path (#38515)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38515

Differential Revision: D22420896

Pulled By: ezyang

fbshipit-source-id: 80d2d0c2052c91c9035e1e025ebb14e210cb0100
2020-07-07 18:07:06 -07:00
38b465db27 ROCm 3.5.1 image (#40385)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40385

Differential Revision: D22421426

Pulled By: ezyang

fbshipit-source-id: 1a131cdb1a0d5ad7ccd55dc1db17cae982cc286b
2020-07-07 15:37:23 -07:00
5e03a1e926 Add support for int[]? arguments in native_functions.yaml (#37174)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37174

ghstack-source-id: 106938112

Test Plan: Upcoming diffs use this for upsampling.

Differential Revision: D21210002

fbshipit-source-id: d6a55ab6420c05a92873a569221b613149aa0daa
2020-07-07 13:52:20 -07:00
4dad829ea3 In interpolate, inline the call to _interp_output_size (#37173)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37173

This function is only called in one place, so inline it.  This eliminates
boilerplate related to overloads and allows for further simplification
of shared logic in later diffs.

All shared local variables have the same names (from closed_over_args),
and no local variables accidentally collide.
ghstack-source-id: 106938108

Test Plan: Existing tests for interpolate.

Differential Revision: D21209995

fbshipit-source-id: acfadf31936296b2aac0833f704764669194b06f
2020-07-07 13:52:18 -07:00
3c1c74c366 In interpolate, move exceptional cases to the bottom (#37172)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37172

This improves readability by keeping cases with similar behavior close
together.  It should also have a very tiny positive impact on perf.
ghstack-source-id: 106938109

Test Plan: Existing tests for interpolate.

Differential Revision: D21209996

fbshipit-source-id: c813e56aa6ba7370b89a2784fcb62cc146005258
2020-07-07 13:52:16 -07:00
8f0e254790 In interpolate, use if instead of elif (#37171)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37171

Every one of these branches returns or raises, so there's no need for elif.
This makes it a little easier to reorder and move conditions.
ghstack-source-id: 106938110

Test Plan: Existing test for interpolate.

Differential Revision: D21209992

fbshipit-source-id: 5c517e61ced91464b713f7ccf53349b05e27461c
2020-07-07 13:49:53 -07:00
93778f3b24 Expose certain methods in OpaqueTensorImpl. (#41060)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41060

Exposes a const ref opaque_handle and made copy_tensor_metdata a
protected method. This helps in reusing code in sub classes of OpaqueTensorImpl

Test Plan: waitforbuildbot

Reviewed By: dzhulgakov

Differential Revision: D22406602

fbshipit-source-id: e3b8338099f257da7f6bbff679f1fdb71e5f335a
2020-07-07 13:36:32 -07:00
8d570bc708 Decouple DataParallel/DistributedDataParallel from CUDA (#38454)
Summary:
Decouple DataParallel/DistributedDataParallel from CUDA to support more device types.
- Move torch/cuda/comm.py to torch/nn/parallel/comm.py with minor changes for common devices support. Torch.cuda.comm is kept as is for backward compatibility
- Provide common APIs to arbitrary device types without changing existing CUDA APIs in torch.cuda space.
- Replace the torch.cuda calls in DataParellel/DistributedDataParallel with the new APIs.

Related RFC: [https://github.com/pytorch/pytorch/issues/36160](https://github.com/pytorch/pytorch/issues/36160)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/38454

Differential Revision: D22051557

Pulled By: mrshenli

fbshipit-source-id: 7842dad0e5d3ca0f6fb760bda49182dcf6653af8
2020-07-07 12:48:16 -07:00
75155df8b4 Doc warnings (#41068)
Summary:
solves most of gh-38011 in the framework of solving gh-32703.

These should only be formatting fixes, I did not try to fix grammer and syntax.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41068

Differential Revision: D22411919

Pulled By: zou3519

fbshipit-source-id: 25780316b6da2cfb4028ea8a6f649bb18b746440
2020-07-07 11:43:21 -07:00
ff3ba25b8e .circleci: Output binary sizes, store binaries (#41074)
Summary:
We need an easy to way to quickly visually grep binary sizes from builds
and then have a way to test out those binaries quickly.

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41074

Differential Revision: D22415667

Pulled By: seemethere

fbshipit-source-id: 86386e5390dce6aae26e952a47f9e2a2221d30b5
2020-07-07 11:36:49 -07:00
0e6b750288 Insert parentheses around kernel name argument to hipLaunchKernelGGL (#41022)
Summary:
This is to workaround an issue in hipclang wrt templated kernel name arguments to hipLaunchKernelGGL.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41022

Differential Revision: D22404183

Pulled By: ngimel

fbshipit-source-id: 63135ccb9e087f4c8e8663ed383979f7e2c1ba06
2020-07-07 11:31:45 -07:00
630e7ed9cc Splitting embedding_bag to embedding_bag_forward_only and embedding_bag (#40557)
Summary:
Currently embedding_bag's CPU kernel queries whether weight.requires_grad() is true. This violates layering of AutoGrad and Op Kernels, causing issues in third-party backends like XLA. See this [issue](https://github.com/pytorch/xla/issues/2215) for more details.

This PR hoists the query of weight.requires_grad() to Python layer, and splits embedding_bag into two separate ops, each corresponding to weight.requires_grad() == true and false.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/40557

Reviewed By: ailzhang

Differential Revision: D22327476

Pulled By: gmagogsfm

fbshipit-source-id: c815b3690d676a43098e12164517c5debec90fdc
2020-07-07 11:24:29 -07:00
00ee54d2a4 Fix link to PyTorch organization (from Governance) (#40984)
Summary:
PR fixes https://github.com/pytorch/pytorch/issues/40666

Pull Request resolved: https://github.com/pytorch/pytorch/pull/40984

Differential Revision: D22404543

Pulled By: ngimel

fbshipit-source-id: 0d39e8f4d701517cce9c31fddaaad46be3d4844b
2020-07-07 11:22:57 -07:00
452d5e191b Grammatically updated the tech docs (#41031)
Summary:
Small grammatical update to the torch tech docs

![image](https://user-images.githubusercontent.com/26879385/86633690-e126c400-bfc8-11ea-8892-23cdc037daa9.png)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41031

Differential Revision: D22404342

Pulled By: ngimel

fbshipit-source-id: 1c723119cfb050c4ef53de7971fe6e0acf3e91a9
2020-07-07 11:17:17 -07:00
22c7d183f7 If ninja is being used, force build_ext to run. (#40837)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40837

As ninja has accurate dependency tracking, if there is nothing to do,
then we will very quickly noop.  But this is important for correctness:
if a change was made to a header that is not listed explicitly in
the distutils Extension, then distutils will come to the wrong
conclusion about whether or not recompilation is needed (but Ninja
will work it out.)

This caused https://github.com/pytorch/vision/issues/2367

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D22340930

Pulled By: ezyang

fbshipit-source-id: 481b74f6e2cc78159d2a74d413751cf7cf16f592
2020-07-07 09:49:31 -07:00
733b8c23c4 Fix several quantization documentation typos (#40567)
Summary:
This PR fixes several typos I noticed in the docs here: https://pytorch.org/docs/master/quantization.html. In one case there was a misspelled module [torch.nn.instrinsic.qat](https://pytorch.org/docs/master/quantization.html#torch-nn-instrinsic-qat) which I corrected and am including screenshots of below just in case.

<img width="1094" alt="before" src="https://user-images.githubusercontent.com/54918401/85766765-5cdd6280-b6e5-11ea-93e6-4944cf820b71.png">

<img width="1093" alt="after" src="https://user-images.githubusercontent.com/54918401/85766769-5d75f900-b6e5-11ea-8850-0d1f5ed67b16.png">
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40567

Differential Revision: D22311291

Pulled By: ezyang

fbshipit-source-id: 65d1f3dd043357e38a584d9e30f31634a5b0995c
2020-07-07 09:45:23 -07:00
2d98f8170e Add option to warn if elements in a Compare table are suspect (#41011)
Summary:
This PR adds a `.highlight_warnings()` method to `Compare`, which will include a `(! XX%)` next to measurements with high variance to highlight that fact. For example:
```
[------------- Record function overhead ------------]
                      |    lstm_jit   |  resnet50_jit
1 threads: ------------------------------------------
      with_rec_fn     |   650         |  8600
      without_rec_fn  |   660         |  8000
2 threads: ------------------------------------------
      with_rec_fn     |   360         |  4200
      without_rec_fn  |   350         |  4000
4 threads: ------------------------------------------
      with_rec_fn     |   250         |  2100
      without_rec_fn  |   260         |  2000
8 threads: ------------------------------------------
      with_rec_fn     |   200 (! 6%)  |  1200
      without_rec_fn  |   210 (! 6%)  |  1100
16 threads: -----------------------------------------
      with_rec_fn     |   220 (! 8%)  |   900 (! 5%)
      without_rec_fn  |   200 (! 5%)  |  1000 (! 7%)
32 threads: -----------------------------------------
      with_rec_fn     |  1000 (! 7%)  |   920
      without_rec_fn  |  1000 (! 6%)  |   900 (! 6%)

Times are in milliseconds (ms).
(! XX%) Measurement has high variance, where XX is the median / IQR * 100.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41011

Differential Revision: D22412905

Pulled By: robieta

fbshipit-source-id: 2c90e719d9a5a1c0267ed113dd1b1b1738fa8269
2020-07-07 09:39:22 -07:00
a04af4dccb Revert D22396896: [pytorch][PR] run single-threaded gradgradcheck in test_nn
Test Plan: revert-hammer

Differential Revision:
D22396896 (dac63a13cb)

Original commit changeset: 3b247caceb65

fbshipit-source-id: 90bbd71ca5128a7f07fe2907c061ee0922d16edf
2020-07-07 07:43:39 -07:00
0e09511af9 type annotations for dataloader, dataset, sampler (#39392)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/38913

Pull Request resolved: https://github.com/pytorch/pytorch/pull/39392

Reviewed By: anjali411

Differential Revision: D22102489

Pulled By: zou3519

fbshipit-source-id: acb68d9521145f0b047214d62b5bdc5a0d1b9be4
2020-07-07 07:16:18 -07:00
a6b703cc89 Make torch_cpu compileable when USE_TENSORPIPE is not set. (#40846)
Summary:
Forward-declare `tensorpipe::Message` class in utils.h
Guard TensorPipe specific methods in utils.cpp with `#ifdef USE_TENSORPIPE`
Pass `USE_TENSORPIPE` as private flag to `torch_cpu` library
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40846

Differential Revision: D22338864

Pulled By: malfet

fbshipit-source-id: 2ea2aea84527ae7480e353afb55951a068b3b980
2020-07-07 07:02:57 -07:00
12b5bdc601 Remove unused Logger in get_matching_activations (#41023)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41023

Remove Logger in get_matching_activations since it's not used.
ghstack-source-id: 107237046

Test Plan:
buck test mode/dev caffe2/test:quantization -- 'test_compare_weights_lstm_dynamic'
buck test mode/dev caffe2/test:quantization -- 'test_compare_model_stub_lstm_dynamic'
buck test mode/dev caffe2/test:quantization -- 'test_compare_model_outputs_lstm_dynamic'
buck test mode/dev caffe2/test:quantization -- 'test_compare_weights_conv_static'
buck test mode/dev caffe2/test:quantization -- 'test_compare_weights_linear_static'
buck test mode/dev caffe2/test:quantization -- 'test_compare_weights_linear_dynamic'
buck test mode/dev caffe2/test:quantization -- 'test_compare_model_stub_conv_static'
buck test mode/dev caffe2/test:quantization -- 'test_compare_model_stub_linear_static'
buck test mode/dev caffe2/test:quantization -- 'test_compare_model_stub_submodule_static'
buck test mode/dev caffe2/test:quantization -- 'test_compare_model_stub_functional_static'
buck test mode/dev caffe2/test:quantization -- 'test_compare_model_stub_linear_dynamic'
buck test mode/dev caffe2/test:quantization -- 'test_compare_model_outputs_conv_static'
buck test mode/dev caffe2/test:quantization -- 'test_compare_model_outputs_linear_static'
buck test mode/dev caffe2/test:quantization -- 'test_compare_model_outputs_functional_static'
buck test mode/dev caffe2/test:quantization -- 'test_compare_model_outputs_linear_dynamic'

Differential Revision: D22394957

fbshipit-source-id: 7d59e0f35e9f4c304b8487460d48236ee6e5a872
2020-07-07 00:33:07 -07:00
4aa543ed2e Fix unordered-map-over-enum for GCC 5.4 (#41063)
Summary:
Forgot to add this to https://github.com/pytorch/pytorch/pull/41055

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41063

Differential Revision: D22407451

Pulled By: malfet

fbshipit-source-id: 6f06653b165cc4817d134657f87caf643182832a
2020-07-06 23:26:31 -07:00
50df097599 Fix CUDA jit codegen compilation with gcc-5.4 (#41055)
Summary:
It's a known gcc-5.4 bug that enum class is not hasheable by default, so `std::unordered_map` needs 3rd explicit parameters to compute hash from the type.

Should fix regression caused by https://github.com/pytorch/pytorch/pull/40864

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41055

Differential Revision: D22405478

Pulled By: malfet

fbshipit-source-id: f4bd36bebdc1ad0251ebd1e6cefba866e6605fe6
2020-07-06 21:09:17 -07:00
56396ad024 ONNX: support view_as operator (#40496)
Summary:
This PR adds support for the torch `view_as` operator.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/40496

Reviewed By: hl475

Differential Revision: D22398318

Pulled By: houseroad

fbshipit-source-id: f92057f9067a201b707aa9b8fc4ad34643dd5fa3
2020-07-06 20:38:46 -07:00
b2cc8a2617 [ONNX]Fix export of full_like (#40063)
Summary:
Fix export of full_like when fill_value is of type torch._C.Value.

This PR fixes a bug when exporting GPT2DoubleHeadsModel https://github.com/huggingface/transformers/issues/4950

Pull Request resolved: https://github.com/pytorch/pytorch/pull/40063

Reviewed By: hl475

Differential Revision: D22398353

Pulled By: houseroad

fbshipit-source-id: 6980a61211fe571c2e4a57716970f474851d811e
2020-07-06 20:36:09 -07:00
6e4f501f1a Improve error message for Pad operator (#39651)
Summary:
In issue https://github.com/pytorch/pytorch/issues/36997 the user encountered a non-meaningful error message when trying to export the model to ONNX. The Pad operator in opset 9 requires the list of paddings to be constant. This PR tries to improve the error message given to the user when this is not the case.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/39651

Reviewed By: hl475

Differential Revision: D21992262

Pulled By: houseroad

fbshipit-source-id: b817111c2a40deba85e4c6cdb874c1713312dba1
2020-07-06 20:26:02 -07:00
6b50874cb7 Fix HTTP links in documentation to HTTPS (#40878)
Summary:
I ran `make linkcheck` using `sphinx.builders.linkcheck` on the documentation and noticed a few links weren't using HTTPS so I quickly updated them all.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/40878

Differential Revision: D22404647

Pulled By: ngimel

fbshipit-source-id: 9c9756db59197304023fddc28f252314f6cf4af3
2020-07-06 20:05:21 -07:00
63ef706979 [ATen] Add native_cuda_h list to CMakeLists.txt (#41038)
Summary:
Closes https://github.com/pytorch/pytorch/issues/40784

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41038

Differential Revision: D22404273

Pulled By: malfet

fbshipit-source-id: 8df05f948f069ac95591d523222faa1327429e71
2020-07-06 19:58:36 -07:00
5d1d8a58b8 Enable in_dims for vmap frontend api (#40717)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40717

`in_dims` specifies which dimension of the input tensors should be
vmapped over. One can also specify `None` as an `in_dim` for a particular
input to indicate that we do not map over said input.

We implement `in_dims` by creating a BatchedTensor with BatchDim equal
to said `in_dim`. Most of this PR is error checking. `in_dims` must
satisfy the following:
- `in_dim` can be either an int or a Tuple[Optional[int]]. If it is an
int, we use it to mean the `in_dim` for every input.
- If `in_dims` is not-None at some index `idx`, then the input at index
`idx` MUST be a tensor (vmap can only map over tensors).

jax supports something more generalized: their `in_dims` can match the
structure of the `inputs` to the function (i.e., it is a nested python
data structure matching the data structure of `inputs` specifying where
in `inputs` the Tensors to be mapped are and what their map dims should
be). We don't have the infrastruture yet so we only support `int` or a
flat tuple for `in_dims`.

Test Plan: - `pytest test/test_vmap.py -v`

Differential Revision: D22397914

Pulled By: zou3519

fbshipit-source-id: 56d2e14be8b6024e4cde2729eff384da305b4ea3
2020-07-06 19:14:43 -07:00
dac63a13cb run single-threaded gradgradcheck in test_nn (#40999)
Summary:
Most time-consuming tests in test_nn (taking about half the time) were gradgradchecks on Conv3d. Reduce their sizes, and, most importantly, run gradgradcheck single-threaded, because that cuts the time of conv3d tests by an order of magnitude, and barely affects other tests.
These changes bring test_nn time down from 1200 s to ~550 s on my machine.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/40999

Differential Revision: D22396896

Pulled By: ngimel

fbshipit-source-id: 3b247caceb65d64be54499de1a55de377fdf9506
2020-07-06 17:21:25 -07:00
37a572f33e fix grad thrashing of shape analysis (#40939)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40939

Previously, when we would do shape analysis by running the op with representative inputs, we would always set the grad property to false. This led to a wrong static analysis when we would create differentiable subgraphs, and propagate shapes without also propagating requires_grad, and then uninline them.

Test Plan: Imported from OSS

Differential Revision: D22394676

Pulled By: eellison

fbshipit-source-id: 254e6e9f964b40d160befe0e125abe1b7aa2bd5e
2020-07-06 17:12:13 -07:00
4af8424377 shape analysis fix for default dtype' (#40938)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40938

already accepted in https://github.com/pytorch/pytorch/pull/40645

Test Plan: Imported from OSS

Reviewed By: jamesr66a, Krovatkin

Differential Revision: D22394675

Pulled By: eellison

fbshipit-source-id: 1e9dbb24a4cb564d9a68280d2166329ca9fb0425
2020-07-06 17:10:01 -07:00
078669f6c3 Back out "[2/n][Compute Meta] support analysis for null flag features"
Summary:
Original commit changeset: 46c59d849fa8

The original commit is breaking DPER3 release pipeline with the following failures:
https://www.internalfb.com/intern/chronos/jobinstance?jobinstanceid=9007207344413239&smc=chronos_gp_admin_client&offset=0
```
Child workflow f 202599639  failed with error: c10::Error: [enforce fail at operator.cc:76] blob != nullptr. op Save: Encountered a non-existing input blob: feature_preproc/feature_sparse_to_dense/default_float_value
```
https://www.internalfb.com/intern/chronos/jobinstance?jobinstanceid=9007207344855973&smc=chronos_gp_admin_client&offset=0
```
Child workflow f 202629391  failed with error: c10::Error: [enforce fail at operator.cc:76] blob != nullptr. op Save: Encountered a non-existing input blob: tum_preproc/inductive/feature_sparse_to_dense/default_float_value
```

Related UBN tasks: T69529846, T68986110

Test Plan: Build a DPER3 package on top of this commit, and check that DPER3 release test `model_deliverability_test` is passing.

Differential Revision: D22396317

fbshipit-source-id: 92d5b30cc146c005d6159a8d5bfe8973e2c546dd
2020-07-06 16:29:03 -07:00
a78024476b Port equal from THC to ATen (CUDA) (#36483)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/24557

ASV benchmark:

```
import torch

sizes = [
    (10**6,),
    (1000, 1000),
    (10, 10),
    (1, 2, 3, 4, 5, 6, 7, 8, 9, 10),
]

class EqualTrue:
    params = range(len(sizes))

    def setup(self, n):
        dims = sizes[n]
        self.a = torch.rand(dims, device='cuda')
        self.b = self.a.clone()

    def time_equal(self, n):
        torch.equal(self.a, self.b)

class EqualFalse:
    params = range(len(sizes))

    def setup(self, n):
        dims = sizes[n]
        self.a = torch.rand(dims, device='cuda')
        self.b = torch.rand(dims, device='cuda')

    def time_equal(self, n):
        torch.equal(self.a, self.b)
```

Old results:
```
[ 75.00%] ··· equal.EqualFalse.time_equal
[ 75.00%] ··· ======== ============
               param1
              -------- ------------
                 0       67.7±7μs
                 1       74.0±2μs
                 2      24.4±0.1μs
                 3      135±0.2μs
              ======== ============

[100.00%] ··· equal.EqualTrue.time_equal
[100.00%] ··· ======== ============
               param1
              -------- ------------
                 0      59.8±0.2μs
                 1      59.9±0.3μs
                 2      25.0±0.5μs
                 3      136±0.2μs
              ======== ============
```

New results:
```
[ 75.00%] ··· equal.EqualFalse.time_equal
[ 75.00%] ··· ======== ============
               param1
              -------- ------------
                 0      44.4±0.2μs
                 1      44.5±0.4μs
                 2      31.3±0.3μs
                 3      96.6±0.5μs
              ======== ============

[100.00%] ··· equal.EqualTrue.time_equal
[100.00%] ··· ======== ============
               param1
              -------- ------------
                 0      44.2±0.2μs
                 1      44.6±0.2μs
                 2      30.8±0.3μs
                 3      97.3±0.2μs
              ======== ============
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/36483

Differential Revision: D21451829

Pulled By: VitalyFedyunin

fbshipit-source-id: 033e8060192c54f139310aeafe8ba784bab94ded
2020-07-06 16:00:16 -07:00
c0f9bf9bea s/torch::jit::class_/torch::class_/ (#40795)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40795

Test Plan: Imported from OSS

Reviewed By: suo

Differential Revision: D22314215

Pulled By: jamesr66a

fbshipit-source-id: a2fb5c6804d4014f8e437c6858a7be8cd3efb380
2020-07-06 15:53:33 -07:00
cbe52d762c Mish Activation Function (#40856)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40856

Add a new activation function - Mish: A Self Regularized Non-Monotonic Neural Activation Function https://arxiv.org/abs/1908.08681

Test Plan:
buck test //caffe2/caffe2/python/operator_test:elementwise_ops_test -- 'test_mish'

{F242275183}

Differential Revision: D22158035

fbshipit-source-id: 459c1dd0ac5b515913fc09b5f4cd13dcf095af31
2020-07-06 15:51:23 -07:00
87f9b55aa5 Use explicit templates in gpu_kernel_with_scalars (#40992)
Summary:
This trick should have no effect on performance, but it reduces size of kernels using the template by 10%
For example, sizeof(BinaryMulDivKernel.cu.o) compiled by CUDA-10.1 toolchain for sm_75 before the change was 4.2Mb, after 3.8Mb

Pull Request resolved: https://github.com/pytorch/pytorch/pull/40992

Differential Revision: D22398733

Pulled By: malfet

fbshipit-source-id: 6576f4da00dc5fc2575b2313577f52c6571d5e6f
2020-07-06 15:46:28 -07:00
945ae5bd7b Update the documentation of the scatter_ method with support for reduction methods. (#40962)
Summary:
Follow up to https://github.com/pytorch/pytorch/pull/36447 . Update for https://github.com/pytorch/pytorch/issues/33389.

Also removes unused `unordered_map` include from the CPP file.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/40962

Differential Revision: D22376253

Pulled By: ngimel

fbshipit-source-id: 4e7432190e9a847321aec6d6f6634056fa69bdb8
2020-07-06 15:27:16 -07:00
35bd2b3c8b DOC: Clarify that CrossEntropyLoss mean is weighted (#40991)
Summary:
Closes https://github.com/pytorch/pytorch/issues/40560

This adds the equation for the weighted mean to `CrossEntropyLoss`'s docs and the `reduction` argument for `CrossEntropyLoss` and `NLLLoss` no longer describes a non-weighted mean of the outputs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/40991

Differential Revision: D22395805

Pulled By: ezyang

fbshipit-source-id: a623b6dd2aab17220fe0bf706bd9b62d6ba531fd
2020-07-06 15:05:31 -07:00
b9b4f05abf [nvFuser] Working towards reductions, codegen improvements (#40864)
Summary:
Have basic reduction fusion working, and have improved code generator to approach performance of eager mode reductions. Coming soon will be pointwise-reduction fusions in a way that should prevent the possibility of hitting regressions. Also working on performant softmax kernels in the code generator which may be our next fusion target.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/40864

Reviewed By: ngimel

Differential Revision: D22392877

Pulled By: soumith

fbshipit-source-id: 457448a807d628b1035f6d90bc0abe8a87bf8447
2020-07-06 14:52:49 -07:00
e026d91506 [JIT] Remove dead store in unpickler.cpp (#40625)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40625

Test Plan: Continuous integration.

Reviewed By: suo

Differential Revision: D22259289

fbshipit-source-id: 76cb097dd06a636004fc780b17cb20f27d3821de
2020-07-06 14:48:03 -07:00
d753f1c2e1 Fixes formatting of vander, count_nonzero, DistributedSampler documentation (#41025)
Summary:
Bundle of small edits to fix formatting.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41025

Differential Revision: D22398364

Pulled By: mruberry

fbshipit-source-id: 8d484cb52a1cf4a8eb1f64914574250c9fd5043d
2020-07-06 14:26:13 -07:00
0fbd42b20f [pytorch] deprecate PYTORCH_DISABLE_TRACING macro (#41004)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41004

Tracing has been moved into separate files. Now we can disable it by not compiling the source files for xplat mobile build.
ghstack-source-id: 107158627

Test Plan: CI + build size bot

Reviewed By: iseeyuan

Differential Revision: D22372615

fbshipit-source-id: bf2e2249e401295ff63020a292df119b188fb966
2020-07-06 14:22:59 -07:00
7f60642bae [pytorch] add manual registration for trace type (#40903)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40903

This PR continues the work of #38467 - decoupling Autograd and Trace for manually registered ops.
ghstack-source-id: 107158638

Test Plan: CI

Differential Revision: D22354804

fbshipit-source-id: f5ea45ade2850296c62707a2a4449d7d67a9f5b5
2020-07-06 14:20:37 -07:00
e173278348 Update quantization.rst (#40896)
Summary:
Add documentation for dynamic quantized modules

Pull Request resolved: https://github.com/pytorch/pytorch/pull/40896

Differential Revision: D22395955

Pulled By: z-a-f

fbshipit-source-id: cdc956d1509a0901bc24b73b6ca68a1b65e00cc2
2020-07-06 13:47:39 -07:00
e75f12ac15 Check statstical diff rather than exact match for test_dropout_cuda. (#40883)
Summary:
There's is a TODO tracked in https://github.com/pytorch/pytorch/issues/40882

Pull Request resolved: https://github.com/pytorch/pytorch/pull/40883

Reviewed By: pbelevich

Differential Revision: D22346087

Pulled By: ailzhang

fbshipit-source-id: b4789ca3a10f6a72c6e77276bde45633eb6cf545
2020-07-06 13:11:48 -07:00
c38a5cba0d Remove duplicate assignment in collate.py (#40655)
Summary:
Duplicated assignment
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40655

Reviewed By: ezyang

Differential Revision: D22308827

Pulled By: colesbury

fbshipit-source-id: 48361da8994b3ca00ef29e9afd3ec2672266f00a
2020-07-06 12:37:59 -07:00
c935712d58 Use unbind for tensor.__iter__ (#40884)
Summary:
Unbind, which has a special backward with cat, is arguably better than multiple selects, whose backward is creating & adding a bunch of tensors as big as `self`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/40884

Reviewed By: pbelevich

Differential Revision: D22363376

Pulled By: zou3519

fbshipit-source-id: 0911cdbb36f9a35d1b95f315d0a2f412424e056d
2020-07-06 10:53:15 -07:00
f6f3c0094a Revert D22369579: add eq.str, ne.str, and add.str ops
Test Plan: revert-hammer

Differential Revision:
D22369579 (0deb2560b8)

Original commit changeset: 7ac9a184d437

fbshipit-source-id: 9c861b9f6bf32fe51fa0ea516cf09a3d09d78a7c
2020-07-06 09:52:59 -07:00
9c82b570bf Fix delegating to jit.load from torch.load (#40937)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40937

Test Plan: Imported from OSS

Differential Revision: D22363816

Pulled By: jamesr66a

fbshipit-source-id: 50fc318869407fe8b215368026eaceb129b68a46
2020-07-06 09:00:13 -07:00
73c5a78f43 Test test_int8_ops_nnpi.py case typo fix. (#41008)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41008

Test test_int8_ops_nnpi.py case typo fix.

Test Plan: test_int8_ops_nnpi.py case typo fix.

Reviewed By: hl475

Differential Revision: D22390331

fbshipit-source-id: 8d257c72114ce890720219eb519b9cb43b2ca49b
2020-07-06 08:44:08 -07:00
46f5cf1e31 Improve error reporting of AVX instruction in CI job (#40681)
Summary:
Close https://github.com/pytorch/pytorch/issues/40320

Leverage `qemu` and `gdbserver` for printing backtrace and instruction, and help developers to understand the causes of failed tests better.

Signed-off-by: Xiong Wei <xiongw.fnst@cn.fujitsu.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/40681

Differential Revision: D22391512

Pulled By: malfet

fbshipit-source-id: 19f125cf6c0e5a51814aff2b1d4d3c81298e3cb6
2020-07-06 08:31:01 -07:00
e1afa9daff fix cmake bug (#39930)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/39930

Differential Revision: D22391207

Pulled By: ezyang

fbshipit-source-id: bde19a112846e124d4e5316ba947f48d4dccf361
2020-07-06 08:02:30 -07:00
0b9717b86a When linking libtorch_cpu.so, put AVX sources last in the input list (#40449)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/39600
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40449

Reviewed By: VitalyFedyunin

Differential Revision: D22312501

Pulled By: colesbury

fbshipit-source-id: 4c09adb0173749046f20b84241d6c940b339ad77
2020-07-06 07:56:12 -07:00
063d5b0d3f Remove get_fail_msg in test_dataloader.test_proper_exit (#40745)
Summary:
Close https://github.com/pytorch/pytorch/issues/40744
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40745

Reviewed By: ezyang

Differential Revision: D22308972

Pulled By: colesbury

fbshipit-source-id: 4b4847e6b926b2614c8b14f17a9db3b0376baabe
2020-07-06 07:48:32 -07:00
450ba49653 Add the missing resource_class key in the update_s3_htmls job (#41000)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/40998.

Actually I don't know why it is needed. But without it, the build won't start. See my rerun of the update_s3_html3 job: https://app.circleci.com/pipelines/github/pytorch/pytorch/187926/workflows/432dbe98-ca2f-484d-acc7-0482cb3fd01f/jobs/6121551/steps.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41000

Differential Revision: D22390654

Pulled By: malfet

fbshipit-source-id: 0f296c8a82fa92d5382f883bca951e6576f75b15
2020-07-06 07:02:11 -07:00
Liu
54d7a1e3f4 Fix module dict key ordering (#40905)
Summary:
fix https://github.com/pytorch/pytorch/issues/40227
Removed the sorting operation both in ModuleDict class, updated the docstring.
Also remove a sort operation in corresponding unit test, which will lead to unit test fail.

BC Note: Python version after 3.6, the plain dict will preserve the order of keys.
example:
For a python 3.6+ user, if he is initial a ModuleDict instance using plain python dict:
{
"b": torch.nn.MaxPool2d(3),
"a": torch.nn.MaxPool2d(3)
}
, he will get a ModuleDict which preserve the order:
ModuleDict(
(b): MaxPool2d(kernel_size=3, stride=3, padding=0, dilation=1, ceil_mode=False)
(a): MaxPool2d(kernel_size=3, stride=3, padding=0, dilation=1, ceil_mode=False)
)

For a python 3.5 user, if we maintain the same input, then the output ModuleDict could be:
ModuleDict(
(a): MaxPool2d(kernel_size=3, stride=3, padding=0, dilation=1, ceil_mode=False)
(b): MaxPool2d(kernel_size=3, stride=3, padding=0, dilation=1, ceil_mode=False)
)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40905

Differential Revision: D22357480

Pulled By: albanD

fbshipit-source-id: 0e2502769647bb64f404978243ca1ebe5346d573
2020-07-06 06:40:48 -07:00
0deb2560b8 add eq.str, ne.str, and add.str ops (#40958)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40958

add 3 str operators to lite interpreter
eq.str
ne.str
add.str

Test Plan:
```
buck run //xplat/caffe2/fb/pytorch_predictor:converter /mnt/vol/gfsfblearner-altoona/flow/data/2020-06-29/1ca8a85f-dbd5-4181-b5fc-63d24465c1fc/201084299/2068673333/model.pt1 ~/model_f201084299.bc

buck run xplat/assistant/model_benchmark_tool/mobile/binary/:lite_predictor -- --model ~/model_f201084299.bc --input_file /tmp/gc_model_input.txt --model_input_args src_tokens,dict_feat,contextual_token_embedding --warmup 1 --iter 2

```

Reviewed By: pengtxiafb

Differential Revision: D22369579

fbshipit-source-id: 7ac9a184d437c875edfb584221edd706bffb16e1
2020-07-06 01:01:15 -07:00
300a3aaaad [jit] move private implementation out of jit/__init__.py (#40807)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40807

We pack a lot of logic into `jit/__init__.py`, making it unclear to
developers and users which parts of our API are public vs. internal. This
is one in a series of PRs intended to pull implementation out into
separate files, and leave `__init__.py` as a place to register the
public API.

This PR moves all the tracing-related stuff out, and fixes other spots up
as necessary. Followups will move other core APIs out.

The desired end-state is that we conform to the relevant rules in [PEP 8](https://www.python.org/dev/peps/pep-0008/#public-and-internal-interfaces). In particular:
- Internal implementation goes in modules prefixed by `_`.
- `__init__.py` exposes a public API from these private modules, and nothing more.
- We set `__all__` appropriately to declare our public API.
- All use of JIT-internal functionality outside the JIT are removed (in particular, ONNX is relying on a number internal APIs). Since they will need to be imported explicitly, it will be easier to catch new uses of internal APIs in review.

Test Plan: Imported from OSS

Reviewed By: eellison

Differential Revision: D22320645

Pulled By: suo

fbshipit-source-id: 0720ea9976240e09837d76695207e89afcc58270
2020-07-05 22:01:11 -07:00
1e64bf4c40 [CircleCI] Delete docker image after testing (#40917)
Summary:
Needed maintenance step to avoid running out of disk space on RocM testers

Pull Request resolved: https://github.com/pytorch/pytorch/pull/40917

Differential Revision: D22385844

Pulled By: malfet

fbshipit-source-id: b6dc9ba888a2e34c311e9bf3c8b7b98fa1ec5435
2020-07-05 13:21:00 -07:00
8ecd4f36aa fix __len__, __contains__, getitem inherited from interface class derived from nn container (closes #40603) (#40789)
Summary:
Define static script implementation of __len__ and __contains__ on any subclass derived from a type such as ModuleList, Sequential, or ModuleDict.  Implement getitem for classes derived from ModuleDict.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40789

Reviewed By: eellison

Differential Revision: D22325159

Pulled By: wconstab

fbshipit-source-id: fc1562c29640fe800e13b5a1dd48e595c2c7239b
2020-07-04 15:45:18 -07:00
8223858cc1 shape inference of undefined for prim::grad (#40866)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40866

Reviewed By: pbelevich

Differential Revision: D22358988

Pulled By: Krovatkin

fbshipit-source-id: 7118d7f8d4eaf056cfb71dc0d588d38b1dfb0fc7
2020-07-04 14:10:22 -07:00
88c0d886e3 update requires_gard on loop inputs correctly (master) (#40926)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40926

Reviewed By: eellison

Differential Revision: D22359471

Pulled By: Krovatkin

fbshipit-source-id: 823e87674e2d2917f075255ec926e0485972f4e2
2020-07-04 13:58:29 -07:00
0790d11a18 typing for tensor.T/grad_fn torch.Size (#40879)
Summary:
fixes  https://github.com/pytorch/pytorch/issues/40658 https://github.com/pytorch/pytorch/issues/40658

Pull Request resolved: https://github.com/pytorch/pytorch/pull/40879

Reviewed By: pbelevich

Differential Revision: D22339146

Pulled By: ezyang

fbshipit-source-id: 6b4695e102591e7a2c391eb337c154414bacf67c
2020-07-04 11:58:29 -07:00
0fc0a9308a fix autodoc for torch.distributed.launch (#40963)
Summary:
The doc for `torch.distributed.launch` is missing since v1.2.0 (see issue https://github.com/pytorch/pytorch/issues/36386) because PR https://github.com/pytorch/pytorch/issues/22501 added some imports at the first line.
542ac74987/torch/distributed/launch.py (L1-L5)
I move it below the docstring to make the autodoc in Sphinx work normally.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/40963

Differential Revision: D22380816

Pulled By: mrshenli

fbshipit-source-id: ee8406785b9a198bbf3fc65e589854379179496f
2020-07-04 08:59:41 -07:00
480851ad2c Docstring changes for dynamic quantized classes (#40931)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40931

Fix docstrings for dynamic quantized Linear/LSTM and associated classes
ghstack-source-id: 107064446

Test Plan: Docs show up in correctly

Differential Revision: D22360787

fbshipit-source-id: 8e357e081dc59ee42fd7f12ea5079ce5d0cc9df2
2020-07-03 21:04:12 -07:00
3b7df2388e [RFC] Profile rpc_async call from JIT (#40652)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40652

Resolves https://github.com/pytorch/pytorch/issues/40304, but looking for
feedback on whether there is a better approach for this.

In order to profile `rpc_async` calls made within a torchscript function, we
add the profiling logic to `rpcTorchscript` which is the point where the RPC is
dispatched and is called by the jit `rpc_async` operator. We take a somewhat
similar approach to how this is done in the python API. If profiling is
enabled, we call `record_function_enter` which creates a `RecordFunction`
object and runs its starting callbacks. Then, we schedule end callbacks for
this `RecordFunction` to be run when the jit future completes.

One caveat is that `rpcTorchscript` can also be called by rpc_async from a
non-JIT function, in which case the profiling logic lives in Python. We add a
check to ensure that we don't double profile in this case.
ghstack-source-id: 107109485

Test Plan: Added relevant unittests.

Differential Revision: D22270608

fbshipit-source-id: 9f62d1a2a27f9e05772d0bfba47842229f0c24e1
2020-07-03 15:17:16 -07:00
f3f113f103 [quant][graphmode][fix] Print the node in error message (#40889)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40889

Test Plan: Imported from OSS

Differential Revision: D22348266

fbshipit-source-id: eed2ece5c94fcfaf187d6770bed4a7109f0c0b4a
2020-07-03 10:01:55 -07:00
f083cea227 [RPC tests] Fix file descriptor leak (#40913)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40913

Summary of the entire stack:
--

This diff is part of an attempt to refactor the RPC tests. They currently suffer from several problems:
- Several ways to specify the agent to use: there exists one "generic" fixture that uses the global variable TEST_CONFIG to look up the agent name, and is used for process group and Thrift, and then there are separate fixtures for the flaky agent and the TensorPipe one.
- These two ways lead to having two separate decorators (`requires_process_group_agent` and `@_skip_if_tensorpipe_agent`) which must both be specified, making it unclear what the effect of each of them is and what happens if only one is given.
- Thrift must override the TEST_CONFIG global variable before any other import (in order for the `requires_process_group_agent` decorator to work correctly) and for that it must use a "trap" file, which makes it even harder to track which agent is being used, and which is specific to Buck, and thus cannot be used in OSS by other agents.
- Even if the TensorPipe fixture doesn't use TEST_CONFIG, it still needs to set it to the right value for other parts of the code to work. (This is done in `dist_init`).
- There are a few functions in dist_utils.py that return some properties of the agent (e.g., a regexp to match against the error it returns in case of shutdown). These functions are effectively chained if/elses on the various agents, which has the effect of "leaking" some part of the Thrift agent into OSS.
- Each test suite (RPC, dist autograd/dist optimizer, their JIT versions, remote module, ...) must be run on each agent (or almost; the faulty one is an exception) in both fork and spawn mode. Each of these combinations is a separate file, which leads to a proliferation of scripts.
- There is no "master list" of what combinations make sense and should be run. Therefore it has happened that when adding new tests or new agents we forgot to enroll them into the right tests. (TensorPipe is still missing a few tests, it turns out).
- All of these tiny "entry point" files contain almost the same duplicated boilerplate. This makes it very easy to get the wrong content into one of them due to a bad copy-paste.

This refactoring aims to address these problems by:
- Avoiding global state, defaults/override, traps, if/elses, ... and have a single way to specify the agent, based on an abstract base class and several concrete subclasses which can be "mixed in" to any test suite.
- Instead of enabling/disabling tests using decorators, the tests that are specific to a certain agent are now in a separate class (which is a subclass of the "generic" test suite) so that they are only picked up by the agent they apply to.
- Instead of having one separate entry point script for each combination, it uses one entry point for each agent, and in that script it provides a list of all the test suites it wants to run on that agent. And it does that by trying to deduplicate the boilerplate as much as possible. (In fact, the various agent-suite combinations could be grouped in any way, not necessarily by agent as I did here).

It provides further advantages:
- It puts all the agents on equal standing, by not having any of them be the default, making it thus easier to migrate from process group to TensorPipe.
- It will make it easier to add more versions of the TensorPipe tests (e.g., one that disables the same-machine backends in order to test the TCP-based ones) without a further duplication of entry points, of boilerplate, ...

Summary of this commit
--
Once we start merging multiple test suites in a single file (which we'll happen in the next diffs in the stack) the OSX tests on CircleCI start failing due to "too many open files". This indicates a file descriptor leak. I then managed to repro it on Linux too by lowering the limit on open file descriptors (`ulimit -n 500`). Each test method that unittest runs is run on a new instance of the Testcase class. With our multiprocessing wrappers, this instance contains a list of child processes. Even after these processes are terminated, it appears they still hold some open file descriptor (for example a pipe to communicate with the subprocess). It also appears unittest is keeping these Testcase instances alive until the entire suite completes, which I suspect is what leads to this "leak" of file descriptors. Based on that guess, in this diff I am resetting the list of subprocesses during shutdown, and this seems to fix the problem.
ghstack-source-id: 107045908

Test Plan: Sandcastle and CircleCI

Differential Revision: D22356784

fbshipit-source-id: c93bb9db60fde72cae0b0c735a50c17e427580a6
2020-07-03 06:22:40 -07:00
f9a71d3de4 [RPC tests] Align ddp_under_dist_autograd test with others (#40815)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40815

Summary of the entire stack:
--

This diff is part of an attempt to refactor the RPC tests. They currently suffer from several problems:
- Several ways to specify the agent to use: there exists one "generic" fixture that uses the global variable TEST_CONFIG to look up the agent name, and is used for process group and Thrift, and then there are separate fixtures for the flaky agent and the TensorPipe one.
- These two ways lead to having two separate decorators (`requires_process_group_agent` and `@_skip_if_tensorpipe_agent`) which must both be specified, making it unclear what the effect of each of them is and what happens if only one is given.
- Thrift must override the TEST_CONFIG global variable before any other import (in order for the `requires_process_group_agent` decorator to work correctly) and for that it must use a "trap" file, which makes it even harder to track which agent is being used, and which is specific to Buck, and thus cannot be used in OSS by other agents.
- Even if the TensorPipe fixture doesn't use TEST_CONFIG, it still needs to set it to the right value for other parts of the code to work. (This is done in `dist_init`).
- There are a few functions in dist_utils.py that return some properties of the agent (e.g., a regexp to match against the error it returns in case of shutdown). These functions are effectively chained if/elses on the various agents, which has the effect of "leaking" some part of the Thrift agent into OSS.
- Each test suite (RPC, dist autograd/dist optimizer, their JIT versions, remote module, ...) must be run on each agent (or almost; the faulty one is an exception) in both fork and spawn mode. Each of these combinations is a separate file, which leads to a proliferation of scripts.
- There is no "master list" of what combinations make sense and should be run. Therefore it has happened that when adding new tests or new agents we forgot to enroll them into the right tests. (TensorPipe is still missing a few tests, it turns out).
- All of these tiny "entry point" files contain almost the same duplicated boilerplate. This makes it very easy to get the wrong content into one of them due to a bad copy-paste.

This refactoring aims to address these problems by:
- Avoiding global state, defaults/override, traps, if/elses, ... and have a single way to specify the agent, based on an abstract base class and several concrete subclasses which can be "mixed in" to any test suite.
- Instead of enabling/disabling tests using decorators, the tests that are specific to a certain agent are now in a separate class (which is a subclass of the "generic" test suite) so that they are only picked up by the agent they apply to.
- Instead of having one separate entry point script for each combination, it uses one entry point for each agent, and in that script it provides a list of all the test suites it wants to run on that agent. And it does that by trying to deduplicate the boilerplate as much as possible. (In fact, the various agent-suite combinations could be grouped in any way, not necessarily by agent as I did here).

It provides further advantages:
- It puts all the agents on equal standing, by not having any of them be the default, making it thus easier to migrate from process group to TensorPipe.
- It will make it easier to add more versions of the TensorPipe tests (e.g., one that disables the same-machine backends in order to test the TCP-based ones) without a further duplication of entry points, of boilerplate, ...

Summary of this commit
--
This prepares the stack by aligning the `ddp_under_dist_autograd` test to the other ones, so that later changes will be more consistent and thus easier to follow. It does so by moving the `skipIf` decorators and the `setUp` methods from the base test suite to the entry point scripts.
ghstack-source-id: 107045911

Test Plan: Sandcastle and CircleCI

Differential Revision: D22287535

fbshipit-source-id: ab0c9eb774b21d81e0ebd3078df958dbb4bfa0c7
2020-07-03 06:20:29 -07:00
d0f2079b5e [RPC tests] Remove world_size and init_method from TensorPipe fixture (#40814)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40814

Summary of the entire stack:
--

This diff is part of an attempt to refactor the RPC tests. They currently suffer from several problems:
- Several ways to specify the agent to use: there exists one "generic" fixture that uses the global variable TEST_CONFIG to look up the agent name, and is used for process group and Thrift, and then there are separate fixtures for the flaky agent and the TensorPipe one.
- These two ways lead to having two separate decorators (`requires_process_group_agent` and `@_skip_if_tensorpipe_agent`) which must both be specified, making it unclear what the effect of each of them is and what happens if only one is given.
- Thrift must override the TEST_CONFIG global variable before any other import (in order for the `requires_process_group_agent` decorator to work correctly) and for that it must use a "trap" file, which makes it even harder to track which agent is being used, and which is specific to Buck, and thus cannot be used in OSS by other agents.
- Even if the TensorPipe fixture doesn't use TEST_CONFIG, it still needs to set it to the right value for other parts of the code to work. (This is done in `dist_init`).
- There are a few functions in dist_utils.py that return some properties of the agent (e.g., a regexp to match against the error it returns in case of shutdown). These functions are effectively chained if/elses on the various agents, which has the effect of "leaking" some part of the Thrift agent into OSS.
- Each test suite (RPC, dist autograd/dist optimizer, their JIT versions, remote module, ...) must be run on each agent (or almost; the faulty one is an exception) in both fork and spawn mode. Each of these combinations is a separate file, which leads to a proliferation of scripts.
- There is no "master list" of what combinations make sense and should be run. Therefore it has happened that when adding new tests or new agents we forgot to enroll them into the right tests. (TensorPipe is still missing a few tests, it turns out).
- All of these tiny "entry point" files contain almost the same duplicated boilerplate. This makes it very easy to get the wrong content into one of them due to a bad copy-paste.

This refactoring aims to address these problems by:
- Avoiding global state, defaults/override, traps, if/elses, ... and have a single way to specify the agent, based on an abstract base class and several concrete subclasses which can be "mixed in" to any test suite.
- Instead of enabling/disabling tests using decorators, the tests that are specific to a certain agent are now in a separate class (which is a subclass of the "generic" test suite) so that they are only picked up by the agent they apply to.
- Instead of having one separate entry point script for each combination, it uses one entry point for each agent, and in that script it provides a list of all the test suites it wants to run on that agent. And it does that by trying to deduplicate the boilerplate as much as possible. (In fact, the various agent-suite combinations could be grouped in any way, not necessarily by agent as I did here).

It provides further advantages:
- It puts all the agents on equal standing, by not having any of them be the default, making it thus easier to migrate from process group to TensorPipe.
- It will make it easier to add more versions of the TensorPipe tests (e.g., one that disables the same-machine backends in order to test the TCP-based ones) without a further duplication of entry points, of boilerplate, ...

Summary of this commit
--
This prepares the stack by simplifying the TensorPipe fixture. A comment says that the TensorPipe fixture cannot subclass the generic fixture class as that would lead to a diamond class hierarchy which Python doesn't support (whereas in fact it does), and therefore it copies over two properties that are defined on the generic fixture. However, each class that uses the TensorPipe fixture also inherits from the generic fixture, so there's no need to redefine those properties. And, in fact, by not redefining it we save ourselves some trouble when the TensorPipe fixture would end up overriding another override.
ghstack-source-id: 107045914

Test Plan: Sandcastle and CircleCI

Differential Revision: D22287533

fbshipit-source-id: 254c38b36ba51c9d852562b166027abacbbd60ef
2020-07-03 02:52:14 -07:00
3890550940 [RPC tests] Fix @_skip_if_tensorpipe always skipping for all agents (#40860)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40860

It turns out that the `@_skip_if_tensorpipe_agent` decorator was written in such a way that it accidentally caused the test to become a no-op (and thus always succeed) for all agents. What this means is that all tests wrapped by that decorator were never ever being run, for any agent.

My understanding of the root cause is that the following code:
```
@_skip_if_tensorpipe_agent
def test_foo(self):
    self.assertEqual(2 + 2, 4)
```
ended up behaving somewhat like this:
```
def test_foo(self):
    def original_test_func(self):
        self.assertEqual(2 + 2, 4)
    return unittest.skipIf(self.agent == "TENSORPIPE")(original_test_func)
```
which means that the test body of the decorated method was not actually calling the original test method.

This issue probably came from the `@_skip_if_tensorpipe_agent` being copy-pasted from `requires_process_group_agent` (which, however, is not a decorator but rather a decorator *factory*). An unfortunate naming (calling `decorator` what was in fact the wrapped method) then hindered readability and hid the issue.

Note that a couple of tests had become legitimately broken in the meantime and no one had noticed. The breakages have been introduced in #39909 (a.k.a., D22011868 (145df306ae)).
ghstack-source-id: 107045916

Test Plan: Discovered this as part of my refactoring, in D22332611. After fixing the decorator two tests started breaking (for real reasons). After fixing them all is passing.

Differential Revision: D22332611

fbshipit-source-id: f88ca5574675fdb3cd09a9f6da12bf1e25203a14
2020-07-03 02:50:11 -07:00
cab7d94d47 [PyTorch Numeric Suite] Remove unnecessary Logger in input arguments (#40890)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40890

Remove unnecessary Logger in input arguments and simplify the API.
ghstack-source-id: 107110487

Test Plan:
buck test mode/dev caffe2/test:quantization -- 'test_compare_weights_lstm_dynamic'
buck test mode/dev caffe2/test:quantization -- 'test_compare_model_stub_lstm_dynamic'
buck test mode/dev caffe2/test:quantization -- 'test_compare_model_outputs_lstm_dynamic'
buck test mode/dev caffe2/test:quantization -- 'test_compare_weights_conv_static'
buck test mode/dev caffe2/test:quantization -- 'test_compare_weights_linear_static'
buck test mode/dev caffe2/test:quantization -- 'test_compare_weights_linear_dynamic'
buck test mode/dev caffe2/test:quantization -- 'test_compare_model_stub_conv_static'
buck test mode/dev caffe2/test:quantization -- 'test_compare_model_stub_linear_static'
buck test mode/dev caffe2/test:quantization -- 'test_compare_model_stub_submodule_static'
buck test mode/dev caffe2/test:quantization -- 'test_compare_model_stub_functional_static'
buck test mode/dev caffe2/test:quantization -- 'test_compare_model_stub_linear_dynamic'
buck test mode/dev caffe2/test:quantization -- 'test_compare_model_outputs_conv_static'
buck test mode/dev caffe2/test:quantization -- 'test_compare_model_outputs_linear_static'
buck test mode/dev caffe2/test:quantization -- 'test_compare_model_outputs_functional_static'
buck test mode/dev caffe2/test:quantization -- 'test_compare_model_outputs_linear_dynamic'

Differential Revision: D22345477

fbshipit-source-id: d8b4eb3d6cb3049aa3296dead8ba29bf5467bd1c
2020-07-03 02:45:46 -07:00
542ac74987 [quant][graphmode][fix] Fold conv bn (#40865)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40865

1. applied filter for the module types
2. removed the assumption that the conv bn are immediate child of parent module

Test Plan:
python test/test_quantization.py TestQuantizeJitPasses

Imported from OSS

Differential Revision: D22338074

fbshipit-source-id: 64739a5e56c0a74249a1dbc2c8454b88ec32aa9e
2020-07-03 00:01:04 -07:00
824ab19941 [quant][graphmode] Support quantization for aten::apend (#40743)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40743

`aten::append` modifies input inplace and the output is ignored, these ops are not
supported right now, so we'll need to first make `aten::append` non-inplace
by change
```
ignored = aten::append(list, x)
```
to
```
x_list = aten::ListConstruct(x)
result = aten::add(list, x_list)
```
and then quantize the aten::add instead.

Test Plan:
TestQuantizeJitOps.test_general_shape_ops

Imported from OSS

Differential Revision: D22302151

fbshipit-source-id: 931000388e7501e9dd17bec2fad8a96b71a5efc5
2020-07-02 22:26:52 -07:00
ff17b83fd8 [pytorch][ci] add custom selective build flow for android build (#40199)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40199

Mobile custom selective build has already been covered by `test/mobile/custom_build/build.sh`.
It builds a CLI binary with host-toolchain and runs on host machine to
check correctness of the result.

But that custom build test doesn't cover the android/gradle build part.
And we cannot use it to measure and track the in-APK size of custom
build library.

So this PR adds the selective build test coverage for android NDK build.
Also integrate with the CI to upload the custom build size to scuba.

TODO:
Ideally it should build android/test_app and measure the in-APK size.
But the test_app hasn't been covered by any CI yet and is currently
broken, so build & measure AAR instead (which can be inaccurate as we
plan to pack C++ header files into AAR soon).

Sample result: https://fburl.com/scuba/pytorch_binary_size/skxwb1gh
```

+---------------------+-------------+-------------------+-----------+----------+
|     build_mode      |    arch     |        lib        | Build Num |   Size   |
+---------------------+-------------+-------------------+-----------+----------+
| custom-build-single | armeabi-v7a | libpytorch_jni.so |   5901579 | 3.68 MiB |
| prebuild            | armeabi-v7a | libpytorch_jni.so |   5901014 | 6.23 MiB |
| prebuild            | x86_64      | libpytorch_jni.so |   5901014 | 7.67 MiB |
+---------------------+-------------+-------------------+-----------+----------+
```

Test Plan: Imported from OSS

Differential Revision: D22111115

Pulled By: ljk53

fbshipit-source-id: 11d24efbc49a85f851ecd0e481d14123f405b3a9
2020-07-02 21:11:01 -07:00
28e1d241cd [pytorch] factor out binary size upload command (#40188)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40188

Create a custom command for this task to avoid copy/paste for new build jobs.

Test Plan: Imported from OSS

Differential Revision: D22111114

Pulled By: ljk53

fbshipit-source-id: a7d4d6bbd61ba6b6cbaa137ec7f884736957dc39
2020-07-02 21:08:17 -07:00
3c22c7aadc infer tensor properties based on an input tensor rather than defaults for xxx_like ctors (#40895)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40895

Reviewed By: eellison

Differential Revision: D22358878

Pulled By: Krovatkin

fbshipit-source-id: 2db2429aa89c180d8e52a6bb1265308483da46a2
2020-07-02 20:56:35 -07:00
6095808d22 fix pca_lowrank memory consumption (#40853)
Summary:
Per title, fixes https://github.com/pytorch/pytorch/issues/40768
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40853

Reviewed By: pbelevich

Differential Revision: D22363906

Pulled By: ngimel

fbshipit-source-id: 966a4b230d351f7632c5cfae4a3b7c9a787bc9a5
2020-07-02 17:52:41 -07:00
3ca5849f0a Add serializer and deserializer for Int8QuantSchemeBlob and Int8QuantParamsBlob (#40661)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40661

Add ser-de to support int8 quantization during online training

Test Plan:
```
buck test caffe2/caffe2/fb/fbgemm:int8_serializer_test
```

Reviewed By: hx89

Differential Revision: D22273292

fbshipit-source-id: 3b1e9c820243acf41044270afce72a262ef92bd4
2020-07-02 17:17:05 -07:00
f8d4878b3c check for unsupported instructions when exporting mobile models (#40791)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40791

Test Plan: Imported from OSS

Differential Revision: D22311469

Pulled By: ann-ss

fbshipit-source-id: 7a6abb3f2477e8553f8c71f4aa0442df4f712fb5
2020-07-02 16:24:11 -07:00
3c6b8a6496 Revert D22360735: .circleci: Build docker images as part of CI workflow
Test Plan: revert-hammer

Differential Revision:
D22360735 (af5bcba217)

Original commit changeset: 4ffbde563fdc

fbshipit-source-id: 4ae2288f466703754c9e329d34d344269c70db83
2020-07-02 16:16:31 -07:00
a1c234e372 Revert D22330340: [C2] Fixed a bug in normalization operator
Test Plan: revert-hammer

Differential Revision:
D22330340 (ce63f70981)

Original commit changeset: 0bccf925bb76

fbshipit-source-id: e27d70dee0fbe9e708b0cf3be81dbd33c4015026
2020-07-02 16:05:23 -07:00
9cc73966b3 [TVM] Fix build and sync with caffe2/caffe2/python/dlpack.h (#40888)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40888

Reviewed By: yinghai

Differential Revision: D22326379

fbshipit-source-id: 96ffcff5738973312c49368f53f35bf410e4c0c9
2020-07-02 15:37:45 -07:00
b7517a76ba rshift use default >> operator (#40545)
Summary:
Fix https://github.com/pytorch/pytorch/issues/40032
Also see https://github.com/pytorch/pytorch/pull/35339
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40545

Reviewed By: pbelevich

Differential Revision: D22362816

Pulled By: ngimel

fbshipit-source-id: 4bbf9212b21a4158badbfee8146b3b67e94d5a33
2020-07-02 15:13:12 -07:00
dec3f918a0 Migrate 'torch.dot' from TH to Aten (CUDA) (#40646)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40646

Support double, float, at::Half.
Avoid creating output result on CPU.

Both of two tensors must be on GPU.

Reviewed By: ngimel

Differential Revision: D22258840

fbshipit-source-id: 95f4747477f09b40b1d682cd1f76e4c2ba28c452
2020-07-02 14:48:59 -07:00
81aebf380e pytorch | Fix linking of qnnpack params on windows. (#40920)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40920

Pytorch depends on this from both C and C++ source files, so unify linking so it's fully fixed.

Test Plan: Build it on Windows

Reviewed By: dreiss, supriyar

Differential Revision: D22348247

fbshipit-source-id: 2933b4804f4725ab1742914656fa367527f8f7e1
2020-07-02 13:46:20 -07:00
a7e09b8727 pytorch | Namespace init_win symbol in qnnpack.
Summary: Namespacing the symbol, since it clashes with "the real thing" otherwise.

Test Plan: Sandcastle + build it on windows

Reviewed By: dreiss

Differential Revision: D22348240

fbshipit-source-id: f9c9a7abc97626ba327605cb4749fc5c38a24d35
2020-07-02 13:37:40 -07:00
e1428cf41b [JIT] fix unfold shape analysis (#40749)
Summary:
unfold on a 0-dimensioned tensor returns a 1-dim tensor
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40749

Differential Revision: D22361481

Pulled By: eellison

fbshipit-source-id: 621597e5f97f6e39953eb86f8b85bb4142527a9f
2020-07-02 13:32:37 -07:00
ce63f70981 [C2] Fixed a bug in normalization operator (#40925)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40925

normalization operator does not handle empty tensors correctly. This is a fix.

Test Plan: unit tests

Differential Revision: D22330340

fbshipit-source-id: 0bccf925bb768ebb997ed0c88130c5556308087f
2020-07-02 13:24:56 -07:00
af5bcba217 .circleci: Build docker images as part of CI workflow (#40827)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40827

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Differential Revision: D22360735

Pulled By: seemethere

fbshipit-source-id: 4ffbde563fdc3c49fdd14794ed3c2e881030361d
2020-07-02 13:00:39 -07:00
9f14e48834 Override shape hints with real weight shape extracted from workspace (#40872)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40872

Shape hints as its name suggests, is hint. We should use real shape from workspace for the weights.

Reviewed By: ChunliF

Differential Revision: D22337680

fbshipit-source-id: e7a6101fb613ccb332c3e34b1c2cb8c6c47ce79b
2020-07-02 12:55:29 -07:00
db39542509 [2/n][Compute Meta] support analysis for null flag features
Summary:
## TLDR
Support using NaN default value for missing dense features in RawInputProcessor for DPER2. In preparation for subsequent support for null flag features in compute meta. For train_eval this is already supported in DPER3 and we do not plan to support this in DPER2 train eval.
## Overview
Intern project plan to support adding dense flags for missing feature values instead of replacing with zero.
## Project plan :
https://docs.google.com/document/d/1OsPUTjpJycwxWLCue3Tnb1mx0uDC_2KKWvC1Rwpo2NI/edit?usp=sharing
## Code paths:
See https://fb.quip.com/eFXUA0tbDmNw for the call stack for all affected code paths.

Test Plan:
## fblearner flow test

1. `flow-cli clone f197867430 --run-as-secure-group ads_personalization_systems --force-build` to build a ephemeral package and start a fblearner flow run (may fail)
2. Clone the new run and change the secure_group to `XXXX` and entitlement to `default` in the UI
3. Adds explicit_null_min_coverage flag
4. Optionally reduce `max_examples` since we only test pass/fail instead of quality.
5. Submit the run to test the change

Example:
f198538878

## compare output coverages to daiquery runs

1. Randomly select null flag features from compute meta workflow output
2. Look up the feature id in feature metadata using feature name
3. Check against a daiquery sample of coverage to see if the coverage falls within guidelines.
https://www.internalfb.com/intern/daiquery/workspace/275342740223489/192619942076136/

## Sampled features:
GFF_C66_ADS_USER_SUM_84_PAGE_TYPE_RATIO_EVENT_LIKE_IMPRESSION: 15694257
- original feature compute meta coverage: 0.999992
- daiquery feature coverage (10k rows): 0.69588
- null flag compute meta coverage: 0.293409
GFF_R1303_ADS_USER_SUM_7_PAGE_TYPE_COUNTER_CONVERSION: 16051183
-  original feature compute meta coverage: 0.949868
- daiquery feature coverage: 0.82241
- null flag compute meta coverage: 0.151687

## Unit tests:

`buck test  fblearner/flow/projects/dper/tests/workflows:ads_test`

https://www.internalfb.com/intern/testinfra/testconsole/testrun/6192449504303863/

Differential Revision: D22026450

fbshipit-source-id: 46c59d849fa89253f14dc2b035c4c677cd6e3a4c
2020-07-02 12:44:41 -07:00
b678666a04 Add module.training to docs (#40923)
Summary:
A lot of people ask https://discuss.pytorch.org/t/check-if-model-is-eval-or-train/9395/3
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40923

Reviewed By: pbelevich

Differential Revision: D22358799

Pulled By: zou3519

fbshipit-source-id: b5465ffedb691fb4811e097c4dbd7bbc405be09c
2020-07-02 12:36:59 -07:00
6ae3cd0d9d Configure RPC metrics handlers and pass them into Thrift RPC Agent (#40602)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40602

Reviewed By: pritamdamania87

Differential Revision: D22250592

fbshipit-source-id: d38131f30939fc26af241b40e057a9dc1109e950
2020-07-02 11:41:21 -07:00
6aabd12390 fix issue #31759 (allow valid ASCII python identifiers as dimnames) (#40871)
Summary:
Fixes issue https://github.com/pytorch/pytorch/issues/31759:
- Changes is_valid_identifier check on named tensor dimensions to allow digits if they are not at the beginning of the name (this allows exactly the ASCII subset of [valid python identifiers](https://docs.python.org/3/reference/lexical_analysis.html#identifiers)).
- Updates error message for illegal dimension names.
- Updates and adds relevant tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40871

Reviewed By: pbelevich

Differential Revision: D22357314

Pulled By: zou3519

fbshipit-source-id: 9550a1136dd0673dd30a5cd5ade28069ba4c9086
2020-07-02 11:35:54 -07:00
5db5a0f2bb Re-enable Caffe2 test RoiAlignTest.CheckCPUGPUEqual (#40901)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/35547.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40901

Differential Revision: D22357760

Pulled By: malfet

fbshipit-source-id: 43f7dc13a905416288a9a317ae31a4dc78276ce4
2020-07-02 11:22:23 -07:00
1a74bb84f2 Remove Int8FC diff restriction.
Summary: Remove Int8FC diff restriction.

Test Plan: test_int8_ops_nnpi.py

Reviewed By: hyuen

Differential Revision: D22353200

fbshipit-source-id: c6c80c9dda3245c02da8343ecd5689994baf0143
2020-07-02 08:15:31 -07:00
591fffc524 Type-annotate serialization.py (#40862)
Summary:
Move Storage class from __init__.pyi.in to types.py and make it a protocol, since this is not a real class
Expose `PyTorchFileReader` and `PyTorchFileWriter` native classes

Ignore function attributes, as there are yet no good way to type annotate those, see https://github.com/python/mypy/issues/2087
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40862

Differential Revision: D22344743

Pulled By: malfet

fbshipit-source-id: 95cdb6f980ee79383960f306223e170c63df3232
2020-07-02 07:10:55 -07:00
9fa1f27968 [jit] Fix value association with dictionaries in the tracer (#40885)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40885

`TracingState::setValue` associates a concrete IValue in the traced
program with a `Value*` symbolic. Previously, the logic for how
GenericDicts worked was special cased to only work for very simple cases
and silently eat other cases.

This PR generalizes the logic to reflect the same behavior as using
dictionaries on input: whenever we encounter a dictionary in the system,
we completely "burn in" all the keys into the graph, and then
recursively call `setValue` on the associated value.

This has the effect of requiring that any dictionary structure you are
creating in a traced program be of fixed structure, similar to how any
dictionary used as input must be static as well.

Test Plan: Imported from OSS

Differential Revision: D22342490

Pulled By: suo

fbshipit-source-id: 93e610a4895d61d9b8b19c8d2aa4e6d57777eaf6
2020-07-02 04:09:35 -07:00
59294fbbb9 [caffe2] Reimplement RemoveOpsByType with SSA (#40649)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40649

The original implementation of RemoveOpsByType is pretty buggy and does not remove all instances of the ops that should be removed. It's also quite complicated and hard to modify. I reimplemented it by first converting the graph to its SSA form. The algorithm is quite simple once the graph is in SSA form. It's very similar to constant propagation with a few modifications. The hardest part is to deal with the case of removing an op with the output being an output of the predict net, because that output has to be preserved.

(Note: this ignores all push blocking failures!)

Reviewed By: yinghai, dzhulgakov

Differential Revision: D22220798

fbshipit-source-id: faf6ed5242f1e2f310125d964738c608c6c55c94
2020-07-02 02:45:36 -07:00
ea03f954ad [ONNX] Add warning in ONNX export when constant folding is on in training-amenable mode (#40546)
Summary:
This PR introduces a warning when user tries to export the model to ONNX in training-amenable mode while constant folding is turned on. We want to warn against any unintentional use because constant folding may fold some parameters that may be intended to be trainable in the exported model.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40546

Reviewed By: hl475

Differential Revision: D22310917

Pulled By: houseroad

fbshipit-source-id: ba83b8e63af7c458b5ecca8ff2ee1c77e2064f90
2020-07-01 21:40:38 -07:00
73f11dc3d1 torch._six.PY37 should be true for Python-3.8 as well (#40868)
Summary:
Right now it is used to check whether `math.remainder` exists, which is the case for both Python-3.7 and 3.8
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40868

Differential Revision: D22343454

Pulled By: malfet

fbshipit-source-id: 6b6d4869705b64c4b952309120f92c04ac7e39fd
2020-07-01 19:49:37 -07:00
8f6e50d013 Make some more ops c10-full (#40747)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40747

-
ghstack-source-id: 106833603

Test Plan: waitforsandcastle

Reviewed By: ezyang

Differential Revision: D22299161

fbshipit-source-id: 6e34999b5f8244d9582e4978754039d340720ca8
2020-07-01 19:39:32 -07:00
d7c9f96e43 Optimize perf for calling ops with custom classes (#38257)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38257

It seems we're doing a runtime type check for custom classes on each operator call if the operator has custom class arguments.
This does not have an effect on operators without custom class arguments, but this is a problem for operators with custom class arguments,
for example operators taking a at::native::xnnpack::Conv2dOpContext argument.

The long term solution would be to move those checks to op registration time instead of doing them at call time,
but as an intermediate fix, we can at least make the check fast by

- Using ska::flat_hash_map instead of std::unordered_map
- Using std::type_index instead of std::string (i.e. avoid calling std::hash on a std::string)
ghstack-source-id: 106805209

Test Plan: waitforsandcastle

Reviewed By: ezyang

Differential Revision: D21507226

fbshipit-source-id: bd120d5574734be843c197673ea4222599fee7cb
2020-07-01 19:28:29 -07:00
2f47e953f7 Fixes #40158 (#40617)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/40158

Description
- docs update: removed incorrect statements
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40617

Reviewed By: ezyang

Differential Revision: D22308802

Pulled By: yns88

fbshipit-source-id: e33084af320f249c0c9ba04bdbe2191d1b954d17
2020-07-01 18:05:44 -07:00
04b6e4273e clang format reducer.cpp (#40876)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40876

clang format reducer.cpp
ghstack-source-id: 106980050

Test Plan: unit test

Differential Revision: D22321422

fbshipit-source-id: 54afdff206504c7bbdf2e408928cc32068e15cdc
2020-07-01 17:24:37 -07:00
ad30d465d5 Move install_torchvision to common.sh so that it can be sourced. (#40828)
Summary:
Moving this to a file that can be source by downstream pytorch/xla.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40828

Reviewed By: malfet

Differential Revision: D22339513

Pulled By: ailzhang

fbshipit-source-id: c43b18fa2b7e1e8bb6810a6a43bb7dccd4756238
2020-07-01 16:40:43 -07:00
49e12d888a [NCCL - reland] Explicitly abort NCCL Communicators on Process Group Destruction (#40585)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40585

This PR aborts incomplete NCCL Communicators in the ProcessGroupNCCL
destructor. This should prevent pending NCCL communicators from blocking other CUDA ops.
ghstack-source-id: 106988073

Test Plan: Sandcastle/ OSS CI

Differential Revision: D22244873

fbshipit-source-id: 4b4fe65e1bd875a50151870f8120498193d7535e
2020-07-01 16:21:16 -07:00
af34f2f63b Added missing generator argument in type annotation(pytorch#40803) (#40873)
Summary:
Added missing generator argument in type annotation(pytorch#40803)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40873

Differential Revision: D22344217

Pulled By: malfet

fbshipit-source-id: 9871401b97c96fa20c70e3f66334259ead1f8429
2020-07-01 16:05:18 -07:00
c73255801f Fix the autograd codegen for repeat function (#40766)
Summary:
Fix https://github.com/pytorch/pytorch/issues/40701

A new special case is added to let `dim()` save an int instead of self.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40766

Differential Revision: D22308354

Pulled By: albanD

fbshipit-source-id: 69008230d7398b9e06b8e074a549ae921c2bf603
2020-07-01 15:43:28 -07:00
26543e6caf [quant][graphmode] FP16 quant support - Operator Fusion (#40710)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40710

Test Plan:
python test/test_quantization.py

Imported from OSS

Differential Revision: D22335975

fbshipit-source-id: 5c176bb6b9c300e1beb83df972149dd5a400b854
2020-07-01 14:15:53 -07:00
55b5ab14d3 [quant][graphmode] FP16 quant support - Insert cast operators (#40709)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40709

Cast to kHalf and back to kFloat before the linear operator to mimic FP16 quant support

Test Plan:
python test/test_quantization.py test_convert_dynamic_fp16

Imported from OSS

Differential Revision: D22335977

fbshipit-source-id: f964128ec733469672a1ed4cb0d757d0a6c22c3a
2020-07-01 14:15:51 -07:00
6aebd2c412 [quant][graphmode] Add FP16 quant support - Insert Noop Observers (#40708)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40708

Insert NoopObservers for activations and weight tensors for FP16

Test Plan:
python test/test_quantization.py test_prepare_dynamic

Imported from OSS

Differential Revision: D22335976

fbshipit-source-id: b19e8035c7db3b0b065ec09c9ad6d913eb434f3e
2020-07-01 14:13:31 -07:00
d1352192e2 Move OperatorBase::AddRelatedBlobInfo implementation to .cc file (#40844)
Summary:
If virtual function is implemented in header file, it's implementation will be included as a weak symbol to every shared library that includes this header along with all of it's dependencies.

This was one of the reasons why size of libcaffe2_module_test_dynamic.so  was 500Kb (AddRelatedBlobInfo implementation pulled a quarter of libprotobuf.a with it)

Combination of this and https://github.com/pytorch/pytorch/issues/40845 reduces size of `libcaffe2_module_test_dynamic.so` from 500kb to 50Kb.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40844

Differential Revision: D22334725

Pulled By: malfet

fbshipit-source-id: 836a4cbb9f344355ddd2512667e77472546616c0
2020-07-01 11:48:15 -07:00
cbdf399fc6 Move OperatorSchema default inference function implementations to .cc… (#40845)
Summary:
… file

This prevents implementation of those functions(as lambdas) to be embedded as weak symbol into every shared library that includes this header.

Combination of this and https://github.com/pytorch/pytorch/pull/40844 reduces size of `libcaffe2_module_test_dynamic.so` from 500kb to 50Kb.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40845

Differential Revision: D22334779

Pulled By: malfet

fbshipit-source-id: 64706918fc2947350a58c0877f294b1b8b085455
2020-07-01 11:42:52 -07:00
c71ec1c717 Fix zip serialization for file > 2GiB for Windows (#40783)
Summary:
`long long == int64_t != long` in MSVC
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40783

Differential Revision: D22328757

Pulled By: ezyang

fbshipit-source-id: bc7301d6b0e7e00ee6d7ca8637e3fce7810b15e2
2020-07-01 08:15:27 -07:00
a0569ad8f8 [android][readme] Aar native linking add fbjni (#40578)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40578

Test Plan: Imported from OSS

Differential Revision: D22239286

Pulled By: IvanKobzarev

fbshipit-source-id: 7a4160b621af8cfcc3b3d9e6da1a75c8afefba27
2020-07-01 08:09:17 -07:00
fcadca1bda serialization: validate sparse tensors after loading (#34059)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/33439

This introduces torch._sparse_coo_tensor_unsafe(...) and
torch._validate_sparse_coo_tensor_args(...)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34059

Differential Revision: D22161254

Pulled By: ezyang

fbshipit-source-id: 994efc9b0e30abbc23ddd7b2ec987e6ba08a8ef0
2020-06-30 22:31:21 -07:00
5f9e7240f5 Fix bug where explicitly providing a namespace never worked. (#40830)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40830

Fixes #40725

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Differential Revision: D22323886

Pulled By: ezyang

fbshipit-source-id: b8a61496923d9f086d4c201024748505ba783238
2020-06-30 22:20:05 -07:00
2cf9fe2d92 Remove more error-exposing tests in exp that cannot be reliably reproduced (#40825)
Summary:
Continuing https://github.com/pytorch/pytorch/issues/40824

All CIs have been enabled (on a branch that starts with `ci-all/`)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40825

Differential Revision: D22328732

Pulled By: ezyang

fbshipit-source-id: 3e517d01a9183d95df0687b328fb268947ea5fb0
2020-06-30 22:14:32 -07:00
f13653db29 [Update transforms.py]use build-in atanh in TanhTransform (#40160)
Summary:
Since `torch.atanh` is recently implemented in https://github.com/pytorch/pytorch/issues/38388, we should simply use it for `TanhTransform`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40160

Differential Revision: D22208039

Pulled By: ezyang

fbshipit-source-id: 34dfbc91eb9383461e16d3452e3ebe295f39df26
2020-06-30 21:38:22 -07:00
fbcf419173 Respect user set thread count. (#40707)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40707

Test Plan: Imported from OSS

Differential Revision: D22318197

Pulled By: AshkanAliabadi

fbshipit-source-id: f11b7302a6e91d11d750df100d2a3d8d96b5d1db
2020-06-30 20:14:49 -07:00
0203d70c63 [nit] fix some typo within documentation (#40692)
Summary:
Apologize if this seems trivial, but i'd like to fix them on my way of reading some of the source code. Thanks!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40692

Differential Revision: D22284651

Pulled By: mrshenli

fbshipit-source-id: 4259d1808aa4d15a02cfd486cfb44dd75fdc58f8
2020-06-30 19:24:44 -07:00
8e0714a60d [rfc] Reduce number of coin flips in RecordFunction (#40758)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40758

Currently we flip a coin for each sampled callback each time
we run RecordFunction, this PR is an attempt to skip most of the coin
flips (for the low-probability observers) and keep the distribution
close to the original one

Test Plan:
CI and record_function_benchmark
```
(python_venv) iliacher@devgpu151:~/local/pytorch  (reduce_coin_flops)$ ./build/bin/record_function_benchmark
Warmup time: 30108 us.
Time per iteration (1x1): 1496.78 us.
Time per iteration (16x16): 2142.46 us.
Pure RecordFunction runtime of 10000000 iterations 687929 us, number of callback invocations: 978
(python_venv) iliacher@devgpu151:~/local/pytorch  (reduce_coin_flops)$ ./build/bin/record_function_benchmark
Warmup time: 19051 us.
Time per iteration (1x1): 1581.89 us.
Time per iteration (16x16): 2195.67 us.
Pure RecordFunction runtime of 10000000 iterations 682402 us, number of callback invocations: 1023
(python_venv) iliacher@devgpu151:~/local/pytorch  (reduce_coin_flops)$ ./build/bin/record_function_benchmark
Warmup time: 18715 us.
Time per iteration (1x1): 1566.11 us.
Time per iteration (16x16): 2131.17 us.
Pure RecordFunction runtime of 10000000 iterations 693571 us, number of callback invocations: 963
(python_venv) iliacher@devgpu151:~/local/pytorch  (reduce_coin_flops)$

(python_venv) iliacher@devgpu151:~/local/pytorch  (reduce_coin_flops)$ ./build/bin/record_function_benchmark
Warmup time: 18814 us.
Time per iteration (1x1): 1536.2 us.
Time per iteration (16x16): 1985.82 us.
Pure RecordFunction runtime of 10000000 iterations 944959 us, number of callback invocations: 1015
(python_venv) iliacher@devgpu151:~/local/pytorch  (reduce_coin_flops)$ ./build/bin/record_function_benchmark
Warmup time: 18278 us.
Time per iteration (1x1): 1526.32 us.
Time per iteration (16x16): 2093.77 us.
Pure RecordFunction runtime of 10000000 iterations 985307 us, number of callback invocations: 1013
(python_venv) iliacher@devgpu151:~/local/pytorch  (reduce_coin_flops)$ ./build/bin/record_function_benchmark
Warmup time: 18545 us.
Time per iteration (1x1): 1524.65 us.
Time per iteration (16x16): 2080 us.
Pure RecordFunction runtime of 10000000 iterations 952835 us, number of callback invocations: 1048
```

Reviewed By: dzhulgakov

Differential Revision: D22320879

Pulled By: ilia-cher

fbshipit-source-id: 2193f07d2f7625814fe7bc3cc85ba4092fe036bc
2020-06-30 17:23:00 -07:00
179dbd4f25 [jit] preserve keys on dictionary input tracing (#40792)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40792

Fixes https://github.com/pytorch/pytorch/issues/40529.

One followup should be to produce a better error message when a new
dictionary has different keys than the traced input. Right now it
presents as a fairly opaque `KeyError`.

Test Plan: Imported from OSS

Differential Revision: D22311731

Pulled By: suo

fbshipit-source-id: c9fbe0b54cf69daed2f11a191d988568521a3932
2020-06-30 16:50:36 -07:00
0ddaaf6a92 [codemod][caffe2] Run clang-format - 5/7
Summary:
This directory is opted-in to clang-format but is not format-clean. This blocks continuous formatting from being enabled on fbcode, and causes hassle for other codemods that leave inconsistent formatting. This diff runs clang-format, which is widely used and considered safe.

If you are unhappy with the formatting of a particular block, please *accept this diff* and then in a stacked commit undo the change and wrap that code in `// clang-format off` and `// clang-format on`, or `/* clang-format off */` and `/* clang-format on */`.

drop-conflicts

Test Plan: sandcastleit

Reviewed By: jerryzh168

Differential Revision: D22311706

fbshipit-source-id: 1ca59a82e96156a4a5dfad70ba3e64d44c5e762a
2020-06-30 15:45:11 -07:00
29aef8f460 Skip some error-producing exp tests that cannot be reliably reproduced (#40824)
Summary:
This is to take care of additional master CI tests for https://github.com/pytorch/pytorch/issues/39087
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40824

Differential Revision: D22321429

Pulled By: ezyang

fbshipit-source-id: 607e284688b3e4ce24d803a030e31991e4e32fd7
2020-06-30 15:39:09 -07:00
0a75234934 Allow np.memmap objects (numpy arrays based on files) to be processed… (#39847)
Summary:
Allow np.memmap objects to be processed by default_collate

np.memmap objects has the same behavior as numpy arrays, and the only difference is that they are stored in a binary file on the disk. However, the default_collate function used by PyTorch DataLoader only accepts np.array, and rejects np.memmap by type checking. This commit allows np.memmap objects to be processed by default_collate. In this way, users can use in-disk large arrays with PyTorch DataLoader.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39847

Reviewed By: ezyang

Differential Revision: D22284650

Pulled By: zou3519

fbshipit-source-id: 003e3208a2afd1afc2e4640df14b3446201e00b4
2020-06-30 15:00:20 -07:00
9d8dc0318b [pruning] add rowwise counter to sparse adagrad
Summary: Use the newly added counter op in sparse adagrad

Reviewed By: chocjy, ellie-wen

Differential Revision: D19221100

fbshipit-source-id: d939d83e3b5b3179f57194be2e8864d0fbbee2c1
2020-06-30 14:40:02 -07:00
40e79bb1d3 Update the version of ninja and scipy (#40677)
Summary:
Update scipy to 1.15 and ninja to 1.10.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40677

Differential Revision: D22311602

Pulled By: ezyang

fbshipit-source-id: ddc852b3b8c3091409d1b3bd579dd144b58e5d47
2020-06-30 14:29:40 -07:00
e762ce8ecf Avoid initializing new_group in test_backward_no_ddp. (#40727)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40727

This unit test doesn't need to initialize a PG, as a result avoiding
initializing a process group.

#Closes: https://github.com/pytorch/pytorch/issues/40292
ghstack-source-id: 106817362

Test Plan: waitforbuildbot

Differential Revision: D22295131

fbshipit-source-id: 5a60e91e4beeb61cc204d24c564106d0215090a6
2020-06-30 14:01:05 -07:00
5a4911834d Add CUDA11 build and test (#40452)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40452

Differential Revision: D22316007

Pulled By: malfet

fbshipit-source-id: 94f4b4ba2a46ff3d3042ba842a615f8392cdc350
2020-06-30 13:50:44 -07:00
1571dd8692 Refactor duplicated string literals (#40788)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40788

Avoid repeated the same `:gencode[foo/bar]` over and over again

Test Plan: CI

Reviewed By: EscapeZero

Differential Revision: D22271151

fbshipit-source-id: f8db57db4ee0948bcca0c8945fdf30380ba81cae
2020-06-30 13:45:02 -07:00
6e4f99b063 Fix wrong MSVC version constraint for CUDA 9.2 (#40794)
Summary:
Tested with https://github.com/pytorch/pytorch/pull/40782.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40794

Differential Revision: D22318045

Pulled By: malfet

fbshipit-source-id: a737ffd7cb8a6a9efb62b84378318f4c3800ad8f
2020-06-30 13:02:45 -07:00
9ac0febb1f Pin torchvision version for doc_push (#40802)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40802

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Differential Revision: D22317343

Pulled By: ezyang

fbshipit-source-id: 8a982dd93a28d102dfd63163cd44704e899922e0
2020-06-30 12:52:13 -07:00
f3949794a3 Prototype benchmarking util (#38338)
Summary:
This is the prototype for the modular utils that we've been discussing. It is admittedly a large PR, but a good fraction of that is documentation and examples. I've trimmed a bit on the edges since we last discussed this design (for instance Timer is no longer Fuzzer aware), but it's mostly the same.

In addition to the library and hermetic examples, I've included `examples.end_to_end` which tests https://github.com/pytorch/pytorch/pull/38061 over a variety of shapes, dtypes, degrees of broadcasting, and layouts. (CC crcrpar)  I only did CPU as I'm not set up on a GPU machine yet. [Results from my devserver](https://gist.github.com/robieta/d1a8e1980556dc3f4f021c9f7c3738e2)

Key takeaways:
  1) For contiguous Tensors, larger dtypes (fp32 and fp64) and lots of reuse of the mask due to broadcasting, improvements are significant. (Presumably due to better vectorization?)
  2) There is an extra ~1.5 us overhead, which dominates small kernels.
  3) Cases with lower write intensity (int8, lower mask fraction, etc) or non-contiguous seem to suffer.

Hopefully this demonstrates the proof-of-concept for how this tooling can be used to tune kernels and assess PRs. Looking forward to thoughts and feedback.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38338

Differential Revision: D21551048

Pulled By: robieta

fbshipit-source-id: 6c50e5439a04eac98b8a2355ef731852ba0500db
2020-06-30 11:31:27 -07:00
c648cd372f Fix complex printing for sci_mode=True (#40513)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40513

This PR makes the following changes:
1. Complex Printing now uses print formatting for it's real and imaginary values and they are joined at the end.
2. Adding 1. naturally fixes the printing of complex tensors in sci_mode=True

```
>>> torch.tensor(float('inf')+float('inf')*1j)
tensor(nan+infj)
>>> torch.randn(2000, dtype=torch.cfloat)
tensor([ 0.3015-0.2502j, -1.1102+1.2218j, -0.6324+0.0640j,  ...,
        -1.0200-0.2302j,  0.6511-0.1889j, -0.1069+0.1702j])
>>> torch.tensor([1e-3, 3+4j, 1e-5j, 1e-2+3j, 5+1e-6j])
tensor([1.0000e-03+0.0000e+00j, 3.0000e+00+4.0000e+00j, 0.0000e+00+1.0000e-05j,
        1.0000e-02+3.0000e+00j, 5.0000e+00+1.0000e-06j])
>>> torch.randn(3, dtype=torch.cfloat)
tensor([ 1.0992-0.4459j,  1.1073+0.1202j, -0.2177-0.6342j])
>>> x = torch.tensor([1e2, 1e-2])
>>> torch.set_printoptions(sci_mode=False)
>>> x
tensor([  100.0000,     0.0100])
>>> x = torch.tensor([1e2, 1e-2j])
>>> x
tensor([100.+0.0000j,   0.+0.0100j])
```

Test Plan: Imported from OSS

Differential Revision: D22309294

Pulled By: anjali411

fbshipit-source-id: 20edf9e28063725aeff39f3a246a2d7f348ff1e8
2020-06-30 11:13:42 -07:00
871bfaaba1 [JIT] Fix shape analysis for aten::masked_select. (#40753)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40753

The reference says that this op always returns a 1-D tensor, even if
the input and the mask are 0-D.

Test Plan: Imported from OSS

Reviewed By: eellison

Differential Revision: D22300354

Pulled By: ZolotukhinM

fbshipit-source-id: f6952989c8facf87d73d00505bf6d41573eff2d6
2020-06-30 11:04:50 -07:00
50d55b9f2b [JIT] Update type of the unsqueeze's output in shape analysis. (#40733)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40733

Test Plan: Imported from OSS

Reviewed By: eellison

Differential Revision: D22298537

Pulled By: ZolotukhinM

fbshipit-source-id: a5d4597ed10bcf14d1b28e914bf898d0cae5b4c0
2020-06-30 11:01:45 -07:00
c3237c7a87 Print hostname of RoCM tester (#40755)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40755

Differential Revision: D22311699

Pulled By: malfet

fbshipit-source-id: 057702800fec84fae787b7837f39348273c80cec
2020-06-30 10:56:31 -07:00
a303fd2ea6 Let exp support complex types on CUDA and enable device/dtype in complex tests (#39087)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/39087

Differential Revision: D22169697

Pulled By: anjali411

fbshipit-source-id: 4866b7be6742508cc40540ed1ac811f005531d8b
2020-06-30 10:50:40 -07:00
ef5a314597 [typing] fix register_buffer/parameter (#40669)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40669

Differential Revision: D22286130

Pulled By: ezyang

fbshipit-source-id: c0cc173279678978726895a0830343d5234e474e
2020-06-30 10:39:32 -07:00
5923a802fa Back out "[pytorch][PR] [ONNX] Add eliminate_unused_items pass"
Summary:
Original commit changeset: 30e1a6e8823a

cause issue to fusing BN

Test Plan: revert

Reviewed By: houseroad

Differential Revision: D22296958

fbshipit-source-id: 62664cc77baa8811ad6ecce9d0520a2ab7f89868
2020-06-30 10:26:35 -07:00
3ecae99dd9 Support Pathlike for zipfile serialization (#40723)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40723

Test Plan: Imported from OSS

Differential Revision: D22294575

Pulled By: jamesr66a

fbshipit-source-id: b157fa0ab02c4eb22cb99ac870942aeab352b0c5
2020-06-30 10:07:23 -07:00
c56255499a Reverts running clang-tidy on ATen (#40764)
Summary:
Reverts https://github.com/pytorch/pytorch/pull/39713.

We are seeing CUDA-related clang-tidy failures on multiple PRs after the above change. The cause of these failures is unclear. Example error message:

```
2020-06-26T18:45:10.9763273Z + python tools/clang_tidy.py --verbose --paths torch/csrc/ aten/src/ATen/ --diff 5036c94a6e868963e0354fc04c92e204d8d77677 -g-torch/csrc/jit/serialization/export.cpp -g-torch/csrc/jit/serialization/import.cpp -g-torch/csrc/jit/serialization/import_legacy.cpp -g-torch/csrc/onnx/init.cpp '-g-torch/csrc/cuda/nccl.*' -g-torch/csrc/cuda/python_nccl.cpp
2020-06-26T18:45:11.1990578Z Error while processing /home/runner/work/pytorch/pytorch/aten/src/ATen/native/cuda/UnaryOpsKernel.cu.
2020-06-26T18:45:11.1992832Z Found compiler error(s).
2020-06-26T18:45:11.2286995Z Traceback (most recent call last):
2020-06-26T18:45:11.2288334Z   File "tools/clang_tidy.py", line 55, in run_shell_command
2020-06-26T18:45:11.2288607Z     output = subprocess.check_output(arguments).decode().strip()
2020-06-26T18:45:11.2289053Z   File "/opt/hostedtoolcache/Python/3.8.3/x64/lib/python3.8/subprocess.py", line 411, in check_output
2020-06-26T18:45:11.2289337Z     return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
2020-06-26T18:45:11.2289786Z   File "/opt/hostedtoolcache/Python/3.8.3/x64/lib/python3.8/subprocess.py", line 512, in run
2020-06-26T18:45:11.2290038Z     raise CalledProcessError(retcode, process.args,
2020-06-26T18:45:11.2292206Z subprocess.CalledProcessError: Command '['clang-tidy', '-p', 'build', '-config', '{"Checks": "-*, bugprone-*, -bugprone-forward-declaration-namespace, -bugprone-macro-parentheses, -bugprone-lambda-function-name, cppcoreguidelines-*, -cppcoreguidelines-interfaces-global-init, -cppcoreguidelines-owning-memory, -cppcoreguidelines-pro-bounds-array-to-pointer-decay, -cppcoreguidelines-pro-bounds-constant-array-index, -cppcoreguidelines-pro-bounds-pointer-arithmetic, -cppcoreguidelines-pro-type-cstyle-cast, -cppcoreguidelines-pro-type-reinterpret-cast, -cppcoreguidelines-pro-type-static-cast-downcast, -cppcoreguidelines-pro-type-union-access, -cppcoreguidelines-pro-type-vararg, -cppcoreguidelines-special-member-functions, hicpp-exception-baseclass, hicpp-avoid-goto, modernize-*, -modernize-return-braced-init-list, -modernize-use-auto, -modernize-use-default-member-init, -modernize-use-using, -modernize-use-trailing-return-type, performance-*, -performance-noexcept-move-constructor, ", "HeaderFilterRegex": "torch/csrc/.*", "AnalyzeTemporaryDtors": false, "CheckOptions": null}', '-line-filter', '[{"name": "aten/src/ATen/native/cuda/UnaryOpsKernel.cu", "lines": [[10, 11], [29, 30]]}]', 'aten/src/ATen/native/cuda/UnaryOpsKernel.cu']' returned non-zero exit status 1.
2020-06-26T18:45:11.2292551Z
2020-06-26T18:45:11.2292684Z During handling of the above exception, another exception occurred:
2020-06-26T18:45:11.2292775Z
2020-06-26T18:45:11.2292894Z Traceback (most recent call last):
2020-06-26T18:45:11.2293208Z   File "tools/clang_tidy.py", line 306, in <module>
2020-06-26T18:45:11.2293364Z     main()
2020-06-26T18:45:11.2293817Z   File "tools/clang_tidy.py", line 298, in main
2020-06-26T18:45:11.2293980Z     clang_tidy_output = run_clang_tidy(options, line_filters, files)
2020-06-26T18:45:11.2294282Z   File "tools/clang_tidy.py", line 191, in run_clang_tidy
2020-06-26T18:45:11.2294439Z     output = run_shell_command(command)
2020-06-26T18:45:11.2294703Z   File "tools/clang_tidy.py", line 59, in run_shell_command
2020-06-26T18:45:11.2294931Z     raise RuntimeError("Error executing {}: {}".format(" ".join(arguments), error_output))
2020-06-26T18:45:11.2296875Z RuntimeError: Error executing clang-tidy -p build -config {"Checks": "-*, bugprone-*, -bugprone-forward-declaration-namespace, -bugprone-macro-parentheses, -bugprone-lambda-function-name, cppcoreguidelines-*, -cppcoreguidelines-interfaces-global-init, -cppcoreguidelines-owning-memory, -cppcoreguidelines-pro-bounds-array-to-pointer-decay, -cppcoreguidelines-pro-bounds-constant-array-index, -cppcoreguidelines-pro-bounds-pointer-arithmetic, -cppcoreguidelines-pro-type-cstyle-cast, -cppcoreguidelines-pro-type-reinterpret-cast, -cppcoreguidelines-pro-type-static-cast-downcast, -cppcoreguidelines-pro-type-union-access, -cppcoreguidelines-pro-type-vararg, -cppcoreguidelines-special-member-functions, hicpp-exception-baseclass, hicpp-avoid-goto, modernize-*, -modernize-return-braced-init-list, -modernize-use-auto, -modernize-use-default-member-init, -modernize-use-using, -modernize-use-trailing-return-type, performance-*, -performance-noexcept-move-constructor, ", "HeaderFilterRegex": "torch/csrc/.*", "AnalyzeTemporaryDtors": false, "CheckOptions": null} -line-filter [{"name": "aten/src/ATen/native/cuda/UnaryOpsKernel.cu", "lines": [[10, 11], [29, 30]]}] aten/src/ATen/native/cuda/UnaryOpsKernel.cu: error: cannot find libdevice for sm_20. Provide path to different CUDA installation via --cuda-path, or pass -nocudalib to build without linking with libdevice. [clang-diagnostic-error]
2020-06-26T18:45:11.2313329Z error: unable to handle compilation, expected exactly one compiler job in ' "/usr/bin/c++" "-cc1" "-triple" "x86_64-pc-linux-gnu" "-aux-triple" "nvptx64-nvidia-cuda" "-fsyntax-only" "-disable-free" "-disable-llvm-verifier" "-discard-value-names" "-main-file-name" "UnaryOpsKernel.cu" "-mrelocation-model" "pic" "-pic-level" "2" "-mthread-model" "posix" "-fno-trapping-math" "-masm-verbose" "-mconstructor-aliases" "-munwind-tables" "-fuse-init-array" "-target-cpu" "x86-64" "-dwarf-column-info" "-debugger-tuning=gdb" "-momit-leaf-frame-pointer" "-resource-dir" "/usr/lib/llvm-8/bin/../lib/clang/8.0.1" "-internal-isystem" "/usr/lib/llvm-8/bin/../lib/clang/8.0.1/include/cuda_wrappers" "-internal-isystem" "/usr/local/cuda/include" "-include" "__clang_cuda_runtime_wrapper.h" "-isystem" "/home/runner/work/pytorch/pytorch/build/third_party/gloo" "-isystem" "/home/runner/work/pytorch/pytorch/cmake/../third_party/gloo" "-isystem"
```

My guess is that our clang-tidy build is improperly configured to handle CUDA code. Until that issue is resolved this stops running clang-tidy on ATen.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40764

Differential Revision: D22310032

Pulled By: mruberry

fbshipit-source-id: 035067e1017f0097026cee9866bba424dd4668b4
2020-06-30 09:35:55 -07:00
3cc18d7139 .circleci: Remove executor from windows uploads (#40742)
Summary:
This wasn't needed and broke nightly builds

Fixes some issues introduced in https://github.com/pytorch/pytorch/pull/40592/files

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40742

Differential Revision: D22310055

Pulled By: seemethere

fbshipit-source-id: 095be3be06a730138d860ca6b73eaf22c24cf08f
2020-06-30 09:29:29 -07:00
a6a31bcd47 Enable out_dims for vmap frontend API (#40576)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40576

`out_dims` specifies where in the output tensors the vmapped dimension
should appear. We implement this by simply creating a view with the
batch dimension moved to the desired position.

`out_dims` must either:
- be int (use the same value for all outputs)
- be Tuple[int] (so the user specifies one out_dim per output).
(See the vmap docstring for what we advertise out_dims to do).

I also renamed `TestVmap` to `TestVmapAPI` to make it clearer that we
are testing the API here and not specific operators (which will go into
their own test class).

Test Plan: - `pytest test/test_vmap.py -v`

Differential Revision: D22288086

Pulled By: zou3519

fbshipit-source-id: c8666cb1a0e22c54473d8045477e14c2089167cf
2020-06-30 08:20:39 -07:00
2f94b7f95c Initial vmap docstring (#40575)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40575

This provides some more context for the next ~2 PRs that will implement
the `out_dims` and `in_dims` functionality. I will probably add more to
it later (things I think we should add: examples (maybe in a dedicated
docs page), specific examples of things vmap cannot handle).

Test Plan:
- Code reading for now. When we are ready to add vmap to master documentation,
I'll build the docs and fix any formatting problems.

Differential Revision: D22288085

Pulled By: zou3519

fbshipit-source-id: 6e28d7bd524242395160c20270159b4b121d6789
2020-06-30 08:18:20 -07:00
4a235b87be pop warning message for cuda module when asan is built in (#35088)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35088

Test Plan: Imported from OSS

Differential Revision: D20552708

Pulled By: glaringlee

fbshipit-source-id: 0b809712378596ccf83211bf8ae39cd71c27dbba
2020-06-30 08:00:37 -07:00
4104ab8b18 Add torch.count_nonzero (#39992)
Summary:
Reference https://github.com/pytorch/pytorch/issues/38349

TODO:

* [x] Add tests
* [x] Add docs (pending add to docs.rst)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39992

Reviewed By: ezyang

Differential Revision: D22236738

Pulled By: mruberry

fbshipit-source-id: 8520068b086b5ffc4de9e4939e746ff889293987
2020-06-30 06:39:13 -07:00
31de10a392 Int8FC dequantize fix (#40608)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40608

Changes to fix uint8_t to fp16 dequantization error.
Enabled test_int8_quantize

(Note: this ignores all push blocking failures!)

Test Plan: Verified with test_int8_ops_nnpi.py

Reviewed By: hyuen

Differential Revision: D22252860

fbshipit-source-id: bb44673327f0c8f44974cef2ab773aa0d89f4dc7
2020-06-30 06:20:09 -07:00
b9cca4b186 fix range of results for pairwise operations (#40728)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40728

there are two reasons the test is failing:
1) div by 0
2) result is bigger than fp16 max

for 1) make the divisor some safe number like 1e-3
2) when a combination of random numbers results in their resulting value  bigger than 65e3, clip

multiplication is fine because range of random numbers is 0,100 -> result is 0->10000

Test Plan: ran tes_div test

Reviewed By: hl475

Differential Revision: D22295934

fbshipit-source-id: 173f3f2187137d6c1c4d4a505411a27f1c059f1a
2020-06-29 23:49:08 -07:00
a371652bc8 Allow to get string references to strings inside torch::List (#39763)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39763

This is an ask from fluent. For performance reasons, they need a way to get read access to the std::string inside of a torch::List<std::string> without having to copy that string.

Instead of special casing std::string, we decided to give access to the underlying value. The API now looks like:

```cpp
torch::List<std::string> list = ...;
const std::string& str = list[2].toIValueRef().toStringRef();
```
ghstack-source-id: 106806840

Test Plan: unit tests

Reviewed By: ezyang

Differential Revision: D21966183

fbshipit-source-id: 8b80b0244d10215c36b524d1d80844832cf8b69a
2020-06-29 20:52:32 -07:00
fabd60ec1a Add comment with UNBOXEDONLY explanation to codegen (#40117)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40117

ghstack-source-id: 106804731

Test Plan: just comments

Reviewed By: ezyang

Differential Revision: D22075103

fbshipit-source-id: 76677dc337196b71c50075f2845a1899451a705f
2020-06-29 20:50:45 -07:00
01e2099bb8 [TB] Add support for hparam domain_discrete (#40720)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40720

Add support for populating domain_discrete field in TensorBoard add_hparams API

Test Plan: Unit test test_hparams_domain_discrete

Reviewed By: edward-io

Differential Revision: D22291347

fbshipit-source-id: 78db9f62661c9fe36cd08d563db0e7021c01428d
2020-06-29 19:33:57 -07:00
53af9df557 Unify boxed function signature between jit and c10 (#37034)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37034

c10 takes a Stack* in boxed functions while JIT took Stack&.
c10 doesn't return anything while JIT returns an int which is always zero.

This changes JIT to follow the c10 behavior.
ghstack-source-id: 106834069

Test Plan: unit tests

Differential Revision: D20567950

fbshipit-source-id: 1a7aea291023afc52ae706957e9a5ca576fbb53b
2020-06-29 19:24:26 -07:00
320164f878 Fix zip serialization for file > 2GiB (#40722)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40722

Test Plan: Imported from OSS

Differential Revision: D22294016

Pulled By: jamesr66a

fbshipit-source-id: 0288882873d4b59bdef37d018c030519c4be7f03
2020-06-29 19:17:06 -07:00
9393ac011a [CUDA] addmm for complex (#40431)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40431

Test Plan: Imported from OSS

Differential Revision: D22285916

Pulled By: anjali411

fbshipit-source-id: 5863c713bdaa8e5b4f3d2b41fa59108502145a23
2020-06-29 17:41:46 -07:00
d7cd16858f Add documentation about storage sharing is preserved and serialized f… (#40412)
Summary:
…ile size.
fixes https://github.com/pytorch/pytorch/issues/40157
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40412

Reviewed By: ezyang

Differential Revision: D22265639

Pulled By: ailzhang

fbshipit-source-id: 16b0301f16038bd784e7e92f63253fedc7820adc
2020-06-29 17:23:29 -07:00
8f5b28674c [JIT] Remove dead store in quantization_patterns.h (#40724)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40724

Test Plan: Continuous integration.

Differential Revision: D22294600

Pulled By: SplitInfinity

fbshipit-source-id: 04546579273d8864d91c3c74a654aa75ba34ee45
2020-06-29 16:55:15 -07:00
0235676f8a [pytorch][ci] run mobile code analysis on PR (#40247)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40247

This CI job was bypassed on PR because most part of it has already been
covered by mobile-custom-build-dynamic job that runs on every PR.

However, it can still fail independently because it builds and analyzes
a small test project, e.g.: if people forget to update the registration API
used in the test project.

So this PR changed it to only build and analyze the test project and run
the job on every PR.

Test Plan: Imported from OSS

Differential Revision: D22126044

Pulled By: ljk53

fbshipit-source-id: 6699a200208a65b249bd3a4e43ad72bc07388ce3
2020-06-29 16:44:45 -07:00
6e1cf000b3 [jit][oacr] Add some operators for Assistant NLU joint lite model (#40126)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40126

These are needed for benchmarking / running our model, following Step 7 in the [Lite interpreter wiki](https://www.internalfb.com/intern/wiki/PyTorch/PyTorchDev/Mobile/Lite_Interpreter/#make-your-model-work-wit) and [this thread](https://www.internalfb.com/intern/qa/56293/atenemptymemory_format-missing-on-fb4a).

Test Plan: Sandcastle

Reviewed By: iseeyuan

Differential Revision: D22073611

fbshipit-source-id: daa46a39c386806be8d5d589740663e85451757e
2020-06-29 16:41:04 -07:00
21de450fcb Fix batch size zero for QNNPACK linear_dynamic (#40588)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40588

Two bugs were preventing this from working.  One was a divide by zero
when multithreading was enabled, fixed similarly to the fix for static
quantized linear in the previous commit.  The other was computation of
min and max to determine qparams.  FBGEMM uses [0,0] for [min,max] of
empty input, do the same.

Test Plan: Added a unit test.

Differential Revision: D22264415

Pulled By: dreiss

fbshipit-source-id: 6ca9cf48107dd998ef4834e5540279a8826bc754
2020-06-29 16:31:11 -07:00
14145f9775 Fix and reenable threaded QNNPACK linear (#40587)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40587

Previously, this was causing divide-by-zero only in the multithreaded
empty-batch case, while calculating tiling parameters for the threads.
In my opinion, the bug here is using a value that is allowed to be zero
(batch size) for an argument that should not be zero (tile size), so I
fixed the bug by bailing out right before the call to
pthreadpool_compute_4d_tiled.

Test Plan: TestQuantizedOps.test_empty_batch

Differential Revision: D22264414

Pulled By: dreiss

fbshipit-source-id: 9446d5231ff65ef19003686f3989e62f04cf18c9
2020-06-29 16:29:29 -07:00
9ca4a46bf8 Implement parallel scatter reductions for CPU (#36447)
Summary:
This PR implements gh-33389.

As a result of this PR, users can now specify various reduction modes for scatter operations. Currently, `add`, `subtract`, `multiply` and `divide` have been implemented, and adding new ones is not hard.

While we now allow dynamic runtime selection of reduction modes, the performance is the same as as was the case for the `scatter_add_` method in the master branch. Proof can be seen in the graph below, which compares `scatter_add_` in the master branch (blue) and `scatter_(reduce="add")` from this PR (orange).
![scatter-regression py csv](https://user-images.githubusercontent.com/2629909/82671491-e5e22380-9c79-11ea-95d6-6344760c8578.png)

The script used for benchmarking is as follows:
``` python
import os
import sys
import torch
import time
import numpy
from IPython import get_ipython

Ms=256
Ns=512
dim = 0
top_power = 2
ipython = get_ipython()

plot_name = os.path.basename(__file__)
branch = sys.argv[1]
fname = open(plot_name + ".csv", "a+")

for pM in range(top_power):
    M = Ms * (2 ** pM)
    for pN in range(top_power):
        N = Ns * (2 ** pN)
        input_one = torch.rand(M, N)
        index = torch.tensor(numpy.random.randint(0, M, (M, N)))
        res = torch.randn(M, N)

        test_case = f"{M}x{N}"
        print(test_case)
        tobj = ipython.magic("timeit -o res.scatter_(dim, index, input_one, reduce=\"add\")")

        fname.write(f"{test_case},{branch},{tobj.average},{tobj.stdev}\n")

fname.close()
```

Additionally, one can see that various reduction modes take almost the same time to execute:
```
op: add
70.6 µs ± 27.3 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
26.1 µs ± 26.5 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
op: subtract
71 µs ± 20.5 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
26.4 µs ± 34.4 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
op: multiply
70.9 µs ± 31.5 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
27.4 µs ± 29.3 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
op: divide
164 µs ± 48.8 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
52.3 µs ± 132 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
```
Script:
``` python
import torch
import time
import numpy
from IPython import get_ipython

ipython = get_ipython()

nrows = 3000
ncols = 10000
dims = [nrows, ncols]

res = torch.randint(5, 10, dims)
idx1 = torch.randint(dims[0], (1, dims[1])).long()
src1 = torch.randint(5, 10, (1, dims[1]))
idx2 = torch.randint(dims[1], (dims[0], 1)).long()
src2 = torch.randint(5, 10, (dims[0], 1))

for op in ["add", "subtract", "multiply", "divide"]:
    print(f"op: {op}")
    ipython.magic("timeit res.scatter_(0, idx1, src1, reduce=op)")
    ipython.magic("timeit res.scatter_(1, idx2, src2, reduce=op)")
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36447

Differential Revision: D22272631

Pulled By: ngimel

fbshipit-source-id: 3cdb46510f9bb0e135a5c03d6d4aa5de9402ee90
2020-06-29 15:52:11 -07:00
11a74a58c8 Setter for real and imag tensor attributes (#39860)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/39860

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D22163234

Pulled By: anjali411

fbshipit-source-id: 35b4aa16499341edff1a4be4076539ac7c74f5be
2020-06-29 15:44:55 -07:00
fd90e4b309 [CircleCI] Add RocM build/test jobs (#39760)
Summary:
Set PYTORCH_ROCM_ARCH to `gfx900;gfx906` if `CIRCLECI` environment variable is defined
Add RocM build test jobs and schedule them on `xlarge` and `amd-gpu` resource classes respectively.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39760

Differential Revision: D22290335

Pulled By: malfet

fbshipit-source-id: 7462f97b262abcacac3e515086ac6236a45626d2
2020-06-29 14:15:44 -07:00
63e5a53b8c DNNL: fix build error when DNNL using TBB threading pool (#40699)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40699

Differential Revision: D22286334

Pulled By: albanD

fbshipit-source-id: 0635a0a5e4bf80d44d90c86945d92e98e26ef480
2020-06-29 13:53:18 -07:00
ed83b9a4be Change function parameter self to input in torch.__init__.pyi (#40235)
Summary:
Fix https://github.com/pytorch/pytorch/issues/40223: Incorrect "self" keyword arguments in `torch.__init__.pyi` type hints
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40235

Differential Revision: D22285816

Pulled By: ezyang

fbshipit-source-id: ebc35290c0c625916289f1a46abc6ff2197f4bcf
2020-06-29 13:49:13 -07:00
d2e16dd888 Remove constexpr for NVCC on Windows (#40675)
Summary:
They are not well supported. Fixes https://github.com/pytorch/pytorch/issues/40393 and https://github.com/pytorch/pytorch/issues/39394.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40675

Differential Revision: D22286031

Pulled By: ezyang

fbshipit-source-id: 7e309916ae21cd3909ee6466952ba89847c74d71
2020-06-29 10:58:42 -07:00
4a174c83ca Add option to preserve certain methods during optimize_for_mobile. (#40629)
Summary:
By default freeze_module pass, invoked from optimize_for_mobile,
preserves only forward method. There is an option to specify a list of
methods that can be preserved during freeze_module. This PR exposes that
to optimize_for_module pass.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40629

Test Plan: python test/test_mobile_optimizer.py

Reviewed By: dreiss

Differential Revision: D22260972

Pulled By: kimishpatel

fbshipit-source-id: 452c653269da8bb865acfb58da2d28c23c66e326
2020-06-29 09:32:53 -07:00
4121d34036 Python/C++ API Parity: Add impl and tests for ParameterDict (#40654)
Summary:
This diff contains the implementation of C++ api for ParameterDict from https://github.com/pytorch/pytorch/issues/25883, refer to  https://github.com/pytorch/pytorch/issues/36904 and https://github.com/pytorch/pytorch/issues/28652
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40654

Test Plan: Add unit test in this diff

Differential Revision: D22273265

Pulled By: glaringlee

fbshipit-source-id: 9134a92c95eacdd53d5b24470d5f7edbeb40a488
2020-06-29 08:50:44 -07:00
b35cdc5200 [Fix] torch_common target shared by lite-interpreter and full-jit" and turn on query-based selective build (#40673)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40673

As title. We planed to have lite-interpreter and full-jit co-exist for short-term. To avoid the duplicated symbol and operator registrations in dynamic lib loading, we put the common files in a separate component.

The original source file list names are reserved.
ghstack-source-id: 106757184

Test Plan: CI

Reviewed By: kwanmacher

Differential Revision: D22276185

fbshipit-source-id: 328a8ba9c3d88437da0d30c6e6791087d0df5e2e
2020-06-28 16:38:52 -07:00
b4db529352 Fix wrong link in docs/source/notes/ddp.rst (#40484)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40484

Differential Revision: D22259834

Pulled By: mrshenli

fbshipit-source-id: 4ec912c600c81010bdb2778c35cbb0321480199f
2020-06-28 13:55:56 -07:00
502ec8f7f7 Revert D22227939: [TB] Add support for hparam domain_discrete
Test Plan: revert-hammer

Differential Revision:
D22227939 (4c25428c8c)

Original commit changeset: d2f0cd8e5632

fbshipit-source-id: c4329fcead69cb0f3d368a254d8756fb04be742d
2020-06-27 22:20:31 -07:00
5377827b3e Revert D22275201: [Fix] torch_common target shared by lite-interpreter and full-jit
Test Plan: revert-hammer

Differential Revision:
D22275201 (1399655a98)

Original commit changeset: dafd3ad36bb3

fbshipit-source-id: a89c8b1fbb55eb7c116dd6ca9dad04bb90727c0a
2020-06-27 22:00:19 -07:00
521722751f Add examples and tests for combining static/class method with async execution (#40619)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40619

Test Plan: Imported from OSS

Differential Revision: D22258407

Pulled By: mrshenli

fbshipit-source-id: 036d85a2affc4505efd2df197fc513dba010e359
2020-06-27 20:42:23 -07:00
1399655a98 [Fix] torch_common target shared by lite-interpreter and full-jit
Summary:
Pull the shared source files to "torch_common" to avoid duplicated symbols and operator registrations.

(Note: this ignores all push blocking failures!)

Test Plan:
CI
buck install -c fbandroid.force_native_library_merge_map=true -c pt.build_from_deps_query=1 -c pt.selective_build=0 -c pt.static_dispatch=0 -r fb4a

Reviewed By: kwanmacher

Differential Revision: D22275201

fbshipit-source-id: dafd3ad36bb33e3ec33f4accfdc5af1d5f8ab775
2020-06-27 17:48:32 -07:00
21991b63f5 Migrate dot from the TH to Aten (CPU) (#40354)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/24692
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40354

Reviewed By: ezyang

Differential Revision: D22214203

Pulled By: ngimel

fbshipit-source-id: 500e60d1c02b3b39db19b518f2af43cd69f2e984
2020-06-27 17:11:10 -07:00
4c25428c8c [TB] Add support for hparam domain_discrete
Summary: Add support for populating domain_discrete field in TensorBoard add_hparams API

Test Plan: Unit test test_hparams_domain_discrete

Reviewed By: edward-io

Differential Revision: D22227939

fbshipit-source-id: d2f0cd8e5632cbcc578466ff3cd587ee74f847af
2020-06-27 14:07:24 -07:00
2456e078d3 [TB] Support custom run_name in add_hparams (#40660)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40660

Support custom run_name since using timestamp as run_name can be confusing to people

Test Plan:
hp = {"lr": 0.1, "bool_var": True, "string_var": "hi"}
  mt = {"accuracy": 0.1}
  writer.add_hparams(hp, mt, run_name="run1")
  writer.flush()

Reviewed By: edward-io

Differential Revision: D22157749

fbshipit-source-id: 3d4974381e3be3298f3e4c40e3d4bf20e49dfb07
2020-06-27 14:05:20 -07:00
15be823455 caffe2 | Revert range loop analysis fix
Summary: This reverts a change that was made to fix range loop analysis warning.

Test Plan: CI

Reviewed By: nlutsenko

Differential Revision: D22274461

fbshipit-source-id: dedc3fcaa6e32259460380163758d6c9c9b73211
2020-06-27 13:02:23 -07:00
68042c7466 Skip mypy on pynightly if numpy-1.20.0-dev0... is used (#40656)
Summary:
Also modernize the test script itself by using `mypy.api.run` rather than `subprocess.call`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40656

Differential Revision: D22274421

Pulled By: malfet

fbshipit-source-id: 59232d4d37ee01cda56375b84ac1476d16686bfe
2020-06-27 09:08:50 -07:00
ac8c8b028d [ROCm] restore jit tests (#40447)
Summary:
Remove `skipIfRocm` from most jit tests and enable `RUN_CUDA_HALF` tests for ROCm.

These changes passed more than three rounds of CI testing against the ROCm CI.

CC ezyang xw285cornell sunway513
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40447

Differential Revision: D22190711

Pulled By: xw285cornell

fbshipit-source-id: bac44825a2675d247b3abe2ec2f80420a95348a3
2020-06-27 01:03:59 -07:00
411bc2b8d5 [quant][graphmode][fix] remove unsupported ops in the list (#40653)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40653

(Note: this ignores all push blocking failures!)

Test Plan: Imported from OSS

Differential Revision: D22271413

fbshipit-source-id: a01611b5d90849ac673fa5a310f910c858e907a3
2020-06-27 00:07:57 -07:00
61a8de77cf [quant] aten::repeat work for quantized tensor (#40644)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40644

Test Plan: Imported from OSS

Differential Revision: D22268558

fbshipit-source-id: 3bc9a129bece1b547c519772ecc6b980780fb904
2020-06-26 22:54:19 -07:00
0309f6a4bb [quant][graphmode][fix] cloning schema in insert_observers (#40624)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40624

Previously we didn't clone schema, so the default schema is used, this is
causing issue for some models

Test Plan: Imported from OSS

Differential Revision: D22259519

fbshipit-source-id: e2a393a54cb18f55da0c7152a74ddc22079ac350
2020-06-26 20:19:09 -07:00
0a19534dd2 [JIT] Remove dead store in quantization_patterns.h (#40623)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40623

Test Plan: Continuous integration.

Reviewed By: jerryzh168

Differential Revision: D22259209

fbshipit-source-id: 90c9e79e039100f2961195504bb81230bba5c5fe
2020-06-26 19:43:43 -07:00
e368b11226 [JIT] Remove dead stores in loopnest.cpp (#40626)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40626

Test Plan: Continuous integration.

Reviewed By: ZolotukhinM

Differential Revision: D22259586

fbshipit-source-id: 447accb5b94392f0b5e4c27956a34403bb0d1ea8
2020-06-26 19:28:03 -07:00
15864d1703 Skip allreducing local_used_maps_dev_ when find_unused_param=False
Summary:
1. In reducer.cpp, we have a new boolean `find_unused_param_` and its value is set in `Reducer::prepare_for_backward`.
If `!find_unused_param_`, then it avoids `allreduce(local_used_maps_dev_)`.
2. Solves issue [38942](https://github.com/pytorch/pytorch/issues/38942).
3. Fixes incorrect `find_unused_parameters_` passing like checking `outputs.empty()` or `unused_parameters_.empty()`.

ghstack-source-id: 106693089

Test Plan:
1. Run `test/distributed/test_c10d.py` and make sure all tests pass.
2. A new test case `test_find_unused_parameters_when_unused_parameters_empty` is included. Old `reducer.cpp` was failing in that unit test because it was checking `find_unused_parameters_` by `unused_parameters_.empty()`. Current `reducer.cpp` passes this unit test.
3. Two test cases were failing `test_forward_backward_unused_parameters` and `test_forward_backward_optimizer` , because `find_unused_parameter_` of their `reducer` object was not set properly. I fixed that as well.

Imported from OSS

**Output of version 14:**
```
................s.....s...............................................test/distributed/test_c10d.py:1531: UserWarning: Deprecation warning: In a future PyTorch release torch.full will no longer return tensors of floating dtype by default. Instead, a bool fill_value will return a tensor of torch.bool dtype, and an integral fill_value will return a tensor of torch.long dtype. Set the optional `dtype` or `out` arguments to suppress this warning. (Triggered internally at  ../aten/src/ATen/native/TensorFactories.cpp:364.)
  tensor = torch.full([100, 100], self.rank)
test/distributed/test_c10d.py:1531: UserWarning: Deprecation warning: In a future PyTorch release torch.full will no longer return tensors of floating dtype by default. Instead, a bool fill_value will return a tensor of torch.bool dtype, and an integral fill_value will return a tensor of torch.long dtype. Set the optional `dtype` or `out` arguments to suppress this warning. (Triggered internally at  ../aten/src/ATen/native/TensorFactories.cpp:364.)
  tensor = torch.full([100, 100], self.rank)
test/distributed/test_c10d.py:1531: UserWarning: Deprecation warning: In a future PyTorch release torch.full will no longer return tensors of floating dtype by default. Instead, a bool fill_value will return a tensor of torch.bool dtype, and an integral fill_value will return a tensor of torch.long dtype. Set the optional `dtype` or `out` arguments to suppress this warning. (Triggered internally at  ../aten/src/ATen/native/TensorFactories.cpp:364.)
  tensor = torch.full([100, 100], self.rank)
test/distributed/test_c10d.py:1531: UserWarning: Deprecation warning: In a future PyTorch release torch.full will no longer return tensors of floating dtype by default. Instead, a bool fill_value will return a tensor of torch.bool dtype, and an integral fill_value will return a tensor of torch.long dtype. Set the optional `dtype` or `out` arguments to suppress this warning. (Triggered internally at  ../aten/src/ATen/native/TensorFactories.cpp:364.)
  tensor = torch.full([100, 100], self.rank)
.test/distributed/test_c10d.py:1554: UserWarning: Deprecation warning: In a future PyTorch release torch.full will no longer return tensors of floating dtype by default. Instead, a bool fill_value will return a tensor of torch.bool dtype, and an integral fill_value will return a tensor of torch.long dtype. Set the optional `dtype` or `out` arguments to suppress this warning. (Triggered internally at  ../aten/src/ATen/native/TensorFactories.cpp:364.)
  self.assertEqual(torch.full([10, 10], self.world_size), tensor)
test/distributed/test_c10d.py:1554: UserWarning: Deprecation warning: In a future PyTorch release torch.full will no longer return tensors of floating dtype by default. Instead, a bool fill_value will return a tensor of torch.bool dtype, and an integral fill_value will return a tensor of torch.long dtype. Set the optional `dtype` or `out` arguments to suppress this warning. (Triggered internally at  ../aten/src/ATen/native/TensorFactories.cpp:364.)
  self.assertEqual(torch.full([10, 10], self.world_size), tensor)
test/distributed/test_c10d.py:1554: UserWarning: Deprecation warning: In a future PyTorch release torch.full will no longer return tensors of floating dtype by default. Instead, a bool fill_value will return a tensor of torch.bool dtype, and an integral fill_value will return a tensor of torch.long dtype. Set the optional `dtype` or `out` arguments to suppress this warning. (Triggered internally at  ../aten/src/ATen/native/TensorFactories.cpp:364.)
  self.assertEqual(torch.full([10, 10], self.world_size), tensor)
test/distributed/test_c10d.py:1554: UserWarning: Deprecation warning: In a future PyTorch release torch.full will no longer return tensors of floating dtype by default. Instead, a bool fill_value will return a tensor of torch.bool dtype, and an integral fill_value will return a tensor of torch.long dtype. Set the optional `dtype` or `out` arguments to suppress this warning. (Triggered internally at  ../aten/src/ATen/native/TensorFactories.cpp:364.)
  self.assertEqual(torch.full([10, 10], self.world_size), tensor)
.....s...............................
----------------------------------------------------------------------
Ran 108 tests in 214.210s

OK (skipped=3)
```

Differential Revision: D22176231

fbshipit-source-id: b5d15f034e13a0915a474737779cc5aa8e068836
2020-06-26 19:20:59 -07:00
4102fbdf08 [1/n] Allow dense NaN value in dper raw input processor output
Summary:
## TLDR
Support using NaN default value for missing dense features in RawInputProcessor for *DPER2*. In preparation for subsequent support for null flag features in *compute meta*. For train_eval this is already supported in DPER3 and we do not plan to support this in DPER2 train eval.
## Overview
Intern project plan to support adding dense flags for missing feature values instead of replacing with zero.

Project plan :
https://docs.google.com/document/d/1OsPUTjpJycwxWLCue3Tnb1mx0uDC_2KKWvC1Rwpo2NI/edit?usp=sharing

## Code paths:
See https://fb.quip.com/eFXUA0tbDmNw for the call stack for all affected code paths.

Test Plan:
# A. DPER3 blob value inspection
## 1. Build local bento kernel in fbcode folder
`buck build mode/dev-nosan //bento/kernels:bento_kernel_ads_ranking`

## 2. Use kernel `ads_ranking (local)` to print dense feature blob values
n280239

## 2.1 Try `default_dense_value = "0.0"` (default)
```
preproc_6/feature_preproc_6/dper_feature_processor_7/raw_input_proc_7/float_feature_sparse_to_dense_7/float_features [[0.       ]
 [0.       ]
 [0.       ]
 [0.       ]
 [0.       ]
 [0.       ]
 [0.       ]
 [1.       ]
 [1.7857143]
 [1.7777778]
 [1.       ]
 [0.       ]
 [0.5625   ]
 [0.       ]
 [0.       ]
 [0.8      ]
 [0.       ]
 [1.       ]
 [0.56     ]
 [0.       ]]
```
## 2.2 Try `default_dense_value = "123"`
```
preproc_2/feature_preproc_2/dper_feature_processor_3/raw_input_proc_3/float_feature_sparse_to_dense_3/float_features [[123.       ]
 [123.       ]
 [123.       ]
 [123.       ]
 [123.       ]
 [123.       ]
 [123.       ]
 [  1.       ]
 [  1.7857143]
 [  1.7777778]
 [  1.       ]
 [123.       ]
 [  0.5625   ]
 [123.       ]
 [123.       ]
 [  0.8      ]
 [123.       ]
 [  1.       ]
 [  0.56     ]
 [123.       ]]
```
## 2.3 Try `default_dense_value = float("nan")`
```
RuntimeError: [enforce fail at enforce_finite_op.h:40] std::isfinite(input_data[i]). Index 0 is not finite (e.g., NaN, Inf): -nan (Error from operator:
input: "unary_4/logistic_regression_loss_4/average_loss_4/average_loss" name: "" type: "EnforceFinite" device_option { random_seed: 54 })
```
which is expected due to nan input.

# B. Unit test
`buck test  fblearner/flow/projects/dper/tests/preprocs:raw_feature_extractor_test`

https://www.internalfb.com/intern/testinfra/testconsole/testrun/5348024586274923/

{F241336814}

Differential Revision: D21961595

fbshipit-source-id: 3dcb153b3c7f42f391584f5e7f52f3d9c76de31f
2020-06-26 16:54:14 -07:00
897e610c82 FP16 rounding-to-nearest for row-wise SparseAdagrad fusion (#40466)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40466

Extend row wise sparse Adagrad fusion op to FP16 (rounding-to-nearest) for PyTorch.

Reviewed By: jianyuh

Differential Revision: D22003571

fbshipit-source-id: e97e01745679a9f6e7b0f81ce5a6ebf4d4a1df41
2020-06-26 16:14:59 -07:00
47c72be3d7 Port /test/cpp_extensions/rng_extension.cpp to new operator registration API (#39459)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39459

Update to this PR: this code isn't going to fully solve https://github.com/pytorch/pytorch/issues/37010. The changes required for 37010 is more than this PR initially planned. Instead, this PR switches op registration of rng related tests to use the new API (similar to what was done in #36925)

Test Plan:
1) unit tests

Imported from OSS

Reviewed By: ezyang

Differential Revision: D22264889

fbshipit-source-id: 82488ac6e3b762a756818434e22c2a0f9cb9dd47
2020-06-26 16:12:54 -07:00
24a8614cac [Reland][doc] Add overflow notice for cuFFT on half precision (#40551)
Summary:
Reland of https://github.com/pytorch/pytorch/issues/35594
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40551

Reviewed By: ezyang

Differential Revision: D22249831

Pulled By: ngimel

fbshipit-source-id: b221b3c0a490ccaaabba50aa698a2490536e0917
2020-06-26 15:40:19 -07:00
6debc28964 Ignore error code from apt-get purge (#40631)
Summary:
This replicates the pattern of other "do for luck" commands.
Prep change to add RocM to CircleCI.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40631

Differential Revision: D22261707

Pulled By: malfet

fbshipit-source-id: 3dadfa434deab866a8800715f3197e84169cf43e
2020-06-26 13:34:07 -07:00
375cd852fa Add a utility function for bundling large input tensors (#37055)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37055

Sometimes it's okay to bundle a large example input tensor with a model.
Add a utility function to make it easy for users to do that *on purpose*.

Test Plan: Unit test.

Differential Revision: D22264239

Pulled By: dreiss

fbshipit-source-id: 05c6422be1aa926cca850f994ff1ae83c0399119
2020-06-26 13:34:02 -07:00
41ea7f2d86 Add channels-last support to bundled_inputs (#36764)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36764

This allows bundling inputs that are large uniform buffers in
channels-last memory format.

Test Plan: Unit test.

Differential Revision: D21142660

Pulled By: dreiss

fbshipit-source-id: 31bbea6586d07c1fd0bcad4cb36ed2b8bb88a7e4
2020-06-26 13:31:17 -07:00
edac323378 Add special rules to launch docker image with RocM (#40632)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40632

Differential Revision: D22262316

Pulled By: malfet

fbshipit-source-id: 3d525767bfbfc8e2497541849d85cabf0379a43b
2020-06-26 13:28:36 -07:00
0494e0ad70 Back out "Revert D21581908: Move TensorOptions ops to c10" (#40595)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40595

ghstack-source-id: 106691774

Test Plan: waitforsandcastle

Differential Revision: D22247729

fbshipit-source-id: 14745588cae267c1e0cc51cd9541a9b8abb830e5
2020-06-26 12:57:09 -07:00
b8f4f6868d [JIT] Remove dead store in exit_transforms.cpp (#40611)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40611

This commit removes a dead store in `transformWith` of exit_transforms.cpp.

Test Plan: Continuous integration.

Reviewed By: suo

Differential Revision: D22254136

fbshipit-source-id: f68c4625f7be8ae29b3500303211b2299ce5d6f6
2020-06-26 12:35:58 -07:00
a62f8805e7 Update TensorPipe submodule (#40614)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40614

This update pulls in a oneliner fix, which sets the TCP_NODELAY option on the TCP sockets of the UV transport. This leads to exceptional performance gains in terms of latency, with about a 25x improvement in one simple benchmark. This thus resolves a regression that TensorPipe had compared to the ProcessGroup agent and, in fact, ends up beating it by 2x.

The benchmark I ran is this, with the two endpoints pinned to different cores of the same machine:
```
torch.jit.script
def remote_fn(t: int):
    return t

torch.jit.script
def local_fn():
    for _ in range(1_000_000):
        fut = rpc.rpc_async("rhs", remote_fn, (42,))
        fut.wait()
```

And the average round-trip time (one iteration) is:
- TensorPipe with SHM: 97.2 us
- TensorPipe with UV _after the fix_: 205us
- Gloo: 440us
- TensorPipe with UV _before the fix_: 5ms

Test Plan: Ran PyTorch RPC test suite

Differential Revision: D22255393

fbshipit-source-id: 3f6825d03317d10313704c05a9280b3043920507
2020-06-26 11:45:51 -07:00
5036c94a6e properly skip legacy tests regardless of the default executor (#40381)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40381

Differential Revision: D22173938

Pulled By: Krovatkin

fbshipit-source-id: 305fc4484977e828cc4cee6e053a1e1ab9f0d6c7
2020-06-26 11:13:50 -07:00
7676682584 Fix illegal opcode bug in caffe2 (#40584)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40584

Also patch [this github issue](https://github.com/pytorch/pytorch/issues/33124)
involving an illegal assembly instruction in 8x8-dq-aarch64-neon.S.

Test Plan:
Build binaries, copy to shaker, run executables. Also run all
existing caffe tests.

Reviewed By: kimishpatel

Differential Revision: D22240670

fbshipit-source-id: 51960266ce58699fe6830bcf75632b92a122f638
2020-06-26 11:11:54 -07:00
fb5d784fb4 Further reduce windows build/test matrix (#40592)
Summary:
Switch windows CPU testers from `windows.xlarge` to `windows.medium` class.
Remove VS 14.16 CUDA build
Only do smoke force-on-cpu tests using VS2019+CUDA10.1 config.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40592

Differential Revision: D22259351

Pulled By: malfet

fbshipit-source-id: f934ff774dfc7d47f12c3da836ca314c12d92208
2020-06-26 10:18:46 -07:00
10822116c5 build docker image for CUDA11 (#40534)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40534

Differential Revision: D22258874

Pulled By: seemethere

fbshipit-source-id: 1954a22ed52e1a65caf89725ab1db9f40ff917b8
2020-06-26 10:07:53 -07:00
fc8bca094c skip_if_rocm test_rnn in test_c10d_spawn.py (#40577)
Summary:
Test was added a few months back in https://github.com/pytorch/pytorch/issues/36503 but recently became flaky for ROCm.

CC ezyang xw285cornell sunway513
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40577

Differential Revision: D22258196

Pulled By: ezyang

fbshipit-source-id: 8a22b0c17b536b3d42d0382f7737df0f8823ba08
2020-06-26 09:45:45 -07:00
67c79bb045 update schema to reflect aliasing behavior (#39794)
Summary:
Fixes: https://github.com/pytorch/pytorch/issues/38555

I did an audit of `native_functions.yaml` and found several functions in addition to `reshape` which were not reporting that they could alias:

```
torch.jit.script
def foo(t: torch.Tensor):
    new_value = torch.tensor(1, dtype=t.dtype, device=t.device)

    t.flatten()[0] = new_value
    t.reshape(-1)[1] = new_value
    t.view_as(t)[2] = new_value
    t.expand_as(t)[3] = new_value
    t.reshape_as(t)[4] = new_value
    t.contiguous()[5] = new_value
    t.detach()[6] = new_value

    return t
```

Currently none of the values are assigned after dead code elimination, after this PR all are. (And the JIT output matches that of eager.)

I don't think this needs to be unit tested; presumably the generic machinery already is and this just brings these ops under the same umbrella.

**BC-breaking note**: This updates the native operator schema and the aliasing rules for autograd. JIT passes will no longer incorrectly optimize mutations on graphs containing these ops, and inplace ops on the result of `flatten` will now properly be tracked in Autograd and the proper backward graph will be created.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39794

Differential Revision: D22008358

Pulled By: robieta

fbshipit-source-id: 9d3ff536e58543211e08254a75c6110f2a3b4992
2020-06-26 09:25:27 -07:00
a0ba7fb43e Precompute entries in dispatch tables (#40512)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40512

Fixes https://github.com/pytorch/pytorch/issues/32454

The heart of this diff is changing this:

```
inline const KernelFunction& Dispatcher::dispatch_(const DispatchTable& dispatchTable, DispatchKey dispatchKey) c
nst {
  const KernelFunction* backendKernel = dispatchTable.lookup(dispatchKey);

  if (nullptr != backendKernel) {
    return *backendKernel;
  }

  const auto& backendFallbackKernel = backendFallbackKernels_[dispatchKey];
  if (backendFallbackKernel.isValid()) {
    return backendFallbackKernel;
  }

  const KernelFunction* catchallKernel = dispatchTable.lookupCatchallKernel();
  if (C10_LIKELY(nullptr != catchallKernel)) {
    return *catchallKernel;
  }

  reportError(dispatchTable, dispatchKey);
}
```

to this:

```
const KernelFunction& OperatorEntry::lookup(DispatchKey k) const {
  const auto& kernel = dispatchTable_[static_cast<uint8_t>(k)];
  if (C10_UNLIKELY(!kernel.isValid())) {
    reportError(k);
  }
  return kernel;
}
```

The difference is that instead of checking a bunch of places to find the
right kernel to use for an operator, all of the operators are
precomputed into dispatchTable_ itself (so you don't have to consult
anything else at runtime.)  OperatorEntry::computeDispatchTableEntry
contains that computation (which is exactly the same as it was before.)
By doing this, we are able to substantially simplify many runtime
components of dispatch.

The diff is fairly large, as there are also some refactors interspersed
with the substantive change:

- I deleted the DispatchTable abstraction, folding it directly into
  OperatorEntry.  It might make sense to have some sort of DispatchTable
  abstraction (if only to let you do operator[] on DispatchKey without
  having to cast it to integers first), but I killed DispatchTable to
  avoid having to design a new abstraction; the old abstraction wasn't
  appropriate for the new algorithm.

- I renamed OperatorEntry::KernelEntry to AnnotatedKernel, and use it
  to store backend fallbacks as well as regular kernel registrations
  (this improves error messages when you incorrectly register a backend
  fallback twice).

- I moved schema_ and debug_ into an AnnotatedSchema type, to make the
  invariant clearer that these are set together, or not at all.

- I moved catch-all kernels out of kernels_ into its own property
  (undoing a refactor I did before).  The main reason I did this was
  because our intended future state is to not have a single catch-all,
  but rather possibly multiple catch-alls which fill-in different
  portions of the dispatch table.  This may change some more in
  the future: if we allow registrations for multiple types of
  catch alls, we will need a NEW data type (representing bundles
  of dispatch keys) which can represent this case, or perhaps
  overload DispatchKey to also record these types.

The key changes for precomputation:

- OperatorEntry::updateDispatchTable_ is now updated to fill in the
  entry at a DispatchKey, considering both kernels (what it did
  before) as well as catch-all and backend fallback.  There is also
  OperatorEntry::updateDispatchTableFull_ which will update the
  entire dispatch table (which is necessary when someone sets a
  catch-all kernel).  OperatorEntry::computeDispatchTableEntry
  holds the canonical algorithm specifying how we decide what
  function will handle a dispatch key for the operator.

- Because dispatch table entry computation requires knowledge of
  what backend fallbacks are (which is recorded in Dispatcher,
  not OperatorEntry), several functions on OperatorEntry now
  take Dispatcher as an argument so they can query this information.

- I modified the manual boxing wrapper invariant: previously, kernels
  stored in kernels_ did NOT have manual boxing wrappers and this
  was maintained by DispatchTable.  Now, we just ALWAYS maintain
  manual boxing wrappers for all KernelFunctions we store.

- DispatchKeyExtractor is greatly simplified: we only need to maintain
  a single per-operator bitmask of what entries are fallthrough
  (we don't need the global bitmask anymore).

- Introduced a new debugging 'dumpComputedTable' method, which prints
  out the computed dispatch table, and how we computed it to be some way.
  This was helpful for debugging cases when the dispatch table and
  the canonical metadata were not in sync.

Things that I didn't do but would be worth doing at some point:

- I really wanted to get rid of the C10_UNLIKELY branch for
  whether or not the KernelFunction is valid, but it looks like
  I cannot easily do this while maintaining good error messages.
  In principle, I could always populate a KernelFunction which
  errors, but the KernelFunction needs to know what the dispatch
  key that is missing is (this is not passed in from the
  calling convention).  Actually, it might be possible to do
  something with functors, but I didn't do it here.

- If we are going to get serious about catchalls for subsets of
  operators, we will need to design a new API for them.  This diff
  is agnostic to this question; we don't change public API at all.

- Precomputation opens up the possibility of subsuming DispatchStub
  by querying CPU capability when filling in the dispatch table.
  This is not implemented yet. (There is also a mild blocker here,
  which is that DispatchStub is also used to share TensorIterator
  configuration, and this cannot be directly supported by the
  regular Dispatcher.)

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Differential Revision: D22236352

Pulled By: ezyang

fbshipit-source-id: d6d90f267078451816b1899afc3f79737b4e128c
2020-06-26 09:03:39 -07:00
a4cabd1a3c Generalize Python dispatcher testing API; disallow overwriting fallback (#40469)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40469

- The old testing interface C._dispatch_import was based off the old
  c10::import variation, which meant the API lined up in a strange
  way with the actual torch/library.h.  This diff reduces the
  differences by letting you program the Library constructor directly.

- Using this newfound flexibility, we add a test for backend fallbacks
  from Python; specifically testing that we disallow registering a
  backend fallback twice.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Differential Revision: D22236351

Pulled By: ezyang

fbshipit-source-id: f8365e3033e9410c7e6eaf9f78aa32e1f7d55833
2020-06-26 09:01:28 -07:00
44bf822084 Add C++ standard version check to top level headers (#40510)
Summary:
Remove `-std=c++14` flag from `utils.cmake`, since PyTorch C++ API can be invoked by any compiler compliant with C++14 standard or later
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40510

Differential Revision: D22253313

Pulled By: malfet

fbshipit-source-id: ff731525868b251c27928fc98b0724080ead9be2
2020-06-26 08:44:04 -07:00
dfc7e71d13 [Selective Build] Apply query-based on instrumentation_tests
Summary:
1. Modularize some bzl files to break circular buck load
2. Use query-based on instrumentation_tests

(Note: this ignores all push blocking failures!)

Test Plan: CI

Reviewed By: kwanmacher

Differential Revision: D22188728

fbshipit-source-id: affbabd333c51c8b1549af6602c6bb79fabb7236
2020-06-26 08:05:53 -07:00
f1406c43fc [papaya][aten] Fix compiler error: loop variable 'tensor' is always a copy because the range of type 'c10::List<at::Tensor>' does not return a reference. (#40599)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40599

.

Test Plan: CI

Reviewed By: smessmer

Differential Revision: D22246106

fbshipit-source-id: a5d0535e627b9f493fca7234dcfc15c521b0ed7f
2020-06-26 02:43:25 -07:00
eebd492dcf [doc] fix autograd doc subsubsection display issue (#40582)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40582

There's a misuse in the `requires_grad` with ~~~~~~, "~~~~" is not a official section marker, change it to "^^^^^" to denote subsubsections, also fix the other places where we should use subsection "-----" instead of subsubsection "^^^^"

see https://www.sphinx-doc.org/en/master/usage/restructuredtext/basics.html#sections

Before:
<img width="712" alt="rst_before" src="https://user-images.githubusercontent.com/9443650/85789835-2226fa80-b6e4-11ea-97b6-2b19fdf324a4.png">
After:
<img width="922" alt="rst_after" src="https://user-images.githubusercontent.com/9443650/85789856-281cdb80-b6e4-11ea-925f-cb3f4ebaa2bf.png">

Test Plan: Imported from OSS

Differential Revision: D22245747

Pulled By: wanchaol

fbshipit-source-id: 11548ed42f627706863bb74d4269827d1b3450d4
2020-06-25 23:28:33 -07:00
3ab60ff696 Remove cpu vec256 for std::complex (#39830)
Summary:
std::complex is gone. We are now using c10::complex
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39830

Differential Revision: D22252066

Pulled By: malfet

fbshipit-source-id: cdd5bb03ec66825d82177d609cbcf0738922dba0
2020-06-25 23:25:58 -07:00
fab412a8f3 Bump nightlies to 1.7.0 (#40519)
Summary:
edit: apparently we hardcode a lot more versions that I would've anticipated.

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40519

Differential Revision: D22221280

Pulled By: seemethere

fbshipit-source-id: ba15a910a6755ec08c10f7783ed72b1e06e6b570
2020-06-25 22:36:33 -07:00
e3a97688cc [quant][graphmode][fix] dequantize propagation for {add/mul}_scalar (#40596)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40596

Previously the fusion patterns for {add/mul}_scalar is inconsistent since the op pattern
produces a non-quantized tensor and the op replacement graph produces a quantized tensor

Test Plan: Imported from OSS

Differential Revision: D22251072

fbshipit-source-id: e16eb92cf6611578cca1ed8ebde961f8d0610137
2020-06-25 22:17:08 -07:00
547ea787ff [ONNX] Add eliminate_unused_items pass (#38812)
Summary:
This PR:

- Adds eliminate_unused_items pass that removes unused inputs and initializers.
- Fixes run_embed_params function so it doesn't export unnecessary parameters.
- Removes  test_modifying_params in test_verify since it's no longer needed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38812

Reviewed By: ezyang

Differential Revision: D22236416

Pulled By: houseroad

fbshipit-source-id: 30e1a6e8823a7e36b51ae1823cc90476a53cd5bb
2020-06-25 22:00:26 -07:00
5466231187 Fixes lint (#40606)
Summary:
'= ' => '='
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40606

Differential Revision: D22252511

Pulled By: mruberry

fbshipit-source-id: 5f90233891be58a742371e4416166a267aee4669
2020-06-25 21:53:00 -07:00
ac79c874ce [PyTorch Operator] [2/n] Adding python test
Summary: Adding python test file with image files wit the input image being p.jpg. Test for the quality difference between the raw image and the decoded image

Test Plan:
Parsing buck files: finished in 1.5 sec
Building: finished in 6.4 sec (100%) 10241/10241 jobs, 2 updated
  Total time: 8.0 sec
More details at https://www.internalfb.com/intern/buck/build/387cb1c1-2902-4f90-ae9f-83fb6d473487
Tpx test run coordinator for Facebook. See https://fburl.com/tpx for details.
Running with tpx session id: 93e6ef88-ec68-41cb-9de7-7868a14e6d65
Trace available for this run at /tmp/tpx-20200623-055836.283269/trace.log
Started reporting to test run: https://our.intern.facebook.com/intern/testinfra/testrun/4222124679431330
    ✓ ListingSuccess: caffe2/test:test_bundled_images - main (18.865)
    ✓ Pass: caffe2/test:test_bundled_images - test_single_tensors (test_bundled_images.TestBundledInputs) (18.060)
    ✓ Pass: caffe2/test:test_bundled_images - main (18.060)
Summary
  Pass: 2
  ListingSuccess: 1
Finished test run: https://our.intern.facebook.com/intern/testinfra/testrun/4222124679431330

Reviewed By: dreiss

Differential Revision: D22046611

fbshipit-source-id: fabc604269a5a4d8a37135ce776200da2794a252
2020-06-25 18:36:44 -07:00
c790476384 Back out "Revert D22072830: [wip] Upgrade msvc to 14.13" (#40594)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40594

Original commit changeset: 901de185e607
ghstack-source-id: 106642590

Test Plan: oss ci

Differential Revision: D22247269

fbshipit-source-id: be0c64d1a579f8aa3999cb84a9d20488095a81bd
2020-06-25 17:19:33 -07:00
b05c34259b relax size check in flatten_for_scatter_gather (#40573)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40573

Per title, to workaround apex sbn bug.

Test Plan: Covered by existing tests

Reviewed By: blefaudeux

Differential Revision: D22236942

fbshipit-source-id: ddb164ee347a7d472a206087e4dbd16aa9d72387
2020-06-25 15:16:37 -07:00
e180ca652f Add __all__ to torch/_C/_VariableFunctions.pyi (#40499)
Summary:
Related to https://github.com/pytorch/pytorch/issues/40397

Inspired by ezyang's comment at https://github.com/pytorch/pytorch/issues/40397#issuecomment-648233001, this PR attempts to leverage using `__all__` to explicitly export private functions from `_VariableFunctions.pyi` in order to make `mypy` aware of them after:

```
if False:
    from torch._C._VariableFunctions import *
```

The generation of the `__all__` template variable excludes some items from `unsorted_function_hints`, as it seems that those without hints end up not being explicitly included in the `.pyi` file: I leaned on the side of caution and opted for having `__all__` consistent with the definitions inside the file. Additionally, added some pretty-printing to avoid having an extremely long line.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40499

Differential Revision: D22240716

Pulled By: ezyang

fbshipit-source-id: 77718752577a82b1e8715e666a8a2118a9d3a1cf
2020-06-25 14:10:07 -07:00
c6e0c67449 [PyTorch Error Logging][2/N] Adding Error Logging for Loading Model (#40537)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40537

Adding Error Logging when loading model, adding event "MOBILE_MODULE_LOAD"
ghstack-source-id: 106615128

Test Plan: {F241028136}

Reviewed By: iseeyuan

Differential Revision: D22098818

fbshipit-source-id: 4de7df4432c7c6c297a9dc173e5cafa13fe2833c
2020-06-25 14:05:43 -07:00
e231405ef6 [jit] Fix type annotations in select assignments (#40528)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40528

Previously, an assignment like `self.foo : List[int] = []` would ignore
the type hint.

Test Plan: Imported from OSS

Reviewed By: eellison

Differential Revision: D22222927

Pulled By: suo

fbshipit-source-id: b0af19b87c6fbe0670d06b55f2002a783d00549d
2020-06-25 13:08:03 -07:00
dfbf0164c9 Revert D22103662: [NCCL] Explicitly Abort NCCL Communicators on Process Group Destruction
Test Plan: revert-hammer

Differential Revision:
D22103662 (527ab13436)

Original commit changeset: 1f6f88b56bd7

fbshipit-source-id: d0944462c021ec73c7f883f98609fc4a3408efd9
2020-06-25 12:27:24 -07:00
4d40ec1480 [PyTorch Error Logging][1/N] Adding Error Logging for Run_Method (#40535)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40535

Adding error logging for run_method.
Adding CANCEL(the method cannot be found) and FAIL status(error occurred when running the method)
ghstack-source-id: 106604786

Test Plan: {F240891059}

Reviewed By: xcheng16

Differential Revision: D22097857

fbshipit-source-id: 4bdc8e3993e40cb1ba51e4706be6637e3afd40b4
2020-06-25 12:25:34 -07:00
f41173b975 [PyPer][quant] Add quantized embedding operators to OSS. (#40076)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40076

Pull Request resolved: https://github.com/pytorch/glow/pull/4606

[PyPer][quant] Add quantized embedding operators to OSS.

This is the first step in supporting Graph Mode Quantization for EmbeddingBag.

At a high level, the next steps would be
a) Implementation of Embedding prepack/unpack operators,
b) Implementation of torch.nn.quantized.dynamic.EmbeddingBag Module,
c) Implementation of torch.nn.quantized.EmbeddingBag Module,
d) Implementation (modification) of IR passes to support graph quantization of EmbeddingBag module.

More in-depth details regarding each step will be in the follow up diffs. Consider this as an initial diff that moves operators to respective places that's required for us to proceed.

Test Plan: ```buck test mode/no-gpu caffe2/test:quantization -- --stress-runs 100  test_embedding_bag```

Reviewed By: supriyar

Differential Revision: D21949828

fbshipit-source-id: cad5ed0a855db7583bddb1d93e2da398c128024a
2020-06-25 12:01:49 -07:00
461014d54b Unify libtorch_python_cuda_core_sources filelists between CMakeList, fbcode and bazel (#40554)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40554

Get a sublist of `libtorch_python_cuda_sources` named `libtorch_python_cuda_core_sources`. Use it to replace the list which has the same content in `CMakeList.txt`.
This is a change to make consistency between CMakeList and bazel.

Test Plan: CI

Reviewed By: malfet

Differential Revision: D22223207

fbshipit-source-id: 2bde3c42a0b2d60d689581561075df4ef52ab694
2020-06-25 11:02:33 -07:00
7369dc8d1f Use CPU Allocator for reading from zip container
Summary:
This code path is used to read tensor bodies, so we need it to respect
alignment and padding requirements.

Test Plan: Ran an internal test that was failing.

Reviewed By: zdevito

Differential Revision: D22225622

fbshipit-source-id: f2126727f96616366850642045ab9704f3885824
2020-06-25 10:51:49 -07:00
c362138f43 Disallow passing functions that don't return Tensors to vmap (#40518)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40518

I overlooked this in the initial vmap frontend api PR. Right now we
want to restrict vmap to taking in functions that only return Tensors.
A function that only return tensors can look like one of the following:
```
def fn1(x):
    ...
    return y

def fn2(x):
    ...
    return y, z
```
fn1 returns a Tensor, while fn2 returns a tuple of Tensors. So we add a
check that the output of the function passed to vmap returns either a
single tensor or a tuple of tensors.

NB: These checks allow passing a function that returns a tuple with a
single-element tensor from vmap. That seems OK to me.

Test Plan: - `python test/test_vmap.py -v`

Differential Revision: D22216166

Pulled By: zou3519

fbshipit-source-id: a92215e9c26f6138db6b10ba81ab0c2c2c030929
2020-06-25 08:54:05 -07:00
43757ea913 Add batching rule for Tensor.permute (#40517)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40517

This is necessary for implementing the vmap frontend API's out_dims
functionality.

Test Plan:
- `./build/bin/vmap_test`. The vmap python API can't accept inputs that
aren't integers right now. There are workarounds around that (use a
lambda) but that doesn't look too nice. In the future we'll test all
batching rules in Python.

Differential Revision: D22216168

Pulled By: zou3519

fbshipit-source-id: b6ef552f116fddc433e242c1594059b9d2fe1ce4
2020-06-25 08:54:01 -07:00
7038579c03 Add batching rule for unsqueeze, squeeze, and transpose (#40455)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40455

These don't need to be implemented right now but are useful later down
the line. I thought I would use these in implementing vmap's `out_dims`
functionality, but it turns out they weren't necessary. Since the code
exists and is useful anyways, I am leaving this PR here.

Test Plan:
- `./build/bin/vmap_test`. We could test this using the vmap frontend API,
but there is the catch that vmap cannot directly take integers right
now (all inputs passed to vmap must be Tensors at the moment). It's
possible to hack around that by declaring lambdas that take in a single
tensor argument, but those don't look nice.

Differential Revision: D22216167

Pulled By: zou3519

fbshipit-source-id: 1a010f5d7784845cca19339d37d6467f5b987c32
2020-06-25 08:51:27 -07:00
88ea51c061 doc string fix for torch.cuda.set_rng_state_all (#40544)
Summary:
Fix https://github.com/pytorch/pytorch/issues/40239
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40544

Differential Revision: D22233989

Pulled By: ezyang

fbshipit-source-id: b5098357a3e0c50037f95ba0d701523d5dce2628
2020-06-25 08:37:14 -07:00
e440c370c5 [quant] Fix fuse linear pass (#40549)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40549

Currently we didn't check if %weight_t is produced by `aten::t`, this will fuse some `matmul`/`addmm` that is
not 2d to `aten::linear`, which is incorrect

Test Plan: Imported from OSS

Differential Revision: D22225921

fbshipit-source-id: 9723e82fdbac6d8e1a7ade22f3a9791321ab12b6
2020-06-25 07:10:09 -07:00
eae1ed99a3 caffe2 | Fix building with -Wrange-loop-analysis on
Summary: `-Wrange-loop-analysis` is turned on by default for clang 10 (see https://reviews.llvm.org/D73834). This fixes a warning that's found with that.

Test Plan: Build with clang 10 and check there are no `range-loop-analysis` warnings.

Reviewed By: yinghai

Differential Revision: D22207072

fbshipit-source-id: 858ba8a36c653071eab961cb891ce945faf0fa87
2020-06-24 23:42:33 -07:00
cf8a9b50ca Allow ReflectionPad to accept 0-dim batch sizes. (#39231)
Summary:
Allows ReflectionPad 1D and 2D to accept 0-dim batch sizes.

Related to issues:

* https://github.com/pytorch/pytorch/issues/38115
* https://github.com/pytorch/pytorch/issues/12013
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39231

Reviewed By: ezyang

Differential Revision: D22205717

Pulled By: mruberry

fbshipit-source-id: 6744661002fcbeb4aaafd8693fb550ed53f3e00f
2020-06-24 22:24:05 -07:00
82e9318a16 Adjust CUDA memory leak test (#40504)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40504

Make CUDA mem liek test not flaky

Test Plan: python test/test_profiler.py

Differential Revision: D22215527

Pulled By: ilia-cher

fbshipit-source-id: 5f1051896342ac50cd3a21ea86ce7487b5f82a19
2020-06-24 18:22:46 -07:00
85b87df5ba Revert D22208758: [pytorch][PR] Report error when ATEN_THEADING is OMP and USE_OPENMP is turned off.
Test Plan: revert-hammer

Differential Revision:
D22208758 (3ed96e465c)

Original commit changeset: 0866c9bb9b3b

fbshipit-source-id: 9e2b469469e274292b2559c02aa0256425fd355e
2020-06-24 18:20:28 -07:00
06debf6373 move __range_length and __derive_index to lite interpreter (#40533)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40533

These ops are required by the demucs denoiser model

Test Plan: build

Reviewed By: kaustubh-kp, linbinyu

Differential Revision: D22216217

fbshipit-source-id: f300ac246fe3a7a6566a70bb89858770af68a90c
2020-06-24 18:14:51 -07:00
adcd755e69 Fix backup solution (#40515)
Summary:
These were changes that had to be made in the `release/1.6` branch in order to get backups to work.

They should be brought to the master branch.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40515

Differential Revision: D22221308

Pulled By: seemethere

fbshipit-source-id: 24e2a0196a8e775fe324a383c8f0c681118b741b
2020-06-24 17:21:38 -07:00
e12f73ee12 Add missing file to BUILD.bazel (#40536)
Summary:
Add `int8_gen_quant_params.cc` added by
https://github.com/pytorch/pytorch/pull/40494/ to bazel build rules
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40536

Reviewed By: mruberry

Differential Revision: D22219595

Pulled By: malfet

fbshipit-source-id: 2875a0b9c55bad2b052a898661b96eab490f6451
2020-06-24 17:16:26 -07:00
3dcc329746 Use tree-based sum for floats to avoid numerical instability (#39516)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/38716, fixes https://github.com/pytorch/pytorch/issues/37234

This algorithm does the summation along a single axis with multiple "levels" of accumulator, each of which is designed to hold the sum of an order of magnitude more values than the previous.

e.g. if there are 2^16 elements, the first level will hold the sum of 2^4 elements, and so on in increasing powers of 2: 2^4, 2^8, 2^12 and finally 2^16.

This limits the differences in magnitude of the partial results being added together, and so we don't lose accuracy as the axis length increases.

WIP to write a vectorized version.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39516

Reviewed By: ezyang

Differential Revision: D22106251

Pulled By: ngimel

fbshipit-source-id: b56de4773292439dbda62b91f44ff37715850ae9
2020-06-24 17:06:38 -07:00
ea06db9466 Release GIL during DDP construction. (#40495)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40495

As part of debugging flaky ddp_under_dist_autograd tests, I realized
we were running into the following deadlock.

1) Rank 0 would go into DDP construction, hold GIL and wait for broadcast in
DDP construction.
2) Rank 3 is a little slower and performs an RRef fetch call before the DDP
construction.
3) The RRef fetch call is done on Rank 0 and tries to acquire GIL.
4) We now have a deadlock since Rank 0 is waiting for Rank 3 to enter the
collective and Rank 3 is waiting for Rank 0 to release GIL.
ghstack-source-id: 106534442

Test Plan:
1) Ran ddp_under_dist_autograd 500 times.
2) waitforbuildbot

Differential Revision: D22205180

fbshipit-source-id: 6afd55342e801b9edb9591ff25158a244a8ea66a
2020-06-24 16:58:42 -07:00
71edd7f175 Update FP16 to FP16:4dfe081cf6bcd15db339cf2680b9281b8451eeb3. (#40526)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40526

Differential Revision: D22215600

Pulled By: AshkanAliabadi

fbshipit-source-id: 6ff0c17d17f118b64ae34c0007b705c7127f07ef
2020-06-24 16:58:40 -07:00
16f276cef9 Add C++-only int dim overloads to std-related operations (#40451)
Summary:
Fixes gh-40287

The `int -> bool` conversion takes higher precedence than `int -> IntArrayRef`. So, calling `std(0)` in C++ would select the `std(unbiased=False)` overload instead.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40451

Differential Revision: D22217926

Pulled By: ezyang

fbshipit-source-id: 7520792fab5ab6665bddd03b6f57444c6c729af4
2020-06-24 16:56:55 -07:00
a208a272cb Update cpuinfo to cpuinfo:63b254577ed77a8004a9be6ac707f3dccc4e1fd9. (#40516)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40516

Differential Revision: D22215554

Pulled By: AshkanAliabadi

fbshipit-source-id: f779cf6e08cf344b87071c2ffc9b3f7cf4659085
2020-06-24 16:47:24 -07:00
c120fdc05b Unify torch/csrc/cuda/shared/cudnn.cpp include path (#40525)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40525

Move `USE_CUDNN` define under `USE_CUDA` guard, add `cuda/shared/cudnn.cpp` to filelist if either USE_ROCM or USE_CUDNN is set.
This is a prep change for PyTorch CUDA src filelist unification change.

Test Plan: CI

Differential Revision: D22214899

fbshipit-source-id: b71b32fc603783b41cdef0e7fab2cc9cbe750a4e
2020-06-24 16:40:11 -07:00
cef35e339f Update FXdiv to FXdiv:b408327ac2a15ec3e43352421954f5b1967701d1. (#40520)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40520

Differential Revision: D22215614

Pulled By: AshkanAliabadi

fbshipit-source-id: 5e41a3a69522cbfe1cc4ac76a0d1f3e90a58528d
2020-06-24 16:31:25 -07:00
4a0ba62ded Update psimd to psimd:072586a71b55b7f8c584153d223e95687148a900. (#40522)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40522

Differential Revision: D22215685

Pulled By: AshkanAliabadi

fbshipit-source-id: 78c103c4f7ad21e78069dc86a8ee47aebc9aa73e
2020-06-24 16:21:25 -07:00
3e09268c0a [jit] allow dict to be mixed between tracing and scripting (#39601)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/39601

Test Plan: Imported from OSS

Reviewed By: jamesr66a

Differential Revision: D22202689

Pulled By: wanchaol

fbshipit-source-id: 5271eb3d8fdcda3d730a085aa555b43c35d14876
2020-06-24 16:14:13 -07:00
787e1c4c7d [jit] fix dictConstruct order issue (#40424)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40424

dictConstruct should preserve the inputs order

Test Plan: Imported from OSS

Differential Revision: D22202690

Pulled By: wanchaol

fbshipit-source-id: c313b531b7fa49e6f3486396d61bfc5d6400cd01
2020-06-24 16:12:32 -07:00
2e6e8d557c Update docs feature classifications (#39966)
Summary:
Update the following feature classifications in docs to align with the changes:
1. [High Level Autograd APIs](https://pytorch.org/docs/stable/autograd.html#functional-higher-level-api): Beta (was experimental)
2. [Eager Mode Quantization](https://pytorch.org/docs/stable/quantization.html): Beta (was experimental)
3. [Named Tensors](https://pytorch.org/docs/stable/named_tensor.html): Prototype (was experimental)
4. [TorchScript/RPC](https://pytorch.org/docs/stable/rpc.html#rpc): Prototype (was experimental)
5. [Channels Last Memory Layout](https://pytorch.org/docs/stable/tensor_attributes.html#torch-memory-format): Beta (was experimental)
6. [Custom C++ Classes](https://pytorch.org/docs/stable/cpp_index.html): Beta (was experimental)
7. [Torch.Sparse](https://pytorch.org/docs/stable/sparse.html): Beta (was experimental)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39966

Differential Revision: D22213217

Pulled By: jlin27

fbshipit-source-id: dc49337cbc7026ed8dcac506fc60029dc3add854
2020-06-24 15:35:59 -07:00
72f2c479e3 Migrate equal from the TH to Aten (CPU) (#33286)
Summary:
https://github.com/pytorch/pytorch/issues/24697
VitalyFedyunin
glaringlee

Test script:
```Python
import timeit

setup_ones = """
import torch
a = torch.ones(({n}, {n}), dtype={dtype})
b = torch.ones(({n}, {n}), dtype={dtype})
"""

for n, t in [(1000, 10000), (2000, 10000)]:
  for dtype in ('torch.bool', 'torch.int', 'torch.long', 'torch.bfloat16', 'torch.float', 'torch.double'):
  #for dtype in ('torch.bool', 'torch.int', 'torch.long', 'torch.float', 'torch.double'):
    print('torch.ones(({n}, {n})) equal for {t} times {dtype}'.format(n=n, t=t, dtype=dtype))
    print(timeit.timeit(stmt='torch.equal(a, b)', setup=setup_ones.format(n=n, dtype=dtype), number=t))

setup_rand = """
import torch
a = torch.rand(({n}, {n}), dtype={dtype})
b = a.clone()
"""
for n, t in [(1000, 10000), (2000, 10000)]:
  for dtype in ('torch.float', 'torch.double'):
    print('torch.rand(({n}, {n})) for {t} times {dtype}'.format(n=n, t=t, dtype=dtype))
    print(timeit.timeit(stmt='torch.equal(a, b)', setup=setup_rand.format(n=n, dtype=dtype), number=t))

setup_non_contiguous = """
import torch
a = torch.rand(({n}, {n}), dtype={dtype})
a2 = a[:, 500:]
a3 = a2.clone()
torch.equal(a2, a3)
"""
for n, t in [(1000, 10000), (2000, 10000)]:
  for dtype in ('torch.float', 'torch.double'):
    print('non_contiguous torch.rand(({n}, {n})) for {t} times {dtype}'.format(n=n, t=t, dtype=dtype))
    print(timeit.timeit(stmt='torch.equal(a2, a3)', setup=setup_non_contiguous.format(n=n, dtype=dtype), number=t))

setup_not_equal = """
import torch
a = torch.rand(({n}, {n}), dtype={dtype})
b = torch.rand(({n}, {n}), dtype={dtype})
torch.equal(a, b)
"""
for n, t in [(1000, 10000), (2000, 10000)]:
  for dtype in ('torch.float', 'torch.double'):
    print('not equal torch.rand(({n}, {n})) for {t} times {dtype}'.format(n=n, t=t, dtype=dtype))
    print(timeit.timeit(stmt='torch.equal(a, b)', setup=setup_not_equal.format(n=n, dtype=dtype), number=t))
```

TH
```
torch.ones((1000, 1000)) equal for 10000 times torch.bool
1.8391206220258027
torch.ones((1000, 1000)) equal for 10000 times torch.int
1.8877864250680432
torch.ones((1000, 1000)) equal for 10000 times torch.long
1.938108820002526
torch.ones((1000, 1000)) equal for 10000 times torch.bfloat16
3.184849138953723
torch.ones((1000, 1000)) equal for 10000 times torch.float
1.8825413499725983
torch.ones((1000, 1000)) equal for 10000 times torch.double
2.7266416549682617
torch.ones((2000, 2000)) equal for 10000 times torch.bool
7.227149627986364
torch.ones((2000, 2000)) equal for 10000 times torch.int
7.76215292501729
torch.ones((2000, 2000)) equal for 10000 times torch.long
9.631909006042406
torch.ones((2000, 2000)) equal for 10000 times torch.bfloat16
8.097328286035918
torch.ones((2000, 2000)) equal for 10000 times torch.float
5.5739822529722005
torch.ones((2000, 2000)) equal for 10000 times torch.double
8.444009944912978
torch.rand((1000, 1000)) for 10000 times torch.float
1.168096570065245
torch.rand((1000, 1000)) for 10000 times torch.double
1.6577326939441264
torch.rand((2000, 2000)) for 10000 times torch.float
5.49395391496364
torch.rand((2000, 2000)) for 10000 times torch.double
8.507486199960113
non_contiguous torch.rand((1000, 1000)) for 10000 times torch.float
6.074504268006422
non_contiguous torch.rand((1000, 1000)) for 10000 times torch.double
6.1426916810451075
non_contiguous torch.rand((2000, 2000)) for 10000 times torch.float
37.501055537955835
non_contiguous torch.rand((2000, 2000)) for 10000 times torch.double
44.6880351039581
not equal torch.rand((1000, 1000)) for 10000 times torch.float
0.029356416082009673
not equal torch.rand((1000, 1000)) for 10000 times torch.double
0.025421109050512314
not equal torch.rand((2000, 2000)) for 10000 times torch.float
0.026333761983551085
not equal torch.rand((2000, 2000)) for 10000 times torch.double
0.02748022007290274
```

ATen
```
torch.ones((1000, 1000)) equal for 10000 times torch.bool
0.7961567062884569
torch.ones((1000, 1000)) equal for 10000 times torch.int
0.49172434909269214
torch.ones((1000, 1000)) equal for 10000 times torch.long
0.9459248608909547
torch.ones((1000, 1000)) equal for 10000 times torch.bfloat16
2.0877483217045665
torch.ones((1000, 1000)) equal for 10000 times torch.float
0.606857153121382
torch.ones((1000, 1000)) equal for 10000 times torch.double
1.1388208279386163
torch.ones((2000, 2000)) equal for 10000 times torch.bool
2.0329296849668026
torch.ones((2000, 2000)) equal for 10000 times torch.int
3.534358019940555
torch.ones((2000, 2000)) equal for 10000 times torch.long
8.19841272290796
torch.ones((2000, 2000)) equal for 10000 times torch.bfloat16
6.595649406313896
torch.ones((2000, 2000)) equal for 10000 times torch.float
4.193911510054022
torch.ones((2000, 2000)) equal for 10000 times torch.double
7.931309659034014
torch.rand((1000, 1000)) for 10000 times torch.float
0.8877940969541669
torch.rand((1000, 1000)) for 10000 times torch.double
1.4142901846207678
torch.rand((2000, 2000)) for 10000 times torch.float
4.010025603231043
torch.rand((2000, 2000)) for 10000 times torch.double
8.126411964651197
non_contiguous torch.rand((1000, 1000)) for 10000 times torch.float
0.602473056409508
non_contiguous torch.rand((1000, 1000)) for 10000 times torch.double
0.6784545010887086
non_contiguous torch.rand((2000, 2000)) for 10000 times torch.float
3.0991827426478267
non_contiguous torch.rand((2000, 2000)) for 10000 times torch.double
5.719010795000941
not equal torch.rand((1000, 1000)) for 10000 times torch.float
0.046060710679739714
not equal torch.rand((1000, 1000)) for 10000 times torch.double
0.036034489050507545
not equal torch.rand((2000, 2000)) for 10000 times torch.float
0.03686975734308362
not equal torch.rand((2000, 2000)) for 10000 times torch.double
0.04189508780837059
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33286

Differential Revision: D22211962

Pulled By: glaringlee

fbshipit-source-id: a5c48f328432c1996f28e19bc75cb495fb689f6b
2020-06-24 15:08:06 -07:00
4d549077a2 Skip test_mem_leak on Windows (#40486)
Summary:
Addresses https://github.com/pytorch/pytorch/issues/40485.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40486

Differential Revision: D22217493

Pulled By: malfet

fbshipit-source-id: 6654c3b53e8af063b508f91728e58262ffbab053
2020-06-24 14:49:14 -07:00
0c923eea0a Add finishAndThrow function to ProcessGroup::Work, and use with Gloo (#40405)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40405

This adds a finishAndThrow function that completes the work object,
sets an exception if one is provided by the user, and throws an exception (if
it is already set or passed by the caller). This is now done by grabbing the
lock just once and simplifies the wait functions in ProcessGroupGloo.
ghstack-source-id: 106516114

Test Plan: CI

Differential Revision: D22174890

fbshipit-source-id: ea74702216c4328187c8d193bf39e1fea43847f6
2020-06-24 14:46:25 -07:00
3e2d2fc856 [NCCL Docs] Adding Comments for Work-level Finish in ProcessGroup (#40404)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40404

Adds docs to the finish function in ProcessGroup::Work. It's better to have some documentation around these functions since we have some PR's with API-changes/optimizations for these work-level functions here and in the subclasses.
ghstack-source-id: 106381736

Test Plan: CI (Docs change only)

Differential Revision: D22174891

fbshipit-source-id: 7901ea3b35caf6f69f37178ca574104d3412de28
2020-06-24 14:44:18 -07:00
527ab13436 [NCCL] Explicitly Abort NCCL Communicators on Process Group Destruction (#40241)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40241

We abort incomplete NCCL Communicators in the ProcessGroupNCCL
destructor, otherwise pending NCCL communciators may block other CUDA ops.

Closes: https://github.com/pytorch/pytorch/issues/32231
ghstack-source-id: 106469423

Test Plan: CI/Sandcastle

Reviewed By: jiayisuse

Differential Revision: D22103662

fbshipit-source-id: 1f6f88b56bd7a5e9ca5a41698995a76e60e8ad9f
2020-06-24 14:34:00 -07:00
fe18dcd692 Use GLOG logging prefixes (#40491)
Summary:
PyTorch should stop polluting global namespace with symbols such as `ERROR` `WARNING` and `INFO`.
Since `logging_is_not_google_glog.h` is a C++ header, define severity levels in namespace and add `GLOG_` prefix to match an unshortened glog severity levels.
Change `LOG` and `LOG_IF` macros to use prefix + namespaced severity levels.

Closes https://github.com/pytorch/pytorch/issues/40083
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40491

Test Plan: CI

Reviewed By: ezyang

Differential Revision: D22210925

Pulled By: malfet

fbshipit-source-id: 0ec1181a53baa8bca2f526f245e398582304aeab
2020-06-24 14:07:00 -07:00
fc4824aa4a enable mkldnn dilation conv (#40483)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40483

Reviewed By: ezyang

Differential Revision: D22213696

Pulled By: ngimel

fbshipit-source-id: 0321eee8fcaf144b20a5182aa76f98d505c65400
2020-06-24 13:28:05 -07:00
de7ac60cf4 Add out= variants for cuda.comm.broadcast/gather/scatter (#39681)
Summary:
Partially fixes https://github.com/pytorch/pytorch/issues/38911
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39681

Differential Revision: D22161342

Pulled By: mrshenli

fbshipit-source-id: 60295077159b02087823e93bb6ebac9d70adea0a
2020-06-24 12:58:19 -07:00
e66445878d Adds dynamic versioning pattern (#40279)
Summary:
BC NOTE:

This change makes it so modules saved with torch.jit.save in PyTorch 1.6 can be loaded by previous versions of PyTorch unless they use torch.div or (soon) torch.full. It also lets tensors saved using torch.save be loaded by previous versions. So this is the opposite of BC-breaking, but I'm using that label to highlight this issue since we don't have a "BC-improving" label.

PR NOTE:
When an operator's semantics change in PyTorch we want to do two things:

1) Preserve the semantics of older serialized Torchscript programs that use the operator
2) Ensure the new semantics are respected

Historically, this meant writing a Versioned Symbol that would remap older versions of the operator into current PyTorch code (1), and bumping the produced file format version (2). Unfortunately, bumping the produced file format version is a nuclear option for ensuring semantics are respected, since it also prevents older versions of PyTorch from loading anything (even tensors!) from newer versions.

Dynamic versioning addresses the nuclear consequences of bumping the produced file format version by only bumping it when necessary. That is, when an operator with changed semantics is detected in the serialized Torchscript. This will prevent Torchscript programs that use the changed operator from loading on earlier versions of PyTorch, as desired, but will have no impact on programs that don't use the changed operator.

Note that this change is only applicable when using torch.jit.save and torch.jit.load. torch.save pickles the given object using pickle (by default), which saves a function's Python directly.

No new tests for this behavior are added since the existing tests for versioned division in test_save_load already validate that models with div are loaded correctly at version 4.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40279

Reviewed By: dzhulgakov

Differential Revision: D22168291

Pulled By: mruberry

fbshipit-source-id: e71d6380e727e25123c7eedf6d80e5d7f1fe9f95
2020-06-24 12:52:50 -07:00
a2e1a948a4 Increase number of iterations in DDP SPMD tests (#40506)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40506

Test Plan: Imported from OSS

Differential Revision: D22208965

Pulled By: mrshenli

fbshipit-source-id: 7d27b60e2c09e641b4eeb1c89d9f9917c4e72e52
2020-06-24 12:48:04 -07:00
9a3e16c773 Add guard for non-default stream in DDP's autograd engine callback (#40115)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40115

Closes https://github.com/pytorch/pytorch/issues/37790
Closes https://github.com/pytorch/pytorch/issues/37944

A user may wish to run DDP's forward + backwards step under a non-default CUDA stream such as those created by `with torch.cuda.Stream(stream)`. In this case, the user should be responsible for synchronizing events on this stream with other streams used in the program (per the documentation at https://pytorch.org/docs/stable/notes/cuda.html#cuda-semantics), but currently DDP has a bug which causes DDP under non-default streams to fail.

If a user does the following:
```
model = DDP(...)
loss = model(inptut).sum()
loss.backward()
grad = model.module.weight.grad()
average = dist.all_reduce(grad)
```

There is a chance that `average` and `grad` will not be equal. This is because the CUDA kernels corresponding to the  `all_reduce` call may run before `loss.backward()`'s kernels are finished. Specifically, in DDP we copy the allreduced gradients back to the model parameter gradients in an autograd engine callback, but this callback runs on the default stream. Note that this can also be fixed by the application synchronizing on the current stream, although this should not be expected, since the application is not using the current stream at all.

This PR fixes the issue by passing the current stream into DDP's callback.

Tested by adding a UT `test_DistributedDataParallel_non_default_stream` that fails without this PR
ghstack-source-id: 106481208

Differential Revision: D22073353

fbshipit-source-id: 70da9b44e5f546ff8b6d8c42022ecc846dff033e
2020-06-24 11:26:51 -07:00
597cb04b2f Use Int8QuantParamsBlob to pass the scale and zeropoint params (#40494)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40494

Resubmit the diff because D22124313 (1ec4337b7d) was reverted due to CI test failures
Added the int8_gen_quant_params.cc to CMakeList.txt to fix the CI failures

Test Plan: buck test caffe2/caffe2/quantization/server:

Reviewed By: hx89

Differential Revision: D22204244

fbshipit-source-id: a2c8b668f199cc5b0c5894086f554f7c459b1ad7
2020-06-24 10:20:16 -07:00
3ed96e465c Report error when ATEN_THEADING is OMP and USE_OPENMP is turned off. (#40146)
Summary:
Currently, even if USE_OPENMP is turned off, ATEN_THEADING can still use OpenMP. This commit fixes it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40146

Reviewed By: ezyang

Differential Revision: D22208758

Pulled By: pbelevich

fbshipit-source-id: 0866c9bb9b3b5b99d586aed176eb0fbe177efa4a
2020-06-24 09:55:10 -07:00
b4ccdef090 Allow torch.cuda.amp.GradScaler to support sparse gradients (#36786)
Summary:
Should close https://github.com/pytorch/pytorch/issues/35810.

I decided to keep sparse handling on the Python side for clarity, although it could be moved to the C++ side (into `_amp_non_finite_check_and_unscale_`) without much trouble.

For non-fp16 sparse grads the logic is simple (call `_amp_non_finite_check_and_unscale_` on `grad._values()`) instead of `grad` itself.  At least I hope it's that easy.

For fp16 sparse grads, it's tricker.  Sparse tensors can be uncoalesced.  From the [Note](https://pytorch.org/docs/master/sparse.html#torch.sparse.FloatTensor):
> Our sparse tensor format permits uncoalesced sparse tensors, where there may be duplicate coordinates in the indices; in this case, the interpretation is that the value at that index is the sum of all duplicate value entries.

An uncoalesced scaled fp16 grad may have values at duplicate coordinates that are all finite but large, such that adding them to make the coalesced version WOULD cause overflows.**  If I checked `_values()` on the uncoalesced version, it might not report overflows, but I think it should.

So, if the grad is sparse, fp16, and uncoalesced, I still call `_amp_non_finite_check_and_unscale_` to unscale `grad._values()` in-place, but I also double-check the coalesced version by calling a second `_amp_non_finite_check_and_unscale_` on `grad.coalesce()._values()`.  `coalesce()` is out-of-place, so this call doesn't redundantly affect `grad._values()`, but it does have the power to populate the same `found_inf` tensor.  The `is_coalesced()` check and `coalesce()` probably aren't great for performance, but if someone needs a giant embedding table in FP16, they're better than nothing and memorywise, they'll only create a copy of nnz gradient values+indices, which is still way better than changing the whole table to FP32.

An `unscale` variant with liberty to create unscaled grads out-of-place, and replace `param.grad` instead of writing through it, could get away with just one `_amp_non_finite_check_and_unscale_`.  It could say `coalesced = grad.coalesced()`, do only the stronger `_amp_non_finite_check_and_unscale_` on `coalesced._values()`, and set `param.grad = coalesced`.  I could even avoid replacing `param.grad` itself by going one level deeper and setting `param.grad`'s indices and values to `coalesced`'s, but that seems brittle and still isn't truly "in place".

** you could whiteboard an uncoalesced fp32 grad with the same property, but fp32's range is big enough that I don't think it's realistic.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36786

Reviewed By: ezyang

Differential Revision: D22202832

Pulled By: ngimel

fbshipit-source-id: b70961a4b6fc3a4c1882f65e7f34874066435735
2020-06-24 09:10:49 -07:00
d855528186 wconstab/38034-sliced-sequential (#40445)
Summary:
Partial support for slicing of Sequential containers.

- works around missing Sequential slice functionality
   by converting to tuple
- only supports iteration of resulting tuple values,
   not direct call() on the sliced sequential
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40445

Differential Revision: D22192469

Pulled By: wconstab

fbshipit-source-id: 61c85deda2d58f6e3bea2f1fa1d5d5dde568b9b5
2020-06-24 09:05:51 -07:00
727463a727 Initial vmap frontend API (#40172)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40172

This PR introduces the initial vmap frontend API. It has the following
limitations that we can resolve in the future:
- the inputs must be a flat list of tensors
- the outputs must be a flat list of tensors
- in_dims = 0 (so we always vmap over dim 0 of input tensors)
- out_dims = 0 (so the returned tensors have their vmap dim appear at
dim 0)
- Coverage limited to operations that have batching rules implemented
(torch.mul, torch.sum, torch.expand).

There are some other semantic limitations (like not being able to handle
mutation, aside from pytorch operations that perform mutation) that will
be documented in the future.

I wanted to introduce the API before adding a slow fallback for the
coverage so that we can test future batching rules (and coverage) via
the python API to avoid verbosity in C++-land.

The way vmap works is that `vmap(func)(inputs)` wraps all Tensor inputs
to be batched in BatchedTensors, sends those into func, and then unwraps
the output BatchedTensors. Operations on BatchedTensors perform the batched
operations that the user is asking for. When performing nested vmaps,
each nested vmap adds a batch dimension upon entry and removes a batch
dimension on exit.

Coming up in the near future:
- Support for non-zero in_dims and out_dims
- docstring for vmap
- slow fallback for operators that do not have a batching rule
implemented.

Test Plan: - `pytest test/test_vmap.py -v`

Differential Revision: D22102076

Pulled By: zou3519

fbshipit-source-id: b119f0a8a3a3b1717c92dbbd180dfb1618295563
2020-06-24 08:14:24 -07:00
43ab9c677b Add invariants check to BatchedTensorImpl (#40171)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40171

It checks that all of the bdims in BatchedTensorImpl are sorted in
order of ascending `level`.

Test Plan: - Check that nothing breaks in `./build/bin/vmap_test`

Differential Revision: D22102077

Pulled By: zou3519

fbshipit-source-id: 094b7abc6c65208437f0f51a0d0083091912decc
2020-06-24 08:12:16 -07:00
e490352dc4 Simplify complex case for tanh backward (#39997)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/39997

Differential Revision: D22195797

Pulled By: anjali411

fbshipit-source-id: 21eb91bcbd3bfc67acd322a1579fe737b0c02e6e
2020-06-24 07:51:34 -07:00
4975be80f8 fix typo "normal" -> "Cauchy" (#40334)
Summary:
just looks like a real simple typo
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40334

Reviewed By: ezyang

Differential Revision: D22195107

Pulled By: zou3519

fbshipit-source-id: 6c43842d22cbc15db2307976381f6dc1536b5047
2020-06-24 07:45:35 -07:00
ecd9a64712 fix torch.jit.trace_module documentation (#40248)
Summary:
This should fix https://github.com/pytorch/pytorch/issues/39328

Before:

![image](https://user-images.githubusercontent.com/24580222/85076992-4720e800-b18f-11ea-9c6e-19bcf3f1cb7d.png)

After:

![image](https://user-images.githubusercontent.com/24580222/85077064-6ddf1e80-b18f-11ea-9274-e8cee6909baa.png)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40248

Reviewed By: ezyang

Differential Revision: D22195038

Pulled By: zou3519

fbshipit-source-id: c4bff6579a422a56ed28b644f5558b20d901c94e
2020-06-24 07:31:31 -07:00
a4dec0674c [doc] fix typo in formula of MarginRankingLoss (#40285)
Summary:
This is just a minor doc fix:

the `MarginRankingLoss` takes 2 input samples `x1` and `x2`, not just `x`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40285

Reviewed By: ezyang

Differential Revision: D22195069

Pulled By: zou3519

fbshipit-source-id: 909f491c94dca329a37216524f4088e9096e0bc6
2020-06-24 07:24:51 -07:00
e439cf738a Fix examples Adaptive avg pooling typo (#40217)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40217

Reviewed By: ezyang

Differential Revision: D22193711

Pulled By: zou3519

fbshipit-source-id: f96f71e025aa1c81b232e78b1d5b3a3bbd8f331f
2020-06-24 07:22:46 -07:00
72e8690b78 Fix typo. in error message (#39958)
Summary:
Changed sould to should
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39958

Reviewed By: ezyang

Differential Revision: D22193674

Pulled By: zou3519

fbshipit-source-id: ad7bc0aa3ee1f31f5e7965ae36c1903b28509095
2020-06-24 07:17:10 -07:00
b4eb82cd29 Temporary commit at 6/17/2020, 6:49:44 PM
Summary: [WIP] Logit Fake16 Op

Test Plan: [WIP] Tests will be enabled in test_op_nnpi_fp16.py file.

Reviewed By: hyuen

Differential Revision: D22109329

fbshipit-source-id: fd73850c3ec61375ff5bbf0ef5460868a874fbf3
2020-06-24 06:51:48 -07:00
0ecea2d64d [JIT x RPC] Consolidate Future type class and Future impl class (#40406)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40406

Same motivation for https://github.com/pytorch/pytorch/issues/35110.

`Future` and `RRef` are two important types for `rpc` module, should make users feel easy to use.

Reference, https://www.sphinx-doc.org/en/master/usage/extensions/autodoc.html#directive-autoclass

Follow https://github.com/pytorch/pytorch/pull/35694.
ghstack-source-id: 106484664

Test Plan:
```
buck test mode/dev-nosan //caffe2/test/distributed/rpc/jit:rpc_fork

buck build mode/dev-nosan //caffe2/test/distributed/rpc/jit:rpc_fork && \
buck-out/gen/caffe2/test/distributed/rpc/jit/rpc_fork\#binary.par \
-r test_rref_local_value
```

```
buck test mode/dev-nosan //caffe2/test/distributed/rpc/tensorpipe:rpc_fork_tensorpipe
```

pyre -l caffe2/torch/fb/training_toolkit
pyre -l caffe2/torch/fb/distributed
pyre -l aiplatform

Differential Revision: D7722176

fbshipit-source-id: f3b9ccd7bccb233b2b33ad59dd65e178ba34d67f
2020-06-24 01:44:49 -07:00
f035f73d53 Fix the issue that run clang-tidy on the aten folder (#39713)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/39713

Differential Revision: D22203850

Pulled By: mruberry

fbshipit-source-id: 43f690e748b7a3c123ad20f6d640d6dae25c641c
2020-06-24 01:27:54 -07:00
46b9e519aa Remove print (#40475)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40475

As title
ghstack-source-id: 106474870

Test Plan: CI

Differential Revision: D22200640

fbshipit-source-id: 1f4c7bbf54be8c4187c9338fefdf14b501597d98
2020-06-24 00:42:25 -07:00
7b0f867c48 Perf improvement of Conv2d and Conv3d (#40324)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40324

1) avoid the use of item 2) bypass the im2col for 1x1 conv

Test Plan:
unit test and perf benchmark to show improvement
```
num = 50

N = 1
C = 512
H = 4
W = 4

M = 512
kernel_h = 1
kernel_w = 1
stride_h = 1
stride_w = 1
padding_h = 0
padding_w = 0

X_np = np.random.randn(N, C, H, W).astype(np.float32)
W_np = np.random.randn(M, C, kernel_h, kernel_w).astype(np.float32)
X = torch.from_numpy(X_np)

conv2d_pt = torch.nn.Conv2d(
    C, M, (kernel_h, kernel_w), stride=(stride_h, stride_w),
    padding=(padding_h, padding_w), groups=1, bias=True)

class ConvNet(torch.nn.Module):
    def __init__(self):
        super(ConvNet, self).__init__()
        self.conv2d = conv2d_pt

    def forward(self, x):
        return self.conv2d(x)

model = ConvNet()

def pt_forward():
    # with torch.autograd.profiler.profile(record_shapes=True) as prof:
    model(X)
    # print(prof.key_averages().table(sort_by="self_cpu_time_total"))

torch._C._set_mkldnn_enabled(False)

t = Timer("pt_forward()", "from __main__ import pt_forward, X")
```
Before the optimization:
pt time = 5.841153813526034
After the optimization:
pt time = 4.513134760782123

Differential Revision: D22149067

fbshipit-source-id: 538d9eea5b729e6c3da79444bde1784bde828876
2020-06-23 23:39:05 -07:00
cb26661fe4 Throws runtime error when torch.full would infer a float dtype from a bool or integral fill value (#40364)
Summary:
BC-breaking NOTE:

In PyTorch 1.6 bool and integral fill values given to torch.full must set the dtype our out keyword arguments. In prior versions of PyTorch these fill values would return float tensors by default, but in PyTorch 1.7 they will return a bool or long tensor, respectively. The documentation for torch.full has been updated to reflect this.

PR NOTE:

This PR causes torch.full to throw a runtime error when it would have inferred a float dtype by being given a boolean or integer value. A versioned symbol for torch.full is added to preserve the behavior of already serialized Torchscript programs. Existing tests for this behavior being deprecated have been updated to reflect it now being unsupported, and a couple new tests have been added to validate the versioned symbol behavior. The documentation of torch.full has also been updated to reflect this change.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40364

Differential Revision: D22176640

Pulled By: mruberry

fbshipit-source-id: b20158ebbcb4f6bf269d05a688bcf4f6c853a965
2020-06-23 23:27:22 -07:00
a2d4d9eca6 Improve Dynamic Library for Windows (#40365)
Summary:
1. Use LoadLibraryEx if available
2. Print more info on error
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40365

Differential Revision: D22194974

Pulled By: malfet

fbshipit-source-id: e8309f39d78fd4681de5aa032288882910dff928
2020-06-23 20:29:48 -07:00
e2201e2ed8 Fixes caffe2 loading issues on Windows (#39513)
Summary:
Addresses https://github.com/pytorch/pytorch/issues/27840#issuecomment-638715422.
Contains a bunch of fixes (https://github.com/pytorch/pytorch/pull/39376 + https://github.com/pytorch/pytorch/pull/39334 + https://github.com/pytorch/pytorch/pull/38302 + https://github.com/pytorch/pytorch/pull/35362)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39513

Differential Revision: D22190761

Pulled By: malfet

fbshipit-source-id: b2d52f6cb16c233d16071e9c0670dfff7da2710e
2020-06-23 20:11:24 -07:00
7c07c39845 [torch.distributed.rpc] Install method docstrings from PyRRef to RRef (#40461)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40461

It turned out `:inheried-members:` (see [doc](https://www.sphinx-doc.org/en/master/usage/extensions/autodoc.html#directive-autoclass)) is not really usable.

Because pybind11 generates a docstring that writes `self` as parent class, `rpc.PyRRef`, type.

As a workaround, I am pulling docstrings on parent-class, `PyRRef` class, into subclass, `RRef`. And do surgery on the docstring generated by pybind11.

{F241283111}

ghstack-source-id: 106472496

Test Plan:
buck test mode/dev-nosan //caffe2/test/distributed/rpc/:rpc_fork

buck build mode/dev-nosan //caffe2/test/distributed/rpc/:rpc_fork && \
buck-out/gen/caffe2/test/distributed/rpc/rpc_fork\#binary.par \
-r test_rref_str

buck build mode/dev-nosan //caffe2/test/distributed/rpc/:rpc_fork && \
buck-out/gen/caffe2/test/distributed/rpc/rpc_fork\#binary.par \
-r test_return_local_rrefs

buck test mode/dev-nosan //caffe2/torch/fb/distributed/model_parallel/tests:test_elastic_averaging -- 'test_elastic_averaging_center \(caffe2\.torch\.fb\.distributed\.model_parallel\.tests\.test_elastic_averaging\.TestElasticAveragingCenter\)'

P134031188

Differential Revision: D7933834

fbshipit-source-id: c03a8a4c9d98888b64492a8caba1591595bfe247
2020-06-23 19:58:36 -07:00
7c737eab59 Remove table of contents at the top of rpc.rst (#40205)
Summary:
mattip - Can we remove the table of contents created by the `.. contents:: :local: :depth: 2` since this page isn't one of the large documentation pages (https://github.com/pytorch/pytorch/issues/38010) and is simply a landing page for the Distributed RPC Framework?

Changes made in this original PR: f10fbcc820 (diff-250b9b23fd6f1a5c15aecdb72afb9d7d)

cc mrshenli
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40205

Differential Revision: D22194943

Pulled By: jlin27

fbshipit-source-id: 4e42845daf2784a17ad81645fe3b838385656bba
2020-06-23 19:45:11 -07:00
b7e044f0e5 Re-apply PyTorch pthreadpool changes
Summary:
This re-applies D21232894 (b9d3869df3) and D22162524, plus updates jni_deps in a few places
to avoid breaking host JNI tests.

Test Plan: `buck test @//fbandroid/mode/server //fbandroid/instrumentation_tests/com/facebook/caffe2:host-test`

Reviewed By: xcheng16

Differential Revision: D22199952

fbshipit-source-id: df13eef39c01738637ae8cf7f581d6ccc88d37d5
2020-06-23 19:26:21 -07:00
bdc00196d1 Enable XNNPACK ops on iOS and macOS.
Test Plan: buck run aibench:run_bench -- -b aibench/specifications/models/pytorch/pytext/pytext_mobile_inference.json --platform ios --framework pytorch --remote --devices D221 (9788a74da8)AP-12.0.1

Reviewed By: xta0

Differential Revision: D21886736

fbshipit-source-id: ac482619dc1b41a110a3c4c79cc0339e5555edeb
2020-06-23 18:50:36 -07:00
c314e0deb5 [quant] Quantized adaptive_avg_pool3d (#40271)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40271

Closes #40244

Test Plan: Imported from OSS

Reviewed By: vkuzo

Differential Revision: D22134318

Pulled By: z-a-f

fbshipit-source-id: 0489b6c083a3cbc21a1d81d8bfcc499372308088
2020-06-23 18:13:48 -07:00
6468bc4637 [JIT] script if tracing fix (#40468)
Summary:
Currently, torchvision annotates `batched_nms` with `torch.jit.script` so the function gets compiled when it is traced and ONNX will work. Unfortunately, this means we are eagerly compiling batched_nms, which fails if torchvision isn't built with `torchvision.ops.nms`. As a result, torchvision doesn't work on torch hub right now.

`_script_if_tracing` could solve our problem here, but right now it does not correctly interact with recursive compilation. This PR fixes that bug.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40468

Reviewed By: jamesr66a

Differential Revision: D22195771

Pulled By: eellison

fbshipit-source-id: 83022ca0bab6d389a48a478aec03052c9282d2b7
2020-06-23 17:14:28 -07:00
92d3182c11 Revert D21232894: Unify PyTorch mobile's threadpool usage.
Test Plan: revert-hammer

Differential Revision:
D21232894 (b9d3869df3)

Original commit changeset: 8b3de86247fb

fbshipit-source-id: e6517cfec08f7dd0f4f8877dab62acf1d65afacd
2020-06-23 17:09:14 -07:00
ddb8565b25 Revert D22162469: [pytorch][PR] Migrate var & std to ATen
Test Plan: revert-hammer

Differential Revision:
D22162469 (7a3c223bbb)

Original commit changeset: 8d901c779767

fbshipit-source-id: 9e0fa439732478349c0ac6c7baafba063edfac5d
2020-06-23 17:04:15 -07:00
7e32e6048d Fix linspace step computation for large integral types (#40132)
Summary:
Convert start and end to `step_t` before computing the difference
Should fix `torch.linspace(-2147483647, 2147483647, 10, dtype=torch.int32)`

Closes https://github.com/pytorch/pytorch/issues/40118
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40132

Differential Revision: D22190095

Pulled By: malfet

fbshipit-source-id: 01cb158a30c505191df663d021804d411b697871
2020-06-23 16:59:59 -07:00
883e4c44b2 Raise exception when trying to build PyTorch on 32-bit Windows system (#40321)
Summary:
Makes errors in cases described in https://github.com/pytorch/pytorch/issues/27815 more obvious
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40321

Differential Revision: D22198352

Pulled By: malfet

fbshipit-source-id: 327d81103c066048dcf5f900fd9083b09942af0e
2020-06-23 16:54:20 -07:00
a6a2dd14ea Fix typo in warning message (#39854)
Summary:
Fix typo
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39854

Reviewed By: ezyang

Differential Revision: D22193544

Pulled By: zou3519

fbshipit-source-id: 04b9f59da7b6ba0649fc6d315adcf20685e10930
2020-06-23 16:47:35 -07:00
0e26a03ef9 [quant][graphmode] Enable inplace option for top level API (#40414)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40414

after `_reconstruct` is supported in RecursiveScriptModule: https://github.com/pytorch/pytorch/pull/39979
we can support inplace option in quantization API

Test Plan: Imported from OSS

Differential Revision: D22178326

fbshipit-source-id: c78bc2bcf2c42b06280c12262bb31aebcadc6c32
2020-06-23 16:42:48 -07:00
2e6da36298 [android][ci] Fix CI packaging headers to aar (#40442)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40442

Problem:
Nightly builds do not include libtorch headers as local build.
The reason is that on docker images path is different than local path when building with `scripts/build_pytorch_android.sh`

Solution:
Introducing gradle property to be able to specify it and add its specification to gradle build job and snapshots publishing job which run on the same docker image.

Test:
ci-all jobs check https://github.com/pytorch/pytorch/pull/40443
checking that gradle build will result with headers inside aar

Test Plan: Imported from OSS

Differential Revision: D22190955

Pulled By: IvanKobzarev

fbshipit-source-id: 9379458d8ab024ee991ca205a573c21d649e5f8a
2020-06-23 16:41:12 -07:00
b9d3869df3 Unify PyTorch mobile's threadpool usage. (#37243)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37243

*** Why ***

As it stands, we have two thread pool solutions concurrently in use in PyTorch mobile: (1) the open source pthreadpool library under third_party, and (2) Caffe2's implementation of pthreadpool under caffe2/utils/threadpool.  Since the primary use-case of the latter has been to act as a drop-in replacement for the third party version so as to enable integration and usage from within NNPACK and QNNPACK, Caffe2's implementation is intentionally written to the exact same interface as the third party version.

The original argument in favor of C2's implementation has been improved performance as a result of using spin locks, as opposed to relinquishing the thread's time slot and putting it to sleep - a less expensive operation up to a point.  That seems to have given C2's implementation the upper hand in performance, hence justifying the added maintenance complexity, until the third party version improved in parallel surpassing the efficiency of C2's implementation as I have verified in benchmarks.  With that advantage gone, there is no reason to continue using C2's implementation in PyTorch mobile either from the perspective of performance or code hygiene.  As a matter of fact, there is considerable performance benefit to be had as a result of using the third party version as it currently stands.

This is a tricky change though, mainly because in order to avoid potential performance regressions, of which I have witnessed none but just in abundance of caution, we have decided to continue using the internal C2's implementation whenever building for Caffe2.  Again, this is mainly to avoid potential performance regressions in production C2 use cases even if doing so results in reduced performance as far as I can tell.

So to summarize, today, and as it currently stands, we are using C2's implementation for (1) NNPACK, (2) PyTorch QNNPACK, and (3) ATen parallel_for on mobile builds, while using the third party version of pthreadpool for XNNPACK as XNNPACK does not provide any build options to link against an external implementation unlike NNPACK and QNNPACK do.

The goal of this PR then, is to unify all usage on mobile to the third party implementation both for improved performance and better code hygiene.  This applies to PyTorch's use of NNPACK, QNNPACK, XNNPACK, and mobile's implementation of ATen parallel_for, all getting routed to the
exact same third party implementation in this PR.

Considering that NNPACK, QNNPACK, and XNNPACK are not mobile specific, these benefits carry over to non-mobile builds of PyTorch (but not Caffe2) as well.  The implementation of ATen parallel_for on non-mobile builds remains unchanged.

*** How ***

This is where things get tricky.

A good deal of the build system complexity in this PR arises from our desire to maintain C2's implementation intact for C2's use.

pthreadpool is a C library with no concept of namespaces, which means two copies of the library cannot exist in the same binary or symbol collision will occur violating ODR.  This means that somehow, and based on some condition, we must decide on the choice of a pthreadpool implementation.  In practice, this has become more complicated as a result of all the possible combinations that USE_NNPACK, USE_QNNPACK, USE_PYTORCH_QNNPACK, USE_XNNPACK, USE_SYSTEM_XNNPACK, USE_SYSTEM_PTHREADPOOL and other variables can result in.  Having said that, I have done my best in this PR to surgically cut through this complexity in a way that minimizes the side effects, considering the significance of the performance we are leaving on the table, yet, as a result of this combinatorial explosion explained above I cannot guarantee that every single combination will work as expected on the first try.  I am heavily relying on CI to find any issues as local testing can only go that far.

Having said that, this PR provides a simple non mobile-specific C++ thread pool implementation on top of pthreadpool, namely caffe2::PThreadPool that automatically routes to C2's implementation or the third party version depending on the build configuration.  This simplifies the logic at the cost of pushing the complexity to the build scripts.  From there on, this thread pool is used in aten parallel_for, and NNPACK and family, again, routing all usage of threading to C2 or third party pthreadpool depending on the build configuration.

When it is all said or done, the layering will look like this:

a) aten::parallel_for, uses
b) caffe2::PThreadPool, which uses
c) pthreadpool C API, which delegates to
    c-1) third_party implementation of pthreadpool if that's what the build has requested, and the rabbit hole ends here.
    c-2) C2's implementation of pthreadpool if that's what the build has requested, which itself delegates to
    c-2-1) caffe2::ThreadPool, and the rabbit hole ends here.

NNPACK, and (PyTorch) QNNPACK directly hook into (c). They never go through (b).

Differential Revision: D21232894

Test Plan: Imported from OSS

Reviewed By: dreiss

Pulled By: AshkanAliabadi

fbshipit-source-id: 8b3de86247fbc3a327e811983e082f9d40081354
2020-06-23 16:34:51 -07:00
c7d79f35e3 Header rename complex_type.h -> complex.h (#39885)
Summary:
This file should have been renamed as `complex.h`, but unfortunately, it was named as `complex_type.h` due to a name clash with FBCode. Is this still the case and is it easy to resolve the name clash? Maybe related to the comment at https://github.com/pytorch/pytorch/pull/39834#issuecomment-642950012
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39885

Differential Revision: D22018575

Pulled By: ezyang

fbshipit-source-id: e237ccedbe2b30c31aca028a5b4c8c063087a30f
2020-06-23 16:27:09 -07:00
111b399c91 Delete requires_tensor (#40184)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40184

Whenever requires_tensor is True, it is also the case that abstract
is true.  Thus, it is not necessary to specify requires_tensor.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Differential Revision: D22187353

Pulled By: ezyang

fbshipit-source-id: d665bb69cffe491bd989495020e1ae32340aa9da
2020-06-23 16:18:28 -07:00
cc9075c5d4 Add some syntax sugar for when backends use the same function. (#40182)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40182

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Differential Revision: D22187354

Pulled By: ezyang

fbshipit-source-id: 875a6a7837981b60830bd7b1c35d2a3802ed7dd7
2020-06-23 16:16:42 -07:00
d8ec19bc03 Revert D22072830: [wip] Upgrade msvc to 14.13
Test Plan: revert-hammer

Differential Revision:
D22072830

Original commit changeset: 6fa03725f3fe

fbshipit-source-id: 901de185e607810cb3871c2e4d23816848c97f4b
2020-06-23 16:13:03 -07:00
581ad48806 Revert D21581908: Move TensorOptions ops to c10
Test Plan: revert-hammer

Differential Revision:
D21581908

Original commit changeset: 6d4a9f526fd7

fbshipit-source-id: fe1e6368a09120ea40dea405e8409983541e3cb5
2020-06-23 16:10:07 -07:00
cbd53bfee8 [jit] Remove unnecessary clone APIs for script::Module and RecursiveScriptModule (#40297)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40297

Test Plan: Imported from OSS

Differential Revision: D22191660

fbshipit-source-id: 4b338ca82caaca04784bffe01fdae3d180c192f4
2020-06-23 16:03:22 -07:00
8c20fb6481 [JIT] freeze doc (#40409)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40409

Reviewed By: ezyang

Differential Revision: D22192709

Pulled By: eellison

fbshipit-source-id: 68cdb2e5040d31957fbd64690fdc03c058d13f9a
2020-06-23 15:44:03 -07:00
09285070a7 Doc fix for complex views (#40450)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40450

Test Plan: Imported from OSS

Differential Revision: D22190911

Pulled By: anjali411

fbshipit-source-id: eb13559c7a2f62d63344601c750b5715686e95c3
2020-06-23 15:03:22 -07:00
5fce7137a9 [WIP][JIT] Add ScriptModule._reconstruct (#39979)
Summary:
**Summary**
This commit adds an instance method `_reconstruct` that permits users
to reconstruct a `ScriptModule` from a given C++ `Module` instance.

**Testing**
This commit adds a unit test for `_reconstruct`.

**Fixes**
This pull request fixes https://github.com/pytorch/pytorch/issues/33912.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39979

Differential Revision: D22172323

Pulled By: SplitInfinity

fbshipit-source-id: 9aa6551c422a5a324b822a09cd8d7c660f99ca5c
2020-06-23 14:42:27 -07:00
5ad885b823 [Caffe2][Pruning] Make the caffe2 Sum operator support long types (#40379)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40379

The current sum operator doesn't support Long .. hence modify the code

Test Plan: Write a test case

Reviewed By: jspark1105, yinghai

Differential Revision: D21917365

fbshipit-source-id: b37d2c100c70d17d2f89c309e40360ddfab584ee
2020-06-23 14:18:29 -07:00
b623bdeabb Move TensorOptions ops to c10 (#39492)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39492

This PR adds use_c10_dispatcher: full to ops taking TensorOptions. To allow this, since the c10 operator library doesn't know about TensorOptions, we need to register the operator kernels as optional<ScalarType>, optional<Device>, optional<Layout>, optional<bool> instead, and also call them this way.

Changes:

Add use_c10_dispatcher: full to those ops
Write hacky_wrapper_for_legacy_signatures which takes an old-style kernel (i.e. one written to take TensorOptions) an creates a wrapper kernel for it that takes the scattered optional<ScalarType>, optional<Device>, optional<Layout>, optional<bool> instead.
Change codegen so that all op registrations are wrapped into hacky_wrapper_for_legacy_signatures. This is added to all ops but is a no-op if the op doesn't take TensorOptions. This allows us in the future to just change a kernel signature from TensorOptions to the scattered version and have it work without having to touch codegen.
Change codegen so that the frontend calls those operators with expanded arguments instead of with a TensorOptions object. This is required because now the kernels are written in this way.
This PR does not remove TensorOptions special cases from codegen, but instead it separates kernels from the codegen/frontend issues. After this, kernels can be worked on separately without having to touch codegen and codegen can be worked on without having to touch kernels.

Codegen diff: P133121032

ghstack-source-id: 106426630

Test Plan: waitforsandcastle

Differential Revision: D21581908

fbshipit-source-id: 6d4a9f526fd70fae40581bf26f3ccf794ce6a89e
2020-06-23 14:13:34 -07:00
f6b9848c25 Use chain.from_iterable in optimizer.py (#40156)
Summary:
This is a faster and more idiomatic way of using `itertools.chain`. Instead of computing all the items in the iterable and storing them in memory, they are computed one-by-one and never stored as a huge list. This can save on both runtime and memory space.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40156

Reviewed By: ezyang

Differential Revision: D22189038

Pulled By: vincentqb

fbshipit-source-id: 160b2c27f442686821a6ea541e1f48f4a846c186
2020-06-23 14:07:05 -07:00
0e074074f3 Disable inlining an opaque tensor into a constant (#40367)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40367

If the tensor has no storage then do not inline as a constant. This
situation when Mkldnn tensors are used.

Test Plan: Imported from OSS

Reviewed By: eellison

Differential Revision: D22158240

Pulled By: bzinodev

fbshipit-source-id: 8d2879044f2429004983a1242d837367b75a9f2a
2020-06-23 13:28:31 -07:00
f000b44d89 Fork/Join Inline Docs (relanding) (#40438)
Summary:
Added fork/wait to docs/source/jit.rst, hopefully that will fix test error.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40438

Differential Revision: D22188152

Pulled By: eellison

fbshipit-source-id: c19277284455fb6e7c0138b0c1423d90b147d18e
2020-06-23 13:25:51 -07:00
d21ee2de66 [wip] Upgrade msvc to 14.13 (#40109)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40109

ghstack-source-id: 106426627

Test Plan: oss CI

Differential Revision: D22072830

fbshipit-source-id: 6fa03725f3fe272795553c9c4acf46130b8c6039
2020-06-23 13:05:36 -07:00
4632 changed files with 529495 additions and 129713 deletions

View File

@ -31,7 +31,7 @@ Usage
1. Make changes to these scripts.
2. Run the `regenerate.sh` script in this directory and commit the script changes and the resulting change to `config.yml`.
You'll see a build failure on TravisCI if the scripts don't agree with the checked-in version.
You'll see a build failure on GitHub if the scripts don't agree with the checked-in version.
Motivation
@ -55,7 +55,7 @@ Future direction
See comment [here](https://github.com/pytorch/pytorch/pull/17323#pullrequestreview-206945747):
In contrast with a full recursive tree traversal of configuration dimensions,
> in the future future I think we actually want to decrease our matrix somewhat and have only a few mostly-orthogonal builds that taste as many different features as possible on PRs, plus a more complete suite on every PR and maybe an almost full suite nightly/weekly (we don't have this yet). Specifying PR jobs in the future might be easier to read with an explicit list when we come to this.
> in the future I think we actually want to decrease our matrix somewhat and have only a few mostly-orthogonal builds that taste as many different features as possible on PRs, plus a more complete suite on every PR and maybe an almost full suite nightly/weekly (we don't have this yet). Specifying PR jobs in the future might be easier to read with an explicit list when we come to this.
----------------
----------------
@ -90,7 +90,7 @@ The binaries are built in CircleCI. There are nightly binaries built every night
We have 3 types of binary packages
* pip packages - nightlies are stored on s3 (pip install -f <a s3 url>). releases are stored in a pip repo (pip install torch) (ask Soumith about this)
* pip packages - nightlies are stored on s3 (pip install -f \<a s3 url\>). releases are stored in a pip repo (pip install torch) (ask Soumith about this)
* conda packages - nightlies and releases are both stored in a conda repo. Nighty packages have a '_nightly' suffix
* libtorch packages - these are zips of all the c++ libraries, header files, and sometimes dependencies. These are c++ only
* shared with dependencies (the only supported option for Windows)
@ -104,16 +104,16 @@ All binaries are built in CircleCI workflows except Windows. There are checked-i
Some quick vocab:
* A\**workflow** is a CircleCI concept; it is a DAG of '**jobs**'. ctrl-f 'workflows' on\https://github.com/pytorch/pytorch/blob/master/.circleci/config.yml to see the workflows.
* A \**workflow** is a CircleCI concept; it is a DAG of '**jobs**'. ctrl-f 'workflows' on https://github.com/pytorch/pytorch/blob/master/.circleci/config.yml to see the workflows.
* **jobs** are a sequence of '**steps**'
* **steps** are usually just a bash script or a builtin CircleCI command.* All steps run in new environments, environment variables declared in one script DO NOT persist to following steps*
* **steps** are usually just a bash script or a builtin CircleCI command. *All steps run in new environments, environment variables declared in one script DO NOT persist to following steps*
* CircleCI has a **workspace**, which is essentially a cache between steps of the *same job* in which you can store artifacts between steps.
## How are the workflows structured?
The nightly binaries have 3 workflows. We have one job (actually 3 jobs: build, test, and upload) per binary configuration
1. binarybuilds
1. binary_builds
1. every day midnight EST
2. linux: https://github.com/pytorch/pytorch/blob/master/.circleci/verbatim-sources/linux-binary-build-defaults.yml
3. macos: https://github.com/pytorch/pytorch/blob/master/.circleci/verbatim-sources/macos-binary-build-defaults.yml
@ -144,7 +144,7 @@ The nightly binaries have 3 workflows. We have one job (actually 3 jobs: build,
## How are the jobs structured?
The jobs are in https://github.com/pytorch/pytorch/tree/master/.circleci/verbatim-sources . Jobs are made of multiple steps. There are some shared steps used by all the binaries/smokes. Steps of these jobs are all delegated to scripts in https://github.com/pytorch/pytorch/tree/master/.circleci/scripts .
The jobs are in https://github.com/pytorch/pytorch/tree/master/.circleci/verbatim-sources. Jobs are made of multiple steps. There are some shared steps used by all the binaries/smokes. Steps of these jobs are all delegated to scripts in https://github.com/pytorch/pytorch/tree/master/.circleci/scripts .
* Linux jobs: https://github.com/pytorch/pytorch/blob/master/.circleci/verbatim-sources/linux-binary-build-defaults.yml
* binary_linux_build.sh
@ -178,8 +178,7 @@ CircleCI creates a final yaml file by inlining every <<* segment, so if we were
So, CircleCI has several executor types: macos, machine, and docker are the ones we use. The 'machine' executor gives you two cores on some linux vm. The 'docker' executor gives you considerably more cores (nproc was 32 instead of 2 back when I tried in February). Since the dockers are faster, we try to run everything that we can in dockers. Thus
* linux build jobs use the docker executor. Running them on the docker executor was at least 2x faster than running them on the machine executor
* linux test jobs use the machine executor and spin up their own docker. Why this nonsense? It's cause we run nvidia-docker for our GPU tests; any code that calls into the CUDA runtime needs to be run on nvidia-docker. To run a nvidia-docker you need to install some nvidia packages on the host machine and then call docker with the '—runtime nvidia' argument. CircleCI doesn't support this, so we have to do it ourself.
* This is not just a mere inconvenience. **This blocks all of our linux tests from using more than 2 cores.** But there is nothing that we can do about it, but wait for a fix on circleci's side. Right now, we only run some smoke tests (some simple imports) on the binaries, but this also affects non-binary test jobs.
* linux test jobs use the machine executor in order for them to properly interface with GPUs since docker executors cannot execute with attached GPUs
* linux upload jobs use the machine executor. The upload jobs are so short that it doesn't really matter what they use
* linux smoke test jobs use the machine executor for the same reason as the linux test jobs
@ -205,7 +204,7 @@ TODO: fill in stuff
## Overview
The code that runs the binaries lives in two places, in the normal [github.com/pytorch/pytorch](http://github.com/pytorch/pytorch), but also in [github.com/pytorch/builder](http://github.com/pytorch/builder) , which is a repo that defines how all the binaries are built. The relevant code is
The code that runs the binaries lives in two places, in the normal [github.com/pytorch/pytorch](http://github.com/pytorch/pytorch), but also in [github.com/pytorch/builder](http://github.com/pytorch/builder), which is a repo that defines how all the binaries are built. The relevant code is
```
@ -261,7 +260,7 @@ Linux, MacOS and Windows use the same code flow for the conda builds.
Conda packages are built with conda-build, see https://conda.io/projects/conda-build/en/latest/resources/commands/conda-build.html
Basically, you pass `conda build` a build folder (pytorch-nightly/ above) that contains a build script and a meta.yaml. The meta.yaml specifies in what python environment to build the package in, and what dependencies the resulting package should have, and the build script gets called in the env to build the thing.
tldr; on conda-build is
tl;dr on conda-build is
1. Creates a brand new conda environment, based off of deps in the meta.yaml
1. Note that environment variables do not get passed into this build env unless they are specified in the meta.yaml
@ -271,7 +270,7 @@ tldr; on conda-build is
4. Runs some simple import tests (if specified in the meta.yaml)
5. Saves the finished package as a tarball
The build.sh we use is essentially a wrapper around ```python setup.py build``` , but it also manually copies in some of our dependent libraries into the resulting tarball and messes with some rpaths.
The build.sh we use is essentially a wrapper around `python setup.py build`, but it also manually copies in some of our dependent libraries into the resulting tarball and messes with some rpaths.
The entrypoint file `builder/conda/build_conda.sh` is complicated because
@ -356,15 +355,15 @@ The Dockerfiles are available in pytorch/builder, but there is no circleci job o
# How to manually rebuild the binaries
tldr; make a PR that looks like https://github.com/pytorch/pytorch/pull/21159
tl;dr make a PR that looks like https://github.com/pytorch/pytorch/pull/21159
Sometimes we want to push a change to master and then rebuild all of today's binaries after that change. As of May 30, 2019 there isn't a way to manually run a workflow in the UI. You can manually re-run a workflow, but it will use the exact same git commits as the first run and will not include any changes. So we have to make a PR and then force circleci to run the binary workflow instead of the normal tests. The above PR is an example of how to do this; essentially you copy-paste the binarybuilds workflow steps into the default workflow steps. If you need to point the builder repo to a different commit then you'd need to change https://github.com/pytorch/pytorch/blob/master/.circleci/scripts/binary_checkout.sh#L42-L45 to checkout what you want.
## How to test changes to the binaries via .circleci
Writing PRs that test the binaries is annoying, since the default circleci jobs that run on PRs are not the jobs that you want to run. Likely, changes to the binaries will touch something under .circleci/ and require that .circleci/config.yml be regenerated (.circleci/config.yml controls all .circleci behavior, and is generated using ```.circleci/regenerate.sh``` in python 3.7). But you also need to manually hardcode the binary jobs that you want to test into the .circleci/config.yml workflow, so you should actually make at least two commits, one for your changes and one to temporarily hardcode jobs. See https://github.com/pytorch/pytorch/pull/22928 as an example of how to do this.
Writing PRs that test the binaries is annoying, since the default circleci jobs that run on PRs are not the jobs that you want to run. Likely, changes to the binaries will touch something under .circleci/ and require that .circleci/config.yml be regenerated (.circleci/config.yml controls all .circleci behavior, and is generated using `.circleci/regenerate.sh` in python 3.7). But you also need to manually hardcode the binary jobs that you want to test into the .circleci/config.yml workflow, so you should actually make at least two commits, one for your changes and one to temporarily hardcode jobs. See https://github.com/pytorch/pytorch/pull/22928 as an example of how to do this.
```
```sh
# Make your changes
touch .circleci/verbatim-sources/nightly-binary-build-defaults.yml
@ -409,7 +408,7 @@ The advantage of this flow is that you can make new changes to the base commit a
You can build Linux binaries locally easily using docker.
```
```sh
# Run the docker
# Use the correct docker image, pytorch/conda-cuda used here as an example
#
@ -419,8 +418,6 @@ You can build Linux binaries locally easily using docker.
# in the docker container then you will see path/to/foo/baz on your local
# machine. You could also clone the pytorch and builder repos in the docker.
#
# If you're building a CUDA binary then use `nvidia-docker run` instead, see below.
#
# If you know how, add ccache as a volume too and speed up everything
docker run \
-v your/pytorch/repo:/pytorch \
@ -444,9 +441,7 @@ export DESIRED_CUDA=cpu
**Building CUDA binaries on docker**
To build a CUDA binary you need to use `nvidia-docker run` instead of just `docker run` (or you can manually pass `--runtime=nvidia`). This adds some needed libraries and things to build CUDA stuff.
You can build CUDA binaries on CPU only machines, but you can only run CUDA binaries on CUDA machines. This means that you can build a CUDA binary on a docker on your laptop if you so choose (though its gonna take a loong time).
You can build CUDA binaries on CPU only machines, but you can only run CUDA binaries on CUDA machines. This means that you can build a CUDA binary on a docker on your laptop if you so choose (though its gonna take a long time).
For Facebook employees, ask about beefy machines that have docker support and use those instead of your laptop; it will be 5x as fast.
@ -456,7 +451,7 @@ Theres no easy way to generate reproducible hermetic MacOS environments. If y
But if you want to try, then Id recommend
```
```sh
# Create a new terminal
# Clear your LD_LIBRARY_PATH and trim as much out of your PATH as you
# know how to do

View File

@ -25,15 +25,17 @@ DEPS_INCLUSION_DIMENSIONS = [
]
def get_processor_arch_name(cuda_version):
return "cpu" if not cuda_version else "cu" + cuda_version
def get_processor_arch_name(gpu_version):
return "cpu" if not gpu_version else (
"cu" + gpu_version.strip("cuda") if gpu_version.startswith("cuda") else gpu_version
)
LINUX_PACKAGE_VARIANTS = OrderedDict(
manywheel=[
"3.6m",
"3.7m",
"3.8m",
"3.9m"
],
conda=dimensions.STANDARD_PYTHON_VERSIONS,
libtorch=[
@ -42,7 +44,7 @@ LINUX_PACKAGE_VARIANTS = OrderedDict(
)
CONFIG_TREE_DATA = OrderedDict(
linux=(dimensions.CUDA_VERSIONS, LINUX_PACKAGE_VARIANTS),
linux=(dimensions.GPU_VERSIONS, LINUX_PACKAGE_VARIANTS),
macos=([None], OrderedDict(
wheel=dimensions.STANDARD_PYTHON_VERSIONS,
conda=dimensions.STANDARD_PYTHON_VERSIONS,
@ -50,13 +52,25 @@ CONFIG_TREE_DATA = OrderedDict(
"3.7",
],
)),
windows=(dimensions.CUDA_VERSIONS, OrderedDict(
wheel=dimensions.STANDARD_PYTHON_VERSIONS,
conda=dimensions.STANDARD_PYTHON_VERSIONS,
libtorch=[
"3.7",
macos_arm64=([None], OrderedDict(
wheel=[
"3.8",
],
conda=[
"3.8",
],
)),
# Skip CUDA-9.2 builds on Windows
windows=(
[v for v in dimensions.GPU_VERSIONS if v not in ['cuda92'] + dimensions.ROCM_VERSION_LABELS],
OrderedDict(
wheel=dimensions.STANDARD_PYTHON_VERSIONS,
conda=dimensions.STANDARD_PYTHON_VERSIONS,
libtorch=[
"3.7",
],
)
),
)
# GCC config variants:
@ -93,12 +107,12 @@ class TopLevelNode(ConfigNode):
class OSConfigNode(ConfigNode):
def __init__(self, parent, os_name, cuda_versions, py_tree):
def __init__(self, parent, os_name, gpu_versions, py_tree):
super(OSConfigNode, self).__init__(parent, os_name)
self.py_tree = py_tree
self.props["os_name"] = os_name
self.props["cuda_versions"] = cuda_versions
self.props["gpu_versions"] = gpu_versions
def get_children(self):
return [PackageFormatConfigNode(self, k, v) for k, v in self.py_tree.items()]
@ -117,7 +131,7 @@ class PackageFormatConfigNode(ConfigNode):
elif self.find_prop("os_name") == "windows" and self.find_prop("package_format") == "libtorch":
return [WindowsLibtorchConfigNode(self, v) for v in WINDOWS_LIBTORCH_CONFIG_VARIANTS]
else:
return [ArchConfigNode(self, v) for v in self.find_prop("cuda_versions")]
return [ArchConfigNode(self, v) for v in self.find_prop("gpu_versions")]
class LinuxGccConfigNode(ConfigNode):
@ -127,14 +141,22 @@ class LinuxGccConfigNode(ConfigNode):
self.props["gcc_config_variant"] = gcc_config_variant
def get_children(self):
cuda_versions = self.find_prop("cuda_versions")
gpu_versions = self.find_prop("gpu_versions")
# XXX devtoolset7 on CUDA 9.0 is temporarily disabled
# see https://github.com/pytorch/pytorch/issues/20066
if self.find_prop("gcc_config_variant") == 'devtoolset7':
cuda_versions = filter(lambda x: x != "90", cuda_versions)
gpu_versions = filter(lambda x: x != "cuda_90", gpu_versions)
return [ArchConfigNode(self, v) for v in cuda_versions]
# XXX disabling conda rocm build since docker images are not there
if self.find_prop("package_format") == 'conda':
gpu_versions = filter(lambda x: x not in dimensions.ROCM_VERSION_LABELS, gpu_versions)
# XXX libtorch rocm build is temporarily disabled
if self.find_prop("package_format") == 'libtorch':
gpu_versions = filter(lambda x: x not in dimensions.ROCM_VERSION_LABELS, gpu_versions)
return [ArchConfigNode(self, v) for v in gpu_versions]
class WindowsLibtorchConfigNode(ConfigNode):
@ -144,14 +166,14 @@ class WindowsLibtorchConfigNode(ConfigNode):
self.props["libtorch_config_variant"] = libtorch_config_variant
def get_children(self):
return [ArchConfigNode(self, v) for v in self.find_prop("cuda_versions")]
return [ArchConfigNode(self, v) for v in self.find_prop("gpu_versions")]
class ArchConfigNode(ConfigNode):
def __init__(self, parent, cu):
super(ArchConfigNode, self).__init__(parent, get_processor_arch_name(cu))
def __init__(self, parent, gpu):
super(ArchConfigNode, self).__init__(parent, get_processor_arch_name(gpu))
self.props["cu"] = cu
self.props["gpu"] = gpu
def get_children(self):
return [PyVersionConfigNode(self, v) for v in self.find_prop("python_versions")]

View File

@ -6,10 +6,10 @@ import cimodel.lib.conf_tree as conf_tree
import cimodel.lib.miniutils as miniutils
class Conf(object):
def __init__(self, os, cuda_version, pydistro, parms, smoke, libtorch_variant, gcc_config_variant, libtorch_config_variant):
def __init__(self, os, gpu_version, pydistro, parms, smoke, libtorch_variant, gcc_config_variant, libtorch_config_variant):
self.os = os
self.cuda_version = cuda_version
self.gpu_version = gpu_version
self.pydistro = pydistro
self.parms = parms
self.smoke = smoke
@ -18,7 +18,7 @@ class Conf(object):
self.libtorch_config_variant = libtorch_config_variant
def gen_build_env_parms(self):
elems = [self.pydistro] + self.parms + [binary_build_data.get_processor_arch_name(self.cuda_version)]
elems = [self.pydistro] + self.parms + [binary_build_data.get_processor_arch_name(self.gpu_version)]
if self.gcc_config_variant is not None:
elems.append(str(self.gcc_config_variant))
if self.libtorch_config_variant is not None:
@ -37,9 +37,12 @@ class Conf(object):
docker_distro_prefix = miniutils.override(self.pydistro, docker_word_substitution)
# The cpu nightlies are built on the pytorch/manylinux-cuda102 docker image
alt_docker_suffix = self.cuda_version or "102"
docker_distro_suffix = "" if self.pydistro == "conda" else alt_docker_suffix
return miniutils.quote("pytorch/" + docker_distro_prefix + "-cuda" + docker_distro_suffix)
# TODO cuda images should consolidate into tag-base images similar to rocm
alt_docker_suffix = "cuda102" if not self.gpu_version else (
"rocm:" + self.gpu_version.strip("rocm") if self.gpu_version.startswith("rocm") else self.gpu_version)
docker_distro_suffix = alt_docker_suffix if self.pydistro != "conda" else (
"cuda" if alt_docker_suffix.startswith("cuda") else "rocm")
return miniutils.quote("pytorch/" + docker_distro_prefix + "-" + docker_distro_suffix)
def get_name_prefix(self):
return "smoke" if self.smoke else "binary"
@ -69,14 +72,10 @@ class Conf(object):
"update_s3_htmls",
]
job_def["filters"] = branch_filters.gen_filter_dict(
branches_list=["nightly"],
tags_list=[branch_filters.RC_PATTERN],
branches_list=["postnightly"],
)
else:
if phase in ["upload"]:
filter_branch = "nightly"
else:
filter_branch = r"/.*/"
filter_branch = r"/.*/"
job_def["filters"] = branch_filters.gen_filter_dict(
branches_list=[filter_branch],
tags_list=[branch_filters.RC_PATTERN],
@ -89,28 +88,61 @@ class Conf(object):
if not (self.smoke and self.os == "macos") and self.os != "windows":
job_def["docker_image"] = self.gen_docker_image()
if self.os != "windows" and self.cuda_version:
# fix this. only works on cuda not rocm
if self.os != "windows" and self.gpu_version:
job_def["use_cuda_docker_runtime"] = miniutils.quote("1")
else:
if self.os == "linux" and phase != "upload":
job_def["docker_image"] = self.gen_docker_image()
if phase == "test":
if self.cuda_version:
if self.gpu_version:
if self.os == "windows":
job_def["executor"] = "windows-with-nvidia-gpu"
else:
job_def["resource_class"] = "gpu.medium"
if phase == "upload":
job_def["context"] = "org-member"
job_def["requires"] = [
self.gen_build_name(upload_phase_dependency, nightly)
]
os_name = miniutils.override(self.os, {"macos": "mac"})
job_name = "_".join([self.get_name_prefix(), os_name, phase])
return {job_name : job_def}
def gen_upload_job(self, phase, requires_dependency):
"""Generate binary_upload job for configuration
Output looks similar to:
- binary_upload:
name: binary_linux_manywheel_3_7m_cu92_devtoolset7_nightly_upload
context: org-member
requires: binary_linux_manywheel_3_7m_cu92_devtoolset7_nightly_test
filters:
branches:
only:
- nightly
tags:
only: /v[0-9]+(\\.[0-9]+)*-rc[0-9]+/
package_type: manywheel
upload_subfolder: cu92
"""
return {
"binary_upload": OrderedDict({
"name": self.gen_build_name(phase, nightly=True),
"context": "org-member",
"requires": [self.gen_build_name(
requires_dependency,
nightly=True
)],
"filters": branch_filters.gen_filter_dict(
branches_list=["nightly"],
tags_list=[branch_filters.RC_PATTERN],
),
"package_type": self.pydistro,
"upload_subfolder": binary_build_data.get_processor_arch_name(
self.gpu_version,
),
})
}
def get_root(smoke, name):
return binary_build_data.TopLevelNode(
@ -129,10 +161,10 @@ def gen_build_env_list(smoke):
for c in config_list:
conf = Conf(
c.find_prop("os_name"),
c.find_prop("cu"),
c.find_prop("gpu"),
c.find_prop("package_format"),
[c.find_prop("pyver")],
c.find_prop("smoke"),
c.find_prop("smoke") and not (c.find_prop("os_name") == "macos_arm64"), # don't test arm64
c.find_prop("libtorch_variant"),
c.find_prop("gcc_config_variant"),
c.find_prop("libtorch_config_variant"),
@ -149,32 +181,19 @@ def get_nightly_uploads():
mylist = []
for conf in configs:
phase_dependency = "test" if predicate_exclude_macos(conf) else "build"
mylist.append(conf.gen_workflow_job("upload", phase_dependency, nightly=True))
mylist.append(conf.gen_upload_job("upload", phase_dependency))
return mylist
def get_post_upload_jobs():
"""Generate jobs to update HTML indices and report binary sizes"""
configs = gen_build_env_list(False)
common_job_def = {
"context": "org-member",
"filters": branch_filters.gen_filter_dict(
branches_list=["nightly"],
tags_list=[branch_filters.RC_PATTERN],
),
"requires": [],
}
for conf in configs:
upload_job_name = conf.gen_build_name(
build_or_test="upload",
nightly=True
)
common_job_def["requires"].append(upload_job_name)
return [
{
"update_s3_htmls": {
"name": "update_s3_htmls",
**common_job_def,
"context": "org-member",
"filters": branch_filters.gen_filter_dict(
branches_list=["postnightly"],
),
},
},
]
@ -197,7 +216,9 @@ def get_jobs(toplevel_key, smoke):
configs = gen_build_env_list(smoke)
phase = "build" if toplevel_key == "binarybuilds" else "test"
for build_config in configs:
jobs_list.append(build_config.gen_workflow_job(phase, nightly=True))
# don't test for macos_arm64 as it's cross compiled
if phase != "test" or build_config.os != "macos_arm64":
jobs_list.append(build_config.gen_workflow_job(phase, nightly=True))
return jobs_list

View File

@ -1,91 +0,0 @@
from cimodel.lib.conf_tree import ConfigNode, XImportant
from cimodel.lib.conf_tree import Ver
CONFIG_TREE_DATA = [
(Ver("ubuntu", "16.04"), [
([Ver("clang", "7")], [XImportant("onnx_main_py3.6"),
XImportant("onnx_ort1_py3.6"),
XImportant("onnx_ort2_py3.6")]),
]),
]
class TreeConfigNode(ConfigNode):
def __init__(self, parent, node_name, subtree):
super(TreeConfigNode, self).__init__(parent, self.modify_label(node_name))
self.subtree = subtree
self.init2(node_name)
# noinspection PyMethodMayBeStatic
def modify_label(self, label):
return str(label)
def init2(self, node_name):
pass
def get_children(self):
return [self.child_constructor()(self, k, v) for (k, v) in self.subtree]
def is_build_only(self):
if str(self.find_prop("language_version")) == "onnx_main_py3.6" or \
str(self.find_prop("language_version")) == "onnx_ort1_py3.6" or \
str(self.find_prop("language_version")) == "onnx_ort2_py3.6":
return False
return set(str(c) for c in self.find_prop("compiler_version")).intersection({
"clang3.8",
"clang3.9",
"clang7",
"android",
}) or self.find_prop("distro_version").name == "macos"
def is_test_only(self):
if str(self.find_prop("language_version")) == "onnx_ort1_py3.6" or \
str(self.find_prop("language_version")) == "onnx_ort2_py3.6":
return True
return False
class TopLevelNode(TreeConfigNode):
def __init__(self, node_name, subtree):
super(TopLevelNode, self).__init__(None, node_name, subtree)
# noinspection PyMethodMayBeStatic
def child_constructor(self):
return DistroConfigNode
class DistroConfigNode(TreeConfigNode):
def init2(self, node_name):
self.props["distro_version"] = node_name
# noinspection PyMethodMayBeStatic
def child_constructor(self):
return CompilerConfigNode
class CompilerConfigNode(TreeConfigNode):
def init2(self, node_name):
self.props["compiler_version"] = node_name
# noinspection PyMethodMayBeStatic
def child_constructor(self):
return LanguageConfigNode
class LanguageConfigNode(TreeConfigNode):
def init2(self, node_name):
self.props["language_version"] = node_name
self.props["build_only"] = self.is_build_only()
self.props["test_only"] = self.is_test_only()
def child_constructor(self):
return ImportantConfigNode
class ImportantConfigNode(TreeConfigNode):
def init2(self, node_name):
self.props["important"] = True
def get_children(self):
return []

View File

@ -1,174 +0,0 @@
from collections import OrderedDict
import cimodel.data.dimensions as dimensions
import cimodel.lib.conf_tree as conf_tree
from cimodel.lib.conf_tree import Ver
import cimodel.lib.miniutils as miniutils
from cimodel.data.caffe2_build_data import CONFIG_TREE_DATA, TopLevelNode
from cimodel.data.simple.util.branch_filters import gen_filter_dict
from dataclasses import dataclass
DOCKER_IMAGE_PATH_BASE = "308535385114.dkr.ecr.us-east-1.amazonaws.com/caffe2/"
DOCKER_IMAGE_VERSION = "376"
@dataclass
class Conf:
language: str
distro: Ver
# There could be multiple compiler versions configured (e.g. nvcc
# for gpu files and host compiler (gcc/clang) for cpu files)
compilers: [Ver]
build_only: bool
test_only: bool
is_important: bool
@property
def compiler_names(self):
return [c.name for c in self.compilers]
# TODO: Eventually we can probably just remove the cudnn7 everywhere.
def get_cudnn_insertion(self):
omit = self.language == "onnx_main_py3.6" \
or self.language == "onnx_ort1_py3.6" \
or self.language == "onnx_ort2_py3.6" \
or set(self.compiler_names).intersection({"android", "mkl", "clang"}) \
or str(self.distro) in ["ubuntu14.04", "macos10.13"]
return [] if omit else ["cudnn7"]
def get_build_name_root_parts(self):
return [
"caffe2",
self.language,
] + self.get_build_name_middle_parts()
def get_build_name_middle_parts(self):
return [str(c) for c in self.compilers] + self.get_cudnn_insertion() + [str(self.distro)]
def construct_phase_name(self, phase):
root_parts = self.get_build_name_root_parts()
build_name_substitutions = {
"onnx_ort1_py3.6": "onnx_main_py3.6",
"onnx_ort2_py3.6": "onnx_main_py3.6",
}
if phase == "build":
root_parts = [miniutils.override(r, build_name_substitutions) for r in root_parts]
return "_".join(root_parts + [phase]).replace(".", "_")
def get_platform(self):
platform = self.distro.name
if self.distro.name != "macos":
platform = "linux"
return platform
def gen_docker_image(self):
lang_substitutions = {
"onnx_main_py3.6": "py3.6",
"onnx_ort1_py3.6": "py3.6",
"onnx_ort2_py3.6": "py3.6",
"cmake": "py3",
}
lang = miniutils.override(self.language, lang_substitutions)
parts = [lang] + self.get_build_name_middle_parts()
return miniutils.quote(DOCKER_IMAGE_PATH_BASE + "-".join(parts) + ":" + str(DOCKER_IMAGE_VERSION))
def gen_workflow_params(self, phase):
parameters = OrderedDict()
lang_substitutions = {
"onnx_py3": "onnx-py3",
"onnx_main_py3.6": "onnx-main-py3.6",
"onnx_ort1_py3.6": "onnx-ort1-py3.6",
"onnx_ort2_py3.6": "onnx-ort2-py3.6",
}
lang = miniutils.override(self.language, lang_substitutions)
parts = [
"caffe2",
lang,
] + self.get_build_name_middle_parts() + [phase]
build_env_name = "-".join(parts)
parameters["build_environment"] = miniutils.quote(build_env_name)
if "ios" in self.compiler_names:
parameters["build_ios"] = miniutils.quote("1")
if phase == "test":
# TODO cuda should not be considered a compiler
if "cuda" in self.compiler_names:
parameters["use_cuda_docker_runtime"] = miniutils.quote("1")
if self.distro.name != "macos":
parameters["docker_image"] = self.gen_docker_image()
if self.build_only:
parameters["build_only"] = miniutils.quote("1")
if phase == "test":
resource_class = "large" if "cuda" not in self.compiler_names else "gpu.medium"
parameters["resource_class"] = resource_class
return parameters
def gen_workflow_job(self, phase):
job_def = OrderedDict()
job_def["name"] = self.construct_phase_name(phase)
if phase == "test":
job_def["requires"] = [self.construct_phase_name("build")]
job_name = "caffe2_" + self.get_platform() + "_test"
else:
job_name = "caffe2_" + self.get_platform() + "_build"
if not self.is_important:
job_def["filters"] = gen_filter_dict()
job_def.update(self.gen_workflow_params(phase))
return {job_name : job_def}
def get_root():
return TopLevelNode("Caffe2 Builds", CONFIG_TREE_DATA)
def instantiate_configs():
config_list = []
root = get_root()
found_configs = conf_tree.dfs(root)
for fc in found_configs:
c = Conf(
language=fc.find_prop("language_version"),
distro=fc.find_prop("distro_version"),
compilers=fc.find_prop("compiler_version"),
build_only=fc.find_prop("build_only"),
test_only=fc.find_prop("test_only"),
is_important=fc.find_prop("important"),
)
config_list.append(c)
return config_list
def get_workflow_jobs():
configs = instantiate_configs()
x = []
for conf_options in configs:
phases = ["build"]
if not conf_options.build_only:
phases = dimensions.PHASES
if conf_options.test_only:
phases = ["test"]
for phase in phases:
x.append(conf_options.gen_workflow_job(phase))
return x

View File

@ -1,14 +1,23 @@
PHASES = ["build", "test"]
CUDA_VERSIONS = [
None, # cpu build
"92",
"101",
"102",
"111",
]
ROCM_VERSIONS = [
"3.10",
"4.0.1",
]
ROCM_VERSION_LABELS = ["rocm" + v for v in ROCM_VERSIONS]
GPU_VERSIONS = [None] + ["cuda" + v for v in CUDA_VERSIONS] + ROCM_VERSION_LABELS
STANDARD_PYTHON_VERSIONS = [
"3.6",
"3.7",
"3.8"
"3.8",
"3.9"
]

View File

@ -3,15 +3,13 @@ from cimodel.lib.conf_tree import ConfigNode, X, XImportant
CONFIG_TREE_DATA = [
("xenial", [
(None, [
X("nightly"),
]),
("gcc", [
("5.4", [ # All this subtree rebases to master and then build
XImportant("3.6"),
("3.6", [
("important", [X(True)]),
("parallel_tbb", [X(True)]),
("parallel_native", [X(True)]),
("pure_torch", [X(True)]),
]),
]),
# TODO: bring back libtorch test
@ -19,21 +17,54 @@ CONFIG_TREE_DATA = [
]),
("clang", [
("5", [
XImportant("3.6"), # This is actually the ASAN build
("3.6", [
("asan", [
(True, [
("shard_test", [XImportant(True)]),
]),
]),
]),
]),
("7", [
("3.6", [
("onnx", [XImportant(True)]),
]),
]),
]),
("cuda", [
("9.2", [
X("3.6"),
("3.6", [
("cuda_gcc_override", [X("gcc5.4")])
X(True),
("cuda_gcc_override", [
("gcc5.4", [
('build_only', [XImportant(True)]),
]),
]),
])
]),
("10.1", [X("3.6")]),
("10.2", [
XImportant("3.6"),
("10.1", [
("3.6", [
("libtorch", [XImportant(True)])
('build_only', [X(True)]),
]),
]),
("10.2", [
("3.6", [
("shard_test", [XImportant(True)]),
("libtorch", [
(True, [
('build_only', [X(True)]),
]),
]),
]),
]),
("11.1", [
("3.8", [
X(True),
("libtorch", [
(True, [
('build_only', [XImportant(True)]),
]),
]),
]),
]),
]),
@ -46,11 +77,27 @@ CONFIG_TREE_DATA = [
("9", [
("3.6", [
("xla", [XImportant(True)]),
("vulkan", [XImportant(True)]),
]),
]),
]),
("gcc", [
("9", [XImportant("3.8")]),
("9", [
("3.8", [
("coverage", [
(True, [
("shard_test", [XImportant(True)]),
]),
]),
]),
]),
]),
("rocm", [
("3.9", [
("3.6", [
('build_only', [XImportant(True)]),
]),
]),
]),
]),
]
@ -118,17 +165,34 @@ class ExperimentalFeatureConfigNode(TreeConfigNode):
experimental_feature = self.find_prop("experimental_feature")
next_nodes = {
"asan": AsanConfigNode,
"xla": XlaConfigNode,
"vulkan": VulkanConfigNode,
"parallel_tbb": ParallelTBBConfigNode,
"parallel_native": ParallelNativeConfigNode,
"onnx": ONNXConfigNode,
"libtorch": LibTorchConfigNode,
"important": ImportantConfigNode,
"build_only": BuildOnlyConfigNode,
"cuda_gcc_override": CudaGccOverrideConfigNode
"shard_test": ShardTestConfigNode,
"cuda_gcc_override": CudaGccOverrideConfigNode,
"coverage": CoverageConfigNode,
"pure_torch": PureTorchConfigNode,
}
return next_nodes[experimental_feature]
class PureTorchConfigNode(TreeConfigNode):
def modify_label(self, label):
return "PURE_TORCH=" + str(label)
def init2(self, node_name):
self.props["is_pure_torch"] = node_name
def child_constructor(self):
return ImportantConfigNode
class XlaConfigNode(TreeConfigNode):
def modify_label(self, label):
return "XLA=" + str(label)
@ -140,6 +204,39 @@ class XlaConfigNode(TreeConfigNode):
return ImportantConfigNode
class AsanConfigNode(TreeConfigNode):
def modify_label(self, label):
return "Asan=" + str(label)
def init2(self, node_name):
self.props["is_asan"] = node_name
def child_constructor(self):
return ExperimentalFeatureConfigNode
class ONNXConfigNode(TreeConfigNode):
def modify_label(self, label):
return "Onnx=" + str(label)
def init2(self, node_name):
self.props["is_onnx"] = node_name
def child_constructor(self):
return ImportantConfigNode
class VulkanConfigNode(TreeConfigNode):
def modify_label(self, label):
return "Vulkan=" + str(label)
def init2(self, node_name):
self.props["is_vulkan"] = node_name
def child_constructor(self):
return ImportantConfigNode
class ParallelTBBConfigNode(TreeConfigNode):
def modify_label(self, label):
return "PARALLELTBB=" + str(label)
@ -170,7 +267,7 @@ class LibTorchConfigNode(TreeConfigNode):
self.props["is_libtorch"] = node_name
def child_constructor(self):
return ImportantConfigNode
return ExperimentalFeatureConfigNode
class CudaGccOverrideConfigNode(TreeConfigNode):
@ -178,17 +275,33 @@ class CudaGccOverrideConfigNode(TreeConfigNode):
self.props["cuda_gcc_override"] = node_name
def child_constructor(self):
return ImportantConfigNode
return ExperimentalFeatureConfigNode
class BuildOnlyConfigNode(TreeConfigNode):
def init2(self, node_name):
self.props["build_only"] = node_name
def child_constructor(self):
return ExperimentalFeatureConfigNode
class ShardTestConfigNode(TreeConfigNode):
def init2(self, node_name):
self.props["shard_test"] = node_name
def child_constructor(self):
return ImportantConfigNode
class CoverageConfigNode(TreeConfigNode):
def init2(self, node_name):
self.props["is_coverage"] = node_name
def child_constructor(self):
return ExperimentalFeatureConfigNode
class ImportantConfigNode(TreeConfigNode):
def modify_label(self, label):
return "IMPORTANT=" + str(label)
@ -201,7 +314,6 @@ class ImportantConfigNode(TreeConfigNode):
class XenialCompilerConfigNode(TreeConfigNode):
def modify_label(self, label):
return label or "<unspecified>"
@ -215,7 +327,6 @@ class XenialCompilerConfigNode(TreeConfigNode):
class BionicCompilerConfigNode(TreeConfigNode):
def modify_label(self, label):
return label or "<unspecified>"

View File

@ -1,14 +1,13 @@
from collections import OrderedDict
from dataclasses import dataclass, field
from typing import List, Optional
from cimodel.data.pytorch_build_data import TopLevelNode, CONFIG_TREE_DATA
import cimodel.data.dimensions as dimensions
import cimodel.lib.conf_tree as conf_tree
import cimodel.lib.miniutils as miniutils
from cimodel.data.simple.util.branch_filters import gen_filter_dict
from cimodel.data.simple.util.docker_constants import gen_docker_image_path
from dataclasses import dataclass, field
from typing import List, Optional
from cimodel.data.pytorch_build_data import CONFIG_TREE_DATA, TopLevelNode
from cimodel.data.simple.util.branch_filters import gen_filter_dict, RC_PATTERN
from cimodel.data.simple.util.docker_constants import gen_docker_image
@dataclass
@ -18,19 +17,25 @@ class Conf:
parms_list_ignored_for_docker_image: Optional[List[str]] = None
pyver: Optional[str] = None
cuda_version: Optional[str] = None
rocm_version: Optional[str] = None
# TODO expand this to cover all the USE_* that we want to test for
# tesnrorrt, leveldb, lmdb, redis, opencv, mkldnn, ideep, etc.
# (from https://github.com/pytorch/pytorch/pull/17323#discussion_r259453608)
is_xla: bool = False
vulkan: bool = False
is_vulkan: bool = False
is_pure_torch: bool = False
restrict_phases: Optional[List[str]] = None
gpu_resource: Optional[str] = None
dependent_tests: List = field(default_factory=list)
parent_build: Optional['Conf'] = None
parent_build: Optional["Conf"] = None
is_libtorch: bool = False
is_important: bool = False
parallel_backend: Optional[str] = None
@staticmethod
def is_test_phase(phase):
return "test" in phase
# TODO: Eliminate the special casing for docker paths
# In the short term, we *will* need to support special casing as docker images are merged for caffe2 and pytorch
def get_parms(self, for_docker):
@ -42,31 +47,47 @@ class Conf:
leading.append("pytorch")
if self.is_xla and not for_docker:
leading.append("xla")
if self.is_vulkan and not for_docker:
leading.append("vulkan")
if self.is_libtorch and not for_docker:
leading.append("libtorch")
if self.is_pure_torch and not for_docker:
leading.append("pure_torch")
if self.parallel_backend is not None and not for_docker:
leading.append(self.parallel_backend)
cuda_parms = []
if self.cuda_version:
cuda_parms.extend(["cuda" + self.cuda_version, "cudnn7"])
cudnn = "cudnn8" if self.cuda_version.startswith("11.") else "cudnn7"
cuda_parms.extend(["cuda" + self.cuda_version, cudnn])
if self.rocm_version:
cuda_parms.extend([f"rocm{self.rocm_version}"])
result = leading + ["linux", self.distro] + cuda_parms + self.parms
if not for_docker and self.parms_list_ignored_for_docker_image is not None:
result = result + self.parms_list_ignored_for_docker_image
return result
def gen_docker_image_path(self):
parms_source = self.parent_build or self
base_build_env_name = "-".join(parms_source.get_parms(True))
image_name, _ = gen_docker_image(base_build_env_name)
return miniutils.quote(image_name)
return miniutils.quote(gen_docker_image_path(base_build_env_name))
def gen_docker_image_requires(self):
parms_source = self.parent_build or self
base_build_env_name = "-".join(parms_source.get_parms(True))
_, requires = gen_docker_image(base_build_env_name)
return miniutils.quote(requires)
def get_build_job_name_pieces(self, build_or_test):
return self.get_parms(False) + [build_or_test]
def gen_build_name(self, build_or_test):
return ("_".join(map(str, self.get_build_job_name_pieces(build_or_test)))).replace(".", "_").replace("-", "_")
return (
("_".join(map(str, self.get_build_job_name_pieces(build_or_test))))
.replace(".", "_")
.replace("-", "_")
)
def get_dependents(self):
return self.dependent_tests or []
@ -78,20 +99,26 @@ class Conf:
build_env_name = "-".join(map(str, build_job_name_pieces))
parameters["build_environment"] = miniutils.quote(build_env_name)
parameters["docker_image"] = self.gen_docker_image_path()
if phase == "test" and self.gpu_resource:
if Conf.is_test_phase(phase) and self.gpu_resource:
parameters["use_cuda_docker_runtime"] = miniutils.quote("1")
if phase == "test":
if Conf.is_test_phase(phase):
resource_class = "large"
if self.gpu_resource:
resource_class = "gpu." + self.gpu_resource
if self.rocm_version is not None:
resource_class = "pytorch/amd-gpu"
parameters["resource_class"] = resource_class
if phase == "build" and self.rocm_version is not None:
parameters["resource_class"] = "xlarge"
if hasattr(self, 'filters'):
parameters['filters'] = self.filters
return parameters
def gen_workflow_job(self, phase):
job_def = OrderedDict()
job_def["name"] = self.gen_build_name(phase)
if phase == "test":
if Conf.is_test_phase(phase):
# TODO When merging the caffe2 and pytorch jobs, it might be convenient for a while to make a
# caffe2 test job dependent on a pytorch build job. This way we could quickly dedup the repeated
@ -103,36 +130,59 @@ class Conf:
job_name = "pytorch_linux_test"
else:
job_name = "pytorch_linux_build"
job_def["requires"] = [self.gen_docker_image_requires()]
if not self.is_important:
job_def["filters"] = gen_filter_dict()
job_def.update(self.gen_workflow_params(phase))
return {job_name : job_def}
return {job_name: job_def}
# TODO This is a hack to special case some configs just for the workflow list
class HiddenConf(object):
def __init__(self, name, parent_build=None):
def __init__(self, name, parent_build=None, filters=None):
self.name = name
self.parent_build = parent_build
self.filters = filters
def gen_workflow_job(self, phase):
return {self.gen_build_name(phase): {"requires": [self.parent_build.gen_build_name("build")]}}
return {
self.gen_build_name(phase): {
"requires": [self.parent_build.gen_build_name("build")],
"filters": self.filters,
}
}
def gen_build_name(self, _):
return self.name
class DocPushConf(object):
def __init__(self, name, parent_build=None, branch="master"):
self.name = name
self.parent_build = parent_build
self.branch = branch
def gen_workflow_job(self, phase):
return {
"pytorch_doc_push": {
"name": self.name,
"branch": self.branch,
"requires": [self.parent_build],
"context": "org-member",
"filters": gen_filter_dict(branches_list=["nightly"],
tags_list=RC_PATTERN)
}
}
# TODO Convert these to graph nodes
def gen_dependent_configs(xenial_parent_config):
extra_parms = [
(["multigpu"], "large"),
(["NO_AVX2"], "medium"),
(["NO_AVX", "NO_AVX2"], "medium"),
(["nogpu", "NO_AVX2"], None),
(["nogpu", "NO_AVX"], None),
(["slow"], "medium"),
(["nogpu"], None),
]
configs = []
@ -141,12 +191,12 @@ def gen_dependent_configs(xenial_parent_config):
c = Conf(
xenial_parent_config.distro,
["py3"] + parms,
pyver="3.6",
pyver=xenial_parent_config.pyver,
cuda_version=xenial_parent_config.cuda_version,
restrict_phases=["test"],
gpu_resource=gpu,
parent_build=xenial_parent_config,
is_important=xenial_parent_config.is_important,
is_important=False,
)
configs.append(c)
@ -157,9 +207,44 @@ def gen_dependent_configs(xenial_parent_config):
def gen_docs_configs(xenial_parent_config):
configs = []
for x in ["pytorch_python_doc_push", "pytorch_cpp_doc_push", "pytorch_doc_test"]:
configs.append(HiddenConf(x, parent_build=xenial_parent_config))
configs.append(
HiddenConf(
"pytorch_python_doc_build",
parent_build=xenial_parent_config,
filters=gen_filter_dict(branches_list=r"/.*/",
tags_list=RC_PATTERN),
)
)
configs.append(
DocPushConf(
"pytorch_python_doc_push",
parent_build="pytorch_python_doc_build",
branch="site",
)
)
configs.append(
HiddenConf(
"pytorch_cpp_doc_build",
parent_build=xenial_parent_config,
filters=gen_filter_dict(branches_list=r"/.*/",
tags_list=RC_PATTERN),
)
)
configs.append(
DocPushConf(
"pytorch_cpp_doc_push",
parent_build="pytorch_cpp_doc_build",
branch="master",
)
)
configs.append(
HiddenConf(
"pytorch_doc_test",
parent_build=xenial_parent_config
)
)
return configs
@ -186,12 +271,13 @@ def instantiate_configs():
compiler_name = fc.find_prop("compiler_name")
compiler_version = fc.find_prop("compiler_version")
is_xla = fc.find_prop("is_xla") or False
is_asan = fc.find_prop("is_asan") or False
is_coverage = fc.find_prop("is_coverage") or False
is_onnx = fc.find_prop("is_onnx") or False
is_pure_torch = fc.find_prop("is_pure_torch") or False
is_vulkan = fc.find_prop("is_vulkan") or False
parms_list_ignored_for_docker_image = []
vulkan = fc.find_prop("vulkan") or False
if vulkan:
parms_list_ignored_for_docker_image.append("vulkan")
python_version = None
if compiler_name == "cuda" or compiler_name == "android":
python_version = fc.find_prop("pyver")
@ -200,9 +286,14 @@ def instantiate_configs():
parms_list = ["py" + fc.find_prop("pyver")]
cuda_version = None
rocm_version = None
if compiler_name == "cuda":
cuda_version = fc.find_prop("compiler_version")
elif compiler_name == "rocm":
rocm_version = fc.find_prop("compiler_version")
restrict_phases = ["build", "test1", "test2", "caffe2_test"]
elif compiler_name == "android":
android_ndk_version = fc.find_prop("compiler_version")
# TODO: do we need clang to compile host binaries like protoc?
@ -216,14 +307,22 @@ def instantiate_configs():
gcc_version = compiler_name + (fc.find_prop("compiler_version") or "")
parms_list.append(gcc_version)
# TODO: This is a nasty special case
if gcc_version == 'clang5' and not is_xla:
parms_list.append("asan")
python_version = fc.find_prop("pyver")
parms_list[0] = fc.find_prop("abbreviated_pyver")
if is_asan:
parms_list.append("asan")
python_version = fc.find_prop("pyver")
parms_list[0] = fc.find_prop("abbreviated_pyver")
if cuda_version in ["9.2", "10", "10.1", "10.2"]:
# TODO The gcc version is orthogonal to CUDA version?
if is_coverage:
parms_list_ignored_for_docker_image.append("coverage")
python_version = fc.find_prop("pyver")
if is_onnx:
parms_list.append("onnx")
python_version = fc.find_prop("pyver")
parms_list[0] = fc.find_prop("abbreviated_pyver")
restrict_phases = ["build", "ort_test1", "ort_test2"]
if cuda_version:
cuda_gcc_version = fc.find_prop("cuda_gcc_override") or "gcc7"
parms_list.append(cuda_gcc_version)
@ -231,7 +330,12 @@ def instantiate_configs():
is_important = fc.find_prop("is_important") or False
parallel_backend = fc.find_prop("parallel_backend") or None
build_only = fc.find_prop("build_only") or False
if build_only and restrict_phases is None:
shard_test = fc.find_prop("shard_test") or False
# TODO: fix pure_torch python test packaging issue.
if shard_test:
restrict_phases = ["build"] if restrict_phases is None else restrict_phases
restrict_phases.extend(["test1", "test2"])
if build_only or is_pure_torch:
restrict_phases = ["build"]
gpu_resource = None
@ -244,8 +348,10 @@ def instantiate_configs():
parms_list_ignored_for_docker_image,
python_version,
cuda_version,
rocm_version,
is_xla,
vulkan,
is_vulkan,
is_pure_torch,
restrict_phases,
gpu_resource,
is_libtorch=is_libtorch,
@ -255,20 +361,33 @@ def instantiate_configs():
# run docs builds on "pytorch-linux-xenial-py3.6-gcc5.4". Docs builds
# should run on a CPU-only build that runs on all PRs.
if distro_name == 'xenial' and fc.find_prop("pyver") == '3.6' \
and cuda_version is None \
and parallel_backend is None \
and compiler_name == 'gcc' \
and fc.find_prop('compiler_version') == '5.4':
# XXX should this be updated to a more modern build? Projects are
# beginning to drop python3.6
if (
distro_name == "xenial"
and fc.find_prop("pyver") == "3.6"
and cuda_version is None
and parallel_backend is None
and not is_vulkan
and not is_pure_torch
and compiler_name == "gcc"
and fc.find_prop("compiler_version") == "5.4"
):
c.filters = gen_filter_dict(branches_list=r"/.*/",
tags_list=RC_PATTERN)
c.dependent_tests = gen_docs_configs(c)
if cuda_version == "10.1" and python_version == "3.6" and not is_libtorch:
if cuda_version == "10.2" and python_version == "3.6" and not is_libtorch:
c.dependent_tests = gen_dependent_configs(c)
if (compiler_name == "gcc"
and compiler_version == "5.4"
and not is_libtorch
and parallel_backend is None):
if (
compiler_name == "gcc"
and compiler_version == "5.4"
and not is_libtorch
and not is_vulkan
and not is_pure_torch
and parallel_backend is None
):
bc_breaking_check = Conf(
"backward-compatibility-check",
[],
@ -297,7 +416,7 @@ def get_workflow_jobs():
for phase in phases:
# TODO why does this not have a test?
if phase == "test" and conf_options.cuda_version == "10":
if Conf.is_test_phase(phase) and conf_options.cuda_version == "10":
continue
x.append(conf_options.gen_workflow_job(phase))

View File

@ -0,0 +1,28 @@
from collections import OrderedDict
from cimodel.data.simple.util.branch_filters import gen_filter_dict
from cimodel.lib.miniutils import quote
CHANNELS_TO_PRUNE = ["pytorch-nightly", "pytorch-test"]
PACKAGES_TO_PRUNE = "pytorch torchvision torchaudio torchtext ignite torchcsprng"
def gen_workflow_job(channel: str):
return OrderedDict(
{
"anaconda_prune": OrderedDict(
{
"name": f"anaconda-prune-{channel}",
"context": quote("org-member"),
"packages": quote(PACKAGES_TO_PRUNE),
"channel": channel,
"filters": gen_filter_dict(branches_list=["postnightly"]),
}
)
}
)
def get_workflow_jobs():
return [gen_workflow_job(channel) for channel in CHANNELS_TO_PRUNE]

View File

@ -1,5 +1,7 @@
import cimodel.data.simple.util.branch_filters
from cimodel.data.simple.util.docker_constants import DOCKER_IMAGE_NDK
import cimodel.data.simple.util.branch_filters as branch_filters
from cimodel.data.simple.util.docker_constants import (
DOCKER_IMAGE_NDK, DOCKER_REQUIREMENT_NDK
)
class AndroidJob:
@ -34,10 +36,11 @@ class AndroidJob:
"name": full_job_name,
"build_environment": "\"{}\"".format(build_env_name),
"docker_image": "\"{}\"".format(DOCKER_IMAGE_NDK),
"requires": [DOCKER_REQUIREMENT_NDK]
}
if self.is_master_only:
props_dict["filters"] = cimodel.data.simple.util.branch_filters.gen_filter_dict()
props_dict["filters"] = branch_filters.gen_filter_dict(branch_filters.NON_PR_BRANCH_LIST)
return [{self.template_name: props_dict}]
@ -47,12 +50,14 @@ class AndroidGradleJob:
job_name,
template_name,
dependencies,
is_master_only=True):
is_master_only=True,
is_pr_only=False):
self.job_name = job_name
self.template_name = template_name
self.dependencies = dependencies
self.is_master_only = is_master_only
self.is_pr_only = is_pr_only
def gen_tree(self):
@ -62,7 +67,9 @@ class AndroidGradleJob:
}
if self.is_master_only:
props_dict["filters"] = cimodel.data.simple.util.branch_filters.gen_filter_dict()
props_dict["filters"] = branch_filters.gen_filter_dict(branch_filters.NON_PR_BRANCH_LIST)
elif self.is_pr_only:
props_dict["filters"] = branch_filters.gen_filter_dict(branch_filters.PR_BRANCH_LIST)
return [{self.template_name: props_dict}]
@ -72,12 +79,18 @@ WORKFLOW_DATA = [
AndroidJob(["x86_64"], "pytorch_linux_build"),
AndroidJob(["arm", "v7a"], "pytorch_linux_build"),
AndroidJob(["arm", "v8a"], "pytorch_linux_build"),
AndroidJob(["vulkan", "x86_32"], "pytorch_linux_build", is_master_only=False),
AndroidGradleJob(
"pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-build-x86_32",
"pytorch_android_gradle_build-x86_32",
["pytorch_linux_xenial_py3_clang5_android_ndk_r19c_x86_32_build"],
is_master_only=False),
is_master_only=False,
is_pr_only=True),
AndroidGradleJob(
"pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single",
"pytorch_android_gradle_custom_build_single",
[DOCKER_REQUIREMENT_NDK],
is_master_only=False,
is_pr_only=True),
AndroidGradleJob(
"pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-build",
"pytorch_android_gradle_build",

View File

@ -1,4 +1,7 @@
from cimodel.data.simple.util.docker_constants import DOCKER_IMAGE_GCC7
from cimodel.data.simple.util.docker_constants import (
DOCKER_IMAGE_GCC7,
DOCKER_REQUIREMENT_GCC7
)
def gen_job_name(phase):
@ -38,7 +41,10 @@ class BazelJob:
full_job_name = gen_job_name(self.phase)
build_env_name = "-".join(build_env_parts)
extra_requires = [gen_job_name("build")] if self.phase == "test" else []
extra_requires = (
[gen_job_name("build")] if self.phase == "test" else
[DOCKER_REQUIREMENT_GCC7]
)
props_dict = {
"build_environment": build_env_name,

View File

@ -5,7 +5,7 @@ TODO: Refactor circleci/cimodel/data/binary_build_data.py to generate this file
NB: If you modify this file, you need to also modify
the binary_and_smoke_tests_on_pr variable in
pytorch-ci-hud to adjust the list of whitelisted builds
pytorch-ci-hud to adjust the allowed build list
at https://github.com/ezyang/pytorch-ci-hud/blob/master/src/BuildHistoryDisplay.js
Note:

View File

@ -1,10 +1,13 @@
from collections import OrderedDict
from cimodel.lib.miniutils import quote
from cimodel.data.simple.util.branch_filters import gen_filter_dict, RC_PATTERN
# TODO: make this generated from a matrix rather than just a static list
IMAGE_NAMES = [
"pytorch-linux-bionic-cuda11.1-cudnn8-py3.6-gcc9",
"pytorch-linux-bionic-cuda11.1-cudnn8-py3.8-gcc9",
"pytorch-linux-bionic-cuda11.0-cudnn8-py3.6-gcc9",
"pytorch-linux-bionic-cuda11.0-cudnn8-py3.8-gcc9",
"pytorch-linux-bionic-cuda10.2-cudnn7-py3.8-gcc9",
@ -15,30 +18,38 @@ IMAGE_NAMES = [
"pytorch-linux-xenial-cuda10.1-cudnn7-py3-gcc7",
"pytorch-linux-xenial-cuda10.2-cudnn7-py3-gcc7",
"pytorch-linux-xenial-cuda11.0-cudnn8-py3-gcc7",
"pytorch-linux-xenial-cuda11.1-cudnn8-py3-gcc7",
"pytorch-linux-xenial-cuda9.2-cudnn7-py3-gcc5.4",
"pytorch-linux-xenial-cuda9.2-cudnn7-py3-gcc7",
"pytorch-linux-xenial-py3-clang5-android-ndk-r19c",
"pytorch-linux-xenial-py3-clang5-asan",
"pytorch-linux-xenial-py3-clang7-onnx",
"pytorch-linux-xenial-py3.8",
"pytorch-linux-xenial-py3.6-clang7",
"pytorch-linux-xenial-py3.6-gcc4.8",
"pytorch-linux-xenial-py3.6-gcc5.4",
"pytorch-linux-xenial-py3.6-gcc5.4", # this one is used in doc builds
"pytorch-linux-xenial-py3.6-gcc7.2",
"pytorch-linux-xenial-py3.6-gcc7",
"pytorch-linux-xenial-pynightly",
"pytorch-linux-xenial-rocm3.3-py3.6",
"pytorch-linux-bionic-rocm3.9-py3.6",
"pytorch-linux-bionic-rocm3.10-py3.6",
]
def get_workflow_jobs():
"""Generates a list of docker image build definitions"""
return [
OrderedDict(
ret = []
for image_name in IMAGE_NAMES:
parameters = OrderedDict({
"name": quote(f"docker-{image_name}"),
"image_name": quote(image_name),
})
if image_name == "pytorch-linux-xenial-py3.6-gcc5.4":
# pushing documentation on tags requires CircleCI to also
# build all the dependencies on tags, including this docker image
parameters['filters'] = gen_filter_dict(branches_list=r"/.*/",
tags_list=RC_PATTERN)
ret.append(OrderedDict(
{
"docker_build_job": OrderedDict(
{"name": quote(image_name), "image_name": quote(image_name)}
)
"docker_build_job": parameters
}
)
for image_name in IMAGE_NAMES
]
))
return ret

View File

@ -61,41 +61,16 @@ WORKFLOW_DATA = [
MultiPartVersion([3, 6], "py"),
MultiPartVersion([5, 4], "gcc"),
None,
["ge_config_legacy", "test"],
["jit_legacy", "test"],
["pytorch_linux_xenial_py3_6_gcc5_4_build"]),
GeConfigTestJob(
MultiPartVersion([3, 6], "py"),
MultiPartVersion([5, 4], "gcc"),
None,
["ge_config_profiling", "test"],
["pytorch_linux_xenial_py3_6_gcc5_4_build"]),
GeConfigTestJob(
MultiPartVersion([3, 6], "py"),
MultiPartVersion([5, 4], "gcc"),
None,
["ge_config_simple", "test"],
["pytorch_linux_xenial_py3_6_gcc5_4_build"],
CudaVersion(10, 2),
["cudnn7", "py3", "jit_legacy", "test"],
["pytorch_linux_xenial_cuda10_2_cudnn7_py3_gcc7_build"],
use_cuda_docker=True,
),
GeConfigTestJob(
None,
None,
CudaVersion(10, 2),
["cudnn7", "py3", "ge_config_legacy", "test"],
["pytorch_linux_xenial_cuda10_2_cudnn7_py3_gcc7_build"],
use_cuda_docker=True,
# TODO Why does the build environment specify cuda10.1, while the
# job name is cuda10_2?
build_env_override="pytorch-linux-xenial-cuda10.1-cudnn7-ge_config_legacy-test"),
GeConfigTestJob(
None,
None,
CudaVersion(10, 2),
["cudnn7", "py3", "ge_config_profiling", "test"],
["pytorch_linux_xenial_cuda10_2_cudnn7_py3_gcc7_build"],
use_cuda_docker=True,
# TODO Why does the build environment specify cuda10.1, while the
# job name is cuda10_2?
build_env_override="pytorch-linux-xenial-cuda10.1-cudnn7-ge_config_profiling-test"),
]

View File

@ -1,16 +1,16 @@
from cimodel.data.simple.util.versions import MultiPartVersion
import cimodel.lib.miniutils as miniutils
IOS_VERSION = MultiPartVersion([11, 2, 1])
XCODE_VERSION = MultiPartVersion([12, 0, 0])
class ArchVariant:
def __init__(self, name, is_custom=False):
def __init__(self, name, custom_build_name=""):
self.name = name
self.is_custom = is_custom
self.custom_build_name = custom_build_name
def render(self):
extra_parts = ["custom"] if self.is_custom else []
extra_parts = [self.custom_build_name] if len(self.custom_build_name) > 0 else []
return "_".join([self.name] + extra_parts)
@ -19,15 +19,15 @@ def get_platform(arch_variant_name):
class IOSJob:
def __init__(self, ios_version, arch_variant, is_org_member_context=True, extra_props=None):
self.ios_version = ios_version
def __init__(self, xcode_version, arch_variant, is_org_member_context=True, extra_props=None):
self.xcode_version = xcode_version
self.arch_variant = arch_variant
self.is_org_member_context = is_org_member_context
self.extra_props = extra_props
def gen_name_parts(self, with_version_dots):
version_parts = self.ios_version.render_dots_or_parts(with_version_dots)
version_parts = self.xcode_version.render_dots_or_parts(with_version_dots)
build_variant_suffix = "_".join([self.arch_variant.render(), "build"])
return [
@ -61,9 +61,10 @@ class IOSJob:
WORKFLOW_DATA = [
IOSJob(IOS_VERSION, ArchVariant("x86_64"), is_org_member_context=False),
IOSJob(IOS_VERSION, ArchVariant("arm64")),
IOSJob(IOS_VERSION, ArchVariant("arm64", True), extra_props={"op_list": "mobilenetv2.yaml"}),
IOSJob(XCODE_VERSION, ArchVariant("x86_64"), is_org_member_context=False),
IOSJob(XCODE_VERSION, ArchVariant("arm64")),
IOSJob(XCODE_VERSION, ArchVariant("arm64", "metal"), extra_props={"use_metal": miniutils.quote(str(int(True)))}),
IOSJob(XCODE_VERSION, ArchVariant("arm64", "custom"), extra_props={"op_list": "mobilenetv2.yaml"}),
]

View File

@ -4,12 +4,23 @@ PyTorch Mobile PR builds (use linux host toolchain + mobile build options)
import cimodel.lib.miniutils as miniutils
import cimodel.data.simple.util.branch_filters
from cimodel.data.simple.util.docker_constants import DOCKER_IMAGE_ASAN, DOCKER_IMAGE_NDK
from cimodel.data.simple.util.docker_constants import (
DOCKER_IMAGE_ASAN,
DOCKER_REQUIREMENT_ASAN,
DOCKER_IMAGE_NDK,
DOCKER_REQUIREMENT_NDK
)
class MobileJob:
def __init__(self, docker_image, variant_parts, is_master_only=False):
def __init__(
self,
docker_image,
docker_requires,
variant_parts,
is_master_only=False):
self.docker_image = docker_image
self.docker_requires = docker_requires
self.variant_parts = variant_parts
self.is_master_only = is_master_only
@ -30,6 +41,7 @@ class MobileJob:
"build_environment": build_env_name,
"build_only": miniutils.quote(str(int(True))),
"docker_image": self.docker_image,
"requires": self.docker_requires,
"name": full_job_name,
}
@ -40,15 +52,27 @@ class MobileJob:
WORKFLOW_DATA = [
MobileJob(DOCKER_IMAGE_ASAN, ["build"]),
MobileJob(DOCKER_IMAGE_ASAN, ["custom", "build", "static"]),
MobileJob(
DOCKER_IMAGE_ASAN,
[DOCKER_REQUIREMENT_ASAN],
["build"]
),
# Use LLVM-DEV toolchain in android-ndk-r19c docker image
MobileJob(DOCKER_IMAGE_NDK, ["custom", "build", "dynamic"]),
MobileJob(
DOCKER_IMAGE_NDK,
[DOCKER_REQUIREMENT_NDK],
["custom", "build", "dynamic"]
),
# Use LLVM-DEV toolchain in android-ndk-r19c docker image
# Most of this CI is already covered by "mobile-custom-build-dynamic" job
MobileJob(DOCKER_IMAGE_NDK, ["code", "analysis"], True),
MobileJob(
DOCKER_IMAGE_NDK,
[DOCKER_REQUIREMENT_NDK],
["code", "analysis"],
True
),
]

View File

@ -1,4 +1,7 @@
from cimodel.data.simple.util.docker_constants import DOCKER_IMAGE_NDK
from cimodel.data.simple.util.docker_constants import (
DOCKER_IMAGE_NDK,
DOCKER_REQUIREMENT_NDK
)
class AndroidNightlyJob:
@ -48,12 +51,13 @@ class AndroidNightlyJob:
return [{self.template_name: props_dict}]
BASE_REQUIRES = [DOCKER_REQUIREMENT_NDK]
WORKFLOW_DATA = [
AndroidNightlyJob(["x86_32"], "pytorch_linux_build"),
AndroidNightlyJob(["x86_64"], "pytorch_linux_build"),
AndroidNightlyJob(["arm", "v7a"], "pytorch_linux_build"),
AndroidNightlyJob(["arm", "v8a"], "pytorch_linux_build"),
AndroidNightlyJob(["x86_32"], "pytorch_linux_build", requires=BASE_REQUIRES),
AndroidNightlyJob(["x86_64"], "pytorch_linux_build", requires=BASE_REQUIRES),
AndroidNightlyJob(["arm", "v7a"], "pytorch_linux_build", requires=BASE_REQUIRES),
AndroidNightlyJob(["arm", "v8a"], "pytorch_linux_build", requires=BASE_REQUIRES),
AndroidNightlyJob(["android_gradle"], "pytorch_android_gradle_build",
with_docker=False,
requires=[

View File

@ -18,7 +18,7 @@ class IOSNightlyJob:
common_name_pieces = [
"ios",
] + ios_definitions.IOS_VERSION.render_dots_or_parts(with_version_dots) + [
] + ios_definitions.XCODE_VERSION.render_dots_or_parts(with_version_dots) + [
"nightly",
self.variant,
"build",

View File

@ -4,6 +4,11 @@ NON_PR_BRANCH_LIST = [
r"/release\/.*/",
]
PR_BRANCH_LIST = [
r"/gh\/.*\/head/",
r"/pull\/.*/",
]
RC_PATTERN = r"/v[0-9]+(\.[0-9]+)*-rc[0-9]+/"
def gen_filter_dict(

View File

@ -1,30 +1,33 @@
AWS_DOCKER_HOST = "308535385114.dkr.ecr.us-east-1.amazonaws.com"
# ARE YOU EDITING THIS NUMBER? MAKE SURE YOU READ THE GUIDANCE AT THE
# TOP OF .circleci/config.yml
DOCKER_IMAGE_TAG = "209062ef-ab58-422a-b295-36c4eed6e906"
def gen_docker_image(container_type):
return (
"/".join([AWS_DOCKER_HOST, "pytorch", container_type]),
f"docker-{container_type}",
)
def gen_docker_image_requires(image_name):
return [f"docker-{image_name}"]
def gen_docker_image_path(container_type):
return "/".join([
AWS_DOCKER_HOST,
"pytorch",
container_type + ":" + DOCKER_IMAGE_TAG,
])
DOCKER_IMAGE_BASIC, DOCKER_REQUIREMENT_BASE = gen_docker_image(
"pytorch-linux-xenial-py3.6-gcc5.4"
)
DOCKER_IMAGE_CUDA_10_2, DOCKER_REQUIREMENT_CUDA_10_2 = gen_docker_image(
"pytorch-linux-xenial-cuda10.2-cudnn7-py3-gcc7"
)
DOCKER_IMAGE_GCC7, DOCKER_REQUIREMENT_GCC7 = gen_docker_image(
"pytorch-linux-xenial-py3.6-gcc7"
)
DOCKER_IMAGE_BASIC = gen_docker_image_path("pytorch-linux-xenial-py3.6-gcc5.4")
DOCKER_IMAGE_CUDA_10_2 = gen_docker_image_path("pytorch-linux-xenial-cuda10.2-cudnn7-py3-gcc7")
DOCKER_IMAGE_GCC7 = gen_docker_image_path("pytorch-linux-xenial-py3.6-gcc7")
def gen_mobile_docker_name(specifier):
def gen_mobile_docker(specifier):
container_type = "pytorch-linux-xenial-py3-clang5-" + specifier
return gen_docker_image_path(container_type)
return gen_docker_image(container_type)
DOCKER_IMAGE_ASAN = gen_mobile_docker_name("asan")
DOCKER_IMAGE_ASAN, DOCKER_REQUIREMENT_ASAN = gen_mobile_docker("asan")
DOCKER_IMAGE_NDK = gen_mobile_docker_name("android-ndk-r19c")
DOCKER_IMAGE_NDK, DOCKER_REQUIREMENT_NDK = gen_mobile_docker("android-ndk-r19c")

View File

@ -9,7 +9,7 @@ class MultiPartVersion:
with the prefix string.
"""
if self.parts:
return [self.prefix + str(self.parts[0])] + list(map(str, self.parts[1:]))
return [self.prefix + str(self.parts[0])] + [str(part) for part in self.parts[1:]]
else:
return [self.prefix]
@ -29,3 +29,6 @@ class CudaVersion(MultiPartVersion):
self.minor = minor
super().__init__([self.major, self.minor], "cuda")
def __str__(self):
return f"{self.major}.{self.minor}"

View File

@ -43,8 +43,11 @@ class WindowsJob:
if base_phase == "test":
prerequisite_jobs.append("_".join(base_name_parts + ["build"]))
if self.cuda_version:
self.cudnn_version = 8 if self.cuda_version.major == 11 else 7
arch_env_elements = (
["cuda" + str(self.cuda_version.major), "cudnn7"]
["cuda" + str(self.cuda_version.major), "cudnn" + str(self.cudnn_version)]
if self.cuda_version
else ["cpu"]
)
@ -83,21 +86,25 @@ class WindowsJob:
props_dict["executor"] = "windows-with-nvidia-gpu"
props_dict["cuda_version"] = (
miniutils.quote(str(self.cuda_version.major))
miniutils.quote(str(self.cuda_version))
if self.cuda_version
else "cpu"
)
props_dict["name"] = "_".join(name_parts)
return [{key_name: props_dict}]
class VcSpec:
def __init__(self, year, version_elements=None):
def __init__(self, year, version_elements=None, hide_version=False):
self.year = year
self.version_elements = version_elements or []
self.hide_version = hide_version
def get_elements(self):
if self.hide_version:
return [self.prefixed_year()]
return [self.prefixed_year()] + self.version_elements
def get_product(self):
@ -110,7 +117,7 @@ class VcSpec:
return "vs" + str(self.year)
def render(self):
return "_".join(filter(None, [self.prefixed_year(), self.dotted_version()]))
return "_".join(self.get_elements())
def FalsePred(_):
return False
@ -118,23 +125,22 @@ def FalsePred(_):
def TruePred(_):
return True
_VC2019 = VcSpec(2019)
WORKFLOW_DATA = [
# VS2017 CUDA-10.1
WindowsJob(None, VcSpec(2017, ["14", "11"]), CudaVersion(10, 1), master_only_pred=FalsePred),
WindowsJob(1, VcSpec(2017, ["14", "11"]), CudaVersion(10, 1)),
# VS2017 no-CUDA (builds only)
WindowsJob(None, VcSpec(2017, ["14", "16"]), CudaVersion(10, 1)),
WindowsJob(None, VcSpec(2017, ["14", "16"]), None),
# VS2019 CUDA-10.1
WindowsJob(None, VcSpec(2019), CudaVersion(10, 1)),
WindowsJob(1, VcSpec(2019), CudaVersion(10, 1)),
WindowsJob(2, VcSpec(2019), CudaVersion(10, 1)),
WindowsJob(None, _VC2019, CudaVersion(10, 1)),
WindowsJob(1, _VC2019, CudaVersion(10, 1)),
WindowsJob(2, _VC2019, CudaVersion(10, 1)),
# VS2019 CUDA-11.1
WindowsJob(None, _VC2019, CudaVersion(11, 1)),
WindowsJob(1, _VC2019, CudaVersion(11, 1), master_only_pred=TruePred),
WindowsJob(2, _VC2019, CudaVersion(11, 1), master_only_pred=TruePred),
# VS2019 CPU-only
WindowsJob(None, VcSpec(2019), None),
WindowsJob(1, VcSpec(2019), None),
WindowsJob(2, VcSpec(2019), None, master_only_pred=TruePred),
WindowsJob(1, VcSpec(2019), CudaVersion(10, 1), force_on_cpu=True),
WindowsJob(2, VcSpec(2019), CudaVersion(10, 1), force_on_cpu=True, master_only_pred=TruePred),
WindowsJob(None, _VC2019, None),
WindowsJob(1, _VC2019, None, master_only_pred=TruePred),
WindowsJob(2, _VC2019, None, master_only_pred=TruePred),
WindowsJob(1, _VC2019, CudaVersion(10, 1), force_on_cpu=True, master_only_pred=TruePred),
]

File diff suppressed because it is too large Load Diff

View File

@ -10,18 +10,37 @@ if [ -z "${image}" ]; then
exit 1
fi
# TODO: Generalize
OS="ubuntu"
DOCKERFILE="${OS}/Dockerfile"
if [[ "$image" == *-cuda* ]]; then
DOCKERFILE="${OS}-cuda/Dockerfile"
elif [[ "$image" == *-rocm* ]]; then
DOCKERFILE="${OS}-rocm/Dockerfile"
fi
function extract_version_from_image_name() {
eval export $2=$(echo "${image}" | perl -n -e"/$1(\d+(\.\d+)?(\.\d+)?)/ && print \$1")
if [ "x${!2}" = x ]; then
echo "variable '$2' not correctly parsed from image='$image'"
exit 1
fi
}
if [[ "$image" == *-trusty* ]]; then
UBUNTU_VERSION=14.04
elif [[ "$image" == *-xenial* ]]; then
function extract_all_from_image_name() {
# parts $image into array, splitting on '-'
keep_IFS="$IFS"
IFS="-"
declare -a parts=($image)
IFS="$keep_IFS"
unset keep_IFS
for part in "${parts[@]}"; do
name=$(echo "${part}" | perl -n -e"/([a-zA-Z]+)\d+(\.\d+)?(\.\d+)?/ && print \$1")
vername="${name^^}_VERSION"
# "py" is the odd one out, needs this special case
if [ "x${name}" = xpy ]; then
vername=ANACONDA_PYTHON_VERSION
fi
# skip non-conforming fields such as "pytorch", "linux" or "xenial" without version string
if [ -n "${name}" ]; then
extract_version_from_image_name "${name}" "${vername}"
fi
done
}
if [[ "$image" == *-xenial* ]]; then
UBUNTU_VERSION=16.04
elif [[ "$image" == *-artful* ]]; then
UBUNTU_VERSION=17.10
@ -29,6 +48,26 @@ elif [[ "$image" == *-bionic* ]]; then
UBUNTU_VERSION=18.04
elif [[ "$image" == *-focal* ]]; then
UBUNTU_VERSION=20.04
elif [[ "$image" == *ubuntu* ]]; then
extract_version_from_image_name ubuntu UBUNTU_VERSION
elif [[ "$image" == *centos* ]]; then
extract_version_from_image_name centos CENTOS_VERSION
fi
if [ -n "${UBUNTU_VERSION}" ]; then
OS="ubuntu"
elif [ -n "${CENTOS_VERSION}" ]; then
OS="centos"
else
echo "Unable to derive operating system base..."
exit 1
fi
DOCKERFILE="${OS}/Dockerfile"
if [[ "$image" == *cuda* ]]; then
DOCKERFILE="${OS}-cuda/Dockerfile"
elif [[ "$image" == *rocm* ]]; then
DOCKERFILE="${OS}-rocm/Dockerfile"
fi
TRAVIS_DL_URL_PREFIX="https://s3.amazonaws.com/travis-python-archives/binaries/ubuntu/14.04/x86_64"
@ -38,19 +77,10 @@ TRAVIS_DL_URL_PREFIX="https://s3.amazonaws.com/travis-python-archives/binaries/u
# from scratch
case "$image" in
pytorch-linux-xenial-py3.8)
# TODO: This is a hack, get rid of this as soon as you get rid of the travis downloads
TRAVIS_DL_URL_PREFIX="https://s3.amazonaws.com/travis-python-archives/binaries/ubuntu/16.04/x86_64"
TRAVIS_PYTHON_VERSION=3.8
ANACONDA_PYTHON_VERSION=3.8
GCC_VERSION=7
# Do not install PROTOBUF, DB, and VISION as a test
;;
pytorch-linux-xenial-py3.6-gcc4.8)
ANACONDA_PYTHON_VERSION=3.6
GCC_VERSION=4.8
PROTOBUF=yes
DB=yes
VISION=yes
;;
pytorch-linux-xenial-py3.6-gcc5.4)
ANACONDA_PYTHON_VERSION=3.6
GCC_VERSION=5
@ -71,13 +101,6 @@ case "$image" in
DB=yes
VISION=yes
;;
pytorch-linux-xenial-pynightly)
TRAVIS_PYTHON_VERSION=nightly
GCC_VERSION=7
PROTOBUF=yes
DB=yes
VISION=yes
;;
pytorch-linux-xenial-cuda9.2-cudnn7-py3-gcc5.4)
CUDA_VERSION=9.2
CUDNN_VERSION=7
@ -126,7 +149,6 @@ case "$image" in
KATEX=yes
;;
pytorch-linux-xenial-cuda11.0-cudnn8-py3-gcc7)
UBUNTU_VERSION=16.04-rc
CUDA_VERSION=11.0
CUDNN_VERSION=8
ANACONDA_PYTHON_VERSION=3.6
@ -136,6 +158,16 @@ case "$image" in
VISION=yes
KATEX=yes
;;
pytorch-linux-xenial-cuda11.1-cudnn8-py3-gcc7)
CUDA_VERSION=11.1
CUDNN_VERSION=8
ANACONDA_PYTHON_VERSION=3.6
GCC_VERSION=7
PROTOBUF=yes
DB=yes
VISION=yes
KATEX=yes
;;
pytorch-linux-xenial-py3-clang5-asan)
ANACONDA_PYTHON_VERSION=3.6
CLANG_VERSION=5.0
@ -143,6 +175,13 @@ case "$image" in
DB=yes
VISION=yes
;;
pytorch-linux-xenial-py3-clang7-onnx)
ANACONDA_PYTHON_VERSION=3.6
CLANG_VERSION=7
PROTOBUF=yes
DB=yes
VISION=yes
;;
pytorch-linux-xenial-py3-clang5-android-ndk-r19c)
ANACONDA_PYTHON_VERSION=3.6
CLANG_VERSION=5.0
@ -167,6 +206,8 @@ case "$image" in
PROTOBUF=yes
DB=yes
VISION=yes
VULKAN_SDK_VERSION=1.2.148.0
SWIFTSHADER=yes
;;
pytorch-linux-bionic-py3.8-gcc9)
ANACONDA_PYTHON_VERSION=3.8
@ -194,7 +235,6 @@ case "$image" in
VISION=yes
;;
pytorch-linux-bionic-cuda11.0-cudnn8-py3.6-gcc9)
UBUNTU_VERSION=18.04-rc
CUDA_VERSION=11.0
CUDNN_VERSION=8
ANACONDA_PYTHON_VERSION=3.6
@ -205,7 +245,6 @@ case "$image" in
KATEX=yes
;;
pytorch-linux-bionic-cuda11.0-cudnn8-py3.8-gcc9)
UBUNTU_VERSION=18.04-rc
CUDA_VERSION=11.0
CUDNN_VERSION=8
ANACONDA_PYTHON_VERSION=3.8
@ -215,22 +254,72 @@ case "$image" in
VISION=yes
KATEX=yes
;;
pytorch-linux-xenial-rocm3.3-py3.6)
pytorch-linux-bionic-cuda11.1-cudnn8-py3.6-gcc9)
CUDA_VERSION=11.1
CUDNN_VERSION=8
ANACONDA_PYTHON_VERSION=3.6
GCC_VERSION=9
PROTOBUF=yes
DB=yes
VISION=yes
KATEX=yes
;;
pytorch-linux-bionic-cuda11.1-cudnn8-py3.8-gcc9)
CUDA_VERSION=11.1
CUDNN_VERSION=8
ANACONDA_PYTHON_VERSION=3.8
GCC_VERSION=9
PROTOBUF=yes
DB=yes
VISION=yes
KATEX=yes
;;
pytorch-linux-bionic-rocm3.9-py3.6)
ANACONDA_PYTHON_VERSION=3.6
PROTOBUF=yes
DB=yes
VISION=yes
ROCM_VERSION=3.3
# newer cmake version required
CMAKE_VERSION=3.6.3
ROCM_VERSION=3.9
;;
pytorch-linux-bionic-rocm3.3-py3.6)
pytorch-linux-bionic-rocm3.10-py3.6)
ANACONDA_PYTHON_VERSION=3.6
PROTOBUF=yes
DB=yes
VISION=yes
ROCM_VERSION=3.3
ROCM_VERSION=3.10
;;
*)
# Catch-all for builds that are not hardcoded.
PROTOBUF=yes
DB=yes
VISION=yes
echo "image '$image' did not match an existing build configuration"
if [[ "$image" == *py* ]]; then
extract_version_from_image_name py ANACONDA_PYTHON_VERSION
fi
if [[ "$image" == *cuda* ]]; then
extract_version_from_image_name cuda CUDA_VERSION
extract_version_from_image_name cudnn CUDNN_VERSION
fi
if [[ "$image" == *rocm* ]]; then
extract_version_from_image_name rocm ROCM_VERSION
fi
if [[ "$image" == *gcc* ]]; then
extract_version_from_image_name gcc GCC_VERSION
fi
if [[ "$image" == *clang* ]]; then
extract_version_from_image_name clang CLANG_VERSION
fi
if [[ "$image" == *devtoolset* ]]; then
extract_version_from_image_name devtoolset DEVTOOLSET_VERSION
fi
if [[ "$image" == *glibc* ]]; then
extract_version_from_image_name glibc GLIBC_VERSION
fi
if [[ "$image" == *cmake* ]]; then
extract_version_from_image_name cmake CMAKE_VERSION
fi
;;
esac
# Set Jenkins UID and GID if running Jenkins
@ -259,15 +348,19 @@ docker build \
--build-arg "JENKINS_UID=${JENKINS_UID:-}" \
--build-arg "JENKINS_GID=${JENKINS_GID:-}" \
--build-arg "UBUNTU_VERSION=${UBUNTU_VERSION}" \
--build-arg "CENTOS_VERSION=${CENTOS_VERSION}" \
--build-arg "DEVTOOLSET_VERSION=${DEVTOOLSET_VERSION}" \
--build-arg "GLIBC_VERSION=${GLIBC_VERSION}" \
--build-arg "CLANG_VERSION=${CLANG_VERSION}" \
--build-arg "ANACONDA_PYTHON_VERSION=${ANACONDA_PYTHON_VERSION}" \
--build-arg "TRAVIS_PYTHON_VERSION=${TRAVIS_PYTHON_VERSION}" \
--build-arg "GCC_VERSION=${GCC_VERSION}" \
--build-arg "CUDA_VERSION=${CUDA_VERSION}" \
--build-arg "CUDNN_VERSION=${CUDNN_VERSION}" \
--build-arg "ANDROID=${ANDROID}" \
--build-arg "ANDROID_NDK=${ANDROID_NDK_VERSION}" \
--build-arg "GRADLE_VERSION=${GRADLE_VERSION}" \
--build-arg "VULKAN_SDK_VERSION=${VULKAN_SDK_VERSION}" \
--build-arg "SWIFTSHADER=${SWIFTSHADER}" \
--build-arg "CMAKE_VERSION=${CMAKE_VERSION:-}" \
--build-arg "NINJA_VERSION=${NINJA_VERSION:-}" \
--build-arg "KATEX=${KATEX:-}" \
@ -277,6 +370,14 @@ docker build \
"$@" \
.
# NVIDIA dockers for RC releases use tag names like `11.0-cudnn8-devel-ubuntu18.04-rc`,
# for this case we will set UBUNTU_VERSION to `18.04-rc` so that the Dockerfile could
# find the correct image. As a result, here we have to replace the
# "$UBUNTU_VERSION" == "18.04-rc"
# with
# "$UBUNTU_VERSION" == "18.04"
UBUNTU_VERSION=$(echo ${UBUNTU_VERSION} | sed 's/-rc$//')
function drun() {
docker run --rm "$tmp_tag" $*
}
@ -294,19 +395,6 @@ if [[ "$OS" == "ubuntu" ]]; then
fi
fi
if [ -n "$TRAVIS_PYTHON_VERSION" ]; then
if [[ "$TRAVIS_PYTHON_VERSION" != nightly ]]; then
if !(drun python --version 2>&1 | grep -qF "Python $TRAVIS_PYTHON_VERSION"); then
echo "TRAVIS_PYTHON_VERSION=$TRAVIS_PYTHON_VERSION, but:"
drun python --version
exit 1
fi
else
echo "Please manually check nightly is OK:"
drun python --version
fi
fi
if [ -n "$ANACONDA_PYTHON_VERSION" ]; then
if !(drun python --version 2>&1 | grep -qF "Python $ANACONDA_PYTHON_VERSION"); then
echo "ANACONDA_PYTHON_VERSION=$ANACONDA_PYTHON_VERSION, but:"

View File

@ -13,7 +13,7 @@ retry () {
#until we find a way to reliably reuse previous build, this last_tag is not in use
# last_tag="$(( CIRCLE_BUILD_NUM - 1 ))"
tag="${CIRCLE_WORKFLOW_ID}"
tag="${DOCKER_TAG}"
registry="308535385114.dkr.ecr.us-east-1.amazonaws.com"
@ -45,9 +45,5 @@ trap "docker logout ${registry}" EXIT
docker push "${image}:${tag}"
# TODO: Get rid of duplicate tagging once ${DOCKER_TAG} becomes the default
docker tag "${image}:${tag}" "${image}:${DOCKER_TAG}"
docker push "${image}:${DOCKER_TAG}"
docker save -o "${IMAGE_NAME}:${tag}.tar" "${image}:${tag}"
aws s3 cp "${IMAGE_NAME}:${tag}.tar" "s3://ossci-linux-build/pytorch/base/${IMAGE_NAME}:${tag}.tar" --acl public-read

View File

@ -0,0 +1,92 @@
ARG CENTOS_VERSION
FROM centos:${CENTOS_VERSION}
ARG CENTOS_VERSION
# Install required packages to build Caffe2
# Install common dependencies (so that this step can be cached separately)
ARG EC2
ADD ./common/install_base.sh install_base.sh
RUN bash ./install_base.sh && rm install_base.sh
# Install devtoolset
ARG DEVTOOLSET_VERSION
ADD ./common/install_devtoolset.sh install_devtoolset.sh
RUN bash ./install_devtoolset.sh && rm install_devtoolset.sh
ENV BASH_ENV "/etc/profile"
# (optional) Install non-default glibc version
ARG GLIBC_VERSION
ADD ./common/install_glibc.sh install_glibc.sh
RUN if [ -n "${GLIBC_VERSION}" ]; then bash ./install_glibc.sh; fi
RUN rm install_glibc.sh
# Install user
ADD ./common/install_user.sh install_user.sh
RUN bash ./install_user.sh && rm install_user.sh
# Install conda and other packages (e.g., numpy, coverage, pytest)
ENV PATH /opt/conda/bin:$PATH
ARG ANACONDA_PYTHON_VERSION
ADD ./common/install_conda.sh install_conda.sh
RUN bash ./install_conda.sh && rm install_conda.sh
# (optional) Install protobuf for ONNX
ARG PROTOBUF
ADD ./common/install_protobuf.sh install_protobuf.sh
RUN if [ -n "${PROTOBUF}" ]; then bash ./install_protobuf.sh; fi
RUN rm install_protobuf.sh
ENV INSTALLED_PROTOBUF ${PROTOBUF}
# (optional) Install database packages like LMDB and LevelDB
ARG DB
ADD ./common/install_db.sh install_db.sh
RUN if [ -n "${DB}" ]; then bash ./install_db.sh; fi
RUN rm install_db.sh
ENV INSTALLED_DB ${DB}
# (optional) Install vision packages like OpenCV and ffmpeg
ARG VISION
ADD ./common/install_vision.sh install_vision.sh
RUN if [ -n "${VISION}" ]; then bash ./install_vision.sh; fi
RUN rm install_vision.sh
ENV INSTALLED_VISION ${VISION}
# Install rocm
ARG ROCM_VERSION
ADD ./common/install_rocm.sh install_rocm.sh
RUN bash ./install_rocm.sh
RUN rm install_rocm.sh
ENV PATH /opt/rocm/bin:$PATH
ENV PATH /opt/rocm/hcc/bin:$PATH
ENV PATH /opt/rocm/hip/bin:$PATH
ENV PATH /opt/rocm/opencl/bin:$PATH
ENV PATH /opt/rocm/llvm/bin:$PATH
ENV LANG en_US.utf8
ENV LC_ALL en_US.utf8
# (optional) Install non-default CMake version
ARG CMAKE_VERSION
ADD ./common/install_cmake.sh install_cmake.sh
RUN if [ -n "${CMAKE_VERSION}" ]; then bash ./install_cmake.sh; fi
RUN rm install_cmake.sh
# (optional) Install non-default Ninja version
ARG NINJA_VERSION
ADD ./common/install_ninja.sh install_ninja.sh
RUN if [ -n "${NINJA_VERSION}" ]; then bash ./install_ninja.sh; fi
RUN rm install_ninja.sh
# Install ccache/sccache (do this last, so we get priority in PATH)
ADD ./common/install_cache.sh install_cache.sh
ENV PATH /opt/cache/bin:$PATH
RUN bash ./install_cache.sh && rm install_cache.sh
# Include BUILD_ENVIRONMENT environment variable in image
ARG BUILD_ENVIRONMENT
ENV BUILD_ENVIRONMENT ${BUILD_ENVIRONMENT}
USER jenkins
CMD ["bash"]

View File

@ -4,13 +4,15 @@ set -ex
[ -n "${ANDROID_NDK}" ]
_https_amazon_aws=https://ossci-android.s3.amazonaws.com
apt-get update
apt-get install -y --no-install-recommends autotools-dev autoconf unzip
apt-get autoclean && apt-get clean
rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/*
pushd /tmp
curl -Os --retry 3 https://dl.google.com/android/repository/android-ndk-${ANDROID_NDK}-linux-x86_64.zip
curl -Os --retry 3 $_https_amazon_aws/android-ndk-${ANDROID_NDK}-linux-x86_64.zip
popd
_ndk_dir=/opt/ndk
mkdir -p "$_ndk_dir"
@ -45,43 +47,22 @@ export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/
# Installing android sdk
# https://github.com/circleci/circleci-images/blob/staging/android/Dockerfile.m4
_sdk_version=sdk-tools-linux-3859397.zip
_tmp_sdk_zip=/tmp/android-sdk-linux.zip
_android_home=/opt/android/sdk
rm -rf $_android_home
sudo mkdir -p $_android_home
curl --silent --show-error --location --fail --retry 3 --output /tmp/$_sdk_version https://dl.google.com/android/repository/$_sdk_version
sudo unzip -q /tmp/$_sdk_version -d $_android_home
rm /tmp/$_sdk_version
curl --silent --show-error --location --fail --retry 3 --output /tmp/android-sdk-linux.zip $_https_amazon_aws/android-sdk-linux-tools3859397-build-tools2803-2902-platforms28-29.zip
sudo unzip -q $_tmp_sdk_zip -d $_android_home
rm $_tmp_sdk_zip
sudo chmod -R 777 $_android_home
export ANDROID_HOME=$_android_home
export ADB_INSTALL_TIMEOUT=120
export PATH="${ANDROID_HOME}/emulator:${ANDROID_HOME}/tools:${ANDROID_HOME}/tools/bin:${ANDROID_HOME}/platform-tools:${PATH}"
export PATH="${ANDROID_HOME}/tools:${ANDROID_HOME}/tools/bin:${ANDROID_HOME}/platform-tools:${PATH}"
echo "PATH:${PATH}"
alias sdkmanager="$ANDROID_HOME/tools/bin/sdkmanager"
sudo mkdir ~/.android && sudo echo '### User Sources for Android SDK Manager' > ~/.android/repositories.cfg
sudo chmod -R 777 ~/.android
yes | sdkmanager --licenses
yes | sdkmanager --update
sdkmanager \
"tools" \
"platform-tools" \
"emulator"
sdkmanager \
"build-tools;28.0.3" \
"build-tools;29.0.2"
sdkmanager \
"platforms;android-28" \
"platforms;android-29"
sdkmanager --list
# Installing Gradle
echo "GRADLE_VERSION:${GRADLE_VERSION}"
@ -89,8 +70,7 @@ _gradle_home=/opt/gradle
sudo rm -rf $gradle_home
sudo mkdir -p $_gradle_home
wget --no-verbose --output-document=/tmp/gradle.zip \
"https://services.gradle.org/distributions/gradle-${GRADLE_VERSION}-bin.zip"
curl --silent --output /tmp/gradle.zip --retry 3 $_https_amazon_aws/gradle-${GRADLE_VERSION}-bin.zip
sudo unzip -q /tmp/gradle.zip -d $_gradle_home
rm /tmp/gradle.zip

View File

@ -2,55 +2,112 @@
set -ex
# NVIDIA dockers for RC releases use tag names like `11.0-cudnn8-devel-ubuntu18.04-rc`,
# for this case we will set UBUNTU_VERSION to `18.04-rc` so that the Dockerfile could
# find the correct image. As a result, here we have to check for
# "$UBUNTU_VERSION" == "18.04"*
# instead of
# "$UBUNTU_VERSION" == "18.04"
if [[ "$UBUNTU_VERSION" == "18.04"* ]]; then
cmake3="cmake=3.10*"
else
cmake3="cmake=3.5*"
fi
install_ubuntu() {
# NVIDIA dockers for RC releases use tag names like `11.0-cudnn8-devel-ubuntu18.04-rc`,
# for this case we will set UBUNTU_VERSION to `18.04-rc` so that the Dockerfile could
# find the correct image. As a result, here we have to check for
# "$UBUNTU_VERSION" == "18.04"*
# instead of
# "$UBUNTU_VERSION" == "18.04"
if [[ "$UBUNTU_VERSION" == "18.04"* ]]; then
cmake3="cmake=3.10*"
else
cmake3="cmake=3.5*"
fi
# Install common dependencies
apt-get update
# TODO: Some of these may not be necessary
# TODO: libiomp also gets installed by conda, aka there's a conflict
ccache_deps="asciidoc docbook-xml docbook-xsl xsltproc"
numpy_deps="gfortran"
apt-get install -y --no-install-recommends \
$ccache_deps \
$numpy_deps \
${cmake3} \
apt-transport-https \
autoconf \
automake \
build-essential \
ca-certificates \
curl \
git \
libatlas-base-dev \
libc6-dbg \
libiomp-dev \
libyaml-dev \
libz-dev \
libjpeg-dev \
libasound2-dev \
libsndfile-dev \
python \
python-dev \
python-setuptools \
python-wheel \
software-properties-common \
sudo \
wget \
vim
# Install common dependencies
apt-get update
# TODO: Some of these may not be necessary
ccache_deps="asciidoc docbook-xml docbook-xsl xsltproc"
numpy_deps="gfortran"
apt-get install -y --no-install-recommends \
$ccache_deps \
$numpy_deps \
${cmake3} \
apt-transport-https \
autoconf \
automake \
build-essential \
ca-certificates \
curl \
git \
libatlas-base-dev \
libc6-dbg \
libiomp-dev \
libyaml-dev \
libz-dev \
libjpeg-dev \
libasound2-dev \
libsndfile-dev \
software-properties-common \
sudo \
wget \
vim
# Cleanup package manager
apt-get autoclean && apt-get clean
rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/*
}
install_centos() {
# Need EPEL for many packages we depend on.
# See http://fedoraproject.org/wiki/EPEL
yum --enablerepo=extras install -y epel-release
ccache_deps="asciidoc docbook-dtds docbook-style-xsl libxslt"
numpy_deps="gcc-gfortran"
# Note: protobuf-c-{compiler,devel} on CentOS are too old to be used
# for Caffe2. That said, we still install them to make sure the build
# system opts to build/use protoc and libprotobuf from third-party.
yum install -y \
$ccache_deps \
$numpy_deps \
autoconf \
automake \
bzip2 \
cmake \
cmake3 \
curl \
gcc \
gcc-c++ \
gflags-devel \
git \
glibc-devel \
glibc-headers \
glog-devel \
hiredis-devel \
libstdc++-devel \
make \
opencv-devel \
sudo \
wget \
vim
# Cleanup
yum clean all
rm -rf /var/cache/yum
rm -rf /var/lib/yum/yumdb
rm -rf /var/lib/yum/history
}
# Install base packages depending on the base OS
ID=$(grep -oP '(?<=^ID=).+' /etc/os-release | tr -d '"')
case "$ID" in
ubuntu)
install_ubuntu
;;
centos)
install_centos
;;
*)
echo "Unable to determine OS..."
exit 1
;;
esac
# Install Valgrind separately since the apt-get version is too old.
mkdir valgrind_build && cd valgrind_build
VALGRIND_VERSION=3.15.0
VALGRIND_VERSION=3.16.1
if ! wget http://valgrind.org/downloads/valgrind-${VALGRIND_VERSION}.tar.bz2
then
wget https://sourceware.org/ftp/valgrind/valgrind-${VALGRIND_VERSION}.tar.bz2
@ -63,13 +120,3 @@ sudo make install
cd ../../
rm -rf valgrind_build
alias valgrind="/usr/local/bin/valgrind"
# TODO: THIS IS A HACK!!!
# distributed nccl(2) tests are a bit busted, see https://github.com/pytorch/pytorch/issues/5877
if dpkg -s libnccl-dev; then
apt-get remove -y libnccl-dev libnccl2 --allow-change-held-packages
fi
# Cleanup package manager
apt-get autoclean && apt-get clean
rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/*

View File

@ -2,17 +2,51 @@
set -ex
install_ubuntu() {
echo "Preparing to build sccache from source"
apt-get update
apt-get install -y cargo pkg-config libssl-dev
echo "Checking out sccache repo"
git clone https://github.com/pytorch/sccache
cd sccache
echo "Building sccache"
cargo build --release
cp target/release/sccache /opt/cache/bin
echo "Cleaning up"
cd ..
rm -rf sccache
apt-get remove -y cargo rustc
apt-get autoclean && apt-get clean
}
install_binary() {
echo "Downloading sccache binary from S3 repo"
curl --retry 3 https://s3.amazonaws.com/ossci-linux/sccache -o /opt/cache/bin/sccache
}
mkdir -p /opt/cache/bin
mkdir -p /opt/cache/lib
sed -e 's|PATH="\(.*\)"|PATH="/opt/cache/bin:\1"|g' -i /etc/environment
export PATH="/opt/cache/bin:$PATH"
# Setup compiler cache
curl --retry 3 https://s3.amazonaws.com/ossci-linux/sccache -o /opt/cache/bin/sccache
if [ -n "$ROCM_VERSION" ]; then
curl --retry 3 http://repo.radeon.com/misc/.sccache_amd/sccache -o /opt/cache/bin/sccache
else
ID=$(grep -oP '(?<=^ID=).+' /etc/os-release | tr -d '"')
case "$ID" in
ubuntu)
install_ubuntu
;;
*)
install_binary
;;
esac
fi
chmod a+x /opt/cache/bin/sccache
function write_sccache_stub() {
printf "#!/bin/sh\nexec sccache $(which $1) \$*" > "/opt/cache/bin/$1"
printf "#!/bin/sh\nif [ \$(ps -p \$PPID -o comm=) != sccache ]; then\n exec sccache $(which $1) \"\$@\"\nelse\n exec $(which $1) \"\$@\"\nfi" > "/opt/cache/bin/$1"
chmod a+x "/opt/cache/bin/$1"
}
@ -20,8 +54,12 @@ write_sccache_stub cc
write_sccache_stub c++
write_sccache_stub gcc
write_sccache_stub g++
write_sccache_stub clang
write_sccache_stub clang++
# NOTE: See specific ROCM_VERSION case below.
if [ "x$ROCM_VERSION" = x ]; then
write_sccache_stub clang
write_sccache_stub clang++
fi
if [ -n "$CUDA_VERSION" ]; then
# TODO: This is a workaround for the fact that PyTorch's FindCUDA
@ -30,6 +68,50 @@ if [ -n "$CUDA_VERSION" ]; then
# where CUDA is installed. Instead, we install an nvcc symlink outside
# of the PATH, and set CUDA_NVCC_EXECUTABLE so that we make use of it.
printf "#!/bin/sh\nexec sccache $(which nvcc) \"\$@\"" > /opt/cache/lib/nvcc
chmod a+x /opt/cache/lib/nvcc
write_sccache_stub nvcc
mv /opt/cache/bin/nvcc /opt/cache/lib/
fi
if [ -n "$ROCM_VERSION" ]; then
# ROCm compiler is hcc or clang. However, it is commonly invoked via hipcc wrapper.
# hipcc will call either hcc or clang using an absolute path starting with /opt/rocm,
# causing the /opt/cache/bin to be skipped. We must create the sccache wrappers
# directly under /opt/rocm while also preserving the original compiler names.
# Note symlinks will chain as follows: [hcc or clang++] -> clang -> clang-??
# Final link in symlink chain must point back to original directory.
# Original compiler is moved one directory deeper. Wrapper replaces it.
function write_sccache_stub_rocm() {
OLDCOMP=$1
COMPNAME=$(basename $OLDCOMP)
TOPDIR=$(dirname $OLDCOMP)
WRAPPED="$TOPDIR/original/$COMPNAME"
mv "$OLDCOMP" "$WRAPPED"
printf "#!/bin/sh\nexec sccache $WRAPPED \"\$@\"" > "$OLDCOMP"
chmod a+x "$OLDCOMP"
}
if [[ -e "/opt/rocm/hcc/bin/hcc" ]]; then
# ROCm 3.3 or earlier.
mkdir /opt/rocm/hcc/bin/original
write_sccache_stub_rocm /opt/rocm/hcc/bin/hcc
write_sccache_stub_rocm /opt/rocm/hcc/bin/clang
write_sccache_stub_rocm /opt/rocm/hcc/bin/clang++
# Fix last link in symlink chain, clang points to versioned clang in prior dir
pushd /opt/rocm/hcc/bin/original
ln -s ../$(readlink clang)
popd
elif [[ -e "/opt/rocm/llvm/bin/clang" ]]; then
# ROCm 3.5 and beyond.
mkdir /opt/rocm/llvm/bin/original
write_sccache_stub_rocm /opt/rocm/llvm/bin/clang
write_sccache_stub_rocm /opt/rocm/llvm/bin/clang++
# Fix last link in symlink chain, clang points to versioned clang in prior dir
pushd /opt/rocm/llvm/bin/original
ln -s ../$(readlink clang)
popd
else
echo "Cannot find ROCm compiler."
exit 1
fi
fi

View File

@ -24,13 +24,20 @@ if [ -n "$ANACONDA_PYTHON_VERSION" ]; then
mkdir /opt/conda
chown jenkins:jenkins /opt/conda
# Work around bug where devtoolset replaces sudo and breaks it.
if [ -n "$DEVTOOLSET_VERSION" ]; then
SUDO=/bin/sudo
else
SUDO=sudo
fi
as_jenkins() {
# NB: unsetting the environment variables works around a conda bug
# https://github.com/conda/conda/issues/6576
# NB: Pass on PATH and LD_LIBRARY_PATH to sudo invocation
# NB: This must be run from a directory that jenkins has access to,
# works around https://github.com/conda/conda-package-handling/pull/34
sudo -H -u jenkins env -u SUDO_UID -u SUDO_GID -u SUDO_COMMAND -u SUDO_USER env "PATH=$PATH" "LD_LIBRARY_PATH=$LD_LIBRARY_PATH" $*
$SUDO -H -u jenkins env -u SUDO_UID -u SUDO_GID -u SUDO_COMMAND -u SUDO_USER env "PATH=$PATH" "LD_LIBRARY_PATH=$LD_LIBRARY_PATH" $*
}
pushd /tmp
@ -49,10 +56,10 @@ if [ -n "$ANACONDA_PYTHON_VERSION" ]; then
pushd /opt/conda
# Track latest conda update
as_jenkins conda update -n base conda
as_jenkins conda update -y -n base conda
# Install correct Python version
as_jenkins conda install python="$ANACONDA_PYTHON_VERSION"
as_jenkins conda install -y python="$ANACONDA_PYTHON_VERSION"
conda_install() {
# Ensure that the install command don't upgrade/downgrade Python
@ -65,11 +72,13 @@ if [ -n "$ANACONDA_PYTHON_VERSION" ]; then
# DO NOT install cmake here as it would install a version newer than 3.5, but
# we want to pin to version 3.5.
if [ "$ANACONDA_PYTHON_VERSION" = "3.8" ]; then
# DO NOT install typing if installing python-3.8, since its part of python-3.8 core packages
# Install llvm-8 as it is required to compile llvmlite-0.30.0 from source
conda_install numpy pyyaml mkl mkl-include setuptools cffi future six llvmdev=8.0.0
conda_install numpy=1.18.5 pyyaml mkl mkl-include setuptools cffi future six llvmdev=8.0.0
elif [ "$ANACONDA_PYTHON_VERSION" = "3.7" ]; then
# DO NOT install dataclasses if installing python-3.7, since its part of python-3.7 core packages
conda_install numpy=1.18.5 pyyaml mkl mkl-include setuptools cffi future six typing_extensions
else
conda_install numpy pyyaml mkl mkl-include setuptools cffi typing future six
conda_install numpy=1.18.5 pyyaml mkl mkl-include setuptools cffi future six dataclasses typing_extensions
fi
if [[ "$CUDA_VERSION" == 9.2* ]]; then
conda_install magma-cuda92 -c pytorch
@ -79,18 +88,42 @@ if [ -n "$ANACONDA_PYTHON_VERSION" ]; then
conda_install magma-cuda101 -c pytorch
elif [[ "$CUDA_VERSION" == 10.2* ]]; then
conda_install magma-cuda102 -c pytorch
elif [[ "$CUDA_VERSION" == 11.0* ]]; then
conda_install magma-cuda110 -c pytorch
elif [[ "$CUDA_VERSION" == 11.1* ]]; then
conda_install magma-cuda111 -c pytorch
elif [[ "$CUDA_VERSION" == 11.2* ]]; then
conda_install magma-cuda112 -c pytorch
fi
# TODO: This isn't working atm
conda_install nnpack -c killeent
# Install some other packages
# Install some other packages, including those needed for Python test reporting
# TODO: Why is scipy pinned
# numba & llvmlite is pinned because of https://github.com/numba/numba/issues/4368
# scikit-learn is pinned because of
# https://github.com/scikit-learn/scikit-learn/issues/14485 (affects gcc 5.5
# only)
as_jenkins pip install --progress-bar off pytest scipy==1.1.0 scikit-learn==0.20.3 scikit-image librosa>=0.6.2 psutil numba==0.46.0 llvmlite==0.30.0
# Pin MyPy version because new errors are likely to appear with each release
# Pin hypothesis to avoid flakiness: https://github.com/pytorch/pytorch/issues/31136
as_jenkins pip install --progress-bar off pytest \
scipy==1.1.0 \
scikit-image \
librosa>=0.6.2 \
psutil \
numba \
llvmlite \
unittest-xml-reporting \
boto3==1.16.34 \
coverage \
hypothesis==4.53.2 \
mypy==0.770 \
tb-nightly
# Update scikit-learn to a python-3.8 compatible version
if [[ $(python -c "import sys; print(int(sys.version_info >= (3, 8)))") == "1" ]]; then
as_jenkins pip install --progress-bar off -U scikit-learn
else
# Pinned scikit-learn due to https://github.com/scikit-learn/scikit-learn/issues/14485 (affects gcc 5.5 only)
as_jenkins pip install --progress-bar off scikit-learn==0.20.3
fi
popd
fi

View File

@ -51,11 +51,16 @@ install_centos() {
}
# Install base packages depending on the base OS
if [ -f /etc/lsb-release ]; then
install_ubuntu
elif [ -f /etc/os-release ]; then
install_centos
else
echo "Unable to determine OS..."
exit 1
fi
ID=$(grep -oP '(?<=^ID=).+' /etc/os-release | tr -d '"')
case "$ID" in
ubuntu)
install_ubuntu
;;
centos)
install_centos
;;
*)
echo "Unable to determine OS..."
exit 1
;;
esac

View File

@ -0,0 +1,10 @@
#!/bin/bash
set -ex
[ -n "$DEVTOOLSET_VERSION" ]
yum install -y centos-release-scl
yum install -y devtoolset-$DEVTOOLSET_VERSION
echo "source scl_source enable devtoolset-$DEVTOOLSET_VERSION" > "/etc/profile.d/devtoolset-$DEVTOOLSET_VERSION.sh"

View File

@ -15,6 +15,7 @@ if [ -n "$GCC_VERSION" ]; then
update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-"$GCC_VERSION" 50
update-alternatives --install /usr/bin/g++ g++ /usr/bin/g++-"$GCC_VERSION" 50
update-alternatives --install /usr/bin/gcov gcov /usr/bin/gcov-"$GCC_VERSION" 50
# Cleanup package manager
apt-get autoclean && apt-get clean

View File

@ -0,0 +1,34 @@
#!/bin/bash
set -ex
[ -n "$GLIBC_VERSION" ]
if [[ -n "$CENTOS_VERSION" ]]; then
[ -n "$DEVTOOLSET_VERSION" ]
fi
yum install -y wget sed
mkdir -p /packages && cd /packages
wget -q http://ftp.gnu.org/gnu/glibc/glibc-$GLIBC_VERSION.tar.gz
tar xzf glibc-$GLIBC_VERSION.tar.gz
if [[ "$GLIBC_VERSION" == "2.26" ]]; then
cd glibc-$GLIBC_VERSION
sed -i 's/$name ne "nss_test1"/$name ne "nss_test1" \&\& $name ne "nss_test2"/' scripts/test-installation.pl
cd ..
fi
mkdir -p glibc-$GLIBC_VERSION-build && cd glibc-$GLIBC_VERSION-build
if [[ -n "$CENTOS_VERSION" ]]; then
export PATH=/opt/rh/devtoolset-$DEVTOOLSET_VERSION/root/usr/bin:$PATH
fi
../glibc-$GLIBC_VERSION/configure --prefix=/usr CFLAGS='-Wno-stringop-truncation -Wno-format-overflow -Wno-restrict -Wno-format-truncation -g -O2'
make -j$(nproc)
make install
# Cleanup
rm -rf /packages
rm -rf /var/cache/yum/*
rm -rf /var/lib/rpm/__db.*
yum clean all

View File

@ -0,0 +1,8 @@
#!/bin/bash
set -ex
git clone --branch v1.15 https://github.com/linux-test-project/lcov.git
pushd lcov
sudo make install # will be installed in /usr/local/bin/lcov
popd

View File

@ -1,30 +0,0 @@
#!/bin/bash
set -ex
llvm_url="https://github.com/llvm/llvm-project/releases/download/llvmorg-9.0.1/llvm-9.0.1.src.tar.xz"
mkdir /opt/llvm
pushd /tmp
wget --no-verbose --output-document=llvm.tar.xz "$llvm_url"
mkdir llvm
tar -xf llvm.tar.xz -C llvm --strip-components 1
rm -f llvm.tar.xz
cd llvm
mkdir build
cd build
cmake -G "Unix Makefiles" \
-DCMAKE_BUILD_TYPE=MinSizeRel \
-DLLVM_ENABLE_ASSERTIONS=ON \
-DCMAKE_INSTALL_PREFIX=/opt/llvm \
-DLLVM_TARGETS_TO_BUILD="host" \
-DLLVM_BUILD_TOOLS=OFF \
-DLLVM_BUILD_UTILS=OFF \
-DLLVM_TEMPORARILY_ALLOW_OLD_TOOLCHAIN=ON \
../
make -j4
sudo make install
popd

View File

@ -0,0 +1,4 @@
#!/bin/bash
sudo apt-get -qq update
sudo apt-get -qq install --allow-downgrades --allow-change-held-packages libnccl-dev=2.5.6-1+cuda10.1 libnccl2=2.5.6-1+cuda10.1

View File

@ -0,0 +1,4 @@
#!/bin/bash
sudo apt-get update
sudo apt-get install -y --allow-downgrades --allow-change-held-packages openmpi-bin libopenmpi-dev

View File

@ -46,11 +46,16 @@ install_centos() {
}
# Install base packages depending on the base OS
if [ -f /etc/lsb-release ]; then
install_ubuntu
elif [ -f /etc/os-release ]; then
install_centos
else
echo "Unable to determine OS..."
exit 1
fi
ID=$(grep -oP '(?<=^ID=).+' /etc/os-release | tr -d '"')
case "$ID" in
ubuntu)
install_ubuntu
;;
centos)
install_centos
;;
*)
echo "Unable to determine OS..."
exit 1
;;
esac

View File

@ -2,47 +2,68 @@
set -ex
install_magma() {
# "install" hipMAGMA into /opt/rocm/magma by copying after build
git clone https://bitbucket.org/icl/magma.git -b hipMAGMA
pushd magma
cp make.inc-examples/make.inc.hip-mkl-gcc make.inc
echo 'LIBDIR += -L$(MKLROOT)/lib' >> make.inc
echo 'LIB += -Wl,--enable-new-dtags -Wl,--rpath,/opt/rocm/lib -Wl,--rpath,$(MKLROOT)/lib -Wl,--rpath,/opt/rocm/magma/lib' >> make.inc
echo 'DEVCCFLAGS += --amdgpu-target=gfx803 --amdgpu-target=gfx900 --amdgpu-target=gfx906 --amdgpu-target=gfx908' >> make.inc
export PATH="${PATH}:/opt/rocm/bin"
make -f make.gen.hipMAGMA -j $(nproc)
make lib/libmagma.so -j $(nproc) MKLROOT=/opt/conda
make testing/testing_dgemm -j $(nproc) MKLROOT=/opt/conda
popd
mv magma /opt/rocm
}
install_ubuntu() {
apt-get update
if [[ $UBUNTU_VERSION == 18.04 ]]; then
# gpg-agent is not available by default on 18.04
apt-get install -y --no-install-recommends gpg-agent
fi
apt-get install -y kmod
apt-get install -y wget
apt-get install -y libopenblas-dev
# Need the libc++1 and libc++abi1 libraries to allow torch._C to load at runtime
apt-get install -y libc++1
apt-get install -y libc++abi1
DEB_ROCM_REPO=http://repo.radeon.com/rocm/apt/${ROCM_VERSION}
# Add rocm repository
wget -qO - $DEB_ROCM_REPO/rocm.gpg.key | apt-key add -
echo "deb [arch=amd64] $DEB_ROCM_REPO xenial main" > /etc/apt/sources.list.d/rocm.list
wget -qO - http://repo.radeon.com/rocm/rocm.gpg.key | apt-key add -
echo "deb [arch=amd64] http://repo.radeon.com/rocm/apt/${ROCM_VERSION} xenial main" > /etc/apt/sources.list.d/rocm.list
apt-get update --allow-insecure-repositories
DEBIAN_FRONTEND=noninteractive apt-get install -y --allow-unauthenticated \
rocm-dev \
rocm-utils \
rocfft \
miopen-hip \
rocblas \
hipsparse \
rocrand \
hipcub \
rocthrust \
rocm-libs \
rccl \
rocprofiler-dev \
roctracer-dev
# Cleanup
apt-get autoclean && apt-get clean
rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/*
# precompiled miopen kernels added in ROCm 3.5; search for all unversioned packages
# if search fails it will abort this script; use true to avoid case where search fails
MIOPENKERNELS=$(apt-cache search --names-only miopenkernels | awk '{print $1}' | grep -F -v . || true)
if [[ "x${MIOPENKERNELS}" = x ]]; then
echo "miopenkernels package not available"
else
DEBIAN_FRONTEND=noninteractive apt-get install -y --allow-unauthenticated ${MIOPENKERNELS}
fi
install_magma
# Cleanup
apt-get autoclean && apt-get clean
rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/*
}
install_centos() {
yum update -y
yum install -y kmod
yum install -y wget
yum install -y openblas-devel
@ -51,7 +72,7 @@ install_centos() {
echo "[ROCm]" > /etc/yum.repos.d/rocm.repo
echo "name=ROCm" >> /etc/yum.repos.d/rocm.repo
echo "baseurl=http://repo.radeon.com/rocm/yum/rpm/" >> /etc/yum.repos.d/rocm.repo
echo "baseurl=http://repo.radeon.com/rocm/yum/${ROCM_VERSION}" >> /etc/yum.repos.d/rocm.repo
echo "enabled=1" >> /etc/yum.repos.d/rocm.repo
echo "gpgcheck=0" >> /etc/yum.repos.d/rocm.repo
@ -60,17 +81,13 @@ install_centos() {
yum install -y \
rocm-dev \
rocm-utils \
rocfft \
miopen-hip \
rocblas \
hipsparse \
rocrand \
rocm-libs \
rccl \
hipcub \
rocthrust \
rocprofiler-dev \
roctracer-dev
install_magma
# Cleanup
yum clean all
rm -rf /var/cache/yum
@ -79,11 +96,16 @@ install_centos() {
}
# Install Python packages depending on the base OS
if [ -f /etc/lsb-release ]; then
install_ubuntu
elif [ -f /etc/os-release ]; then
install_centos
else
echo "Unable to determine OS..."
exit 1
fi
ID=$(grep -oP '(?<=^ID=).+' /etc/os-release | tr -d '"')
case "$ID" in
ubuntu)
install_ubuntu
;;
centos)
install_centos
;;
*)
echo "Unable to determine OS..."
exit 1
;;
esac

View File

@ -0,0 +1,24 @@
#!/bin/bash
set -ex
[ -n "${SWIFTSHADER}" ]
retry () {
$* || (sleep 1 && $*) || (sleep 2 && $*) || (sleep 4 && $*) || (sleep 8 && $*)
}
_https_amazon_aws=https://ossci-android.s3.amazonaws.com
# SwiftShader
_swiftshader_dir=/var/lib/jenkins/swiftshader
_swiftshader_file_targz=swiftshader-abe07b943-prebuilt.tar.gz
mkdir -p $_swiftshader_dir
_tmp_swiftshader_targz="/tmp/${_swiftshader_file_targz}"
curl --silent --show-error --location --fail --retry 3 \
--output "${_tmp_swiftshader_targz}" "$_https_amazon_aws/${_swiftshader_file_targz}"
tar -C "${_swiftshader_dir}" -xzf "${_tmp_swiftshader_targz}"
export VK_ICD_FILENAMES="${_swiftshader_dir}/build/Linux/vk_swiftshader_icd.json"

View File

@ -1,97 +0,0 @@
#!/bin/bash
set -ex
as_jenkins() {
# NB: Preserve PATH and LD_LIBRARY_PATH changes
sudo -H -u jenkins env "PATH=$PATH" "LD_LIBRARY_PATH=$LD_LIBRARY_PATH" $*
}
if [ -n "$TRAVIS_PYTHON_VERSION" ]; then
mkdir -p /opt/python
chown jenkins:jenkins /opt/python
# Download Python binary from Travis
pushd tmp
as_jenkins wget --quiet ${TRAVIS_DL_URL_PREFIX}/python-$TRAVIS_PYTHON_VERSION.tar.bz2
# NB: The tarball also comes with /home/travis virtualenv that we
# don't care about. (Maybe we should, but we've worked around the
# "how do I install to python" issue by making this entire directory
# user-writable "lol")
# NB: Relative ordering of opt/python and flags matters
as_jenkins tar xjf python-$TRAVIS_PYTHON_VERSION.tar.bz2 --strip-components=2 --directory /opt/python opt/python
popd
echo "/opt/python/$TRAVIS_PYTHON_VERSION/lib" > /etc/ld.so.conf.d/travis-python.conf
ldconfig
sed -e 's|PATH="\(.*\)"|PATH="/opt/python/'"$TRAVIS_PYTHON_VERSION"'/bin:\1"|g' -i /etc/environment
export PATH="/opt/python/$TRAVIS_PYTHON_VERSION/bin:$PATH"
python --version
pip --version
# Install pip from source.
# The python-pip package on Ubuntu Trusty is old
# and upon install numpy doesn't use the binary
# distribution, and fails to compile it from source.
pushd tmp
as_jenkins curl -L -O https://pypi.python.org/packages/11/b6/abcb525026a4be042b486df43905d6893fb04f05aac21c32c638e939e447/pip-9.0.1.tar.gz
as_jenkins tar zxf pip-9.0.1.tar.gz
pushd pip-9.0.1
as_jenkins python setup.py install
popd
rm -rf pip-9.0.1*
popd
# Install pip packages
as_jenkins pip install --upgrade pip
pip --version
if [[ "$TRAVIS_PYTHON_VERSION" == nightly ]]; then
# These two packages have broken Cythonizations uploaded
# to PyPi, see:
#
# - https://github.com/numpy/numpy/issues/10500
# - https://github.com/yaml/pyyaml/issues/117
#
# Furthermore, the released version of Cython does not
# have these issues fixed.
#
# While we are waiting on fixes for these, we build
# from Git for now. Feel free to delete this conditional
# branch if things start working again (you may need
# to do this if these packages regress on Git HEAD.)
as_jenkins pip install git+https://github.com/cython/cython.git
as_jenkins pip install git+https://github.com/numpy/numpy.git
as_jenkins pip install git+https://github.com/yaml/pyyaml.git
else
as_jenkins pip install numpy pyyaml
fi
as_jenkins pip install \
future \
hypothesis \
protobuf \
pytest \
pillow \
typing
as_jenkins pip install mkl mkl-devel
# SciPy does not support Python 3.7 or Python 2.7.9
if [[ "$TRAVIS_PYTHON_VERSION" != nightly ]] && [[ "$TRAVIS_PYTHON_VERSION" != "2.7.9" ]]; then
as_jenkins pip install scipy==1.1.0 scikit-image librosa>=0.6.2
fi
# Install psutil for dataloader tests
as_jenkins pip install psutil
# Install dill for serialization tests
as_jenkins pip install "dill>=0.3.1"
# Cleanup package manager
apt-get autoclean && apt-get clean
rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/*
fi

View File

@ -47,11 +47,16 @@ install_centos() {
}
# Install base packages depending on the base OS
if [ -f /etc/lsb-release ]; then
install_ubuntu
elif [ -f /etc/os-release ]; then
install_centos
else
echo "Unable to determine OS..."
exit 1
fi
ID=$(grep -oP '(?<=^ID=).+' /etc/os-release | tr -d '"')
case "$ID" in
ubuntu)
install_ubuntu
;;
centos)
install_centos
;;
*)
echo "Unable to determine OS..."
exit 1
;;
esac

View File

@ -0,0 +1,23 @@
#!/bin/bash
set -ex
[ -n "${VULKAN_SDK_VERSION}" ]
retry () {
$* || (sleep 1 && $*) || (sleep 2 && $*) || (sleep 4 && $*) || (sleep 8 && $*)
}
_https_amazon_aws=https://ossci-android.s3.amazonaws.com
_vulkansdk_dir=/var/lib/jenkins/vulkansdk
mkdir -p $_vulkansdk_dir
_tmp_vulkansdk_targz=/tmp/vulkansdk.tar.gz
curl --silent --show-error --location --fail --retry 3 \
--output "$_tmp_vulkansdk_targz" "$_https_amazon_aws/vulkansdk-linux-x86_64-${VULKAN_SDK_VERSION}.tar.gz"
tar -C "$_vulkansdk_dir" -xzf "$_tmp_vulkansdk_targz" --strip-components 1
export VULKAN_SDK="$_vulkansdk_dir/"
rm "$_tmp_vulkansdk_targz"

View File

@ -24,7 +24,7 @@ ARG KATEX
ADD ./common/install_katex.sh install_katex.sh
RUN bash ./install_katex.sh && rm install_katex.sh
# Install conda
# Install conda and other packages (e.g., numpy, coverage, pytest)
ENV PATH /opt/conda/bin:$PATH
ARG ANACONDA_PYTHON_VERSION
ADD ./common/install_conda.sh install_conda.sh
@ -40,12 +40,6 @@ ARG CLANG_VERSION
ADD ./common/install_clang.sh install_clang.sh
RUN bash ./install_clang.sh && rm install_clang.sh
# Install non-standard Python versions (via Travis binaries)
ARG TRAVIS_PYTHON_VERSION
ENV PATH /opt/python/$TRAVIS_PYTHON_VERSION/bin:$PATH
ADD ./common/install_travis_python.sh install_travis_python.sh
RUN bash ./install_travis_python.sh && rm install_travis_python.sh
# (optional) Install protobuf for ONNX
ARG PROTOBUF
ADD ./common/install_protobuf.sh install_protobuf.sh
@ -78,6 +72,16 @@ ADD ./common/install_jni.sh install_jni.sh
ADD ./java/jni.h jni.h
RUN bash ./install_jni.sh && rm install_jni.sh
# Install NCCL for when CUDA is version 10.1
ADD ./common/install_nccl.sh install_nccl.sh
RUN if [ "${CUDA_VERSION}" = 10.1 ]; then bash ./install_nccl.sh; fi
RUN rm install_nccl.sh
# Install Open MPI for CUDA
ADD ./common/install_openmpi.sh install_openmpi.sh
RUN if [ -n "${CUDA_VERSION}" ]; then bash install_openmpi.sh; fi
RUN rm install_openmpi.sh
# Include BUILD_ENVIRONMENT environment variable in image
ARG BUILD_ENVIRONMENT
ENV BUILD_ENVIRONMENT ${BUILD_ENVIRONMENT}
@ -86,9 +90,8 @@ ENV BUILD_ENVIRONMENT ${BUILD_ENVIRONMENT}
ENV TORCH_CUDA_ARCH_LIST Maxwell
ENV TORCH_NVCC_FLAGS "-Xfatbin -compress-all"
# Install LLVM dev version
ADD ./common/install_llvm.sh install_llvm.sh
RUN bash ./install_llvm.sh
# Install LLVM dev version (Defined in the pytorch/builder github repository)
COPY --from=pytorch/llvm:9.0.1 /opt/llvm /opt/llvm
USER jenkins
CMD ["bash"]

View File

@ -21,7 +21,7 @@ RUN bash ./install_clang.sh && rm install_clang.sh
ADD ./common/install_user.sh install_user.sh
RUN bash ./install_user.sh && rm install_user.sh
# Install conda
# Install conda and other packages (e.g., numpy, coverage, pytest)
ENV PATH /opt/conda/bin:$PATH
ARG ANACONDA_PYTHON_VERSION
ADD ./common/install_conda.sh install_conda.sh
@ -57,7 +57,8 @@ ENV PATH /opt/rocm/bin:$PATH
ENV PATH /opt/rocm/hcc/bin:$PATH
ENV PATH /opt/rocm/hip/bin:$PATH
ENV PATH /opt/rocm/opencl/bin:$PATH
ENV HIP_PLATFORM hcc
ENV PATH /opt/rocm/llvm/bin:$PATH
ENV MAGMA_HOME /opt/rocm/magma
ENV LANG C.UTF-8
ENV LC_ALL C.UTF-8

View File

@ -33,7 +33,7 @@ ARG KATEX
ADD ./common/install_katex.sh install_katex.sh
RUN bash ./install_katex.sh && rm install_katex.sh
# Install conda
# Install conda and other packages (e.g., numpy, coverage, pytest)
ENV PATH /opt/conda/bin:$PATH
ARG ANACONDA_PYTHON_VERSION
ADD ./common/install_conda.sh install_conda.sh
@ -44,12 +44,9 @@ ARG GCC_VERSION
ADD ./common/install_gcc.sh install_gcc.sh
RUN bash ./install_gcc.sh && rm install_gcc.sh
# Install non-standard Python versions (via Travis binaries)
ARG TRAVIS_PYTHON_VERSION
ARG TRAVIS_DL_URL_PREFIX
ENV PATH /opt/python/$TRAVIS_PYTHON_VERSION/bin:$PATH
ADD ./common/install_travis_python.sh install_travis_python.sh
RUN bash ./install_travis_python.sh && rm install_travis_python.sh
# Install lcov for C++ code coverage
ADD ./common/install_lcov.sh install_lcov.sh
RUN bash ./install_lcov.sh && rm install_lcov.sh
# (optional) Install protobuf for ONNX
ARG PROTOBUF
@ -85,6 +82,18 @@ RUN rm AndroidManifest.xml
RUN rm build.gradle
ENV INSTALLED_ANDROID ${ANDROID}
# (optional) Install Vulkan SDK
ARG VULKAN_SDK_VERSION
ADD ./common/install_vulkan_sdk.sh install_vulkan_sdk.sh
RUN if [ -n "${VULKAN_SDK_VERSION}" ]; then bash ./install_vulkan_sdk.sh; fi
RUN rm install_vulkan_sdk.sh
# (optional) Install swiftshader
ARG SWIFTSHADER
ADD ./common/install_swiftshader.sh install_swiftshader.sh
RUN if [ -n "${SWIFTSHADER}" ]; then bash ./install_swiftshader.sh; fi
RUN rm install_swiftshader.sh
# (optional) Install non-default CMake version
ARG CMAKE_VERSION
ADD ./common/install_cmake.sh install_cmake.sh
@ -111,9 +120,8 @@ RUN bash ./install_jni.sh && rm install_jni.sh
ARG BUILD_ENVIRONMENT
ENV BUILD_ENVIRONMENT ${BUILD_ENVIRONMENT}
# Install LLVM dev version
ADD ./common/install_llvm.sh install_llvm.sh
RUN bash ./install_llvm.sh
# Install LLVM dev version (Defined in the pytorch/builder github repository)
COPY --from=pytorch/llvm:9.0.1 /opt/llvm /opt/llvm
USER jenkins
CMD ["bash"]

View File

@ -88,6 +88,9 @@ parser = argparse.ArgumentParser(description="Delete old Docker tags from regist
parser.add_argument(
"--dry-run", action="store_true", help="Dry run; print tags that would be deleted"
)
parser.add_argument(
"--debug", action="store_true", help="Debug, print ignored / saved tags"
)
parser.add_argument(
"--keep-stable-days",
type=int,
@ -164,51 +167,48 @@ for repo in repos(client):
# Keep list of image digests to delete for this repository
digest_to_delete = []
print(repositoryName)
for image in images(client, repo):
tags = image.get("imageTags")
if not isinstance(tags, (list,)) or len(tags) == 0:
continue
tag = tags[0]
created = image["imagePushedAt"]
age = now - created
if any([
looks_like_git_sha(tag),
tag.isdigit(),
tag.count("-") == 4, # TODO: Remove, this no longer applies as tags are now built using a SHA1
tag in ignore_tags]):
window = stable_window
if tag in ignore_tags:
stable_window_tags.append((repositoryName, tag, "", age, created))
elif age < window:
stable_window_tags.append((repositoryName, tag, window, age, created))
else:
window = unstable_window
for tag in tags:
if any([
looks_like_git_sha(tag),
tag.isdigit(),
tag.count("-") == 4, # TODO: Remove, this no longer applies as tags are now built using a SHA1
tag in ignore_tags]):
window = stable_window
if tag in ignore_tags:
stable_window_tags.append((repositoryName, tag, "", age, created))
elif age < window:
stable_window_tags.append((repositoryName, tag, window, age, created))
else:
window = unstable_window
if tag in ignore_tags:
print("Ignoring tag {}:{} (age: {})".format(repositoryName, tag, age))
continue
if age < window:
print("Not deleting manifest for tag {}:{} (age: {})".format(repositoryName, tag, age))
continue
if args.dry_run:
print("(dry run) Deleting manifest for tag {}:{} (age: {})".format(repositoryName, tag, age))
if tag in ignore_tags or age < window:
if args.debug:
print("Ignoring {}:{} (age: {})".format(repositoryName, tag, age))
break
else:
print("Deleting manifest for tag{}:{} (age: {})".format(repositoryName, tag, age))
for tag in tags:
print("{}Deleting {}:{} (age: {})".format("(dry run) " if args.dry_run else "", repositoryName, tag, age))
digest_to_delete.append(image["imageDigest"])
if args.dry_run:
if args.debug:
print("Skipping actual deletion, moving on...")
else:
# Issue batch delete for all images to delete for this repository
# Note that as of 2018-07-25, the maximum number of images you can
# delete in a single batch is 100, so chunk our list into batches of
# 100
for c in chunks(digest_to_delete, 100):
client.batch_delete_image(
registryId="308535385114",
repositoryName=repositoryName,
imageIds=[{"imageDigest": digest} for digest in c],
)
# Issue batch delete for all images to delete for this repository
# Note that as of 2018-07-25, the maximum number of images you can
# delete in a single batch is 100, so chunk our list into batches of
# 100
for c in chunks(digest_to_delete, 100):
client.batch_delete_image(
registryId="308535385114",
repositoryName=repositoryName,
imageIds=[{"imageDigest": digest} for digest in c],
)
save_to_s3(args.filter_prefix, stable_window_tags)
save_to_s3(args.filter_prefix, stable_window_tags)

View File

@ -8,10 +8,9 @@ Please see README.md in this directory for details.
import os
import shutil
import sys
from collections import OrderedDict, namedtuple
from collections import namedtuple
import cimodel.data.binary_build_definitions as binary_build_definitions
import cimodel.data.caffe2_build_definitions as caffe2_build_definitions
import cimodel.data.pytorch_build_definitions as pytorch_build_definitions
import cimodel.data.simple.android_definitions
import cimodel.data.simple.bazel_definitions
@ -23,6 +22,7 @@ import cimodel.data.simple.macos_definitions
import cimodel.data.simple.mobile_definitions
import cimodel.data.simple.nightly_android
import cimodel.data.simple.nightly_ios
import cimodel.data.simple.anaconda_prune_defintions
import cimodel.data.windows_build_definitions as windows_build_definitions
import cimodel.lib.miniutils as miniutils
import cimodel.lib.miniyaml as miniyaml
@ -83,6 +83,7 @@ class Header(object):
def gen_build_workflows_tree():
build_workflows_functions = [
cimodel.data.simple.docker_definitions.get_workflow_jobs,
pytorch_build_definitions.get_workflow_jobs,
cimodel.data.simple.macos_definitions.get_workflow_jobs,
cimodel.data.simple.android_definitions.get_workflow_jobs,
@ -90,23 +91,19 @@ def gen_build_workflows_tree():
cimodel.data.simple.mobile_definitions.get_workflow_jobs,
cimodel.data.simple.ge_config_tests.get_workflow_jobs,
cimodel.data.simple.bazel_definitions.get_workflow_jobs,
caffe2_build_definitions.get_workflow_jobs,
cimodel.data.simple.binary_smoketest.get_workflow_jobs,
cimodel.data.simple.nightly_ios.get_workflow_jobs,
cimodel.data.simple.nightly_android.get_workflow_jobs,
cimodel.data.simple.anaconda_prune_defintions.get_workflow_jobs,
windows_build_definitions.get_windows_workflows,
binary_build_definitions.get_post_upload_jobs,
binary_build_definitions.get_binary_smoke_test_jobs,
]
binary_build_functions = [
binary_build_definitions.get_binary_build_jobs,
binary_build_definitions.get_nightly_tests,
binary_build_definitions.get_nightly_uploads,
binary_build_definitions.get_post_upload_jobs,
binary_build_definitions.get_binary_smoke_test_jobs,
]
docker_builder_functions = [
cimodel.data.simple.docker_definitions.get_workflow_jobs
]
return {
@ -115,20 +112,10 @@ def gen_build_workflows_tree():
"when": r"<< pipeline.parameters.run_binary_tests >>",
"jobs": [f() for f in binary_build_functions],
},
"docker_build": OrderedDict(
{
"triggers": [
{
"schedule": {
"cron": miniutils.quote("0 15 * * 0"),
"filters": {"branches": {"only": ["master"]}},
}
}
],
"jobs": [f() for f in docker_builder_functions],
}
),
"build": {"jobs": [f() for f in build_workflows_functions]},
"build": {
"when": r"<< pipeline.parameters.run_build >>",
"jobs": [f() for f in build_workflows_functions]
},
}
}
@ -140,12 +127,10 @@ YAML_SOURCES = [
File("nightly-binary-build-defaults.yml"),
Header("Build parameters"),
File("build-parameters/pytorch-build-params.yml"),
File("build-parameters/caffe2-build-params.yml"),
File("build-parameters/binary-build-params.yml"),
File("build-parameters/promote-build-params.yml"),
Header("Job specs"),
File("job-specs/pytorch-job-specs.yml"),
File("job-specs/caffe2-job-specs.yml"),
File("job-specs/binary-job-specs.yml"),
File("job-specs/job-specs-custom.yml"),
File("job-specs/job-specs-promote.yml"),

View File

@ -33,6 +33,11 @@ else
export BUILDER_ROOT="$workdir/builder"
fi
# Try to extract PR number from branch if not already set
if [[ -z "${CIRCLE_PR_NUMBER:-}" ]]; then
CIRCLE_PR_NUMBER="$(echo ${CIRCLE_BRANCH} | sed -E -n 's/pull\/([0-9]*).*/\1/p')"
fi
# Clone the Pytorch branch
retry git clone https://github.com/pytorch/pytorch.git "$PYTORCH_ROOT"
pushd "$PYTORCH_ROOT"

View File

@ -15,7 +15,8 @@ export PATH="~/anaconda/bin:${PATH}"
source ~/anaconda/bin/activate
# Install dependencies
conda install numpy ninja pyyaml mkl mkl-include setuptools cmake cffi typing requests --yes
conda install numpy ninja pyyaml mkl mkl-include setuptools cmake cffi requests --yes
conda install -c conda-forge valgrind --yes
export CMAKE_PREFIX_PATH=${CONDA_PREFIX:-"$(dirname $(which conda))/../"}
# sync submodules

View File

@ -13,7 +13,7 @@ base64 --decode cert.txt -o Certificates.p12
rm cert.txt
bundle exec fastlane install_cert
# install the provisioning profile
PROFILE=TestApp_CI.mobileprovision
PROFILE=PyTorch_CI_2021.mobileprovision
PROVISIONING_PROFILES=~/Library/MobileDevice/Provisioning\ Profiles
mkdir -pv "${PROVISIONING_PROFILES}"
cd "${PROVISIONING_PROFILES}"
@ -25,5 +25,5 @@ if ! [ -x "$(command -v xcodebuild)" ]; then
echo 'Error: xcodebuild is not installed.'
exit 1
fi
PROFILE=TestApp_CI
PROFILE=PyTorch_CI_2021
ruby ${PROJ_ROOT}/scripts/xcode_build.rb -i ${PROJ_ROOT}/build_ios/install -x ${PROJ_ROOT}/ios/TestApp/TestApp.xcodeproj -p ${IOS_PLATFORM} -c ${PROFILE} -t ${IOS_DEV_TEAM_ID}

View File

@ -14,7 +14,7 @@ mkdir -p ${ZIP_DIR}/src
cp -R ${ARTIFACTS_DIR}/arm64/include ${ZIP_DIR}/install/
# build a FAT bianry
cd ${ZIP_DIR}/install/lib
target_libs=(libc10.a libclog.a libcpuinfo.a libeigen_blas.a libpytorch_qnnpack.a libtorch_cpu.a libtorch.a libXNNPACK.a)
target_libs=(libc10.a libclog.a libcpuinfo.a libeigen_blas.a libpthreadpool.a libpytorch_qnnpack.a libtorch_cpu.a libtorch.a libXNNPACK.a)
for lib in ${target_libs[*]}
do
if [ -f "${ARTIFACTS_DIR}/x86_64/lib/${lib}" ] && [ -f "${ARTIFACTS_DIR}/arm64/lib/${lib}" ]; then
@ -34,7 +34,13 @@ touch version.txt
echo $(date +%s) > version.txt
zip -r ${ZIPFILE} install src version.txt LICENSE
# upload to aws
brew install awscli
# Install conda then 'conda install' awscli
curl --retry 3 -o ~/conda.sh https://repo.anaconda.com/miniconda/Miniconda3-latest-MacOSX-x86_64.sh
chmod +x ~/conda.sh
/bin/bash ~/conda.sh -b -p ~/anaconda
export PATH="~/anaconda/bin:${PATH}"
source ~/anaconda/bin/activate
conda install -c conda-forge awscli --yes
set +x
export AWS_ACCESS_KEY_ID=${AWS_S3_ACCESS_KEY_FOR_PYTORCH_BINARY_UPLOAD}
export AWS_SECRET_ACCESS_KEY=${AWS_S3_ACCESS_SECRET_FOR_PYTORCH_BINARY_UPLOAD}

View File

@ -5,26 +5,22 @@ set -eux -o pipefail
source /env
# Defaults here so they can be changed in one place
export MAX_JOBS=12
export MAX_JOBS=${MAX_JOBS:-$(( $(nproc) - 2 ))}
if [[ "${DESIRED_CUDA}" == "cu111" ]]; then
export BUILD_SPLIT_CUDA="ON"
fi
# Parse the parameters
if [[ "$PACKAGE_TYPE" == 'conda' ]]; then
build_script='conda/build_pytorch.sh'
elif [[ "$DESIRED_CUDA" == cpu ]]; then
build_script='manywheel/build_cpu.sh'
elif [[ "$DESIRED_CUDA" == *"rocm"* ]]; then
build_script='manywheel/build_rocm.sh'
else
build_script='manywheel/build.sh'
fi
# We want to call unbuffer, which calls tclsh which finds the expect
# package. The expect was installed by yum into /usr/bin so we want to
# find /usr/bin/tclsh, but this is shadowed by /opt/conda/bin/tclsh in
# the conda docker images, so we prepend it to the path here.
if [[ "$PACKAGE_TYPE" == 'conda' ]]; then
mkdir /just_tclsh_bin
ln -s /usr/bin/tclsh /just_tclsh_bin/tclsh
export PATH=/just_tclsh_bin:$PATH
fi
# Build the package
SKIP_ALL_TESTS=1 unbuffer "/builder/$build_script" | ts
SKIP_ALL_TESTS=1 "/builder/$build_script"

View File

@ -5,12 +5,17 @@ cat >/home/circleci/project/ci_test_script.sh <<EOL
# =================== The following code will be executed inside Docker container ===================
set -eux -o pipefail
python_nodot="\$(echo $DESIRED_PYTHON | tr -d m.u)"
# Set up Python
if [[ "$PACKAGE_TYPE" == conda ]]; then
# There was a bug that was introduced in conda-package-handling >= 1.6.1 that makes archives
# above a certain size fail out when attempting to extract
# see: https://github.com/conda/conda-package-handling/issues/71
conda install -y conda-package-handling=1.6.0
retry conda create -qyn testenv python="$DESIRED_PYTHON"
source activate testenv >/dev/null
elif [[ "$PACKAGE_TYPE" != libtorch ]]; then
python_nodot="\$(echo $DESIRED_PYTHON | tr -d m.u)"
python_path="/opt/python/cp\$python_nodot-cp\${python_nodot}"
# Prior to Python 3.8 paths were suffixed with an 'm'
if [[ -d "\${python_path}/bin" ]]; then
@ -20,6 +25,19 @@ elif [[ "$PACKAGE_TYPE" != libtorch ]]; then
fi
fi
EXTRA_CONDA_FLAGS=""
NUMPY_PIN=""
if [[ "\$python_nodot" = *39* ]]; then
EXTRA_CONDA_FLAGS="-c=conda-forge"
# There's an issue with conda channel priority where it'll randomly pick 1.19 over 1.20
# we set a lower boundary here just to be safe
NUMPY_PIN=">=1.20"
fi
if [[ "$DESIRED_CUDA" == "cu112" ]]; then
EXTRA_CONDA_FLAGS="-c=conda-forge"
fi
# Install the package
# These network calls should not have 'retry's because they are installing
# locally and aren't actually network calls
@ -28,23 +46,37 @@ fi
# conda build scripts themselves. These should really be consolidated
pkg="/final_pkgs/\$(ls /final_pkgs)"
if [[ "$PACKAGE_TYPE" == conda ]]; then
conda install -y "\$pkg" --offline
if [[ "$DESIRED_CUDA" == 'cpu' ]]; then
retry conda install -y cpuonly -c pytorch
fi
retry conda install -yq future numpy protobuf six
if [[ "$DESIRED_CUDA" != 'cpu' ]]; then
# DESIRED_CUDA is in format cu90 or cu102
if [[ "${#DESIRED_CUDA}" == 4 ]]; then
cu_ver="${DESIRED_CUDA:2:1}.${DESIRED_CUDA:3}"
(
# For some reason conda likes to re-activate the conda environment when attempting this install
# which means that a deactivate is run and some variables might not exist when that happens,
# namely CONDA_MKL_INTERFACE_LAYER_BACKUP from libblas so let's just ignore unbound variables when
# it comes to the conda installation commands
set +u
retry conda install \${EXTRA_CONDA_FLAGS} -yq \
"numpy\${NUMPY_PIN}" \
future \
mkl>=2018 \
ninja \
dataclasses \
typing-extensions \
defaults::protobuf \
six
if [[ "$DESIRED_CUDA" == 'cpu' ]]; then
retry conda install -c pytorch -y cpuonly
else
cu_ver="${DESIRED_CUDA:2:2}.${DESIRED_CUDA:4}"
# DESIRED_CUDA is in format cu90 or cu102
if [[ "${#DESIRED_CUDA}" == 4 ]]; then
cu_ver="${DESIRED_CUDA:2:1}.${DESIRED_CUDA:3}"
else
cu_ver="${DESIRED_CUDA:2:2}.${DESIRED_CUDA:4}"
fi
retry conda install \${EXTRA_CONDA_FLAGS} -yq -c nvidia -c pytorch "cudatoolkit=\${cu_ver}"
fi
retry conda install -yq -c pytorch "cudatoolkit=\${cu_ver}"
fi
conda install \${EXTRA_CONDA_FLAGS} -y "\$pkg" --offline
)
elif [[ "$PACKAGE_TYPE" != libtorch ]]; then
pip install "\$pkg"
retry pip install -q future numpy protobuf six
retry pip install -q future numpy protobuf typing-extensions six
fi
if [[ "$PACKAGE_TYPE" == libtorch ]]; then
pkg="\$(ls /final_pkgs/*-latest.zip)"

View File

@ -1,49 +0,0 @@
#!/bin/bash
# Do NOT set -x
source /home/circleci/project/env
set -eu -o pipefail
set +x
declare -x "AWS_ACCESS_KEY_ID=${PYTORCH_BINARY_AWS_ACCESS_KEY_ID}"
declare -x "AWS_SECRET_ACCESS_KEY=${PYTORCH_BINARY_AWS_SECRET_ACCESS_KEY}"
#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!
# DO NOT TURN -x ON BEFORE THIS LINE
#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!
set -eux -o pipefail
export PATH="$MINICONDA_ROOT/bin:$PATH"
# This gets set in binary_populate_env.sh, but lets have a sane default just in case
PIP_UPLOAD_FOLDER=${PIP_UPLOAD_FOLDER:-nightly}
# TODO: Combine CONDA_UPLOAD_CHANNEL and PIP_UPLOAD_FOLDER into one variable
# The only difference is the trailing slash
# Strip trailing slashes if there
CONDA_UPLOAD_CHANNEL=$(echo "${PIP_UPLOAD_FOLDER}" | sed 's:/*$::')
BACKUP_BUCKET="s3://pytorch-backup"
# Upload the package to the final location
pushd /home/circleci/project/final_pkgs
if [[ "$PACKAGE_TYPE" == conda ]]; then
retry conda install -yq anaconda-client
retry anaconda -t "${CONDA_PYTORCHBOT_TOKEN}" upload "$(ls)" -u "pytorch-${CONDA_UPLOAD_CHANNEL}" --label main --no-progress --force
# Fetch platform (eg. win-64, linux-64, etc.) from index file
# Because there's no actual conda command to read this
subdir=$(tar -xOf ./*.bz2 info/index.json | grep subdir | cut -d ':' -f2 | sed -e 's/[[:space:]]//' -e 's/"//g' -e 's/,//')
BACKUP_DIR="conda/${subdir}"
elif [[ "$PACKAGE_TYPE" == libtorch ]]; then
retry pip install -q awscli
s3_dir="s3://pytorch/libtorch/${PIP_UPLOAD_FOLDER}${DESIRED_CUDA}/"
for pkg in $(ls); do
retry aws s3 cp "$pkg" "$s3_dir" --acl public-read
done
BACKUP_DIR="libtorch/${PIP_UPLOAD_FOLDER}${DESIRED_CUDA}/"
else
retry pip install -q awscli
s3_dir="s3://pytorch/whl/${PIP_UPLOAD_FOLDER}${DESIRED_CUDA}/"
retry aws s3 cp "$(ls)" "$s3_dir" --acl public-read
BACKUP_DIR="whl/${PIP_UPLOAD_FOLDER}${DESIRED_CUDA}/"
fi
if [[ -n "${CIRCLE_TAG:-}" ]]; then
s3_dir="${BACKUP_BUCKET}/${CIRCLE_TAG}/${BACKUP_DIR}"
retry aws s3 cp . "$s3_dir"
fi

View File

@ -20,9 +20,9 @@ if [[ "$PACKAGE_TYPE" == libtorch ]]; then
unzip "$pkg" -d /tmp
cd /tmp/libtorch
elif [[ "$PACKAGE_TYPE" == conda ]]; then
conda install -y "$pkg" --offline
conda install -y "$pkg"
else
pip install "$pkg" --no-index --no-dependencies -v
pip install "$pkg" -v
fi
# Test

View File

@ -1,49 +0,0 @@
#!/bin/bash
# Do NOT set -x
set -eu -o pipefail
set +x
export AWS_ACCESS_KEY_ID="${PYTORCH_BINARY_AWS_ACCESS_KEY_ID}"
export AWS_SECRET_ACCESS_KEY="${PYTORCH_BINARY_AWS_SECRET_ACCESS_KEY}"
#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!
# DO NOT TURN -x ON BEFORE THIS LINE
#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!
set -eux -o pipefail
source "/Users/distiller/project/env"
export "PATH=$workdir/miniconda/bin:$PATH"
# This gets set in binary_populate_env.sh, but lets have a sane default just in case
PIP_UPLOAD_FOLDER=${PIP_UPLOAD_FOLDER:-nightly}
# TODO: Combine CONDA_UPLOAD_CHANNEL and PIP_UPLOAD_FOLDER into one variable
# The only difference is the trailing slash
# Strip trailing slashes if there
CONDA_UPLOAD_CHANNEL=$(echo "${PIP_UPLOAD_FOLDER}" | sed 's:/*$::')
BACKUP_BUCKET="s3://pytorch-backup"
pushd "$workdir/final_pkgs"
if [[ "$PACKAGE_TYPE" == conda ]]; then
retry conda install -yq anaconda-client
retry anaconda -t "${CONDA_PYTORCHBOT_TOKEN}" upload "$(ls)" -u "pytorch-${CONDA_UPLOAD_CHANNEL}" --label main --no-progress --force
# Fetch platform (eg. win-64, linux-64, etc.) from index file
# Because there's no actual conda command to read this
subdir=$(tar -xOf ./*.bz2 info/index.json | grep subdir | cut -d ':' -f2 | sed -e 's/[[:space:]]//' -e 's/"//g' -e 's/,//')
BACKUP_DIR="conda/${subdir}"
elif [[ "$PACKAGE_TYPE" == libtorch ]]; then
retry pip install -q awscli
s3_dir="s3://pytorch/libtorch/${PIP_UPLOAD_FOLDER}${DESIRED_CUDA}/"
for pkg in $(ls); do
retry aws s3 cp "$pkg" "$s3_dir" --acl public-read
done
BACKUP_DIR="libtorch/${PIP_UPLOAD_FOLDER}${DESIRED_CUDA}/"
else
retry pip install -q awscli
s3_dir="s3://pytorch/whl/${PIP_UPLOAD_FOLDER}${DESIRED_CUDA}/"
retry aws s3 cp "$(ls)" "$s3_dir" --acl public-read
BACKUP_DIR="whl/${PIP_UPLOAD_FOLDER}${DESIRED_CUDA}/"
fi
if [[ -n "${CIRCLE_TAG:-}" ]]; then
s3_dir="${BACKUP_BUCKET}/${CIRCLE_TAG}/${BACKUP_DIR}"
retry aws s3 cp . "$s3_dir"
fi

View File

@ -73,7 +73,7 @@ PIP_UPLOAD_FOLDER='nightly/'
# We put this here so that OVERRIDE_PACKAGE_VERSION below can read from it
export DATE="$(date -u +%Y%m%d)"
#TODO: We should be pulling semver version from the base version.txt
BASE_BUILD_VERSION="1.6.0.dev$DATE"
BASE_BUILD_VERSION="1.8.0.dev$DATE"
# Change BASE_BUILD_VERSION to git tag when on a git tag
# Use 'git -C' to make doubly sure we're in the correct directory for checking
# the git tag
@ -85,7 +85,7 @@ if tagged_version >/dev/null; then
# Turns tag v1.6.0-rc1 -> v1.6.0
BASE_BUILD_VERSION="$(tagged_version | sed -e 's/^v//' -e 's/-.*$//')"
fi
if [[ "$(uname)" == 'Darwin' ]] || [[ "$DESIRED_CUDA" == "cu102" ]] || [[ "$PACKAGE_TYPE" == conda ]]; then
if [[ "$(uname)" == 'Darwin' ]] || [[ "$PACKAGE_TYPE" == conda ]]; then
export PYTORCH_BUILD_VERSION="${BASE_BUILD_VERSION}"
else
export PYTORCH_BUILD_VERSION="${BASE_BUILD_VERSION}+$DESIRED_CUDA"
@ -100,8 +100,14 @@ if [[ "$PACKAGE_TYPE" == libtorch ]]; then
POSSIBLE_JAVA_HOMES+=(/usr/local)
POSSIBLE_JAVA_HOMES+=(/usr/lib/jvm/java-8-openjdk-amd64)
POSSIBLE_JAVA_HOMES+=(/Library/Java/JavaVirtualMachines/*.jdk/Contents/Home)
# Add the Windows-specific JNI path
POSSIBLE_JAVA_HOMES+=("$PWD/.circleci/windows-jni/")
for JH in "${POSSIBLE_JAVA_HOMES[@]}" ; do
if [[ -e "$JH/include/jni.h" ]] ; then
# Skip if we're not on Windows but haven't found a JAVA_HOME
if [[ "$JH" == "$PWD/.circleci/windows-jni/" && "$OSTYPE" != "msys" ]] ; then
break
fi
echo "Found jni.h under $JH"
JAVA_HOME="$JH"
BUILD_JNI=ON
@ -130,7 +136,7 @@ if [[ "${BUILD_FOR_SYSTEM:-}" == "windows" ]]; then
fi
export DATE="$DATE"
export NIGHTLIES_DATE_PREAMBLE=1.6.0.dev
export NIGHTLIES_DATE_PREAMBLE=1.8.0.dev
export PYTORCH_BUILD_VERSION="$PYTORCH_BUILD_VERSION"
export PYTORCH_BUILD_NUMBER="$PYTORCH_BUILD_NUMBER"
export OVERRIDE_PACKAGE_VERSION="$PYTORCH_BUILD_VERSION"
@ -161,6 +167,7 @@ export CIRCLE_TAG="${CIRCLE_TAG:-}"
export CIRCLE_SHA1="$CIRCLE_SHA1"
export CIRCLE_PR_NUMBER="${CIRCLE_PR_NUMBER:-}"
export CIRCLE_BRANCH="$CIRCLE_BRANCH"
export CIRCLE_WORKFLOW_ID="$CIRCLE_WORKFLOW_ID"
# =================== The above code will be executed inside Docker container ===================
EOL

View File

@ -19,7 +19,7 @@ chmod +x /home/circleci/project/ci_test_script.sh
VOLUME_MOUNTS="-v /home/circleci/project/:/circleci_stuff -v /home/circleci/project/final_pkgs:/final_pkgs -v ${PYTORCH_ROOT}:/pytorch -v ${BUILDER_ROOT}:/builder"
# Run the docker
if [ -n "${USE_CUDA_DOCKER_RUNTIME:-}" ]; then
export id=$(docker run --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --runtime=nvidia ${VOLUME_MOUNTS} -t -d "${DOCKER_IMAGE}")
export id=$(docker run --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --gpus all ${VOLUME_MOUNTS} -t -d "${DOCKER_IMAGE}")
else
export id=$(docker run --cap-add=SYS_PTRACE --security-opt seccomp=unconfined ${VOLUME_MOUNTS} -t -d "${DOCKER_IMAGE}")
fi

View File

@ -0,0 +1,98 @@
#!/usr/bin/env bash
set -euo pipefail
PACKAGE_TYPE=${PACKAGE_TYPE:-conda}
PKG_DIR=${PKG_DIR:-/tmp/workspace/final_pkgs}
# Designates whether to submit as a release candidate or a nightly build
# Value should be `test` when uploading release candidates
# currently set within `designate_upload_channel`
UPLOAD_CHANNEL=${UPLOAD_CHANNEL:-nightly}
# Designates what subfolder to put packages into
UPLOAD_SUBFOLDER=${UPLOAD_SUBFOLDER:-cpu}
UPLOAD_BUCKET="s3://pytorch"
BACKUP_BUCKET="s3://pytorch-backup"
DRY_RUN=${DRY_RUN:-enabled}
# Don't actually do work unless explicit
ANACONDA="true anaconda"
AWS_S3_CP="aws s3 cp --dryrun"
if [[ "${DRY_RUN}" = "disabled" ]]; then
ANACONDA="anaconda"
AWS_S3_CP="aws s3 cp"
fi
do_backup() {
local backup_dir
backup_dir=$1
(
pushd /tmp/workspace
set -x
${AWS_S3_CP} --recursive . "${BACKUP_BUCKET}/${CIRCLE_TAG}/${backup_dir}/"
)
}
conda_upload() {
(
set -x
${ANACONDA} \
upload \
${PKG_DIR}/*.tar.bz2 \
-u "pytorch-${UPLOAD_CHANNEL}" \
--label main \
--no-progress \
--force
)
}
s3_upload() {
local extension
local pkg_type
extension="$1"
pkg_type="$2"
s3_dir="${UPLOAD_BUCKET}/${pkg_type}/${UPLOAD_CHANNEL}/${UPLOAD_SUBFOLDER}/"
(
for pkg in ${PKG_DIR}/*.${extension}; do
(
set -x
${AWS_S3_CP} --no-progress --acl public-read "${pkg}" "${s3_dir}"
)
done
)
}
case "${PACKAGE_TYPE}" in
conda)
conda_upload
# Fetch platform (eg. win-64, linux-64, etc.) from index file
# Because there's no actual conda command to read this
subdir=$(\
tar -xOf ${PKG_DIR}/*.bz2 info/index.json \
| grep subdir \
| cut -d ':' -f2 \
| sed -e 's/[[:space:]]//' -e 's/"//g' -e 's/,//' \
)
BACKUP_DIR="conda/${subdir}"
;;
libtorch)
s3_upload "zip" "libtorch"
BACKUP_DIR="libtorch/${UPLOAD_CHANNEL}/${UPLOAD_SUBFOLDER}"
;;
# wheel can either refer to wheel/manywheel
*wheel)
s3_upload "whl" "whl"
BACKUP_DIR="whl/${UPLOAD_CHANNEL}/${UPLOAD_SUBFOLDER}"
;;
*)
echo "ERROR: unknown package type: ${PACKAGE_TYPE}"
exit 1
;;
esac
# CIRCLE_TAG is defined by upstream circleci,
# this can be changed to recognize tagged versions
if [[ -n "${CIRCLE_TAG:-}" ]]; then
do_backup "${BACKUP_DIR}"
fi

View File

@ -15,6 +15,10 @@ else
export VC_YEAR=2019
fi
if [[ "${DESIRED_CUDA}" == "cu111" ]]; then
export BUILD_SPLIT_CUDA="ON"
fi
set +x
export AWS_ACCESS_KEY_ID=${CIRCLECI_AWS_ACCESS_KEY_FOR_SCCACHE_S3_BUCKET_V4:-}
export AWS_SECRET_ACCESS_KEY=${CIRCLECI_AWS_SECRET_KEY_FOR_SCCACHE_S3_BUCKET_V4:-}

View File

@ -1,48 +0,0 @@
#!/bin/bash
set -eu -o pipefail
set +x
declare -x "AWS_ACCESS_KEY_ID=${PYTORCH_BINARY_AWS_ACCESS_KEY_ID}"
declare -x "AWS_SECRET_ACCESS_KEY=${PYTORCH_BINARY_AWS_SECRET_ACCESS_KEY}"
#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!
# DO NOT TURN -x ON BEFORE THIS LINE
#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!
set -eux -o pipefail
source "/env"
# This gets set in binary_populate_env.sh, but lets have a sane default just in case
PIP_UPLOAD_FOLDER=${PIP_UPLOAD_FOLDER:-nightly/}
# TODO: Combine CONDA_UPLOAD_CHANNEL and PIP_UPLOAD_FOLDER into one variable
# The only difference is the trailing slash
# Strip trailing slashes if there
CONDA_UPLOAD_CHANNEL=$(echo "${PIP_UPLOAD_FOLDER}" | sed 's:/*$::')
BACKUP_BUCKET="s3://pytorch-backup"
pushd /root/workspace/final_pkgs
# Upload the package to the final location
if [[ "$PACKAGE_TYPE" == conda ]]; then
retry conda install -yq anaconda-client
retry anaconda -t "${CONDA_PYTORCHBOT_TOKEN}" upload "$(ls)" -u "pytorch-${CONDA_UPLOAD_CHANNEL}" --label main --no-progress --force
# Fetch platform (eg. win-64, linux-64, etc.) from index file
# Because there's no actual conda command to read this
subdir=$(tar -xOf ./*.bz2 info/index.json | grep subdir | cut -d ':' -f2 | sed -e 's/[[:space:]]//' -e 's/"//g' -e 's/,//')
BACKUP_DIR="conda/${subdir}"
elif [[ "$PACKAGE_TYPE" == libtorch ]]; then
retry conda install -c conda-forge -yq awscli
s3_dir="s3://pytorch/libtorch/${PIP_UPLOAD_FOLDER}${DESIRED_CUDA}/"
for pkg in $(ls); do
retry aws s3 cp "$pkg" "$s3_dir" --acl public-read
done
BACKUP_DIR="libtorch/${PIP_UPLOAD_FOLDER}${DESIRED_CUDA}/"
else
retry conda install -c conda-forge -yq awscli
s3_dir="s3://pytorch/whl/${PIP_UPLOAD_FOLDER}${DESIRED_CUDA}/"
retry aws s3 cp "$(ls)" "$s3_dir" --acl public-read
BACKUP_DIR="whl/${PIP_UPLOAD_FOLDER}${DESIRED_CUDA}/"
fi
if [[ -n "${CIRCLE_TAG:-}" ]]; then
s3_dir="${BACKUP_BUCKET}/${CIRCLE_TAG}/${BACKUP_DIR}"
retry aws s3 cp . "$s3_dir"
fi

View File

@ -1,7 +1,11 @@
#!/usr/bin/env bash
set -eux -o pipefail
env
echo "BUILD_ENVIRONMENT:$BUILD_ENVIRONMENT"
export ANDROID_NDK_HOME=/opt/ndk
export ANDROID_NDK=/opt/ndk
export ANDROID_HOME=/opt/android/sdk
# Must be in sync with GRADLE_VERSION in docker image for android
@ -10,6 +14,31 @@ export GRADLE_VERSION=4.10.3
export GRADLE_HOME=/opt/gradle/gradle-$GRADLE_VERSION
export GRADLE_PATH=$GRADLE_HOME/bin/gradle
# touch gradle cache files to prevent expiration
while IFS= read -r -d '' file
do
touch "$file" || true
done < <(find /var/lib/jenkins/.gradle -type f -print0)
export GRADLE_LOCAL_PROPERTIES=~/workspace/android/local.properties
rm -f $GRADLE_LOCAL_PROPERTIES
echo "sdk.dir=/opt/android/sdk" >> $GRADLE_LOCAL_PROPERTIES
echo "ndk.dir=/opt/ndk" >> $GRADLE_LOCAL_PROPERTIES
echo "cmake.dir=/usr/local" >> $GRADLE_LOCAL_PROPERTIES
retry () {
$* || (sleep 1 && $*) || (sleep 2 && $*) || (sleep 4 && $*) || (sleep 8 && $*)
}
# Run custom build script
if [[ "${BUILD_ENVIRONMENT}" == *-gradle-custom-build* ]]; then
# Install torch & torchvision - used to download & dump used ops from test model.
retry pip install torch torchvision --progress-bar off
exec "$(dirname "${BASH_SOURCE[0]}")/../../android/build_test_app_custom.sh" armeabi-v7a
fi
# Run default build
BUILD_ANDROID_INCLUDE_DIR_x86=~/workspace/build_android/install/include
BUILD_ANDROID_LIB_DIR_x86=~/workspace/build_android/install/lib
@ -44,9 +73,6 @@ ln -s ${BUILD_ANDROID_INCLUDE_DIR_arm_v8a} ${JNI_INCLUDE_DIR}/arm64-v8a
ln -s ${BUILD_ANDROID_LIB_DIR_arm_v8a} ${JNI_LIBS_DIR}/arm64-v8a
fi
env
echo "BUILD_ENVIRONMENT:$BUILD_ENVIRONMENT"
GRADLE_PARAMS="-p android assembleRelease --debug --stacktrace"
if [[ "${BUILD_ENVIRONMENT}" == *-gradle-build-only-x86_32* ]]; then
GRADLE_PARAMS+=" -PABI_FILTERS=x86"
@ -56,20 +82,6 @@ if [ -n "{GRADLE_OFFLINE:-}" ]; then
GRADLE_PARAMS+=" --offline"
fi
# touch gradle cache files to prevent expiration
while IFS= read -r -d '' file
do
touch "$file" || true
done < <(find /var/lib/jenkins/.gradle -type f -print0)
env
export GRADLE_LOCAL_PROPERTIES=~/workspace/android/local.properties
rm -f $GRADLE_LOCAL_PROPERTIES
echo "sdk.dir=/opt/android/sdk" >> $GRADLE_LOCAL_PROPERTIES
echo "ndk.dir=/opt/ndk" >> $GRADLE_LOCAL_PROPERTIES
echo "cmake.dir=/usr/local" >> $GRADLE_LOCAL_PROPERTIES
$GRADLE_PATH $GRADLE_PARAMS
find . -type f -name "*.a" -exec ls -lh {} \;

View File

@ -30,13 +30,7 @@ if [ "$version" == "master" ]; then
is_master_doc=true
fi
# Argument 3: (optional) If present, we will NOT do any pushing. Used for testing.
dry_run=false
if [ "$3" != "" ]; then
dry_run=true
fi
echo "install_path: $install_path version: $version dry_run: $dry_run"
echo "install_path: $install_path version: $version"
# ======================== Building PyTorch C++ API Docs ========================
@ -53,31 +47,22 @@ sudo apt-get -y install doxygen
# Generate ATen files
pushd "${pt_checkout}"
pip install -r requirements.txt
time python aten/src/ATen/gen.py \
time python -m tools.codegen.gen \
-s aten/src/ATen \
-d build/aten/src/ATen \
aten/src/ATen/Declarations.cwrap \
aten/src/THCUNN/generic/THCUNN.h \
aten/src/ATen/nn.yaml \
aten/src/ATen/native/native_functions.yaml
-d build/aten/src/ATen
# Copy some required files
cp aten/src/ATen/common_with_cwrap.py tools/shared/cwrap_common.py
cp torch/_utils_internal.py tools/shared
# Generate PyTorch files
time python tools/setup_helpers/generate_code.py \
--declarations-path build/aten/src/ATen/Declarations.yaml \
--native-functions-path aten/src/ATen/native/native_functions.yaml \
--nn-path aten/src/
# Build the docs
pushd docs/cpp
pip install breathe==4.13.0 bs4 lxml six
pip install --no-cache-dir -e "git+https://github.com/pytorch/pytorch_sphinx_theme.git#egg=pytorch_sphinx_theme"
pip install exhale>=0.2.1
pip install sphinx==2.4.4
# Uncomment once it is fixed
# pip install -r requirements.txt
pip install -r requirements.txt
time make VERBOSE=1 html -j
popd
@ -103,24 +88,8 @@ git status
git config user.email "soumith+bot@pytorch.org"
git config user.name "pytorchbot"
# If there aren't changes, don't make a commit; push is no-op
git commit -m "Automatic sync on $(date)" || true
git commit -m "Generate C++ docs from pytorch/pytorch@$CIRCLE_SHA1" || true
git status
if [ "$dry_run" = false ]; then
echo "Pushing to https://github.com/pytorch/cppdocs"
set +x
/usr/bin/expect <<DONE
spawn git push -u origin master
expect "Username*"
send "pytorchbot\n"
expect "Password*"
send "$::env(GITHUB_PYTORCHBOT_TOKEN)\n"
expect eof
DONE
set -x
else
echo "Skipping push due to dry_run"
fi
popd
# =================== The above code **should** be executed inside Docker container ===================

View File

@ -0,0 +1,8 @@
set "DRIVER_DOWNLOAD_LINK=https://s3.amazonaws.com/ossci-windows/452.39-data-center-tesla-desktop-win10-64bit-international.exe"
curl --retry 3 -kL %DRIVER_DOWNLOAD_LINK% --output 452.39-data-center-tesla-desktop-win10-64bit-international.exe
if errorlevel 1 exit /b 1
start /wait 452.39-data-center-tesla-desktop-win10-64bit-international.exe -s -noreboot
if errorlevel 1 exit /b 1
del 452.39-data-center-tesla-desktop-win10-64bit-international.exe || ver > NUL

View File

@ -7,6 +7,8 @@ sudo apt-get -y install expect-dev
# This is where the local pytorch install in the docker image is located
pt_checkout="/var/lib/jenkins/workspace"
source "$pt_checkout/.jenkins/pytorch/common_utils.sh"
echo "python_doc_push_script.sh: Invoked with $*"
set -ex
@ -38,15 +40,30 @@ echo "error: python_doc_push_script.sh: branch (arg3) not specified"
exit 1
fi
# Argument 4: (optional) If present, we will NOT do any pushing. Used for testing.
dry_run=false
if [ "$4" != "" ]; then
dry_run=true
fi
echo "install_path: $install_path version: $version"
echo "install_path: $install_path version: $version dry_run: $dry_run"
git clone https://github.com/pytorch/pytorch.github.io -b $branch
build_docs () {
set +e
set -o pipefail
make $1 2>&1 | tee /tmp/docs_build.txt
code=$?
if [ $code -ne 0 ]; then
set +x
echo =========================
grep "WARNING:" /tmp/docs_build.txt
echo =========================
echo Docs build failed. If the failure is not clear, scan back in the log
echo for any WARNINGS or for the line "build finished with problems"
echo "(tried to echo the WARNINGS above the ==== line)"
echo =========================
fi
set -ex
return $code
}
git clone https://github.com/pytorch/pytorch.github.io -b $branch --depth 1
pushd pytorch.github.io
export LC_ALL=C
@ -54,26 +71,15 @@ export PATH=/opt/conda/bin:$PATH
rm -rf pytorch || true
# Install TensorBoard in python 3 so torch.utils.tensorboard classes render
pip install -q https://s3.amazonaws.com/ossci-linux/wheels/tensorboard-1.14.0a0-py3-none-any.whl
# Get all the documentation sources, put them in one place
pushd "$pt_checkout"
git clone https://github.com/pytorch/vision
pushd vision
conda install -q pillow
time python setup.py install
popd
pushd docs
rm -rf source/torchvision
cp -a ../vision/docs/source source/torchvision
# Build the docs
pip -q install -r requirements.txt || true
pip -q install -r requirements.txt
if [ "$is_master_doc" = true ]; then
# TODO: fix gh-38011 then enable this which changes warnings into errors
# export SPHINXOPTS="-WT --keep-going"
make html
build_docs html
[ $? -eq 0 ] || exit $?
make coverage
# Now we have the coverage report, we need to make sure it is empty.
# Count the number of lines in the file and turn that number into a variable
@ -94,8 +100,9 @@ if [ "$is_master_doc" = true ]; then
exit 1
fi
else
# Don't fail the build on coverage problems
make html-stable
# skip coverage, format for stable or tags
build_docs html-stable
[ $? -eq 0 ] || exit $?
fi
# Move them into the docs repo
@ -104,14 +111,6 @@ popd
git rm -rf "$install_path" || true
mv "$pt_checkout/docs/build/html" "$install_path"
# Add the version handler by search and replace.
# XXX: Consider moving this to the docs Makefile or site build
if [ "$is_master_doc" = true ]; then
find "$install_path" -name "*.html" -print0 | xargs -0 perl -pi -w -e "s@master\s+\((\d\.\d\.[A-Fa-f0-9]+\+[A-Fa-f0-9]+)\s+\)@<a href='http://pytorch.org/docs/versions.html'>\1 \&#x25BC</a>@g"
else
find "$install_path" -name "*.html" -print0 | xargs -0 perl -pi -w -e "s@master\s+\((\d\.\d\.[A-Fa-f0-9]+\+[A-Fa-f0-9]+)\s+\)@<a href='http://pytorch.org/docs/versions.html'>$version \&#x25BC</a>@g"
fi
# Prevent Google from indexing $install_path/_modules. This folder contains
# generated source files.
# NB: the following only works on gnu sed. The sed shipped with mac os is different.
@ -123,24 +122,8 @@ git status
git config user.email "soumith+bot@pytorch.org"
git config user.name "pytorchbot"
# If there aren't changes, don't make a commit; push is no-op
git commit -m "auto-generating sphinx docs" || true
git commit -m "Generate Python docs from pytorch/pytorch@$CIRCLE_SHA1" || true
git status
if [ "$dry_run" = false ]; then
echo "Pushing to pytorch.github.io:$branch"
set +x
/usr/bin/expect <<DONE
spawn git push origin $branch
expect "Username*"
send "pytorchbot\n"
expect "Password*"
send "$::env(GITHUB_PYTORCHBOT_TOKEN)\n"
expect eof
DONE
set -x
else
echo "Skipping push due to dry_run"
fi
popd
# =================== The above code **should** be executed inside Docker container ===================

View File

@ -1,12 +1,6 @@
#!/usr/bin/env bash
set -ex -o pipefail
# Set up NVIDIA docker repo
curl -s -L --retry 3 https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
echo "deb https://nvidia.github.io/libnvidia-container/ubuntu16.04/amd64 /" | sudo tee -a /etc/apt/sources.list.d/nvidia-docker.list
echo "deb https://nvidia.github.io/nvidia-container-runtime/ubuntu16.04/amd64 /" | sudo tee -a /etc/apt/sources.list.d/nvidia-docker.list
echo "deb https://nvidia.github.io/nvidia-docker/ubuntu16.04/amd64 /" | sudo tee -a /etc/apt/sources.list.d/nvidia-docker.list
# Remove unnecessary sources
sudo rm -f /etc/apt/sources.list.d/google-chrome.list
sudo rm -f /etc/apt/heroku.list
@ -14,7 +8,7 @@ sudo rm -f /etc/apt/openjdk-r-ubuntu-ppa-xenial.list
sudo rm -f /etc/apt/partner.list
retry () {
$* || $* || $* || $* || $*
$* || $* || $* || $* || $*
}
# Method adapted from here: https://askubuntu.com/questions/875213/apt-get-to-retry-downloading
@ -22,70 +16,75 @@ retry () {
# This is better than retrying the whole apt-get command
echo "APT::Acquire::Retries \"3\";" | sudo tee /etc/apt/apt.conf.d/80-retries
sudo apt-get -y update
sudo apt-get -y remove linux-image-generic linux-headers-generic linux-generic docker-ce
# WARNING: Docker version is hardcoded here; you must update the
# version number below for docker-ce and nvidia-docker2 to get newer
# versions of Docker. We hardcode these numbers because we kept
# getting broken CI when Docker would update their docker version,
# and nvidia-docker2 would be out of date for a day until they
# released a newer version of their package.
#
# How to figure out what the correct versions of these packages are?
# My preferred method is to start a Docker instance of the correct
# Ubuntu version (e.g., docker run -it ubuntu:16.04) and then ask
# apt what the packages you need are. Note that the CircleCI image
# comes with Docker.
#
# Using 'retry' here as belt-and-suspenders even though we are
# presumably retrying at the single-package level via the
# apt.conf.d/80-retries technique.
retry sudo apt-get update -qq
retry sudo apt-get -y install \
linux-headers-$(uname -r) \
linux-image-generic \
moreutils \
docker-ce=5:18.09.4~3-0~ubuntu-xenial \
nvidia-container-runtime=2.0.0+docker18.09.4-1 \
nvidia-docker2=2.0.3+docker18.09.4-1 \
expect-dev
sudo pkill -SIGHUP dockerd
echo "== DOCKER VERSION =="
docker version
retry sudo pip -q install awscli==1.16.35
if [ -n "${USE_CUDA_DOCKER_RUNTIME:-}" ]; then
DRIVER_FN="NVIDIA-Linux-x86_64-440.59.run"
DRIVER_FN="NVIDIA-Linux-x86_64-460.39.run"
wget "https://s3.amazonaws.com/ossci-linux/nvidia_driver/$DRIVER_FN"
sudo /bin/bash "$DRIVER_FN" -s --no-drm || (sudo cat /var/log/nvidia-installer.log && false)
nvidia-smi
# Taken directly from https://github.com/NVIDIA/nvidia-docker
# Add the package repositories
distribution=$(. /etc/os-release;echo "$ID$VERSION_ID")
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L "https://nvidia.github.io/nvidia-docker/${distribution}/nvidia-docker.list" | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo apt-get update -qq
# Necessary to get the `--gpus` flag to function within docker
sudo apt-get install -y nvidia-container-toolkit
sudo systemctl restart docker
else
# Explicitly remove nvidia docker apt repositories if not building for cuda
sudo rm -rf /etc/apt/sources.list.d/nvidia-docker.list
fi
add_to_env_file() {
local content
content=$1
# BASH_ENV should be set by CircleCI
echo "${content}" >> "${BASH_ENV:-/tmp/env}"
}
add_to_env_file "IN_CI=1"
add_to_env_file "COMMIT_SOURCE=${CIRCLE_BRANCH:-}"
add_to_env_file "BUILD_ENVIRONMENT=${BUILD_ENVIRONMENT}"
add_to_env_file "CIRCLE_PULL_REQUEST=${CIRCLE_PULL_REQUEST}"
if [[ "${BUILD_ENVIRONMENT}" == *-build ]]; then
echo "declare -x IN_CIRCLECI=1" > /home/circleci/project/env
echo "declare -x COMMIT_SOURCE=${CIRCLE_BRANCH:-}" >> /home/circleci/project/env
echo "declare -x SCCACHE_BUCKET=ossci-compiler-cache-circleci-v2" >> /home/circleci/project/env
add_to_env_file "SCCACHE_BUCKET=ossci-compiler-cache-circleci-v2"
SCCACHE_MAX_JOBS=$(( $(nproc) - 1 ))
MEMORY_LIMIT_MAX_JOBS=8 # the "large" resource class on CircleCI has 32 CPU cores, if we use all of them we'll OOM
MAX_JOBS=$(( ${SCCACHE_MAX_JOBS} > ${MEMORY_LIMIT_MAX_JOBS} ? ${MEMORY_LIMIT_MAX_JOBS} : ${SCCACHE_MAX_JOBS} ))
add_to_env_file "MAX_JOBS=${MAX_JOBS}"
if [ -n "${USE_CUDA_DOCKER_RUNTIME:-}" ]; then
echo "declare -x TORCH_CUDA_ARCH_LIST=5.2" >> /home/circleci/project/env
add_to_env_file "TORCH_CUDA_ARCH_LIST=5.2"
fi
export SCCACHE_MAX_JOBS=`expr $(nproc) - 1`
export MEMORY_LIMIT_MAX_JOBS=8 # the "large" resource class on CircleCI has 32 CPU cores, if we use all of them we'll OOM
export MAX_JOBS=$(( ${SCCACHE_MAX_JOBS} > ${MEMORY_LIMIT_MAX_JOBS} ? ${MEMORY_LIMIT_MAX_JOBS} : ${SCCACHE_MAX_JOBS} ))
echo "declare -x MAX_JOBS=${MAX_JOBS}" >> /home/circleci/project/env
if [[ "${BUILD_ENVIRONMENT}" == *xla* ]]; then
# This IAM user allows write access to S3 bucket for sccache & bazels3cache
set +x
echo "declare -x XLA_CLANG_CACHE_S3_BUCKET_NAME=${XLA_CLANG_CACHE_S3_BUCKET_NAME:-}" >> /home/circleci/project/env
echo "declare -x AWS_ACCESS_KEY_ID=${CIRCLECI_AWS_ACCESS_KEY_FOR_SCCACHE_AND_XLA_BAZEL_S3_BUCKET_V2:-}" >> /home/circleci/project/env
echo "declare -x AWS_SECRET_ACCESS_KEY=${CIRCLECI_AWS_SECRET_KEY_FOR_SCCACHE_AND_XLA_BAZEL_S3_BUCKET_V2:-}" >> /home/circleci/project/env
add_to_env_file "XLA_CLANG_CACHE_S3_BUCKET_NAME=${XLA_CLANG_CACHE_S3_BUCKET_NAME:-}"
add_to_env_file "AWS_ACCESS_KEY_ID=${CIRCLECI_AWS_ACCESS_KEY_FOR_SCCACHE_AND_XLA_BAZEL_S3_BUCKET_V2:-}"
add_to_env_file "AWS_SECRET_ACCESS_KEY=${CIRCLECI_AWS_SECRET_KEY_FOR_SCCACHE_AND_XLA_BAZEL_S3_BUCKET_V2:-}"
set -x
else
# This IAM user allows write access to S3 bucket for sccache
set +x
echo "declare -x XLA_CLANG_CACHE_S3_BUCKET_NAME=${XLA_CLANG_CACHE_S3_BUCKET_NAME:-}" >> /home/circleci/project/env
echo "declare -x AWS_ACCESS_KEY_ID=${CIRCLECI_AWS_ACCESS_KEY_FOR_SCCACHE_S3_BUCKET_V4:-}" >> /home/circleci/project/env
echo "declare -x AWS_SECRET_ACCESS_KEY=${CIRCLECI_AWS_SECRET_KEY_FOR_SCCACHE_S3_BUCKET_V4:-}" >> /home/circleci/project/env
add_to_env_file "XLA_CLANG_CACHE_S3_BUCKET_NAME=${XLA_CLANG_CACHE_S3_BUCKET_NAME:-}"
add_to_env_file "AWS_ACCESS_KEY_ID=${CIRCLECI_AWS_ACCESS_KEY_FOR_SCCACHE_S3_BUCKET_V4:-}"
add_to_env_file "AWS_SECRET_ACCESS_KEY=${CIRCLECI_AWS_SECRET_KEY_FOR_SCCACHE_S3_BUCKET_V4:-}"
set -x
fi
fi
@ -94,5 +93,5 @@ fi
set +x
export AWS_ACCESS_KEY_ID=${CIRCLECI_AWS_ACCESS_KEY_FOR_ECR_READ_WRITE_V4:-}
export AWS_SECRET_ACCESS_KEY=${CIRCLECI_AWS_SECRET_KEY_FOR_ECR_READ_WRITE_V4:-}
eval $(aws ecr get-login --region us-east-1 --no-include-email)
eval "$(aws ecr get-login --region us-east-1 --no-include-email)"
set -x

View File

@ -33,7 +33,7 @@ systemctl list-units --all | cat
sudo pkill apt-get || true
# For even better luck, purge unattended-upgrades
sudo apt-get purge -y unattended-upgrades
sudo apt-get purge -y unattended-upgrades || true
cat /etc/apt/sources.list

View File

@ -41,11 +41,13 @@ def build_message(size):
"build_num": os.environ.get("CIRCLE_BUILD_NUM"),
"sha1": os.environ.get("CIRCLE_SHA1"),
"branch": os.environ.get("CIRCLE_BRANCH"),
"workflow_id": os.environ.get("CIRCLE_WORKFLOW_ID"),
},
"int": {
"time": int(time.time()),
"size": size,
"commit_time": int(os.environ.get("COMMIT_TIME", "0")),
"run_duration": int(time.time() - os.path.getmtime(os.path.realpath(__file__))),
},
}
@ -114,10 +116,12 @@ def report_android_sizes(file_dir):
"build_num": os.environ.get("CIRCLE_BUILD_NUM"),
"sha1": os.environ.get("CIRCLE_SHA1"),
"branch": os.environ.get("CIRCLE_BRANCH"),
"workflow_id": os.environ.get("CIRCLE_WORKFLOW_ID"),
},
"int": {
"time": int(time.time()),
"commit_time": int(os.environ.get("COMMIT_TIME", "0")),
"run_duration": int(time.time() - os.path.getmtime(os.path.realpath(__file__))),
"size": comp_size,
"raw_size": uncomp_size,
},

View File

@ -1,7 +1,7 @@
$VS_DOWNLOAD_LINK = "https://aka.ms/vs/15/release/vs_buildtools.exe"
$COLLECT_DOWNLOAD_LINK = "https://aka.ms/vscollect.exe"
$VS_INSTALL_ARGS = @("--nocache","--quiet","--wait", "--add Microsoft.VisualStudio.Workload.VCTools",
"--add Microsoft.VisualStudio.Component.VC.Tools.14.11",
"--add Microsoft.VisualStudio.Component.VC.Tools.14.13",
"--add Microsoft.Component.MSBuild",
"--add Microsoft.VisualStudio.Component.Roslyn.Compiler",
"--add Microsoft.VisualStudio.Component.TextTemplating",

View File

@ -0,0 +1,5 @@
$CMATH_DOWNLOAD_LINK = "https://raw.githubusercontent.com/microsoft/STL/12c684bba78f9b032050526abdebf14f58ca26a3/stl/inc/cmath"
$VC14_28_INSTALL_PATH="C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.28.29910\include"
curl.exe --retry 3 -kL $CMATH_DOWNLOAD_LINK --output "$home\cmath"
Move-Item -Path "$home\cmath" -Destination "$VC14_28_INSTALL_PATH" -Force

View File

@ -1,30 +1,54 @@
#!/bin/bash
set -eux -o pipefail
curl --retry 3 -kLO https://ossci-windows.s3.amazonaws.com/cuda_10.1.243_426.00_win10.exe
7z x cuda_10.1.243_426.00_win10.exe -ocuda_10.1.243_426.00_win10
cd cuda_10.1.243_426.00_win10
cuda_major_version=${CUDA_VERSION%.*}
if [[ "$cuda_major_version" == "10" ]]; then
cuda_installer_name="cuda_10.1.243_426.00_win10"
msbuild_project_dir="CUDAVisualStudioIntegration/extras/visual_studio_integration/MSBuildExtensions"
cuda_install_packages="nvcc_10.1 cuobjdump_10.1 nvprune_10.1 cupti_10.1 cublas_10.1 cublas_dev_10.1 cudart_10.1 cufft_10.1 cufft_dev_10.1 curand_10.1 curand_dev_10.1 cusolver_10.1 cusolver_dev_10.1 cusparse_10.1 cusparse_dev_10.1 nvgraph_10.1 nvgraph_dev_10.1 npp_10.1 npp_dev_10.1 nvrtc_10.1 nvrtc_dev_10.1 nvml_dev_10.1"
elif [[ "$cuda_major_version" == "11" ]]; then
cuda_installer_name="cuda_11.1.0_456.43_win10"
msbuild_project_dir="visual_studio_integration/CUDAVisualStudioIntegration/extras/visual_studio_integration/MSBuildExtensions"
cuda_install_packages="nvcc_11.1 cuobjdump_11.1 nvprune_11.1 nvprof_11.1 cupti_11.1 cublas_11.1 cublas_dev_11.1 cudart_11.1 cufft_11.1 cufft_dev_11.1 curand_11.1 curand_dev_11.1 cusolver_11.1 cusolver_dev_11.1 cusparse_11.1 cusparse_dev_11.1 npp_11.1 npp_dev_11.1 nvrtc_11.1 nvrtc_dev_11.1 nvml_dev_11.1"
else
echo "CUDA_VERSION $CUDA_VERSION is not supported yet"
exit 1
fi
if [[ "$cuda_major_version" == "11" && "${JOB_EXECUTOR}" == "windows-with-nvidia-gpu" ]]; then
cuda_install_packages="${cuda_install_packages} Display.Driver"
fi
cuda_installer_link="https://ossci-windows.s3.amazonaws.com/${cuda_installer_name}.exe"
curl --retry 3 -kLO $cuda_installer_link
7z x ${cuda_installer_name}.exe -o${cuda_installer_name}
cd ${cuda_installer_name}
mkdir cuda_install_logs
set +e
./setup.exe -s nvcc_10.1 cuobjdump_10.1 nvprune_10.1 cupti_10.1 cublas_10.1 cublas_dev_10.1 cudart_10.1 cufft_10.1 cufft_dev_10.1 curand_10.1 curand_dev_10.1 cusolver_10.1 cusolver_dev_10.1 cusparse_10.1 cusparse_dev_10.1 nvgraph_10.1 nvgraph_dev_10.1 npp_10.1 npp_dev_10.1 nvrtc_10.1 nvrtc_dev_10.1 nvml_dev_10.1 -loglevel:6 -log:"$(pwd -W)/cuda_install_logs"
./setup.exe -s ${cuda_install_packages} -loglevel:6 -log:"$(pwd -W)/cuda_install_logs"
set -e
if [[ "${VC_YEAR}" == "2017" ]]; then
cp -r CUDAVisualStudioIntegration/extras/visual_studio_integration/MSBuildExtensions/* "C:/Program Files (x86)/Microsoft Visual Studio/2017/${VC_PRODUCT}/Common7/IDE/VC/VCTargets/BuildCustomizations/"
cp -r ${msbuild_project_dir}/* "C:/Program Files (x86)/Microsoft Visual Studio/2017/${VC_PRODUCT}/Common7/IDE/VC/VCTargets/BuildCustomizations/"
else
cp -r CUDAVisualStudioIntegration/extras/visual_studio_integration/MSBuildExtensions/* "C:/Program Files (x86)/Microsoft Visual Studio/2019/${VC_PRODUCT}/MSBuild/Microsoft/VC/v160/BuildCustomizations/"
cp -r ${msbuild_project_dir}/* "C:/Program Files (x86)/Microsoft Visual Studio/2019/${VC_PRODUCT}/MSBuild/Microsoft/VC/v160/BuildCustomizations/"
fi
curl --retry 3 -kLO https://ossci-windows.s3.amazonaws.com/NvToolsExt.7z
7z x NvToolsExt.7z -oNvToolsExt
mkdir -p "C:/Program Files/NVIDIA Corporation/NvToolsExt"
cp -r NvToolsExt/* "C:/Program Files/NVIDIA Corporation/NvToolsExt/"
export NVTOOLSEXT_PATH="C:\\Program Files\\NVIDIA Corporation\\NvToolsExt\\"
if ! ls "/c/Program Files/NVIDIA Corporation/NvToolsExt/bin/x64/nvToolsExt64_1.dll"
then
curl --retry 3 -kLO https://ossci-windows.s3.amazonaws.com/NvToolsExt.7z
7z x NvToolsExt.7z -oNvToolsExt
mkdir -p "C:/Program Files/NVIDIA Corporation/NvToolsExt"
cp -r NvToolsExt/* "C:/Program Files/NVIDIA Corporation/NvToolsExt/"
export NVTOOLSEXT_PATH="C:\\Program Files\\NVIDIA Corporation\\NvToolsExt\\"
fi
if ! ls "/c/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v10.1/bin/nvcc.exe"
if ! ls "/c/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v${CUDA_VERSION}/bin/nvcc.exe"
then
echo "CUDA installation failed"
mkdir -p /c/w/build-results
@ -33,5 +57,5 @@ then
fi
cd ..
rm -rf ./cuda_10.1.243_426.00_win10
rm -f ./cuda_10.1.243_426.00_win10.exe
rm -rf ./${cuda_installer_name}
rm -f ./${cuda_installer_name}.exe

View File

@ -0,0 +1,21 @@
#!/bin/bash
set -eux -o pipefail
cuda_major_version=${CUDA_VERSION%.*}
if [[ "$cuda_major_version" == "10" ]]; then
cudnn_installer_name="cudnn-${CUDA_VERSION}-windows10-x64-v7.6.4.38"
elif [[ "$cuda_major_version" == "11" ]]; then
cudnn_installer_name="cudnn-${CUDA_VERSION}-windows-x64-v8.0.5.39"
else
echo "CUDNN for CUDA_VERSION $CUDA_VERSION is not supported yet"
exit 1
fi
cudnn_installer_link="https://ossci-windows.s3.amazonaws.com/${cudnn_installer_name}.zip"
curl --retry 3 -O $cudnn_installer_link
7z x ${cudnn_installer_name}.zip -ocudnn
cp -r cudnn/cuda/* "C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v${CUDA_VERSION}/"
rm -rf cudnn
rm -f ${cudnn_installer_name}.zip

View File

@ -1,45 +0,0 @@
#!/usr/bin/env python3
import cimodel.data.caffe2_build_definitions as caffe2_build_definitions
import cimodel.data.simple.util.docker_constants as pytorch_docker_constants
from yaml import load
try:
from yaml import CLoader as Loader
except ImportError:
from yaml import Loader
def load_config(filename=".circleci/config.yml"):
with open(filename, "r") as fh:
return load("".join(fh.readlines()), Loader)
def load_tags_for_projects(workflow_config):
return {
v["ecr_gc_job"]["project"]: v["ecr_gc_job"]["tags_to_keep"]
for v in workflow_config["workflows"]["ecr_gc"]["jobs"]
if isinstance(v, dict) and "ecr_gc_job" in v
}
def check_version(job, tags, expected_version):
valid_versions = tags[job].split(",")
if expected_version not in valid_versions:
raise RuntimeError(
"We configured {} to use Docker version {}; but this "
"version is not configured in job ecr_gc_job_for_{}. Non-deployed versions will be "
"garbage collected two weeks after they are created. DO NOT LAND "
"THIS TO MASTER without also updating ossci-job-dsl with this version."
"\n\nDeployed versions: {}".format(job, expected_version, job, tags[job])
)
def validate_docker_version():
tags = load_tags_for_projects(load_config())
check_version("pytorch", tags, pytorch_docker_constants.DOCKER_IMAGE_TAG)
check_version("caffe2", tags, caffe2_build_definitions.DOCKER_IMAGE_VERSION)
if __name__ == "__main__":
validate_docker_version()

View File

@ -59,7 +59,7 @@ binary_windows_params: &binary_windows_params
default: ""
executor:
type: string
default: "windows-cpu-with-nvidia-cuda"
default: "windows-xlarge-cpu-with-nvidia-cuda"
environment:
BUILD_ENVIRONMENT: << parameters.build_environment >>
BUILD_FOR_SYSTEM: windows

View File

@ -1,27 +0,0 @@
caffe2_params: &caffe2_params
parameters:
build_environment:
type: string
default: ""
build_ios:
type: string
default: ""
docker_image:
type: string
default: ""
use_cuda_docker_runtime:
type: string
default: ""
build_only:
type: string
default: ""
resource_class:
type: string
default: "large"
environment:
BUILD_ENVIRONMENT: << parameters.build_environment >>
BUILD_IOS: << parameters.build_ios >>
USE_CUDA_DOCKER_RUNTIME: << parameters.use_cuda_docker_runtime >>
DOCKER_IMAGE: << parameters.docker_image >>
BUILD_ONLY: << parameters.build_only >>
resource_class: << parameters.resource_class >>

View File

@ -36,17 +36,21 @@ pytorch_ios_params: &pytorch_ios_params
op_list:
type: string
default: ""
use_metal:
type: string
default: "0"
environment:
BUILD_ENVIRONMENT: << parameters.build_environment >>
IOS_ARCH: << parameters.ios_arch >>
IOS_PLATFORM: << parameters.ios_platform >>
SELECTED_OP_LIST: << parameters.op_list >>
USE_PYTORCH_METAL: << parameters.use_metal >>
pytorch_windows_params: &pytorch_windows_params
parameters:
executor:
type: string
default: "windows-cpu-with-nvidia-cuda"
default: "windows-xlarge-cpu-with-nvidia-cuda"
build_environment:
type: string
default: ""
@ -55,16 +59,16 @@ pytorch_windows_params: &pytorch_windows_params
default: ""
cuda_version:
type: string
default: "10"
default: "10.1"
python_version:
type: string
default: "3.6"
vc_version:
type: string
default: "14.11"
default: "14.16"
vc_year:
type: string
default: "2017"
default: "2019"
vc_product:
type: string
default: "BuildTools"

View File

@ -1,23 +1,26 @@
commands:
# Must be run after attaching workspace from previous steps
load_shared_env:
description: "Loads .circleci/shared/env_file into ${BASH_ENV}"
parameters:
# For some weird reason we decide to reattach our workspace to ~/workspace so
# in the vein of making it simple let's assume our share env_file is here
root:
type: string
default: "~/workspace"
calculate_docker_image_tag:
description: "Calculates the docker image tag"
steps:
- run:
name: "Load .circleci/shared/env_file into ${BASH_ENV}"
name: "Calculate docker image hash"
command: |
if [[ -f "<< parameters.root >>/.circleci/shared/env_file" ]]; then
cat << parameters.root >>/.circleci/shared/env_file >> ${BASH_ENV}
else
echo "We didn't have a shared env file, that's weird"
DOCKER_TAG=$(git rev-parse HEAD:.circleci/docker)
echo "DOCKER_TAG=${DOCKER_TAG}" >> "${BASH_ENV}"
designate_upload_channel:
description: "inserts the correct upload channel into ${BASH_ENV}"
steps:
- run:
name: adding UPLOAD_CHANNEL to BASH_ENV
command: |
our_upload_channel=nightly
# On tags upload to test instead
if [[ -n "${CIRCLE_TAG}" ]]; then
our_upload_channel=test
fi
echo "export UPLOAD_CHANNEL=${our_upload_channel}" >> ${BASH_ENV}
# This system setup script is meant to run before the CI-related scripts, e.g.,
# installing Git client, checking out code, setting up CI env, and
@ -100,7 +103,7 @@ commands:
name: (Optional) Merge target branch
no_output_timeout: "10m"
command: |
if [ -n "$CIRCLE_PULL_REQUEST" ]; then
if [[ -n "$CIRCLE_PULL_REQUEST" && "$CIRCLE_BRANCH" != "nightly" ]]; then
PR_NUM=$(basename $CIRCLE_PULL_REQUEST)
CIRCLE_PR_BASE_BRANCH=$(curl -s https://api.github.com/repos/$CIRCLE_PROJECT_USERNAME/$CIRCLE_PROJECT_REPONAME/pulls/$PR_NUM | jq -r '.base.ref')
if [[ "${BUILD_ENVIRONMENT}" == *"xla"* || "${BUILD_ENVIRONMENT}" == *"gcc5"* ]] ; then
@ -108,11 +111,11 @@ commands:
git config --global user.email "circleci.ossci@gmail.com"
git config --global user.name "CircleCI"
git config remote.origin.url https://github.com/pytorch/pytorch.git
git config --add remote.origin.fetch +refs/heads/master:refs/remotes/origin/master
git fetch --tags --progress https://github.com/pytorch/pytorch.git +refs/heads/master:refs/remotes/origin/master --depth=100 --quiet
git config --add remote.origin.fetch +refs/heads/release/1.8:refs/remotes/origin/release/1.8
git fetch --tags --progress https://github.com/pytorch/pytorch.git +refs/heads/release/1.8:refs/remotes/origin/release/1.8 --depth=100 --quiet
# PRs generated from ghstack has format CIRCLE_PR_BASE_BRANCH=gh/xxx/1234/base
if [[ "${CIRCLE_PR_BASE_BRANCH}" == "gh/"* ]]; then
CIRCLE_PR_BASE_BRANCH=master
CIRCLE_PR_BASE_BRANCH=release/1.8
fi
export GIT_MERGE_TARGET=`git log -n 1 --pretty=format:"%H" origin/$CIRCLE_PR_BASE_BRANCH`
echo "GIT_MERGE_TARGET: " ${GIT_MERGE_TARGET}
@ -130,4 +133,42 @@ commands:
echo "This is not a pull request, skipping..."
fi
upload_binary_size_for_android_build:
description: "Upload binary size data for Android build"
parameters:
build_type:
type: string
default: ""
artifacts:
type: string
default: ""
steps:
- run:
name: "Binary Size - Install Dependencies"
no_output_timeout: "5m"
command: |
retry () {
$* || (sleep 1 && $*) || (sleep 2 && $*) || (sleep 4 && $*) || (sleep 8 && $*)
}
retry pip3 install requests
- run:
name: "Binary Size - Untar Artifacts"
no_output_timeout: "5m"
command: |
# The artifact file is created inside docker container, which contains the result binaries.
# Now unpackage it into the project folder. The subsequent script will scan project folder
# to locate result binaries and report their sizes.
# If artifact file is not provided it assumes that the project folder has been mounted in
# the docker during build and already contains the result binaries, so this step can be skipped.
export ARTIFACTS="<< parameters.artifacts >>"
if [ -n "${ARTIFACTS}" ]; then
tar xf "${ARTIFACTS}" -C ~/project
fi
- run:
name: "Binary Size - Upload << parameters.build_type >>"
no_output_timeout: "5m"
command: |
cd ~/project
export ANDROID_BUILD_TYPE="<< parameters.build_type >>"
export COMMIT_TIME=$(git log --max-count=1 --format=%ct || echo 0)
python3 .circleci/scripts/upload_binary_size_to_scuba.py android

View File

@ -11,6 +11,9 @@ parameters:
run_binary_tests:
type: boolean
default: false
run_build:
type: boolean
default: true
docker_config_defaults: &docker_config_defaults
user: jenkins
@ -26,9 +29,14 @@ executors:
image: windows-server-2019-nvidia:stable
shell: bash.exe
windows-cpu-with-nvidia-cuda:
windows-xlarge-cpu-with-nvidia-cuda:
machine:
# we will change to CPU host when it's ready
resource_class: windows.xlarge
image: windows-server-2019-vs2019:stable
shell: bash.exe
windows-medium-cpu-with-nvidia-cuda:
machine:
resource_class: windows.medium
image: windows-server-2019-vs2019:stable
shell: bash.exe

View File

@ -1,60 +1,42 @@
binary_linux_build:
<<: *binary_linux_build_params
steps:
# See Note [Workspace for CircleCI scripts] in job-specs-setup.yml
- checkout
- calculate_docker_image_tag
- run:
<<: *binary_checkout
- run:
<<: *binary_populate_env
- run:
name: Install unbuffer and ts
command: |
set -eux -o pipefail
source /env
OS_NAME=`awk -F= '/^NAME/{print $2}' /etc/os-release`
if [[ "$OS_NAME" == *"CentOS Linux"* ]]; then
retry yum -q -y install epel-release
retry yum -q -y install expect moreutils
elif [[ "$OS_NAME" == *"Ubuntu"* ]]; then
retry apt-get update
retry apt-get -y install expect moreutils
retry conda install -y -c eumetsat expect
retry conda install -y cmake
fi
- run:
name: Update compiler to devtoolset7
command: |
set -eux -o pipefail
source /env
if [[ "$DESIRED_DEVTOOLSET" == 'devtoolset7' ]]; then
source "/builder/update_compiler.sh"
# Env variables are not persisted into the next step
echo "export PATH=$PATH" >> /env
echo "export LD_LIBRARY_PATH=$LD_LIBRARY_PATH" >> /env
else
echo "Not updating compiler"
fi
- run:
name: Build
no_output_timeout: "1h"
command: |
source "/pytorch/.circleci/scripts/binary_linux_build.sh"
# Preserve build log
if [ -f /pytorch/build/.ninja_log ]; then
cp /pytorch/build/.ninja_log /final_pkgs
fi
- run:
name: Output binary sizes
no_output_timeout: "1m"
command: |
ls -lah /final_pkgs
- run:
name: save binary size
no_output_timeout: "5m"
command: |
source /env
cd /pytorch && export COMMIT_TIME=$(git log --max-count=1 --format=%ct || echo 0)
pip3 install requests && \
python3 -mpip install requests && \
SCRIBE_GRAPHQL_ACCESS_TOKEN=${SCRIBE_GRAPHQL_ACCESS_TOKEN} \
python3 /pytorch/.circleci/scripts/upload_binary_size_to_scuba.py || exit 0
- persist_to_workspace:
root: /
paths: final_pkgs
- store_artifacts:
path: /final_pkgs
# This should really just be another step of the binary_linux_build job above.
# This isn't possible right now b/c the build job uses the docker executor
# (otherwise they'd be really really slow) but this one uses the macine
@ -63,11 +45,10 @@
binary_linux_test:
<<: *binary_linux_test_upload_params
machine:
image: ubuntu-1604:201903-01
image: ubuntu-1604:202007-01
steps:
# See Note [Workspace for CircleCI scripts] in job-specs-setup.yml
- checkout
# TODO: We shouldn't attach the workspace multiple times
- attach_workspace:
at: /home/circleci/project
- setup_linux_system_environment
@ -83,25 +64,41 @@
- run:
<<: *binary_run_in_docker
binary_linux_upload:
<<: *binary_linux_test_upload_params
machine:
image: ubuntu-1604:201903-01
binary_upload:
parameters:
package_type:
type: string
description: "What type of package we are uploading (eg. wheel, libtorch, conda)"
default: "wheel"
upload_subfolder:
type: string
description: "What subfolder to put our package into (eg. cpu, cudaX.Y, etc.)"
default: "cpu"
docker:
- image: continuumio/miniconda3
environment:
- DRY_RUN: disabled
- PACKAGE_TYPE: "<< parameters.package_type >>"
- UPLOAD_SUBFOLDER: "<< parameters.upload_subfolder >>"
steps:
# See Note [Workspace for CircleCI scripts] in job-specs-setup.yml
- checkout
- setup_linux_system_environment
- setup_ci_environment
- attach_workspace:
at: /home/circleci/project
- run:
<<: *binary_populate_env
- run:
<<: *binary_install_miniconda
- run:
name: Upload
no_output_timeout: "1h"
command: .circleci/scripts/binary_linux_upload.sh
- attach_workspace:
at: /tmp/workspace
- checkout
- designate_upload_channel
- run:
name: Install dependencies
no_output_timeout: "1h"
command: |
conda install -yq anaconda-client
pip install -q awscli
- run:
name: Do upload
no_output_timeout: "1h"
command: |
AWS_ACCESS_KEY_ID="${PYTORCH_BINARY_AWS_ACCESS_KEY_ID}" \
AWS_SECRET_ACCESS_KEY="${PYTORCH_BINARY_AWS_SECRET_ACCESS_KEY}" \
ANACONDA_API_TOKEN="${CONDA_PYTORCHBOT_TOKEN}" \
.circleci/scripts/binary_upload.sh
# Nighlty build smoke tests defaults
# These are the second-round smoke tests. These make sure that the binaries are
@ -111,9 +108,10 @@
smoke_linux_test:
<<: *binary_linux_test_upload_params
machine:
image: ubuntu-1604:201903-01
image: ubuntu-1604:202007-01
steps:
- checkout
- calculate_docker_image_tag
- setup_linux_system_environment
- setup_ci_environment
- run:
@ -137,7 +135,7 @@
smoke_mac_test:
<<: *binary_linux_test_upload_params
macos:
xcode: "9.4.1"
xcode: "12.0"
steps:
- checkout
- run:
@ -162,7 +160,7 @@
binary_mac_build:
<<: *binary_mac_params
macos:
xcode: "9.4.1"
xcode: "12.0"
steps:
# See Note [Workspace for CircleCI scripts] in job-specs-setup.yml
- checkout
@ -176,7 +174,7 @@
- run:
name: Build
no_output_timeout: "1h"
no_output_timeout: "90m"
command: |
# Do not set -u here; there is some problem with CircleCI
# variable expansion with PROMPT_COMMAND
@ -200,10 +198,13 @@
root: /Users/distiller/project
paths: final_pkgs
binary_mac_upload: &binary_mac_upload
- store_artifacts:
path: /Users/distiller/project/final_pkgs
binary_macos_arm64_build:
<<: *binary_mac_params
macos:
xcode: "9.4.1"
xcode: "12.3.0"
steps:
# See Note [Workspace for CircleCI scripts] in job-specs-setup.yml
- checkout
@ -214,20 +215,31 @@
- brew_update
- run:
<<: *binary_install_miniconda
- attach_workspace: # TODO - we can `cp` from ~/workspace
at: /Users/distiller/project
- run:
name: Upload
no_output_timeout: "10m"
name: Build
no_output_timeout: "90m"
command: |
script="/Users/distiller/project/pytorch/.circleci/scripts/binary_macos_upload.sh"
# Do not set -u here; there is some problem with CircleCI
# variable expansion with PROMPT_COMMAND
set -ex -o pipefail
export CROSS_COMPILE_ARM64=1
script="/Users/distiller/project/pytorch/.circleci/scripts/binary_macos_build.sh"
cat "$script"
source "$script"
- persist_to_workspace:
root: /Users/distiller/project
paths: final_pkgs
- store_artifacts:
path: /Users/distiller/project/final_pkgs
binary_ios_build:
<<: *pytorch_ios_params
macos:
xcode: "11.2.1"
xcode: "12.0"
steps:
- attach_workspace:
at: ~/workspace
@ -254,7 +266,7 @@
binary_ios_upload:
<<: *pytorch_ios_params
macos:
xcode: "11.2.1"
xcode: "12.0"
steps:
- attach_workspace:
at: ~/workspace
@ -276,11 +288,16 @@
default: ""
executor:
type: string
default: "windows-cpu-with-nvidia-cuda"
default: "windows-xlarge-cpu-with-nvidia-cuda"
executor: <<parameters.executor>>
steps:
# See Note [Workspace for CircleCI scripts] in job-specs-setup.yml
- checkout
- run:
name: _HACK_ Install CUDA compatible cmath
no_output_timeout: 1m
command: |
powershell .circleci/scripts/vs_install_cmath.ps1
- run:
<<: *binary_checkout
- run:
@ -305,7 +322,7 @@
default: ""
executor:
type: string
default: "windows-cpu-with-nvidia-cuda"
default: "windows-medium-cpu-with-nvidia-cuda"
executor: <<parameters.executor>>
steps:
- checkout
@ -324,28 +341,6 @@
cat "$script"
source "$script"
binary_windows_upload:
<<: *binary_windows_params
docker:
- image: continuumio/miniconda
steps:
# See Note [Workspace for CircleCI scripts] in job-specs-setup.yml
- checkout
- attach_workspace:
at: /root/workspace
- run:
<<: *binary_checkout
- run:
<<: *binary_populate_env
- run:
name: Upload
no_output_timeout: "10m"
command: |
set -eux -o pipefail
script="/pytorch/.circleci/scripts/binary_windows_upload.sh"
cat "$script"
source "$script"
smoke_windows_test:
<<: *binary_windows_params
parameters:
@ -354,7 +349,7 @@
default: ""
executor:
type: string
default: "windows-cpu-with-nvidia-cuda"
default: "windows-medium-cpu-with-nvidia-cuda"
executor: <<parameters.executor>>
steps:
- checkout
@ -372,3 +367,32 @@
cat "$script"
source "$script"
anaconda_prune:
parameters:
packages:
type: string
description: "What packages are we pruning? (quoted, space-separated string. eg. 'pytorch', 'torchvision torchaudio', etc.)"
default: "pytorch"
channel:
type: string
description: "What channel are we pruning? (eq. pytorch-nightly)"
default: "pytorch-nightly"
docker:
- image: continuumio/miniconda3
environment:
- PACKAGES: "<< parameters.packages >>"
- CHANNEL: "<< parameters.channel >>"
steps:
- checkout
- run:
name: Install dependencies
no_output_timeout: "1h"
command: |
conda install -yq anaconda-client
- run:
name: Prune packages
no_output_timeout: "1h"
command: |
ANACONDA_API_TOKEN="${CONDA_PYTORCHBOT_TOKEN}" \
scripts/release/anaconda-prune/run.sh

View File

@ -8,7 +8,8 @@
# then install the one with the most recent version.
update_s3_htmls: &update_s3_htmls
machine:
image: ubuntu-1604:201903-01
image: ubuntu-1604:202007-01
resource_class: medium
steps:
- checkout
- setup_linux_system_environment

View File

@ -1,198 +0,0 @@
caffe2_linux_build:
<<: *caffe2_params
machine:
image: ubuntu-1604:201903-01
steps:
# See Note [Workspace for CircleCI scripts] in job-specs-setup.yml
- checkout
- setup_linux_system_environment
- setup_ci_environment
- run:
name: Build
no_output_timeout: "1h"
command: |
set -e
cat >/home/circleci/project/ci_build_script.sh \<<EOL
# =================== The following code will be executed inside Docker container ===================
set -ex
export BUILD_ENVIRONMENT="$BUILD_ENVIRONMENT"
# Reinitialize submodules
git submodule sync && git submodule update -q --init --recursive
# conda must be added to the path for Anaconda builds (this location must be
# the same as that in install_anaconda.sh used to build the docker image)
if [[ "${BUILD_ENVIRONMENT}" == conda* ]]; then
export PATH=/opt/conda/bin:$PATH
sudo chown -R jenkins:jenkins '/opt/conda'
fi
# Build
./.jenkins/caffe2/build.sh
# Show sccache stats if it is running
if pgrep sccache > /dev/null; then
sccache --show-stats
fi
# =================== The above code will be executed inside Docker container ===================
EOL
chmod +x /home/circleci/project/ci_build_script.sh
echo "DOCKER_IMAGE: "${DOCKER_IMAGE}
time docker pull ${DOCKER_IMAGE} >/dev/null
export id=$(docker run --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -t -d -w /var/lib/jenkins ${DOCKER_IMAGE})
docker cp /home/circleci/project/. $id:/var/lib/jenkins/workspace
export COMMAND='((echo "source ./workspace/env" && echo "sudo chown -R jenkins workspace && cd workspace && ./ci_build_script.sh") | docker exec -u jenkins -i "$id" bash) 2>&1'
echo ${COMMAND} > ./command.sh && unbuffer bash ./command.sh | ts
# Push intermediate Docker image for next phase to use
if [ -z "${BUILD_ONLY}" ]; then
if [[ "$BUILD_ENVIRONMENT" == *cmake* ]]; then
export COMMIT_DOCKER_IMAGE=${DOCKER_IMAGE}-cmake-${CIRCLE_SHA1}
else
export COMMIT_DOCKER_IMAGE=${DOCKER_IMAGE}-${CIRCLE_SHA1}
fi
docker commit "$id" ${COMMIT_DOCKER_IMAGE}
time docker push ${COMMIT_DOCKER_IMAGE}
fi
caffe2_linux_test:
<<: *caffe2_params
machine:
image: ubuntu-1604:201903-01
steps:
# See Note [Workspace for CircleCI scripts] in job-specs-setup.yml
- checkout
- setup_linux_system_environment
- setup_ci_environment
- run:
name: Test
no_output_timeout: "1h"
command: |
set -e
# TODO: merge this into Caffe2 test.sh
cat >/home/circleci/project/ci_test_script.sh \<<EOL
# =================== The following code will be executed inside Docker container ===================
set -ex
export BUILD_ENVIRONMENT="$BUILD_ENVIRONMENT"
# libdc1394 (dependency of OpenCV) expects /dev/raw1394 to exist...
sudo ln /dev/null /dev/raw1394
# conda must be added to the path for Anaconda builds (this location must be
# the same as that in install_anaconda.sh used to build the docker image)
if [[ "${BUILD_ENVIRONMENT}" == conda* ]]; then
export PATH=/opt/conda/bin:$PATH
fi
# Upgrade SSL module to avoid old SSL warnings
pip -q install --user --upgrade pyOpenSSL ndg-httpsclient pyasn1
pip -q install --user -b /tmp/pip_install_onnx "file:///var/lib/jenkins/workspace/third_party/onnx#egg=onnx"
# Build
./.jenkins/caffe2/test.sh
# Remove benign core dumps.
# These are tests for signal handling (including SIGABRT).
rm -f ./crash/core.fatal_signal_as.*
rm -f ./crash/core.logging_test.*
# =================== The above code will be executed inside Docker container ===================
EOL
chmod +x /home/circleci/project/ci_test_script.sh
if [[ "$BUILD_ENVIRONMENT" == *cmake* ]]; then
export COMMIT_DOCKER_IMAGE=${DOCKER_IMAGE}-cmake-${CIRCLE_SHA1}
else
export COMMIT_DOCKER_IMAGE=${DOCKER_IMAGE}-${CIRCLE_SHA1}
fi
echo "DOCKER_IMAGE: "${COMMIT_DOCKER_IMAGE}
time docker pull ${COMMIT_DOCKER_IMAGE} >/dev/null
if [ -n "${USE_CUDA_DOCKER_RUNTIME}" ]; then
export id=$(docker run --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --runtime=nvidia -t -d -w /var/lib/jenkins ${COMMIT_DOCKER_IMAGE})
else
export id=$(docker run --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -t -d -w /var/lib/jenkins ${COMMIT_DOCKER_IMAGE})
fi
docker cp /home/circleci/project/. "$id:/var/lib/jenkins/workspace"
export COMMAND='((echo "source ./workspace/env" && echo "sudo chown -R jenkins workspace && cd workspace && ./ci_test_script.sh") | docker exec -u jenkins -i "$id" bash) 2>&1'
echo ${COMMAND} > ./command.sh && unbuffer bash ./command.sh | ts
caffe2_macos_build:
<<: *caffe2_params
macos:
xcode: "9.4.1"
steps:
# See Note [Workspace for CircleCI scripts] in job-specs-setup.yml
- checkout
- run_brew_for_macos_build
- run:
name: Build
no_output_timeout: "1h"
command: |
set -e
export IN_CIRCLECI=1
brew install cmake
# Reinitialize submodules
git submodule sync && git submodule update -q --init --recursive
# Reinitialize path (see man page for path_helper(8))
eval `/usr/libexec/path_helper -s`
export PATH=/usr/local/opt/python/libexec/bin:/usr/local/bin:$PATH
# Install Anaconda if we need to
if [ -n "${CAFFE2_USE_ANACONDA}" ]; then
rm -rf ${TMPDIR}/anaconda
curl --retry 3 -o ${TMPDIR}/conda.sh https://repo.anaconda.com/miniconda/Miniconda${ANACONDA_VERSION}-latest-MacOSX-x86_64.sh
chmod +x ${TMPDIR}/conda.sh
/bin/bash ${TMPDIR}/conda.sh -b -p ${TMPDIR}/anaconda
rm -f ${TMPDIR}/conda.sh
export PATH="${TMPDIR}/anaconda/bin:${PATH}"
source ${TMPDIR}/anaconda/bin/activate
fi
pip -q install numpy
# Install sccache
sudo curl --retry 3 https://s3.amazonaws.com/ossci-macos/sccache --output /usr/local/bin/sccache
sudo chmod +x /usr/local/bin/sccache
export SCCACHE_BUCKET=ossci-compiler-cache-circleci-v2
# This IAM user allows write access to S3 bucket for sccache
set +x
export AWS_ACCESS_KEY_ID=${CIRCLECI_AWS_ACCESS_KEY_FOR_SCCACHE_S3_BUCKET_V4}
export AWS_SECRET_ACCESS_KEY=${CIRCLECI_AWS_SECRET_KEY_FOR_SCCACHE_S3_BUCKET_V4}
set -x
export SCCACHE_BIN=${PWD}/sccache_bin
mkdir -p ${SCCACHE_BIN}
if which sccache > /dev/null; then
printf "#!/bin/sh\nexec sccache $(which clang++) \$*" > "${SCCACHE_BIN}/clang++"
chmod a+x "${SCCACHE_BIN}/clang++"
printf "#!/bin/sh\nexec sccache $(which clang) \$*" > "${SCCACHE_BIN}/clang"
chmod a+x "${SCCACHE_BIN}/clang"
export PATH="${SCCACHE_BIN}:$PATH"
fi
# Build
if [ "${BUILD_IOS:-0}" -eq 1 ]; then
unbuffer scripts/build_ios.sh 2>&1 | ts
elif [ -n "${CAFFE2_USE_ANACONDA}" ]; then
# All conda build logic should be in scripts/build_anaconda.sh
unbuffer scripts/build_anaconda.sh 2>&1 | ts
else
unbuffer scripts/build_local.sh 2>&1 | ts
fi
# Show sccache stats if it is running
if which sccache > /dev/null; then
sccache --show-stats
fi

View File

@ -4,7 +4,7 @@
type: string
default: ""
machine:
image: ubuntu-1604:201903-01
image: ubuntu-1604:202007-01
resource_class: large
environment:
IMAGE_NAME: << parameters.image_name >>
@ -13,20 +13,7 @@
DOCKER_BUILDKIT: 1
steps:
- checkout
- run:
name: Calculate docker tag
command: |
set -x
mkdir .circleci/shared
# git keeps a hash of all sub trees
echo "export DOCKER_TAG=$(git rev-parse HEAD:.circleci/docker)" >> .circleci/shared/env_file
# Saves our calculated docker tag to our workpace for later use
- persist_to_workspace:
root: .
paths:
- .circleci/shared/
- load_shared_env:
root: .
- calculate_docker_image_tag
- run:
name: Check if image should be built
command: |
@ -35,7 +22,6 @@
export AWS_SECRET_ACCESS_KEY=${CIRCLECI_AWS_SECRET_KEY_FOR_DOCKER_BUILDER_V1}
eval $(aws ecr get-login --no-include-email --region us-east-1)
set -x
PREVIOUS_DOCKER_TAG=$(git rev-parse "$(git merge-base HEAD << pipeline.git.base_revision >>):.circleci/docker")
# Check if image already exists, if it does then skip building it
if docker manifest inspect "308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/${IMAGE_NAME}:${DOCKER_TAG}"; then
circleci-agent step halt
@ -43,8 +29,15 @@
# explicitly exit the step here ourselves before it causes too much trouble
exit 0
fi
# Covers the case where a previous tag doesn't exist for the tree
# this is only really applicable on trees that don't have `.circleci/docker` at its merge base, i.e. nightly
if ! git rev-parse "$(git merge-base HEAD << pipeline.git.base_revision >>):.circleci/docker"; then
echo "Directory '.circleci/docker' not found in tree << pipeline.git.base_revision >>, you should probably rebase onto a more recent commit"
exit 1
fi
PREVIOUS_DOCKER_TAG=$(git rev-parse "$(git merge-base HEAD << pipeline.git.base_revision >>):.circleci/docker")
# If no image exists but the hash is the same as the previous hash then we should error out here
if [[ ${PREVIOUS_DOCKER_TAG} = ${DOCKER_TAG} ]]; then
if [[ "${PREVIOUS_DOCKER_TAG}" = "${DOCKER_TAG}" ]]; then
echo "ERROR: Something has gone wrong and the previous image isn't available for the merge-base of your branch"
echo " contact the PyTorch team to restore the original images"
exit 1
@ -60,7 +53,7 @@
cd .circleci/docker && ./build_docker.sh
docker_for_ecr_gc_build_job:
machine:
image: ubuntu-1604:201903-01
image: ubuntu-1604:202007-01
steps:
- checkout
- run:
@ -113,23 +106,3 @@
export AWS_SECRET_ACCESS_KEY=${CIRCLECI_AWS_SECRET_KEY_FOR_DOCKER_BUILDER_V1}
set -x
/usr/bin/gc.py --filter-prefix ${PROJECT} --ignore-tags "${IMAGE_TAG},${GENERATED_IMAGE_TAG}"
docker_hub_index_job:
docker:
- image: 308535385114.dkr.ecr.us-east-1.amazonaws.com/gc/ecr
aws_auth:
aws_access_key_id: ${CIRCLECI_AWS_ACCESS_KEY_FOR_DOCKER_BUILDER_V1}
aws_secret_access_key: ${CIRCLECI_AWS_SECRET_KEY_FOR_DOCKER_BUILDER_V1}
steps:
- run:
name: garbage collecting for ecr images
no_output_timeout: "1h"
command: |
set +x
export AWS_ACCESS_KEY_ID=${CIRCLECI_AWS_ACCESS_KEY_FOR_DOCKER_BUILDER_V1}
export AWS_SECRET_ACCESS_KEY=${CIRCLECI_AWS_SECRET_KEY_FOR_DOCKER_BUILDER_V1}
export DOCKER_HUB_USERNAME=${CIRCLECI_DOCKER_HUB_USERNAME}
export DOCKER_HUB_PASSWORD=${CIRCLECI_DOCKER_HUB_PASSWORD}
set -x
/usr/bin/docker_hub.py

View File

@ -1,13 +1,39 @@
pytorch_python_doc_push:
pytorch_doc_push:
resource_class: medium
machine:
image: ubuntu-1604:202007-01
parameters:
branch:
type: string
default: "master"
steps:
- attach_workspace:
at: /tmp/workspace
- run:
name: Generate netrc
command: |
# set credentials for https pushing
cat > ~/.netrc \<<DONE
machine github.com
login pytorchbot
password ${GITHUB_PYTORCHBOT_TOKEN}
DONE
- run:
name: Docs push
command: |
pushd /tmp/workspace
git push -u origin "<< parameters.branch >>"
pytorch_python_doc_build:
environment:
BUILD_ENVIRONMENT: pytorch-python-doc-push
# TODO: stop hardcoding this
DOCKER_IMAGE: "308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-py3.6-gcc5.4:209062ef-ab58-422a-b295-36c4eed6e906"
DOCKER_IMAGE: "308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-py3.6-gcc5.4"
resource_class: large
machine:
image: ubuntu-1604:201903-01
image: ubuntu-1604:202007-01
steps:
- checkout
- calculate_docker_image_tag
- setup_linux_system_environment
- setup_ci_environment
- run:
@ -15,49 +41,44 @@
no_output_timeout: "1h"
command: |
set -ex
export COMMIT_DOCKER_IMAGE=${DOCKER_IMAGE}-${CIRCLE_SHA1}
export COMMIT_DOCKER_IMAGE=${DOCKER_IMAGE}:${DOCKER_TAG}-${CIRCLE_SHA1}
echo "DOCKER_IMAGE: "${COMMIT_DOCKER_IMAGE}
tag=${CIRCLE_TAG:1:5}
target=${tag:-master}
echo "building for ${target}"
time docker pull ${COMMIT_DOCKER_IMAGE} >/dev/null
export id=$(docker run --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -t -d -w /var/lib/jenkins ${COMMIT_DOCKER_IMAGE})
export id=$(docker run --env-file "${BASH_ENV}" --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -t -d -w /var/lib/jenkins ${COMMIT_DOCKER_IMAGE})
# master branch docs push
if [[ "${CIRCLE_BRANCH}" == "master" ]]; then
export COMMAND='((echo "export BUILD_ENVIRONMENT=${BUILD_ENVIRONMENT}" && echo "export GITHUB_PYTORCHBOT_TOKEN=${GITHUB_PYTORCHBOT_TOKEN}" && echo "source ./workspace/env" && echo "sudo chown -R jenkins workspace && cd workspace && . ./.circleci/scripts/python_doc_push_script.sh docs/master master site") | docker exec -u jenkins -i "$id" bash) 2>&1'
# stable release docs push. Due to some circleci limitations, we keep
# an eternal PR open for merging v1.2.0 -> master for this job.
# XXX: The following code is only run on the v1.2.0 branch, which might
# not be exactly the same as what you see here.
elif [[ "${CIRCLE_BRANCH}" == "v1.2.0" ]]; then
export COMMAND='((echo "export BUILD_ENVIRONMENT=${BUILD_ENVIRONMENT}" && echo "export GITHUB_PYTORCHBOT_TOKEN=${GITHUB_PYTORCHBOT_TOKEN}" && echo "source ./workspace/env" && echo "sudo chown -R jenkins workspace && cd workspace && . ./.circleci/scripts/python_doc_push_script.sh docs/stable 1.2.0 site dry_run") | docker exec -u jenkins -i "$id" bash) 2>&1'
# For open PRs: Do a dry_run of the docs build, don't push build
else
export COMMAND='((echo "export BUILD_ENVIRONMENT=${BUILD_ENVIRONMENT}" && echo "export GITHUB_PYTORCHBOT_TOKEN=${GITHUB_PYTORCHBOT_TOKEN}" && echo "source ./workspace/env" && echo "sudo chown -R jenkins workspace && cd workspace && . ./.circleci/scripts/python_doc_push_script.sh docs/master master site dry_run") | docker exec -u jenkins -i "$id" bash) 2>&1'
fi
export COMMAND='((echo "sudo chown -R jenkins workspace && cd workspace && '"export CIRCLE_SHA1='$CIRCLE_SHA1'"' && . ./.circleci/scripts/python_doc_push_script.sh docs/'$target' '$target' site") | docker exec -u jenkins -i "$id" bash) 2>&1'
echo ${COMMAND} > ./command.sh && unbuffer bash ./command.sh | ts
mkdir -p ~/workspace/build_artifacts
docker cp $id:/var/lib/jenkins/workspace/pytorch.github.io/docs/master ~/workspace/build_artifacts
docker cp $id:/var/lib/jenkins/workspace/pytorch.github.io /tmp/workspace
# Save the docs build so we can debug any problems
export DEBUG_COMMIT_DOCKER_IMAGE=${COMMIT_DOCKER_IMAGE}-debug
docker commit "$id" ${DEBUG_COMMIT_DOCKER_IMAGE}
time docker push ${DEBUG_COMMIT_DOCKER_IMAGE}
- persist_to_workspace:
root: /tmp/workspace
paths:
- .
- store_artifacts:
path: ~/workspace/build_artifacts/master
destination: docs
pytorch_cpp_doc_push:
pytorch_cpp_doc_build:
environment:
BUILD_ENVIRONMENT: pytorch-cpp-doc-push
DOCKER_IMAGE: "308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-py3.6-gcc5.4:209062ef-ab58-422a-b295-36c4eed6e906"
DOCKER_IMAGE: "308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-py3.6-gcc5.4"
resource_class: large
machine:
image: ubuntu-1604:201903-01
image: ubuntu-1604:202007-01
steps:
- checkout
- calculate_docker_image_tag
- setup_linux_system_environment
- setup_ci_environment
- run:
@ -65,39 +86,36 @@
no_output_timeout: "1h"
command: |
set -ex
export COMMIT_DOCKER_IMAGE=${DOCKER_IMAGE}-${CIRCLE_SHA1}
export COMMIT_DOCKER_IMAGE=${DOCKER_IMAGE}:${DOCKER_TAG}-${CIRCLE_SHA1}
echo "DOCKER_IMAGE: "${COMMIT_DOCKER_IMAGE}
tag=${CIRCLE_TAG:1:5}
target=${tag:-master}
echo "building for ${target}"
time docker pull ${COMMIT_DOCKER_IMAGE} >/dev/null
export id=$(docker run --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -t -d -w /var/lib/jenkins ${COMMIT_DOCKER_IMAGE})
export id=$(docker run --env-file "${BASH_ENV}" --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -t -d -w /var/lib/jenkins ${COMMIT_DOCKER_IMAGE})
# master branch docs push
if [[ "${CIRCLE_BRANCH}" == "master" ]]; then
export COMMAND='((echo "export BUILD_ENVIRONMENT=${BUILD_ENVIRONMENT}" && echo "export GITHUB_PYTORCHBOT_TOKEN=${GITHUB_PYTORCHBOT_TOKEN}" && echo "source ./workspace/env" && echo "sudo chown -R jenkins workspace && cd workspace && . ./.circleci/scripts/cpp_doc_push_script.sh docs/master master") | docker exec -u jenkins -i "$id" bash) 2>&1'
# stable release docs push. Due to some circleci limitations, we keep
# an eternal PR open (#16502) for merging v1.0.1 -> master for this job.
# XXX: The following code is only run on the v1.0.1 branch, which might
# not be exactly the same as what you see here.
elif [[ "${CIRCLE_BRANCH}" == "v1.0.1" ]]; then
export COMMAND='((echo "export BUILD_ENVIRONMENT=${BUILD_ENVIRONMENT}" && echo "export GITHUB_PYTORCHBOT_TOKEN=${GITHUB_PYTORCHBOT_TOKEN}" && echo "source ./workspace/env" && echo "sudo chown -R jenkins workspace && cd workspace && . ./.circleci/scripts/cpp_doc_push_script.sh docs/stable 1.0.1") | docker exec -u jenkins -i "$id" bash) 2>&1'
# For open PRs: Do a dry_run of the docs build, don't push build
else
export COMMAND='((echo "export BUILD_ENVIRONMENT=${BUILD_ENVIRONMENT}" && echo "export GITHUB_PYTORCHBOT_TOKEN=${GITHUB_PYTORCHBOT_TOKEN}" && echo "source ./workspace/env" && echo "sudo chown -R jenkins workspace && cd workspace && . ./.circleci/scripts/cpp_doc_push_script.sh docs/master master dry_run") | docker exec -u jenkins -i "$id" bash) 2>&1'
fi
export COMMAND='((echo "sudo chown -R jenkins workspace && cd workspace && '"export CIRCLE_SHA1='$CIRCLE_SHA1'"' && . ./.circleci/scripts/cpp_doc_push_script.sh docs/"$target" master") | docker exec -u jenkins -i "$id" bash) 2>&1'
echo ${COMMAND} > ./command.sh && unbuffer bash ./command.sh | ts
mkdir -p ~/workspace/build_artifacts
docker cp $id:/var/lib/jenkins/workspace/cppdocs/ /tmp/workspace
# Save the docs build so we can debug any problems
export DEBUG_COMMIT_DOCKER_IMAGE=${COMMIT_DOCKER_IMAGE}-debug
docker commit "$id" ${DEBUG_COMMIT_DOCKER_IMAGE}
time docker push ${DEBUG_COMMIT_DOCKER_IMAGE}
- persist_to_workspace:
root: /tmp/workspace
paths:
- .
pytorch_macos_10_13_py3_build:
environment:
BUILD_ENVIRONMENT: pytorch-macos-10.13-py3-build
macos:
xcode: "9.4.1"
xcode: "12.0"
steps:
- checkout
- run_brew_for_macos_build
@ -106,7 +124,7 @@
no_output_timeout: "1h"
command: |
set -e
export IN_CIRCLECI=1
export IN_CI=1
# Install sccache
sudo curl --retry 3 https://s3.amazonaws.com/ossci-macos/sccache --output /usr/local/bin/sccache
@ -131,7 +149,7 @@
environment:
BUILD_ENVIRONMENT: pytorch-macos-10.13-py3-test
macos:
xcode: "9.4.1"
xcode: "12.0"
steps:
- checkout
- attach_workspace:
@ -142,7 +160,7 @@
no_output_timeout: "1h"
command: |
set -e
export IN_CIRCLECI=1
export IN_CI=1
chmod a+x .jenkins/pytorch/macos-test.sh
unbuffer .jenkins/pytorch/macos-test.sh 2>&1 | ts
@ -152,13 +170,14 @@
pytorch_android_gradle_build:
environment:
BUILD_ENVIRONMENT: pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-build
DOCKER_IMAGE: "308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-py3-clang5-android-ndk-r19c:209062ef-ab58-422a-b295-36c4eed6e906"
DOCKER_IMAGE: "308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-py3-clang5-android-ndk-r19c"
PYTHON_VERSION: "3.6"
resource_class: large
machine:
image: ubuntu-1604:201903-01
image: ubuntu-1604:202007-01
steps:
- checkout
- calculate_docker_image_tag
- setup_linux_system_environment
- setup_ci_environment
- run:
@ -166,7 +185,7 @@
no_output_timeout: "1h"
command: |
set -eux
docker_image_commit=${DOCKER_IMAGE}-${CIRCLE_SHA1}
docker_image_commit=${DOCKER_IMAGE}:${DOCKER_TAG}-${CIRCLE_SHA1}
docker_image_libtorch_android_x86_32=${docker_image_commit}-android-x86_32
docker_image_libtorch_android_x86_64=${docker_image_commit}-android-x86_64
@ -181,16 +200,16 @@
# x86_32
time docker pull ${docker_image_libtorch_android_x86_32} >/dev/null
export id_x86_32=$(docker run --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -t -d -w /var/lib/jenkins ${docker_image_libtorch_android_x86_32})
export id_x86_32=$(docker run --env-file "${BASH_ENV}" -e GRADLE_OFFLINE=1 --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -t -d -w /var/lib/jenkins ${docker_image_libtorch_android_x86_32})
export COMMAND='((echo "source ./workspace/env" && echo "sudo chown -R jenkins workspace") | docker exec -u jenkins -i "$id_x86_32" bash) 2>&1'
export COMMAND='((echo "sudo chown -R jenkins workspace") | docker exec -u jenkins -i "$id_x86_32" bash) 2>&1'
echo ${COMMAND} > ./command.sh && unbuffer bash ./command.sh | ts
# arm-v7a
time docker pull ${docker_image_libtorch_android_arm_v7a} >/dev/null
export id_arm_v7a=$(docker run --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -t -d -w /var/lib/jenkins ${docker_image_libtorch_android_arm_v7a})
export id_arm_v7a=$(docker run --env-file "${BASH_ENV}" --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -t -d -w /var/lib/jenkins ${docker_image_libtorch_android_arm_v7a})
export COMMAND='((echo "source ./workspace/env" && echo "sudo chown -R jenkins workspace") | docker exec -u jenkins -i "$id_arm_v7a" bash) 2>&1'
export COMMAND='((echo "sudo chown -R jenkins workspace") | docker exec -u jenkins -i "$id_arm_v7a" bash) 2>&1'
echo ${COMMAND} > ./command.sh && unbuffer bash ./command.sh | ts
mkdir -p ~/workspace/build_android_install_arm_v7a
@ -198,9 +217,9 @@
# x86_64
time docker pull ${docker_image_libtorch_android_x86_64} >/dev/null
export id_x86_64=$(docker run --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -t -d -w /var/lib/jenkins ${docker_image_libtorch_android_x86_64})
export id_x86_64=$(docker run --env-file "${BASH_ENV}" --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -t -d -w /var/lib/jenkins ${docker_image_libtorch_android_x86_64})
export COMMAND='((echo "source ./workspace/env" && echo "sudo chown -R jenkins workspace") | docker exec -u jenkins -i "$id_x86_64" bash) 2>&1'
export COMMAND='((echo "sudo chown -R jenkins workspace") | docker exec -u jenkins -i "$id_x86_64" bash) 2>&1'
echo ${COMMAND} > ./command.sh && unbuffer bash ./command.sh | ts
mkdir -p ~/workspace/build_android_install_x86_64
@ -208,9 +227,9 @@
# arm-v8a
time docker pull ${docker_image_libtorch_android_arm_v8a} >/dev/null
export id_arm_v8a=$(docker run --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -t -d -w /var/lib/jenkins ${docker_image_libtorch_android_arm_v8a})
export id_arm_v8a=$(docker run --env-file "${BASH_ENV}" --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -t -d -w /var/lib/jenkins ${docker_image_libtorch_android_arm_v8a})
export COMMAND='((echo "source ./workspace/env" && echo "sudo chown -R jenkins workspace") | docker exec -u jenkins -i "$id_arm_v8a" bash) 2>&1'
export COMMAND='((echo "sudo chown -R jenkins workspace") | docker exec -u jenkins -i "$id_arm_v8a" bash) 2>&1'
echo ${COMMAND} > ./command.sh && unbuffer bash ./command.sh | ts
mkdir -p ~/workspace/build_android_install_arm_v8a
@ -221,7 +240,7 @@
docker cp ~/workspace/build_android_install_arm_v8a $id_x86_32:/var/lib/jenkins/workspace/build_android_install_arm_v8a
# run gradle buildRelease
export COMMAND='((echo "source ./workspace/env" && echo "export BUILD_ENVIRONMENT=${BUILD_ENVIRONMENT}" && echo "export GRADLE_OFFLINE=1" && echo "sudo chown -R jenkins workspace && cd workspace && ./.circleci/scripts/build_android_gradle.sh") | docker exec -u jenkins -i "$id_x86_32" bash) 2>&1'
export COMMAND='((echo "sudo chown -R jenkins workspace && cd workspace && ./.circleci/scripts/build_android_gradle.sh") | docker exec -u jenkins -i "$id_x86_32" bash) 2>&1'
echo ${COMMAND} > ./command.sh && unbuffer bash ./command.sh | ts
mkdir -p ~/workspace/build_android_artifacts
@ -230,26 +249,9 @@
output_image=$docker_image_libtorch_android_x86_32-gradle
docker commit "$id_x86_32" ${output_image}
time docker push ${output_image}
- run:
name: save binary size
no_output_timeout: "5m"
command: |
docker_image=${DOCKER_IMAGE}-${CIRCLE_SHA1}-android-x86_32-gradle
export id=$(docker run --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -t -d -w /var/lib/jenkins ${docker_image})
echo "docker-id: $id"
cat \<< EOL | docker exec -u jenkins -i "$id" bash
# ============================== Begin Docker ==============================
cd workspace
source ./env
export ANDROID_BUILD_TYPE="prebuild"
export COMMIT_TIME=\$(git log --max-count=1 --format=%ct || echo 0)
export CIRCLE_BUILD_NUM="${CIRCLE_BUILD_NUM}"
export CIRCLE_SHA1="${CIRCLE_SHA1}"
export CIRCLE_BRANCH="${CIRCLE_BRANCH}"
export SCRIBE_GRAPHQL_ACCESS_TOKEN="${SCRIBE_GRAPHQL_ACCESS_TOKEN}"
python .circleci/scripts/upload_binary_size_to_scuba.py android
# ============================== End Docker ==============================
EOL
- upload_binary_size_for_android_build:
build_type: prebuilt
artifacts: /home/circleci/workspace/build_android_artifacts/artifacts.tgz
- store_artifacts:
path: ~/workspace/build_android_artifacts/artifacts.tgz
destination: artifacts.tgz
@ -257,22 +259,22 @@
pytorch_android_publish_snapshot:
environment:
BUILD_ENVIRONMENT: pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-publish-snapshot
DOCKER_IMAGE: "308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-py3-clang5-android-ndk-r19c:209062ef-ab58-422a-b295-36c4eed6e906"
DOCKER_IMAGE: "308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-py3-clang5-android-ndk-r19c"
PYTHON_VERSION: "3.6"
resource_class: large
machine:
image: ubuntu-1604:201903-01
image: ubuntu-1604:202007-01
steps:
- checkout
- calculate_docker_image_tag
- setup_linux_system_environment
- checkout
- setup_ci_environment
- run:
name: pytorch android gradle build
no_output_timeout: "1h"
command: |
set -eux
docker_image_commit=${DOCKER_IMAGE}-${CIRCLE_SHA1}
docker_image_commit=${DOCKER_IMAGE}:${DOCKER_TAG}-${CIRCLE_SHA1}
docker_image_libtorch_android_x86_32_gradle=${docker_image_commit}-android-x86_32-gradle
@ -281,9 +283,9 @@
# x86_32
time docker pull ${docker_image_libtorch_android_x86_32_gradle} >/dev/null
export id_x86_32=$(docker run --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -t -d -w /var/lib/jenkins ${docker_image_libtorch_android_x86_32_gradle})
export id_x86_32=$(docker run --env-file "${BASH_ENV}" --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -t -d -w /var/lib/jenkins ${docker_image_libtorch_android_x86_32_gradle})
export COMMAND='((echo "source ./workspace/env" && echo "sudo chown -R jenkins workspace" && echo "export BUILD_ENVIRONMENT=${BUILD_ENVIRONMENT}" && echo "export SONATYPE_NEXUS_USERNAME=${SONATYPE_NEXUS_USERNAME}" && echo "export SONATYPE_NEXUS_PASSWORD=${SONATYPE_NEXUS_PASSWORD}" && echo "export ANDROID_SIGN_KEY=${ANDROID_SIGN_KEY}" && echo "export ANDROID_SIGN_PASS=${ANDROID_SIGN_PASS}" && echo "sudo chown -R jenkins workspace && cd workspace && ./.circleci/scripts/publish_android_snapshot.sh") | docker exec -u jenkins -i "$id_x86_32" bash) 2>&1'
export COMMAND='((echo "sudo chown -R jenkins workspace" && echo "export BUILD_ENVIRONMENT=${BUILD_ENVIRONMENT}" && echo "export SONATYPE_NEXUS_USERNAME=${SONATYPE_NEXUS_USERNAME}" && echo "export SONATYPE_NEXUS_PASSWORD=${SONATYPE_NEXUS_PASSWORD}" && echo "export ANDROID_SIGN_KEY=${ANDROID_SIGN_KEY}" && echo "export ANDROID_SIGN_PASS=${ANDROID_SIGN_PASS}" && echo "sudo chown -R jenkins workspace && cd workspace && ./.circleci/scripts/publish_android_snapshot.sh") | docker exec -u jenkins -i "$id_x86_32" bash) 2>&1'
echo ${COMMAND} > ./command.sh && unbuffer bash ./command.sh | ts
output_image=${docker_image_libtorch_android_x86_32_gradle}-publish-snapshot
@ -293,21 +295,14 @@
pytorch_android_gradle_build-x86_32:
environment:
BUILD_ENVIRONMENT: pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-build-only-x86_32
DOCKER_IMAGE: "308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-py3-clang5-android-ndk-r19c:209062ef-ab58-422a-b295-36c4eed6e906"
DOCKER_IMAGE: "308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-py3-clang5-android-ndk-r19c"
PYTHON_VERSION: "3.6"
resource_class: large
machine:
image: ubuntu-1604:201903-01
image: ubuntu-1604:202007-01
steps:
- checkout
- run:
name: filter out not PR runs
no_output_timeout: "5m"
command: |
echo "CIRCLE_PULL_REQUEST: ${CIRCLE_PULL_REQUEST:-}"
if [ -z "${CIRCLE_PULL_REQUEST:-}" ]; then
circleci step halt
fi
- calculate_docker_image_tag
- setup_linux_system_environment
- checkout
- setup_ci_environment
@ -316,14 +311,14 @@
no_output_timeout: "1h"
command: |
set -e
docker_image_libtorch_android_x86_32=${DOCKER_IMAGE}-${CIRCLE_SHA1}-android-x86_32
docker_image_libtorch_android_x86_32=${DOCKER_IMAGE}:${DOCKER_TAG}-${CIRCLE_SHA1}-android-x86_32
echo "docker_image_libtorch_android_x86_32: "${docker_image_libtorch_android_x86_32}
# x86
time docker pull ${docker_image_libtorch_android_x86_32} >/dev/null
export id=$(docker run --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -t -d -w /var/lib/jenkins ${docker_image_libtorch_android_x86_32})
export id=$(docker run --env-file "${BASH_ENV}" --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -t -d -w /var/lib/jenkins ${docker_image_libtorch_android_x86_32})
export COMMAND='((echo "source ./workspace/env" && echo "export BUILD_ENVIRONMENT=${BUILD_ENVIRONMENT}" && echo "export GRADLE_OFFLINE=1" && echo "sudo chown -R jenkins workspace && cd workspace && ./.circleci/scripts/build_android_gradle.sh") | docker exec -u jenkins -i "$id" bash) 2>&1'
export COMMAND='((echo "export BUILD_ENVIRONMENT=${BUILD_ENVIRONMENT}" && echo "export GRADLE_OFFLINE=1" && echo "sudo chown -R jenkins workspace && cd workspace && ./.circleci/scripts/build_android_gradle.sh") | docker exec -u jenkins -i "$id" bash) 2>&1'
echo ${COMMAND} > ./command.sh && unbuffer bash ./command.sh | ts
mkdir -p ~/workspace/build_android_x86_32_artifacts
@ -332,34 +327,57 @@
output_image=${docker_image_libtorch_android_x86_32}-gradle
docker commit "$id" ${output_image}
time docker push ${output_image}
- run:
name: save binary size
no_output_timeout: "5m"
command: |
docker_image=${DOCKER_IMAGE}-${CIRCLE_SHA1}-android-x86_32-gradle
export id=$(docker run --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -t -d -w /var/lib/jenkins ${docker_image})
echo "docker-id: $id"
cat \<< EOL | docker exec -u jenkins -i "$id" bash
# ============================== Begin Docker ==============================
cd workspace
source ./env
export ANDROID_BUILD_TYPE="prebuild-single"
export COMMIT_TIME=\$(git log --max-count=1 --format=%ct || echo 0)
export CIRCLE_BUILD_NUM="${CIRCLE_BUILD_NUM}"
export CIRCLE_SHA1="${CIRCLE_SHA1}"
export CIRCLE_BRANCH="${CIRCLE_BRANCH}"
export SCRIBE_GRAPHQL_ACCESS_TOKEN="${SCRIBE_GRAPHQL_ACCESS_TOKEN}"
python .circleci/scripts/upload_binary_size_to_scuba.py android
# ============================== End Docker ==============================
EOL
- upload_binary_size_for_android_build:
build_type: prebuilt-single
artifacts: /home/circleci/workspace/build_android_x86_32_artifacts/artifacts.tgz
- store_artifacts:
path: ~/workspace/build_android_x86_32_artifacts/artifacts.tgz
destination: artifacts.tgz
pytorch_android_gradle_custom_build_single:
environment:
BUILD_ENVIRONMENT: pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single
DOCKER_IMAGE: "308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-py3-clang5-android-ndk-r19c"
PYTHON_VERSION: "3.6"
resource_class: large
machine:
image: ubuntu-1604:202007-01
steps:
- checkout
- calculate_docker_image_tag
- setup_linux_system_environment
- checkout
- calculate_docker_image_tag
- setup_ci_environment
- run:
name: pytorch android gradle custom build single architecture (for PR)
no_output_timeout: "1h"
command: |
set -e
# Unlike other gradle jobs, it's not worth building libtorch in a separate CI job and share via docker, because:
# 1) Not shareable: it's custom selective build, which is different from default libtorch mobile build;
# 2) Not parallelizable by architecture: it only builds libtorch for one architecture;
echo "DOCKER_IMAGE: ${DOCKER_IMAGE}:${DOCKER_TAG}"
time docker pull ${DOCKER_IMAGE}:${DOCKER_TAG} >/dev/null
git submodule sync && git submodule update -q --init --recursive
VOLUME_MOUNTS="-v /home/circleci/project/:/var/lib/jenkins/workspace"
export id=$(docker run --env-file "${BASH_ENV}" ${VOLUME_MOUNTS} --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -t -d -w /var/lib/jenkins ${DOCKER_IMAGE}:${DOCKER_TAG})
export COMMAND='((echo "export GRADLE_OFFLINE=1" && echo "sudo chown -R jenkins workspace && cd workspace && ./.circleci/scripts/build_android_gradle.sh") | docker exec -u jenkins -i "$id" bash) 2>&1'
echo ${COMMAND} > ./command.sh && unbuffer bash ./command.sh | ts
# Skip docker push as this job is purely for size analysis purpose.
# Result binaries are already in `/home/circleci/project/` as it's mounted instead of copied.
- upload_binary_size_for_android_build:
build_type: custom-build-single
pytorch_ios_build:
<<: *pytorch_ios_params
macos:
xcode: "11.2.1"
xcode: "12.0"
steps:
- checkout
- run_brew_for_ios_build
@ -378,7 +396,7 @@
rm cert.txt
bundle exec fastlane install_cert
# install the provisioning profile
PROFILE=TestApp_CI.mobileprovision
PROFILE=PyTorch_CI_2021.mobileprovision
PROVISIONING_PROFILES=~/Library/MobileDevice/Provisioning\ Profiles
mkdir -pv "${PROVISIONING_PROFILES}"
cd "${PROVISIONING_PROFILES}"
@ -390,7 +408,7 @@
no_output_timeout: "1h"
command: |
set -e
export IN_CIRCLECI=1
export IN_CI=1
WORKSPACE=/Users/distiller/workspace
PROJ_ROOT=/Users/distiller/project
export TCLLIBPATH="/usr/local/lib"
@ -407,7 +425,7 @@
$* || (sleep 1 && $*) || (sleep 2 && $*) || (sleep 4 && $*) || (sleep 8 && $*)
}
retry conda install numpy ninja pyyaml mkl mkl-include setuptools cmake cffi typing requests --yes
retry conda install numpy ninja pyyaml mkl mkl-include setuptools cmake cffi requests typing_extensions --yes
# sync submodules
cd ${PROJ_ROOT}
@ -421,6 +439,7 @@
chmod a+x ${PROJ_ROOT}/scripts/build_ios.sh
echo "IOS_ARCH: ${IOS_ARCH}"
echo "IOS_PLATFORM: ${IOS_PLATFORM}"
echo "USE_PYTORCH_METAL": "${USE_METAL}"
#check the custom build flag
echo "SELECTED_OP_LIST: ${SELECTED_OP_LIST}"
@ -429,6 +448,9 @@
fi
export IOS_ARCH=${IOS_ARCH}
export IOS_PLATFORM=${IOS_PLATFORM}
if [ ${IOS_PLATFORM} != "SIMULATOR" ]; then
export USE_PYTORCH_METAL=${USE_METAL}
fi
unbuffer ${PROJ_ROOT}/scripts/build_ios.sh 2>&1 | ts
- run:
name: Run Build Test
@ -436,7 +458,7 @@
command: |
set -e
PROJ_ROOT=/Users/distiller/project
PROFILE=TestApp_CI
PROFILE=PyTorch_CI_2021
# run the ruby build script
if ! [ -x "$(command -v xcodebuild)" ]; then
echo 'Error: xcodebuild is not installed.'
@ -475,9 +497,10 @@
pytorch_linux_bazel_build:
<<: *pytorch_params
machine:
image: ubuntu-1604:201903-01
image: ubuntu-1604:202007-01
steps:
- checkout
- calculate_docker_image_tag
- setup_linux_system_environment
- setup_ci_environment
- run:
@ -486,9 +509,9 @@
command: |
set -e
# Pull Docker image and run build
echo "DOCKER_IMAGE: "${DOCKER_IMAGE}
time docker pull ${DOCKER_IMAGE} >/dev/null
export id=$(docker run --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -t -d -w /var/lib/jenkins ${DOCKER_IMAGE})
echo "DOCKER_IMAGE: "${DOCKER_IMAGE}:${DOCKER_TAG}
time docker pull ${DOCKER_IMAGE}:${DOCKER_TAG} >/dev/null
export id=$(docker run --env-file "${BASH_ENV}" --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -t -d -w /var/lib/jenkins ${DOCKER_IMAGE}:${DOCKER_TAG})
echo "Do NOT merge master branch into $CIRCLE_BRANCH in environment $BUILD_ENVIRONMENT"
@ -496,14 +519,14 @@
docker cp /home/circleci/project/. $id:/var/lib/jenkins/workspace
export COMMAND='((echo "export BUILD_ENVIRONMENT=${BUILD_ENVIRONMENT}" && echo "source ./workspace/env" && echo "sudo chown -R jenkins workspace && cd workspace && .jenkins/pytorch/build.sh") | docker exec -u jenkins -i "$id" bash) 2>&1'
export COMMAND='((echo "sudo chown -R jenkins workspace && cd workspace && .jenkins/pytorch/build.sh") | docker exec -u jenkins -i "$id" bash) 2>&1'
echo ${COMMAND} > ./command.sh && unbuffer bash ./command.sh | ts
# Push intermediate Docker image for next phase to use
if [ -z "${BUILD_ONLY}" ]; then
# Augment our output image name with bazel to avoid collisions
output_image=${DOCKER_IMAGE}-bazel-${CIRCLE_SHA1}
output_image=${DOCKER_IMAGE}:${DOCKER_TAG}-bazel-${CIRCLE_SHA1}
export COMMIT_DOCKER_IMAGE=$output_image
docker commit "$id" ${COMMIT_DOCKER_IMAGE}
time docker push ${COMMIT_DOCKER_IMAGE}
@ -512,9 +535,10 @@
pytorch_linux_bazel_test:
<<: *pytorch_params
machine:
image: ubuntu-1604:201903-01
image: ubuntu-1604:202007-01
steps:
- checkout
- calculate_docker_image_tag
- setup_linux_system_environment
- setup_ci_environment
- run:
@ -522,16 +546,16 @@
no_output_timeout: "90m"
command: |
set -e
output_image=${DOCKER_IMAGE}-bazel-${CIRCLE_SHA1}
output_image=${DOCKER_IMAGE}:${DOCKER_TAG}-bazel-${CIRCLE_SHA1}
export COMMIT_DOCKER_IMAGE=$output_image
echo "DOCKER_IMAGE: "${COMMIT_DOCKER_IMAGE}
time docker pull ${COMMIT_DOCKER_IMAGE} >/dev/null
if [ -n "${USE_CUDA_DOCKER_RUNTIME}" ]; then
export id=$(docker run --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --runtime=nvidia -t -d -w /var/lib/jenkins ${COMMIT_DOCKER_IMAGE})
export id=$(docker run --env-file "${BASH_ENV}" --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --gpus all -t -d -w /var/lib/jenkins ${COMMIT_DOCKER_IMAGE})
else
export id=$(docker run --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -t -d -w /var/lib/jenkins ${COMMIT_DOCKER_IMAGE})
export id=$(docker run --env-file "${BASH_ENV}" --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -t -d -w /var/lib/jenkins ${COMMIT_DOCKER_IMAGE})
fi
retrieve_test_reports() {
@ -541,9 +565,9 @@
trap "retrieve_test_reports" ERR
if [[ ${BUILD_ENVIRONMENT} == *"multigpu"* ]]; then
export COMMAND='((echo "export BUILD_ENVIRONMENT=${BUILD_ENVIRONMENT}" && echo "source ./workspace/env" && echo "sudo chown -R jenkins workspace && cd workspace && .jenkins/pytorch/multigpu-test.sh") | docker exec -u jenkins -i "$id" bash) 2>&1'
export COMMAND='((echo "sudo chown -R jenkins workspace && cd workspace && .jenkins/pytorch/multigpu-test.sh") | docker exec -u jenkins -i "$id" bash) 2>&1'
else
export COMMAND='((echo "export BUILD_ENVIRONMENT=${BUILD_ENVIRONMENT}" && echo "export CIRCLE_PULL_REQUEST=${CIRCLE_PULL_REQUEST}" && echo "source ./workspace/env" && echo "sudo chown -R jenkins workspace && cd workspace && .jenkins/pytorch/test.sh") | docker exec -u jenkins -i "$id" bash) 2>&1'
export COMMAND='((echo "sudo chown -R jenkins workspace && cd workspace && .jenkins/pytorch/test.sh") | docker exec -u jenkins -i "$id" bash) 2>&1'
fi
echo ${COMMAND} > ./command.sh && unbuffer bash ./command.sh | ts
@ -555,13 +579,13 @@
pytorch_doc_test:
environment:
BUILD_ENVIRONMENT: pytorch-doc-test
# TODO: stop hardcoding this
DOCKER_IMAGE: "308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-py3.6-gcc5.4:209062ef-ab58-422a-b295-36c4eed6e906"
DOCKER_IMAGE: "308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-py3.6-gcc5.4"
resource_class: medium
machine:
image: ubuntu-1604:201903-01
image: ubuntu-1604:202007-01
steps:
- checkout
- calculate_docker_image_tag
- setup_linux_system_environment
- setup_ci_environment
- run:
@ -569,9 +593,9 @@
no_output_timeout: "30m"
command: |
set -ex
export COMMIT_DOCKER_IMAGE=${DOCKER_IMAGE}-${CIRCLE_SHA1}
export COMMIT_DOCKER_IMAGE=${DOCKER_IMAGE}:${DOCKER_TAG}-${CIRCLE_SHA1}
echo "DOCKER_IMAGE: "${COMMIT_DOCKER_IMAGE}
time docker pull ${COMMIT_DOCKER_IMAGE} >/dev/null
export id=$(docker run --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -t -d -w /var/lib/jenkins ${COMMIT_DOCKER_IMAGE})
export COMMAND='((echo "export BUILD_ENVIRONMENT=${BUILD_ENVIRONMENT}" && echo "export GITHUB_PYTORCHBOT_TOKEN=${GITHUB_PYTORCHBOT_TOKEN}" && echo "source ./workspace/env" && echo "sudo chown -R jenkins workspace && cd workspace && . ./.jenkins/pytorch/docs-test.sh") | docker exec -u jenkins -i "$id" bash) 2>&1'
export COMMAND='((echo "sudo chown -R jenkins workspace && cd workspace && . ./.jenkins/pytorch/docs-test.sh") | docker exec -u jenkins -i "$id" bash) 2>&1'
echo ${COMMAND} > ./command.sh && unbuffer bash ./command.sh | ts

View File

@ -2,12 +2,12 @@ jobs:
pytorch_linux_build:
<<: *pytorch_params
machine:
image: ubuntu-1604:201903-01
image: ubuntu-1604:202007-01
steps:
# See Note [Workspace for CircleCI scripts] in job-specs-setup.yml
- checkout
- calculate_docker_image_tag
- setup_linux_system_environment
- checkout
- optional_merge_target_branch
- setup_ci_environment
- run:
@ -15,33 +15,42 @@ jobs:
no_output_timeout: "1h"
command: |
set -e
if [[ "${DOCKER_IMAGE}" == *rocm3.9* ]]; then
export DOCKER_TAG="f3d89a32912f62815e4feaeed47e564e887dffd6"
fi
if [[ ${BUILD_ENVIRONMENT} == *"pure_torch"* ]]; then
echo 'BUILD_CAFFE2=OFF' >> "${BASH_ENV}"
fi
if [[ ${BUILD_ENVIRONMENT} == *"paralleltbb"* ]]; then
echo 'ATEN_THREADING=TBB' >> "${BASH_ENV}"
echo 'USE_TBB=1' >> "${BASH_ENV}"
elif [[ ${BUILD_ENVIRONMENT} == *"parallelnative"* ]]; then
echo 'ATEN_THREADING=NATIVE' >> "${BASH_ENV}"
fi
echo "Parallel backend flags: "${PARALLEL_FLAGS}
# Pull Docker image and run build
echo "DOCKER_IMAGE: "${DOCKER_IMAGE}
time docker pull ${DOCKER_IMAGE} >/dev/null
export id=$(docker run --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -t -d -w /var/lib/jenkins ${DOCKER_IMAGE})
echo "DOCKER_IMAGE: "${DOCKER_IMAGE}:${DOCKER_TAG}
time docker pull ${DOCKER_IMAGE}:${DOCKER_TAG} >/dev/null
export id=$(docker run --env-file "${BASH_ENV}" --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -t -d -w /var/lib/jenkins ${DOCKER_IMAGE}:${DOCKER_TAG})
git submodule sync && git submodule update -q --init --recursive
docker cp /home/circleci/project/. $id:/var/lib/jenkins/workspace
if [[ ${BUILD_ENVIRONMENT} == *"paralleltbb"* ]]; then
export PARALLEL_FLAGS="export ATEN_THREADING=TBB USE_TBB=1 "
elif [[ ${BUILD_ENVIRONMENT} == *"parallelnative"* ]]; then
export PARALLEL_FLAGS="export ATEN_THREADING=NATIVE "
fi
echo "Parallel backend flags: "${PARALLEL_FLAGS}
export COMMAND='((echo "export BUILD_ENVIRONMENT=${BUILD_ENVIRONMENT}" && echo '"$PARALLEL_FLAGS"' && echo "source ./workspace/env" && echo "sudo chown -R jenkins workspace && cd workspace && .jenkins/pytorch/build.sh") | docker exec -u jenkins -i "$id" bash) 2>&1'
export COMMAND='((echo "sudo chown -R jenkins workspace && cd workspace && .jenkins/pytorch/build.sh && find ${BUILD_ROOT} -type f -name "*.a" -or -name "*.o" -delete") | docker exec -u jenkins -i "$id" bash) 2>&1'
echo ${COMMAND} > ./command.sh && unbuffer bash ./command.sh | ts
# Copy dist folder back
docker cp $id:/var/lib/jenkins/workspace/dist /home/circleci/project/. || echo "Dist folder not found"
# Push intermediate Docker image for next phase to use
if [ -z "${BUILD_ONLY}" ]; then
# Note [Special build images]
# The xla build uses the same docker image as
# pytorch-linux-trusty-py3.6-gcc5.4-build. In the push step, we have to
# pytorch_linux_bionic_py3_6_clang9_build. In the push step, we have to
# distinguish between them so the test can pick up the correct image.
output_image=${DOCKER_IMAGE}-${CIRCLE_SHA1}
output_image=${DOCKER_IMAGE}:${DOCKER_TAG}-${CIRCLE_SHA1}
if [[ ${BUILD_ENVIRONMENT} == *"xla"* ]]; then
export COMMIT_DOCKER_IMAGE=$output_image-xla
elif [[ ${BUILD_ENVIRONMENT} == *"libtorch"* ]]; then
@ -60,20 +69,25 @@ jobs:
export COMMIT_DOCKER_IMAGE=$output_image-android-x86_32
elif [[ ${BUILD_ENVIRONMENT} == *"android-ndk-r19c-vulkan-x86_32"* ]]; then
export COMMIT_DOCKER_IMAGE=$output_image-android-vulkan-x86_32
elif [[ ${BUILD_ENVIRONMENT} == *"vulkan-linux"* ]]; then
export COMMIT_DOCKER_IMAGE=$output_image-vulkan
else
export COMMIT_DOCKER_IMAGE=$output_image
fi
docker commit "$id" ${COMMIT_DOCKER_IMAGE}
time docker push ${COMMIT_DOCKER_IMAGE}
fi
- store_artifacts:
path: /home/circleci/project/dist
pytorch_linux_test:
<<: *pytorch_params
machine:
image: ubuntu-1604:201903-01
image: ubuntu-1604:202007-01
steps:
# See Note [Workspace for CircleCI scripts] in job-specs-setup.yml
- checkout
- calculate_docker_image_tag
- setup_linux_system_environment
- setup_ci_environment
- run:
@ -81,8 +95,12 @@ jobs:
no_output_timeout: "90m"
command: |
set -e
export PYTHONUNBUFFERED=1
if [[ "${DOCKER_IMAGE}" == *rocm3.9* ]]; then
export DOCKER_TAG="f3d89a32912f62815e4feaeed47e564e887dffd6"
fi
# See Note [Special build images]
output_image=${DOCKER_IMAGE}-${CIRCLE_SHA1}
output_image=${DOCKER_IMAGE}:${DOCKER_TAG}-${CIRCLE_SHA1}
if [[ ${BUILD_ENVIRONMENT} == *"xla"* ]]; then
export COMMIT_DOCKER_IMAGE=$output_image-xla
elif [[ ${BUILD_ENVIRONMENT} == *"libtorch"* ]]; then
@ -91,30 +109,34 @@ jobs:
export COMMIT_DOCKER_IMAGE=$output_image-paralleltbb
elif [[ ${BUILD_ENVIRONMENT} == *"parallelnative"* ]]; then
export COMMIT_DOCKER_IMAGE=$output_image-parallelnative
elif [[ ${BUILD_ENVIRONMENT} == *"vulkan-linux"* ]]; then
export COMMIT_DOCKER_IMAGE=$output_image-vulkan
else
export COMMIT_DOCKER_IMAGE=$output_image
fi
echo "DOCKER_IMAGE: "${COMMIT_DOCKER_IMAGE}
if [[ ${BUILD_ENVIRONMENT} == *"paralleltbb"* ]]; then
export PARALLEL_FLAGS="export ATEN_THREADING=TBB USE_TBB=1 "
echo 'ATEN_THREADING=TBB' >> "${BASH_ENV}"
echo 'USE_TBB=1' >> "${BASH_ENV}"
elif [[ ${BUILD_ENVIRONMENT} == *"parallelnative"* ]]; then
export PARALLEL_FLAGS="export ATEN_THREADING=NATIVE "
echo 'ATEN_THREADING=NATIVE' >> "${BASH_ENV}"
fi
echo "Parallel backend flags: "${PARALLEL_FLAGS}
time docker pull ${COMMIT_DOCKER_IMAGE} >/dev/null
# TODO: Make this less painful
if [ -n "${USE_CUDA_DOCKER_RUNTIME}" ]; then
export id=$(docker run --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --runtime=nvidia -t -d -w /var/lib/jenkins ${COMMIT_DOCKER_IMAGE})
export id=$(docker run --env-file "${BASH_ENV}" --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --gpus all --shm-size=2g -t -d -w /var/lib/jenkins ${COMMIT_DOCKER_IMAGE})
elif [[ ${BUILD_ENVIRONMENT} == *"rocm"* ]]; then
hostname
export id=$(docker run --env-file "${BASH_ENV}" --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --shm-size=8g --ipc=host --device /dev/kfd --device /dev/dri --group-add video -t -d -w /var/lib/jenkins ${COMMIT_DOCKER_IMAGE})
else
export id=$(docker run --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -t -d -w /var/lib/jenkins ${COMMIT_DOCKER_IMAGE})
export id=$(docker run --env-file "${BASH_ENV}" --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --shm-size=1g --ipc=host -t -d -w /var/lib/jenkins ${COMMIT_DOCKER_IMAGE})
fi
echo "id=${id}" >> "${BASH_ENV}"
# Pass environment variables to the next step
# See https://circleci.com/docs/2.0/env-vars/#using-parameters-and-bash-environment
echo "export PARALLEL_FLAGS=\"${PARALLEL_FLAGS}\"" >> $BASH_ENV
echo "export id=$id" >> $BASH_ENV
- run:
name: Check for no AVX instruction by default
no_output_timeout: "20m"
@ -131,8 +153,8 @@ jobs:
}
if is_vanilla_build; then
echo "apt-get update && apt-get install -y qemu-user" | docker exec -u root -i "$id" bash
echo "cd workspace/build; qemu-x86_64 -cpu Broadwell -E ATEN_CPU_CAPABILITY=default ./bin/basic --gtest_filter=BasicTest.BasicTestCPU" | docker exec -u jenkins -i "$id" bash
echo "apt-get update && apt-get install -y qemu-user gdb" | docker exec -u root -i "$id" bash
echo "cd workspace/build; qemu-x86_64 -g 2345 -cpu Broadwell -E ATEN_CPU_CAPABILITY=default ./bin/basic --gtest_filter=BasicTest.BasicTestCPU & gdb ./bin/basic -ex 'set pagination off' -ex 'target remote :2345' -ex 'continue' -ex 'bt' -ex='set confirm off' -ex 'quit \$_isvoid(\$_exitcode)'" | docker exec -u jenkins -i "$id" bash
else
echo "Skipping for ${BUILD_ENVIRONMENT}"
fi
@ -142,21 +164,61 @@ jobs:
command: |
set -e
cat >docker_commands.sh \<<EOL
# =================== The following code will be executed inside Docker container ===================
set -ex
export SCRIBE_GRAPHQL_ACCESS_TOKEN="${SCRIBE_GRAPHQL_ACCESS_TOKEN}"
${PARALLEL_FLAGS}
cd workspace
EOL
if [[ ${BUILD_ENVIRONMENT} == *"multigpu"* ]]; then
export COMMAND='((echo "export BUILD_ENVIRONMENT=${BUILD_ENVIRONMENT}" && echo "${PARALLEL_FLAGS}" && echo "source ./workspace/env" && echo "sudo chown -R jenkins workspace && cd workspace && .jenkins/pytorch/multigpu-test.sh") | docker exec -u jenkins -i "$id" bash) 2>&1'
echo ".jenkins/pytorch/multigpu-test.sh" >> docker_commands.sh
elif [[ ${BUILD_ENVIRONMENT} == *onnx* ]]; then
echo "pip install click mock tabulate networkx==2.0" >> docker_commands.sh
echo "pip -q install --user \"file:///var/lib/jenkins/workspace/third_party/onnx#egg=onnx\"" >> docker_commands.sh
echo ".jenkins/caffe2/test.sh" >> docker_commands.sh
else
export COMMAND='((echo "export BUILD_ENVIRONMENT=${BUILD_ENVIRONMENT}" && echo "export CIRCLE_PULL_REQUEST=${CIRCLE_PULL_REQUEST}" && echo "${PARALLEL_FLAGS}" && echo "source ./workspace/env" && echo "sudo chown -R jenkins workspace && cd workspace && .jenkins/pytorch/test.sh") | docker exec -u jenkins -i "$id" bash) 2>&1'
echo ".jenkins/pytorch/test.sh" >> docker_commands.sh
fi
echo ${COMMAND} > ./command.sh && unbuffer bash ./command.sh | ts
echo "(cat docker_commands.sh | docker exec -u jenkins -i "$id" bash) 2>&1" > command.sh
unbuffer bash command.sh | ts
- run:
name: Report results
no_output_timeout: "5m"
command: |
set -e
docker stats --all --no-stream
echo "cd workspace; python test/print_test_stats.py test" | docker exec -u jenkins -i "$id" bash
cat >docker_commands.sh \<<EOL
# =================== The following code will be executed inside Docker container ===================
set -ex
export BUILD_ENVIRONMENT=${BUILD_ENVIRONMENT}
export SCRIBE_GRAPHQL_ACCESS_TOKEN="${SCRIBE_GRAPHQL_ACCESS_TOKEN}"
export CIRCLE_TAG="${CIRCLE_TAG:-}"
export CIRCLE_SHA1="$CIRCLE_SHA1"
export CIRCLE_PR_NUMBER="${CIRCLE_PR_NUMBER:-}"
export CIRCLE_BRANCH="$CIRCLE_BRANCH"
export CIRCLE_JOB="$CIRCLE_JOB"
export CIRCLE_WORKFLOW_ID="$CIRCLE_WORKFLOW_ID"
cd workspace
python test/print_test_stats.py --upload-to-s3 test
EOL
echo "(cat docker_commands.sh | docker exec -u jenkins -i "$id" bash) 2>&1" > command.sh
unbuffer bash command.sh | ts
echo "Retrieving test reports"
docker cp $id:/var/lib/jenkins/workspace/test/test-reports ./ || echo 'No test reports found!'
if [[ ${BUILD_ENVIRONMENT} == *"coverage"* ]]; then
echo "Retrieving C++ coverage report"
docker cp $id:/var/lib/jenkins/workspace/build/coverage.info ./test
fi
if [[ ${BUILD_ENVIRONMENT} == *"coverage"* || ${BUILD_ENVIRONMENT} == *"onnx"* ]]; then
echo "Retrieving Python coverage report"
docker cp $id:/var/lib/jenkins/workspace/test/.coverage ./test
docker cp $id:/var/lib/jenkins/workspace/test/coverage.xml ./test
python3 -mpip install codecov
python3 -mcodecov
fi
when: always
- store_test_results:
path: test-reports
@ -166,7 +228,7 @@ jobs:
parameters:
executor:
type: string
default: "windows-cpu-with-nvidia-cuda"
default: "windows-xlarge-cpu-with-nvidia-cuda"
build_environment:
type: string
default: ""
@ -175,16 +237,16 @@ jobs:
default: ""
cuda_version:
type: string
default: "10"
default: "10.1"
python_version:
type: string
default: "3.6"
vc_version:
type: string
default: "14.11"
default: "14.16"
vc_year:
type: string
default: "2017"
default: "2019"
vc_product:
type: string
default: "BuildTools"
@ -195,11 +257,10 @@ jobs:
steps:
- checkout
- run:
name: Install VS2017
name: _HACK_ Install CUDA compatible cmath
no_output_timeout: 1m
command: |
if [[ "${VC_YEAR}" == "2017" ]]; then
powershell .circleci/scripts/vs_install.ps1
fi
powershell .circleci/scripts/vs_install_cmath.ps1
- run:
name: Install Cuda
no_output_timeout: 30m
@ -211,10 +272,7 @@ jobs:
name: Install Cudnn
command : |
if [[ "${USE_CUDA}" == "1" ]]; then
cd c:/
curl --retry 3 -O https://ossci-windows.s3.amazonaws.com/cudnn-10.1-windows10-x64-v7.6.4.38.zip
7z x cudnn-10.1-windows10-x64-v7.6.4.38.zip -ocudnn
cp -r cudnn/cuda/* "C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v10.1/"
.circleci/scripts/windows_cudnn_install.sh
fi
- run:
name: Build
@ -237,7 +295,7 @@ jobs:
parameters:
executor:
type: string
default: "windows-cpu-with-nvidia-cuda"
default: "windows-medium-cpu-with-nvidia-cuda"
build_environment:
type: string
default: ""
@ -246,16 +304,16 @@ jobs:
default: ""
cuda_version:
type: string
default: "10"
default: "10.1"
python_version:
type: string
default: "3.6"
vc_version:
type: string
default: "14.11"
default: "14.16"
vc_year:
type: string
default: "2017"
default: "2019"
vc_product:
type: string
default: "BuildTools"
@ -267,34 +325,27 @@ jobs:
- checkout
- attach_workspace:
at: c:/users/circleci/workspace
- run:
name: Install VS2017
command: |
if [[ "${VC_YEAR}" == "2017" ]]; then
powershell .circleci/scripts/vs_install.ps1
fi
- run:
name: Install Cuda
no_output_timeout: 30m
command: |
if [[ "${CUDA_VERSION}" != "cpu" && "${JOB_EXECUTOR}" != "windows-with-nvidia-gpu" ]]; then
.circleci/scripts/windows_cuda_install.sh
if [[ "${CUDA_VERSION}" != "cpu" ]]; then
if [[ "${CUDA_VERSION}" != "10" || "${JOB_EXECUTOR}" != "windows-with-nvidia-gpu" ]]; then
.circleci/scripts/windows_cuda_install.sh
fi
fi
- run:
name: Install Cudnn
command : |
if [[ "${CUDA_VERSION}" != "cpu" && "${JOB_EXECUTOR}" != "windows-with-nvidia-gpu" ]]; then
cd c:/
curl --retry 3 -O https://ossci-windows.s3.amazonaws.com/cudnn-10.1-windows10-x64-v7.6.4.38.zip
7z x cudnn-10.1-windows10-x64-v7.6.4.38.zip -ocudnn
cp -r cudnn/cuda/* "C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v10.1/"
if [[ "${CUDA_VERSION}" != "cpu" ]]; then
.circleci/scripts/windows_cudnn_install.sh
fi
- run:
name: Test
no_output_timeout: "30m"
command: |
set -e
export IN_CIRCLECI=1
export IN_CI=1
set +x
export AWS_ACCESS_KEY_ID=${CIRCLECI_AWS_ACCESS_KEY_FOR_WIN_BUILD_V1}
export AWS_SECRET_ACCESS_KEY=${CIRCLECI_AWS_SECRET_KEY_FOR_WIN_BUILD_V1}

View File

@ -11,7 +11,7 @@
- ecr_gc_job:
name: ecr_gc_job_for_pytorch
project: pytorch
tags_to_keep: "271,262,256,278,282,291,300,323,327,347,389,401,402,403,405,a8006f9a-272d-4478-b137-d121c6f05c83,6e7b11da-a919-49e5-b2ba-da66e3d4bb0a,f990c76a-a798-42bb-852f-5be5006f8026,e43973a9-9d5a-4138-9181-a08a0fc55e2f,8fcf46ef-4a34-480b-a8ee-b0a30a4d3e59,9a3986fa-7ce7-4a36-a001-3c9bef9892e2,1bc00f11-e0f3-4e5c-859f-15937dd938cd,209062ef-ab58-422a-b295-36c4eed6e906,be76e8fd-44e2-484d-b090-07e0cc3a56f0"
tags_to_keep: "271,262,256,278,282,291,300,323,327,347,389,401,402,403,405,a8006f9a-272d-4478-b137-d121c6f05c83,6e7b11da-a919-49e5-b2ba-da66e3d4bb0a,f990c76a-a798-42bb-852f-5be5006f8026,e43973a9-9d5a-4138-9181-a08a0fc55e2f,8fcf46ef-4a34-480b-a8ee-b0a30a4d3e59,9a3986fa-7ce7-4a36-a001-3c9bef9892e2,1bc00f11-e0f3-4e5c-859f-15937dd938cd,209062ef-ab58-422a-b295-36c4eed6e906,be76e8fd-44e2-484d-b090-07e0cc3a56f0,fff7795428560442086f7b2bb6004b65245dc11a,ab1632df-fa59-40e6-8c23-98e004f61148"
requires:
- docker_for_ecr_gc_build_job
- ecr_gc_job:
@ -32,4 +32,3 @@
tags_to_keep: "34"
requires:
- docker_for_ecr_gc_build_job
- docker_hub_index_job

File diff suppressed because it is too large Load Diff

View File

@ -1,6 +1,7 @@
---
# NOTE there must be no spaces before the '-', so put the comma last.
Checks: '-*,
InheritParentConfig: true
Checks: '
bugprone-*,
-bugprone-forward-declaration-namespace,
-bugprone-macro-parentheses,
@ -17,9 +18,11 @@ cppcoreguidelines-*,
-cppcoreguidelines-pro-type-union-access,
-cppcoreguidelines-pro-type-vararg,
-cppcoreguidelines-special-member-functions,
-facebook-hte-RelativeInclude,
hicpp-exception-baseclass,
hicpp-avoid-goto,
modernize-*,
-modernize-concat-nested-namespaces,
-modernize-return-braced-init-list,
-modernize-use-auto,
-modernize-use-default-member-init,
@ -27,7 +30,7 @@ modernize-*,
-modernize-use-trailing-return-type,
performance-*,
-performance-noexcept-move-constructor,
'
'
HeaderFilterRegex: 'torch/csrc/.*'
AnalyzeTemporaryDtors: false
CheckOptions:

21
.flake8
View File

@ -12,5 +12,22 @@ ignore =
B007,B008,
# these ignores are from flake8-comprehensions; please fix!
C400,C401,C402,C403,C404,C405,C407,C411,C413,C414,C415
per-file-ignores = __init__.py: F401
exclude = docs/src,venv,third_party,caffe2,scripts,docs/caffe2,torch/lib/include,torch/lib/tmp_install,build,torch/include,*.pyi,.git,build,build_test_custom_build,build_code_analyzer
per-file-ignores = __init__.py: F401 torch/utils/cpp_extension.py: B950
exclude =
docs/src,
docs/cpp/src,
venv,
third_party,
caffe2,
scripts,
docs/caffe2,
torch/lib/include,
torch/lib/tmp_install,
build,
torch/include,
*.pyi,
.git,
build,
build_test_custom_build,
build_code_analyzer,
test/generated_type_hints_smoketest.py

View File

@ -0,0 +1 @@
Fixes #{issue number}

View File

@ -9,3 +9,5 @@ labels_to_circle_params:
- release/.*
tags:
- v[0-9]+(\.[0-9]+)*-rc[0-9]+
set_to_false:
- run_build

View File

@ -0,0 +1,86 @@
#!/usr/bin/env python3
"""Generates a matrix to be utilized through github actions
Will output a condensed version of the matrix if on a pull request that only
includes the latest version of python we support built on three different
architectures:
* CPU
* Latest CUDA
* Latest ROCM
"""
import json
import os
import itertools
CUDA_ARCHES = [
"10.1",
"10.2",
"11.0"
]
ROCM_ARCHES = [
"3.10",
"4.0"
]
FULL_ARCHES = [
"cpu",
*CUDA_ARCHES,
*ROCM_ARCHES
]
CONTAINER_IMAGES = {
**{
# TODO: Re-do manylinux CUDA image tagging scheme to be similar to
# ROCM so we don't have to do this replacement
gpu_arch: f"pytorch/manylinux-cuda{gpu_arch.replace('.', '')}"
for gpu_arch in CUDA_ARCHES
},
**{
gpu_arch: f"pytorch/manylinux-rocm:{gpu_arch}"
for gpu_arch in ROCM_ARCHES
},
"cpu": "pytorch/manylinux-cpu"
}
FULL_PYTHON_VERSIONS = [
"3.6",
"3.7",
"3.8",
"3.9",
]
def is_pull_request():
return os.environ.get("GITHUB_HEAD_REF")
def generate_matrix():
python_versions = FULL_PYTHON_VERSIONS
arches = FULL_ARCHES
if is_pull_request():
python_versions = [python_versions[-1]]
arches = ["cpu", CUDA_ARCHES[-1], ROCM_ARCHES[-1]]
matrix = []
for item in itertools.product(python_versions, arches):
python_version, arch_version = item
# Not my favorite code here
gpu_arch_type = "cuda"
if "rocm" in CONTAINER_IMAGES[arch_version]:
gpu_arch_type = "rocm"
elif "cpu" in CONTAINER_IMAGES[arch_version]:
gpu_arch_type = "cpu"
matrix.append({
"python_version": python_version,
"gpu_arch_type": gpu_arch_type,
"gpu_arch_version": arch_version,
"container_image": CONTAINER_IMAGES[arch_version]
})
return json.dumps({"include": matrix})
def main():
print(generate_matrix())
if __name__ == "__main__":
main()

113
.github/scripts/generate_pytorch_version.py vendored Executable file
View File

@ -0,0 +1,113 @@
#!/usr/bin/env python3
import argparse
import os
import subprocess
import re
from datetime import datetime
from distutils.util import strtobool
from pathlib import Path
LEADING_V_PATTERN = re.compile("^v")
TRAILING_RC_PATTERN = re.compile("-rc[0-9]*$")
LEGACY_BASE_VERSION_SUFFIX_PATTERN = re.compile("a0$")
class NoGitTagException(Exception):
pass
def get_pytorch_root():
return Path(subprocess.check_output(
['git', 'rev-parse', '--show-toplevel']
).decode('ascii').strip())
def get_tag():
root = get_pytorch_root()
# We're on a tag
am_on_tag = (
subprocess.run(
['git', 'describe', '--tags', '--exact'],
cwd=root,
stdout=subprocess.DEVNULL,
stderr=subprocess.DEVNULL
).returncode == 0
)
tag = ""
if am_on_tag:
dirty_tag = subprocess.check_output(
['git', 'describe'],
cwd=root
).decode('ascii').strip()
# Strip leading v that we typically do when we tag branches
# ie: v1.7.1 -> 1.7.1
tag = re.sub(LEADING_V_PATTERN, "", dirty_tag)
# Strip trailing rc pattern
# ie: 1.7.1-rc1 -> 1.7.1
tag = re.sub(TRAILING_RC_PATTERN, "", tag)
return tag
def get_base_version():
root = get_pytorch_root()
dirty_version = open(root / 'version.txt', 'r').read().strip()
# Strips trailing a0 from version.txt, not too sure why it's there in the
# first place
return re.sub(LEGACY_BASE_VERSION_SUFFIX_PATTERN, "", dirty_version)
class PytorchVersion:
def __init__(self, gpu_arch_type, gpu_arch_version, no_build_suffix):
self.gpu_arch_type = gpu_arch_type
self.gpu_arch_version = gpu_arch_version
self.no_build_suffix = no_build_suffix
def get_post_build_suffix(self):
if self.gpu_arch_type == "cuda":
return f"+cu{self.gpu_arch_version.replace('.', '')}"
return f"+{self.gpu_arch_type}{self.gpu_arch_version}"
def get_release_version(self):
if not get_tag():
raise NoGitTagException(
"Not on a git tag, are you sure you want a release version?"
)
return f"{get_tag()}{self.get_post_build_suffix()}"
def get_nightly_version(self):
date_str = datetime.today().strftime('%Y%m%d')
build_suffix = self.get_post_build_suffix()
return f"{get_base_version()}.dev{date_str}{build_suffix}"
def main():
parser = argparse.ArgumentParser(
description="Generate pytorch version for binary builds"
)
parser.add_argument(
"--no-build-suffix",
type=strtobool,
help="Whether or not to add a build suffix typically (+cpu)",
default=os.environ.get("NO_BUILD_SUFFIX", False)
)
parser.add_argument(
"--gpu-arch-type",
type=str,
help="GPU arch you are building for, typically (cpu, cuda, rocm)",
default=os.environ.get("GPU_ARCH_TYPE", "cpu")
)
parser.add_argument(
"--gpu-arch-version",
type=str,
help="GPU arch version, typically (10.2, 4.0), leave blank for CPU",
default=os.environ.get("GPU_ARCH_VERSION", "")
)
args = parser.parse_args()
version_obj = PytorchVersion(
args.gpu_arch_type,
args.gpu_arch_version,
args.no_build_suffix
)
try:
print(version_obj.get_release_version())
except NoGitTagException:
print(version_obj.get_nightly_version())
if __name__ == "__main__":
main()

View File

@ -0,0 +1,86 @@
name: Build Linux Wheels
on:
# TODO: These are only runnable from workflow_dispatch, we need to eventually add
# a cron
# TODO: Add an on_release trigger to build on tags
workflow_dispatch:
jobs:
generate-build-matrix:
if: ${{ github.repository_owner == 'pytorch' }}
runs-on: ubuntu-18.04
outputs:
matrix: ${{ steps.set-matrix.outputs.matrix }}
container:
image: python:3.9
steps:
- name: Clone pytorch/pytorch
uses: actions/checkout@v2
- name: Generating build matrix
id: set-matrix
run: |
# outputting for debugging purposes
python .github/scripts/generate_binary_build_matrix.py
MATRIX=$(python .github/scripts/generate_binary_build_matrix.py)
echo "::set-output name=matrix::${MATRIX}"
build-wheel:
if: ${{ github.repository_owner == 'pytorch' }}
needs: generate-build-matrix
runs-on: linux.2xlarge
strategy:
matrix:
${{ fromJson(needs.generate-build-matrix.outputs.matrix) }}
container:
image: ${{ matrix.container_image }}
env:
DESIRED_PYTHON: ${{ matrix.python_version }}
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: ${{ matrix.gpu_arch_version }}
GPU_ARCH_VERSION: ${{ matrix.GPU_ARCH_VERSION }}
GPU_ARCH_TYPE: ${{ matrix.gpu_arch_type }}
PYTORCH_BUILD_NUMBER: 1
SKIP_ALL_TESTS: 1
steps:
- name: Clone pytorch/pytorch
uses: actions/checkout@v2
with:
path: pytorch
submodules: recursive
- name: Clone pytorch/builder
uses: actions/checkout@v2
with:
repository: pytorch/builder
path: builder
- name: Generate version string
working-directory: pytorch/
run: |
version=$(.github/scripts/generate_pytorch_version.py)
echo "Generated version: ${version}"
echo "PYTORCH_BUILD_VERSION=${version}" >> $GITHUB_ENV
# TODO: Remove this once we remove the need for the directories to be
# in specific locations
- name: Symlink repositories to root directory (for legacy scripts purposes)
run: |
ln -s $(pwd)/pytorch /pytorch
ln -s $(pwd)/builder /builder
# TODO: Bundle the correct build script in the base container image so
# that we don't have to do this type of specification
- name: Build PyTorch binary (CUDA specific)
if: ${{ matrix.gpu_arch_type == 'cuda' }}
run: |
/builder/manywheel/build.sh
- name: Build PyTorch binary (ROCM specific)
if: ${{ matrix.gpu_arch_type == 'rocm' }}
run: |
/builder/manywheel/build_rocm.sh
- name: Build PyTorch binary (CPU specific)
if: ${{ matrix.gpu_arch_type == 'cpu' }}
run: |
/builder/manywheel/build_cpu.sh
- uses: actions/upload-artifact@v2
with:
name: pytorch-wheel-py${{ matrix.python_version }}-${{matrix.gpu_arch_type}}-${{ matrix.gpu_arch_version }}
path: /remote/**/*.whl
# TODO: Add a step here for uploading binaries

View File

@ -5,7 +5,7 @@ on:
jobs:
clang-format:
runs-on: ubuntu-latest
runs-on: ubuntu-18.04
steps:
- name: Setup Python
uses: actions/setup-python@v1
@ -35,7 +35,7 @@ jobs:
HEAD_SHA=${{ github.event.pull_request.head.sha }}
MERGE_BASE=$(git merge-base $BASE_SHA $HEAD_SHA)
# only run clang-format on whitelisted files
# only run clang-format on allowlisted files
echo "~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~"
echo "| clang-format failures found! Run: "
echo "| tools/clang_format_ci.sh ${MERGE_BASE} "

78
.github/workflows/jit_triage.yml vendored Normal file
View File

@ -0,0 +1,78 @@
name: jit-triage
on:
issues:
types: [labeled]
jobs:
welcome:
runs-on: ubuntu-18.04
steps:
- uses: actions/github-script@v2
with:
github-token: ${{secrets.GITHUB_TOKEN}}
script: |
// Arguments available:
// - github: A pre-authenticated octokit/rest.js client
// - context: An object containing the context of the workflow run
// - core: A reference to the @actions/core package
// - io: A reference to the @actions/io package
// Check if issue has a JIT label.
const kJitLabel = "oncall: jit";
issue = await github.issues.get({
owner: context.issue.owner,
repo: context.issue.repo,
issue_number: context.issue.number,
})
const hasJitLabel = issue.data.labels.filter(label => label.name == kJitLabel).length > 0;
if (!hasJitLabel) {
core.debug("Issue " + issue.data.title + " does not have JIT label");
return;
}
// Get project column ID.
const kProjectName = "JIT Triage";
const kColumnName = "Need triage";
// Query all projects in the repository.
// TODO: Support pagination once there are > 30 projects.
const projects = await github.projects.listForRepo({
owner: context.issue.owner,
repo: context.issue.repo,
});
// Filter out unwanted projects and get the ID for the JIT Triage project.
const filteredProjects = projects.data.filter(project => project.name == kProjectName);
if (filteredProjects.length != 1) {
core.setFailed("Unable to find a project named " + kProjectName);
return;
}
const projectId = filteredProjects[0].id;
// First, query all columns in the project.
// TODO: Support pagination once there are > 30 columns.
const columns = await github.projects.listColumns({
project_id: projectId,
});
// Filter out unwanted projects and get the ID for the Need triage column.
const filteredColumns = columns.data.filter(column => column.name == kColumnName);
if (filteredColumns.length != 1) {
core.setFailed("Unable to find a column named " + kColumnName);
return;
}
const columnId = filteredColumns[0].id;
// Create a project card for this new issue.
await github.projects.createCard({
column_id: columnId,
content_id: issue.data.id,
content_type: "Issue",
})

View File

@ -8,7 +8,7 @@ on:
jobs:
quick-checks:
runs-on: ubuntu-latest
runs-on: ubuntu-18.04
steps:
- name: Setup Python
uses: actions/setup-python@v1
@ -17,17 +17,28 @@ jobs:
architecture: x64
- name: Checkout PyTorch
uses: actions/checkout@v1
- name: Checkout PR tip
run: |
set -eux
if [[ "${{ github.event_name }}" == "pull_request" ]]; then
# We are on a PR, so actions/checkout leaves us on a merge commit.
# Check out the actual tip of the branch.
git checkout ${{ github.event.pull_request.head.sha }}
fi
echo ::set-output name=commit_sha::$(git rev-parse HEAD)
id: get_pr_tip
- name: Ensure consistent CircleCI YAML config
run: |
pip install -r requirements.txt
cd .circleci && ./ensure-consistency.py
- name: Ensure Docker version is correctly deployed
run: |
pip install pyyaml
.circleci/validate-docker-version.py
- name: Shellcheck Jenkins scripts
# https://github.com/koalaman/shellcheck#installing-a-pre-compiled-binary
run: |
sudo apt-get install -y shellcheck
scversion="stable"
wget -qO- "https://github.com/koalaman/shellcheck/releases/download/${scversion?}/shellcheck-${scversion?}.linux.x86_64.tar.xz" | tar -xJv
sudo cp "shellcheck-${scversion}/shellcheck" /usr/bin/
rm -r "shellcheck-${scversion}"
shellcheck --version
.jenkins/run-shellcheck.sh
- name: Ensure no tabs
run: |
@ -35,16 +46,23 @@ jobs:
- name: Ensure canonical include
run: |
(! git grep -I -l $'#include "' -- ./c10 ./aten ./torch/csrc ':(exclude)aten/src/ATen/native/quantized/cpu/qnnpack/**' || (echo "The above files have include with quotes; please convert them to #include <xxxx>"; false))
# note that this next step depends on a clean heckout;
# if you run it locally then it will likely to complain
# about all the generated files in torch/test
- name: Ensure C++ source files are not executable
run: |
(! find . \( -path ./third_party -o -path ./.git -o -path ./torch/bin -o -path ./build \) -prune -o -type f -executable -regextype posix-egrep -not -regex '.+(\.(bash|sh|py|so)|git-pre-commit|git-clang-format)$' -print | grep . || (echo 'The above files have executable permission; please remove their executable permission by using `chmod -x`'; false))
(! find . \( -path ./third_party -o -path ./.git -o -path ./torch/bin -o -path ./build \) -prune -o -type f -executable -regextype posix-egrep -not -regex '.+(\.(bash|sh|py|so)|git-pre-commit|git-clang-format|gradlew)$' -print | grep . || (echo 'The above files have executable permission; please remove their executable permission by using `chmod -x`'; false))
- name: C++ docs check
run: |
sudo apt-get install -y doxygen && pip install -r requirements.txt
cd docs/cpp/source && ./check-doxygen.sh
- name: CUDA kernel launch check
run: |
set -eux
python torch/testing/check_kernel_launches.py |& tee ${GITHUB_WORKSPACE}/cuda_kernel_launch_checks.txt
flake8-py3:
runs-on: ubuntu-latest
runs-on: ubuntu-18.04
steps:
- name: Setup Python
uses: actions/setup-python@v1
@ -66,23 +84,25 @@ jobs:
- name: Run flake8
run: |
set -eux
pip install flake8==3.8.2 flake8-mypy flake8-bugbear flake8-comprehensions flake8-executable flake8-pyi==20.5.0 mccabe pycodestyle==2.6.0 pyflakes==2.2.0
pip install -r requirements-flake8.txt
flake8 --version
flake8 --exit-zero > ${GITHUB_WORKSPACE}/flake8-output.txt
cat ${GITHUB_WORKSPACE}/flake8-output.txt
flake8 | tee ${GITHUB_WORKSPACE}/flake8-output.txt
- name: Add annotations
uses: pytorch/add-annotations-github-action@master
with:
check_name: 'flake8-py3'
linter_output_path: 'flake8-output.txt'
commit_sha: ${{ steps.get_pr_tip.outputs.commit_sha }}
regex: '^(?<filename>.*?):(?<lineNumber>\d+):(?<columnNumber>\d+): (?<errorCode>\w\d+) (?<errorDesc>.*)'
regex: '^(?<filename>.*?):(?<lineNumber>\d+):(?<columnNumber>\d+): (?<errorCode>\w+\d+) (?<errorDesc>.*)'
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
- name: Catch any other warnings
run: |
[ ! -s flake8-output.txt ]
clang-tidy:
if: github.event_name == 'pull_request'
runs-on: ubuntu-latest
runs-on: ubuntu-18.04
steps:
- name: Setup Python
uses: actions/setup-python@v1
@ -112,12 +132,12 @@ jobs:
sudo apt-get update
sudo apt-get --no-install-recommends -y install cuda-toolkit-10-2
# Install dependencies
pip install pyyaml
pip install pyyaml typing_extensions
wget -O - https://apt.llvm.org/llvm-snapshot.gpg.key | sudo apt-key add -
sudo apt-add-repository "deb http://apt.llvm.org/bionic/ llvm-toolchain-bionic-8 main"
sudo apt-add-repository "deb http://apt.llvm.org/bionic/ llvm-toolchain-bionic-11 main"
sudo apt-get update
sudo apt-get install -y clang-tidy-8
sudo update-alternatives --install /usr/bin/clang-tidy clang-tidy /usr/bin/clang-tidy-8 1000
sudo apt-get install -y clang-tidy-11
sudo update-alternatives --install /usr/bin/clang-tidy clang-tidy /usr/bin/clang-tidy-11 1000
- name: Run clang-tidy
run: |
set -eux
@ -135,33 +155,45 @@ jobs:
time python setup.py --cmake-only build
# Generate ATen files.
time python aten/src/ATen/gen.py \
time python -m tools.codegen.gen \
-s aten/src/ATen \
-d build/aten/src/ATen \
aten/src/ATen/Declarations.cwrap \
aten/src/THCUNN/generic/THCUNN.h \
aten/src/ATen/nn.yaml \
aten/src/ATen/native/native_functions.yaml
-d build/aten/src/ATen
# Generate PyTorch files.
time python tools/setup_helpers/generate_code.py \
--declarations-path build/aten/src/ATen/Declarations.yaml \
--native-functions-path aten/src/ATen/native/native_functions.yaml \
--nn-path aten/src
fi
# Run Clang-Tidy
# The negative filters below are to exclude files that include onnx_pb.h or
# caffe2_pb.h, otherwise we'd have to build protos as part of this CI job.
# FunctionsManual.cpp is excluded to keep this diff clean. It will be fixed
# in a follow up PR.
# /torch/csrc/generic/*.cpp is excluded because those files aren't actually built.
# deploy/interpreter files are excluded due to using macros and other techniquies
# that are not easily converted to accepted c++
python tools/clang_tidy.py \
--verbose \
--paths torch/csrc/ \
--diff "$MERGE_BASE" \
-g"-torch/csrc/jit/passes/onnx/helper.cpp" \
-g"-torch/csrc/jit/passes/onnx/shape_type_inference.cpp"\
-g"-torch/csrc/jit/serialization/onnx.cpp" \
-g"-torch/csrc/jit/serialization/export.cpp" \
-g"-torch/csrc/jit/serialization/import.cpp" \
-g"-torch/csrc/jit/serialization/import_legacy.cpp" \
-g"-torch/csrc/onnx/init.cpp" \
-g"-torch/csrc/cuda/nccl.*" \
-g"-torch/csrc/cuda/python_nccl.cpp" \
-g"-torch/csrc/autograd/FunctionsManual.cpp" \
-g"-torch/csrc/generic/*.cpp" \
-g"-torch/csrc/jit/codegen/cuda/runtime/*" \
-g"-torch/csrc/deploy/interpreter/interpreter.cpp" \
-g"-torch/csrc/deploy/interpreter/interpreter.h" \
-g"-torch/csrc/deploy/interpreter/interpreter_impl.h" \
-g"-torch/csrc/deploy/interpreter/test_main.cpp" \
"$@" > ${GITHUB_WORKSPACE}/clang-tidy-output.txt
cat ${GITHUB_WORKSPACE}/clang-tidy-output.txt
@ -176,7 +208,7 @@ jobs:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
cmakelint:
runs-on: ubuntu-latest
runs-on: ubuntu-18.04
steps:
- name: Setup Python
uses: actions/setup-python@v1

View File

@ -0,0 +1,78 @@
name: quantization-triage
on:
issues:
types: [labeled]
jobs:
welcome:
runs-on: ubuntu-18.04
steps:
- uses: actions/github-script@v2
with:
github-token: ${{secrets.GITHUB_TOKEN}}
script: |
// Arguments available:
// - github: A pre-authenticated octokit/rest.js client
// - context: An object containing the context of the workflow run
// - core: A reference to the @actions/core package
// - io: A reference to the @actions/io package
// Check if issue has a Quantization label.
const kQuantizationLabel = "oncall: quantization";
issue = await github.issues.get({
owner: context.issue.owner,
repo: context.issue.repo,
issue_number: context.issue.number,
})
const hasQuantizationLabel = issue.data.labels.filter(label => label.name == kQuantizationLabel).length > 0;
if (!hasQuantizationLabel) {
core.debug("Issue " + issue.data.title + " does not have Quantization label");
return;
}
// Get project column ID.
const kProjectName = "Quantization Triage";
const kColumnName = "Need Triage";
// Query all projects in the repository.
// TODO: Support pagination once there are > 30 projects.
const projects = await github.projects.listForRepo({
owner: context.issue.owner,
repo: context.issue.repo,
});
// Filter out unwanted projects and get the ID for the Quantization Triage project.
const filteredProjects = projects.data.filter(project => project.name == kProjectName);
if (filteredProjects.length != 1) {
core.setFailed("Unable to find a project named " + kProjectName);
return;
}
const projectId = filteredProjects[0].id;
// First, query all columns in the project.
// TODO: Support pagination once there are > 30 columns.
const columns = await github.projects.listColumns({
project_id: projectId,
});
// Filter out unwanted projects and get the ID for the Need triage column.
const filteredColumns = columns.data.filter(column => column.name == kColumnName);
if (filteredColumns.length != 1) {
core.setFailed("Unable to find a column named " + kColumnName);
return;
}
const columnId = filteredColumns[0].id;
// Create a project card for this new issue.
await github.projects.createCard({
column_id: columnId,
content_id: issue.data.id,
content_type: "Issue",
})

View File

@ -0,0 +1,36 @@
name: 'Close stale pull requests'
on:
schedule:
# TODO: Reduce frequency once we work through the backlog of pull requests
- cron: '0 * * * *'
workflow_dispatch:
jobs:
stale:
if: ${{ github.repository_owner == 'pytorch' }}
runs-on: ubuntu-18.04
steps:
- uses: actions/stale@v3
with:
stale-pr-message: >
Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as `Stale`. <br>
Feel free to remove the `Stale` label if you feel this was a mistake. <br>
`Stale` pull requests will automatically be closed 30 days after being marked `Stale` <br>
exempt-pr-labels: "no-stale,open source,high priority"
days-before-stale: 60
days-before-close: 90
stale-open-source:
if: ${{ github.repository_owner == 'pytorch' }}
runs-on: ubuntu-18.04
steps:
- uses: actions/stale@v3
with:
stale-pr-message: >
Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as `Stale`. <br>
Feel free to remove the `Stale` label if you feel this was a mistake. <br>
If you are unable to remove the `Stale` label please contact a maintainer in order to do so. <br>
`Stale` pull requests will automatically be closed 30 days after being marked `Stale` <br>
exempt-pr-labels: "no-stale,high priority"
only-labels: "open source"
days-before-stale: 150
days-before-close: 180

Some files were not shown because too many files have changed in this diff Show More