Compare commits

...

2481 Commits

Author SHA1 Message Date
43404c4141 [1.7] Remove torch.vmap (#45571)
torch.vmap is a prototype feature and should not be in the stable
binary. This PR:
- Removes the `torch.vmap` API
- Removes the documentation entry for `torch.vmap`
- Changes the vmap tests to use an internal API instead of `torch.vmap`.

Test Plan:
- Tested locally (test_torch, test_type_hints, test_vmap), but also wait
for CI.
2020-09-30 15:14:36 -05:00
cf07ba50fe Update target determinator to point to release/1.7
Signed-off-by: Eli Uriegas <eliuriegas@fb.com>
2020-09-30 09:39:23 -05:00
f5c95d5cf1 Source code level attribution in profiler (#43898)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43898

Adding with_source parameter to enable tracking source code
(filename and line) in profiler for eager, torchscript and autograd
modes

Test Plan:
python test/test_profiler.py
```
Name                                 Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     Number of Calls  Source Location
-----------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  --------------------------------------------
ts_method_1                          10.43%           235.364us        36.46%           822.920us        822.920us        1                test/test_profiler.py(70): test_source
aten::add                            7.52%            169.833us        8.88%            200.439us        200.439us        1                test/test_profiler.py(69): test_source
aten::normal_                        6.26%            141.380us        6.26%            141.380us        141.380us        1                test/test_profiler.py(67): test_source
aten::add                            5.80%            130.830us        8.41%            189.800us        63.267us         3                test/test_profiler.py(72): test_source
aten::sum                            5.02%            113.340us        8.39%            189.475us        189.475us        1                test/test_profiler.py(64): ts_method_1
aten::add                            4.58%            103.346us        6.33%            142.847us        142.847us        1                test/test_profiler.py(62): ts_method_1
aten::mul                            4.05%            91.498us         9.62%            217.113us        217.113us        1                test/test_profiler.py(71): test_source
aten::add                            4.03%            90.880us         5.60%            126.405us        126.405us        1                test/test_profiler.py(58): ts_method_2
aten::empty                          3.49%            78.735us         3.49%            78.735us         19.684us         4                test/test_profiler.py(72): test_source
```

Reviewed By: ngimel

Differential Revision: D23432664

Pulled By: ilia-cher

fbshipit-source-id: 83ad7ebe0c2502494d3b48c4e687802db9c77615
2020-09-30 00:57:35 -07:00
c2c7099944 Fix docs for kwargs, q-z (#43589)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43589

Reviewed By: zhangguanheng66

Differential Revision: D24006259

Pulled By: mruberry

fbshipit-source-id: 39abd474744f152648aad201d7311b42d20efc88
2020-09-29 22:57:02 -07:00
b4ba66ae32 Print tensor shapes and convolution parameters when cuDNN exception is thrown (#45023)
Summary:
Originally proposed at https://github.com/pytorch/pytorch/issues/44473#issuecomment-690670989 by colesbury .

This PR adds the functionality to print relevant tensor shapes and convolution parameters along with the stack trace once a cuDNN exception is thrown.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45023

Reviewed By: gchanan

Differential Revision: D23932661

Pulled By: ezyang

fbshipit-source-id: 5f5f570df6583271049dfc916fac36695f415331
2020-09-29 21:55:34 -07:00
93650a82c9 Move prim::tolist math.log and aten::cpu to lite interpreter for translation model (#45482)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45482

Working on some models that need these ops on lite interpreter.

Test Plan: locally build and load/run the TS model without problem.

Reviewed By: iseeyuan

Differential Revision: D23906581

fbshipit-source-id: 01b9de2af2046296165892b837bc14a7e5d59b4e
2020-09-29 21:42:18 -07:00
4aca63d38a [TensorExpr] Change API for creating Load and Store expressions. (#45520)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45520

With this change `Load`s and `Store`s no longer accept `Placeholder`s in
their constructor and `::make` functions and can only be built with
`Buf`.
`Placeholder` gets its own `store`, `load`, `storeWithMask`, and
`loadWithMask` method for more convenient construction.

Test Plan: Imported from OSS

Reviewed By: glaringlee

Differential Revision: D23998789

Pulled By: ZolotukhinM

fbshipit-source-id: 3fe018e00c1529a563553b2b215f403b34aea912
2020-09-29 20:52:38 -07:00
772ce9ac2c Fix memory corruption when running torch.svd for complex.doubles (#45486)
Summary:
According to http://www.netlib.org/lapack/explore-html/d3/da8/group__complex16_g_esing_gaccb06ed106ce18814ad7069dcb43aa27.html
rwork should be an array of doubles, but it was allocated as array of floats (actually ints)

Fixes crash from https://github.com/pytorch/pytorch/issues/45269

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45486

Reviewed By: walterddr

Differential Revision: D23984444

Pulled By: malfet

fbshipit-source-id: 6a1b00a27de47046496ccf6a91b6e8ad283e42e6
2020-09-29 20:27:08 -07:00
ccad73ab41 Fix D23995953 import.
Summary: https://github.com/pytorch/pytorch/pull/45511 could not be properly imported

Test Plan: See https://github.com/pytorch/pytorch/pull/45511

Reviewed By: zhangguanheng66

Differential Revision: D23995953

fbshipit-source-id: a6224a67d54617ddf34c2392e65f2142c4e78ea4
2020-09-29 19:30:23 -07:00
c87ff2cb90 Enable transposed tensor copy for complex types (#45487)
Summary:
This enables a special copy operator for transposed tensors with more than 360 elements:
417e3f85e5/aten/src/ATen/native/Copy.cpp (L19)

Steps to repro: python -c "import torch; print(torch.svd(torch.randn(61, 61, dtype=torch.complex64)))"

Fixes https://github.com/pytorch/pytorch/issues/45269

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45487

Reviewed By: anjali411

Differential Revision: D23984441

Pulled By: malfet

fbshipit-source-id: 10ce1d5f4425fb6de78e96adffd119e545b6624f
2020-09-29 19:22:05 -07:00
0a15646e15 CUDA RTX30 series support (#45489)
Summary:
I also opened a PR on cmake upstream: https://gitlab.kitware.com/cmake/cmake/-/merge_requests/5292

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45489

Reviewed By: zhangguanheng66

Differential Revision: D23997844

Pulled By: ezyang

fbshipit-source-id: 4e7443dde9e70632ee429184f0d51cb9aa5a98b5
2020-09-29 18:19:23 -07:00
c1e6592964 Enable type-checking of torch.nn.quantized.* modules (#43110)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/43029

I am not changing the following files in this PR:
* `torch/nn/quantized/dynamic/modules/rnn.py` due to https://github.com/pytorch/pytorch/issues/43072
* `torch/nn/quantized/modules/conv.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43110

Reviewed By: gchanan

Differential Revision: D23963258

Pulled By: ezyang

fbshipit-source-id: 0fb0fd13af283f6f7b3434e7bbf62165357d1f98
2020-09-29 18:14:29 -07:00
375a83e6c1 Annotate torch.utils.(tensorboard/show_pickle/hypify) (#44216)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/44215

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44216

Reviewed By: gchanan

Differential Revision: D23963216

Pulled By: ezyang

fbshipit-source-id: b3fed51b2a1cbd05e3cd0222c89c38d61d8968c1
2020-09-29 18:14:26 -07:00
eb39542e67 Add typing annotations for torch.utils.data.* modules (#44136)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/44135

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44136

Reviewed By: gchanan

Differential Revision: D23963273

Pulled By: ezyang

fbshipit-source-id: 939234dddbe89949bd8e5ff05d06f6c8add6935c
2020-09-29 18:12:05 -07:00
33aba57e4c Patch generate files for system protobuf (#44583)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/42939

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44583

Reviewed By: albanD

Differential Revision: D23692639

Pulled By: ezyang

fbshipit-source-id: 49781f704dd6ceab7717b63225d0b4076ce33daa
2020-09-29 18:06:33 -07:00
22a34bcf4e ROCm {emoji:2764} TensorExpr (#45506)
Summary:
This might be an alternative to reverting https://github.com/pytorch/pytorch/issues/45396 .
The obvious rough edge is that I'm not really seeing the work group limits that TensorExpr produces.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45506

Reviewed By: zhangguanheng66

Differential Revision: D23991410

Pulled By: Krovatkin

fbshipit-source-id: 11d3fc4600e4bffb1d1192c6b8dd2fe22c1e064e
2020-09-29 16:52:16 -07:00
637570405b Disable multi tensor tesnor tests on rocm (#45535)
Summary:
Disable multi tensor test on rocm

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45535

Reviewed By: ngimel

Differential Revision: D24002557

Pulled By: izdeby

fbshipit-source-id: 608c9389e3d9cd7dac49ea42c9bb0af55662c754
2020-09-29 15:49:21 -07:00
06a566373a [PyTorch/NCCL] Fix async error handling (#45456)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45456

Remove work while not holding lock, to avoid deadlock with watchdog thread while GPU is 100%

SyncBatchNorm failure trace: P143879560

Test Plan:
**Desync test:**
BACKEND=nccl WORLD_SIZE=3 NCCL_ASYNC_ERROR_HANDLING=1 ./buck-out/gen/caffe2/test/distributed/distributed_nccl_spawn#binary.par -r test_DistributedDataParallel_desync

**SyncBatchNorm test:**
BACKEND=nccl WORLD_SIZE=3 NCCL_ASYNC_ERROR_HANDLING=1 ./buck-out/gen/caffe2/test/distributed/distributed_nccl_fork#binary.par -r test_DistributedDataParallel_SyncBatchNorm_Diff_Input_Sizes_gradient

Reviewed By: osalpekar

Differential Revision: D23972071

fbshipit-source-id: f03d9637a6ec998d64dab1a062a81e0f3697275f
2020-09-29 15:44:34 -07:00
ef41472544 Create experimental FX graph manipulation library (#44775)
Summary:
This PR adds a new GraphManipulation library for operating on the GraphModule nodes.
It also adds an implementation of replace_target_nodes_with, which replaces all nodes in the GraphModule or a specific op/target with a new specified op/target. An example use of this function would be replacing a generic operator with an optimized operator for specific sizes and shapes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44775

Reviewed By: jamesr66a

Differential Revision: D23874561

Pulled By: gcatron

fbshipit-source-id: e1497cd11e0bbbf1fabdf137d65c746248998e0b
2020-09-29 15:32:41 -07:00
d642992877 Quantized operators template selective (#45509)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45509

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44479

Test Plan: Imported from OSS

Reviewed By: dhruvbird

Differential Revision: D23626562

Pulled By: iseeyuan

fbshipit-source-id: c2fc8bad25f8e5e9a70eb1001b9066a711b8e8e7
2020-09-29 14:52:27 -07:00
ab5cf16b6c fix standard deviation gradient NaN behavior (#45468)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/4320

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45468

Reviewed By: zhangguanheng66

Differential Revision: D23991064

Pulled By: albanD

fbshipit-source-id: d4274895f2dac8b2cdbd73e5276ce3df466fc341
2020-09-29 13:47:29 -07:00
18876b5722 Update backward formula for torch.dot and add backward definition for torch.vdot (#45074)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45074

TODO: Add R -> C tests in https://github.com/pytorch/pytorch/pull/44744 (blocked on some JIT changes)

Test Plan: Imported from OSS

Reviewed By: gchanan

Differential Revision: D23975361

Pulled By: anjali411

fbshipit-source-id: 3512bd2962b588a198bc317673bd18cc96ac823f
2020-09-29 12:52:03 -07:00
147c88ef2d Add docs to a pytorch.github.io/doc/tag directory when repo is tagged (#45204)
Summary:
In coordination with jlin27.

This PR is meant to build documentation when the repo is tagged. For instance, tagging the repo with 1.7.0rc1 will push that commit's documentation to pytorch/pytorch.github.io/docs/1.7.

Subsequently tagging 1.7.0rc2 will override the 1.7 docs, as will 1.7.0, and 1.7.1. I think this is as it should be: there should be one, latest, version for the 1.7 docs. This can be tweaked differently if desired.

There is probably work that needs to be done to adjust the [versions.html](https://pytorch.org/docs/versions.html) to add the new tag?

Is there a way to test the tagging side of this without breaking the production documentation?

As an aside, the documentation is being built via the `pytorch_linux_xenial_py3_6_gcc5_4_build` image. Some projects are starting to move on from python3.6 since [it is in security-only support mode](https://devguide.python.org/#status-of-python-branches), no new binaries are being released.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45204

Reviewed By: zhangguanheng66

Differential Revision: D23996800

Pulled By: seemethere

fbshipit-source-id: a94a080348a47738c1de5832ab37b2b0d57d2d57
2020-09-29 12:31:30 -07:00
b66ac1e928 Updates nonzero's as_tuple behavior to no longer warn. (#45413)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/44284.

[torch.nonzero](https://pytorch.org/docs/master/generated/torch.nonzero.html?highlight=nonzero#torch.nonzero) is distinct from [numpy.nonzero](https://numpy.org/doc/1.18/reference/generated/numpy.nonzero.html?highlight=nonzero#numpy.nonzero). The latter returns a tensor by default, and the former returns a tuple of tensors. The `as_tuple` argument was added as part of an intended deprecation process to make torch.nonzero consistent with numpy.nonzero, but this was a confusing change for users. A better deprecation path would be to offer torch.argwhere consistent with [numpy.argwhere](https://numpy.org/doc/stable/reference/generated/numpy.argwhere.html?highlight=argwhere#numpy.argwhere), which is equivalent to the default torch.nonzero behavior. Once this is offered a change to torch.nonzero should be more straightforward with less user disruption, if we decided that's the correct change to pursue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45413

Reviewed By: ngimel

Differential Revision: D23975015

Pulled By: mruberry

fbshipit-source-id: b59237d0d8c2df984e952b62d0a7c247b49d84dc
2020-09-29 12:16:59 -07:00
0df99ad470 Remove unnecessary __at_align32__ in int_elementwise_binary_256 (#45470)
Summary:
They were added in 4b3046ed286e92b5910769bf97f2bc6a1ad473d1 based on a
misunderstanding of `_mm256_storeu_si256`, but they
are actually unnecessary. The [document][1] of `_mm256_storeu_si256` says:

> Moves values from a integer vector to an **unaligned** memory location.

In this case, it's better to remove the `__at_align32__` qualifier to
leave the compiler and linker more flexibility to optimize.

[1]: https://software.intel.com/content/www/us/en/develop/documentation/cpp-compiler-developer-guide-and-reference/top/compiler-reference/intrinsics/intrinsics-for-intel-advanced-vector-extensions/intrinsics-for-load-and-store-operations-1/mm256-storeu-si256.html

Close https://github.com/pytorch/pytorch/issues/44810

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45470

Reviewed By: zhangguanheng66

Differential Revision: D23980060

Pulled By: glaringlee

fbshipit-source-id: 12b3558b76c6e81d88a72081060fdb8674464768
2020-09-29 11:55:25 -07:00
6e55a26e10 Move mobile specific CPUCachingAllocator to c10/mobile folder. (#45364)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45364

Plus add some more comments about the usage, limitations and cons.

Test Plan: Build and run benchmark binary.

Reviewed By: gchanan

Differential Revision: D23944193

fbshipit-source-id: 30d4f4991d2185a0ab768d94c846d73730fc0835
2020-09-29 11:33:26 -07:00
b2925671b6 Updates deterministic flag to throw a warning, makes docs consistent (#45410)
Summary:
Per feedback in the recent design review. Also tweaks the documentation to clarify what "deterministic" means and adds a test for the behavior.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45410

Reviewed By: ngimel

Differential Revision: D23974988

Pulled By: mruberry

fbshipit-source-id: e48307da9c90418fc6834fbd67b963ba2fe0ba9d
2020-09-29 11:17:33 -07:00
aa2bd7e1ae Conservative-ish persistent RNN heuristics for compute capability 8.0+ (#43165)
Summary:
Based on https://github.com/pytorch/pytorch/pull/43165#issuecomment-697033663 and tests by Vasily Volkov ([persistentRNN-speedup.xlsx](https://github.com/pytorch/pytorch/files/5298001/persistentRNN-speedup.xlsx)).  See comments in code.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43165

Reviewed By: zhangguanheng66, mruberry

Differential Revision: D23991756

Pulled By: ngimel

fbshipit-source-id: 4c2c14c9002be2fec76fb21ba55b7dab79497510
2020-09-29 11:14:55 -07:00
f47fd0eb72 Updated cholesky_backward for complex inputs (#45267)
Summary:
Updated `cholesky_backward` to work correctly for complex input.
Note that the current implementation gives the conjugate of what JAX would return. anjali411 is that correct thing to do?
Ref. https://github.com/pytorch/pytorch/issues/44895

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45267

Reviewed By: bwasti

Differential Revision: D23975269

Pulled By: anjali411

fbshipit-source-id: 9908b0bb53c411e5ad24027ff570c4f0abd451e6
2020-09-29 11:07:32 -07:00
15f85eea18 Support bfloat16 and complex dtypes for logical_not (#43537)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43537

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D23751950

Pulled By: mruberry

fbshipit-source-id: d07ecd9aae263eb8e00928d4fc981e0d66066fbb
2020-09-29 11:00:05 -07:00
ea59251f51 Fix model_name not logged properly issue. (#45488)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45488

model_name logging was broken, issue is from the recent change of assigning the method name into the module name, this diff is fixing it.
ghstack-source-id: 113103942

Test Plan:
made sure that now the model_name is logged from module_->name().
verified with one model which does not contain the model metadata, and the model_name field is logged as below:

09-28 21:59:30.065 11530 12034 W module.cpp: TESTINGTESTING run() module = __torch__.Model
09-28 21:59:30.065 11530 12034 W module.cpp: TESTINGTESTING metadata does not have model_name assigning to __torch__.Model
09-28 21:59:30.066 11530 12034 W MobileModuleQPLObserver.cpp: TESTINGTESTING onEnterRunMethod log  model_name = __torch__.Model
09-28 21:59:30.066 11530 12034 W MobileModuleQPLObserver.cpp: TESTINGTESTING onEnterRunMethod log  method_name = labels
09-28 21:59:30.068 11530 12034 W MobileModuleQPLObserver.cpp: TESTINGTESTING onExitRunMethod()

Reviewed By: linbinyu

Differential Revision: D23984165

fbshipit-source-id: 5b00f50ea82106b695c2cee14029cb3b2e02e2c8
2020-09-29 10:37:36 -07:00
09b3e16b40 [JIT] Enable @unused syntax for ignoring properties (#45261)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45261

**Summary**
This commit enables `unused` syntax for ignoring
properties. Inoring properties is more intuitive with this feature enabled.
`ignore` is not supported because class type properties cannot be
executed in Python (because they exist only as TorchScript types) like
an `ignored` function and module properties that cannot be scripted
are not added to the `ScriptModule` wrapper so that they
may execute in Python.

**Test Plan**
This commit updates the existing unit tests for class type and module
properties to test properties ignored using `unused`.

Test Plan: Imported from OSS

Reviewed By: navahgar, Krovatkin, mannatsingh

Differential Revision: D23971881

Pulled By: SplitInfinity

fbshipit-source-id: 8d3cc1bbede7753d6b6f416619e4660c56311d33
2020-09-29 10:24:25 -07:00
5f49d14be2 Add mobile_optimized tag to optimized model. (#45479)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45479

Add a top level boolean attribute to the model called mobile_optimized that is set to true if it is optimized.

Test Plan: buck test //caffe2/test:mobile passes

Reviewed By: kimishpatel

Differential Revision: D23956728

fbshipit-source-id: 79c5931702208b871454319ca2ab8633596b1eb8
2020-09-29 10:06:57 -07:00
17be7c6e5c [vulkan][android][test_app] Add test_app variant that runs module on Vulkan (#44897)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44897

Test Plan: Imported from OSS

Reviewed By: dreiss

Differential Revision: D23763770

Pulled By: IvanKobzarev

fbshipit-source-id: 6ad16b7271c745313a71da64a629a764258bbc85
2020-09-29 10:00:46 -07:00
2c300fd74c [android][vulkan] Module load argument to specify device cpu/vulkan (#44896)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44896

Test Plan: Imported from OSS

Reviewed By: dreiss

Differential Revision: D23763771

Pulled By: IvanKobzarev

fbshipit-source-id: 990a386ad13c704f03345dbe09e180281af913c9
2020-09-29 09:58:22 -07:00
fe9019cbfe Reorganized Sorting.cpp method order (#45083)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45083

This PR just reorders the methods in Sorting.cpp placing related methods next to each other.

Test Plan: Imported from OSS

Reviewed By: glaringlee

Differential Revision: D23908817

Pulled By: heitorschueroff

fbshipit-source-id: 1dd7b693b5135fddf5dff12303474e85ce0c2f83
2020-09-29 09:49:31 -07:00
ab5edf21b0 Revert D23789657: [wip] fast typeMeta/ScalarType conversion approach 2
Test Plan: revert-hammer

Differential Revision:
D23789657 (1ed1a2f5b0)

Original commit changeset: 5afdd52d24bd

fbshipit-source-id: 6d827be8895bcb39c8e85342eee0f7a3f5056c76
2020-09-29 09:40:53 -07:00
b3135c2056 Enable torch.cuda.amp typechecking (#45480)
Summary:
Fix `torch._C._autocast_*_nesting` declarations in __init__.pyi

Fix iterable constructor logic: not every iterable can be constructed using `type(val)(val)` trick, for example it would not work for `val=range(10)` although `isinstance(val, Iterable)` is True
Change optional resolution logic to meet mypy expectations

Fixes https://github.com/pytorch/pytorch/issues/45436

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45480

Reviewed By: walterddr

Differential Revision: D23982822

Pulled By: malfet

fbshipit-source-id: 6418a28d04ece1b2427dcde4b71effb67856a872
2020-09-29 09:31:55 -07:00
df0de780c3 Add cusolver guard for cuda >= 10.1.243 (#45452)
Summary:
See https://github.com/pytorch/pytorch/issues/45403

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45452

Reviewed By: mruberry

Differential Revision: D23977009

Pulled By: ngimel

fbshipit-source-id: df66425773d7500fa37e64d5e4bcc98167016be3
2020-09-29 09:25:20 -07:00
bb19a55429 Improves fft doc consistency and makes deprecation warnings more prominent (#45409)
Summary:
This PR makes the deprecation warnings for existing fft functions more prominent and makes the torch.stft deprecation warning consistent with our current deprecation planning.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45409

Reviewed By: ngimel

Differential Revision: D23974975

Pulled By: mruberry

fbshipit-source-id: b90d8276095122ac3542ab625cb49b991379c1f8
2020-09-29 09:07:49 -07:00
0a38aed025 Auto set libuv_ROOT env var for Gloo submodule on Windows platform (#45484)
Summary:
Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45484

Reviewed By: lw

Differential Revision: D23990724

Pulled By: mrshenli

fbshipit-source-id: 1987ce7eb7d3f9d3120c07e954cd6581cd3caf59
2020-09-29 08:58:56 -07:00
6d37126a10 Makes rdiv consistent with div (#45407)
Summary:
In addition to making rdiv consistent with div, this PR significantly expands division testing, accounting for floor_divide actually performing truncation division, too.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45407

Reviewed By: ngimel

Differential Revision: D23974967

Pulled By: mruberry

fbshipit-source-id: 82b46b07615603f161ab7cd1d3afaa6d886bfe95
2020-09-29 08:34:01 -07:00
7cde662f08 Add check for Complex Type to allow non integral alpha. (#45200)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/45184

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45200

Reviewed By: gchanan

Differential Revision: D23940134

Pulled By: anjali411

fbshipit-source-id: cce7b1efc22ec189ba6c83e31ce712bb34997139
2020-09-29 07:36:46 -07:00
0806c58e9f Optimize view_as_complex and view_as_real (#44908)
Summary:
This avoids unnecessary memory allocations in `view_as_complex` and `view_as_real`. I construct the new tensor directly with the existing storage to avoid creating a new storage object and also use `DimVector`s to avoid allocating for the sizes and strides. Overall, this saves about 2 us of overhead from `torch.fft.fft` which currently has to call `view_as_real` and `view_as_complex` for every call.

I've used this simple benchmark to measure the overhead:
```python
In [1]: import torch
   ...: a = torch.rand(1, 2)
   ...: ac = torch.view_as_complex(a)
   ...: %timeit torch.view_as_real(ac)
   ...: %timeit torch.view_as_complex(a)
   ...: %timeit ac.real
```

Results before:
```
2.5 µs ± 62.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
2.22 µs ± 36 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
4.17 µs ± 8.76 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
```

and after:
```
1.83 µs ± 9.26 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
1.57 µs ± 7.17 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
3.47 µs ± 34.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44908

Reviewed By: agolynski

Differential Revision: D23793479

Pulled By: anjali411

fbshipit-source-id: 64b9cad70e3ec10891310cbfa8c0bdaa1d72885b
2020-09-29 07:30:38 -07:00
87f98a5b54 Updates torch.floor_divide documentation to clarify it's actually torch.trunc_divide (or torch.rtz_divide) (#45411)
Summary:
Addresses https://github.com/pytorch/pytorch/issues/43874 for 1.7. 1.8 will need to take floor_divide through a proper deprecation process.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45411

Reviewed By: ngimel

Differential Revision: D23974997

Pulled By: mruberry

fbshipit-source-id: 16dd07e50a17ac76bfc93bd6b71d4ad72d909bf4
2020-09-29 05:55:44 -07:00
37f9af7f29 Missing tests about torch.xxx(out=...) (#44465)
Summary:
PR opened just to run the CI tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44465

Reviewed By: ngimel

Differential Revision: D23907565

Pulled By: mruberry

fbshipit-source-id: 620661667877f1e9a2bab17d19988e2dc986fc0f
2020-09-29 04:54:46 -07:00
56af122659 Revert D23966878: [pytorch][PR] This PR flips a switch to enable PE + TE
Test Plan: revert-hammer

Differential Revision:
D23966878 (dddb685c11)

Original commit changeset: 2010a0b07c59

fbshipit-source-id: 132556039730fd3e4babd0d7ca8daf9c8d14f728
2020-09-29 04:33:19 -07:00
1ed1a2f5b0 [wip] fast typeMeta/ScalarType conversion approach 2 (#44965)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44965

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D23789657

Pulled By: bhosmer

fbshipit-source-id: 5afdd52d24bd097891ff4a7313033f7bd400165e
2020-09-29 02:39:36 -07:00
489af4ddcb [quant] Add quant APIs to save/load observer state_dict (#44846)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44846

The save function traverses the model state dict to pick out the observer stats
load function traverse the module hierarchy to load the state dict into module attributes depending on observer type

Test Plan:
python test/test_quantization.py TestQuantizeFx.test_save_observer_state_dict

Imported from OSS

Reviewed By: raghuramank100

Differential Revision: D23746821

fbshipit-source-id: 05c571b62949a2833602d736a81924d77e7ade55
2020-09-29 01:52:42 -07:00
bb478810e0 [quant] torch.max_pool1d (#45152)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45152

Test Plan: Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D23846473

Pulled By: z-a-f

fbshipit-source-id: 38fd611e568e4f8b39b7a00adeb42c7b99576360
2020-09-29 01:45:22 -07:00
b86008ab75 [TensorExpr] Remove buf_ field from class Tensor. (#45390)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45390

Tensor objects should always refer to their Function's bufs. Currently
we never create a Tensor with a buffer different than of its function,
but having it in two places seems incorrect and dangerous.

Differential Revision: D23952865

Test Plan: Imported from OSS

Reviewed By: nickgg

Pulled By: ZolotukhinM

fbshipit-source-id: e63fc26d7078427514649d9ce973b74ea635a94a
2020-09-29 01:21:57 -07:00
3c33695a6d [TensorExpr] Rename Buffer to Placeholder. (#45389)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45389

Differential Revision: D23952866

Test Plan: Imported from OSS

Reviewed By: nickgg

Pulled By: ZolotukhinM

fbshipit-source-id: 17eedd3ac17897501403482ac1866c569d247c75
2020-09-29 01:21:54 -07:00
92306b85d5 [TensorExpr] Consolidate {buffer,function,tensor}.{h.cpp} in tensor.{h,cpp}. (#45388)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45388

Classes defined in these files are closely related, so it is reasonable
to have them all in one file. The change is purely a code move.

Differential Revision: D23952867

Test Plan: Imported from OSS

Reviewed By: nickgg

Pulled By: ZolotukhinM

fbshipit-source-id: 12cfaa968bdfc4dff00509e34310a497c7b59155
2020-09-29 01:17:10 -07:00
d2623da52c replaced whitelist with allowlist (#45260)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/41754

**(1)**
Intially file was named **gen_op_registration_whitelist.py** I changed it to **gen_op_registration_allowlist.py**

**(2)**
There were some **whitelist** in comment inside the file, I changed it to **allowlist**
![update1](https://user-images.githubusercontent.com/62737243/94106752-b296e780-fe59-11ea-8541-632a1dbf90d6.png)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45260

Reviewed By: dhruvbird

Differential Revision: D23947182

Pulled By: ljk53

fbshipit-source-id: 31b486592451dbb0605d7950e07747cbb72ab80f
2020-09-29 00:27:46 -07:00
8c309fc052 Add more tests for mt optimizers (#45475)
Summary:
Add more test cases for mt optimizers and fix Adam/AdamW

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45475

Reviewed By: soumith

Differential Revision: D23982727

Pulled By: izdeby

fbshipit-source-id: 4b24d37bd52a2fa3719d3e3a5dcf3b96990b0f5b
2020-09-28 23:59:58 -07:00
6bdb871d47 [FX] Lint pass for Graphs (#44973)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44973

Test Plan: Imported from OSS

Reviewed By: zdevito

Differential Revision: D23792631

Pulled By: jamesr66a

fbshipit-source-id: d8faef0c311d8bd611ba0a7e1e2f353e3e5a1068
2020-09-28 23:00:32 -07:00
b0bdc82a00 [FX][EZ] Fix bug where copying node made non-unique name (#45311)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45311

Test Plan: Imported from OSS

Reviewed By: dzhulgakov

Differential Revision: D23917864

Pulled By: jamesr66a

fbshipit-source-id: 10d0a4017ffe160bce4ba0d830e035616bbded74
2020-09-28 22:55:20 -07:00
417e3f85e5 Support tuple inputs in NN Module test (#44853)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44853

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D23750441

Pulled By: glaringlee

fbshipit-source-id: 1b111a370a726b40521134b711c35f48dda99411
2020-09-28 22:05:05 -07:00
dddb685c11 This PR flips a switch to enable PE + TE (#45396)
Summary:
This PR flips a switch to enable PE + TE
next PR: https://github.com/pytorch/pytorch/pull/45397

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45396

Reviewed By: suo

Differential Revision: D23966878

Pulled By: Krovatkin

fbshipit-source-id: 2010a0b07c595992a88b3fe0792d6af315cf421e
2020-09-28 21:57:50 -07:00
50b91103a9 add self cuda time to avoid double/quadruple counting (#45209)
Summary:
In profiler, cuda did not report self time, so for composite functions there was no way to determine which function is really taking time. In addition, "total cuda time" reported was frequently more than total wallclock time. This PR adds "self CUDA time" in profiler, and computes total cuda time based on self cuda time, similar to how it's done for CPU. Also, slight formatting changes to make table more compact. Before:
```
--------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------
Name                  Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CUDA total %     CUDA total       CUDA time avg    Number of Calls
--------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------
aten::matmul          0.17%            890.805us        99.05%           523.401ms        5.234ms          49.91%           791.184ms        7.912ms          100
aten::mm              98.09%           518.336ms        98.88%           522.511ms        5.225ms          49.89%           790.885ms        7.909ms          100
aten::t               0.29%            1.530ms          0.49%            2.588ms          25.882us         0.07%            1.058ms          10.576us         100
aten::view            0.46%            2.448ms          0.46%            2.448ms          12.238us         0.06%            918.936us        4.595us          200
aten::transpose       0.13%            707.204us        0.20%            1.058ms          10.581us         0.03%            457.802us        4.578us          100
aten::empty           0.14%            716.056us        0.14%            716.056us        7.161us          0.01%            185.694us        1.857us          100
aten::as_strided      0.07%            350.935us        0.07%            350.935us        3.509us          0.01%            156.380us        1.564us          100
aten::stride          0.65%            3.458ms          0.65%            3.458ms          11.527us         0.03%            441.258us        1.471us          300
--------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------
Self CPU time total: 528.437ms
CUDA time total: 1.585s

Recorded timeit time:  789.0814 ms

```
Note recorded timeit time (with proper cuda syncs) is 2 times smaller than "CUDA time total" reported by profiler

After
```
--------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
--------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
        aten::matmul         0.15%     802.716us        99.06%     523.548ms       5.235ms     302.451us         0.04%     791.151ms       7.912ms           100
            aten::mm        98.20%     519.007ms        98.91%     522.745ms       5.227ms     790.225ms        99.63%     790.848ms       7.908ms           100
             aten::t         0.27%       1.406ms         0.49%       2.578ms      25.783us     604.964us         0.08%       1.066ms      10.662us           100
          aten::view         0.45%       2.371ms         0.45%       2.371ms      11.856us     926.281us         0.12%     926.281us       4.631us           200
     aten::transpose         0.15%     783.462us         0.22%       1.173ms      11.727us     310.016us         0.04%     461.282us       4.613us           100
         aten::empty         0.11%     591.603us         0.11%     591.603us       5.916us     176.566us         0.02%     176.566us       1.766us           100
    aten::as_strided         0.07%     389.270us         0.07%     389.270us       3.893us     151.266us         0.02%     151.266us       1.513us           100
        aten::stride         0.60%       3.147ms         0.60%       3.147ms      10.489us     446.451us         0.06%     446.451us       1.488us           300
--------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
Self CPU time total: 528.498ms
CUDA time total: 793.143ms

Recorded timeit time:  788.9832 ms

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45209

Reviewed By: zou3519

Differential Revision: D23925491

Pulled By: ngimel

fbshipit-source-id: 7f9c49238d116bfd2db9db3e8943355c953a77d0
2020-09-28 21:51:13 -07:00
35596d39e9 Coalesce TLS accesses in RecordFunction constructor (#44970)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44970

Right now, when RecordFunction is not active (usual case),
we do two TLS accesses (check for thread local callbacks, and check for
thread local boolean).
Experimenting with reducing number of TLS accesses in RecordFunction
constructor.

Test Plan: record_function_benchmark

Reviewed By: dzhulgakov

Differential Revision: D23791165

Pulled By: ilia-cher

fbshipit-source-id: 6137ce4bface46f540ece325df9864fdde50e0a4
2020-09-28 21:42:23 -07:00
5a6a31168f add circle ci job name dimension to report test stats (#45457)
Summary:
To support abnormal detection for test time spike

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45457

Reviewed By: malfet

Differential Revision: D23975628

Pulled By: walterddr

fbshipit-source-id: f28d0f12559070004d637d5bde83289f029b15b8
2020-09-28 20:51:58 -07:00
5be954b502 Fix WorkerInfo link format (#45476)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45476

Test Plan: Imported from OSS

Reviewed By: rohan-varma

Differential Revision: D23982069

Pulled By: mrshenli

fbshipit-source-id: 6d932e77c1941dfd96592b388353f0fc8968dde6
2020-09-28 20:48:15 -07:00
8e47fcba5f Update docs for RPC async_execution (#45458)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45458

Test Plan: Imported from OSS

Reviewed By: pritamdamania87

Differential Revision: D23973366

Pulled By: mrshenli

fbshipit-source-id: 3697f07fa972db21746aa25eaf461c1b93293f58
2020-09-28 20:48:12 -07:00
c5ade5f698 Fix no_sync docs (#45455)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45455

Test Plan: Imported from OSS

Reviewed By: pritamdamania87

Differential Revision: D23973365

Pulled By: mrshenli

fbshipit-source-id: 87c9878cdc7310754670b83efa65ae6f877f86fb
2020-09-28 20:48:09 -07:00
6967e6295e Fix DDP docs (#45454)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45454

Test Plan: Imported from OSS

Reviewed By: pritamdamania87

Differential Revision: D23973367

Pulled By: mrshenli

fbshipit-source-id: 11f20d51d0d0f92f199e4023f02b86623867bae0
2020-09-28 20:43:22 -07:00
534f2ae582 Disable inplace abs for complex tensors (#45069)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45069

`torch.abs` is a `C -> R` function for complex input. Following the general semantics in torch, the in-place version of abs should be disabled for complex input.

Test Plan: Imported from OSS

Reviewed By: glaringlee, malfet

Differential Revision: D23818397

Pulled By: anjali411

fbshipit-source-id: b23b8d0981c53ba0557018824d42ed37ec13d4e2
2020-09-28 20:33:35 -07:00
208df1aeb8 Use python 3.8 in pytorch docker image (#45466)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45466

Test Plan: Imported from OSS

Reviewed By: seemethere

Differential Revision: D23975294

Pulled By: tierex

fbshipit-source-id: 964de7928b541121963e9de792630bcef172bb5c
2020-09-28 19:21:40 -07:00
8c66cd120b Disable complex inputs to torch.round (#45330)
Summary:
- Related with https://github.com/pytorch/pytorch/issues/44612
- Disable complex inputs to `torch.round`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45330

Reviewed By: gchanan

Differential Revision: D23970781

Pulled By: anjali411

fbshipit-source-id: b8c9ac315ae0fc872701aa132367c3171fd56185
2020-09-28 19:07:01 -07:00
0c8a6008ac Fix torch.pow when the scalar base is a complex number (#45259)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/43829

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45259

Reviewed By: gchanan

Differential Revision: D23962073

Pulled By: anjali411

fbshipit-source-id: 1b16afbb98f33fa7bc53c6ca296c5ddfcbdd2b72
2020-09-28 18:25:53 -07:00
a0f0cb1608 [JIT] Add test for ignored class type property (#45233)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45233

**Summary**
This commit modifies `TestClassType.test_properties` to check that
properties on class types can be ignored with the same syntax as
ignoring properties on `Modules`.

**Test Plan**
`python test/test_jit.py TestClassType.test_properties`

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D23971885

Pulled By: SplitInfinity

fbshipit-source-id: f2228f61fe26dff219024668cc0444a2baa8834c
2020-09-28 18:22:19 -07:00
4af4b71fdc [JIT] Update docs for recently added features (#45232)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45232

**Summary**
This commit updates the TorchScript language reference to include
documentation on recently-added TorchScript enums. It also removed
`torch.no_grad` from the list of known unsupported `torch` modules and
classes because it is now supported.

**Test Plan**
Continuous integration.

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D23971884

Pulled By: SplitInfinity

fbshipit-source-id: 5e2c164ed59bc0926b11201106952cff86e9356e
2020-09-28 18:17:42 -07:00
52cbc9e4ec [TensorExpr] Always inline and DCE in the LLVM backend (#45445)
Summary:
Inline pytorch into wrapper, which is especially helpful in combination
with dead code elimination to reduce IR size and compilation times when
a lot of parameters are unused.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45445

Test Plan: CI

Reviewed By: ZolotukhinM

Differential Revision: D23969009

Pulled By: asuhan

fbshipit-source-id: a21509d07e4c130b6aa6eae5236bb64db2748a3d
2020-09-28 18:11:13 -07:00
7ac872b934 [JIT] Modify to_backend API so that it accepts wrapped modules (#43612)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43612

**Summary**
This commit modifies the `torch._C._jit_to_backend` function so that it
accepts `ScriptModules` as inputs. It already returns `ScriptModules`
(as opposed to C++ modules), so this makes sense and makes the API more
intuitive.

**Test Plan**
Continuous integration, which includes unit tests and out-of-tree tests
for custom backends.

**Fixes**
This commit fixes #41432.

Test Plan: Imported from OSS

Reviewed By: suo, jamesr66a

Differential Revision: D23339854

Pulled By: SplitInfinity

fbshipit-source-id: 08ecef729c4e1e6bddf3f483276947fc3559ea88
2020-09-28 17:17:01 -07:00
5855aa8dac Type check quasirandom (#45434)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/42978.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45434

Reviewed By: walterddr

Differential Revision: D23967139

Pulled By: ajitmaths

fbshipit-source-id: bcee6627f367fd01aa9a5c10a7c24331fc1823ad
2020-09-28 16:49:38 -07:00
49b198c454 type check for torch.testing._internal.common_utils (#45375)
Summary:
part of torch.testing._internal.* effort

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45375

Reviewed By: malfet

Differential Revision: D23964315

Pulled By: walterddr

fbshipit-source-id: efdd643297f5c7f75670ffe60ff7e82fc413d18d
2020-09-28 16:28:46 -07:00
96f8755034 Fixed handling of nan for evenly_distribute_backward (#45280)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45280

Performance is the same on CPU and on CUDA is only 1-1.05x slower. This change is necessary for the future nan ops including nan(min|max|median)

Test Plan: Imported from OSS

Reviewed By: gchanan

Differential Revision: D23908796

Pulled By: heitorschueroff

fbshipit-source-id: c2b57acbe924cfa59fbd85216811f29f4af05088
2020-09-28 15:57:02 -07:00
6a206df891 20000x faster audio conversion for SummaryWriter (#44201)
Summary:
Stumbled upon a little gem in the audio conversion for `SummaryWriter.add_audio()`: two Python `for` loops to convert a float array to little-endian int16 samples. On my machine, this took 35 seconds for a 30-second 22.05 kHz excerpt. The same can be done directly in numpy in 1.65 milliseconds. (No offense, I'm glad that the functionality was there!)

Would also be ready to extend this to support stereo waveforms, or should this become a separate PR?

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44201

Reviewed By: J0Nreynolds

Differential Revision: D23831002

Pulled By: edward-io

fbshipit-source-id: 5c8f1ac7823d1ed41b53c4f97ab9a7bac33ea94b
2020-09-28 15:44:29 -07:00
e54e1fe51e [package] Add dependency viz (#45214)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45214

When in verbose mode the package exporter will produce an html visualization
of dependencies of a module to make it easier to trim out unneeded code,
or debug inclusion of things that cannot be exported.

Test Plan: Imported from OSS

Reviewed By: suo

Differential Revision: D23873525

Pulled By: zdevito

fbshipit-source-id: 6801991573d8dd5ab8c284e09572b36a35e1e5a4
2020-09-28 15:38:41 -07:00
331ebaf7cb [Distributed] Adding Python tests for the TCPStore getNumKeys and deleteKey (#45402)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45402

Previous diffs in this stack implemented the getNumKeys and deleteKey
APIs in the c10d Store as well as added tests at the C++ layer. This diff adds
tests at the Python level in test_c10d.py
ghstack-source-id: 112997161

Test Plan: Running these new python tests as well as previous C++ tests

Reviewed By: mrshenli

Differential Revision: D23955729

fbshipit-source-id: c7e0af7c884de2d488320e2a1d94aec801a782e5
2020-09-28 15:35:24 -07:00
6b65b3cbd8 [Distributed] DeleteKey API for c10d TCP Store (#45401)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45401

Added a DeleteKey API for the TCP Store
ghstack-source-id: 112997162

Test Plan:
Modified the existing get/set test to use delete. verified that the
correct keys were deleted and that the numKeys API returned the right values

Reviewed By: mrshenli

Differential Revision: D23955730

fbshipit-source-id: 5c9f82be34ff4521c59f56f8d9c1abf775c67f9f
2020-09-28 15:30:39 -07:00
190f91e3db Adding Histogram Binning Calibration to DSNN and Adding Type Double to Caffe2 ParallelSumOp/SumReluOp
Summary: As title.

Test Plan:
FBL job without this diff failed:
f221545832

Error message:
```
NonRetryableException: AssertionError: Label is missing in training stage for HistogramBinningCalibration
```

FBL job with canary package built in this diff is running without failure:
f221650379

Reviewed By: chenshouyuan

Differential Revision: D23959508

fbshipit-source-id: c077230de29f7abfd092c84747eaabda0b532bcc
2020-09-28 15:21:31 -07:00
1097fe0088 Remove CriterionTest.test_cuda code for dtype None. (#45316)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45316

It's never used.

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D23919449

Pulled By: gchanan

fbshipit-source-id: f9aaeeabf3940389156bfc01bc3118d348ca4cf6
2020-09-28 15:08:09 -07:00
a4486fe7ba [ROCm] Print name irrespective of seq number assignment for roctx traces (#45229)
Summary:
Recent changes to the seq_num correlation behavior in profiler (PR https://github.com/pytorch/pytorch/issues/42565)  has changed the behavior for emit_nvtx(record_shapes=True)  which doesn't print the name of the operator properly.

Created PR to dump out the name in roctx traces, irrespective of the sequence number assigned only for ROCm.

cc: jeffdaily sunway513

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45229

Reviewed By: zou3519

Differential Revision: D23932902

Pulled By: albanD

fbshipit-source-id: c782667ff002b70b51f1cc921afd1b1ac533b39d
2020-09-28 15:03:47 -07:00
c6b7eeb654 Gh/taylorrobie/timer cleanup (#45361)
Summary:
This PR cleans up some of the rough edges around `Timer` and `Compare`
* Moves `Measurement` to be dataclass based
* Adds a bunch of type annotations. MyPy is now happy.
* Allows missing entries in `Compare`. This is one of the biggest usability issues with `Compare` right now, both from an API perspective and because the current failure mode is really unpleasant.
* Greatly expands the testing of `Compare`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45361

Test Plan: Changes to Timer are covered under existing tests, changes to `Compare` are covered by the expanded `test_compare` method.

Reviewed By: bwasti

Differential Revision: D23966816

Pulled By: robieta

fbshipit-source-id: 826969f73b42f72fa35f4de3c64d0988b61474cd
2020-09-28 14:56:43 -07:00
a77d633db1 [ONNX] Fix view for dynamic input shape (#43558)
Summary:
Export of view op with dynamic input shape is broken when using tensors with a 0-dim.
This fix removes symbolic use of static input size to fix this issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43558

Reviewed By: ailzhang

Differential Revision: D23965090

Pulled By: bzinodev

fbshipit-source-id: 628e9d7ee5d53375f25052340ca6feabf7ba7c53
2020-09-28 14:46:51 -07:00
5d1fee23b3 Remove convert_target from NN tests. (#45291)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45291

It's not necessary, you can just check if the dtype is integral.

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D23911963

Pulled By: gchanan

fbshipit-source-id: 230139e1651eb76226f4095e31068dded30e03e8
2020-09-28 14:21:42 -07:00
986af53be2 type check for torch.testing._internalcodegen:* (#45368)
Summary:
part of `torch.testing._internal.*` effort

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45368

Reviewed By: malfet

Differential Revision: D23950512

Pulled By: walterddr

fbshipit-source-id: 399f712d12cdd9795b0136328f512c3f86a15f24
2020-09-28 14:04:52 -07:00
7a4c417ed3 Fix typo (#45379)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45379

Registeres -> Registers in reducer.h.
ghstack-source-id: 112982279

Test Plan: N/A

Reviewed By: mrshenli

Differential Revision: D23951203

fbshipit-source-id: 96c7dc2e1e12c132339b9ac83ce1da52c812740c
2020-09-28 14:02:01 -07:00
57c18127dc [ONNX] Update div export to perform true divide (#44831)
Summary:
related https://github.com/pytorch/pytorch/issues/43787

Now that PyTorch div is actually performing true divide, update onnx export code to stay consistent.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44831

Reviewed By: eellison

Differential Revision: D23880316

Pulled By: bzinodev

fbshipit-source-id: 3bb8db34142ac4fed4039295ad3c4cb79487987f
2020-09-28 13:53:43 -07:00
9163e8171e Adding Type Double to Caffe2 Mean Op
Summary: Adding support for type double to caffe2 MeanOp and MeanGradientOp.

Test Plan:
All tests passed.

Example FBL job failed without this diff:
f221169563

Error message:
```
c10::Error: [enforce fail at mean_op.h:72] . Mean operator only supports 32-bit float, but input was of type double (Error from operator:
input: "dpsgd_8/Copy_3" input: "dpsgd_8/Copy_4" output: "dpsgd_8/Mean_2" name: "" type: "Mean" device_option { device_type: 0 device_id: 0 })
```

Example FBL job is running without failure with the canary package built from this diff:
f221468723

Reviewed By: chenshouyuan

Differential Revision: D23956222

fbshipit-source-id: 6c81bbc390d812ae0ac235e7d025141c8402def1
2020-09-28 13:35:29 -07:00
47debdca42 Document change for DDP enabled on Windows platform (#45392)
Summary:
Document change for DDP enabled on Windows platform

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45392

Reviewed By: gchanan

Differential Revision: D23962344

Pulled By: mrshenli

fbshipit-source-id: 8924c6ca36d68699871d8add3e0aab6542ea269c
2020-09-28 13:22:42 -07:00
722faeb2a4 [RELAND] Added optimizers based on multi tensor apply (#45408)
Summary:
Original PR https://github.com/pytorch/pytorch/pull/45299.  The present PR fixes minor bugs that caused revert.

Adding a new namespace `torch.optim._multi_tensor` with a bunch of updated optimizers. Those optimizers are using _foreach APIs which improve performance significantly.

### Tests
- updated existing tests to use both optimizers
- added `test_multi_tensor_optimizers` test to verify correctness.

### Perf results

**Adam**
timeit: 42.69 ms --> 10.16 ms
autorange: 41.96 ms --> 10.28 ms

**AdamW**
timeit: 51.38 ms --> 15.63 ms
autorange: 50.82 ms --> 16.07 ms

**SGD**
timeit: 6.28 ms --> 4.40 ms
autorange: 6.13 ms --> 4.73 ms

**RMSprop**
timeit: 28.63 ms --> 5.89 ms
autorange: 28.27 ms -->  5.76 ms

**Rprop**
timeit: 213.30 --> 178.42
autorange: 212.03 --> 178.03

**ASGD**
timeit: 21.67 --> 9.33
autorange: 21.64 --> 9.27

**Adamax**
timeit: 55.60 --> 48.29
autorange: 55.22 -> 49.13

**Rerf Script used**

```
import torch
import time
import torch.optim as optim
from torch.autograd import Variable
from torch.optim.lr_scheduler import ExponentialLR, ReduceLROnPlateau, StepLR
import torch.nn as nn
import time
import torchvision
import torch.utils._benchmark as benchmark_utils

device = "cuda"
model = torchvision.models.resnet.resnet101(pretrained=True).to(device)
targets = torch.randint(0, 1000, (100, 100), device=device)
criterion = nn.CrossEntropyLoss()

optimizer = optim.SGD(model.parameters(), lr=1e-3) # <----------------------- optimizer.
                                                          # would compare optim.SGD vs optim._multi_tensor.SGD
running_loss = 0.0
target = torch.empty(128, dtype=torch.long, device=device).random_(5)

optimizer.zero_grad()
inputs = torch.rand(128, 3, 100, 100, device=device , requires_grad=True)
outputs = model(inputs)
loss = criterion(outputs, target)
loss.backward()
optimizer.step()
running_loss += loss.item()

def main():
    timer = benchmark_utils.Timer(
        stmt="optimizer.step()",
        globals=globals(),
        label="str(optimizer)",
    )

    for i in range(1):
        print(f"Run: {i}\n{'-' * 40}")
        print(f"timeit:\n{timer.timeit(1000)}\n")
        print(f"autorange:\n{timer.blocked_autorange()}\n\n")

if __name__ == "__main__":
    main()
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45408

Reviewed By: gchanan

Differential Revision: D23956680

Pulled By: izdeby

fbshipit-source-id: c5eab7bf5fce14a287c15cead1cdc26e42cfed94
2020-09-28 13:14:04 -07:00
87b356d093 [static runtime] Split out graph preparation from runtime (#44131)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44131

Test Plan: Imported from OSS

Reviewed By: hlu1

Differential Revision: D23604305

Pulled By: bwasti

fbshipit-source-id: 7b47da4961d99074199417ef1407a788c7d80ee6
2020-09-28 13:01:23 -07:00
6ab1c0b1ca Disable a few tests in preparation to enabling PE+TE (#44815)
Summary:
Disable a few tests in preparation to enabling PE+TE
Next PR: https://github.com/pytorch/pytorch/pull/45396

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44815

Reviewed By: ZolotukhinM

Differential Revision: D23948445

Pulled By: Krovatkin

fbshipit-source-id: 93e641b7b8a3f13bd3fd3840116076553408f224
2020-09-28 12:55:12 -07:00
36c3fbc9e3 CUDA BFloat Conv (non-cuDNN) (#45007)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45007

Reviewed By: zou3519

Differential Revision: D23933174

Pulled By: ngimel

fbshipit-source-id: 84eb028f09c9197993fb9981c0efb535014e5f78
2020-09-28 11:42:42 -07:00
03342af3a3 Add env variable to bypass CUDACachingAllocator for debugging (#45294)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45294

While tracking down a recent memory corruption bug we found that
cuda-memcheck wasn't finding the bad accesses, and ngimel pointed out that
it's because we use a caching allocator so a lot of "out of bounds" accesses
land in a valid slab.

This PR adds a runtime knob (`PYTORCH_NO_CUDA_MEMORY_CACHING`) that, when set,
bypasses the caching allocator's caching logic so that allocations go straight
to cudaMalloc.  This way, cuda-memcheck will actually work.

Test Plan:
Insert some memory errors and run a test under cuda-memcheck;
observe that cuda-memcheck flags an error where expected.

Specifically I removed the output-masking logic here:
https://github.com/pytorch/pytorch/blob/master/torch/csrc/jit/tensorexpr/cuda_codegen.cpp#L819-L826

And ran:
```
PYTORCH_NO_CUDA_MEMORY_CACHING=1 cuda-memcheck pytest -k test_superslomo test_jit_fuser_te.py
```

Reviewed By: ngimel

Differential Revision: D23964734

Pulled By: bertmaher

fbshipit-source-id: 04efd11e8aff037b9edde80c70585cb820ee6e39
2020-09-28 11:40:04 -07:00
993628c74a Build shape expressions and remove outputs that are only used by aten::sizes (#45080)
Summary:
Currently, TE materializes all intermediate results even if they are only used for computing their shapes. This diff ports the approach the OF (Old Fuser) took to deal with this issue. Namely, given the structure of a fusion group we infer all the sizes outside a fusion group based on fusion group's inputs.

A simple example would be:

```
        def test_fuse(a, b):
            c = a + b
            d = c + b
            return d
```

Here we don't need to cache `c` as computing a gradient for `b` in `d = c + b` doesn't need it. We do need to compute sizes for all arguments here in case broadcasts happen.

Without this optimization, TE would need to materialize `c` so we can get its size

```
[DUMP profiling_graph_executor_impl.cpp:499] Optimized Graph:
[DUMP profiling_graph_executor_impl.cpp:499] graph(%a.1 : Tensor,
[DUMP profiling_graph_executor_impl.cpp:499]       %b.1 : Tensor):
[DUMP profiling_graph_executor_impl.cpp:499]   %11 : Tensor = prim::DifferentiableGraph_0(%b.1, %a.1)
[DUMP profiling_graph_executor_impl.cpp:499]   return (%11)
[DUMP profiling_graph_executor_impl.cpp:499] with prim::DifferentiableGraph_0 = graph(%11 : Tensor,
[DUMP profiling_graph_executor_impl.cpp:499]       %13 : Tensor):
[DUMP profiling_graph_executor_impl.cpp:499]   %59 : int[] = aten::size(%13) # <string>:3:44
[DUMP profiling_graph_executor_impl.cpp:499]   %62 : int[] = aten::size(%11) # <string>:3:93
[DUMP profiling_graph_executor_impl.cpp:499]   %83 : Double(1:1, requires_grad=0, device=cuda:0), %84 : Double(1:1, requires_grad=0, device=cuda:0), %85 : bool = prim::TypeCheck(%11, %13)
[DUMP profiling_graph_executor_impl.cpp:499]   %86 : Tensor, %87 : Tensor = prim::If(%85)
[DUMP profiling_graph_executor_impl.cpp:499]     block0():
[DUMP profiling_graph_executor_impl.cpp:499]       %d.4 : Double(1:1, requires_grad=0, device=cuda:0), %c.4 : Double(1:1, requires_grad=0, device=cuda:0) = prim::TensorExprGroup_0(%83, %84)
[DUMP profiling_graph_executor_impl.cpp:499]       -> (%d.4, %c.4)
[DUMP profiling_graph_executor_impl.cpp:499]     block1():
[DUMP profiling_graph_executor_impl.cpp:499]       %94 : Function = prim::Constant[name="fallback_function", fallback=1]()
[DUMP profiling_graph_executor_impl.cpp:499]       %95 : (Tensor, Tensor) = prim::CallFunction(%94, %11, %13)
[DUMP profiling_graph_executor_impl.cpp:499]       %96 : Tensor, %97 : Tensor = prim::TupleUnpack(%95)
[DUMP profiling_graph_executor_impl.cpp:499]       -> (%96, %97)
[DUMP profiling_graph_executor_impl.cpp:499]   %60 : int[] = aten::size(%87) # <string>:3:55
[DUMP profiling_graph_executor_impl.cpp:499]   %61 : int[]? = aten::_size_if_not_equal(%59, %60) # <string>:3:19
[DUMP profiling_graph_executor_impl.cpp:499]   %64 : int[]? = aten::_size_if_not_equal(%62, %60) # <string>:3:68
[DUMP profiling_graph_executor_impl.cpp:499]   %67 : int[] = aten::size(%86) # <string>:3:55
[DUMP profiling_graph_executor_impl.cpp:499]   %68 : int[]? = aten::_size_if_not_equal(%60, %67) # <string>:3:19
[DUMP profiling_graph_executor_impl.cpp:499]   %71 : int[]? = aten::_size_if_not_equal(%62, %67) # <string>:3:68
[DUMP profiling_graph_executor_impl.cpp:499]   return (%86, %61, %64, %68, %71)
[DUMP profiling_graph_executor_impl.cpp:499] with prim::TensorExprGroup_0 = graph(%1 : Double(1:1, requires_grad=0, device=cuda:0),
[DUMP profiling_graph_executor_impl.cpp:499]       %4 : Double(1:1, requires_grad=0, device=cuda:0)):
[DUMP profiling_graph_executor_impl.cpp:499]   %5 : int = prim::Constant[value=1]()
[DUMP profiling_graph_executor_impl.cpp:499]   %c.3 : Double(1:1, requires_grad=0, device=cuda:0) = aten::add(%4, %1, %5) # /scratch/villedepommes/pytorches/bench/test/test_jit.py:2872:16
[DUMP profiling_graph_executor_impl.cpp:499]   %2 : int = prim::Constant[value=1]()
[DUMP profiling_graph_executor_impl.cpp:499]   %d.3 : Double(1:1, requires_grad=0, device=cuda:0) = aten::add(%c.3, %1, %2) # /scratch/villedepommes/pytorches/bench/test/test_jit.py:2873:16
[DUMP profiling_graph_executor_impl.cpp:499]   return (%d.3, %c.3)
```

With this optimization we use `prim::BroadcastSizes` to compute the size of `c`. No need to materialize it.

```
[DUMP profiling_graph_executor_impl.cpp:499] Optimized Graph:
[DUMP profiling_graph_executor_impl.cpp:499] graph(%a.1 : Tensor,
[DUMP profiling_graph_executor_impl.cpp:499]       %b.1 : Tensor):
[DUMP profiling_graph_executor_impl.cpp:499]   %11 : Tensor = prim::DifferentiableGraph_0(%b.1, %a.1)
[DUMP profiling_graph_executor_impl.cpp:499]   return (%11)
[DUMP profiling_graph_executor_impl.cpp:499] with prim::DifferentiableGraph_0 = graph(%11 : Tensor,
[DUMP profiling_graph_executor_impl.cpp:499]       %13 : Tensor):
[DUMP profiling_graph_executor_impl.cpp:499]   %59 : int[] = aten::size(%13) # <string>:3:44
[DUMP profiling_graph_executor_impl.cpp:499]   %62 : int[] = aten::size(%11) # <string>:3:93
[DUMP profiling_graph_executor_impl.cpp:499]   %88 : Double(1:1, requires_grad=0, device=cuda:0), %89 : Double(1:1, requires_grad=0, device=cuda:0), %90 : bool = prim::TypeCheck(%11, %13)
[DUMP profiling_graph_executor_impl.cpp:499]   %91 : Tensor = prim::If(%90)
[DUMP profiling_graph_executor_impl.cpp:499]     block0():
[DUMP profiling_graph_executor_impl.cpp:499]       %d.4 : Double(1:1, requires_grad=0, device=cuda:0) = prim::TensorExprGroup_0(%88, %89)
[DUMP profiling_graph_executor_impl.cpp:499]       -> (%d.4)
[DUMP profiling_graph_executor_impl.cpp:499]     block1():
[DUMP profiling_graph_executor_impl.cpp:499]       %97 : Function = prim::Constant[name="fallback_function", fallback=1]()
[DUMP profiling_graph_executor_impl.cpp:499]       %98 : (Tensor) = prim::CallFunction(%97, %11, %13)
[DUMP profiling_graph_executor_impl.cpp:499]       %99 : Tensor = prim::TupleUnpack(%98)
[DUMP profiling_graph_executor_impl.cpp:499]       -> (%99)
[DUMP profiling_graph_executor_impl.cpp:499]   %85 : int[] = aten::size(%91)
[DUMP profiling_graph_executor_impl.cpp:499]   %86 : int[] = prim::BroadcastSizes(%59, %62)
[DUMP profiling_graph_executor_impl.cpp:499]   %61 : int[]? = aten::_size_if_not_equal(%59, %86) # <string>:3:19
[DUMP profiling_graph_executor_impl.cpp:499]   %64 : int[]? = aten::_size_if_not_equal(%62, %86) # <string>:3:68
[DUMP profiling_graph_executor_impl.cpp:499]   %68 : int[]? = aten::_size_if_not_equal(%86, %85) # <string>:3:19
[DUMP profiling_graph_executor_impl.cpp:499]   %71 : int[]? = aten::_size_if_not_equal(%62, %85) # <string>:3:68
[DUMP profiling_graph_executor_impl.cpp:499]   return (%91, %61, %64, %68, %71)
[DUMP profiling_graph_executor_impl.cpp:499] with prim::TensorExprGroup_0 = graph(%1 : Double(1:1, requires_grad=0, device=cuda:0),
[DUMP profiling_graph_executor_impl.cpp:499]       %4 : Double(1:1, requires_grad=0, device=cuda:0)):
[DUMP profiling_graph_executor_impl.cpp:499]   %5 : int = prim::Constant[value=1]()
[DUMP profiling_graph_executor_impl.cpp:499]   %c.3 : Double(1:1, requires_grad=0, device=cuda:0) = aten::add(%4, %1, %5) # /scratch/villedepommes/pytorches/bench/test/test_jit.py:2872:16
[DUMP profiling_graph_executor_impl.cpp:499]   %2 : int = prim::Constant[value=1]()
[DUMP profiling_graph_executor_impl.cpp:499]   %d.3 : Double(1:1, requires_grad=0, device=cuda:0) = aten::add(%c.3, %1, %2) # /scratch/villedepommes/pytorches/bench/test/test_jit.py:2873:16
[DUMP profiling_graph_executor_impl.cpp:499]   return (%d.3)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45080

Reviewed By: bertmaher

Differential Revision: D23856410

Pulled By: Krovatkin

fbshipit-source-id: 2956286eb03a4894a5baa151c35e6092466322b1
2020-09-28 10:45:56 -07:00
e5242aaf89 Update TensorPipe submodule (#45433)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45433

Primarily in order to pick up the fix landed in https://github.com/pytorch/tensorpipe/pull/225 which fixes the handling of scopes in link-local IPv6 addresses, which was reported by a user.

Test Plan: The specific upstream change is covered by new unit tests. The submodule update will be validated by the PyTorch CI.

Reviewed By: beauby

Differential Revision: D23962289

fbshipit-source-id: 4ed762fc19c4aeb1398d1337d61b3188c4c228be
2020-09-28 10:32:06 -07:00
48d29c830d [hotfix] disable problematic cuda tests on rocm builds (#45435)
Summary:
Disable the recent 3 cuda tests on amd rocm build/tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45435

Reviewed By: malfet

Differential Revision: D23962881

Pulled By: walterddr

fbshipit-source-id: ad4ea1f835b4722cdbdce685806cfd64376cc16f
2020-09-28 10:02:12 -07:00
e2ffdf467a docker: Add torchelastic to docker image (#45438)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45438

Adds torchelastic (as well as its dependencies) to the official docker
images

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: tierex

Differential Revision: D23963787

Pulled By: seemethere

fbshipit-source-id: 54ebb4b9c50699e543f264975dadf99badf55753
2020-09-28 09:53:07 -07:00
e4950a093a Backward support for generalized eigenvalue solver with LOBPCG in forward [only k-rank SYMEIG case] (#43002)
Summary:
As per title. Fixes [#{38948}](https://github.com/pytorch/pytorch/issues/38948). Therein you can find some blueprints for the algorithm being used in this PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43002

Reviewed By: zou3519

Differential Revision: D23931326

Pulled By: albanD

fbshipit-source-id: e6994af70d94145f974ef87aa5cea166d6deff1e
2020-09-28 07:22:35 -07:00
6417a70465 Updates linalg warning + docs (#45415)
Summary:
Changes the deprecation of norm to a docs deprecation, since PyTorch components still rely on norm and some behavior, like automatically flattening tensors, may need to be ported to torch.linalg.norm. The documentation is also updated to clarify that torch.norm and torch.linalg.norm are distinct.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45415

Reviewed By: ngimel

Differential Revision: D23958252

Pulled By: mruberry

fbshipit-source-id: fd54e807c59a2655453a6bcd9f4073cb2c12e8ac
2020-09-28 05:28:42 -07:00
7818a214c5 [AutoAccept][Codemod][FBSourceClangFormatLinter] Daily arc lint --take CLANGFORMAT
Reviewed By: zertosh

Differential Revision: D23959094

fbshipit-source-id: 6caa046d263114bff38a38d756099aac357e4f04
2020-09-28 05:08:46 -07:00
95a97e51b5 [ONNX] Improve scripting inplace indexing ops (#44351)
Summary:
Fix a couple of issues with scripting inplace indexing in prepare_inplace_ops_for_onnx pass.
1- Tracing index copy (such as cases lik x[1:3] = data) already applies broadcasting on rhs if needed. The broadcasting node (aten::expand) is missing in scripting cases.

2- Inplace indexing with ellipsis (aten::copy_) is replaced with aten::index_put and then handled with slice+select in this pass.
Support for negative indices for this op added.

Shape inference is also enabled for scripting tests using new JIT API.
A few more tests are enabled for scripting.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44351

Reviewed By: ezyang

Differential Revision: D23880267

Pulled By: bzinodev

fbshipit-source-id: 78b33444633eb7ae0fbabc7415e3b16001f5207f
2020-09-28 00:32:36 -07:00
13f76f2be4 Fix preserve submodule attribute in freezing (#45143)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45143

This PR prevents freezing cleaning up a submodule when user requests to
preserve a submodule.

Test Plan: Imported from OSS

Reviewed By: eellison

Differential Revision: D23844969

Pulled By: bzinodev

fbshipit-source-id: 80e6db3fc12460d62e634ea0336ae2a3551c2151
2020-09-28 00:05:38 -07:00
c3bf402cbb handle onnx nll with default ignore index (#44816)
Summary:
in ONNX NegativeLogLikelihoodLoss specification, ignore_index is optional without default value.
therefore, when convert nll op to ONNX, we need to set ignore_index attribute even if it is not specified (e.g. ignore_index=-100).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44816

Reviewed By: ezyang

Differential Revision: D23880354

Pulled By: bzinodev

fbshipit-source-id: d0bdd58d0a4507ed9ce37133e68533fe6d1bdf2b
2020-09-27 23:26:19 -07:00
8bdbedd4ee Revert "Updates and simplifies nonzero as_tuple behavior"
This reverts commit 8b143771d0f0bcd93d925263adc8b0d6b235b398.
2020-09-27 20:58:42 -07:00
8b143771d0 Updates and simplifies nonzero as_tuple behavior 2020-09-27 20:56:30 -07:00
5b839bca78 [ONNX] Optimize export_onnx api to reduce string and model proto exchange (#44332)
Summary:
Optimize export_onnx api to reduce string and model proto exchange in export.cpp

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44332

Reviewed By: bwasti, eellison

Differential Revision: D23880129

Pulled By: bzinodev

fbshipit-source-id: 1d216d8f710f356cbba2334fb21ea15a89dd16fa
2020-09-27 16:29:08 -07:00
4005afe94b [ONNX] Update narrow for dynamic inputs (#44039)
Summary:
Update narrow for dynamic inputs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44039

Reviewed By: mruberry

Differential Revision: D23742215

Pulled By: bzinodev

fbshipit-source-id: 0d58d2fe996f91a124af988a9a21ee433e842d07
2020-09-27 15:52:57 -07:00
78caa028b6 Revert D23009117: [Distributed] DeleteKey API for c10d TCP Store
Test Plan: revert-hammer

Differential Revision:
D23009117 (addf94f2d6)

Original commit changeset: 1a0d95b43d79

fbshipit-source-id: ad3fe5501267e1a0a7bf23410766f1e92b34b24d
2020-09-27 12:04:42 -07:00
f84b2e865f Revert D23878455: [Distributed] Adding Python tests for the TCPStore getNumKeys and deleteKey
Test Plan: revert-hammer

Differential Revision:
D23878455 (cf808bed73)

Original commit changeset: 0a17ecf66b28

fbshipit-source-id: 93e60b23f66324e3e5266c45abb0cec295bb3d23
2020-09-27 12:02:24 -07:00
bc5710f2f7 Benchmarks: tweak PE config settings. (#45349)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45349

Test Plan: Imported from OSS

Reviewed By: Krovatkin

Differential Revision: D23935518

Pulled By: ZolotukhinM

fbshipit-source-id: 5a7c508c6fc84eafbc23399f095d732b903510dc
2020-09-26 23:13:29 -07:00
a07d82982a CI: Add a run of FastRNN benchmarks in default executor/fuser configuration. (#45348)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45348

Test Plan: Imported from OSS

Reviewed By: Krovatkin

Differential Revision: D23935520

Pulled By: ZolotukhinM

fbshipit-source-id: efecaaab68caaaa057b354884f4ae37b6ef36983
2020-09-26 23:13:27 -07:00
8cef7326f4 Benchmarks: add 'default' options for fuser and executor. (#45347)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45347

Test Plan: Imported from OSS

Reviewed By: Krovatkin

Differential Revision: D23935519

Pulled By: ZolotukhinM

fbshipit-source-id: 8323fafe7828683c4d29c12a1e5722adb6f945ff
2020-09-26 23:09:02 -07:00
37a671abc7 Revert D23828257: Quantization: add API summary section
Test Plan: revert-hammer

Differential Revision:
D23828257 (d2bd556e7d)

Original commit changeset: 9311ee3f394c

fbshipit-source-id: 80b16fc123191e249e6a070ec5360a15fe91cf61
2020-09-26 22:53:10 -07:00
110aa45387 Revert D23842456: Quantization: combine previous summary with new summary
Test Plan: revert-hammer

Differential Revision:
D23842456 (278da57255)

Original commit changeset: db2399e51e9a

fbshipit-source-id: 7878257330bf83751cb17c0971a5c894bdf256ba
2020-09-26 22:53:07 -07:00
3da1061059 Revert D23916669: quant docs: add reduce_range explanatation to top level doc
Test Plan: revert-hammer

Differential Revision:
D23916669 (eb39624394)

Original commit changeset: ef93fb774cb1

fbshipit-source-id: 7b56020427e76e13f847494044179c81d508db11
2020-09-26 22:48:38 -07:00
54a253fded Revert D23931987: Added optimizers based on multi tensor apply
Test Plan: revert-hammer

Differential Revision:
D23931987 (2b21e7767e)

Original commit changeset: 582134ef2d40

fbshipit-source-id: ffd500aea55fda34155442fb15e2529cb9c00100
2020-09-26 18:11:54 -07:00
e52762cbb7 Revert D23917034: quant docs: document how to customize qconfigs in eager mode
Test Plan: revert-hammer

Differential Revision:
D23917034 (7763e1d7b1)

Original commit changeset: ccf71ce4300c

fbshipit-source-id: 9ce99e880b4a22e824f4413354a0f3703e7c5c2c
2020-09-26 18:05:38 -07:00
23dfca8351 Support record_shapes in RPC profiling (#44419)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44419

Closes https://github.com/pytorch/pytorch/issues/39969

This PR adds support for propagation of input shapes over the wire when the profiler is invoked with `record_shapes=True` over RPC. Previously, we did not respect this argument.

This is done by saving the shapes as an ivalue list and recovering it as the type expected (`std::vector<std::vector<int>>` on the client). Test is added to ensure that remote ops have the same `input_shapes` as if the op were run locally.
ghstack-source-id: 112977899

Reviewed By: pritamdamania87

Differential Revision: D23591274

fbshipit-source-id: 7cf3b2e8df26935ead9d70e534fc2c872ccd6958
2020-09-26 13:26:44 -07:00
19dda7c68a Fallback to CPU when remote end does not have CUDA for profiling (#44967)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44967

When enabling profiler on server, if it is a different machine it may
not have CUDA while caller does. In this case, we would crash but now we
fallback to CPU and log a warning.
ghstack-source-id: 112977906

Test Plan: CI

Reviewed By: pritamdamania87

Differential Revision: D23790729

fbshipit-source-id: dc6eba172b7e666842d54553f52a6b9d5f0a5362
2020-09-26 13:12:55 -07:00
2b21e7767e Added optimizers based on multi tensor apply (#45299)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45299

Adding a new namespace `torch.optim._multi_tensor` with a bunch of updated optimizers. Those optimizers are using _foreach APIs which improve performance significantly.

### Tests
- updated existing tests to use both optimizers
- added `test_multi_tensor_optimizers` test to verify correctness.

### Perf results

**Adam**
timeit: 42.69 ms --> 10.16 ms
autorange: 41.96 ms --> 10.28 ms

**AdamW**
timeit: 51.38 ms --> 15.63 ms
autorange: 50.82 ms --> 16.07 ms

**SGD**
timeit: 6.28 ms --> 4.40 ms
autorange: 6.13 ms --> 4.73 ms

**RMSprop**
timeit: 28.63 ms --> 5.89 ms
autorange: 28.27 ms -->  5.76 ms

**Rprop**
timeit: 213.30 --> 178.42
autorange: 212.03 --> 178.03

**ASGD**
timeit: 21.67 --> 9.33
autorange: 21.64 --> 9.27

**Adamax**
timeit: 55.60 --> 48.29
autorange: 55.22 -> 49.13

**Rerf Script used**

```
import torch
import time
import torch.optim as optim
from torch.autograd import Variable
from torch.optim.lr_scheduler import ExponentialLR, ReduceLROnPlateau, StepLR
import torch.nn as nn
import time
import torchvision
import torch.utils._benchmark as benchmark_utils

device = "cuda"
model = torchvision.models.resnet.resnet101(pretrained=True).to(device)
targets = torch.randint(0, 1000, (100, 100), device=device)
criterion = nn.CrossEntropyLoss()

optimizer = optim.SGD(model.parameters(), lr=1e-3) # <----------------------- optimizer.
                                                          # would compare optim.SGD vs optim._multi_tensor.SGD
running_loss = 0.0
target = torch.empty(128, dtype=torch.long, device=device).random_(5)

optimizer.zero_grad()
inputs = torch.rand(128, 3, 100, 100, device=device , requires_grad=True)
outputs = model(inputs)
loss = criterion(outputs, target)
loss.backward()
optimizer.step()
running_loss += loss.item()

def main():
    timer = benchmark_utils.Timer(
        stmt="optimizer.step()",
        globals=globals(),
        label="str(optimizer)",
    )

    for i in range(1):
        print(f"Run: {i}\n{'-' * 40}")
        print(f"timeit:\n{timer.timeit(1000)}\n")
        print(f"autorange:\n{timer.blocked_autorange()}\n\n")

if __name__ == "__main__":
    main()
```

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D23931987

Pulled By: izdeby

fbshipit-source-id: 582134ef2d402909d27d89a45c5b588fb7130ea1
2020-09-26 12:17:43 -07:00
0fa551f0ab [c2] Fix int types for learning rate
Summary: Currently GetSingleArgument is overflowing since it's expecting an int instead of an int64 when using a 1cycle (hill policy) annealing schedule

Test Plan:
unittest

buck test  caffe2/caffe2/python/operator_test:learning_rate_op_test

Differential Revision: D23938169

fbshipit-source-id: 20d65df800d7a0f1dd9520705af31f63ae716463
2020-09-26 10:59:29 -07:00
cf808bed73 [Distributed] Adding Python tests for the TCPStore getNumKeys and deleteKey (#45223)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45223

Previous diffs in this stack implemented the getNumKeys and deleteKey
APIs in the c10d Store as well as added tests at the C++ layer. This diff adds
tests at the Python level in test_c10d.py
ghstack-source-id: 112939763

Test Plan: Ensured these new python tests as well as previous C++ tests pass

Reviewed By: jiayisuse

Differential Revision: D23878455

fbshipit-source-id: 0a17ecf66b28d46438a77346e5bf36414e05e25c
2020-09-26 00:54:28 -07:00
addf94f2d6 [Distributed] DeleteKey API for c10d TCP Store (#43963)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43963

Added a DeleteKey API for the TCP Store
ghstack-source-id: 112939762

Test Plan:
Modified the existing get/set test to use delete. verified that the
correct keys were deleted and that the numKeys API returned the right values

Reviewed By: jiayisuse

Differential Revision: D23009117

fbshipit-source-id: 1a0d95b43d79e665a69b2befbaa059b2b50a1f66
2020-09-26 00:54:21 -07:00
304e1d1e19 [Distributed] getNumKeys API to c10d TCPStore (#43962)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43962

TCPStore needs a getNumKeys API for our logging needs.
ghstack-source-id: 112939761

Test Plan: Adding tests to C++ Store Tests

Reviewed By: pritamdamania87

Differential Revision: D22985085

fbshipit-source-id: 8a0d286fbd6fd314dcc997bae3aad0e62b51af83
2020-09-26 00:49:00 -07:00
d9af3d2fcd [quant] ConvTranspose warnings (#45081)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45081

Test Plan: Imported from OSS

Reviewed By: vkuzo

Differential Revision: D23822449

Pulled By: z-a-f

fbshipit-source-id: f21a5f3ef4d09f703c96fff0bc413dbadeac8202
2020-09-25 22:30:14 -07:00
92189b34b7 Add get_all_users_of function to GraphManipulation (#45216)
Summary:
This PR adds get_all_users_of function. The function returns all the users of a specific node. A test unit is also added.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45216

Reviewed By: ezyang

Differential Revision: D23883572

Pulled By: scottxu0730

fbshipit-source-id: 3eb68a411c3c6db39ed2506c9cb7bb7337520ee4
2020-09-25 19:32:49 -07:00
7763e1d7b1 quant docs: document how to customize qconfigs in eager mode (#45306)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45306

Adds details to the main quantization doc on how specifically
users can skip or customize quantization of layers.

Test Plan: Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D23917034

Pulled By: vkuzo

fbshipit-source-id: ccf71ce4300c1946b2ab63d1f35a07691fd7a2af
2020-09-25 18:33:35 -07:00
eb39624394 quant docs: add reduce_range explanatation to top level doc (#45305)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45305

Adds an explanatation for reduce_range to the main quantization
doc page.

Test Plan: Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D23916669

Pulled By: vkuzo

fbshipit-source-id: ef93fb774cb15741cd92889f114f6ab76c39f051
2020-09-25 18:33:32 -07:00
278da57255 Quantization: combine previous summary with new summary (#45135)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45135

The previous quantization summary had steps on what to do for
dynamic, static, QAT.  This PR moves these steps to comments in the
example code, so it is more clear how to accomplish the steps.

Test Plan: Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D23842456

Pulled By: vkuzo

fbshipit-source-id: db2399e51e9ae33c8a1ac610e3d7dbdb648742b0
2020-09-25 18:33:30 -07:00
d2bd556e7d Quantization: add API summary section (#45093)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45093

This adds a tl;dr; style summary of the quantization API
to the documentation. Hopefully this will make this easier
for new folks to learn how to use quantization.

This is not meant to be all-encompassing.  Future PRs
can improve the documentation further.

Test Plan:
1. build the doc as specified in https://github.com/pytorch/pytorch#building-the-documentation
2. inspect the quantization page in Chrome, format looks good

Reviewed By: jerryzh168

Differential Revision: D23828257

Pulled By: vkuzo

fbshipit-source-id: 9311ee3f394cd83af0aeafb6e2fcdc3e0321fa38
2020-09-25 18:30:51 -07:00
958c208666 [quant] conv_transpose graph patterns (#45078)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45078

Test Plan: Imported from OSS

Reviewed By: vkuzo

Differential Revision: D23821580

Pulled By: z-a-f

fbshipit-source-id: 813a4ef1bbc429720765d61791fe754b6678a334
2020-09-25 18:14:29 -07:00
606b1a9a2e Move xla codegen to aten. (#45241)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45241

Test Plan: Imported from OSS

Reviewed By: soumith

Differential Revision: D23926750

Pulled By: ailzhang

fbshipit-source-id: f768e24a9baeca9f9df069a62d6f8b94a853a1ee
2020-09-25 18:07:32 -07:00
32c355af5b [dist_optim] introduce distributed functional optimizer (#45221)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45221

This PR introduces a distributed functional optimizer, so that
distributed optimizer can reuse the functional optimizer APIs and
maintain their own states. This could enable the torchscript compatible
functional optimizer when using distributed optimizer, helps getting rid
of GIL and improve overall performance of training, especially distributed
model parallel training

Test Plan: Imported from OSS

Reviewed By: ailzhang

Differential Revision: D23935256

Pulled By: wanchaol

fbshipit-source-id: 59b6d77ff4693ab24a6e1cbb6740bcf614cc624a
2020-09-25 17:13:10 -07:00
08caf15502 [optimizer] refactor Adam to use functional API (#44791)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44791

Test Plan: Imported from OSS

Reviewed By: ailzhang

Differential Revision: D23935257

Pulled By: wanchaol

fbshipit-source-id: 6f6e22a9287f5515d2e4e6abd4dee2fe7e17b945
2020-09-25 17:13:08 -07:00
0444c372e1 [optimizer] introduce optimizer functional API, refactor Adagrad (#44715)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44715

We have provided a nice and intuitive API in Python. But in the context of large scale distributed training (e.g. Distributed Model Parallel), users often want to use multithreaded training instead of multiprocess training as it provides better resource utilization and efficiency.

This PR introduces functional optimizer concept (that is similar to the concept of `nn.functional`), we split optimizer into two parts: 1. optimizer state management 2. optimizer computation. We expose the computation part as a separate functional API that is available to be used by internal and OSS developers, the caller of the functional API will maintain their own states in order to directly calls the functional API. While maintaining the end user API be the same, the functional API is TorchScript friendly, and could be used by the distributed optimizer to speed up the training without GIL.

Test Plan: Imported from OSS

Reviewed By: ailzhang

Differential Revision: D23935258

Pulled By: wanchaol

fbshipit-source-id: d2a5228439edb3bc64f7771af2bb9e891847136a
2020-09-25 17:10:26 -07:00
8ab2ad306d Enable torch.cuda.nccl typechecking (#45344)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/45336

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45344

Reviewed By: walterddr

Differential Revision: D23935306

Pulled By: malfet

fbshipit-source-id: dd09d4f8ff7a327131764487158675027a13bf69
2020-09-25 17:02:47 -07:00
5211fb97ac Remove device maps from TensorPipe for v1.7 release (#45353)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45353

Temporarily removing this feature, will add this back after branch cut.

Test Plan: Imported from OSS

Reviewed By: rohan-varma

Differential Revision: D23939865

Pulled By: mrshenli

fbshipit-source-id: 7dceaffea6b9a16512b5ba6036da73e7f8f83a8e
2020-09-25 16:51:45 -07:00
439930c81b adding a beta parameter to the smooth_l1 loss fn (#44433)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44433

Not entirely sure why, but changing the type of beta from `float` to `double in autocast_mode.cpp and FunctionsManual.h fixes my compiler errors, failing instead at link time

fixing some type errors, updated fn signature in a few more files

removing my usage of Scalar, making beta a double everywhere instead

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D23636720

Pulled By: bdhirsh

fbshipit-source-id: caea2a1f8dd72b3b5fd1d72dd886b2fcd690af6d
2020-09-25 16:36:28 -07:00
37513a1118 Use explicit templates in CUDALoops kernels (#44286)
Summary:
Reland attempt of https://github.com/pytorch/pytorch/pull/41059
Use explicit templates instead of lambdas to reduce binary size without affecting the perf by 100-200Kb per arch per CU, namely:
BinaryMulDivKernel.cu 3.8Mb -> 3.5Mb
CompareEQKernel.cu 1.8Mb -> 1.7Mb
BinaryAddSubKernel.cu 2.0Mb -> 1.8Mb
BinaryBitwiseOpsKernels.cu 2.6Mb -> 2.3Mb

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44286

Reviewed By: ngimel

Differential Revision: D23859691

Pulled By: malfet

fbshipit-source-id: 2c4e86f35e0f94a62294dc5d52a3ba364db23e2d
2020-09-25 16:26:40 -07:00
a2b4177c5b Add barrier() at the end of init_process_group and new_group. (#45181)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45181

`init_process_group` and `new_group` update a bunch of global
variables after initializing the actual process group. As a result, there is a
race that after initializing the process group on say rank 0, if we immediately
check the default process group on rank 1 (say via RPC), we might actually get
an error since rank 1 hasn't yet updated its _default_pg variable.

To resolve this issue, I've added barrier() at the end of both of these calls.
This ensures that once these calls return we are guaranteed about correct
initialization on all ranks.

Since these calls are usually done mostly during initialization, it should be
fine to add the overhead of a barrier() here.

#Closes: https://github.com/pytorch/pytorch/issues/40434, https://github.com/pytorch/pytorch/issues/40378
ghstack-source-id: 112923112

Test Plan:
Reproduced the failures in
https://github.com/pytorch/pytorch/issues/40434 and
https://github.com/pytorch/pytorch/issues/40378 and verified that this PR fixes
the issue.

Reviewed By: mrshenli

Differential Revision: D23858025

fbshipit-source-id: c4d5e46c2157981caf3ba1525dec5310dcbc1830
2020-09-25 15:46:59 -07:00
3b7e4f89b2 Add deprecation warning to PG backend and make TP backend stable. (#45356)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45356

In this PR, I'm adding a warning to the PG backend mentioning it would
be deprecated in the future. In addition to this I removed the warning from the
TP backend that it is a beta feature.
ghstack-source-id: 112940501

Test Plan: waitforbuildbot

Reviewed By: mrshenli

Differential Revision: D23940144

fbshipit-source-id: d44054aa1e4ef61004a40bbe0ec45ff07829aad4
2020-09-25 15:41:00 -07:00
04be420549 [static runtime] Remove ops in static from backwards compatibility checks (#45354)
Summary:
This should get the builds green again

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45354

Reviewed By: zhangguanheng66

Differential Revision: D23939615

Pulled By: bwasti

fbshipit-source-id: e93b11bc9592205e52330bb15928603b0aea21ac
2020-09-25 14:46:42 -07:00
eee7dad376 Add torch.do_assert, which is symbolically traceable (#45188)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45188

This is a symbolically traceable alternative to Python's `assert`.
It should be useful to allow people who want to use FX to also
be able to assert things.

A bunch of TODO(before) land are inline - would love thoughts
on where is the best place for this code to live, and what this
function should be called (since `assert` is reserved).

Test Plan:
```
python test/test_fx.py TestFX.test_symbolic_trace_assert
```

Imported from OSS

Reviewed By: jamesr66a

Differential Revision: D23861567

fbshipit-source-id: d9d6b9556140faccc0290eba1fabea401d7850de
2020-09-25 13:46:28 -07:00
7c5436d557 [RPC profiling] Add tests to ensure RPC profiling works on single threaded (#44923)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44923

This ensures that RPC profiling works in single-threaded server
scenarios and that we won't make the assumption that we'll have multiple
threads when working on this code. For example, this assumption resulted in a
bug in the previous diff (which was fixed)
ghstack-source-id: 112868469

Test Plan: CI

Reviewed By: lw

Differential Revision: D23691304

fbshipit-source-id: b17d34ade823794cbe949b70a5ab35723d974203
2020-09-25 13:24:18 -07:00
27ab9bc0f9 [RPC profiling] Extend RPC profiling to support async function execution over RPC. (#44664)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44664

Closes https://github.com/pytorch/pytorch/issues/39971. This PR adds support for functions decorated with `rpc.functions.async_execution` to be profiled over RPC as builtins, jit functions, and blocking python UDFs currently can be. The reasoning for this is to provide complete feature support in terms of RPC profiling and the various types of functions users can run.

To enable this, the PR below this enables calling `disableProfiler()` safely from another thread. We use that functionality to defer disabling the profiler on the server until the future corresponding to the RPC request completes (rather than only the blocking `processRPC` call as was done previously). Since when the future completes we've kicked off the async function and the future corresponding to it has completed, we are able to capture any RPCs the function would have called and the actual work done on the other node.

For example, if the following async function is ran on a server over RPC:

```
def slow_add(x, y):
    time.sleep(1)
    return torch.add(x, y)

rpc.functions.async_execution
def slow_async_add(to, x, y):
    return rpc.rpc_async(to, slow_add, args=(x, y))
```

we expect to see the original RPC profiled, the nested RPC profiled, and the actual torch.add() work. All of these events should be recorded with the correct node id. Here is an example profiling output:

```
-------------------------------------------------------------------------------------------------------------------------  ---------------  ---------------  ---------------  --------
-------  ---------------  ---------------  ---------------
Name                                                                                                                       Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     Number of Calls  Node ID
-------------------------------------------------------------------------------------------------------------------------  ---------------  ---------------  ---------------  --------
-------  ---------------  ---------------  ---------------                                                                                                                            rpc_async#slow_async_add(worker1 -> worker2)                                                                               0.00%            0.000us          0                1.012s
         1.012s           1                1
aten::empty                                                                                                                7.02%            11.519us         7.02%            11.519us         11.519us         1                1
rpc_async#slow_async_add(worker1 -> worker2)#remote_op: rpc_async#slow_add(worker2 -> worker3)                             0.00%            0.000us          0                1.006s
         1.006s           1                2                                                                                                                                          rpc_async#slow_async_add(worker1 -> worker2)#remote_op: aten::empty                                                        7.21%            11.843us         7.21%            11.843us
         11.843us         1                2
rpc_async#slow_async_add(worker1 -> worker2)#remote_op: rpc_async#slow_add(worker2 -> worker3)#remote_op: aten::add        71.94%           118.107us        85.77%           140.802us        140.802us        1                3
rpc_async#slow_async_add(worker1 -> worker2)#remote_op: rpc_async#slow_add(worker2 -> worker3)#remote_op: aten::empty      13.82%           22.695us         13.82%           22.695us
         22.695us         1                3                                                                                                                                          -------------------------------------------------------------------------------------------------------------------------  ---------------  ---------------  ---------------  --------
-------  ---------------  ---------------  ---------------
Self CPU time total: 164.164us
```

This PR also moves a bunch of the profiling logic to `rpc/utils.cpp` to declutter `request_callback` code.
ghstack-source-id: 112868470

Test Plan:
```
rvarm1@devbig978:fbcode  (52dd34f6)$ buck test mode/no-gpu mode/dev-nosan //caffe2/test/distributed/rpc:process_group_agent -- test_rpc_profiling_async_function --print-passing-details --stress-runs 1
```

Reviewed By: mrshenli

Differential Revision: D23638387

fbshipit-source-id: eedb6d48173a4ecd41d70a9c64048920bd4807c4
2020-09-25 13:19:26 -07:00
d5748d9a1a Enable binary ops with Scalar Lists with for foreach APIs (#45298)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45298

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D23931986

Pulled By: izdeby

fbshipit-source-id: 281267cd6f90d57a169af89f9f10b0f4fcab47e3
2020-09-25 12:58:34 -07:00
f07ac6a004 Fix Windows build failure after DDP PR merged (#45335)
Summary:
Fixes #{issue number}
This is resubmit for PR https://github.com/pytorch/pytorch/issues/42897 . Together with fix for Windows build issue introduced by PR https://github.com/pytorch/pytorch/issues/44344 .

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45335

Reviewed By: zou3519

Differential Revision: D23931471

Pulled By: mrshenli

fbshipit-source-id: f49b5a114944c1450b32934b3292170be064f494
2020-09-25 12:37:50 -07:00
c8166d4b58 Add torch.cuda.comm to typechecking CI (#45350)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45350

Reviewed By: walterddr

Differential Revision: D23935750

Pulled By: malfet

fbshipit-source-id: 5a7d2d4fbc976699d80bb5caf4727c19fa2c5bc8
2020-09-25 12:13:43 -07:00
22401b850b port all JIT tests to gtest (#45264)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45264

Context for why we are porting to gtest in: https://github.com/pytorch/pytorch/pull/45018.

This PR completes the process of porting and removes unused files/macros.

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D23901392

Pulled By: suo

fbshipit-source-id: 89526890e1a49462f3f77718f4ee273c5bc578ba
2020-09-25 11:37:43 -07:00
5a0514e3e6 [pytorch] Update fmt to 7.0.3 (#45304)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45304

As title

Test Plan: sandcastle

Reviewed By: malfet

Differential Revision: D23916328

fbshipit-source-id: 47c76886c1f17233304dc59289ff6baa16c50b8d
2020-09-25 11:33:36 -07:00
dc9e9c118e CUDA BFloat16 neg (#45240)
Summary:
Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45240

Reviewed By: mruberry

Differential Revision: D23933392

Pulled By: ngimel

fbshipit-source-id: 2472dc550600ff470a1044ddee39054e22598038
2020-09-25 11:25:49 -07:00
e5f6e5af13 Add Deep and wide to test and flatten/tranpose for good measure (#44129)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44129

Test Plan: Imported from OSS

Reviewed By: hlu1

Differential Revision: D23604302

Pulled By: bwasti

fbshipit-source-id: 5787f6f32a80b22b1b712c4116f70370dad98f12
2020-09-25 11:05:41 -07:00
d1a11618f5 [static runtime] Add _out variants and reuse memory (#44128)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44128

Test Plan: Imported from OSS

Reviewed By: hlu1

Differential Revision: D23604304

Pulled By: bwasti

fbshipit-source-id: 06a23cb75700a0fc733069071843b7b498e7b9e9
2020-09-25 11:03:06 -07:00
d1d9017a66 [NNC] fix Half conversion of immediates in Cuda backend (#45213)
Summary:
The Cuda HalfChecker casts up all loads and stores of Half to Float, so we do math in Float on the device. It didn't cast up HalfImmediate (ie. constants) so they could insert mixed-size ops. Fix is to do that.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45213

Reviewed By: ezyang

Differential Revision: D23885287

Pulled By: nickgg

fbshipit-source-id: 912991d85cc06ebb282625cfa5080d7525c8eba9
2020-09-25 10:53:36 -07:00
536580e976 Vectorize bitwise_not (#45103)
Summary:
Benchmark (Debian 10, Release build, gcc 8.3, no turbo, Intel(R) Xeon(R)
E-2136 CPU @ 3.30GHz):

```python
import timeit
for dtype in ('torch.int64', 'torch.int32', 'torch.int16', 'torch.int8', 'torch.uint8'):
    for n, t in [(10_000, 100000),
                (100_000, 10000)]:
        print(f'torch.bitwise_not(a), numel() == {n} for {t} times, dtype={dtype}')
        print(timeit.timeit('torch.bitwise_not(a)', setup=f'import torch; a = torch.arange(-{n//2}, {n//2}, dtype={dtype})', number=t))
```

Before:

```
torch.bitwise_not(a), numel() == 10000 for 100000 times, dtype=torch.int64
0.5479081739904359
torch.bitwise_not(a), numel() == 100000 for 10000 times, dtype=torch.int64
0.3350257440470159
torch.bitwise_not(a), numel() == 10000 for 100000 times, dtype=torch.int32
0.39590477803722024
torch.bitwise_not(a), numel() == 100000 for 10000 times, dtype=torch.int32
0.25563537096604705
torch.bitwise_not(a), numel() == 10000 for 100000 times, dtype=torch.int16
0.31152817397378385
torch.bitwise_not(a), numel() == 100000 for 10000 times, dtype=torch.int16
0.20817365101538599
torch.bitwise_not(a), numel() == 10000 for 100000 times, dtype=torch.int8
0.8573925020173192
torch.bitwise_not(a), numel() == 100000 for 10000 times, dtype=torch.int8
0.4150037349900231
torch.bitwise_not(a), numel() == 10000 for 100000 times, dtype=torch.uint8
0.8551108679967001
torch.bitwise_not(a), numel() == 100000 for 10000 times, dtype=torch.uint8
0.37137620500288904
```

After:

```
torch.bitwise_not(a), numel() == 10000 for 100000 times, dtype=torch.int64
0.5232444299617782
torch.bitwise_not(a), numel() == 100000 for 10000 times, dtype=torch.int64
0.33852163201663643
torch.bitwise_not(a), numel() == 10000 for 100000 times, dtype=torch.int32
0.3931163849774748
torch.bitwise_not(a), numel() == 100000 for 10000 times, dtype=torch.int32
0.24392802000511438
torch.bitwise_not(a), numel() == 10000 for 100000 times, dtype=torch.int16
0.3122224889229983
torch.bitwise_not(a), numel() == 100000 for 10000 times, dtype=torch.int16
0.1977886479580775
torch.bitwise_not(a), numel() == 10000 for 100000 times, dtype=torch.int8
0.26711542706470937
torch.bitwise_not(a), numel() == 100000 for 10000 times, dtype=torch.int8
0.18208567495457828
torch.bitwise_not(a), numel() == 10000 for 100000 times, dtype=torch.uint8
0.2615354140289128
torch.bitwise_not(a), numel() == 100000 for 10000 times, dtype=torch.uint8
0.17972210398875177
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45103

Reviewed By: ailzhang

Differential Revision: D23848675

Pulled By: ezyang

fbshipit-source-id: 6dde1ab32d9a343a49de66ad9f9b062fa23824d2
2020-09-25 10:18:30 -07:00
a117d968f6 [quant][graph] Remove redundant aten::wait calls in the graph (#45257)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45257

Currently we inline fork-wait calls when we insert observers for quantization
In the case where fork and wait are in different subgraphs, inlining the fork-wait calls
only gets rid of the fork. This leaves the aten::wait call in the graph with a torch.Tensor as input,
which is currently not supported.
To avoid this we check to make sure input to all wait calls in the graph is of type Future[tensor]
in the cleanup phase

Test Plan:
python test/test_quantization.py TestQuantizeJitPasses.test_quantize_fork_wait

Imported from OSS

Reviewed By: qizzzh

Differential Revision: D23895412

fbshipit-source-id: 3c58c6be7d7e7904eb6684085832ac21f827a399
2020-09-25 09:52:52 -07:00
8b00c4c794 [ONNX] Correct a minor typo in warning (#45187)
Summary:
The warning for batch_norm was mentioning dropout.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45187

Reviewed By: glaringlee

Differential Revision: D23873215

Pulled By: ezyang

fbshipit-source-id: 1dcc82ad16522215f49b4cd0fc0e357b2094e4f2
2020-09-25 09:26:51 -07:00
b70fac75ac CMake: Fix python dependencies in codegen (#45275)
Summary:
I noticed while working on https://github.com/pytorch/pytorch/issues/45163 that edits to python files in the  `tools/codegen/api/` directory wouldn't trigger rebuilds. This tells CMake about all of the dependencies, so rebuilds are triggered automatically.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45275

Reviewed By: zou3519

Differential Revision: D23922805

Pulled By: ezyang

fbshipit-source-id: 0fbf2b6a9b2346c31b9b0384e5ad5e0eb0f70e9b
2020-09-25 09:16:38 -07:00
78fcde9c50 Trace scattered tensor options arguments (#44071)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44071

Previously, tracing re-gathered ScalarType, Layout, Device, bool into a TensorOptions object and called `tracer::addInput()` on the gathered TensorOptions argument. `tracer::addInput()` then scattered them again and added the individual scattered arguments to the traced graph. This PR avoids the extraneous gathering and re-scattering step and calls `tracer::addInput()` on the individual arguments directly. This avoid the perf hit for an unnecessary gathering step.

This applies to both c10-full and non-c10-full ops. In the case of c10-full ops, the tracing kernels takes scattered arguments and we can directly pass them to `tracer::addInput()`. In the case of non-c10-full ops, the kernel takes a `TensorOptions` argument but we still call `tracer::addInput()` on the scattered arguments.
ghstack-source-id: 112825793

Test Plan:
waitforsandcastle

vs master: https://www.internalfb.com/intern/fblearner/details/216129483/

vs previous diff: https://www.internalfb.com/intern/fblearner/details/216170069/

Reviewed By: ezyang

Differential Revision: D23486638

fbshipit-source-id: e0b53e6673cef8d7f94158e718301eee261e5d22
2020-09-25 09:04:06 -07:00
2ac7de7d53 Remove hacky_wrapper from BackendSelect kernels (#44062)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44062

Previously, BackendSelect kernels were still written in the legacy way, i.e. they took one TensorOptions argument instead of scattered dtype, layout, device, pin_memory,  and they used hacky_wrapper to be callable. This caused a re-wrapping step. Calling into a BackencSelect kernel required taking the individual scattered arguments, packing them into a TensorOptions, and the kernel itself then gathered them again for redispatch.

Now with this PR, BackendSelect kernels are written in the new way and no hacky_wrapper or rewrapping is needed for them.
ghstack-source-id: 112825789

Test Plan:
vs master: https://www.internalfb.com/intern/fblearner/details/216117032/

vs previous diff: https://www.internalfb.com/intern/fblearner/details/216170194/

Reviewed By: ezyang

Differential Revision: D23484192

fbshipit-source-id: e8fb49c4692404b6b775d18548b990c4cdddbada
2020-09-25 09:04:03 -07:00
043bd51b48 Remove hacky_wrapper from VariableType and TraceType (#44005)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44005

Previously, VariableType and TraceType kernels were still written in the legacy way, i.e. they took one TensorOptions argument instead of scattered dtype, layout, device, pin_memory,  and they used hacky_wrapper to be callable.

Now with this PR, variable and tracing kernels are written in the new way and no hacky_wrapper is needed for them.
ghstack-source-id: 112825791

Test Plan:
waitforsandcastle

https://www.internalfb.com/intern/fblearner/details/215954270/

Reviewed By: ezyang

Differential Revision: D23466042

fbshipit-source-id: bde730a9e3bb4cb80ad484417be1ebecbdc2d377
2020-09-25 09:01:34 -07:00
bf8cd21f2a Py transformer coder test (#43976)
Summary:
Fixes #{[37756](https://github.com/pytorch/pytorch/issues/37756)}

Added the missing Transformer coder python test scripts from C++ API test scripts

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43976

Reviewed By: jamesr66a

Differential Revision: D23873250

Pulled By: glaringlee

fbshipit-source-id: cdeae53231e02208463e7629ba2c1f00990150ea
2020-09-25 08:22:24 -07:00
2739a7c599 Byte-for-byte compatibility fixes in codegen (#44879)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44879

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D23825163

Pulled By: bdhirsh

fbshipit-source-id: 4d8028274f82c401b393c4fe1b9e32de3f4909c6
2020-09-25 08:06:50 -07:00
00e704e757 [fix] torch.repeat : dim-0 backward (#45212)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/45201

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45212

Reviewed By: mrshenli

Differential Revision: D23905545

Pulled By: albanD

fbshipit-source-id: c5bf9cf481c8cf3ccc1fdbfb364006b29f67dc9f
2020-09-25 07:53:00 -07:00
76ee58e2ec [TensorExpr] Move inner loops vectorization logic to its own method (#45287)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45287

Test Plan: CI, build

Reviewed By: gmagogsfm

Differential Revision: D23913432

Pulled By: asuhan

fbshipit-source-id: 3bf8fe09753f349e3c857863a43d2b1fca5101c1
2020-09-25 02:29:36 -07:00
241afc9188 Migrate addr from the TH to Aten (CPU) (#44364)
Summary:
Related https://github.com/pytorch/pytorch/issues/24507
Fixes https://github.com/pytorch/pytorch/issues/24666

This PR is to modernize the CPU implementation of the vector `outer product`.
The existing TH implementation for `torch.attr` is migrated to `aten`, as the `torch.ger` manipulates the `addr` functions to calculate outer product,

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44364

Reviewed By: ezyang

Differential Revision: D23866733

Pulled By: mruberry

fbshipit-source-id: 5159ea22f0e3c991123fe7c19cc9beb6ad00301e
2020-09-25 01:18:09 -07:00
99e0a87bbb [nvFuser] Latency improvements for pointwise + reduction fusion (#45218)
Summary:
A lot of changes are in this update, some highlights:

- Added Doxygen config file
- Split the fusion IR (higher level TE like IR) from kernel IR (lower level CUDA like IR)
- Improved latency with dynamic shape handling for the fusion logic
- Prevent recompilation for pointwise + reduction fusions when not needed
- Improvements to inner dimension reduction performance
- Added input -> kernel + kernel launch parameters cache, added eviction policy
- Added reduction fusions with multiple outputs (still single reduction stage)
- Fixed code generation bugs for symbolic tiled GEMM example
- Added thread predicates to prevent shared memory form being loaded multiple times
- Improved sync threads placements with shared memory and removed read before write race
- Fixes to FP16 reduction fusions where output would come back as FP32

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45218

Reviewed By: ezyang

Differential Revision: D23905183

Pulled By: soumith

fbshipit-source-id: 12f5ad4cbe03e9a25043bccb89e372f8579e2a79
2020-09-24 23:17:20 -07:00
95df8657c9 Enables test linalg (#45278)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/45271.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45278

Reviewed By: ngimel

Differential Revision: D23926124

Pulled By: mruberry

fbshipit-source-id: 26692597f9a1988e5fa846f97b8430c3689cac27
2020-09-24 23:09:38 -07:00
bdf329ef8a SyncBN: preserve qconfig if it exists (#45317)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45317

Eager mode quantization depends on the presence of the `config`
model attribute.  Currently converting a model to use `SyncBatchNorm`
removes the qconfig - fixing this.  This is important if a BN is not
fused to anything during quantization convert.

Test Plan:
```
python test/test_quantization.py TestDistributed.test_syncbn_preserves_qconfig
```

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D23922072

fbshipit-source-id: cc1bc25c8e5243abb924c6889f78cf65a81be158
2020-09-24 22:52:07 -07:00
103fa3894a Revert D23841786: [pytorch][PR] Enable distributed package on windows, Gloo backend supported only
Test Plan: revert-hammer

Differential Revision:
D23841786 (0122299f9b)

Original commit changeset: 334ba1ed73ef

fbshipit-source-id: ec95432f9957df56a5a04e52661f5db920b7f57f
2020-09-24 22:44:33 -07:00
bc3151dee0 [quant] Remove unused qconfig argument in qat linear module (#45307)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45307

fixes: https://github.com/pytorch/pytorch/issues/35634

Test Plan: Imported from OSS

Reviewed By: vkuzo

Differential Revision: D23917339

fbshipit-source-id: 65f8844b98198bbf93547b3d71408c2a54605218
2020-09-24 22:15:16 -07:00
31ae8117ba [RFC] Remove per-op-registration related code in caffe2/tools/codegen/gen.py (#45134)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45134

Per-Op-Registration was a mechanism used for mobile selective build v0. Since then, a new dispathing mechanism has been built for PyTorch, and this code path isn't used any more. Remove it to simplify understanding/updating the code-generator's code-flow.
ghstack-source-id: 112723942

Test Plan: `buck build` and sandcastle.

Reviewed By: ezyang

Differential Revision: D23806632

fbshipit-source-id: d93cd324650c541d9bfc8eeff2ddb2833b988ecc
2020-09-24 22:02:49 -07:00
0122299f9b Enable distributed package on windows, Gloo backend supported only (#42897)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/42095

For test case part will be committed to this PR later

mrshenli, please help to review

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42897

Reviewed By: osalpekar

Differential Revision: D23841786

Pulled By: mrshenli

fbshipit-source-id: 334ba1ed73eff2f668857390fc32d1bc7f08e5f3
2020-09-24 21:13:55 -07:00
c6500bcf14 [reland] Make grad point to bucket buffer in DDP to save memory usage (#44344)
Summary:
[test all]
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44344

reland #41954

Add one argument in DDP API to enable/disable letting grads pointing  to views. When it is disabled, behavior is the same as DDP right now; when it is enabled, Make both variable.grad() and grad in distautograd context point to bucket buffer in DDP to save memory usage.
In this case, grad will be view of bucket buffer tensors, in order to make it compatiable with optimizer.zero_grad(), we
made changes in #41283.

Also be noted that we can not make variable.grad() pointing to bucket buffer during construction time, because we want to
keep grad undefined for unused parameters.
ghstack-source-id: 112845787

Test Plan:
1. When grad_is_view=false:
a. roberta_base, peak memory usage 8250MB, p50 per iteration latency 0.923second, https://www.internalfb.com/intern/fblearner/details/218029699/?notif_channel=cli
b. resnet, peak memory usage 3089MB, p50 per iteration latency 0.120second, https://www.internalfb.com/intern/fblearner/details/218029035/?notif_channel=cli
c. accuracy benchmark, distributed=false, .accuracy 40.914535522461, .loss: 1.6370717287064; distributed=true, .accuracy: 39.966053009033, .loss: 1.6849111318588
https://www.internalfb.com/intern/fblearner/details/218035688/?notif_channel=cli
d. classy vision uru production flow, https://www.internalfb.com/intern/fblearner/details/219065811/?notif_channel=cli
e. pytext flow, https://www.internalfb.com/intern/fblearner/details/219137458/?notif_channel=cli

2. When grad_is_view=true:
a. roberta_base, peak memory usage 7183MB, p50 per iteration latency 0.908second, https://www.internalfb.com/intern/fblearner/details/217882539?tab=operator_details
b. resnet, peak memory usage 2988 MB, p50 per iteration latency 0.119second, https://www.internalfb.com/intern/fblearner/details/218028479/?notif_channel=cli
c. accuracy benchmark, distributed=false, .accuracy 41.713260650635, .loss: 1.69939661026; distributed=true, .accuracy: 39.966053009033, .loss: 1.6849111318588, https://www.internalfb.com/intern/fblearner/details/218037058/?notif_channel=cli
d. classy vision uru production flow, expected, can not work well with apex.amp https://www.internalfb.com/intern/fblearner/details/219205218/?notif_channel=cli
e. pytext flow, detach_() related error, expected, as pytext zero_grad depends on apex repo where detach_() is called. also seeing the warning in finalize_bucket_dense due to tied weights, which is expected. https://www.internalfb.com/intern/fblearner/details/219150229/?notif_channel=cli

Reviewed By: mrshenli

Differential Revision: D23588186

fbshipit-source-id: f724d325b954ef6f06ede31759bf01dd29a6f5e5
2020-09-24 20:54:51 -07:00
630bd85aae [pytorch] refine dispatch keys in native_functions.yaml (2/N) (#45284)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45284

This is the 2nd batch of the change described in #45010.

In this batch we relaxed some filters to cover more 'backend specific' ops:
* ops that not call any 'Tensor::is_xxx()' method OR only call
  'Tensor::is_cuda()' - we are adding CUDA dispatch key anyway;
* ops that call other ATen ops but ARE differentiable - differentiability
  is a fuzzy indicator of not being 'composite';

Inherited other filters from the 1st batch:
* These ops don't already have dispatch section in native_functions.yaml;
* These ops call one or more DispatchStub (thus "backend specific");

Differential Revision: D23909901

Test Plan: Imported from OSS

Reviewed By: ailzhang

Pulled By: ljk53

fbshipit-source-id: 3b31e176324b6ac814acee0b0f80d18443bd81a1
2020-09-24 20:18:57 -07:00
7e5492e1be [minor] Fix undefined variable (#45246)
Summary:
The commit 2a37f3fd2f https://github.com/pytorch/pytorch/pull/45130 deleted the python variable `capability` which is used in later lines.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45246

Reviewed By: walterddr

Differential Revision: D23923916

Pulled By: malfet

fbshipit-source-id: c5d7fef9e4a87ccc621191200e5965710e9d6aaa
2020-09-24 20:17:13 -07:00
0f2c648c97 log metadata when model loading failed (#44430)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44430

log metadata even when model loading is failed

Test Plan: {F331550976}

Reviewed By: husthyc

Differential Revision: D23577711

fbshipit-source-id: 0504e75625f377269f1e5df0f1ebe34b8e564c4b
2020-09-24 20:09:22 -07:00
03dde4c62a Resend diff D23858329 (#45315)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45315

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45314

in D23858329 (721cfbf842), we put PriorCorrectionCalibrationPrediction unit test in OSS file which causes test failure issue in public trunk.

this diff moves it to FB only test file.

Test Plan:
```
 buck test //caffe2/caffe2/python/operator_test:torch_integration_test -- test_gather_ranges_to_dense_op

buck test //caffe2/caffe2/fb/python/operator_test:torch_integration_test -- test_prior_correct_calibration_prediction_op
```
all pass.

Reviewed By: houseroad

Differential Revision: D23899012

fbshipit-source-id: 1ed97d8702e2765991e6caf5695d4c49353dae82
2020-09-24 18:41:49 -07:00
677a59dcaa [aten] Call fbgemm functions for embedding prepack/unpack (#44845)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44845

fbgemm functions are vectorized and faster

```
Finished test run: https://our.intern.facebook.com/intern/testinfra/testrun/6473924484856786
Summary (total time 15.08s):
  PASS: 7
  FAIL: 0
  SKIP: 0
  FATAL: 0
  TIMEOUT: 0
  OMIT: 0
```

Performance Before:
```
# ----------------------------------------
# PyTorch/Caffe2 Operator Micro-benchmarks
# ----------------------------------------
# Tag : short

# Benchmarking PyTorch: qembeddingbag_byte_prepack
# Mode: Eager
# Name: qembeddingbag_byte_prepack_num_embeddings80_embedding_dim128
# Input: num_embeddings: 80, embedding_dim: 128
Forward Execution Time (us) : 68.727

# Benchmarking PyTorch: qembeddingbag_byte_prepack
# Mode: Eager
# Name: qembeddingbag_byte_prepack_num_embeddings80_embedding_dim256
# Input: num_embeddings: 80, embedding_dim: 256
Forward Execution Time (us) : 131.500

# Benchmarking PyTorch: qembeddingbag_byte_prepack
# Mode: Eager
# Name: qembeddingbag_byte_prepack_num_embeddings80_embedding_dim512
# Input: num_embeddings: 80, embedding_dim: 512
Forward Execution Time (us) : 248.190

# Benchmarking PyTorch: qembeddingbag_4bit_prepack
# Mode: Eager
# Name: qembeddingbag_4bit_prepack_num_embeddings80_embedding_dim128
# Input: num_embeddings: 80, embedding_dim: 128
Forward Execution Time (us) : 172.742

# Benchmarking PyTorch: qembeddingbag_4bit_prepack
# Mode: Eager
# Name: qembeddingbag_4bit_prepack_num_embeddings80_embedding_dim256
# Input: num_embeddings: 80, embedding_dim: 256
Forward Execution Time (us) : 333.008

# Benchmarking PyTorch: qembeddingbag_4bit_prepack
# Mode: Eager
# Name: qembeddingbag_4bit_prepack_num_embeddings80_embedding_dim512
# Input: num_embeddings: 80, embedding_dim: 512
Forward Execution Time (us) : 652.423

# Benchmarking PyTorch: qembeddingbag_2bit_prepack
# Mode: Eager
# Name: qembeddingbag_2bit_prepack_num_embeddings80_embedding_dim128
# Input: num_embeddings: 80, embedding_dim: 128
Forward Execution Time (us) : 167.282

# Benchmarking PyTorch: qembeddingbag_2bit_prepack
# Mode: Eager
# Name: qembeddingbag_2bit_prepack_num_embeddings80_embedding_dim256
# Input: num_embeddings: 80, embedding_dim: 256
Forward Execution Time (us) : 398.901

# Benchmarking PyTorch: qembeddingbag_2bit_prepack
# Mode: Eager
# Name: qembeddingbag_2bit_prepack_num_embeddings80_embedding_dim512
# Input: num_embeddings: 80, embedding_dim: 512
Forward Execution Time (us) : 785.254

# Benchmarking PyTorch: qembeddingbag_byte_unpack
# Mode: Eager
# Name: qembeddingbag_byte_unpack_num_embeddings80_embedding_dim128
# Input: num_embeddings: 80, embedding_dim: 128
Forward Execution Time (us) : 122.653

# Benchmarking PyTorch: qembeddingbag_byte_unpack
# Mode: Eager
# Name: qembeddingbag_byte_unpack_num_embeddings80_embedding_dim256
# Input: num_embeddings: 80, embedding_dim: 256
Forward Execution Time (us) : 230.617

# Benchmarking PyTorch: qembeddingbag_byte_unpack
# Mode: Eager
# Name: qembeddingbag_byte_unpack_num_embeddings80_embedding_dim512
# Input: num_embeddings: 80, embedding_dim: 512
Forward Execution Time (us) : 408.807

# Benchmarking PyTorch: qembeddingbag_4bit_unpack
# Mode: Eager
# Name: qembeddingbag_4bit_unpack_num_embeddings80_embedding_dim128
# Input: num_embeddings: 80, embedding_dim: 128
Forward Execution Time (us) : 176.087

# Benchmarking PyTorch: qembeddingbag_4bit_unpack
# Mode: Eager
# Name: qembeddingbag_4bit_unpack_num_embeddings80_embedding_dim256
# Input: num_embeddings: 80, embedding_dim: 256
Forward Execution Time (us) : 337.514

# Benchmarking PyTorch: qembeddingbag_4bit_unpack
# Mode: Eager
# Name: qembeddingbag_4bit_unpack_num_embeddings80_embedding_dim512
# Input: num_embeddings: 80, embedding_dim: 512
Forward Execution Time (us) : 659.716

# Benchmarking PyTorch: qembeddingbag_2bit_unpack
# Mode: Eager
# Name: qembeddingbag_2bit_unpack_num_embeddings80_embedding_dim128
# Input: num_embeddings: 80, embedding_dim: 128
Forward Execution Time (us) : 342.529

# Benchmarking PyTorch: qembeddingbag_2bit_unpack
# Mode: Eager
# Name: qembeddingbag_2bit_unpack_num_embeddings80_embedding_dim256
# Input: num_embeddings: 80, embedding_dim: 256
Forward Execution Time (us) : 665.197

# Benchmarking PyTorch: qembeddingbag_2bit_unpack
# Mode: Eager
# Name: qembeddingbag_2bit_unpack_num_embeddings80_embedding_dim512
# Input: num_embeddings: 80, embedding_dim: 512
Forward Execution Time (us) : 1307.923
```

Performance After:
```
# ----------------------------------------
# PyTorch/Caffe2 Operator Micro-benchmarks
# ----------------------------------------
# Tag : short

# Benchmarking PyTorch: qembeddingbag_byte_prepack
# Mode: Eager
# Name: qembeddingbag_byte_prepack_num_embeddings80_embedding_dim128
# Input: num_embeddings: 80, embedding_dim: 128
Forward Execution Time (us) : 10.782

# Benchmarking PyTorch: qembeddingbag_byte_prepack
# Mode: Eager
# Name: qembeddingbag_byte_prepack_num_embeddings80_embedding_dim256
# Input: num_embeddings: 80, embedding_dim: 256
Forward Execution Time (us) : 17.443

# Benchmarking PyTorch: qembeddingbag_byte_prepack
# Mode: Eager
# Name: qembeddingbag_byte_prepack_num_embeddings80_embedding_dim512
# Input: num_embeddings: 80, embedding_dim: 512
Forward Execution Time (us) : 25.898

# Benchmarking PyTorch: qembeddingbag_4bit_prepack
# Mode: Eager
# Name: qembeddingbag_4bit_prepack_num_embeddings80_embedding_dim128
# Input: num_embeddings: 80, embedding_dim: 128
Forward Execution Time (us) : 13.903

# Benchmarking PyTorch: qembeddingbag_4bit_prepack
# Mode: Eager
# Name: qembeddingbag_4bit_prepack_num_embeddings80_embedding_dim256
# Input: num_embeddings: 80, embedding_dim: 256
Forward Execution Time (us) : 18.575

# Benchmarking PyTorch: qembeddingbag_4bit_prepack
# Mode: Eager
# Name: qembeddingbag_4bit_prepack_num_embeddings80_embedding_dim512
# Input: num_embeddings: 80, embedding_dim: 512
Forward Execution Time (us) : 30.650

# Benchmarking PyTorch: qembeddingbag_2bit_prepack
# Mode: Eager
# Name: qembeddingbag_2bit_prepack_num_embeddings80_embedding_dim128
# Input: num_embeddings: 80, embedding_dim: 128
Forward Execution Time (us) : 14.158

# Benchmarking PyTorch: qembeddingbag_2bit_prepack
# Mode: Eager
# Name: qembeddingbag_2bit_prepack_num_embeddings80_embedding_dim256
# Input: num_embeddings: 80, embedding_dim: 256
Forward Execution Time (us) : 19.818

# Benchmarking PyTorch: qembeddingbag_2bit_prepack
# Mode: Eager
# Name: qembeddingbag_2bit_prepack_num_embeddings80_embedding_dim512
# Input: num_embeddings: 80, embedding_dim: 512
Forward Execution Time (us) : 30.852

# Benchmarking PyTorch: qembeddingbag_byte_unpack
# Mode: Eager
# Name: qembeddingbag_byte_unpack_num_embeddings80_embedding_dim128
# Input: num_embeddings: 80, embedding_dim: 128
Forward Execution Time (us) : 47.596

# Benchmarking PyTorch: qembeddingbag_byte_unpack
# Mode: Eager
# Name: qembeddingbag_byte_unpack_num_embeddings80_embedding_dim256
# Input: num_embeddings: 80, embedding_dim: 256
Forward Execution Time (us) : 91.025

# Benchmarking PyTorch: qembeddingbag_byte_unpack
# Mode: Eager
# Name: qembeddingbag_byte_unpack_num_embeddings80_embedding_dim512
# Input: num_embeddings: 80, embedding_dim: 512
Forward Execution Time (us) : 131.425

# Benchmarking PyTorch: qembeddingbag_4bit_unpack
# Mode: Eager
# Name: qembeddingbag_4bit_unpack_num_embeddings80_embedding_dim128
# Input: num_embeddings: 80, embedding_dim: 128
Forward Execution Time (us) : 12.637

# Benchmarking PyTorch: qembeddingbag_4bit_unpack
# Mode: Eager
# Name: qembeddingbag_4bit_unpack_num_embeddings80_embedding_dim256
# Input: num_embeddings: 80, embedding_dim: 256
Forward Execution Time (us) : 20.856

# Benchmarking PyTorch: qembeddingbag_4bit_unpack
# Mode: Eager
# Name: qembeddingbag_4bit_unpack_num_embeddings80_embedding_dim512
# Input: num_embeddings: 80, embedding_dim: 512
Forward Execution Time (us) : 33.944

# Benchmarking PyTorch: qembeddingbag_2bit_unpack
# Mode: Eager
# Name: qembeddingbag_2bit_unpack_num_embeddings80_embedding_dim128
# Input: num_embeddings: 80, embedding_dim: 128
Forward Execution Time (us) : 21.181

# Benchmarking PyTorch: qembeddingbag_2bit_unpack
# Mode: Eager
# Name: qembeddingbag_2bit_unpack_num_embeddings80_embedding_dim256
# Input: num_embeddings: 80, embedding_dim: 256
Forward Execution Time (us) : 34.213

# Benchmarking PyTorch: qembeddingbag_2bit_unpack
# Mode: Eager
# Name: qembeddingbag_2bit_unpack_num_embeddings80_embedding_dim512
# Input: num_embeddings: 80, embedding_dim: 512
Forward Execution Time (us) : 59.622
```
ghstack-source-id: 112836216

Test Plan: buck test //caffe2/test:quantization -- 'test_embedding_bag*'  --print-passing-details

Reviewed By: radkris-git

Differential Revision: D23675777

fbshipit-source-id: 0b1a787864663daecc7449295f9ab6264eac52fc
2020-09-24 17:21:03 -07:00
0b6e5ad4a9 Resolve comments in #44354. (#45150)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45150

Test Plan: Imported from OSS

Reviewed By: bhosmer

Differential Revision: D23846796

Pulled By: ailzhang

fbshipit-source-id: 7bef89d833848ac3f8993c4c037acf1d4f2ca674
2020-09-24 16:40:02 -07:00
92ebb04f92 added check for NumberType (#44375)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/44107

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44375

Reviewed By: mrshenli

Differential Revision: D23906728

Pulled By: eellison

fbshipit-source-id: 3b534e5dd3af1f5e43a7314953e64117cbe8ffe4
2020-09-24 16:26:59 -07:00
bee1d448e7 Fix test_rpc_profiling_remote_record_function (#45162)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45162

This test was flaky because it was not able to validate that the
overall record_function's CPU times are greater than the sum of its children.
It turns out that this is a general bug in the profiler that can be reproduced
without RPC, see https://github.com/pytorch/pytorch/issues/45160. Hence,
removing this from the test and replacing it by just validating the expected
children.

Ran the test 1000 times and they all passed.
ghstack-source-id: 112632327

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D23851854

fbshipit-source-id: 5d9023acd17800a6668ba4849659d8cc902b8d6c
2020-09-24 15:57:32 -07:00
5dd288eb06 [JIT] Regularize tensorexpr fuser strategy with other fusers (#44972)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44972

Previously, our fusion strategy would be:
- start at the end of the block, find a fusable node
- iteratively try to merge inputs into the fusion group, sorted topologically

This strategy works pretty well, but has the possibility of missing fusion groups. See my attached test case for an example where we wouldn't find all possible fusion groups. bertmaher found an example of a missed fusion groups in one of our rnn examples (jit_premul) that caused a regression from the legacy fuser.

Here, I'm updating our fusion strategy to be the same as our other fusion passes - create_autodiff_subgraphs, and graph_fuser.cpp.

The basic strategy is:
- iterate until you find a fusible node
- try to merge the nodes inputs, whenever a succesful merge occurs restart at the beginning of the nodes inputs
- after you've exhausted a node, continue searching the block for fusion opportunities from the node
- continue doing this on the block until we go through an iteration without an succesful merges

Since we create the fusion groups once, and only re-specialize within the fusion groups, we should be running this very infrequently (only re-triggers when we fail undefinedness specializations). Also bc it's the same algorithm as the existing fuser it is unlikely to cause a regression.

Test Plan: Imported from OSS

Reviewed By: Krovatkin, robieta

Differential Revision: D23821581

Pulled By: eellison

fbshipit-source-id: e513d1ef719120dadb0bfafc7a14f4254cd806ee
2020-09-24 15:34:21 -07:00
0137e3641d Refactor subgraph merging (#44238)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44238

Refactor create_autodiff_subgraphs to use the same updating of output aliasing properties logic as tensorexpr fuser, and factor that out to a common function in subgraph utils.

Test Plan: Imported from OSS

Reviewed By: Krovatkin, robieta

Differential Revision: D23871565

Pulled By: eellison

fbshipit-source-id: 72df253b16baf8e4aabf3d68b103b29e6a54d44c
2020-09-24 15:29:34 -07:00
1539d4a664 Add operator to compute the equalization scale (#45096)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45096

Add operator to compute the equalization scale. This will be used in the integration of equalization into dper int8 fixed quant scheme quantization flow.

Design docs:
https://fb.quip.com/bb7SAGBxPGNC

https://fb.quip.com/PDAOAsgoLfRr

Test Plan: buck test caffe2/caffe2/quantization/server:compute_equalization_scale_test

Reviewed By: jspark1105

Differential Revision: D23779870

fbshipit-source-id: 5e6a8c220935a142ecf8e61100a8c71932afa8d7
2020-09-24 15:19:49 -07:00
5a59330647 Add architectural support for multi-GPU. (#44059)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44059

Test Plan: Imported from OSS

Reviewed By: IvanKobzarev

Differential Revision: D23820825

Pulled By: AshkanAliabadi

fbshipit-source-id: 0719b00581487a77ebadff867d1e4ac89354bf90
2020-09-24 15:11:55 -07:00
6311c5a483 Minor touchups. (#44317)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44317

Test Plan: Imported from OSS

Reviewed By: IvanKobzarev

Differential Revision: D23820828

Pulled By: AshkanAliabadi

fbshipit-source-id: b83bdea9aed2fb52bd254ff15914d55a1af58c04
2020-09-24 15:07:08 -07:00
b84dd771e6 Grammatically updated the tech docs (#45192)
Summary:
Small grammatical update to the [https://pytorch.org/docs/stable/tensors.html](url) docs.

**_update1_**
![update1](https://user-images.githubusercontent.com/62737243/93969792-5c0ea800-fd8a-11ea-8c9f-0033f51a1fdc.png)

**_update2_**
![update2](https://user-images.githubusercontent.com/62737243/93969801-603ac580-fd8a-11ea-812d-d3026b9fc8a5.png)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45192

Reviewed By: bwasti

Differential Revision: D23877870

Pulled By: ezyang

fbshipit-source-id: 929ba3d479925b5132dbe87fad2da487408db7c7
2020-09-24 14:48:30 -07:00
cd7a682282 [caffe2] adds hypothesis test for queue ops cancel (#45178)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45178

## Motivation
* To be able to make C2 ops cancellable so we can safely exit.
* Some C2 operators are now blocking thus being non-cancellable. If an error
occurs we need to be able to safely stop all net execution so we can throw
the exception to the caller.

## Summary
* Adds a hypothesis test for queue ops cancellation.

Test Plan:
## Unit test added to verify that queue ops propagate errors

```
buck test caffe2/caffe2/python:hypothesis_test
buck test caffe2/caffe2/python:hypothesis_test -- test_safe_dequeue_blob__raises_exception_when_hang --stress-runs 1000
```

```
Summary
  Pass: 1000
  ListingSuccess: 1
```

Reviewed By: d4l3k

Differential Revision: D23847576

fbshipit-source-id: 2fc351e1ee13ea8b32d976216d2d01dfb6fcc1ad
2020-09-24 14:43:52 -07:00
71e6ce6616 [JIT] Specialize AutogradZero: merge AutogradAnyNonZero and Not(AutogradAnyNonZero) checks into one. (#44987)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44987

This PR introduces new `prim::AutogradAllZero` and
`prim::AutogradAllNonZero` ops that are used for a batch check for
multiple tensors. The specialize-autogradzero pass now generates one
check for all expected-to-be-undefined tensors, one check for all
expected-to-be-defined tensors, and a bunch of checks for size
parameters passed to `grad_sum_to_size` (this probably could be cleaned
up somehow as well in future).

An example of what we generated before this change:
```
%1626 : bool = prim::AutogradAnyNonZero(%0)
%1627 : bool = prim::AutogradAnyNonZero(%2)
%1628 : bool = aten::__not__(%1627)
%1629 : bool = prim::AutogradAnyNonZero(%3)
%1630 : bool = aten::__not__(%1629)
%1631 : bool = prim::AutogradAnyNonZero(%4)
%1632 : bool = aten::__not__(%1631)
%1633 : bool = prim::AutogradAnyNonZero(%5)
%1634 : bool = aten::__not__(%1633)
%1635 : bool = prim::AutogradAnyNonZero(%6)
%1636 : bool = aten::__not__(%1635)
%1637 : bool = prim::AutogradAnyNonZero(%7)
%1638 : bool = aten::__not__(%1637)
%1639 : bool = prim::AutogradAnyNonZero(%8)
%1640 : bool = aten::__not__(%1639)
%1641 : bool = prim::AutogradAnyNonZero(%9)
%1642 : bool = aten::__not__(%1641)
%1643 : bool = prim::AutogradAnyNonZero(%10)
%1644 : bool = aten::__not__(%1643)
%1645 : bool = prim::AutogradAnyNonZero(%11)
%1646 : bool = aten::__not__(%1645)
%1647 : bool = prim::AutogradAnyNonZero(%12)
%1648 : bool = aten::__not__(%1647)
%1649 : bool = prim::AutogradAnyNonZero(%13)
%1650 : bool = aten::__not__(%1649)
%1651 : bool = prim::AutogradAnyNonZero(%14)
%1652 : bool = aten::__not__(%1651)
%1653 : bool = prim::AutogradAnyNonZero(%15)
%1654 : bool = aten::__not__(%1653)
%1655 : bool = prim::AutogradAnyNonZero(%16)
%1656 : bool = aten::__not__(%1655)
%1657 : bool = prim::AutogradAnyNonZero(%17)
%1658 : bool = prim::AutogradAnyNonZero(%18)
%1659 : bool = prim::AutogradAnyNonZero(%19)
%1660 : bool = prim::AutogradAnyNonZero(%20)
%1661 : bool = aten::__is__(%self_size.16, %1625)
%1662 : bool = aten::__is__(%other_size.16, %1625)
%1663 : bool = aten::__is__(%self_size.14, %1625)
%1664 : bool = aten::__is__(%self_size.12, %1625)
%1665 : bool = prim::AutogradAnyNonZero(%ingate.7)
%1666 : bool = prim::AutogradAnyNonZero(%forgetgate.7)
%1667 : bool = prim::AutogradAnyNonZero(%cellgate.7)
%1668 : bool = prim::AutogradAnyNonZero(%30)
%1669 : bool = prim::AutogradAnyNonZero(%31)
%1670 : bool = aten::__is__(%self_size.10, %1625)
%1671 : bool = aten::__is__(%other_size.10, %1625)
%1672 : bool = prim::AutogradAnyNonZero(%34)
%1673 : bool = prim::AutogradAnyNonZero(%35)
%1674 : bool = aten::__is__(%self_size.8, %1625)
%1675 : bool = aten::__is__(%other_size.8, %1625)
%1676 : bool = aten::__is__(%self_size.6, %1625)
%1677 : bool = aten::__is__(%other_size.6, %1625)
%1678 : bool = prim::AutogradAnyNonZero(%outgate.7)
%1679 : bool = prim::AutogradAnyNonZero(%41)
%1680 : bool = prim::AutogradAnyNonZero(%42)
%1681 : bool = prim::AutogradAnyNonZero(%43)
%1682 : bool = aten::__is__(%self_size.4, %1625)
%1683 : bool = aten::__is__(%other_size.4, %1625)
%1684 : bool[] = prim::ListConstruct(%1626, %1628, %1630, %1632, %1634, %1636, %1638, %1640, %1642, %1644, %1646, %1648, %1650, %1652, %1654, %1656, %1657, %1658, %1659, %1660, %1661, %1662, %1663, %1664, %1665, %1666, %1667, %1668, %1669, %1670, %1671, %1672, %1673, %1674, %1675, %1676, %1677, %1678, %1679, %1680, %1681, %1682, %1683)
%1685 : bool = aten::all(%1684)
```

Same example after this change:
```
%1625 : None = prim::Constant()
%1626 : bool = aten::__is__(%self_size.16, %1625)
%1627 : bool = aten::__is__(%other_size.16, %1625)
%1628 : bool = aten::__is__(%self_size.14, %1625)
%1629 : bool = aten::__is__(%self_size.12, %1625)
%1630 : bool = aten::__is__(%self_size.10, %1625)
%1631 : bool = aten::__is__(%other_size.10, %1625)
%1632 : bool = aten::__is__(%self_size.8, %1625)
%1633 : bool = aten::__is__(%other_size.8, %1625)
%1634 : bool = aten::__is__(%self_size.6, %1625)
%1635 : bool = aten::__is__(%other_size.6, %1625)
%1636 : bool = aten::__is__(%self_size.4, %1625)
%1637 : bool = aten::__is__(%other_size.4, %1625)
%1638 : bool = prim::AutogradAllNonZero(%0, %17, %18, %19, %20, %ingate.7, %forgetgate.7, %cellgate.7, %30, %31, %34, %35, %outgate.7, %41, %42, %43)
%1639 : bool = prim::AutogradAllZero(%2, %3, %4, %5, %6, %7, %8, %9, %10, %11, %12, %13, %14, %15, %16)
%1640 : bool[] = prim::ListConstruct(%1626, %1627, %1628, %1629, %1630, %1631, %1632, %1633, %1634, %1635, %1636, %1637, %1638, %1639)
%1641 : bool = aten::all(%1640)
```

My performance measurements showed some changes, but I don't really
trust them and think that they are probably just a noise. Below are
tables with min-aggregation over 10 runs:

FastRNN models:

| name                                             | base time (s) |   diff time (s) |   % change |
| :---                                             |          ---: |            ---: |       ---: |
| lstm[aten]:bwd                                   |     30.059927 |       29.834089 |      -0.8% |
| lstm[aten]:fwd                                   |     25.673708 |       25.700039 |       0.1% |
| lstm[cudnn]:bwd                                  |     17.866232 |       17.893120 |       0.2% |
| lstm[cudnn]:fwd                                  |     11.418444 |       11.408514 |      -0.1% |
| lstm[jit]:bwd                                    |     27.127205 |       27.141029 |       0.1% |
| lstm[jit]:fwd                                    |     17.018047 |       16.975451 |      -0.3% |
| lstm[jit_multilayer]:bwd                         |     27.502396 |       27.365149 |      -0.5% |
| lstm[jit_multilayer]:fwd                         |     16.918591 |       16.917767 |      -0.0% |
| lstm[jit_premul]:bwd                             |     22.281199 |       22.215082 |      -0.3% |
| lstm[jit_premul]:fwd                             |     14.848708 |       14.896231 |       0.3% |
| lstm[jit_premul_bias]:bwd                        |     20.761206 |       21.170969 |       2.0% |
| lstm[jit_premul_bias]:fwd                        |     15.013515 |       15.037978 |       0.2% |
| lstm[jit_simple]:bwd                             |     26.715771 |       26.697786 |      -0.1% |
| lstm[jit_simple]:fwd                             |     16.675898 |       16.545893 |      -0.8% |
| lstm[py]:bwd                                     |     56.327065 |       54.731030 |      -2.8% |
| lstm[py]:fwd                                     |     39.876324 |       39.230572 |      -1.6% |

Torch Hub models:

| name                                             | base time (s) |   diff time (s) |   % change |
| :---                                             |          ---: |            ---: |       ---: |
| test_eval[BERT_pytorch-cuda-jit]                 |      0.111706 |        0.106604 |      -4.6% |
| test_eval[LearningToPaint-cuda-jit]              |      0.002841 |        0.002801 |      -1.4% |
| test_eval[Super_SloMo-cuda-jit]                  |      0.384869 |        0.384737 |      -0.0% |
| test_eval[attension_is_all_you_nee...-cuda-jit]  |      0.123857 |        0.123923 |       0.1% |
| test_eval[demucs-cuda-jit]                       |      0.077270 |        0.076878 |      -0.5% |
| test_eval[fastNLP-cuda-jit]                      |      0.000255 |        0.000249 |      -2.3% |
| test_eval[moco-cuda-jit]                         |      0.426472 |        0.427380 |       0.2% |
| test_eval[pytorch_CycleGAN_and_pix...-cuda-jit]  |      0.026483 |        0.026423 |      -0.2% |
| test_eval[pytorch_mobilenet_v3-cuda-jit]         |      0.036202 |        0.035853 |      -1.0% |
| test_eval[pytorch_struct-cuda-jit]               |      0.001439 |        0.001495 |       3.9% |
| test_train[BERT_pytorch-cuda-jit]                |      0.247236 |        0.247188 |      -0.0% |
| test_train[Background_Matting-cuda-jit]          |      3.536659 |        3.581864 |       1.3% |
| test_train[LearningToPaint-cuda-jit]             |      0.015341 |        0.015331 |      -0.1% |
| test_train[Super_SloMo-cuda-jit]                 |      1.018626 |        1.019098 |       0.0% |
| test_train[attension_is_all_you_nee...-cuda-jit] |      0.446314 |        0.444893 |      -0.3% |
| test_train[demucs-cuda-jit]                      |      0.169647 |        0.169846 |       0.1% |
| test_train[fastNLP-cuda-jit]                     |      0.001990 |        0.001978 |      -0.6% |
| test_train[moco-cuda-jit]                        |      0.855323 |        0.856974 |       0.2% |
| test_train[pytorch_mobilenet_v3-cuda-jit]        |      0.497723 |        0.485416 |      -2.5% |
| test_train[pytorch_struct-cuda-jit]              |      0.309692 |        0.308792 |      -0.3% |

Differential Revision: D23794659

Test Plan: Imported from OSS

Reviewed By: bertmaher

Pulled By: ZolotukhinM

fbshipit-source-id: 859b68868ef839c5c6cbc7021879ee22d3144ea8
2020-09-24 14:31:49 -07:00
cbe1eac1f4 [caffe2] adds Cancel to SafeDequeueBlobsOp and SafeEnqueueBlobsOp (#45177)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45177

## Motivation
* To be able to make C2 ops cancellable so we can safely exit.
* Some C2 operators are now blocking thus being non-cancellable. If an error
occurs we need to be able to safely stop all net execution so we can throw
the exception to the caller.

## Summary
* When an error occurs in a net or it got cancelled, running ops will have the
`Cancel` method called.
This diff adds `Cancel` method to the `SafeEnqueueBlobsOp`
and `SafeDequeueBlobsOp` to have the call queue->close() to force all the
blocking ops to return.
* Adds unit test that verified the error propagation.

Test Plan:
## Unit test added to verify that queue ops propagate errors

```
buck test caffe2/caffe2/python:hypothesis_test -- test_safe_dequeue_blob__raises_exception_when_hang --stress-runs 1000
```

```
Summary
  Pass: 1000
  ListingSuccess: 1
```

Reviewed By: d4l3k

Differential Revision: D23846967

fbshipit-source-id: c7ddd63259e033ed0bed9df8e1b315f87bf59394
2020-09-24 14:22:46 -07:00
022ba5a78b Make ddp_comm_hook_wrapper a private method. (#44643)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44643

This method is not used anywhere else.

Also formatted the file.

Test Plan: buck test caffe2/test/distributed/algorithms/ddp_comm_hooks:test_ddp_hooks

Reviewed By: pritamdamania87

Differential Revision: D23675945

fbshipit-source-id: 2d04f94589a20913e46b8d71e6a39b70940c1461
2020-09-24 13:29:48 -07:00
e2bcdc7b69 [Caffe2] Fix LayerNormOp when batch_size == 0. (#45250)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45250

[Caffe2] Fix LayerNormOp when batch_size == 0.

Test Plan: buck test mode/dev-nosan //caffe2/caffe2/python/operator_test:layer_norm_op_test

Reviewed By: houseroad

Differential Revision: D23892091

fbshipit-source-id: 9a34654dd6880c9d14b7111fcf850e4f48ffdf91
2020-09-24 12:30:03 -07:00
c3a5aed5f7 Run pytorch_core CUDA tests on GPU using TPX
Summary:
Modify contbuild to disable sanitizers, add option to run "cuda" test using TPX RE

(Note: this ignores all push blocking failures!)

Test Plan: CI

Reviewed By: walterddr, cspanda

Differential Revision: D23854578

fbshipit-source-id: 327d7cc3655c17034a6a7bc78f69967403290623
2020-09-24 12:12:23 -07:00
c211a9102f add rocm 3.8 to nightly builds (#45222)
Summary:
Corresponding change in builder repo: https://github.com/pytorch/builder/pull/528.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45222

Reviewed By: ezyang

Differential Revision: D23894831

Pulled By: walterddr

fbshipit-source-id: c6a256ec325ddcf5836b4d293f546368d58db538
2020-09-24 12:00:30 -07:00
26001a2334 Revert D23753711: [pytorch][PR] Add foreach APIs for binary ops with ScalarList
Test Plan: revert-hammer

Differential Revision:
D23753711 (71d1b5b0e2)

Original commit changeset: bf3e8c54bc07

fbshipit-source-id: 192692e0d3fff4cade9983db0a1760fedfc9674c
2020-09-24 11:55:49 -07:00
c79d493096 added rocm 3.8 docker image (#45205)
Summary:
jeffdaily

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45205

Reviewed By: malfet

Differential Revision: D23906606

Pulled By: walterddr

fbshipit-source-id: 604a12bf4c97260215a1881cc96e35e7c42b4578
2020-09-24 11:18:33 -07:00
3f5eee666c Adjust TF32 tests (#44240)
Summary:
- The thresholds of some tests are bumped up. Depending on the random generator, sometimes these tests fail with things like 0.0059 is not smaller than 0.005. I ran `test_nn.py` and `test_torch.py` for 10+ times to check these are no longer flaky.
- Add `tf32_on_and_off` to new `matrix_exp` tests.
- Disable TF32 on test suites other than `test_nn.py` and `test_torch.py`

cc: ptrblck

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44240

Reviewed By: mruberry

Differential Revision: D23882498

Pulled By: ngimel

fbshipit-source-id: 44a9ec08802c93a2efaf4e01d7487222478b6df8
2020-09-24 10:25:58 -07:00
b8eab8cdbd [hotfix] typo in NaiveConvolutionTranspose2d.cu (#45224)
Summary:
Fixes typo in e2f49c8
Fixes https://github.com/pytorch/pytorch/issues/45172

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45224

Reviewed By: ezyang

Differential Revision: D23879872

Pulled By: walterddr

fbshipit-source-id: c3db6d4c6f2ac0e6887862d4217a79c030647cb9
2020-09-24 10:06:29 -07:00
e57a08119b Add a warning log when there is high skew of uneven inputs in DDP training (#45238)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45238

Adds a warning when there is much higher than expected amount of
discrepancy of inputs across different processes when running with uneven
inputs. This is because a skew in the thousands can reduce performance a
nontrivial amount as shown in benchmarks, and it was proposed to add this
warning as a result. Tested by running the tests so the threshold is hit and
observing the output.
ghstack-source-id: 112773552

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D23719270

fbshipit-source-id: 306264f62c1de65e733696a912bdb6e9376d5622
2020-09-24 09:50:44 -07:00
2b38c09f69 Moves prim ops from C10 back to JIT (#45144)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45144

Moves prim ops from C10 back to JIT.

These were originally moved to C10 from JIT in D19237648 (f362cd510d)
ghstack-source-id: 112775781

Test Plan:
buck test //caffe2/test/cpp/jit:jit

https://pxl.cl/1l22N

buck test adsatlas/gavel/lib/ata_processor/tests:ata_processor_test

https://pxl.cl/1lBxD

Reviewed By: iseeyuan

Differential Revision: D23697598

fbshipit-source-id: 36d1eb8c346e9b161ba6af537a218440a9bafd27
2020-09-24 09:44:20 -07:00
8507ea22b2 replace timer test with a mocked variant (#45173)
Summary:
I noticed that the recently introduced adaptive_autorange tests occasionally timeout CI, and I've been meaning to improve the Timer tests for a while. This PR allows unit tests to swap the measurement portion of `Timer` with a deterministic mock so we can thoroughly test behavior without having to worry about flaky CI measurements. It also means that the tests can be much more detailed and still finish very quickly.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45173

Test Plan: You're lookin' at it.

Reviewed By: ezyang

Differential Revision: D23873548

Pulled By: robieta

fbshipit-source-id: 26113e5cea0cbf46909b9bf5e90c878c29e87e88
2020-09-24 09:42:37 -07:00
bfdf4323ac Bump up NCCL to 2.7.8 (#45251)
Summary:
Use latest NCCL

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45251

Reviewed By: mingzhe09088

Differential Revision: D23893064

Pulled By: mrshenli

fbshipit-source-id: 820dd166039e61a5aa59b4c5bbc615a7b18be8c3
2020-09-24 09:33:57 -07:00
5195d727b5 adding a test for ddp save()/load() (#44906)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44906

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D23825386

Pulled By: bdhirsh

fbshipit-source-id: 2276e6e030ef9cffd78fc78c2ffe34d60a1e160e
2020-09-24 09:15:53 -07:00
f9ae296a85 renaming TestDdpCommHook class so it doesn't get picked up as a test by pytest (#44905)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44905

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D23825308

Pulled By: bdhirsh

fbshipit-source-id: 17a07b3bd211850d6ecca793fd9ef3f326ca9274
2020-09-24 08:46:25 -07:00
bc591d76a1 add skip_if_rocm to all requires_nccl tests (#45158)
Summary:
requires_nccl annotation should skip_if_rocm as well

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45158

Reviewed By: seemethere

Differential Revision: D23879952

Pulled By: walterddr

fbshipit-source-id: 818fb31ab75d5f02e77fe3f1367faf748855bee7
2020-09-24 08:37:49 -07:00
71d1b5b0e2 Add foreach APIs for binary ops with ScalarList (#44743)
Summary:
In this PR:
1) Added binary operations with ScalarLists.
2) Fixed _foreach_div(...) bug in native_functions
3) Covered all possible cases with scalars and scalar lists in tests
4) [minor] fixed bug in native_functions by adding "use_c10_dispatcher: full" to all _foreach functions

tested via unit tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44743

Reviewed By: bwasti, malfet

Differential Revision: D23753711

Pulled By: izdeby

fbshipit-source-id: bf3e8c54bc07867e8f6e82b5d3d35ff8e99b5a0a
2020-09-24 08:30:42 -07:00
bea7901e38 Enable torch.tensor typechecks (#45077)
Summary:
this fixes https://github.com/pytorch/pytorch/issues/42983.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45077

Reviewed By: ezyang

Differential Revision: D23842493

Pulled By: walterddr

fbshipit-source-id: 1c516a5ff351743a187d00cba7ed0be11678edf1
2020-09-24 08:22:06 -07:00
dc67b47bc9 Deprecate old fft functions (#44876)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44876

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D23866715

Pulled By: mruberry

fbshipit-source-id: 73305eb02f92cbd1ef7d175419529d19358fedda
2020-09-24 02:39:44 -07:00
6d21d5f0b3 gtest-ify JIT tests, through the letter c (#45249)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45249

Reland of https://github.com/pytorch/pytorch/pull/45055 and
https://github.com/pytorch/pytorch/pull/45020

See https://github.com/pytorch/pytorch/pull/45018 for context.

Test Plan: Imported from OSS

Reviewed By: jamesr66a

Differential Revision: D23892645

Pulled By: suo

fbshipit-source-id: e7fe58d5e1a5a0c44f4e2aec9694145afabde0fd
2020-09-24 00:21:20 -07:00
29dc3c5ec8 Sparse softmax support (CUDA) (#42307)
Summary:
This PR implements softmax support for sparse tensors.

Resolves gh-23651 for CUDA.

- [x]  sparse softmax
    - [x]  CUDA C++ implementation
    - [x]  unittests
    - [x]  update softmax documentation
    - [x]  autograd support
- [x]  sparse log_softmax
    - [x]  CUDA C++ implementation
    - [x]  unittests
    - [x]  update log_softmax documentation
    - [x]  autograd support

Here are some benchmark (script is [here](https://gist.github.com/aocsa/fbc1827b3e49901512a33ba96092cbc1)) results for `torch.sparse.softmax and torch.softmax`,  using CPU and GPU, values are float64 scalars, timing repeat is 1000:

| size         | density | sparse CUDA | sparse CPU |
|--------------|---------|-------------|------------|
|  (32, 10000) |   0.01  |    380.2    |    687.5   |
| (32, 10000)  | 0.05    | 404.3       | 2357.9     |
| (32, 10000)  | 0.1     | 405.9       | 3677.2     |
| (512, 10000) | 0.01    | 438.0       | 5443.4     |
| (512, 10000) | 0.05    | 888.1       | 24485.0    |
| (512, 10000) | 0.1     | 1921.3      | 45340.5    |

| size         | density | dense CUDA | dense CPU |
|--------------|---------|-------------|------------|
|  (32, 10000) |   0.01  |     23.6    |   1943.2   |
| (32, 10000)  | 0.05    | 23.6        | 1954.0     |
| (32, 10000)  | 0.1     | 23.5        | 1950.0     |
| (512, 10000) | 0.01    | 639.3       | 39797.9    |
| (512, 10000) | 0.05    | 640.3       | 39374.4    |
| (512, 10000) | 0.1     | 639.6       | 39192.3    |

Times are in microseconds (us).

Quick note:  I updated the performance test again.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42307

Reviewed By: ngimel

Differential Revision: D23774427

Pulled By: mruberry

fbshipit-source-id: bfabf726075b39dde544c10249f27ae1871f82c7
2020-09-24 00:07:30 -07:00
b3d7c2f978 [ONNX] Update ONNX docs for release (#45086)
Summary:
ONNX doc updates.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45086

Reviewed By: ezyang

Differential Revision: D23880383

Pulled By: bzinodev

fbshipit-source-id: ca29782fd73024967ee7708c217a005233e7b970
2020-09-23 23:28:36 -07:00
3dd0e362db [TensorExpr] Fix min and max for integral inputs in CUDA backend (#44984)
Summary:
For integral types, isnan is meaningless. Provide specializations for
maximum and minimum which don't call it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44984

Test Plan: python test/test_jit_fuser_te.py -k TestTEFuser.test_minmax_int_ops

Reviewed By: ezyang

Differential Revision: D23885259

Pulled By: asuhan

fbshipit-source-id: 2e6da2c43c0ed18f0b648a2383d510894c574437
2020-09-23 23:19:12 -07:00
b470fa4500 Add complex number support for binary logical operators (#43174)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43174

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D23684425

Pulled By: mruberry

fbshipit-source-id: 4857b16e18ec4c65327136badd7f04c74e32d330
2020-09-23 23:03:00 -07:00
0b6b735863 [fix] type promotion atan2 (#43466)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/43360

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43466

Reviewed By: malfet

Differential Revision: D23834928

Pulled By: mruberry

fbshipit-source-id: 2e7e0b4fcf1a846efc171c275d65a6daffd3c631
2020-09-23 22:23:05 -07:00
6a2e9eb51c torch.fft: Multi-dimensional transforms (#44550)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44550

Part of the `torch.fft` work (gh-42175).
This adds n-dimensional transforms: `fftn`, `ifftn`, `rfftn` and `irfftn`.

This is aiming for correctness first, with the implementation on top of the existing `_fft_with_size` restrictions. I plan to follow up later with a more efficient rewrite that makes `_fft_with_size` work with arbitrary numbers of dimensions.

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D23846032

Pulled By: mruberry

fbshipit-source-id: e6950aa8be438ec5cb95fb10bd7b8bc9ffb7d824
2020-09-23 22:09:58 -07:00
070fe15e4c Add link to profiling recipe from rpc main docs (#45235)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45235

This is so that users know that the profiler works as expected with
RPC and they can learn how to use it to profile RPC-based workloads.
ghstack-source-id: 112773748

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D23777888

fbshipit-source-id: 4805be9b949c8c7929182f291a6524c3c6a725c1
2020-09-23 22:02:38 -07:00
956a25d061 Revert D23858329: [PT Model Split] Support 2 operators in PT by C2 conversion
Test Plan: revert-hammer

Differential Revision:
D23858329 (721cfbf842)

Original commit changeset: ed37118ca7f0

fbshipit-source-id: 30c700f80665be11afc608b00a77766064e60b35
2020-09-23 21:20:21 -07:00
2d00ebd29f Failing test demonstrating problems with mixed output shapes (#44455)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44455

Test Plan: Imported from OSS

Reviewed By: gmagogsfm

Differential Revision: D23886119

Pulled By: bertmaher

fbshipit-source-id: 41787930f154cf4e8a1766613c4cf33b18246555
2020-09-23 21:15:37 -07:00
c760bc8fb1 Add GlowLoadAOTModel flag (#45189)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45189

Pull Request resolved: https://github.com/pytorch/glow/pull/4902

Test Plan: Test locally

Reviewed By: yinghai

Differential Revision: D23810445

fbshipit-source-id: 56e717d80abbfe76b15d0f4249e1e399a9722753
2020-09-23 20:50:04 -07:00
60665ace17 [quant] Add optimized approach to calculate qparams for qembedding_bag (#45149)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45149

The choose_qparams_optimized calculates the the optimized qparams.
It uses a greedy approach to nudge the min and max and calculate the l2 norm
  and tries to minimize the quant error by doing `torch.norm(x-fake_quant(x,s,z))`

Test Plan: Imported from OSS

Reviewed By: raghuramank100

Differential Revision: D23848060

fbshipit-source-id: c6c57c9bb07664c3f1c87dd7664543e09f634aee
2020-09-23 19:00:22 -07:00
721cfbf842 [PT Model Split] Support 2 operators in PT by C2 conversion (#45231)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45231

There are two operators:
`PriorCorrectionCalibrationPrediction` and `GatherRangesToDense` is not supported in PT which makes GLOW cannot work.

To unblock, we first try to use C2->PT conversion. In the long-term, we need to implement PT custom ops.

This diff does this conversion to unblock current project.

Test Plan:
Run unit test. the Test input is from current DPER example.
All pass.
```buck test //caffe2/caffe2/python/operator_test:torch_integration_test -- test_prior_correct_calibration_prediction_op  --print-passing-details

> c2 reference output
> [0.14285715 0.27272728 0.39130434 0.5 ]

> PT converted output
> tensor([0.1429, 0.2727, 0.3913, 0.5000])

buck test //caffe2/caffe2/python/operator_test:torch_integration_test -- test_gather_ranges_to_dense_op  --print-passing-details

c2 reference output
> [array([[6, 5, 4, 3], [0, 0, 0, 0]], dtype=int64)]

> PT converted output
> [tensor([[6, 5, 4, 3], [0, 0, 0, 0]])]
```

Reviewed By: allwu, qizzzh

Differential Revision: D23858329

fbshipit-source-id: ed37118ca7f09e1cd0ad1fdec3d37f66dce60dd9
2020-09-23 18:31:57 -07:00
27c7158166 Remove __future__ imports for legacy Python2 supports (#45033)
Summary:
There is a module called `2to3` which you can target for future specifically to remove these, the directory of `caffe2` has the most redundant imports:

```2to3 -f future -w caffe2```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45033

Reviewed By: seemethere

Differential Revision: D23808648

Pulled By: bugra

fbshipit-source-id: 38971900f0fe43ab44a9168e57f2307580d36a38
2020-09-23 17:57:02 -07:00
e9aa6898ab Revert D23802296: gtest-ify JIT tests, through the letter c
Test Plan: revert-hammer

Differential Revision:
D23802296 (d2b045030e)

Original commit changeset: 20c9798a414e

fbshipit-source-id: a28d56039ca404fe94ed7572f1febd1673e3e788
2020-09-23 17:42:19 -07:00
89c570ed0a Revert D23811085: gtestify dce and fuser tests
Test Plan: revert-hammer

Differential Revision:
D23811085 (246bd9422a)

Original commit changeset: 45008e41f239

fbshipit-source-id: 94c981f565cab9b710fe52a55bbe8dbf9c179c23
2020-09-23 17:27:59 -07:00
76c185dcca [TensorExpr] When lanes differ, insert Broadcast instead of Cast (#45179)
Summary:
We need to check if dtypes differ in scalar type or lanes to decide between
Cast and Broadcast.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45179

Test Plan: test_tensorexpr --gtest_filter=TensorExprTest.SimplifyBroadcastTermExpander

Reviewed By: bwasti

Differential Revision: D23873316

Pulled By: asuhan

fbshipit-source-id: ca141be67e10c2b6c5f2ff9c11e42dcfc62ac620
2020-09-23 17:06:54 -07:00
f93ead6d37 [quant][eagermode] Custom module support (#44835)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44835

This is for feature parity with fx graph mode quantization

Test Plan: Imported from OSS

Reviewed By: z-a-f

Differential Revision: D23745086

fbshipit-source-id: ae2fc86129f9896d5a9039b73006a4da15821307
2020-09-23 15:39:40 -07:00
0495998862 [TensorExpr] Disallow arithmetic binary operations on Bool (#44677)
Summary:
Arithmetic operations on Bool aren't fully supported in the evaluator. Moreover,
such semantics can be implemented by the client code through insertion of
explicit casts to widen and narrow to the desired types.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44677

Test Plan:
test_tensorexpr --gtest_filter=TensorExprTest.ExprDisallowBoolArithmetic
python test/test_jit_fuser_te.py

Reviewed By: agolynski

Differential Revision: D23801412

Pulled By: asuhan

fbshipit-source-id: fff5284e3a216655dbf5a9a64d1cb1efda271a36
2020-09-23 14:59:11 -07:00
8e0fc711f4 [TensorExpr] Remove unused EvalConstExpr function (#45180)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45180

Test Plan: build

Reviewed By: ezyang

Differential Revision: D23877151

Pulled By: asuhan

fbshipit-source-id: a5d4d211c1dc85e6f7045330606163a933b9474e
2020-09-23 14:55:27 -07:00
2a1a51facb Fix typos. (#45195)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45195

Fix some typos in reducer class.
ghstack-source-id: 112673443

Test Plan: N/A

Reviewed By: rohan-varma

Differential Revision: D23862399

fbshipit-source-id: 0dc69e5ea1fa7d33c85d1909b2216bcd1f579f6a
2020-09-23 14:51:15 -07:00
246bd9422a gtestify dce and fuser tests (#45055)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45055

See https://github.com/pytorch/pytorch/pull/45018 for context.

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D23811085

Pulled By: suo

fbshipit-source-id: 45008e41f2394d2ba319745b0340392e1b3d3172
2020-09-23 14:33:22 -07:00
d2b045030e gtest-ify JIT tests, through the letter c (#45020)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45020

See https://github.com/pytorch/pytorch/pull/45018 for context.

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D23802296

Pulled By: suo

fbshipit-source-id: 20c9798a414e9ba30869a862012cbdee0613c8b1
2020-09-23 14:28:45 -07:00
3f89b779c4 [jit] allow submodule methods inference rule be different (#43872)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43872

This PR allows the recursive scripting to have a separate
submodule_stubs_fn to create its submodule with specific user provided
rules.

Fixes https://github.com/pytorch/pytorch/issues/43729

Test Plan: Imported from OSS

Reviewed By: suo

Differential Revision: D23430176

Pulled By: wanchaol

fbshipit-source-id: 20530d7891ac3345b36f1ed813dc9c650b28d27a
2020-09-23 14:10:31 -07:00
9e206ee9f1 [NNC] Fix a bug in SplitWithMask when splitting multiple times (#45141)
Summary:
When doing a splitWithMask we only mask if the loop extent is not cleanly divide by the split factor. However, the logic does not simplify so any nontrivial loop extents will always cause a mask to be added, e.g. if the loop had been previously split. Unlike splitWithTail, the masks added by splitWithMask are always overhead and we don't have the analysis to optimize them out if they are unnecessary, so it's good to avoid inserting them if we can.

The fix is just to simplify the loop extents before doing the extent calculation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45141

Reviewed By: ezyang

Differential Revision: D23869170

Pulled By: nickgg

fbshipit-source-id: 44686fd7b802965ca4f5097b0172a41cf837a1f5
2020-09-23 14:04:58 -07:00
adb2b380ba [quant][graphmode][fx] qconfig_dict support more types of configurations (#44856)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44856

Support following format of qconfig_dict
```python
qconfig_dict = {
    # optional, global config
    "": qconfig?,

    # optional, used for module and function types
    # could also be split into module_types and function_types if we prefer
    "object_type": [
      (nn.Conv2d, qconfig?),
      (F.add, qconfig?),
      ...,
    ],

    # optional, used for module names
    "module_name": [
      ("foo.bar", qconfig?)
      ...,
    ],

    # optional, matched in order, first match takes precedence
    "module_name_regex": [
      ("foo.*bar.*conv[0-9]+", qconfig?)
      ...,
    ]
    # priority (in increasing order): global, object_type, module_name_regex, module_name
    # qconfig == None means fusion and quantization should be skipped for anything
    # matching the rule
}
```

Test Plan: Imported from OSS

Reviewed By: vkuzo

Differential Revision: D23751304

fbshipit-source-id: 5b98f4f823502b12ae2150c93019c7b229c49c50
2020-09-23 13:59:53 -07:00
21fabae47a Remove expensive call to PyObject_GetAttrString in PyTorch_LookupSpecial (#44684)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44684

The ad-hoc quantization benchmarking script in D23689062 recently highlighted that quantized ops were surprisingly slow after the introduction of support for custom ops in torch.fx in D23203204 (f15e27265f).

Using strobelight, it's immediately clear that up to 66% of samples were seen in `c10::get_backtrace`, which is descends from `torch::is_tensor_and_apppend_overloaded -> torch::check_has_torch_function ->  torch::PyTorch_LookupSpecial -> PyObject_HasAttrString ->  PyObject_GetAttrString`.

I'm no expert by any means so please correct any/all misinterpretation, but it appears that:
- `check_has_torch_function` only needs to return a bool
- `PyTorch_LookupSpecial` should return `NULL` if a matching method is not found on the object
- in the impl of `PyTorch_LookupSpecial` the return value from `PyObject_HasAttrString` only serves as a bool to return early, but ultimately ends up invoking `PyObject_GetAttrString`, which raises, spawning the generation of a backtrace
- `PyObject_FastGetAttrString` returns `NULL` (stolen ref to an empty py::object if the if/else if isn't hit) if the method is not found, anyway, so it could be used singularly instead of invoking both `GetAttrString` and `FastGetAttrString`
- D23203204 (f15e27265f) compounded (but maybe not directly caused) the problem by increasing the number of invocations

so, removing it in this diff and seeing how many things break :)

before:
strobelight: see internal section
output from D23689062 script:
```
$ ./buck-out/gen/scripts/v/test_pt_quant_perf.par
Sequential(
  (0): Quantize(scale=tensor([0.0241]), zero_point=tensor([60]), dtype=torch.quint8)
  (1): QuantizedLinear(in_features=4, out_features=4, scale=0.017489388585090637, zero_point=68, qscheme=torch.per_tensor_affine)
  (2): DeQuantize()
)
fp 0.010896682739257812
q 0.11908197402954102
```

after:
strobelight: see internal section
output from D23689062 script:
```
$ ./buck-out/gen/scripts/v/test_pt_quant_perf.par
Sequential(
  (0): Quantize(scale=tensor([0.0247]), zero_point=tensor([46]), dtype=torch.quint8)
  (1): QuantizedLinear(in_features=4, out_features=4, scale=0.012683945707976818, zero_point=41, qscheme=torch.per_tensor_affine)
  (2): DeQuantize()
)
fp 0.011141300201416016
q 0.022639036178588867
```

which roughly restores original performance seen in P142370729

UPDATE: 9/22 mode/opt benchmarks
```
buck run //scripts/x:test_pt_quant_perf mode/opt
Sequential(
  (0): Quantize(scale=tensor([0.0263]), zero_point=tensor([82]), dtype=torch.quint8)
  (1): QuantizedLinear(in_features=4, out_features=4, scale=0.021224206313490868, zero_point=50, qscheme=torch.per_tensor_affine)
  (2): DeQuantize()
)
fp 0.002968311309814453
q 0.5138928890228271
```

with patch:
```
buck run //scripts/x:test_pt_quant_perf mode/opt
Sequential(
  (0): Quantize(scale=tensor([0.0323]), zero_point=tensor([70]), dtype=torch.quint8)
  (1): QuantizedLinear(in_features=4, out_features=4, scale=0.017184294760227203, zero_point=61, qscheme=torch.per_tensor_affine)
  (2): DeQuantize()
)
fp 0.0026655197143554688
q 0.0064449310302734375
```

Reviewed By: ezyang

Differential Revision: D23697334

fbshipit-source-id: f756d744688615e01c94bf5c48c425747458fb33
2020-09-23 13:52:54 -07:00
99242eca1d Dockerfile: Support CUDA 11 (#45071)
Summary:
Although PyTorch already supports CUDA 11, the Dockerfile still relies on CUDA 10. This pull request upgrades all the necessary versions such that recent NVIDIA GPUs like A100 can be used.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45071

Reviewed By: ezyang

Differential Revision: D23873224

Pulled By: seemethere

fbshipit-source-id: 822c25f183dcc3b4c5b780c00cd37744d34c6e00
2020-09-23 11:38:49 -07:00
4d80c8c648 Fix inlining interface call in fork subgraph (#43790)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43790

Interface calls were not handled properly when they are used in fork
subgraph. This PR fixes this issue.

Test Plan: Imported from OSS

Reviewed By: eellison

Differential Revision: D23402039

Pulled By: bzinodev

fbshipit-source-id: 41adc5ee7d942250e732e243ab30e356d78d9bf7
2020-09-23 11:17:19 -07:00
da4033d32a Make cudaHostRegister actually useful on cudart. (#45159)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45159

By default, pybind11 binds void* to be capsules.  After a lot of
Googling, I have concluded that this is not actually useful:
you can't actually create a capsule from Python land, and our
data_ptr() function returns an int, which means that the
function is effectively unusable.  It didn't help that we had no
tests exercising it.

I've replaced the void* with uintptr_t, so that we now accept int
(and you can pass data_ptr() in directly).  I'm not sure if we
should make these functions accept ctypes types; unfortunately,
pybind11 doesn't seem to have any easy way to do this.

Fixes #43006

Also added cudaHostUnregister which was requested.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: lw

Differential Revision: D23849731

Pulled By: ezyang

fbshipit-source-id: 8a79986f3aa9546abbd2a6a5828329ae90fd298f
2020-09-23 11:05:44 -07:00
a5a4924c27 Warn if import torch is called from the source root. (#39995)
Summary:
This is a small developer quality of life improvement. I commonly try to run some snippet of python as I'm working on a PR and forget that I've cd-d into the local clone to run some git commands, resulting in annoying failures like:
`ImportError: cannot import name 'default_generator' from 'torch._C' (unknown location)`

This actually took a non-trivial amount of time to figure out the first time I hit it, and even now it's annoying because it happens just infrequently enough to not sit high in the mental cache.

This PR adds a check to `torch/__init__.py` and warns if `import torch` is likely resolving to the wrong thing:

```
WARNING:root:You appear to be importing PyTorch from a clone of the git repo:
  /data/users/taylorrobie/repos/pytorch
  This will prevent `import torch` from resolving to the PyTorch install
  (instead it will try to load /data/users/taylorrobie/repos/pytorch/torch/__init__.py)
  and will generally lead to other failures such as a failure to load C extensions.
```

so that the soon to follow internal import failure makes some sense. I elected to make this a warning rather than an exception because I'm not 100% sure that it's **always** wrong. (e.g. weird `PYTHONPATH` or `importlib` corner cases.)

EDIT: There are now separate cases for `cwd` vs. `PYTHONPATH`, and failure is an `ImportError`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/39995

Reviewed By: malfet

Differential Revision: D23817209

Pulled By: robieta

fbshipit-source-id: d9ac567acb22d9c8c567a8565a7af65ac624dbf7
2020-09-23 10:55:08 -07:00
9db3871288 Update true_divide_out to use at::. (#45079)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45079

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D23821701

Pulled By: ailzhang

fbshipit-source-id: 562eac10faba7a503eda0029a0b026c1fb85fe1e
2020-09-23 10:50:48 -07:00
9e30a76697 Filter strtod_l is undeclared errors from sccache log (#45183)
Summary:
This prevents DrCI from misidentifying test failures for the compilation failures, such as:
```
/var/lib/jenkins/workspace/build/CMakeFiles/CMakeTmp/CheckSymbolExists.c:8:19: error: use of undeclared identifier \'strtod_l\'
  return ((int*)(&strtod_l))[argc];
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45183

Reviewed By: ezyang

Differential Revision: D23859267

Pulled By: malfet

fbshipit-source-id: 283d9bd2ab712f23239b72f3758d121e2d026fb0
2020-09-23 09:49:49 -07:00
5b20bf4fd9 Added support for complex input for Cholesky decomposition (#44895)
Summary:
Cholesky decomposition now works for complex inputs.

Fixes https://github.com/pytorch/pytorch/issues/44637.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44895

Reviewed By: ailzhang

Differential Revision: D23841583

Pulled By: anjali411

fbshipit-source-id: 3b1f34a7af17827884540696f8771a0d5b1df478
2020-09-23 08:25:56 -07:00
94c3cdd994 Let rpc._all_gather use default RPC timeout (#44983)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44983

`_all_gather` was converted from `_wait_all_workers` and inherited its
5 seconds fixed timeout. As `_all_gather` meant to support a broader
set of use cases, the timeout configuration should be more flexible.
This PR makes `rpc._all_gather` use the global default RPC timeout.

Test Plan: Imported from OSS

Reviewed By: pritamdamania87

Differential Revision: D23794383

Pulled By: mrshenli

fbshipit-source-id: 382f52c375f0f25c032c5abfc910f72baf4c5ad9
2020-09-23 08:06:09 -07:00
e5bade7b2c [PyTorch Mobile] Move string op registrations to prim and make them selective (#44960)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44960

Since we have templated selective build, it should be safe to move the operators to prim so that they can be selectively built in mobile

Test Plan: CI

Reviewed By: linbinyu

Differential Revision: D23772025

fbshipit-source-id: 52cebae76e4df5a6b2b51f2cd82f06f75e2e45d0
2020-09-23 07:42:35 -07:00
76dc50e9c8 [RPC] Infer backend type if only options are given (#45065)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45065

To preserve backwards compatibility with applications that were passing in some ProcessGroupRpcBackendOptions but were not explicitly setting backend=BackendType.PROCESS_GROUP, we're here now inferring the backend type from the options if only the latter ones are passed. If neither are passed, we'll default to TensorPipe, as before this change.
ghstack-source-id: 112586258

Test Plan: Added new unit tests.

Reviewed By: pritamdamania87

Differential Revision: D23814289

fbshipit-source-id: f4be7919e0817a4f539a50ab12216dc3178cb752
2020-09-23 00:46:27 -07:00
215679573e [TensorExpr] Fix operator order in combineMultilane (#45157)
Summary:
combineMultilane used the wrong order when ramp was on the left hand side,
which matters for subtract.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45157

Test Plan: test_tensorexpr --gtest_filter=TensorExprTest.SimplifyRampSubBroadcast

Reviewed By: ailzhang

Differential Revision: D23851751

Pulled By: asuhan

fbshipit-source-id: 864d1611e88769fb43327ef226bb3310017bf858
2020-09-22 23:50:47 -07:00
7fba30c2be [quant][fx][bug] Fix error in convert step for QAT (#45050)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45050

Update tests to actually test for QAT

Test Plan:
python test/test_quantization.py TestQuantizeFxOps.test_linear

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D23808022

fbshipit-source-id: d749ab2d215fe19238ff9d539307ffce9ef0ca9b
2020-09-22 22:48:31 -07:00
144dacd8d9 CUDA BFloat16 batched gemm (#45167)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45167

Reviewed By: mruberry

Differential Revision: D23860458

Pulled By: ngimel

fbshipit-source-id: 698de424a046963a30017b58d227fa510f85bf3f
2020-09-22 22:43:52 -07:00
989d877c95 [JIT] Do not allow creating generics with None types (#44958)
Summary:
Otherwise, invoking something like  `python -c "import torch._C;print(torch._C.ListType(None))"` will result in SIGSEGV

Discovered while trying to create a torch script for function with the following type annotation `Tuple[int, Ellipsis] -> None`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44958

Reviewed By: suo

Differential Revision: D23799906

Pulled By: malfet

fbshipit-source-id: 916a243007d13ed3e7a5b282dd712da3d66e3bf7
2020-09-22 21:50:40 -07:00
0a9ac98bed [reland][pytorch] refine dispatch keys in native_functions.yaml (1/N) (#45137)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45137

Reland https://github.com/pytorch/pytorch/pull/45010 - which broke
master due to merge conflict.

Test Plan: Imported from OSS

Reviewed By: bhosmer

Differential Revision: D23843510

Pulled By: ljk53

fbshipit-source-id: 28aabb9da533b6b806ab8779a0ee96b695e9e242
2020-09-22 21:44:55 -07:00
25ed739ac9 [packaging] rstrip fix (#45166)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45166

Test Plan: Imported from OSS

Reviewed By: suo

Differential Revision: D23852505

Pulled By: zdevito

fbshipit-source-id: 6bb743b37333ae19fc24629686e8d06aef812c50
2020-09-22 21:23:47 -07:00
cb75addee4 torch.package - a way to package models and code (#45015)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45015

torch.package allows you to write packages of code, pickled python data, and
arbitrary binary and text resources into a self-contained package.

torch.package.PackageExporter writes the packages and
torch.package.PackageImporter reads them.

The importers can load this code in a hermetic way, such that code is loaded
from the package rather than the normal python import system. This allows
for the packaging of PyTorch model code and data so that it can be run
on a server or used in the future for transfer learning.

The code contained in packages is copied file-by-file from the original
source when it is created, and the file format is a specially organized
zip file. Future users of the package can unzip the package, and edit the code
in order to perform custom modifications to it.

The importer for packages ensures that code in the module can only be loaded from
within the package, except for modules explicitly listed as external using :method:`extern_module`.
The file `extern_modules` in the zip archive lists all the modules that a package externally depends on.
This prevents "implicit" dependencies where the package runs locally because it is importing
a locally-installed package, but then fails when the package is copied to another machine.

Test Plan: Imported from OSS

Reviewed By: SplitInfinity

Differential Revision: D23824337

Pulled By: zdevito

fbshipit-source-id: 1247c34ba9b656f9db68a83e31f2a0fbe3bea6bd
2020-09-22 21:21:21 -07:00
d4a634c209 [RPC profiling] Don't wrap toHere() calls with profiling (#44655)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44655

Since `toHere()` does not execute operations over RPC and simply
transfers the value to the local node, we don't need to enable the profiler
remotely for this message. This causes unnecessary overhead and is not needed.

Since `toHere` is a blocking call, we already profile the call on the local node using `RECORD_USER_SCOPE`, so this does not change the expected profiler results (validated by ensuring all remote profiling tests pass).
ghstack-source-id: 112605610

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D23641466

fbshipit-source-id: 109d9eb10bd7fe76122b2026aaf1c7893ad10588
2020-09-22 21:17:00 -07:00
70d2e4d1f6 [RPC profiling] Allow disableProfiler() to be called from another thread. (#44653)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44653

This changes the profiler per a discussion with ilia-cher offline that enables `disableProfiler()` event consolidation logic to be called from different threads (i.e. threads where the profiler was not explicitly enabled). This is needed to support the functionality enabled by D23638387 where we defer profiling event collection until executing an async callback that can execute on a different thread, to support RPC async function profiling.

This is done by introducing 2 flags `cleanupTLSState` and `consolidate` which controls whether we should clean up thread local settings (we don't do this when calling `disableProfiler()` on non-main threads) and whether we should consolidate all profiled events. Backwards compatiblity is ensured since both options are true by default.

Added a test in `test_misc.cpp` to test this.
ghstack-source-id: 112605620

Reviewed By: mrshenli

Differential Revision: D23638499

fbshipit-source-id: f5bbb0d41ef883c5e5870bc27e086b8b8908f46b
2020-09-22 21:16:58 -07:00
1bd6533d60 Remove thread_local RecordFunctionGuard from profiler. (#44646)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44646

Per a discussion with ilia-cher, this is not needed anymore and
removing it would make some future changes to support async RPC profiling
easier. Tested by ensuring profiling tests in `test_autograd.py` still pass.
ghstack-source-id: 112605618

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D23683998

fbshipit-source-id: 4e49a439509884fe04d922553890ae353e3331ab
2020-09-22 21:15:31 -07:00
67a19fecef CUDA BFloat16 pooling (#45151)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45151

Reviewed By: ailzhang

Differential Revision: D23854056

Pulled By: ngimel

fbshipit-source-id: 32f0835218c2602a09654a9ac2d161c4eb360f90
2020-09-22 20:19:25 -07:00
666223df46 [jit] gtestify test_argument_spec.cpp (#45019)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45019

See https://github.com/pytorch/pytorch/pull/45018 for context.

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D23802298

Pulled By: suo

fbshipit-source-id: 0e36d095d4d81dcd5ebe6d56b3dc469d6d5482d0
2020-09-22 19:44:14 -07:00
f575df201f [quant][graphmode][jit][api] Expose preserved_attrs from finalize to convert_jit (#44490)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44490

Test Plan: Imported from OSS

Reviewed By: z-a-f

Differential Revision: D23631142

fbshipit-source-id: f0913f0cb4576067e2a7288326024942d12e0ae0
2020-09-22 19:37:25 -07:00
e045119956 [JIT] Add default arguments for class types (#45098)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45098

**Summary**
This commit adds support for default arguments in methods of class
types. Similar to how default arguments are supported for regular
script functions and methods on scripted modules, default values are
retrieved from the definition of a TorchScript class in Python as Python
objects, converted to IValues, and then attached to the schemas of
already compiled class methods.

**Test Plan**
This commit adds a set of new tests to TestClassType to test default
arguments.

**Fixes**
This commit fixes #42562.

Test Plan: Imported from OSS

Reviewed By: gmagogsfm

Differential Revision: D23844769

Pulled By: SplitInfinity

fbshipit-source-id: ceedff7703bf9ede8bd07b3abcb44a0f654936bd
2020-09-22 18:37:44 -07:00
ebde5a80bb [tensorexpr] Add flag to fuse with unknown shapes (#44401)
Summary:
This flag simply allows users to get fusion groups that will *eventually* have shapes (such that `getOperation` is a valid).

This is useful for doing early analysis and compiling just in time.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44401

Reviewed By: ZolotukhinM

Differential Revision: D23656140

Pulled By: bwasti

fbshipit-source-id: 9a26c202752399d1932ad7d69f21c88081ffc1e5
2020-09-22 18:17:47 -07:00
c0267c6845 [caffe2] Support data types in shape hints (#45110)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45110

A recent change in DSNN quantizes the ad embedding to 8 bits. Ad embeddings are part of the inputs to the DSNN merge net. To correctly pass shape hints of input tensors including quantized ad embeddings, we need to be able to annotate the data types in shape hints.

A bit on the corner cases, if type is omitted or not a valid type, e.g., white spaces, instead of throwing an exception, I decided to return the default type, float.

Test Plan:
```
buck test caffe2/caffe2/fb/opt:shape_info_utils_test
```

Reviewed By: yinghai

Differential Revision: D23834091

fbshipit-source-id: 5e072144a7a7ff4b5126b618062dfc4041851dd3
2020-09-22 17:49:33 -07:00
b98ac20849 install ATen/native/cuda and hip headers (#45097)
Summary:
The ATen/native/cuda headers were copied to torch/include, but then not included in the final package.  Further, add ATen/native/hip headers to the installation, as well.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45097

Reviewed By: mruberry

Differential Revision: D23831006

Pulled By: malfet

fbshipit-source-id: ab527928185faaa912fd8cab208733a9b11a097b
2020-09-22 17:43:47 -07:00
2a37f3fd2f Relax CUDA architecture check (#45130)
Summary:
NVIDIA GPUs are binary compatible within major compute capability revision

This would prevent: "GeForce RTX 3080 with CUDA capability sm_86 is not compatible with the current PyTorch installation." messages from appearing, since CUDA-11 do not support code generation for sm_85.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45130

Reviewed By: ngimel

Differential Revision: D23841556

Pulled By: malfet

fbshipit-source-id: bcfc9e8da63dfe62cdec06909b6c049aaed6a18a
2020-09-22 17:26:47 -07:00
ccfbfe5eb5 [quant][graphmode][fx] Custom module support (#44766)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44766

There might be modules that are not symbolically traceable, e.g. LSTM (since it has
input dependent control flows), to support quantization in these cases, user will provide
the corresponding observed and quantized version of the custom module, the observed
custom module with observers already inserted in the module and the quantized version will
have the corresponding ops quantized. And use
```
from torch.quantization import register_observed_custom_module_mapping
from torch.quantization import register_quantized_custom_module_mapping
register_observed_custom_module_mapping(CustomModule, ObservedCustomModule)
register_quantized_custom_module_mapping(CustomModule, QuantizedCustomModule)
```
to register the custom module mappings, we'll also need to define a custom delegate class
for symbolic trace in order to prevent the custom module from being traced:
```python
class CustomDelegate(DefaultDelegate):
      def is_leaf_module(self, m):
          return (m.__module__.startswith('torch.nn') and
                    not isinstance(m, torch.nn.Sequential)) or \
                    isinstance(m, CustomModule)
m = symbolic_trace(original_m, delegate_class=CustomDelegate)
```

Test Plan: Imported from OSS

Reviewed By: z-a-f

Differential Revision: D23723455

fbshipit-source-id: 50d666e29b94cbcbea5fb6bcc73b00cff87eb77a
2020-09-22 17:11:46 -07:00
7f4a27be3a [resubmit][FX] s/get_param/get_attr/ (#45147)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45147

ghstack-source-id: 112605923

Test Plan: Imported from OSS

Reviewed By: eellison

Differential Revision: D23845096

fbshipit-source-id: 9ca209aa84cbaddd6e89c52b541e43b11197e2d5
2020-09-22 17:06:18 -07:00
35cdb01327 [PyTorch] Enable type check for autocast_test_lists (#45107)
Summary:
This is a sub-task for addressing: https://github.com/pytorch/pytorch/issues/42969. We re-enable type check for `autocast_test_lists `.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45107

Test Plan:
`python test/test_type_hints.py` passed:
```
(pytorch) bash-5.0$ with-proxy python test/test_type_hints.py
....
----------------------------------------------------------------------
Ran 4 tests in 103.871s

OK
```

Reviewed By: walterddr

Differential Revision: D23842884

Pulled By: Hangjun

fbshipit-source-id: a39f3810e3abebc6b4c1cb996b06312f6d42ffd6
2020-09-22 16:54:26 -07:00
cddcfde81d [JIT] Fix WithTest.test_with_exceptions (#45106)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45106

**Summary**
This commit fixes `WithTest.test_with_exceptions`. It's been running
in regular Python this whole time; none of the functions created and
invoked for the test were scripted. Fortunately, the tests still pass
after being fixed.

**Test Plan**
Ran unit tests + continuous integration.

Test Plan: Imported from OSS

Reviewed By: gmagogsfm

Differential Revision: D23848206

Pulled By: SplitInfinity

fbshipit-source-id: fd975ee34db9441ef4e4a4abf2fb21298166bbaa
2020-09-22 16:31:17 -07:00
d1c68a7069 Clarify that 5-D 'bilinear' grid_sample is actually trilinear (#45090)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/41528

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45090

Reviewed By: ailzhang

Differential Revision: D23841046

Pulled By: zou3519

fbshipit-source-id: 941770cd5b3e705608957739026e9113e5f0c616
2020-09-22 15:10:22 -07:00
79fe794f87 [FX] Make Graphs immutable and make GraphModule recompile after assigning graph (#44830)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44830

Test Plan: Imported from OSS

Reviewed By: zdevito

Differential Revision: D23743850

Pulled By: jamesr66a

fbshipit-source-id: 501b92a89ff636c26abeff13105a75462384554c
2020-09-22 15:02:11 -07:00
def433bbb6 .circleci: Upgrade all xcode 9 workers to xcode 11 (#45153)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45153

xcode 9 is being deprectated within circleci infra so we should get
everything else on a more recent version of xcode

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D23852774

Pulled By: seemethere

fbshipit-source-id: c02e162f1993d408de439fee21b340e9640e5a24
2020-09-22 14:57:43 -07:00
a4ce3f4194 Fix type hint warnings for common_methods_invocations.py (#44971)
Summary:
Fixes a subtask of https://github.com/pytorch/pytorch/issues/42969

Tested the following and no warnings were seen.

python test/test_type_hints.py
....
----------------------------------------------------------------------
Ran 4 tests in 180.759s

OK

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44971

Reviewed By: walterddr

Differential Revision: D23822274

Pulled By: visweshfb

fbshipit-source-id: e3485021e348ee0a8508a9d128f04bad721795ef
2020-09-22 13:40:46 -07:00
c253b10154 Fix incorrect EnumValue serialization issue (#44891)
Summary:
Previously, `prim::EnumValue` is serialized to `ops.prim.EnumValue`, which doesn't have the right implementation to refine return type. This diff correctly serializes it to enum.value, thus fixing the issue.

Fixes https://github.com/pytorch/pytorch/issues/44892

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44891

Reviewed By: malfet

Differential Revision: D23818962

Pulled By: gmagogsfm

fbshipit-source-id: 6edfdf9c4b932176b08abc69284a916cab10081b
2020-09-22 11:59:45 -07:00
2b1f25885e [quant] Fix ConvTranspose mapping (#44844)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44844

Test Plan: Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D23746466

Pulled By: z-a-f

fbshipit-source-id: cb84e0fef5ab82e8ed8dd118d9fb21ee7b480ef7
2020-09-22 11:59:42 -07:00
09aee06e82 [caffe2] Replace embedding conversion ops with fbgemm functions (#44843)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44843

Replace perfkernels calls with fbgemm kernels to avoid code duplication
ghstack-source-id: 112496292

Test Plan: CI

Reviewed By: radkris-git

Differential Revision: D23675519

fbshipit-source-id: 05c285a9eeb9ea109a04a78cb442a24ee40a4aec
2020-09-22 11:57:01 -07:00
e2b40ce793 Support BFloat16 for binary logical operators on CUDA (#42485)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42485

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D23684423

Pulled By: mruberry

fbshipit-source-id: edc2b46b726361d4c8bf8a4bf4e4a09197b20428
2020-09-22 11:42:34 -07:00
ef885c10d8 [pytorch] Add triplet margin loss with custom distance (#43680)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43680

As discussed [here](https://github.com/pytorch/pytorch/issues/43342),
adding in a Python-only implementation of the triplet-margin loss that takes a
custom distance function.  Still discussing whether this is necessary to add to
PyTorch Core.

Test Plan:
python test/run_tests.py

Imported from OSS

Reviewed By: albanD

Differential Revision: D23363898

fbshipit-source-id: 1cafc05abecdbe7812b41deaa1e50ea11239d0cb
2020-09-22 11:35:52 -07:00
10f287539f Align casing in test_dispatch with dispatch keys. (#44933)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44933

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D23778247

Pulled By: ailzhang

fbshipit-source-id: bc3725eae670b03543015afe763cb3bb16baf8f6
2020-09-22 10:50:08 -07:00
1fd48a9d1f Revert D23798016: [FX] s/get_param/get_attr/
Test Plan: revert-hammer

Differential Revision:
D23798016 (c941dd3492)

Original commit changeset: 1d2f3db1994a

fbshipit-source-id: 974d930064b37d396c5d66c905a63d45449813e5
2020-09-22 10:32:51 -07:00
8501b89a87 [ONNX] Update ort release (#45095)
Summary:
Update ort release

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45095

Reviewed By: bwasti

Differential Revision: D23832041

Pulled By: malfet

fbshipit-source-id: 39c47a87e451c4c43ba4d4e8be385cc195cc611a
2020-09-22 10:08:48 -07:00
4b42f0b613 Support Math keyword in native_functions.yaml. (#44556)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44556

Test Plan: Imported from OSS

Reviewed By: bhosmer

Differential Revision: D23698386

Pulled By: ailzhang

fbshipit-source-id: f10ea839a2cfe7d16f5823a75b8b8c5f1ae22dde
2020-09-22 10:00:40 -07:00
ae286d81e0 [JIT] improve alias analysis for list constructs (#39111)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39111

In our present alias analysis, we consider any Value that enter another container as entering the heap, and thus aliasing all other heap values of the same type. There are a number of advantages to this approach:
- it is not to hard to maintain the aliasDb implementation
- it is much easier from an op schema perspective - there are many composite list ops registered internally and externally that would be tricky to register and get right if we did something more complicated
- It limits the size of the AliasDb, because a container of size 10 only contains a single memory dag element instead of 10 elements.

The downside is that we have are unable to handle the simple and extremely common case of a list of tensors being used in an ATen op.

In an example like:

```
 def foo(input):
    x = torch.tensor([1, 2, 3, 4])
    y = [x, x]
    input.add_(1)
    return torch.cat(y)
```

we will consider x to be written to. any write to any wildcard element (an element that enters a tuple, an element that is taken from a list) will mark x as written to. This can be limiting for our ability to create a functional subset and fuse graphs - as a result, 4 of TorchVision classification models could not be functionalized.

Test Plan: Imported from OSS

Reviewed By: SplitInfinity

Differential Revision: D23828003

Pulled By: eellison

fbshipit-source-id: 9109fcb6f2ca20ca897cae71683530285da9d537
2020-09-22 09:38:59 -07:00
9fc7a942f0 Change from self to self.class() in _DecoratorManager to ensure a new object is every time a function is called recursively (#44633)
Summary:
Change from self to self._class_() in _DecoratorManager to ensure a new object is every time a function is called recursively

Fixes https://github.com/pytorch/pytorch/issues/44531

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44633

Reviewed By: agolynski

Differential Revision: D23783601

Pulled By: albanD

fbshipit-source-id: a818664dee7bdb061a40ede27ef99e9546fc80bb
2020-09-22 09:13:39 -07:00
63fd257879 Add Ellipsis constant to the list of recognized tokens (#44959)
Summary:
Per https://docs.python.org/3.6/library/constants.html
> `Ellipsis` is the same as ellipsis literal `...`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44959

Reviewed By: suo

Differential Revision: D23785660

Pulled By: malfet

fbshipit-source-id: f68461849e7d16ef68042eb96566f2c936c06b0f
2020-09-22 09:05:25 -07:00
e155fbe915 add warning when ParameterList/Dict is used with DataParallel (#44405)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44405

Test Plan: Imported from OSS

Reviewed By: agolynski

Differential Revision: D23783987

Pulled By: albanD

fbshipit-source-id: 5018b0d381cb09301d2f88a98a910854f740ace1
2020-09-22 08:58:00 -07:00
4a0aa69a66 Fix undefined variable 'namedshape' in tensor.py (#45085)
Summary:
Hot Fix

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45085

Reviewed By: malfet, seemethere

Differential Revision: D23824444

Pulled By: walterddr

fbshipit-source-id: c9f37b394d281b7ef44b14c30699bb7510a362a7
2020-09-22 08:52:47 -07:00
36ec8f8fb8 [dper3] Create dper LearningRate low-level module (#44639)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44639

As title; this will unblock migration of several modules that need learning rate functionality.

Test Plan:
```
buck test //dper3/dper3/modules/low_level_modules/tests:learning_rate_test
```

Reviewed By: yf225

Differential Revision: D23681733

fbshipit-source-id: 1d98cb35bf6a4ff0718c9cb6abf22401980b523c
2020-09-22 08:26:07 -07:00
58b6ab69e5 torch.sgn for complex tensors (#39955)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39955

resolves https://github.com/pytorch/pytorch/issues/36323 by adding `torch.sgn` for complex tensors.
`torch.sgn` returns `x/abs(x)` for `x != 0` and returns `0 + 0j` for `x==0`

This PR doesn't test the correctness of the gradients. It will be done as a part of auditing all the ops in future once we decide the autograd behavior (JAX vs TF) and add gradchek.

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D23460526

Pulled By: anjali411

fbshipit-source-id: 70fc4e14e4d66196e27cf188e0422a335fc42f92
2020-09-22 08:24:53 -07:00
1b059f2c6d Directly use work.result() to retrieve tensor rather than passing as a separate argument (#44914)
Summary:
We currently are fetching an allreduced tensor from Python in C++ in, where we are storing the resulting tensor in a struct's parameter. This PR removes extra tensor paratemeter in the function parameter and fetch from a single place.

Fixes https://github.com/pytorch/pytorch/issues/43960

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44914

Reviewed By: rohan-varma

Differential Revision: D23798888

Pulled By: bugra

fbshipit-source-id: ad1b8c31c15e3758a57b17218bbb9dc1f61f1577
2020-09-22 06:28:47 -07:00
71aeb84ab4 Revert D23803951: [pytorch] refine dispatch keys in native_functions.yaml (1/N)
Test Plan: revert-hammer

Differential Revision:
D23803951 (339961187a)

Original commit changeset: aaced7c34427

fbshipit-source-id: fcc4fb6a2c1d79b587f62347b43f8851fe1647fd
2020-09-22 05:41:59 -07:00
339961187a [pytorch] refine dispatch keys in native_functions.yaml (1/N) (#45010)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45010

The motivation of this change is to differentiate "backend specific" ops
and "generic" ops.

"backend specific" ops are those invoking backend specific kernels thus
only able to run on certain backends, e.g.: CPU, CUDA.

"generic" ops are those not *directly* invoking backend specific kernels.
They are usually calling other "backend specific" ops to get things
done. Thus, they are also referred to as "composite" ops, or "math" ops
(because they are usually pure C++ code constructed from math formula).

The other way to see the difference is that: we have to implement new
kernels for the "backend specific" ops if we want to run these ops on a
new backend. In contrast, "generic"/"composite" ops can run on the new
backend if we've added support for all the "backend specific" ops to
which they delegate their work.

Historically we didn't make a deliberate effort to always populate
supported backends to the "dispatch" section for all the "backend specific"
ops in native_functions.yaml. So now there are many ops which don't have
"dispatch" section but are actually "backend specific" ops. Majority
of them are calling "DispatchStub" kernels, which usually only support
CPU/CUDA (via TensorIterator) or QuantizedCPU/CUDA.

The ultimate goal is to be able to differentiate these two types of ops
by looking at the "dispatch" section in native_functions.yaml.

This PR leveraged the analysis script on #44963 to populate missing
dispatch keys for a set of "backend specific" ops. As the initial step,
we only deal with the simplest case:
* These ops don't already have dispatch section in native_functions.yaml;
* These ops call one or more DispatchStub (thus "backend specific");
* These ops don't call any other aten ops - except for some common
  ones almost every op calls via framework, e.g. calling aten::eq via
  Dispatcher::checkSchemaCompatibility. Calling other nontrivial aten
  ops is a sign of being "composite", so we don't want to deal with this
  case now;
* These ops don't call Tensor::is_quantized() / Tensor::is_sparse() / etc.
  Some ops call thse Tensor::is_XXX() methods to dispatch to quantized /
  sparse kernels internally. We don't deal with this case now.

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D23803951

Pulled By: ljk53

fbshipit-source-id: aaced7c34427d1ede72380af4513508df366ea16
2020-09-22 03:20:01 -07:00
c947ab0bb9 Added sparse support for asin and neg functions, updated log1p (#44028)
Summary:
Description:

- [x] added C++ code for sparse `asin` and `neg` ops similarly to `log1p` op
- [x] added tests
  - [x] coalesced input CPU/CUDA
  - [x] uncoalesced input CPU/CUDA
- [x] added tests for `negative`  and `arcsin`

Backprop will be addressed in another PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44028

Reviewed By: agolynski

Differential Revision: D23793027

Pulled By: mruberry

fbshipit-source-id: 5fd642808da8e528cf6acd608ca0dcd720c4ccc3
2020-09-22 02:04:38 -07:00
d126a0d4fd [iOS] Disable the iOS nightly build until the cert issue has resolved (#45094)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45094

Test Plan: Imported from OSS

Reviewed By: husthyc

Differential Revision: D23831152

Pulled By: xta0

fbshipit-source-id: 6327edba01e4d5abad63ac35680eefb22276423f
2020-09-22 01:47:41 -07:00
5aed75b21b [quant][graphmode][jit] Try to support append (#44641)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44641

Test Plan: Imported from OSS

Reviewed By: z-a-f

Differential Revision: D23682356

fbshipit-source-id: 09a03dfde0b1346a5764e8e28ba56e32b343d239
2020-09-21 23:13:56 -07:00
2111ec3bf3 CUDA BFloat16 losses (#45011)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45011

Reviewed By: mruberry

Differential Revision: D23805840

Pulled By: ngimel

fbshipit-source-id: 3eb60d4367c727100763879e20e9df9d58bf5ad6
2020-09-21 22:51:17 -07:00
32c1a8c79f adjust shape inference in sls tests (#44936)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44936

need to provide max sequence size and max element size instead of
total
added a check that onnxifi was succesful

Test Plan: sls tests

Reviewed By: yinghai

Differential Revision: D23779437

fbshipit-source-id: 5048d6536ca00f0a3b0b057c4e2cf6584b1329d6
2020-09-21 22:09:55 -07:00
0dda65ac77 [ONNX] add jit pass for lists (#43820)
Summary:
Add jit preprocessing pass for adding int lists.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43820

Reviewed By: albanD

Differential Revision: D23674598

Pulled By: bzinodev

fbshipit-source-id: 35766403a073e202563bba5251c07efb7cc5cfb1
2020-09-21 22:05:25 -07:00
09e7f62ce2 Fix RPC and ProcessGroup GIL deadlock (#45088)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45088

Fixes #45082

Found a few problems while working on #44983

1. We deliberately swallow RPC timeouts during shutdown, as we haven't
found a good way to handle those. When we convert `_wait_all_workers`
into `_all_gather`, the same logic was inherited. However, as
`_all_gather` meant to be used in more general scenarios, we should
no longer keep silent about errors. This commit let the error throw
in `_all_gather` and also let `shutdown()` to catch them and log.
2. After fixing (1), I found that `UnpickledPythonCall` needs to
acquire GIL on destruction, and this can lead to deadlock when used
in conjuction with `ProcessGroup`. Because `ProcessGroup` ctor is a
synchronization point which holds GIL. In `init_rpc`, followers
(`rank != 0`) can exit before the leader (`rank == 0`). If the two
happens together, we could get a) on a follower, it exits `init_rpc`
after running `_broadcast_to_followers` and before the reaching dtor
of `UnpickledPythonCall`. Then it runs the ctor of `ProcessGroup`,
which holds the GIL and wait for the leader to join. However, the
leader is waiting for the response from `_broadcast_to_followers`,
which is blocked by the dtor of `UnpickledPythonCall`. And hence
the deadlock. This commit drops the GIL in `ProcessGroup` ctor.
3. After fixing (2), I found that `TensorPipe` backend
nondeterministically fails with `test_local_shutdown`, due to a
similar reason as (2), but this time it is that `shutdown()` on a
follower runs before the leader finishes `init_rpc`. This commit
adds a join for `TensorPipe` backend `init_rpc` after `_all_gather`.

The 3rd one should be able to solve the 2nd one as well. But since
I didn't see a reason to hold GIL during `ProcessGroup` ctor, I
made that change too.

Test Plan: Imported from OSS

Reviewed By: pritamdamania87

Differential Revision: D23825592

Pulled By: mrshenli

fbshipit-source-id: 94920f2ad357746a6b8e4ffaa380dd56a7310976
2020-09-21 21:47:27 -07:00
dfc88d4fd0 [vulkan] support dimensions negative indexing (#45068)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45068

Test Plan: Imported from OSS

Reviewed By: AshkanAliabadi

Differential Revision: D23816081

Pulled By: IvanKobzarev

fbshipit-source-id: bda753f3f216dac7c05b6f728a3bd6068e5d06a0
2020-09-21 21:24:16 -07:00
5621ba87a2 [vulkan] reshape op to use infer_size to expand -1 (#45104)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45104

Test Plan: Imported from OSS

Reviewed By: AshkanAliabadi

Differential Revision: D23834249

Pulled By: IvanKobzarev

fbshipit-source-id: 0e3699d6a4227788d1d634349c0bf259c0ad5e8d
2020-09-21 21:08:59 -07:00
8968030f19 [WIP] Add vec256 test to linux CI (#44912)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44912

This is to add vec256 test into linux CI system.
The whole test will last 50 to 70 seconds.

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D23772923

Pulled By: glaringlee

fbshipit-source-id: ef929b53f3ea7894abcd9510a8e0389979cab4a2
2020-09-21 21:00:29 -07:00
4b3046ed28 Vectorize int8_t on CPU (#44759)
Summary:
int8_t is not vectorized in vec256_int.h. This PR adds vectorization for
int8_t. As pointed out in https://github.com/pytorch/pytorch/issues/43033, this is an important type for vectorization because
a lot of images are loaded in this data type.

Related issue: https://github.com/pytorch/pytorch/issues/43033

Benchmark (Debian Buster,  Intel(R) Xeon(R) E-2136 CPU @ 3.30GHz, Turbo off, Release build):

```python
import timeit
dtype = 'torch.int8'
for op in ('+', '-'):
    for n, t in [(10_000, 200000),
                (100_000, 20000)]:
        print(f'a {op} b, numel() == {n} for {t} times, dtype={dtype}')
        print(timeit.timeit(f'c = a {op} b', setup=f'import torch; a = torch.arange(1, {n}, dtype={dtype}); b = torch.arange({n}, 1, -1, dtype={dtype})', number=t))
```

Results:

Before:

```
a + b, numel() == 10000 for 200000 times, dtype=torch.int8
1.2223373489978258
a + b, numel() == 100000 for 20000 times, dtype=torch.int8
0.6108450189931318
a - b, numel() == 10000 for 200000 times, dtype=torch.int8
1.256775538000511
a - b, numel() == 100000 for 20000 times, dtype=torch.int8
0.6101213909860235
```

After:

```
a + b, numel() == 10000 for 200000 times, dtype=torch.int8
0.5713336059998255
a + b, numel() == 100000 for 20000 times, dtype=torch.int8
0.39169703199877404
a - b, numel() == 10000 for 200000 times, dtype=torch.int8
0.5838428330025636
a - b, numel() == 100000 for 20000 times, dtype=torch.int8
0.37486923701362684
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44759

Reviewed By: malfet

Differential Revision: D23786383

Pulled By: glaringlee

fbshipit-source-id: 67f5bcd344c0b5014bacbc876143231fca156713
2020-09-21 19:55:13 -07:00
f77ba0e48c Change typo 'momemtum' to 'momentum' (#45045)
Summary:
As the title.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45045

Reviewed By: mruberry

Differential Revision: D23808563

Pulled By: mrshenli

fbshipit-source-id: ca818377f4c23d67b037c146fef667ab8731961e
2020-09-21 19:03:26 -07:00
20f52cdd76 [hpc]optimize the torch.cat cuda kernel (#44833)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44833

Current cat cuda kernel employs the pin memory to pass the tensor data. 1) It is much slower than passing through argument using constant memory 2) the H2D sometimes overlaps with other H2D in training, and thus generates some random delay and leads to desync issue.

For small N, we actually saw 2X improvements.

Test Plan:
benchmark
```
./buck-out/opt/gen/caffe2/benchmarks/operator_benchmark/pt/cat_test.par --tag_filter all --device cuda
```
```
# ----------------------------------------
# PyTorch/Caffe2 Operator Micro-benchmarks
# ----------------------------------------
# Tag : all

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(1,1,1)_N2_dim0_cuda
# Input: sizes: (1, 1, 1), N: 2, dim: 0, device: cuda
Forward Execution Time (us) : 38.825

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(512,512,2)_N2_dim1_cuda
# Input: sizes: (512, 512, 2), N: 2, dim: 1, device: cuda
Forward Execution Time (us) : 45.440

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(128,1024,2)_N2_dim1_cuda
# Input: sizes: (128, 1024, 2), N: 2, dim: 1, device: cuda
Forward Execution Time (us) : 38.765

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(1024,1024,2)_N2_dim0_cuda
# Input: sizes: (1024, 1024, 2), N: 2, dim: 0, device: cuda
Forward Execution Time (us) : 60.075

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(1025,1023,2)_N2_dim1_cuda
# Input: sizes: (1025, 1023, 2), N: 2, dim: 1, device: cuda
Forward Execution Time (us) : 65.203

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(1024,1024,2)_N2_dim2_cuda
# Input: sizes: (1024, 1024, 2), N: 2, dim: 2, device: cuda
Forward Execution Time (us) : 83.941

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f0d50fc2440>,111,65]_N5_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f0d50fc2440>, 111, 65], N: 5, dim: 0, device: cuda
Forward Execution Time (us) : 51.059

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[96,<function<lambda>at0x7f0d50fc2b90>,64]_N5_dim1_cuda
# Input: sizes: [96, <function <lambda> at 0x7f0d50fc2b90>, 64], N: 5, dim: 1, device: cuda
Forward Execution Time (us) : 42.134

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[128,64,<function<lambda>at0x7f0b22b7e3b0>]_N5_dim2_cuda
# Input: sizes: [128, 64, <function <lambda> at 0x7f0b22b7e3b0>], N: 5, dim: 2, device: cuda
Forward Execution Time (us) : 78.333

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f0b22b7e5f0>,32,64]_N50_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f0b22b7e5f0>, 32, 64], N: 50, dim: 0, device: cuda
Forward Execution Time (us) : 77.065

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[32,<function<lambda>at0x7f0b22b7e680>,64]_N50_dim1_cuda
# Input: sizes: [32, <function <lambda> at 0x7f0b22b7e680>, 64], N: 50, dim: 1, device: cuda
Forward Execution Time (us) : 74.632

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[33,65,<function<lambda>at0x7f0b22b7e710>]_N50_dim2_cuda
# Input: sizes: [33, 65, <function <lambda> at 0x7f0b22b7e710>], N: 50, dim: 2, device: cuda
Forward Execution Time (us) : 81.846

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(64,32,4,16,32)_N2_dim2_cuda
# Input: sizes: (64, 32, 4, 16, 32), N: 2, dim: 2, device: cuda
Forward Execution Time (us) : 99.291

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(16,32,4,16,32)_N8_dim2_cuda
# Input: sizes: (16, 32, 4, 16, 32), N: 8, dim: 2, device: cuda
Forward Execution Time (us) : 114.060

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(9,31,5,15,33)_N17_dim4_cuda
# Input: sizes: (9, 31, 5, 15, 33), N: 17, dim: 4, device: cuda
Forward Execution Time (us) : 478.777

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f0b22b7e7a0>]_N100_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f0b22b7e7a0>], N: 100, dim: 0, device: cuda
Forward Execution Time (us) : 80.165

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f0b22b7e830>]_N1000_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f0b22b7e830>], N: 1000, dim: 0, device: cuda
Forward Execution Time (us) : 491.983

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f0b22b7e8c0>]_N2000_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f0b22b7e8c0>], N: 2000, dim: 0, device: cuda
Forward Execution Time (us) : 966.613

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f0b22b7e950>]_N3000_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f0b22b7e950>], N: 3000, dim: 0, device: cuda
Forward Execution Time (us) : 1500.133
```

After optimization
```
# ----------------------------------------
# PyTorch/Caffe2 Operator Micro-benchmarks
# ----------------------------------------
# Tag : all

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(1,1,1)_N2_dim0_cuda
# Input: sizes: (1, 1, 1), N: 2, dim: 0, device: cuda
Forward Execution Time (us) : 22.168

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(512,512,2)_N2_dim1_cuda
# Input: sizes: (512, 512, 2), N: 2, dim: 1, device: cuda
Forward Execution Time (us) : 33.430

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(128,1024,2)_N2_dim1_cuda
# Input: sizes: (128, 1024, 2), N: 2, dim: 1, device: cuda
Forward Execution Time (us) : 19.884

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(1024,1024,2)_N2_dim0_cuda
# Input: sizes: (1024, 1024, 2), N: 2, dim: 0, device: cuda
Forward Execution Time (us) : 48.082

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(1025,1023,2)_N2_dim1_cuda
# Input: sizes: (1025, 1023, 2), N: 2, dim: 1, device: cuda
Forward Execution Time (us) : 53.261

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(1024,1024,2)_N2_dim2_cuda
# Input: sizes: (1024, 1024, 2), N: 2, dim: 2, device: cuda
Forward Execution Time (us) : 71.294

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f837a135200>,111,65]_N5_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f837a135200>, 111, 65], N: 5, dim: 0, device: cuda
Forward Execution Time (us) : 40.165

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[96,<function<lambda>at0x7f837a135950>,64]_N5_dim1_cuda
# Input: sizes: [96, <function <lambda> at 0x7f837a135950>, 64], N: 5, dim: 1, device: cuda
Forward Execution Time (us) : 32.666

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[128,64,<function<lambda>at0x7f82e50e2440>]_N5_dim2_cuda
# Input: sizes: [128, 64, <function <lambda> at 0x7f82e50e2440>], N: 5, dim: 2, device: cuda
Forward Execution Time (us) : 67.003

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f82e50e24d0>,32,64]_N50_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f82e50e24d0>, 32, 64], N: 50, dim: 0, device: cuda
Forward Execution Time (us) : 67.035

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[32,<function<lambda>at0x7f82e50e2560>,64]_N50_dim1_cuda
# Input: sizes: [32, <function <lambda> at 0x7f82e50e2560>, 64], N: 50, dim: 1, device: cuda
Forward Execution Time (us) : 63.803

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[33,65,<function<lambda>at0x7f82e50e25f0>]_N50_dim2_cuda
# Input: sizes: [33, 65, <function <lambda> at 0x7f82e50e25f0>], N: 50, dim: 2, device: cuda
Forward Execution Time (us) : 69.969

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(64,32,4,16,32)_N2_dim2_cuda
# Input: sizes: (64, 32, 4, 16, 32), N: 2, dim: 2, device: cuda
Forward Execution Time (us) : 98.327

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(16,32,4,16,32)_N8_dim2_cuda
# Input: sizes: (16, 32, 4, 16, 32), N: 8, dim: 2, device: cuda
Forward Execution Time (us) : 112.363

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(9,31,5,15,33)_N17_dim4_cuda
# Input: sizes: (9, 31, 5, 15, 33), N: 17, dim: 4, device: cuda
Forward Execution Time (us) : 478.224

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f82e50e2680>]_N100_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f82e50e2680>], N: 100, dim: 0, device: cuda
Forward Execution Time (us) : 63.269

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f82e50e2710>]_N1000_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f82e50e2710>], N: 1000, dim: 0, device: cuda
Forward Execution Time (us) : 470.141

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f82e50e27a0>]_N2000_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f82e50e27a0>], N: 2000, dim: 0, device: cuda
Forward Execution Time (us) : 966.668

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f82e50e2830>]_N3000_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f82e50e2830>], N: 3000, dim: 0, device: cuda
Forward Execution Time (us) : 1485.309
```

Reviewed By: ngimel

Differential Revision: D23727275

fbshipit-source-id: 171275ac541c649f7aeab0a2f8f0fea9486d0180
2020-09-21 18:38:25 -07:00
81bb19c9f0 [JIT] Prohibit subscripted assignments for tuple types (#44929)
Summary:
This would force jit.script to raise an error if someone tries to mutate tuple
```
Tuple[int, int] does not support subscripted assignment:
  File "/home/nshulga/test/tupleassignment.py", line 9
torch.jit.script
def foo(x: Tuple[int, int]) -> int:
    x[-1] = x[0] + 1
    ~~~~~ <--- HERE
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44929

Reviewed By: suo

Differential Revision: D23777668

Pulled By: malfet

fbshipit-source-id: 8efaa4167354ffb4930ccb3e702736a3209151b6
2020-09-21 16:35:44 -07:00
9a31eee107 [vulkan] Remove duplication of op registration and clean unused vars (#44932)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44932

Test Plan: Imported from OSS

Reviewed By: AshkanAliabadi

Differential Revision: D23778203

Pulled By: IvanKobzarev

fbshipit-source-id: d1bc0a5c2cdd711d8a4cd983154a4f6774987674
2020-09-21 15:57:32 -07:00
dfb8f2d51f CUDA BFloat16 addmm, addmv (#44986)
Summary:
This PR was originally authored by slayton58. I steal his implementation and added some tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44986

Reviewed By: mruberry

Differential Revision: D23806039

Pulled By: ngimel

fbshipit-source-id: 305d66029b426d8039fab3c3e011faf2bf87aead
2020-09-21 14:28:27 -07:00
581a364437 CUDA BFloat16 unary ops part 1 (#44813)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44813

Reviewed By: mruberry

Differential Revision: D23805816

Pulled By: ngimel

fbshipit-source-id: 28c645dc31f094c8b6c3d3803f0b4152f0475a64
2020-09-21 14:22:31 -07:00
1cab27d485 Add a torch.hub.load_local() function that can load models from any local directory with a hubconf.py (#44204)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/43622

- Moves the model loading part of `torch.hub.load()` into a new `torch.hub.load_local()` function that takes in a path to a local directory that contains a `hubconf.py` instead of a repo name.
- Refactors `torch.hub.load()` so that it now calls `torch.hub.load_local()` after downloading and extracting the repo.
- Updates `torch.hub` docs to include the new function + minor fixes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44204

Reviewed By: malfet

Differential Revision: D23817429

Pulled By: ailzhang

fbshipit-source-id: 788fd83c87a94f487b558715b2809d346ead02b2
2020-09-21 14:17:21 -07:00
c941dd3492 [FX] s/get_param/get_attr/ (#45000)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45000

Test Plan: Imported from OSS

Reviewed By: suo

Differential Revision: D23798016

Pulled By: jamesr66a

fbshipit-source-id: 1d2f3db1994a62b95d0ced03bf958e54d30c35dd
2020-09-21 14:09:32 -07:00
9dc2bcdc07 Introducing (Const)StridedRandomAccessor + CompositeRandomAccessor + migrate sort to ATen (CPU) (#39744)
Summary:
This PR introduces a (Const)StridedRandomAccessor, a [random access iterator](https://en.cppreference.com/w/cpp/named_req/RandomAccessIterator) over a strided array, and a CompositeRandomAccessor, a random access iterator over two random access iterators.

The main motivation is to be able to use a handful of operations from STL and thrust in numerous dim-apply types of algorithms and eliminate unnecessary buffer allocations. Plus more advanced algorithms are going to be available with C++17.

Porting `sort` provides a hands-on example of how these iterators could be used.

Fixes [https://github.com/pytorch/pytorch/issues/24770](https://github.com/pytorch/pytorch/issues/24770).

Some benchmarks:
```python
from IPython import get_ipython

torch.manual_seed(13)

ipython = get_ipython()

sizes = [
        [10000, 10000],
        [1000, 1000, 100]
        ]
for size in sizes:
    t = torch.randn(*size)
    dims = len(size)

    print(f"Tensor of size {size}")
    for dim in range(dims):
        print(f"sort for dim={dim}")
        print("float:")
        ipython.magic("timeit t.sort(dim)")
    print()

```
#### Master
```
Tensor of size [10000, 10000]
sort for dim=0
float:
10.7 s ± 201 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
sort for dim=1
float:
6.27 s ± 50.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Tensor of size [1000, 1000, 100]
sort for dim=0
float:
7.21 s ± 23.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
sort for dim=1
float:
6.1 s ± 21.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
sort for dim=2
float:
3.58 s ± 27 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

```
#### This PR
```
Tensor of size [10000, 10000]
sort for dim=0
float:
10.5 s ± 209 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
sort for dim=1
float:
6.16 s ± 28.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Tensor of size [1000, 1000, 100]
sort for dim=0
float:
5.94 s ± 60.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
sort for dim=1
float:
5.1 s ± 11.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
sort for dim=2
float:
3.43 s ± 8.52 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

```
As you can see, the legacy sorting routine is actually quite efficient. The performance gain is likely due to the improved reduction with TensorIterator.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/39744

Reviewed By: malfet

Differential Revision: D23796486

Pulled By: glaringlee

fbshipit-source-id: 7bddad10dfbc0a0e5cad7ced155d6c7964e8702c
2020-09-21 13:24:58 -07:00
7118d53711 add .cache to gitignore (#45017)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45017

this is the default indexing folder for clangd 11.

Test Plan: Imported from OSS

Reviewed By: jamesr66a

Differential Revision: D23817619

Pulled By: suo

fbshipit-source-id: 6a60136e591b2fec3d432ac5343cb76ac0934502
2020-09-21 12:51:35 -07:00
1a580c1021 Adding test to quantized copy for 'from float' (#43681)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43681

Test Plan: Imported from OSS

Reviewed By: supriyar

Differential Revision: D23364507

Pulled By: z-a-f

fbshipit-source-id: ef1b00937b012b0647d9b9afa054437f2bce032a
2020-09-21 12:38:59 -07:00
7de512ced8 nightly robustness fixes for linking across devices (#43771)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/43761

CC rgommers ezyang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43771

Reviewed By: glaringlee

Differential Revision: D23819835

Pulled By: malfet

fbshipit-source-id: a3be2780c4b8bdbf347d456c4d14df863c2ff8c2
2020-09-21 12:32:32 -07:00
42af2c7923 [jit] gtest-ify test_alias_analysis.cpp (#45018)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45018

Now that https://github.com/pytorch/pytorch/pull/44795 has landed, we
can convert the bulk of our cpp tests to use gtest APIs. Eventually
we'll want to get rid of our weird harness for cpp tests entirely in
favor of using regular gtest everywhere. This PR demonstrates some of
the benefits of this approach:
1. You don't need to register your test twice (once to define it, once
in tests.h).
2. Consequently, it's easier to have many individual test cases.
Failures can be reported independently (rather than having huge
functions to test entire modules.
3. Some nicer testing APIs, notably test fixtures.

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D23802297

Pulled By: suo

fbshipit-source-id: 774255da7716294ac573747dcd5e106e5fe3ac8f
2020-09-21 12:19:37 -07:00
92f8f75c59 Add alias dispatch key Math. (#44354)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44354

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D23591481

Pulled By: ailzhang

fbshipit-source-id: 6e93c4ec99a07f3fc920ba2d09dc222e6ced5adf
2020-09-21 11:10:39 -07:00
acc2a1e5fa Update submodule gloo (#45025)
Summary:
Including commits to fix Windows CI failure of enable distributed training on Windows PR

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45025

Reviewed By: beauby

Differential Revision: D23807995

Pulled By: mrshenli

fbshipit-source-id: a2f4c1684927ca66d7d3e9920ecb588fb4386f7c
2020-09-21 10:28:37 -07:00
a4aba1d465 fix compile error (#45052)
Summary:
Update vulkanOptimizeForMobile function invoking in optimize_for_mobile.cc to align latest call contract in PR https://github.com/pytorch/pytorch/pull/44903.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45052

Reviewed By: malfet

Differential Revision: D23814953

Pulled By: mrshenli

fbshipit-source-id: 0fa844a8291e952715b9de35cdec0e411c42b7f9
2020-09-21 10:23:49 -07:00
ac8c7c4e9f Make Channel API accept buffer structs rather than raw pointers. (#45014)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45014

Pull Request resolved: https://github.com/pytorch/tensorpipe/pull/219

Pull Request resolved: https://github.com/pytorch/tensorpipe/pull/212

+ Introduce buffer.h defining the buffer struct(s). The `CpuBuffer`
struct is always defined, while the `CudaBuffer` struct is defined
only when `TENSORPIPE_SUPPORTS_CUDA` is true.
+ Update all channels to take a `CpuBuffer` or `CudaBuffer` for
`send`/`recv` rather than a raw pointer and a length.
+ Make the base `Channel`/`Context` classes templated on `TBuffer`,
effectively creating two channel hierarchies (one for CPU channels,
one for CUDA channels).
+ Update the Pipe and the generic channel tests to use the new API. So
far, generic channel tests are CPU only, and tests for the CUDA IPC
channel are (temporarily) disabled. A subsequent PR will take care of
refactoring tests so that generic tests work for CUDA channels. An
other PR will add support for CUDA tensors in the Pipe.

Differential Revision: D23598033

Test Plan: Imported from OSS

Reviewed By: lw

Pulled By: beauby

fbshipit-source-id: 1d6c3f91e288420858835cd5e7962e8da051b44b
2020-09-21 10:18:45 -07:00
4bbb6adff5 [NNC] fix SyncThreads insertion and reenable CudaSharedMem test (#44909)
Summary:
A previous fix for masking Cuda dimensions (https://github.com/pytorch/pytorch/issues/44733) changed the behaviour of inserting thread synchronization barriers in the Cuda CodeGen, causing the CudaSharedMemReduce_1 to be flaky and ultimately disabled.

The issue is working out where these barriers must be inserted - solving this optimally is very hard, and I think not possible without dependency analysis we don't have, so I've changed our logic to be quite pessimistic. We'll insert barriers before and after any blocks that have thread dimensions masked (even between blocks that have no data dependencies). This should be correct, but it's an area we could improve performance. To address this somewhat I've added a simplifier pass that removes obviously unnecessary syncThreads.

To avoid this test being flaky again, I've added a check against the generated code to ensure there is a syncThread in the right place.

Also fixed a couple of non-functional but clarity issues in the generated code: fixed the missing newline after Stores in the CudaPrinter, and prevented the PrioritizeLoad mutator from pulling out loads contained within simple Let statements (such as those produced by the Registerizer).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44909

Reviewed By: agolynski

Differential Revision: D23800565

Pulled By: nickgg

fbshipit-source-id: bddef1f40d8d461da965685f01d00b468d8a2c2f
2020-09-21 09:27:22 -07:00
e2f49c8437 skip im2col & vol2col in cpu/cuda convolution methods (#44600)
Summary:
this fixes https://github.com/pytorch/pytorch/issues/44482.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44600

Reviewed By: ngimel

Differential Revision: D23733483

Pulled By: walterddr

fbshipit-source-id: 90e188027ef6bb08588619b6629110b5f73d63e3
2020-09-21 09:20:23 -07:00
a6895d43b6 Turn on gradgrad check for BCELoss Criterion Tests. (#44894)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44894

Looks like we added double backwards support but only turned on the ModuleTests.

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D23762544

Pulled By: gchanan

fbshipit-source-id: b5cef579608dd71f3de245c4ba92e49216ce8a5e
2020-09-21 07:14:22 -07:00
4810365576 Enabled torch.testing._internal.jit_utils.* typechecking. (#44985)
Summary:
Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44985

Reviewed By: malfet

Differential Revision: D23794444

Pulled By: kauterry

fbshipit-source-id: 9893cc91780338a8223904fb574efa77fa3ab2b9
2020-09-21 01:19:06 -07:00
9f67176b82 Complex gradcheck logic (#43208)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43208

This PR adds gradcheck for complex. The logic used for complex gradcheck is described in Section 3.5.3 here: https://arxiv.org/pdf/1701.00392.pdf

More concretely, this PR introduces the following changes:
1. Updates get_numerical_jacobian to take as input a scalar value for vector (v). Adds gradcheck logic for C -> C, C-> R, R -> C. For R -> C functions, only the real value of gradient is propagated.
2. Adds backward definition for `torch.complex` and also adds a test to verify the definition added.
3. Updates backward for `mul`, `sin`, `cos`, `sinh`, `cosh`.
4. Adds tests for all `torch.real`, `torch.imag`, `torch.view_as_real`, `torch.view_as_complex`, `torch.conj`.

Follow up tasks:
1. Add more thorough tests for R -> C cases. Specifically, add R->C test variants for functions. for e.g., `torch.mul(complex_tensor, real_tensor)`
2. Add back commented test in `common_methods_invocation.py`.
3. Add more special case checking for complex gradcheck to make debugging easier.
4. Update complex autograd note.
5. disable complex autograd for operators not tested for complex.

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D23655088

Pulled By: anjali411

fbshipit-source-id: caa75e09864b5f6ead0f988f6368dce64cf15deb
2020-09-20 22:05:04 -07:00
da7863f46b Add one dimensional FFTs to torch.fft namespace (#43011)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43011

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D23751850

Pulled By: mruberry

fbshipit-source-id: 8dc5fec75102d8809eeb85a3d347ba1b5de45b33
2020-09-19 23:32:22 -07:00
49db7b59e0 For logical tests, use the dtypes decorator (#42483)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42483

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D23684424

Pulled By: mruberry

fbshipit-source-id: ba7ab5c3a6eaa0c16975728200f27d164ed4f852
2020-09-19 19:01:49 -07:00
60709ad1bf Adds multiply and divide aliases (#44463)
Summary:
These alias are consistent with NumPy. Note that C++'s naming would be different (std::multiplies and std::divides), and that PyTorch's existing names (mul and div) are consistent with Python's dunders.

This also improves the instructions for adding an alias to clarify that dispatch keys should be removed when copying native_function.yaml entries to create the alias entries.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44463

Reviewed By: ngimel

Differential Revision: D23670782

Pulled By: mruberry

fbshipit-source-id: 9f1bdf8ff447abc624ff9e9be7ac600f98340ac4
2020-09-19 15:47:52 -07:00
faef89c89f CUDA BFloat Pooling (#44836)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44836

Reviewed By: mruberry

Differential Revision: D23800992

Pulled By: ngimel

fbshipit-source-id: 2945a27874345197cbd1d8a4fbd20816afc02c86
2020-09-19 15:43:36 -07:00
7ecfaef7ec CUDA BFloat16 layernorm (#45002)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45002

Reviewed By: mruberry

Differential Revision: D23800931

Pulled By: ngimel

fbshipit-source-id: cc213d02352907a3e945cd9fffd1de29e355a16c
2020-09-19 15:36:03 -07:00
2163d31016 histogram observer: ensure buffer shape consistency (#44956)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44956

Makes buffer shapes for HistogramObserver have the
same shapes in uninitialized versus initialized states.

This is useful because the detectron2 checkpointer assumes
that these states will stay the same, so it removes the
need for manual hacks around the shapes changing.

Test Plan:
```
python test/test_quantization.py TestObserver.test_histogram_observer_consistent_buffer_shape
```

Imported from OSS

Reviewed By: raghuramank100

Differential Revision: D23785382

fbshipit-source-id: 1a83fd4f39b244b00747c368d5d305a07d877c92
2020-09-19 09:29:39 -07:00
0714c003ee [pytorch][tensorexpr] Make gtest-style macros in tests match actual gtest signatures (#44861)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44861

We were redefining things like ASSERT_EQ to take a _VA_ARGS_ parameter, so compiling these files with gtest (instead of pytorch's custom python-based cpp test infra) fails.

Test Plan: buck build //caffe2/test/cpp/tensorexpr

Reviewed By: asuhan

Differential Revision: D23711293

fbshipit-source-id: 8af14fa7c1f1e8169d14bb64515771f7bc3089e5
2020-09-19 07:25:05 -07:00
9e5045e978 [pytorch] clean up normalized_dynamic_type() hack (#44889)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44889

This HACK doesn't seem to be necessary any more - there is no 'real'
type in generated Declarations.yaml file.
Verified by comparing generated code before/after.

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D23761624

Pulled By: ljk53

fbshipit-source-id: de996f04d77eebea3fb9297dd90a8ebeb07647bb
2020-09-18 23:49:46 -07:00
d75c402755 Add cusolver to build, rewrite MAGMA inverse with cusolver (#42403)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/42265

This PR adds cusolver to the pytorch build, and enables the use of cusolver/cublas library functions on GPU `torch.inverse` on certain tensor shapes.

Specifically, when

* the tensor is two dimensional (single batch), or
* has >2 dimensions (multiple batches) and `batch_size <= 2`, or
* magma is not linked,

cusolver/cublas will be used. In other conditions, the current implementation of MAGMA will still be used.

8c0949ae45/aten/src/ATen/native/cuda/BatchLinearAlgebra.cu (L742-L752)

The reason for this is that for tensors with large batch_size, `cublasXgetrfBatched` and `cublasXgetriBatched` doesn't perform very well. For `batch_size > 1`, we launch cusolver functions in multiple streams. This lets cusolver functions run in parallel, and can greatly increase the performance. When `batch_size > 2`, the parallel launched cusolver functions are slightly slower than the current magma implementation, so we still use the current magma impl.

On CUDA 9.2, there were some numerical issues detected, so cusolver impl will not be used. The cusolver impl will also not be used on platforms other than Nvidia CUDA.

060769feaf/aten/src/ATen/native/cuda/BatchLinearAlgebraLib.h (L10-L13)

Note that there is a new heuristic used before cusolver/cublas calls here:

8c0949ae45/aten/src/ATen/native/cuda/MiscUtils.h (L113-L121)

where `use_loop_launch = true` means launch single batch cusolver functions in parallel, and `use_loop_launch = false` means use cublas_X_batched functions. When magma is enabled (only `batch_size <= 2` will be dispatched to cusolver/cublas), the heuristic will always return `true` and the cusolver calls are faster than small batch_size magma calls. When magma is disabled, this adds the functionality of `torch.inverse`, which was disabled before for all shapes (though large batch_size cublas performance may not be as well as magma).

Checklist:
- [X] Add benchmark, cpu, gpu-before (magma), gpu-after (cusolver)
- [X] Rewrite single inverse (ndim == 2) with cusolver
- [X] Rewrite batched inverse (ndim > 2) with cublas
- [X] Add cusolver to build
- [x] Clean up functions related to `USE_MAGMA` define guard
- [x] Workaround for non-cuda platform
- [x] Workaround for cuda 9.2
- [x] Add zero size check
- [x] Add tests

Next step:

If cusolver doesn't cause any problem in pytorch build, and there are no major performance regressions reported after this PR being merged, I will start porting other cusolver/cublas functions for linear algebra to improve the performance.

<details>
<summary> benchmark 73499c6 </summary>

benchmark code: https://github.com/xwang233/code-snippet/blob/master/torch.inverse/inverse-cusolver.ipynb

shape meaning:

* `[] 2 torch.float32 -> torch.randn(2, 2, dtype=torch.float32)`
* `[2] 4 torch.float32 -> torch.randn(2, 4, 4, dtype=torch.float32)`

| shape | cpu_time (ms) | gpu_time_before (magma) (ms) | gpu_time_after (ms) |
| --- | --- | --- | --- |
| [] 2 torch.float32 |  0.095 |  7.534 |  0.129  |
| [] 4 torch.float32 |  0.009 |  7.522 |  0.129  |
| [] 8 torch.float32 |  0.011 |  7.647 |  0.138  |
| [] 16 torch.float32 |  0.075 |  7.582 |  0.135  |
| [] 32 torch.float32 |  0.073 |  7.573 |  0.191  |
| [] 64 torch.float32 |  0.134 |  7.694 |  0.288  |
| [] 128 torch.float32 |  0.398 |  8.073 |  0.491  |
| [] 256 torch.float32 |  1.054 |  11.860 |  1.074  |
| [] 512 torch.float32 |  5.218 |  14.130 |  2.582  |
| [] 1024 torch.float32 |  19.010 |  18.780 |  6.936  |
| [1] 2 torch.float32 |  0.009 |  0.113 |  0.128 ***regressed |
| [1] 4 torch.float32 |  0.009 |  0.113 |  0.131 ***regressed |
| [1] 8 torch.float32 |  0.011 |  0.116 |  0.129 ***regressed |
| [1] 16 torch.float32 |  0.015 |  0.122 |  0.135 ***regressed |
| [1] 32 torch.float32 |  0.032 |  0.177 |  0.178 ***regressed |
| [1] 64 torch.float32 |  0.070 |  0.420 |  0.281  |
| [1] 128 torch.float32 |  0.328 |  0.816 |  0.490  |
| [1] 256 torch.float32 |  1.125 |  1.690 |  1.084  |
| [1] 512 torch.float32 |  4.344 |  4.305 |  2.576  |
| [1] 1024 torch.float32 |  16.510 |  16.340 |  6.928  |
| [2] 2 torch.float32 |  0.009 |  0.113 |  0.186 ***regressed |
| [2] 4 torch.float32 |  0.011 |  0.115 |  0.184 ***regressed |
| [2] 8 torch.float32 |  0.012 |  0.114 |  0.184 ***regressed |
| [2] 16 torch.float32 |  0.019 |  0.119 |  0.173 ***regressed |
| [2] 32 torch.float32 |  0.050 |  0.170 |  0.240 ***regressed |
| [2] 64 torch.float32 |  0.120 |  0.429 |  0.375  |
| [2] 128 torch.float32 |  0.576 |  0.830 |  0.675  |
| [2] 256 torch.float32 |  2.021 |  1.748 |  1.451  |
| [2] 512 torch.float32 |  9.070 |  4.749 |  3.539  |
| [2] 1024 torch.float32 |  33.655 |  18.240 |  12.220  |
| [4] 2 torch.float32 |  0.009 |  0.112 |  0.318 ***regressed |
| [4] 4 torch.float32 |  0.010 |  0.115 |  0.319 ***regressed |
| [4] 8 torch.float32 |  0.013 |  0.115 |  0.320 ***regressed |
| [4] 16 torch.float32 |  0.027 |  0.120 |  0.331 ***regressed |
| [4] 32 torch.float32 |  0.085 |  0.173 |  0.385 ***regressed |
| [4] 64 torch.float32 |  0.221 |  0.431 |  0.646 ***regressed |
| [4] 128 torch.float32 |  1.102 |  0.834 |  1.055 ***regressed |
| [4] 256 torch.float32 |  4.042 |  1.811 |  2.054 ***regressed |
| [4] 512 torch.float32 |  18.390 |  4.884 |  5.087 ***regressed |
| [4] 1024 torch.float32 |  69.025 |  19.840 |  20.000 ***regressed |

</details>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42403

Reviewed By: ailzhang, mruberry

Differential Revision: D23717984

Pulled By: ngimel

fbshipit-source-id: 54cbd9ea72a97989cff4127089938e8a8e29a72b
2020-09-18 20:43:29 -07:00
620c999979 update gloo submodule (#45008)
Summary:
Revert accidental gloo submodule changes in https://github.com/pytorch/pytorch/issues/41977

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45008

Reviewed By: malfet

Differential Revision: D23799892

Pulled By: ngimel

fbshipit-source-id: e8dab244c6abad32ed60efe3c26cab40837e57c8
2020-09-18 19:02:36 -07:00
21a1b9c7cf skip more nccl tests that causes flaky timeouts on rocm build (#44996)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44996

Reviewed By: malfet

Differential Revision: D23797564

Pulled By: walterddr

fbshipit-source-id: 4d60f76bb8ae54bb04a9f4143a68623933461b2a
2020-09-18 18:53:47 -07:00
1c15452703 Update Windows builders to latest VS2019 (#44746)
Summary:
Restore https://github.com/pytorch/pytorch/issues/44706, which should workaround VC compiler crash, which was reverted by https://github.com/pytorch/pytorch/issues/41977
Update configs to use ":stable" Windows images

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44746

Reviewed By: walterddr

Differential Revision: D23793682

Pulled By: malfet

fbshipit-source-id: bfdc36c35b920f58798a18c15642ec7efc68f00e
2020-09-18 18:46:44 -07:00
e9941a5dd4 [vulkan][py] torch.utils.optimize_for_vulkan (#44903)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44903

Test Plan: Imported from OSS

Reviewed By: kimishpatel

Differential Revision: D23766039

Pulled By: IvanKobzarev

fbshipit-source-id: dbdf484ee7d3a7719aab105efba51b92ebc51568
2020-09-18 18:20:11 -07:00
572f7e069c Enable type check for torch.testing._internal.te_utils.* (#44927)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44927

Test Plan: Imported from OSS

Reviewed By: walterddr

Differential Revision: D23776842

Pulled By: sshawnwu

fbshipit-source-id: 65c028169a37e1f2f7d9fdce8a958234ee1caa26
2020-09-18 18:09:15 -07:00
043466f978 [FX] Pass module's qualname to is_leaf_module (#44966)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44966

Test Plan: Imported from OSS

Reviewed By: dzhulgakov

Differential Revision: D23790360

Pulled By: jamesr66a

fbshipit-source-id: 7ef569fd93646584b27af7a615fa69c8d8bbdd3b
2020-09-18 17:02:33 -07:00
40c09cfe14 [CircleCI] Fix CUDA test setup (#44982)
Summary:
Circle updated windows-nvidia-2019:canary image to exclude VC++ 14.26
Update the config to use 14.27

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44982

Reviewed By: seemethere

Differential Revision: D23794116

Pulled By: malfet

fbshipit-source-id: f3281f7d51acae4a4d06cecff01100fa77bd81ff
2020-09-18 16:20:24 -07:00
e255a4e1fd Enable bfloat16 random kernels on Windows (#44918)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/33793

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44918

Reviewed By: pbelevich

Differential Revision: D23777548

Pulled By: ngimel

fbshipit-source-id: 9cf13166d7deba17bc72e402b82ed0afe347cb9b
2020-09-18 15:55:32 -07:00
06389406bb CUDA BFloat activations 1 (#44834)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44834

Reviewed By: mruberry

Differential Revision: D23752660

Pulled By: ngimel

fbshipit-source-id: 209a937e8a9afe12b7dd86ecfa493c9417fd22fb
2020-09-18 15:48:49 -07:00
76a109c930 [caffe2/aten] Fix clang build (#44934)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44934

Fix build errors when using clang to build cuda sources:

```
In file included from aten/src/ATen/native/cuda/DistributionBernoulli.cu:4:
In file included from aten/src/ATen/cuda/CUDAApplyUtils.cuh:5:
caffe2/aten/src/THC/THCAtomics.cuh:321:1: error: control reaches end of non-void function [-Werror,-Wreturn-type]
}
^
1 error generated when compiling for sm_70.

In file included from aten/src/ATen/native/cuda/DistributionBernoulli.cu:4:
In file included from aten/src/ATen/cuda/CUDAApplyUtils.cuh:5:
caffe2/aten/src/THC/THCAtomics.cuh:321:1: error: control reaches end of non-void function [-Werror,-Wreturn-type]
}
^
1 error generated when compiling for sm_60.

In file included from aten/src/ATen/native/cuda/DistributionBernoulli.cu:4:
In file included from aten/src/ATen/cuda/CUDAApplyUtils.cuh:5:
caffe2/aten/src/THC/THCAtomics.cuh:321:1: error: control reaches end of non-void function [-Werror,-Wreturn-type]
}
^
1 error generated when compiling for sm_52.
```

Test Plan: CI

Reviewed By: ngimel

Differential Revision: D23775266

fbshipit-source-id: 141e6624e2da870a8c50ff9f71fcf0717222fb17
2020-09-18 15:22:09 -07:00
fd4e21c91e Add optional string support to native_functions schema (#43010)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43010

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D23751851

Pulled By: mruberry

fbshipit-source-id: 648f7430e1b7311eff28421f38e01f52d998fcbd
2020-09-18 14:57:24 -07:00
2d884f2263 Optimize Scale function (#44913)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44913

Pull Request resolved: https://github.com/pytorch/pytorch/pull/18322

Optimize Scale function

i-am-not-moving-c2-to-c10

Test Plan: buck test mode/dbg caffe2/caffe2/python/operator_test:weighted_sum_test

Reviewed By: BIT-silence

Differential Revision: D14575780

fbshipit-source-id: db333a7964581dcaff6e432ff1d6b517ba1a075f
2020-09-18 14:31:33 -07:00
374e9373b5 [jit] Pull (most) tests out of libtorch_python (#44795)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44795

Today, we build our cpp tests twice, once as a standalone gtest binary,
and once linked in `libtorch_python` so we can call them from
`test_jit.py`.

This is convenient (it means that `test_jit.py` is a single entry point
for all our tests), but has a few drawbacks:
1. We can't actually use the gtest APIs, since we don't link gtest into
`libtorch_python`. We're stuck with the subset that we want to write
polyfills for, and an awkward registration scheme where you have to
write a test then include it in `tests.h`).
2. More seriously, we register custom operators and classes in these
tests. In a world where we may be linking many `libtorch_python`s, this
has a tendency to cause errors with `libtorch`.

So now, only tests that explicitly require cooperation with Python are
built into `libtorch_python`. The rest are built into
`build/bin/test_jit`.

There are tests which require that we define custom classes and
operators. In these cases, I've built thm into separate `.so`s that we
call `torch.ops.load_library()` on.

Test Plan: Imported from OSS

Reviewed By: SplitInfinity, ZolotukhinM

Differential Revision: D23735520

Pulled By: suo

fbshipit-source-id: d146bf4e7eb908afa6f96b394e4d395d63ad72ff
2020-09-18 14:04:40 -07:00
af3fc9725d Extract rpc/tensorpipe_utils.{cpp,h} from rpc/utils.{cpp,h} (#44803)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44803

Test Plan: CI

Reviewed By: lw

Differential Revision: D23732022

fbshipit-source-id: 5b839c7997bbee162a14d03414ee32baabbc8ece
2020-09-18 13:51:43 -07:00
d22dd80128 Enable type check for torch.testing._internal.common_device_type. (#44911)
Summary:
This PR intends to fix the type exceptions in common_device_type.py.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44911

Reviewed By: walterddr

Differential Revision: D23768397

Pulled By: wuyangzhang

fbshipit-source-id: 053692583b4d6169b0eb5ffe0c3d30635c0db699
2020-09-18 13:42:11 -07:00
a47e3697ab Use iterator of DispatchKeySet. (#44682)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44682

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D23698387

Pulled By: ailzhang

fbshipit-source-id: 4fa140db9254c2c9c342bf1c8dfd952469b0b779
2020-09-18 13:34:27 -07:00
6d312132e1 Beef up vmap docs and expose to master documentation (#44825)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44825

Test Plan: - build and view docs locally.

Reviewed By: ezyang

Differential Revision: D23742727

Pulled By: zou3519

fbshipit-source-id: f62b7a76b5505d3387b7816c514c086c01089de0
2020-09-18 13:26:25 -07:00
c2cf6efd96 Enable type check for torch.testing._internal.dist_utils.* (#44832)
Summary:
Addresses a sub-task of https://github.com/pytorch/pytorch/issues/44752.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44832

Reviewed By: malfet

Differential Revision: D23744260

Pulled By: samestep

fbshipit-source-id: 46aede57b4fa66a770d5df382b0aea2bd6772b9b
2020-09-18 12:50:48 -07:00
7bd8a6913d CUDA BFloat div, addcdiv, addcmul, mean, var (#44758)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44758

Reviewed By: mruberry

Differential Revision: D23752317

Pulled By: ngimel

fbshipit-source-id: 77992cf991f4e2b4b6839de73ea7e6ce2e1061c6
2020-09-18 11:51:11 -07:00
f175830558 [NNC] Fuse identical conditions in simplifier (#44886)
Summary:
Adds a pass to the IR Simplifier which fuses together the bodies of Cond statements which have identical conditions. e.g.

```
if (i < 10) {
  do_thing_1;
} else {
  do_thing_2;
}
if (i < 10) {
  do_thing_3;
}
```

is transformed into:

```
if (i < 10) {
  do_thing_1;
  do_thing_3;
} else {
  do_thing_2;
}
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44886

Reviewed By: glaringlee

Differential Revision: D23768565

Pulled By: nickgg

fbshipit-source-id: 3fe40d91e82bdfff8dcb8c56a02a4fd579c070df
2020-09-18 11:38:03 -07:00
09f2c6a94c Back out "Revert D23494065: Refactor CallbackManager as a friend class of RecordFunction." (#44699)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44699

Original commit changeset: 3b1ec928e3db

Previous revert (D23698861) was on the wrong diff stack. Backing out the revert.

Test Plan: Passed unit tests and previously landed.

Reviewed By: mruberry

Differential Revision: D23702258

fbshipit-source-id: 5c3e197bca412f454db5a7e86251ec85faf621c1
2020-09-18 11:08:27 -07:00
174cbff00a Improve sugared value's error message (#42889)
Summary:
Stack from [ghstack](https://github.com/ezyang/ghstack):
* **https://github.com/pytorch/pytorch/issues/42889 Improve sugared value's error message**

I think most (if not all) cases where this code path is reached can be attributed to closing over a global variable.
Improving error message to make this clearer to users.

close https://github.com/pytorch/pytorch/issues/41288

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42889

Reviewed By: SplitInfinity

Differential Revision: D23779347

Pulled By: gmagogsfm

fbshipit-source-id: ced702a96234040f79eb16ad998d202e360d6654
2020-09-18 11:01:40 -07:00
0063512a4b [ONNX] Updates to diagnostic tool to find missing ops (#44124)
Summary:
Moved description of tool and changes in function name

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44124

Reviewed By: albanD

Differential Revision: D23674618

Pulled By: bzinodev

fbshipit-source-id: 5db0bb14fc106fc96358b1e0590f08e975388c6d
2020-09-18 10:32:30 -07:00
c68cc78299 Add a device parameter to RemoteModule (#44254)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44254

Add a device parameter to RemoteModule, so it can be placed on any device
and not just CPU.

Original PR issue: RemoteModule enhancements #40550

Test Plan: buck test test/distributed/rpc:process_group_agent -- RemoteModule

Reviewed By: pritamdamania87

Differential Revision: D23483803

fbshipit-source-id: 4918583c15c6a38a255ccbf12c9168660ab7f6db
2020-09-18 10:31:03 -07:00
cff0e57c31 Remove Incorrect Comment in tools/build_libtorch and remove Python2 support in the module import (#44888)
Summary:
Fixes #{44293} and removes Python2 imports from MNIST download module as Python2 is not being supported.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44888

Reviewed By: agolynski

Differential Revision: D23785579

Pulled By: bugra

fbshipit-source-id: d9380502380876282008dd2d5feb92a446648982
2020-09-18 10:03:36 -07:00
07b7e44ed1 Stop using check_criterion_jacobian. (#44786)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44786

This predates gradcheck and gradcheck does the same and more.

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D23731902

Pulled By: gchanan

fbshipit-source-id: 425fd30e943194f63a663708bada8960265b8f05
2020-09-18 07:04:57 -07:00
6d178f6b8e Stop ignoring errors in cuda nn module tests. (#44783)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44783

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D23731778

Pulled By: gchanan

fbshipit-source-id: 32df903a9e36bbf3f66645ee2d77efa5ed6ee429
2020-09-18 07:03:41 -07:00
df39c40054 Cleanup tracer handling of optional arguments (#43009)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43009

* **#43009 Cleanup tracer handling of optional arguments**

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D23766621

Pulled By: mruberry

fbshipit-source-id: c1b46cd23b58b18ef4c03021b2514d7e692badb6
2020-09-18 06:54:09 -07:00
caea1adc35 Complex support for stft and istft (#43886)
Summary:
Ref https://github.com/pytorch/pytorch/issues/42175, fixes https://github.com/pytorch/pytorch/issues/34797

This adds complex support to `torch.stft` and `torch.istft`. Note that there are really two issues with complex here: complex signals, and returning complex tensors.

## Complex signals and windows
`stft` currently assumes all signals are real and uses `rfft` with `onesided=True` by default. Similarly, `istft` always takes a complex fourier series and uses `irfft` to return real signals.

For `stft`, I now allow complex inputs and windows by calling the full `fft` if either are complex. If the user gives `onesided=True` and the signal is complex, then this doesn't work and raises an error instead. For `istft`, there's no way to automatically know what to do when `onesided=False` because that could either be a redundant representation of a real signal or a complex signal. So there, the user needs to pass the argument `return_complex=True` in order to use `ifft` and get a complex result back.

## stft returning complex tensors
The other issue is that `stft` returns a complex result, represented as a `(... X 2)` real tensor. I think ideally we want this to return proper complex tensors but to preserver BC I've had to add a `return_complex` argument to manage this transition. `return_complex` defaults to false for real inputs to preserve BC but defaults to True for complex inputs where there is no BC to consider.

In order to `return_complex` by default everywhere without a sudden BC-breaking change, a simple transition plan could be:
1. introduce `return_complex`, defaulted to false when BC is an issue but giving a warning. (this PR)
2. raise an error in cases where `return_complex` defaults to false, making it a required argument.
3. change `return_complex` default to true in all cases.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43886

Reviewed By: glaringlee

Differential Revision: D23760174

Pulled By: mruberry

fbshipit-source-id: 2fec4404f5d980ddd6bdd941a63852a555eb9147
2020-09-18 01:39:47 -07:00
e400150c3b Fixed for caffe2/opt/tvm_transformer.cc (#44249)
Summary:
Fixes #https://github.com/pytorch/pytorch/issues/41706

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44249

Reviewed By: gmagogsfm

Differential Revision: D23752331

Pulled By: SplitInfinity

fbshipit-source-id: 1d7297e080bc1e065129259e406af7216f3f0665
2020-09-18 00:03:59 -07:00
f2b3480795 CUDA BFloat softmax (#44837)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44837

Reviewed By: glaringlee

Differential Revision: D23767981

Pulled By: ngimel

fbshipit-source-id: be92c25a1b66ed50a52e090db167079def6f6b39
2020-09-17 21:52:47 -07:00
1694fde7eb Fix a GroupNorm cuda bug when input does not require_grad (#44863)
Summary:
Fix https://discuss.pytorch.org/t/illegal-memory-access-when-i-use-groupnorm/95800

`dX` is a Tensor, comparing `dX` with `nullptr` was wrong.

cc BIT-silence who wrote the kernel.

The test couldn't pass with `rtol=0` and `x.requires_grad=True`, so I have to update that to `1e-5`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44863

Reviewed By: mruberry

Differential Revision: D23754101

Pulled By: BIT-silence

fbshipit-source-id: 2eb0134dd489480e5ae7113a7d7b84629104cd49
2020-09-17 19:01:28 -07:00
5dbcbea265 TorchScript with record_function (#44345)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44345

As part of enhancing profiler support for RPC, when executing TorchScript functions over RPC, we would like to be able to support user-defined profiling scopes created by `with record_function(...)`.

Since after https://github.com/pytorch/pytorch/pull/34705, we support `with` statements in TorchScript, this PR adds support for `with torch.autograd.profiler.record_function` to be used within TorchScript.

This can be accomplished via the following without this PR:
```
torch.opts.profiler._record_function_enter(...)
# Script code, such as forward pass
torch.opts.profiler._record_function_exit(....)
```

This is a bit hacky and it would be much cleaner to use the context manager now that we support `with` statements. Also, `_record_function_` type operators are internal operators that are subject to change, this change will help avoid BC issues in the future.

Tested with `python test/test_jit.py TestWith.test_with_record_function -v`
ghstack-source-id: 112320645

Test Plan:
Repro instructions:
1) Change `def script_add_ones_return_any(x) -> Any` to `def script_add_ones_return_any(x) -> Tensor` in `jit/rpc_test.py`
2) `buck test mode/dev-nosan //caffe2/test/distributed/rpc:process_group_agent -- test_record_function_on_caller_rpc_async --print-passing-details`
3) The function which ideally should accept `Future[Any]` is `def _call_end_callbacks_on_future` in `autograd/profiler.py`.

python test/test_jit.py TestWith.test_with_foo -v

Reviewed By: pritamdamania87

Differential Revision: D23332074

fbshipit-source-id: 61b0078578e8b23bfad5eeec3b0b146b6b35a870
2020-09-17 18:45:00 -07:00
4a9c80e82e [pytorch][bot] update mobile op deps (#44854)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44854

Test Plan: Imported from OSS

Reviewed By: iseeyuan

Differential Revision: D23751925

Pulled By: ljk53

fbshipit-source-id: 8e1905091bf3abaac20d97182eb88f96e905ffc2
2020-09-17 18:33:13 -07:00
9a007ba4cb [jit] stop parsing the block after seeing exit statements (#44870)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44870

fix https://github.com/pytorch/pytorch/issues/44864

Test Plan: buck test mode/dev-nosan //caffe2/test:jit -- 'test_assert_is_script'

Reviewed By: eellison

Differential Revision: D23755094

fbshipit-source-id: ca3f8b27dc6f9dc9364a22a1bce0e2f588ed4308
2020-09-17 18:09:16 -07:00
60ae6c9c18 [FX] Fix GraphModule copy methods not regenerating forward (#44806)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44806

Test Plan: Imported from OSS

Reviewed By: zdevito

Differential Revision: D23738732

Pulled By: jamesr66a

fbshipit-source-id: 14e13551c6568c562f3f789b6274b6c86afefd0b
2020-09-17 17:14:38 -07:00
e14b2080be [reland] move rebuild buckets from end of first iteration to beginning of second iteration (#44798)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44798

[test all]

Update for relanding: in ddp.join(), moved _rebuild_buckets from end of backward to beginning of forward as well.

Part of relanding PR #41954, this refactoring is to move rebuild_buckets call from end of first iteration to beginning of second iteration
ghstack-source-id: 112279261
ghstack-source-id: 112279261

Test Plan: unit tests

Reviewed By: rohan-varma

Differential Revision: D23735185

fbshipit-source-id: c26e0efeecb3511640120faa1122a2c856cd694e
2020-09-17 17:10:21 -07:00
2043fbdfb6 Enable torch.backends.cuda typechecking in CI (#44916)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44916

Reviewed By: walterddr

Differential Revision: D23769844

Pulled By: malfet

fbshipit-source-id: 3be3616fba9e2f9c6d89cc71d5f0d24ffcc45cf2
2020-09-17 15:31:38 -07:00
18b77d7d17 [TensorExpr] Add Mod support to the LLVM backend (#44823)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44823

Test Plan: test_tensorexpr --gtest_filter=TensorExprTest.LLVMElemwiseMod_LLVM

Reviewed By: glaringlee

Differential Revision: D23761996

Pulled By: asuhan

fbshipit-source-id: c3c5b2fe0d989dec04f0152ce47c5cae35ed19c9
2020-09-17 15:25:42 -07:00
e535fb3f7d [ONNX] Enable true_divide scripting export with ONNX shape inference (#43991)
Summary:
Fixes the `true_divide` symbolic to cast tensors correctly.
The logic depends on knowing input types at export time, which is a known gap for exporting scripted modules. On that end we are improving exporter by enabling ONNX shape inference https://github.com/pytorch/pytorch/issues/40628, and starting to increase coverage for scripting support.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43991

Reviewed By: mruberry

Differential Revision: D23674614

Pulled By: bzinodev

fbshipit-source-id: 1b1b85340eef641f664a14c4888781389c886a8b
2020-09-17 14:38:24 -07:00
1c996b7170 Enable typechecking for torch.testing._internal.common_quantized.* (#44805)
Summary:
Addresses a subproblem of [Issue 42969](https://github.com/pytorch/pytorch/issues/42969)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44805

Reviewed By: malfet

Differential Revision: D23742754

Pulled By: janeyx99

fbshipit-source-id: e916a6a0c049cac318549a485d47f19363087d15
2020-09-17 14:24:32 -07:00
f5b92332c1 [TensorExpr] Fix order comparisons for unsigned types (#44857)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44857

Test Plan: test_tensorexpr --gtest_filter=TensorExprTest.LLVMCompareSelectByte*_LLVM

Reviewed By: glaringlee

Differential Revision: D23762162

Pulled By: asuhan

fbshipit-source-id: 1553429bd2d5292ccda57910326b8c70e4e6ab88
2020-09-17 14:16:54 -07:00
a153eafab7 Let logspace support bfloat16 on both CPU and CUDA (#44675)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44675

Reviewed By: ngimel

Differential Revision: D23710801

Pulled By: mruberry

fbshipit-source-id: 12d8e56f41bb635b500e89aaaf5df86a1795eb72
2020-09-17 14:13:55 -07:00
40e44c5f0a Make nuclear and frobenius norm non-out depend on out variants (#44095)
Summary:
Part of https://github.com/pytorch/pytorch/issues/24802

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44095

Reviewed By: ngimel

Differential Revision: D23735893

Pulled By: mruberry

fbshipit-source-id: bd1264b6a8e7f9220033982b0118aa962991ca88
2020-09-17 14:11:31 -07:00
086a2e7a4e [caffe2] add cost inference for FusedFakeQuantFC and FusedFakeQuantFCGradient (#44840)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44840

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44762

Move CostInferenceForFCGradient to fc_inference.cc/h to be used in multiple .cc files.

Test Plan: CI

Reviewed By: qizzzh

Differential Revision: D23714877

fbshipit-source-id: d27f33e270a93b0e053f2af592dc4a24e35526cd
2020-09-17 14:07:17 -07:00
4066022146 Do not use PRId64 in torch/csrc (#44767)
Summary:
Instead use `fmt::format()` or `%lld` and cast argument to `(long long)`
Fix typos and add helper `PyErr_SetString()` method in torch/csrc/Exceptions.h

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44767

Reviewed By: ezyang

Differential Revision: D23723671

Pulled By: malfet

fbshipit-source-id: c0101aed222184aa436b1e8768480d1531dff232
2020-09-17 14:00:02 -07:00
5d57025206 [TensorExpr] Add log1p support to the LLVM backend (#44839)
Summary:
Also corrected Sleef_log1p registrations, float versions had a redundant f.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44839

Test Plan: test_tensorexpr --gtest_filter=TensorExprTest.LLVMElemwiseLog1pFloat_LLVM

Reviewed By: glaringlee

Differential Revision: D23762113

Pulled By: asuhan

fbshipit-source-id: b5cf003b5c0c1ad549c7f04470352231929ac459
2020-09-17 13:38:35 -07:00
f5440a448a CUDA BFloat16 i0 support (#44750)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44750

Reviewed By: glaringlee

Differential Revision: D23764383

Pulled By: ngimel

fbshipit-source-id: d0e784d89241e8028f97766fdac51fe1ab4c188c
2020-09-17 13:30:10 -07:00
bee97d5be0 Document the default behavior for dist.new_group() when ranks=None (#44000)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44000

This wasn't documented, so add a doc saying all ranks are used when
ranks=None
ghstack-source-id: 111206308

Test Plan: CI

Reviewed By: SciPioneer

Differential Revision: D23465034

fbshipit-source-id: 4c51f37ffcba3d58ffa5a0adcd5457e0c5676a5d
2020-09-17 11:30:37 -07:00
2558e5769d Implement sort for list of tuples (#43448)
Summary:
* Implement tuple sort by traversing contained IValue types and generate a lambda function as comparator for sort.
* Tuple, class objects can now arbitrarily nest within each other and still be sortable

Fixes https://github.com/pytorch/pytorch/issues/43219

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43448

Reviewed By: eellison

Differential Revision: D23352273

Pulled By: gmagogsfm

fbshipit-source-id: b6efa8d00e112178de8256da3deebdba7d06c0e1
2020-09-17 11:20:56 -07:00
c189328e5d CUDA BFloat16 unary ops part 2 (#44824)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44824

Reviewed By: mruberry

Differential Revision: D23752360

Pulled By: ngimel

fbshipit-source-id: 3aadaf9db9d4e4937aa38671e8589ecbeece709d
2020-09-17 10:57:43 -07:00
c1fa42497b fix legacy GET_BLOCKS code from THCUNN/common.h (#44789)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/44472

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44789

Reviewed By: malfet

Differential Revision: D23732762

Pulled By: walterddr

fbshipit-source-id: c3748e365e9a1d009b00140ab0ef892da905d09b
2020-09-17 10:49:53 -07:00
24df3b7373 torch.empty_like and torch.zeros_like raise error if any memory format is provided with sparse input (#43699) (#44058)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/43699

- Changed the order of `TORCH_CHECK` and `if (options.layout() == kSparse && self.is_sparse())`
inside `empty_like` method.

- [x] Added tests

EDIT:

More details on that and why we can not take zeros_like  approach.
Python code :
```python
res = torch.zeros_like(input_coalesced, memory_format=torch.preserve_format)
```
is routed to
```c++
// TensorFactories.cpp
Tensor zeros_like(
    const Tensor& self,
    const TensorOptions& options,
    c10::optional<c10::MemoryFormat> optional_memory_format) {
  if (options.layout() == kSparse && self.is_sparse()) {
    auto res = at::empty({0}, options); // to be resized
    res.sparse_resize_and_clear_(
        self.sizes(), self.sparse_dim(), self.dense_dim());
    return res;
  }
  auto result = at::empty_like(self, options, optional_memory_format);
  return result.zero_();
}
```
and passed to `if (options.layout() == kSparse && self.is_sparse())`

When we call in Python
```python
res = torch.empty_like(input_coalesced, memory_format=torch.preserve_format)
```
it is routed to
```c++
Tensor empty_like(
    const Tensor& self,
    const TensorOptions& options_,
    c10::optional<c10::MemoryFormat> optional_memory_format) {
  TORCH_CHECK(
    !(options_.has_memory_format() && optional_memory_format.has_value()),
    "Cannot set memory_format both in TensorOptions and explicit argument; please delete "
    "the redundant setter.");
  TensorOptions options =
      self.options()
          .merge_in(options_)
          .merge_in(TensorOptions().memory_format(optional_memory_format));
  TORCH_CHECK(
      !(options.layout() != kStrided &&
          optional_memory_format.has_value()),
      "memory format option is only supported by strided tensors");
  if (options.layout() == kSparse && self.is_sparse()) {
    auto result = at::empty({0}, options); // to be resized
    result.sparse_resize_and_clear_(
        self.sizes(), self.sparse_dim(), self.dense_dim());
    return result;
  }
```

cc pearu

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44058

Reviewed By: albanD

Differential Revision: D23672494

Pulled By: mruberry

fbshipit-source-id: af232274dd2b516dd6e875fc986e3090fa285658
2020-09-17 10:25:31 -07:00
1fde54d531 [quant][qat] Ensure fake_quant and observer can be disabled on scriptmodule (#44773)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44773

The model is created and prepared using fx APIs and then scripted for training.
In order to test QAT on scriptmodel we need to be able to disable/enable fake_quant
and observer modules on it.

Test Plan:
python test/test_quantization.py TestQuantizeFx.test_qat_and_script

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D23741354

fbshipit-source-id: 3fee7aa9b049d9901313b977710f4dc1c4501532
2020-09-17 10:21:52 -07:00
361b38da19 [quant][fx] Add node name as prefix to observer module name (#44765)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44765

Test Plan:
python test/test_quantization.py TestQuantizeFx.test_save_observer_state_dict

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D23741355

fbshipit-source-id: 7185ceae5b3b520ac0beebb627c44eab7ae7d231
2020-09-17 10:17:42 -07:00
74c3dcd1d2 Revert D23725053: [pytorch][PR] change self.generator to generator
Test Plan: revert-hammer

Differential Revision:
D23725053 (a011b86115)

Original commit changeset: 89706313013d

fbshipit-source-id: 035214f0d4298d29a52f8032d364b52dfd956fe8
2020-09-17 09:42:37 -07:00
d2b4534d4d refactor intialize bucket views (#44330)
Summary:
[test all]
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44330

Part of relanding PR #41954, this refactor is to seperate intialize_bucket_views and populate_bucket_views_out, as they are doing different things and called by different callsites as well
ghstack-source-id: 112257271

Test Plan: unit tests

Reviewed By: mrshenli

Differential Revision: D23583347

fbshipit-source-id: a5f2041b2c4f2c2b5faba1af834c7143eaade938
2020-09-17 09:20:23 -07:00
6006e45028 .circleci: Switch to dynamic MAX_JOBS (#44729)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44729

Switches our MAX_JOBS from a hardcoded value to a more dynamic value so
that we can always utilize all of the core that are available to us

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: walterddr

Differential Revision: D23759643

Pulled By: seemethere

fbshipit-source-id: ad26480cb0359c988ae6f994e26a09f601b728e3
2020-09-17 09:16:36 -07:00
f605d7581e Implement better caching allocator for segmentation usecase. (#44618)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44618

This diff refactors caching allocator to allow for overriding behavior by
making it a virtual class.

Test Plan: https://www.internalfb.com/intern/fblearner/details/218419618?tab=Experiment%20Results

Reviewed By: dreiss

Differential Revision: D23672902

fbshipit-source-id: 976f02922178695fab1c87f453fcb59142c258ec
2020-09-17 08:56:14 -07:00
4affbbd9f8 minor style edits to torch/testing/_internal/common_quantized.py (#44807)
Summary:
style nits

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44807

Reviewed By: malfet

Differential Revision: D23742537

Pulled By: janeyx99

fbshipit-source-id: 446343822d61f8fd9ef6dfcb8e5da4feff6522b6
2020-09-17 08:02:43 -07:00
a40ef25e30 [te] Disable flaky test CudaSharedMemReduce_1 (#44862)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44862

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D23753831

Pulled By: bertmaher

fbshipit-source-id: d7d524ac34e4ca208df022a5730c2d11b3068f12
2020-09-17 07:58:16 -07:00
503c74888f Always use NewModuleTest instead of ModuleTest. (#44745)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44745

Much like CriterionTest, NewCriterionTest these are outdated formulations and we should just use the new one.

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D23717808

Pulled By: gchanan

fbshipit-source-id: eb91982eef23452456044381334bfc9a5bbd837e
2020-09-17 07:36:39 -07:00
28085cbd39 Fixed quantile nan propagation and implemented nanquantile (#44393)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44393

torch.quantile now correctly propagates nan and implemented torch.nanquantile similar to numpy.nanquantile.

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D23649613

Pulled By: heitorschueroff

fbshipit-source-id: 5201d076745ae1237cedc7631c28cf446be99936
2020-09-17 05:53:25 -07:00
99093277c0 Support Python Slice class in TorchScript (#44335)
Summary:
Implements support for[ Python Slice class](https://docs.python.org/3/c-api/slice.html) (not slice expression, which is already supported)

Slice object can be used in any place that supports slice expression, including multi-dim tensor slicing.

Fixes https://github.com/pytorch/pytorch/issues/43511
Fixes https://github.com/pytorch/pytorch/issues/43125

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44335

Reviewed By: suo, jamesr66a

Differential Revision: D23682213

Pulled By: gmagogsfm

fbshipit-source-id: f74fe25370e89fbfd2b3727d95ce4e1c4ba8dec4
2020-09-17 00:41:53 -07:00
b6f4bb0a70 Revert D23236088: [pytorch][PR] [caffe2] adds Cancel to SafeDequeueBlobsOp and SafeEnqueueBlobsOp
Test Plan: revert-hammer

Differential Revision:
D23236088 (0ccc38b773)

Original commit changeset: daa90d9ee324

fbshipit-source-id: 933c7deab177250075683a9bea143ac37f16a598
2020-09-16 23:32:50 -07:00
e18a2219dd Implement scatter reductions (CUDA), remove divide/subtract (#41977)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/33394 .

This PR does two things:
1. Implement CUDA scatter reductions with revamped GPU atomic operations.
2. Remove support for divide and subtract for CPU reduction as was discussed with ngimel .

I've also updated the docs to reflect the existence of only multiply and add.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41977

Reviewed By: mruberry

Differential Revision: D23748888

Pulled By: ngimel

fbshipit-source-id: ea643c0da03c9058e433de96db02b503514c4e9c
2020-09-16 23:25:21 -07:00
fdeee74590 [pytorch][vulkan] Fix downcast warnings-errors, aten_vulkan buck target
Summary:
buck build has -Wall for downcasts - need to add safe_downcast<int32_t> everywhere

BUCK build changes for aten_vulkan to include vulkan_wrapper lib

Test Plan: The next diff with segmentation demo works fine

Reviewed By: dreiss

Differential Revision: D23739445

fbshipit-source-id: b22a30e1493c4174c35075a68586defb0fccd2af
2020-09-16 20:49:34 -07:00
b61d3d8be8 Implement torch.kaiser_window (#44271)
Summary:
Related to https://github.com/pytorch/pytorch/issues/38349

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44271

Reviewed By: ngimel

Differential Revision: D23727972

Pulled By: mruberry

fbshipit-source-id: b4c931b2eb3a536231ad6d6c3cb66e52a13286ac
2020-09-16 20:41:31 -07:00
34331b0e0f CUDA BFloat16 and other improvements on abs (#44804)
Summary:
Not sure if ROCm supports `std::abs` today, let's see the CI

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44804

Reviewed By: mruberry

Differential Revision: D23748837

Pulled By: ngimel

fbshipit-source-id: ccf4e63279f3e5927a85d8d8f70ba4b8c334156b
2020-09-16 20:37:07 -07:00
ba6534ae2b enable type check common_distributed (#44821)
Summary:
Enabled type checking in common_distributed by using tensors of ints

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44821

Test Plan: Run python test/test_type_hints.py, errors are no longer ingnored by mypy.ini

Reviewed By: walterddr

Differential Revision: D23747466

Pulled By: alanadakotashine

fbshipit-source-id: 820fd502d7ff715728470fbef0be90ae7f128dd6
2020-09-16 19:19:36 -07:00
e48201c5cf Mention TF32 on related docs (#44690)
Summary:
cc: ptrblck

![image](https://user-images.githubusercontent.com/1032377/93168022-cbbfcb80-f6d6-11ea-8f6e-f2c8a15c5bea.png)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44690

Reviewed By: ngimel

Differential Revision: D23727921

Pulled By: mruberry

fbshipit-source-id: db7cc8e74cde09c13d6a57683129fd839863b914
2020-09-16 19:18:30 -07:00
79108fc16c [JIT] Improve Future subtype checking (#44570)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44570

**Summary**
This commit improves subtype checking for futures so that
`Future[T]` is considered to be a subtype of `Future[U]` if `U` is a
subtype of `V`.

**Test Plan**
This commit adds a test case to `test_async.py` that tests this.

Test Plan: Imported from OSS

Reviewed By: eellison

Differential Revision: D23660588

Pulled By: SplitInfinity

fbshipit-source-id: b606137c91379debab91b9f41057f7b1605757c5
2020-09-16 18:54:51 -07:00
29664e6aa3 [FX] Further sanitize generated names (#44808)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44808

Test Plan: Imported from OSS

Reviewed By: suo

Differential Revision: D23739413

Pulled By: jamesr66a

fbshipit-source-id: b759c3ea613dfa717fb23977b72ff4773d9dcc99
2020-09-16 18:47:38 -07:00
204f985fc3 [NNC] Add simplification of Loop + Condition patterns. (#44764)
Summary:
Adds a new optimization to the IRSimplifier which changes this pattern:
```
for ...
  if ...
   do thing;
```
into:
```
if ...
  for ...
    do thing;
```

Which should be almost strictly better.

There are many cases where this isn't safe to do, hence tests. Most  obviously when the condition depends on something modified within the loop.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44764

Reviewed By: mruberry

Differential Revision: D23734463

Pulled By: nickgg

fbshipit-source-id: 51617e837de96b354fb702d0090ac65ddc523d36
2020-09-16 18:41:58 -07:00
8ec6bc7292 [pytorch][vulkan][jni] LiteModuleLoader load argument to use vulkan device
Summary:
### Java, CPP
Introducing additional parameter `device` to LiteModuleLoader to specify device on which the `forward` will work.

On the java side this is enum that contains CPU and VULKAN, passing as jint to jni side and storing it as a member field on the same level as module.

On pytorch_jni_lite.cpp - for all input tensors converting them to vulkan.

On pytorch_jni_common.cpp (also goes to OSS) - if result Tensor is not cpu - call cpu. (Not Cpu at the moment is only Vulkan).

### BUCK
Introducing `pytorch_jni_lite_with_vulkan` target, that depends on `pytorch_jni_lite_with_vulkan` and adds `aten_vulkan`

In that case `pytorch_jni_lite_with_vulkan` can be used along with `pytorch_jni_lite_with_vulkan`.

Test Plan:
After the following diff with aidemo segmentation:
```
buck install -r aidemos-android
```
{F296224521}

Reviewed By: dreiss

Differential Revision: D23198335

fbshipit-source-id: 95328924e398901d76718c4d828f96e112dfa1b0
2020-09-16 18:35:22 -07:00
0ccc38b773 [caffe2] adds Cancel to SafeDequeueBlobsOp and SafeEnqueueBlobsOp (#44495)
Summary:
## Motivation

* To be able to make C2 ops cancellable so we can safely exit.
* Some C2 operators are now blocking thus being non-cancellable. If an error
  occurs we need to be able to safely stop all net execution so we can throw
  the exception to the caller.

* When an error occurs in a net or it got cancelled, running ops will have the
 `Cancel` method called.

* This diff adds `Cancel` method to the `SafeEnqueueBlobsOp`
and `SafeDequeueBlobsOp` to have the call queue->close() to force all the
 blocking ops to return.
* Adds unit test that verified the error propagation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44495

Test Plan:
## Unit Test added to verify that queue ops propagate errors
```
buck test caffe2/caffe2/python:hypothesis_test
```

Reviewed By: dzhulgakov

Differential Revision: D23236088

Pulled By: dahsh

fbshipit-source-id: daa90d9ee32483fb51195e269a52cf5987bb0a5a
2020-09-16 18:17:34 -07:00
3fa7f515a5 [pytorch][bot] update mobile op deps (#44700)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44700

Test Plan: Imported from OSS

Reviewed By: iseeyuan

Differential Revision: D23719486

Pulled By: ljk53

fbshipit-source-id: 39219ceeee51861f90b228fdfe2ab59ac8a9704d
2020-09-16 17:20:15 -07:00
6befc09465 Fix misuse of PyObject_IsSubclass (#44769)
Summary:
PyObject_IsSubclass may set python live exception bit if given object is not a class. `IsNamedTuple` is currently using it incorrectly, which may trip all following python operations in debug-build python. Normal release-build python is not affected because `assert` is no-op in release-build.

Fixes https://github.com/pytorch/pytorch/issues/43577

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44769

Reviewed By: jamesr66a

Differential Revision: D23725584

Pulled By: gmagogsfm

fbshipit-source-id: 2dabd4f8667a045d5bf75813500876c6fd81542b
2020-09-16 16:19:01 -07:00
43fe034514 [JIT] Disallow plain Optional type annotation without arg (#44586)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44586

**Summary**
This commit disallows plain `Optional` type annotations without
any contained types both in type comments and in-line as
Python3-style type annotations.

**Test Plan**
This commit adds a unit test for these two situations.

Test Plan: Imported from OSS

Reviewed By: gmagogsfm

Differential Revision: D23721517

Pulled By: SplitInfinity

fbshipit-source-id: ead411e94aa0ccce227af74eb0341e2a5331370a
2020-09-16 16:07:26 -07:00
574f9af160 [NCCL] Add option to run NCCL on high priority cuda stream (#43796)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43796

This diff adds an option for the process group NCCL backend to pick high priority cuda streams.

Test Plan: waitforsandcastle

Reviewed By: jiayisuse

Differential Revision: D23404286

fbshipit-source-id: b79ae097b7cd945a26e8ba1dd13ad3147ac790eb
2020-09-16 16:00:41 -07:00
161490d441 Move torch/version.py generation to cmake (#44577)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44577

I would like to to move this to cmake so that I can depend on it
happening from other parts of the build.

This PR pulls out the logic for determining the version string and
writing the version file into its own module. `setup.py` still receives
the version string and uses it as before, but now the code for writing
out `torch/version.py` lives in a custom command in torch/CMakeLists.txt

I noticed a small inconsistency in how version info is populated.
`TORCH_BUILD_VERSION` is populated from `setup.py` at configuration
time, while `torch/version.py` is written at build time. So if, e.g. you
configured cmake on a certain git rev, then built it in on another, the
two versions would be inconsistent.

This does not appear to matter, so I opted to preserve the existing
behavior.

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D23734781

Pulled By: suo

fbshipit-source-id: 4002c9ec8058503dc0550f8eece2256bc98c03a4
2020-09-16 15:49:22 -07:00
ffe127e4f1 [JIT] Disallow plain Tuple type annotation without arg (#44585)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44585

**Summary**
This commit disallows plain `Tuple` type annotations without any
contained types both in type comments and in-line as Python3-style
type annotations.

**Test Plan**
This commit adds a unit test for these two situations.

Test Plan: Imported from OSS

Reviewed By: gmagogsfm

Differential Revision: D23721515

Pulled By: SplitInfinity

fbshipit-source-id: e11c77a4fac0b81cd535c37a31b9f4129c276592
2020-09-16 15:49:19 -07:00
qxu
09a84071a3 enable mypy check for jit_metaprogramming_utils (#44752)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/42969
enable mypy check for jit_metaprogramming_utils.py and fixed all errors.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44752

Reviewed By: walterddr

Differential Revision: D23741285

Pulled By: qxu-fb

fbshipit-source-id: 21e36ca5d25c8682fb93b806e416b9e1db76f71e
2020-09-16 15:44:37 -07:00
3f5bb2bade [quant] Support clone for per channel affine quantized tensor (#44573)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44573

fixes: https://github.com/pytorch/pytorch/issues/33309

Test Plan: Imported from OSS

Reviewed By: VitalyFedyunin

Differential Revision: D23663828

fbshipit-source-id: 9a021a22b6075b1e94b3f91c0c101fbb9246ec0e
2020-09-16 15:37:44 -07:00
7b3432caff [TensorExpr] Support boolean in simplifier (#44659)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44659

Test Plan: test_tensorexpr --gtest_filter=TensorExprTest.ConstantFoldCastToBool

Reviewed By: ngimel

Differential Revision: D23714675

Pulled By: asuhan

fbshipit-source-id: 4c18d972b628d5ad55bad58eddd5f6974e043d9c
2020-09-16 15:30:19 -07:00
ac0d13cc88 Vectorize complex copy. (#44722)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44722

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: anjali411

Differential Revision: D23731276

Pulled By: ezyang

fbshipit-source-id: 4902c4b79577ae3c70aca94828006b12914ab7f9
2020-09-16 15:15:12 -07:00
78b806ab4a [JIT] Disallow plain List type annotation without arg (#44584)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44584

**Summary**
This commit extends the work done in #38130 and disallows plain
Python3-style `List` type annotations.

**Test Plan**
This commit extends `TestList.test_no_element_type_annotation` to the
Python3-style type annotation.

Test Plan: Imported from OSS

Reviewed By: gmagogsfm

Differential Revision: D23721514

Pulled By: SplitInfinity

fbshipit-source-id: 48957868286f44ab6d5bf5e1bf97f0a4ebf955df
2020-09-16 15:08:04 -07:00
cb3b8a33f1 [JIT] Disallow plain Dict type annotation without arg (#44334)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44334

**Summary**
This commit detects and prohibits the case in which `typing.Dict` is
used as an annotation without type arguments (i.e. `typing.Dict[K, V]`).
At present, `typing.Dict` is always assumed to have two arguments, and
when it is used without them, `typing.Dict.__args__` is nonempty and
contains some `typing.TypeVar` instances, which have no JIT type equivalent.
Consequently, trying to convert `typing.Dict` to a JIT type results in
a `c10::DictType` with `nullptr` for its key and value types, which can cause
a segmentation fault.

This is fixed by returning a `DictType` from
`jit.annotations.try_ann_to_type` only if the key and value types are converted
successfully to a JIT type and returning `None` otherwise.

**Test Plan**
This commit adds a unit test to `TestDict` that tests the plain `Dict`
annotations throw an error.

**Fixes**
This commit closes #43530.

Test Plan: Imported from OSS

Reviewed By: gmagogsfm

Differential Revision: D23610766

Pulled By: SplitInfinity

fbshipit-source-id: 036b10eff6e3206e0da3131cfb4997d8189c4fec
2020-09-16 14:38:28 -07:00
5027c161a9 Add TORCH_SELECTIVE_NAME to AMP definitions (#44711)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44711

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D23711425

Pulled By: ezyang

fbshipit-source-id: d4b0ef77893af80fe9b74791e66825e223ae221d
2020-09-16 14:25:17 -07:00
82ab167cce [NNC] Fix masking for all block and thread dimensions in CudaCodeGen (#44733)
Summary:
Unifies a number of partial solutions to the thread and block dimension extent masking, including the NoThreadIdxWriter and my last fix https://github.com/pytorch/pytorch/issues/44325. The NoThreadIdxWriter is gone in favour of tracking the current loop extents and masking any statements that have a lower rank than the launch parameters in any Block or Thread dimension, which handles both the "no" and "smaller" axis binding cases.

For example it will transform the following:
```
for i in 0..10 // blockIdx.x
  for j in 0..10 // threadIdx.x
    do thing(i, j);
  for k in 0..5 // threadIdx.x
    do other thing(i, k);
```

Into:
```
do thing(blockIdx.x, threadIdx.x);
if (threadIdx.x < 5) {
  do other thing(blockIdx.x, threadIdx.x);
}
```

And handle the case where statements are not bound by any axis, eg.
```
do outer thing;
for i in 0..10 // blockIdx.x
  for j in 0..10 // threadIdx.x
    do thing(i, j);
  do other thing(i);
```

will become:

```
if (blockIdx.x < 1) {
  if (threadIdx.x < 1) {
    do outer thing;
  }
}
syncthreads();
do thing(blockIdx.x, threadIdx.x);
syncthreads();
if (threadIdx.x < 1) {
  do other thing(blockIdx.x);
}
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44733

Reviewed By: mruberry

Differential Revision: D23736878

Pulled By: nickgg

fbshipit-source-id: 52d08626ae8043d53eb937843466874d479a6768
2020-09-16 14:23:47 -07:00
a3835179a1 [FakeLowP] Addressing FakeLowP OSS issues. (#44819)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44819

[12:39 AM] Cherckez, Tal
please review the following patch.
should address these issues that our validation team found:
A) test_op_nnpi_fp16: hypothesis to trigger max_example*max_example.
B) batchnorm: batchNorm has derived from unit test which doesnt have setting required for hypothesis. hence default value as 100 getting set.

Test Plan:
buck test //caffe2/caffe2/contrib/fakelowp/test/...
https://our.intern.facebook.com/intern/testinfra/testrun/5910974543950859

Reviewed By: hyuen

Differential Revision: D23740970

fbshipit-source-id: 16fcc49f7bf84a5d7342786f671cd0b4e0fc87d3
2020-09-16 13:56:11 -07:00
07d9cc80a4 Fix error code checks for triangular_solve (CPU) (#44720)
Summary:
Added missing error checks for the CPU version of `triangular_solve`.
Fixes https://github.com/pytorch/pytorch/issues/43141.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44720

Reviewed By: mruberry

Differential Revision: D23733400

Pulled By: ngimel

fbshipit-source-id: 9837e01b04a6bfd9181e08d46bf96329f292cae0
2020-09-16 13:54:45 -07:00
f3bd984e44 Move the description comment of compute_bucket_assignment_by_size from cpp to the header file. (#44703)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44703

The description of this public function should be in the header file.

Also fix some typos.

Test Plan: N/A.

Reviewed By: pritamdamania87

Differential Revision: D23703661

fbshipit-source-id: 24ae63de9498e321b31dfb2efadb44183c6370df
2020-09-16 13:44:14 -07:00
20ac736200 Remove py2 compatible future imports (#44735)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44735

Reviewed By: mruberry

Differential Revision: D23731306

Pulled By: ezyang

fbshipit-source-id: 0ba009a99e475ddbe22981be8ac636f8a1c8b02f
2020-09-16 12:55:57 -07:00
6debe825be [vulkan] glsl shaders relaxed precision mode to cmake option (#43076)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43076

Test Plan: Imported from OSS

Reviewed By: AshkanAliabadi

Differential Revision: D23143354

Pulled By: IvanKobzarev

fbshipit-source-id: 7b3ead1e63cf8acf6e8e547080a8ead7a2db994b
2020-09-16 12:51:34 -07:00
e9c6449b46 [FX][EZ] Allow constructing GraphModule with dict for root (#44679)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44679

Test Plan: Imported from OSS

Reviewed By: zdevito

Differential Revision: D23696766

Pulled By: jamesr66a

fbshipit-source-id: fe18b7b579c1728d00589bd5fd5e54c917cc61fe
2020-09-16 12:43:23 -07:00
1718b16d15 [Caffe2] gcs_cuda_only is trivial if CUDA not available (#44578)
Summary:
Make `gcs_cuda_only` and `gcs_gpu_only` return empty device lists if CUDA/GPU(CUDA or RocM) not available

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44578

Reviewed By: walterddr

Differential Revision: D23664227

Pulled By: malfet

fbshipit-source-id: 176b5d964c0b02b8379777cd9a38698c11818690
2020-09-16 12:24:08 -07:00
c44e4878ae Enable torch.backends.quantized typechecks (#44794)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/44793

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44794

Reviewed By: walterddr

Differential Revision: D23734353

Pulled By: malfet

fbshipit-source-id: 491bd7c8f147759715eb296d7537a172685aa066
2020-09-16 12:21:20 -07:00
1cd5ba49c6 Add batching rule for "is_complex", "conj" (#44649)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44649

To unblock #43208, which adds "is_complex" checks to backward formulas
that are being tested for batched gradient support with vmap.

Test Plan: - `pytest test/test_vmap.py -v`

Reviewed By: anjali411

Differential Revision: D23685356

Pulled By: zou3519

fbshipit-source-id: 29e41a9296336f6d1008e3040cade4c643bf5ebf
2020-09-16 12:19:46 -07:00
cce7680a23 Add bound method tests for async_execution with RRef helper (#44716)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44716

Test Plan: Imported from OSS

Reviewed By: rohan-varma

Differential Revision: D23707326

Pulled By: mrshenli

fbshipit-source-id: a2f8db17447e9f82c9f6ed941ff1f8cb9090ad74
2020-09-16 12:01:07 -07:00
257c6d0fde Make async_execution compatible with RRef helpers (#44666)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44666

Test Plan: Imported from OSS

Reviewed By: rohan-varma

Differential Revision: D23691989

Pulled By: mrshenli

fbshipit-source-id: b36f4b1c9d7782797a0220434a8272610a23e83e
2020-09-16 12:01:05 -07:00
924717bf51 Add _get_type() API to RRef (#44663)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44663

The new API returns the type of the data object referenced by this
`RRef`. On the owner, this is same as `type(rref.local_value())`.
On a user, this will trigger an RPC to fetch the `type` object from
the owner. After this function is run once, the `type` object is
cached by the `RRef`, and subsequent invocations no longer trigger
RPC.

closes #33210

Test Plan: Imported from OSS

Reviewed By: rohan-varma

Differential Revision: D23691990

Pulled By: mrshenli

fbshipit-source-id: a2d87cd601a691dd75164b6bcd7315245e9cf6bd
2020-09-16 11:59:22 -07:00
6954ae1278 Vec256 Test cases (#42685)
Summary:
[Tests for Vec256 classes https://github.com/pytorch/pytorch/issues/15676](https://github.com/pytorch/pytorch/issues/15676)

Testing
Current list:

- [x] Blends
- [x] Memory: UnAlignedLoadStore
- [x] Arithmetics: Plus,Minu,Multiplication,Division
- [x] Bitwise: BitAnd, BitOr, BitXor
- [x] Comparison: Equal, NotEqual, Greater, Less, GreaterEqual, LessEqual
- [x] MinMax: Minimum, Maximum, ClampMin, ClampMax, Clamp
- [x] SignManipulation: Absolute, Negate
- [x] Interleave: Interleave, DeInterleave
- [x] Rounding: Round, Ceil, Floor, Trunc
- [x] Mask: ZeroMask
- [x] SqrtAndReciprocal: Sqrt, RSqrt, Reciprocal
- [x] Trigonometric: Sin, Cos, Tan
- [x] Hyperbolic: Tanh, Sinh, Cosh
- [x] InverseTrigonometric: Asin, ACos, ATan, ATan2
- [x] Logarithm: Log, Log2, Log10, Log1p
- [x] Exponents: Exp, Expm1
- [x] ErrorFunctions: Erf, Erfc, Erfinv
- [x] Pow: Pow
- [x] LGamma: LGamma
- [x] Quantization: quantize, dequantize, requantize_from_int
- [x] Quantization: widening_subtract, relu, relu6
Missing:
- [ ] Constructors, initializations
- [ ] Conversion , Cast
- [ ] Additional: imag, conj, angle (note: imag and conj only checked for float complex)

#### Notes on tests and testing framework
- some math functions are tested within domain range
- mostly testing framework randomly tests against std implementation within the domain or within the implementation domain for some math functions.
- some functions are tested against the local version. ~~For example, std::round and vector version of round differs. so it was tested against the local version~~
- round was tested against pytorch at::native::round_impl. ~~for double type on **Vsx  vec_round failed  for  (even)+0 .5 values**~~ . it was solved by using vec_rint
- ~~**complex types are not tested**~~  **After enabling complex testing due to precision and domain some of the complex functions failed for vsx and x86 avx as well. I will either test it against local implementation or check within the accepted domain**
- ~~quantizations are not tested~~  Added tests for quantizing, dequantize, requantize_from_int, relu, relu6, widening_subtract functions
- the testing framework should be improved further
- ~~For now  `-DBUILD_MOBILE_TEST=ON `will be used for Vec256Test too~~
Vec256 Test cases will be built for each CPU_CAPABILITY

Fixes: https://github.com/pytorch/pytorch/issues/15676

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42685

Reviewed By: malfet

Differential Revision: D23034406

Pulled By: glaringlee

fbshipit-source-id: d1bf03acdfa271c88744c5d0235eeb8b77288ef8
2020-09-16 11:48:02 -07:00
e6101f5507 fixes lda condition for blas functions, fixes bug with beta=0 in addmv slow path (#44681)
Summary:
per title. If `beta=0` and slow path was taken, `nan` and `inf` in the result were not masked as is the case with other linear algebra functions. Similarly, since `mv` is implemented as `addmv` with `beta=0`, wrong results were sometimes produced for `mv` slow path.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44681

Reviewed By: mruberry

Differential Revision: D23708653

Pulled By: ngimel

fbshipit-source-id: e2d5d3e6f69b194eb29b327e1c6f70035f3b231c
2020-09-16 11:47:56 -07:00
570102ce85 Remove many unused THC pointwise math operators (#44230)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44230

Reviewed By: albanD

Differential Revision: D23701185

Pulled By: ngimel

fbshipit-source-id: caf7b7a815b37d50232448d6965e591508546bd7
2020-09-16 11:47:51 -07:00
07d07e3c6c Remove EXPERIMENTAL_ENUM_SUPPORT feature guard (#44243)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/41095

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44243

Reviewed By: ZolotukhinM

Differential Revision: D23605979

Pulled By: gmagogsfm

fbshipit-source-id: 098ae69049c4664ad5d1521c45b8a7dd22e72f6c
2020-09-16 11:45:59 -07:00
3e6bb5233f Reference amp tutorial (recipe) from core amp docs (#44725)
Summary:
https://pytorch.org/tutorials/recipes/recipes/amp_recipe.html is live.  Core amp docs should reference it.

Also i fixed some typos in the `zero_grad` docs we ignored when git was behaving weirdly during ngimel 's merge of https://github.com/pytorch/pytorch/pull/44423.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44725

Reviewed By: mruberry

Differential Revision: D23723807

Pulled By: ngimel

fbshipit-source-id: ca0b76365f8ca908bd978e3b38bf81857fa6c2a3
2020-09-16 11:37:58 -07:00
a011b86115 change self.generator to generator (#44461)
Summary:
bug fix

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44461

Reviewed By: mruberry

Differential Revision: D23725053

Pulled By: ngimel

fbshipit-source-id: 89706313013d9eae96aaaf144924867457efd2c0
2020-09-16 11:32:17 -07:00
ee493e1a91 CUDA bfloat compare ops (#44748)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44748

Reviewed By: mruberry

Differential Revision: D23725997

Pulled By: ngimel

fbshipit-source-id: 4f89dce3a8b8f1295ced522011b59e60d756e749
2020-09-16 11:32:14 -07:00
eb75cfb9c0 Back out "Revert D23323486: DPP Async Tracing" plus windows build fix. (#44702)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44702

Original commit changeset: c6bd6d277aca

This diff caused windows build to fail due to a compiler bug in VS2019 (lambda capture constant int value). This back out works around the issue with explicit capture of const int value.

Test Plan: Tested and previously landed.

Reviewed By: mruberry

Differential Revision: D23703215

fbshipit-source-id: f9ef23be97540bc9cf78a855295fb8c69f360459
2020-09-16 11:32:11 -07:00
ced8727d88 Fix a broken link in CONTRIBUTING.md (#44701)
Summary:
as the title says :)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44701

Reviewed By: ngimel

Differential Revision: D23724919

Pulled By: mrshenli

fbshipit-source-id: 5ca5ea974ee6a94ed132dbe7892a9b4b9c3dd9be
2020-09-16 11:30:05 -07:00
5e717f0d5e delete the space for the docs rendering (#44740)
Summary:
see the docs rendering of `jacobian` and `hessian` at https://pytorch.org/docs/stable/autograd.html

![image](https://user-images.githubusercontent.com/20907377/93268949-f0618500-f762-11ea-9ec6-ddd062540c59.png)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44740

Reviewed By: ngimel

Differential Revision: D23724899

Pulled By: mrshenli

fbshipit-source-id: f7558ff53989e5dc7e678706207be2ac7ce22c66
2020-09-16 11:13:45 -07:00
a5cc151b8c Build EigenBlas as static library (#44747)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/43709

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44747

Reviewed By: ezyang

Differential Revision: D23717927

Pulled By: malfet

fbshipit-source-id: c46fbcf5a55895cb984dd4c5301fbcb784fc17d5
2020-09-16 10:25:26 -07:00
b63b684394 Consolidate CODEOWNERS file for distributed package. (#44763)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44763

The file had separate rules for RPC and DDP/c10d, consolidated all of
it together and placed all the distributed rules together.
ghstack-source-id: 112140871

Test Plan: waitforbuildbot

Reviewed By: rohan-varma

Differential Revision: D23721162

fbshipit-source-id: d41c757eb1615376d442bd6b2802909624bd1d3f
2020-09-16 10:19:25 -07:00
dbf17a1d4c Fixing a few links in distributed CONTRIBUTING.md (#44753)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44753

ghstack-source-id: 112132781

Test Plan: waitforbuildbot

Reviewed By: rohan-varma

Differential Revision: D23719077

fbshipit-source-id: 3d943dfde100d175f417554fc7fca1fdb295129f
2020-09-16 10:14:19 -07:00
06036f76b6 CUDA BFloat16 pow (#44760)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44760

Reviewed By: ngimel

Differential Revision: D23727936

Pulled By: mruberry

fbshipit-source-id: 8aa89e989294347d7f593b1a63ce4a1dbfdf783e
2020-09-16 10:01:21 -07:00
63469da3bb Add a test to ensure DDP join works with RPC (#44439)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44439

Adds a test to ddp_under_dist_autograd_test to enusre that that uneven
inputs join() API works properly when DDP + RPC is combined. We test that when
running in outside DDP mode (DDP applied to whole hybrid module) we can
correctly process uneven inputs across different trainers.
ghstack-source-id: 112156980

Test Plan: CI

Reviewed By: albanD

Differential Revision: D23612409

fbshipit-source-id: f1e328c096822042daaba263aa8747a9c7e89de7
2020-09-16 09:51:43 -07:00
3f512b0de2 [quant][qat] Ensure observers and fq modules are scriptable (#44749)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44749

Ensure fx module is scriptable after calling prepare_qat on it

Test Plan:
python test/test_quantization.py TestQuantizeFx.test_qat_and_script

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D23718380

fbshipit-source-id: abf63ffb21e707f7def8f6c88246877f5aded58c
2020-09-16 09:30:07 -07:00
b85568a54a [CI] Add profiling-te benchmarks. (#44756)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44756

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D23719728

Pulled By: ZolotukhinM

fbshipit-source-id: 739940e02a6697fbed2a43a13682a6e5268f710b
2020-09-15 21:33:03 -07:00
d66520ba08 [TensorExpr] Fuser: try merging adjacent fusion groups. (#43671)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43671

Test Plan: Imported from OSS

Reviewed By: eellison

Differential Revision: D23360796

Pulled By: ZolotukhinM

fbshipit-source-id: 60ec318fe77ae9f2c821d9c4d106281845266e0f
2020-09-15 21:31:02 -07:00
2efc618f19 lr_schedule.py redundant code (#44613)
Summary:
The subclass sets "self.last_epoch" when this is set in the parent class's init function. Why would we need to set last_epoch twice? I think calling "super" resets last_epoch anyway, so I am not sure why we would want to include this in the subclass. Am I missing something?

For the record, I am just a Pytorch enthusiast. I hope my question isn't totally silly.

Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44613

Reviewed By: albanD

Differential Revision: D23691770

Pulled By: mrshenli

fbshipit-source-id: 080d9acda86e1a2bfaafe2c6fcb8fc1544f8cf8a
2020-09-15 20:28:39 -07:00
2c1b215b48 [fx] remove delegate, replace with tracer (#44566)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44566

The Delegate objects were confusing. They were suppose to be a way to
configure how tracing works, but in some cases they appeared necessary
for consturcting graphs, which was not true. This makes the organization
clearer by removing Delgate and moving its functionality into a Tracer class,
similar to how pickle has a Pickler class.

Test Plan: Imported from OSS

Reviewed By: jamesr66a

Differential Revision: D23683177

Pulled By: zdevito

fbshipit-source-id: 7605a34e65dfac9a487c0bada39a23ca1327ab00
2020-09-15 16:52:22 -07:00
993b4651fd Convert num_kernels to int64 before calling into CUDA GET_BLOCKS (#44688)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44688

this fixes https://github.com/pytorch/pytorch/issues/44472

Test Plan: Imported from OSS

Reviewed By: walterddr

Differential Revision: D23699819

Pulled By: soulitzer

fbshipit-source-id: 7ecfe78d09344178d1e6c7e1503417feb6beff6c
2020-09-15 15:10:55 -07:00
fb085d90e3 Revert D23583017: move rebuild buckets from end of first iteration to beginning of second iteration
Test Plan: revert-hammer

Differential Revision:
D23583017 (f5d231d593)

Original commit changeset: ef67f79437a8

fbshipit-source-id: fd914b7565aba6a5574a32b31403525abb80ff07
2020-09-15 15:10:52 -07:00
26a91a9f04 [WIP][JIT] Add benchmarking support of NV Fuser with FP16 dtype support (#44101)
Summary:
Modified files in `benchmarks/tensorexpr` to add support for NVIDIA's Fuser for the jit compiler.

This support has some modifications besides adding an option to support the NVIDIA fuser:

* Adds FP16 Datatype support
* Fixes SOL/Algo calculations to generally use the data type instead of being fixed to 4 bytes
* Adds IR printing and kernel printing knobs
* Adds a knob `input_iter` to create ranges of inputs currently only for reductions
* Adds further reduction support for Inner and Outer dimension reductions that are compatible with the `input_iter` knob.
* Added `simple_element`, `reduce2d_inner`, and `reduce2d_outer` to isolate performance on elementwise  and reduction operations in the most minimal fashion.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44101

Reviewed By: ngimel

Differential Revision: D23713658

Pulled By: bertmaher

fbshipit-source-id: d6b83cfab559aefe107c23b3c0f2df9923b3adc1
2020-09-15 15:10:49 -07:00
2f4c31ce3a [jit] Speed up saving in case of many classes (#44589)
Summary:
There's an annoying O(N^2) in module export logic that makes saving some of the models (if they have many classes) take eternity.

I'm not super familiar with this code to properly untangle the deps and make it a pure hash lookup. So I just added a side lookup table for raw pointers. It's still quadratic, but it's O(num_classes^2) instead of O(num_classes * num_references) which already gives huge savings.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44589

Test Plan:
Tested with one of the offending models - just loading a saving a Torchscript file:

```
Before:
load 1.9239683151245117
save 165.74712467193604

After:
load 1.9409027099609375
save 1.4711427688598633
```

Reviewed By: suo

Differential Revision: D23675278

Pulled By: dzhulgakov

fbshipit-source-id: 8f3fa7730941085ea20d9255b49a149ac1bf64fe
2020-09-15 15:10:45 -07:00
285ba0d068 Enable fp16 for UniformFill (#44540)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44540

Support output type to be fp16 for UniformFill

Reviewed By: jianyuh

Differential Revision: D23558030

fbshipit-source-id: 53a5b2c92cfe78cd11f55e6ee498e1bd682fe4a1
2020-09-15 15:09:18 -07:00
69839ea3f6 [NNC] make inlining immediate (take 3) (#44231)
Summary:
This is a reup https://github.com/pytorch/pytorch/issues/43885 with an extra commit which should fix the bugs that caused it to be reverted. Read that for general context.

The issue here was that we were still using the side maps `tensor_to_stmt_` and `stmt_to_tensor_` which get invalidated by any transform of the IR (rather than just any transform that isn't computeInline). I added a comment about this but didn't actually address our usages of it.

I've removed these maps and changed the `getLoopBodyFor` and `getLoopStatementsFor` helpers to search the root stmt directly.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44231

Reviewed By: albanD

Differential Revision: D23689688

Pulled By: nickgg

fbshipit-source-id: 1c6009a880f8c0cebf2300fd06b5cc9322bffbf9
2020-09-15 11:12:24 -07:00
8df0400a50 Fix fallback graph in specialize autogradzero (#44654)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44654

Previously we weren't creating a fallback graph as intended in specialize autograd zero, so if a Tensor failed one of our undefinedness checks we would run the backward normally without reprofiling & optimizing.

Test Plan: Imported from OSS

Reviewed By: jamesr66a

Differential Revision: D23691764

Pulled By: eellison

fbshipit-source-id: 10c6fa79518c84a6f5ef2bfbd9ea10843af751eb
2020-09-15 11:12:20 -07:00
4ce6af35c4 Enable fp16 for CUDA SparseLengthsSum/Mean (#44089)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44089

Add support of fp16 as input type in SparseLengthSum/Mean caffe2 operator

Reviewed By: xianjiec

Differential Revision: D23436877

fbshipit-source-id: 02fbef2fde17d4b0abea9ca5d17a36aa989f98a0
2020-09-15 11:10:54 -07:00
07cba8b1fc Run vmap tests in CI (#44656)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44656

All this time, test_vmap wasn't running in the CI. Fortunately all the
tests pass locally for me. h/t to anjali411 for pointing this out.

Test Plan: - Wait for CI

Reviewed By: anjali411

Differential Revision: D23689355

Pulled By: zou3519

fbshipit-source-id: 543c3e6aed0af77bfd6ea7a7549337f8230e3d32
2020-09-15 10:59:00 -07:00
d62994a94d ci: Add anaconda pruning to CI pipeline (#44651)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44651

Adds pruning for our anaconda channels (pytorch-nightly, pytorch-test)
into our CI pipeline so that it gets run on a more consistent basis.

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: walterddr

Differential Revision: D23692851

Pulled By: seemethere

fbshipit-source-id: fa69b506b73805bf2ffbde75d221aef1ee3f753e
2020-09-15 10:51:05 -07:00
1d733d660d [docs] torch.min/max: remove incorrect warning from docs (#44615)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/44195

cc: mruberry

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44615

Reviewed By: ngimel

Differential Revision: D23703525

Pulled By: mruberry

fbshipit-source-id: 471ebd764be667e29c03a30f3ef341440adc54d2
2020-09-15 10:42:08 -07:00
6bc77f4d35 Use amax/maximum instead of max in optimizers (#43797)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43797

Reviewed By: malfet

Differential Revision: D23406641

Pulled By: mruberry

fbshipit-source-id: 0cd075124aa6533b21375fe2c90c44a5d05ad6e6
2020-09-15 10:39:40 -07:00
9c364da9b9 Fix doc builds for bool kwargs (#44686)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/43669

The bool will still link to https://docs.python.org/3/library/functions.html#bool.
Tested using bmm:
![image](https://user-images.githubusercontent.com/16063114/93156438-2ad11080-f6d6-11ea-9b81-96e02ee68d90.png)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44686

Reviewed By: ngimel

Differential Revision: D23703823

Pulled By: mruberry

fbshipit-source-id: 7286afad084f5ab24a1254ad84e5d01907781c85
2020-09-15 10:34:58 -07:00
f5d231d593 move rebuild buckets from end of first iteration to beginning of second iteration (#44326)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44326

Part of relanding PR #41954, this refactoring is to move rebuild_buckets call from end of first iteration to beginning of second iteration
ghstack-source-id: 112011490

Test Plan: unit tests

Reviewed By: mrshenli

Differential Revision: D23583017

fbshipit-source-id: ef67f79437a820d9b5699b651803622418499a83
2020-09-15 09:51:33 -07:00
5f692a67db qat conv_fused.py: one more patch for forward compatibility (#44671)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44671

See comments inline - the FC between
https://github.com/pytorch/pytorch/pull/38478 and
https://github.com/pytorch/pytorch/pull/38820 was broken,
patching it.

Test Plan: Verified with customer hitting the issue that this fixes their issue.

Reviewed By: jerryzh168

Differential Revision: D23694029

fbshipit-source-id: a5e1733334e22305a111df750b190776889705d0
2020-09-15 09:43:29 -07:00
72b5665c4f Upgrade oneDNN (mkl-dnn) to v1.6 (#44706)
Summary:
- Bump oneDNN (mkl-dnn) to 1.6 for bug fixes
    - Fixes https://github.com/pytorch/pytorch/issues/42446. RuntimeError: label is redefined for convolutions with large filter size on Intel AVX512
    - Implemented workaround for internal compiler error when building oneDNN with Microsoft Visual Studio 2019 (https://github.com/pytorch/pytorch/pull/43169)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44706

Reviewed By: ngimel

Differential Revision: D23705967

Pulled By: albanD

fbshipit-source-id: 65e8fecc52a76c9f3324403a8b60ffa8a8948bc6
2020-09-15 09:30:01 -07:00
7036e91abd Revert D23323486: DPP Async Tracing
Test Plan: revert-hammer

Differential Revision:
D23323486 (71673b31f9)

Original commit changeset: 4b6ca6c0e320

fbshipit-source-id: c6bd6d277aca070bef2de3522c2a60e23b4395ad
2020-09-15 01:19:23 -07:00
2435d941b1 Fix FP16 fastAtomicAdd for one case where tensor start address is not 32 bit aligned (#44642)
Summary:
For https://github.com/pytorch/pytorch/issues/44206 and https://github.com/pytorch/pytorch/issues/42218, I'd like to update trilinear interpolate backward and grid_sample backward to use `fastAtomicAdd`.

As a prelude, I spotted a UB risk in `fastAtomicAdd`.  I think existing code incurs a misaligned `__half2` atomicAdd when `index` is odd and `tensor` is not 32-bit aligned (`index % 2 == 1` and `(reinterpret_cast<std::uintptr_t>(tensor) % sizeof(__half2) == 1`). In this case we think we're `!low_bit` and go down the `!low_bit` code path, but in fact we are `low_bit`.  It appears the original [fastAtomicAdd PR](https://github.com/pytorch/pytorch/pull/21879#discussion_r295040377)'s discussion did not consider that case explicitly.

I wanted to push my tentative fix for discussion ASAP. jjsjann123 and mkolod as original authors of `fastAtomicAdd`. (I'm also curious why we need to `reinterpret_cast<std::uintptr_t>(tensor...` for the address modding, but that's minor.)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44642

Reviewed By: mruberry

Differential Revision: D23699820

Pulled By: ngimel

fbshipit-source-id: 0db57150715ebb45e6a1fb36897e46f00d61defd
2020-09-14 22:07:29 -07:00
2fd142a2ef Small clarification to amp gradient penalty example (#44667)
Summary:
requested by https://discuss.pytorch.org/t/what-is-the-correct-way-of-computing-a-grad-penalty-using-amp/95827/3

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44667

Reviewed By: mruberry

Differential Revision: D23692768

Pulled By: ngimel

fbshipit-source-id: 83c61b94e79ef9f86abed2cc066f188dce0c8456
2020-09-14 21:56:09 -07:00
aedce773ed Deleted docker images for rocm 3.3 and rocm 3.5 (#44672)
Summary:
jeffdaily

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44672

Reviewed By: malfet

Differential Revision: D23694924

Pulled By: xw285cornell

fbshipit-source-id: 0066dc4b36c366588e1f309c82e7e1dc2ce8eec1
2020-09-14 21:50:41 -07:00
c71ce10cfc add dilation to transposeconv's _output_padding method (#43793)
Summary:
This PR adds dilation to _ConvTransposeNd._output_padding method and tests using a bunch of different sized inputs.

Fixes https://github.com/pytorch/pytorch/issues/14272

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43793

Reviewed By: zou3519

Differential Revision: D23493313

Pulled By: ezyang

fbshipit-source-id: bca605c428cbf3a97d3d24316d8d7fde4bddb307
2020-09-14 21:28:27 -07:00
ed862d3682 Split CUDA_NVCC_FLAGS by space (#44603)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/44599

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44603

Reviewed By: albanD

Differential Revision: D23692320

Pulled By: ezyang

fbshipit-source-id: 6a63d94ab8b88e7a82f9d65f03523d6ef639c754
2020-09-14 20:25:37 -07:00
2c4b4aa81b Revert D23494065: Refactor CallbackManager as a nested class of RecordFunction.
Test Plan: revert-hammer

Differential Revision:
D23494065 (63105fd5b1)

Original commit changeset: 416d5bf6c942

fbshipit-source-id: 3b1ec928e3db0cc203bb63ec4db3da1584b9b884
2020-09-14 19:43:50 -07:00
e7d782e724 [JIT] Add property support for ScriptModules (#42390)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42390

**Summary**
This commit extends support for properties to include
ScriptModules.

**Test Plan**
This commit adds a unit test that has a ScriptModule with
a user-defined property.

`python test/test_jit_py3.py TestScriptPy3.test_module_properties`

Test Plan: Imported from OSS

Reviewed By: eellison, mannatsingh

Differential Revision: D22880298

Pulled By: SplitInfinity

fbshipit-source-id: 74f6cb80f716084339e2151ca25092b6341a1560
2020-09-14 18:49:21 -07:00
63105fd5b1 Refactor CallbackManager as a nested class of RecordFunction. (#44645)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44645

Moved CallbackManager as a nested class of RecordFunction to allow private access to the call handles and context without exposing them publicly. It still hides the singleton instance of the CallbackManager inside record_function.cpp.

Test Plan: Unit tests.

Reviewed By: ilia-cher

Differential Revision: D23494065

fbshipit-source-id: 416d5bf6c9426e112877fbd233a6f4dff7bef455
2020-09-14 18:44:40 -07:00
71673b31f9 DPP Async Tracing (#44252)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44252

Add tracing to DPP client. Because DPP requests are async, we need to be able to start a trace event in one thread and potentially end in a different thread. RecordFunction and LibgpumonObserver previously assume each trace event starts and finishes in the same thread. So they use a thread local context to track enter and exit call backs. Async events breaks this assumption. This change attaches the event context to the RecordFunction object so we do not need to use thread local context.

Test Plan:
Tested with dpp perf test and able to collect trace.

{F307824044}

Reviewed By: ilia-cher

Differential Revision: D23323486

fbshipit-source-id: 4b6ca6c0e32028fb38a476cd1f44c17a001fc03b
2020-09-14 18:43:14 -07:00
e107ef5ca2 Add type annotations for torch.nn.utils.* (#43080)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/43013

Redo of gh-42954

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43080

Reviewed By: albanD

Differential Revision: D23681334

Pulled By: malfet

fbshipit-source-id: 20ec78aa3bfecb7acffc12eb89d3ad833024394c
2020-09-14 17:52:37 -07:00
551494b01d [JIT] Fix torch.tensor for empty multidimensional-typed lists (#44652)
Summary:
We were hitting an assert error when you passed in an empty `List[List[int]]` - this fixes that error by not recursing into 0-element tensors.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44652

Reviewed By: ZolotukhinM

Differential Revision: D23688247

Pulled By: eellison

fbshipit-source-id: d48ea24893044fae96bc39f76c0f1f9726eaf4c7
2020-09-14 17:28:23 -07:00
2254e5d976 Add note comments to enforce nondeterministic alert documentation (#44140)
Summary:
This PR fulfills Ed's request (https://github.com/pytorch/pytorch/pull/41692#discussion_r473122076) for a strategy to keep the functions that have nondeterministic alerts fully documented.

Part of https://github.com/pytorch/pytorch/issues/15359

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44140

Reviewed By: colesbury

Differential Revision: D23644469

Pulled By: ezyang

fbshipit-source-id: 60936ccced13f071c620f7d25ef6dcbca338de7f
2020-09-14 16:48:22 -07:00
a91c2be2a9 Automated submodule update: FBGEMM (#44647)
Summary:
This is an automated pull request to update the first-party submodule for [pytorch/FBGEMM](https://github.com/pytorch/FBGEMM).

New submodule commit: 1d710393d5

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44647

Test Plan: Ensure that CI jobs succeed on GitHub before landing.

Reviewed By: dskhudia

Differential Revision: D23684528

fbshipit-source-id: 316ff2e448707a6e5a83248c9b22e58118bc8741
2020-09-14 16:43:59 -07:00
686e281bcf Updates div to perform true division (#42907)
Summary:
This PR:

- updates div to perform true division
- makes torch.true_divide an alias of torch.div

This follows on work in previous PyTorch releases that first deprecated div performing "integer" or "floor" division, then prevented it by throwing a runtime error.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42907

Reviewed By: ngimel

Differential Revision: D23622114

Pulled By: mruberry

fbshipit-source-id: 414c7e3c1a662a6c3c731ad99cc942507d843927
2020-09-14 15:50:38 -07:00
e594c30bc2 [quant][graphmode][fx] Support fp16 dynamic quantization for linear (#44582)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44582

Test Plan:
test_quantize_fx.py

Imported from OSS

Reviewed By: vkuzo

Differential Revision: D23665974

fbshipit-source-id: 19ba6c61a9c77ef570b00614016506e9a2729f7c
2020-09-14 15:43:08 -07:00
43406e218a [ONNX] Update ONNX shape inference (#43929)
Summary:
* Support sequence type (de)serialization, enables onnx shape inference on sequence nodes.
* Fix shape inference with block input/output: e.g. Loop and If nodes.
* Fix bugs in symbolic discovered by coverage of onnx shape inference.
* Improve debuggability: added more jit logs. For simplicity, the default log level, when jit log is enabled, will not dump ir graphs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43929

Reviewed By: albanD

Differential Revision: D23674604

Pulled By: bzinodev

fbshipit-source-id: ab6aacb16d0e3b9a4708845bce27c6d65e567ba7
2020-09-14 15:36:19 -07:00
89aed1a933 [vulkan][op] avg_pool2d (#42675)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42675

Test Plan: Imported from OSS

Reviewed By: AshkanAliabadi

Differential Revision: D22978765

Pulled By: IvanKobzarev

fbshipit-source-id: 64938d8965aeeb408dd5c40d688eca13fb7ebb8a
2020-09-14 15:07:34 -07:00
8f327cd6c5 [vulkan][op] add.Scalar, mul.Scalar (#42674)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42674

Test Plan: Imported from OSS

Reviewed By: AshkanAliabadi

Differential Revision: D22978763

Pulled By: IvanKobzarev

fbshipit-source-id: 9fd97d394205e3fa51992ee99d5bfafc33f75efa
2020-09-14 15:03:22 -07:00
f7cfbac89b [ONNX] Update len symbolic (#43824)
Summary:
Update len symbolic.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43824

Reviewed By: izdeby

Differential Revision: D23575765

Pulled By: bzinodev

fbshipit-source-id: 0e5c8c8d4a5297f65e2dc43168993350f784c776
2020-09-14 15:00:44 -07:00
da11d932bc [ONNX] Update arange op to support out argument (#43777)
Summary:
Update arange op to support out argument

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43777

Reviewed By: albanD

Differential Revision: D23674583

Pulled By: bzinodev

fbshipit-source-id: 6fb65e048c6b1a551569d4d2a33223522d2a960c
2020-09-14 14:56:17 -07:00
62ebad4ff9 [ONNX] Export new_empty and new_zeros (#43506)
Summary:
Adding symbolic to export new_empty and new_zeros

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43506

Reviewed By: houseroad

Differential Revision: D23674574

Pulled By: bzinodev

fbshipit-source-id: ecfcdbd4845fd3a3c6618a060129fbeee4df5dd7
2020-09-14 14:48:34 -07:00
d0a56cab07 [quant] Fixing the output shape for the linear (#44513)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44513

Test Plan: Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D23637508

Pulled By: z-a-f

fbshipit-source-id: d19d4c1b234b05e8d9813e864863d937b6c35bf5
2020-09-14 14:31:00 -07:00
742654d1b6 [quant] ConvTranspose1d / ConvTranspose2d (#40371)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40371

Test Plan: Imported from OSS

Reviewed By: vkuzo

Differential Revision: D22158981

Pulled By: z-a-f

fbshipit-source-id: defbf6fbe730a58d5b155dcb2460dd969797215c
2020-09-14 14:25:06 -07:00
84949672bf Fix exception chaining in test/ (#44193)
Summary:
## Motivation
This PR fixes https://github.com/pytorch/pytorch/issues/43770 and is the continuation of https://github.com/pytorch/pytorch/issues/43836.

## Description of the change
This PR fixes exception chaining only in files under `test/` where appropriate.
To fix exception chaining, I used either:
1. `raise new_exception from old_exception` where `new_exception` itself seems not descriptive enough to debug or `old_exception` delivers valuable information.
2. `raise new_exception from None` where raising both of `new_exception` and `old_exception` seems a bit noisy and redundant.

## List of lines containing `raise` in `except` clause:
I wrote [this simple script](https://gist.github.com/akihironitta/4223c1b32404b36c1b349d70c4c93b4d) using [ast](https://docs.python.org/3.8/library/ast.html#module-ast) to list lines where `raise`ing in `except` clause.

- [x] f8f35fddd4/test/test_cpp_extensions_aot.py (L16)
- [x] f8f35fddd4/test/test_jit.py (L2503)
- [x] f8f35fddd4/test/onnx/model_defs/word_language_model.py (L22)
- [x] f8f35fddd4/test/onnx/verify.py (L73)
- [x] f8f35fddd4/test/onnx/verify.py (L110)
- [x] f8f35fddd4/test/onnx/test_verify.py (L31)
- [x] f8f35fddd4/test/distributed/test_c10d.py (L255)
- [x] f8f35fddd4/test/distributed/test_c10d.py (L2992)
- [x] f8f35fddd4/test/distributed/test_c10d.py (L3025)
- [x] f8f35fddd4/test/distributed/test_c10d.py (L3712)
- [x] f8f35fddd4/test/distributed/test_distributed.py (L3180)
- [x] f8f35fddd4/test/distributed/test_distributed.py (L3198)
- [x] f8f35fddd4/test/distributed/test_data_parallel.py (L752)
- [x] f8f35fddd4/test/distributed/test_data_parallel.py (L776)
- [x] f8f35fddd4/test/test_type_hints.py (L151)
- [x] f8f35fddd4/test/test_jit_fuser.py (L771)
- [x] f8f35fddd4/test/test_jit_fuser.py (L773)
- [x] f8f35fddd4/test/test_dispatch.py (L105)
- [x] f8f35fddd4/test/test_distributions.py (L4738)
- [x] f8f35fddd4/test/test_nn.py (L9824)
- [x] f8f35fddd4/test/test_namedtensor.py (L843)
- [x] f8f35fddd4/test/test_jit_fuser_te.py (L875)
- [x] f8f35fddd4/test/test_jit_fuser_te.py (L877)
- [x] f8f35fddd4/test/test_dataloader.py (L31)
- [x] f8f35fddd4/test/test_dataloader.py (L43)
- [x] f8f35fddd4/test/test_dataloader.py (L365)
- [x] f8f35fddd4/test/test_dataloader.py (L391)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44193

Reviewed By: albanD

Differential Revision: D23681529

Pulled By: malfet

fbshipit-source-id: 7c2256ff17334625081137b35baeb816c1e53e0b
2020-09-14 14:20:16 -07:00
a188dbdf3f Check for index-rank consistency in FunctionInliner (#44561)
Summary:
When caller / callee pairs are	inserted into the mapping, verify that
the arity of the buffer access is consistent with its declared rank.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44561

Test Plan: CI, test_tensorexpr --gtest_filter=TensorExprTest.DetectInlineRankMismatch

Reviewed By: albanD

Differential Revision: D23684342

Pulled By: asuhan

fbshipit-source-id: dd3a0cdd4c2492853fa68381468e0ec037136cab
2020-09-14 14:07:22 -07:00
b5dd6e3e61 split torch.testing._internal.* and add type checking for torch.testing._internal.common_cuda (#44575)
Summary:
First step to fix https://github.com/pytorch/pytorch/issues/42969.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44575

Reviewed By: malfet

Differential Revision: D23668740

Pulled By: walterddr

fbshipit-source-id: eeb3650b1780aaa5727b525b4e6182e1bc47a83f
2020-09-14 14:04:02 -07:00
cfba33bde3 Fix the ELU formula in the docs (#43764)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/43389.

This PR replaces the old ELU formula from the docs that yields wrong results for negative alphas with the new one that fixes the issue and relies on the cases notation which makes the formula more straightforward.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43764

Reviewed By: ailzhang

Differential Revision: D23425532

Pulled By: albanD

fbshipit-source-id: d0931996e5667897d926ba4fc7a8cc66e8a66837
2020-09-14 14:01:56 -07:00
9d4943daaf [quant] conv_transpose1d / conv_transpose2d (#40370)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40370

Test Plan: Imported from OSS

Reviewed By: vkuzo

Differential Revision: D22158979

Pulled By: z-a-f

fbshipit-source-id: f5cb812c9953efa7608f06cf0188de447f73f358
2020-09-14 13:45:28 -07:00
ecac8294a6 enable type checking for torch._classes (#44576)
Summary:
Fix https://github.com/pytorch/pytorch/issues/42980

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44576

Reviewed By: malfet

Differential Revision: D23668741

Pulled By: walterddr

fbshipit-source-id: 4201ea3187a40051ebff53d28c8e571ea1a61126
2020-09-14 13:26:46 -07:00
ad7a2eb1c9 Simplify nested Min and Max patterns. (#44142)
Summary:
Improve simplification of nested Min and Max patterns.

Specifically, handles the following pattern simplications:
  * `Max(A, Max(A, Const)) => Max(A, Const)`
  * `Max(Min(A, B), Min(A, C)) => Min(A, Max(B, C))`
  * `Max(Const, Max(A, OtherConst) => Max(A, Max(Const, OtherConst))`
     - This case can have an arbitrarily long chain of Max ops. For example: `Max(5, Max(x, Max(y, Max(z, 8)))) => Max(Max(Max(x, 8), y), z)`

Similarly, for the case of Min as well.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44142

Reviewed By: albanD

Differential Revision: D23644486

Pulled By: navahgar

fbshipit-source-id: 42bd241e6c2af820566744c8494e5dee172107f4
2020-09-14 13:24:46 -07:00
199435af90 Update median doc to note return value of even-sized input (#44562)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44562

Add a note that torch.median returns the smaller of the two middle elements for even-sized input and refer user to torch.quantile for the mean of the middle values.

fixes https://github.com/pytorch/pytorch/issues/39520

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D23657208

Pulled By: heitorschueroff

fbshipit-source-id: 2747aa652d1e7f10229d9299b089295aeae092c2
2020-09-14 13:18:33 -07:00
a475613d1d [static runtime] Swap to out-variant compatible nodes (#44127)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44127

Test Plan: Imported from OSS

Reviewed By: hlu1

Differential Revision: D23604306

Pulled By: bwasti

fbshipit-source-id: 18ccfb9b466b822e28130be3d5c4fae36c76820b
2020-09-14 12:38:25 -07:00
856510c96d [JIT] Dont optimize shape info in batch_mm (#44565)
Summary:
We run remove profile nodes and specialize types before batch_mm, so we cannot run peepholes on the type information of tensors since these properties have not been guarded to be guaranteed to be correct.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44565

Reviewed By: albanD

Differential Revision: D23661538

Pulled By: eellison

fbshipit-source-id: 0dd23a65714f047f49b4db4ec582b21870925fe1
2020-09-14 12:34:20 -07:00
e261e0953e Fix centos8 gcc (#44644)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/44198 properly this time

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44644

Reviewed By: albanD

Differential Revision: D23684909

Pulled By: malfet

fbshipit-source-id: cea6f6e2ae28138f6b93a6513d1abd36d14ae573
2020-09-14 12:28:09 -07:00
ace81b6794 Remove an extra empty line in the warning comments. (#44622)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44622

Remove an extra empty line in the warning comments.Remove an extra empty line.

Test Plan: N/A

Reviewed By: rohan-varma

Differential Revision: D23674070

fbshipit-source-id: 4ee570590c66a72fb808e9ee034fb773b833efcd
2020-09-14 11:15:35 -07:00
21a09ba94d Fix lerp.cu bug when given discontiguous out tensor (#44559)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44559

Please refer to the discussion at the bottom of https://github.com/pytorch/pytorch/pull/43541 about the bug.

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D23655403

Pulled By: heitorschueroff

fbshipit-source-id: 10e4ce5c2fe7bf6e95bcfac4033202430292b03f
2020-09-14 11:03:02 -07:00
95a69a7d09 adds list_gpu_processes function (#44616)
Summary:
per title, to make it easier to track the creation of stray contexts:
```
python -c "import torch; a=torch.randn(1, device='cuda'); print(torch.cuda.memory.list_gpu_processes(0)); print(torch.cuda.memory.list_gpu_processes(1))"
GPU:0
process      79749 uses      601.000 MB GPU memory
GPU:1
no processes are running
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44616

Reviewed By: mruberry

Differential Revision: D23675739

Pulled By: ngimel

fbshipit-source-id: ffa14cad9d7144e883de13b1c2c6817bd432f53a
2020-09-14 09:54:32 -07:00
105132b891 Move ONNX circle ci build to torch and remove all caffe2 CI job/workflows (#44595)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44595

Reviewed By: seemethere

Differential Revision: D23670280

Pulled By: walterddr

fbshipit-source-id: b32633912f6c8b4606be36b90f901e636567b355
2020-09-14 09:50:13 -07:00
bd257a17a1 Add HIP/ROCm version to collect_env.py (#44106)
Summary:
This adds HIP version info to the `collect_env.py` output.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44106

Reviewed By: VitalyFedyunin

Differential Revision: D23652341

Pulled By: zou3519

fbshipit-source-id: a1f5bce8da7ad27a1277a95885934293d0fd43c5
2020-09-14 09:19:18 -07:00
7040a070e3 [torch] Minor: Avoid ostreamstring in Operator's canonicalSchemaString() (#44442)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44442

I noticed lock contention on startup as lookupByLiteral() was
calling registerPendingOperators() - some calls were holding the
lock for 10+ ms, as operators were being registered.

canonicalSchemaString() was using ostreamstring, which isn't typically
particularly fast (partly because of c++ spec locale requirements).
If we repalce with regular c++ string appends, it's somewhat faster
(which isn't hard when comparing with stringstream; albeit a bit
more codegen)

Over the first minute or so, this cuts out 1.4 seconds under the
OperatorRegistry lock (as part of registerPendingOperators) in the
first couple minutes of run time (mostly front-loaded) when running
sync sgd.

As an example, before:
   registerPendingOperators 12688 usec for 2449 operators
After:
   registerPendingOperators 6853 usec for 2449 operators
ghstack-source-id: 111862971

Test Plan: buck test mode/dev-nosan caffe2/test/cpp/...

Reviewed By: ailzhang

Differential Revision: D23614515

fbshipit-source-id: e712f9dac5bca0b1876e11fb8f0850402f03873a
2020-09-14 08:24:16 -07:00
c68a99bd61 [numpy] Add torch.exp2 (#44184)
Summary:
Reference https://github.com/pytorch/pytorch/issues/42515

TODO
* [x] Add tests
* [x] Add docs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44184

Reviewed By: ngimel

Differential Revision: D23674237

Pulled By: mruberry

fbshipit-source-id: 7f4fb1900fad3051cd7fc9d3d7f6d985c5fb093c
2020-09-14 04:05:37 -07:00
870f647040 Automated submodule update: FBGEMM (#44581)
Summary:
This is an automated pull request to update the first-party submodule for [pytorch/FBGEMM](https://github.com/pytorch/FBGEMM).

New submodule commit: 0725301da5

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44581

Test Plan: Ensure that CI jobs succeed on GitHub before landing.

Reviewed By: dskhudia, VitalyFedyunin

Differential Revision: D23665173

fbshipit-source-id: 03cee22335eef0517e561827795bbe2036942ea0
2020-09-13 21:26:56 -07:00
68a5c361ae Adding Adapative Autorange to benchmark utils. (#44607)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/44219

Rebasing https://github.com/pytorch/pytorch/pull/44288 and fixing the git history.

This allows users to bencmark code without having to specify how long to run the benchmark. It runs the benchmark until the variance (IQR / Median) is low enough that we can be confident in the measurement.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44607

Test Plan: There are unit tests, and we manually tested using Examples posted in git.

Reviewed By: robieta

Differential Revision: D23671208

Pulled By: bitfort

fbshipit-source-id: d63184290b88b26fb81c2452e1ae701c7d513d12
2020-09-13 20:55:40 -07:00
8daaa3bc7e Fix latex error in heaviside docs (#44481)
Summary:
This fixes a `katex` error I was getting trying to build the docs:
```
ParseError: KaTeX parse error: Undefined control sequence: \0 at position 55: …gin{cases}
```

This failure was introduced in https://github.com/pytorch/pytorch/issues/42523.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44481

Reviewed By: colesbury

Differential Revision: D23627700

Pulled By: mruberry

fbshipit-source-id: 9cc09c687a7d9349da79a0ac87d6c962c9cfbe2d
2020-09-13 16:42:19 -07:00
fe26102a0e Enable TE in test_jit.py (#44200)
Summary:
Enable TE in test_jit.py and adjust/fix tests accordingly.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44200

Reviewed By: SplitInfinity

Differential Revision: D23673624

Pulled By: Krovatkin

fbshipit-source-id: 5999725c7aacc6ee77885eb855a41ddfb4d9a8d8
2020-09-13 15:58:20 -07:00
7862827269 [pytorch] Add variadic run_method for lite intepreter (#44337)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44337

Add a new run_method to mobile Module which is variadic (takes any number of arguments) to match full jit.
ghstack-source-id: 111909068

Test Plan: Added new unit test to test_jit test suite

Reviewed By: linbinyu, ann-ss

Differential Revision: D23585763

fbshipit-source-id: 007cf852290f03615b78c35aa6f7a21287ccff9e
2020-09-13 13:26:30 -07:00
bcf97b8986 [JIT] Cleanup some places where we log graphs in executors. (#44588)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44588

1) SOURCE_DUMP crashes when invoked on a backward graph since
   `prim::GradOf` nodes can't be printed as sources (they don't have
   schema).
2) Dumping graph each time we execute an optimized plan produces lots of
   output in tests where we run the graph multiple times (e.g.
   benchmarks). Outputting that on the least level of verbosity seems
   like an overkill.
3) Duplicated log statement is removed.

Differential Revision: D23666812

Test Plan: Imported from OSS

Reviewed By: bertmaher

Pulled By: ZolotukhinM

fbshipit-source-id: b9a30e34fd39c85f3e13c3f1e3594e157e1c130f
2020-09-13 11:31:02 -07:00
82da6b3702 [JIT] Fix jit-log verbosity selection logic. (#44587)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44587

Currently it's skewed by one.

The following test demonstrates it:
```
$ cat test.py

import torch
def foo(a,b):
    return a*a*b
torch._C._jit_set_profiling_executor(True)
torch._C._jit_set_profiling_mode(True)
torch._C._jit_override_can_fuse_on_cpu(True)
torch._C._jit_set_texpr_fuser_enabled(True)
f = torch.jit.script(foo)
for _ in range(10):
    f(torch.rand(10), torch.rand(10))

$ cat test_logging_levels.sh

PYTORCH_JIT_LOG_LEVEL="tensorexpr_fuser"    python test.py 2>&1 | grep DUMP   >& /dev/null && echo OK || echo FAIL
PYTORCH_JIT_LOG_LEVEL="tensorexpr_fuser"    python test.py 2>&1 | grep UPDATE >& /dev/null && echo FAIL || echo OK
PYTORCH_JIT_LOG_LEVEL="tensorexpr_fuser"    python test.py 2>&1 | grep DEBUG  >& /dev/null && echo FAIL || echo OK

PYTORCH_JIT_LOG_LEVEL=">tensorexpr_fuser"   python test.py 2>&1 | grep DUMP   >& /dev/null && echo OK || echo FAIL
PYTORCH_JIT_LOG_LEVEL=">tensorexpr_fuser"   python test.py 2>&1 | grep UPDATE >& /dev/null && echo OK || echo FAIL
PYTORCH_JIT_LOG_LEVEL=">tensorexpr_fuser"   python test.py 2>&1 | grep DEBUG  >& /dev/null && echo FAIL || echo OK

PYTORCH_JIT_LOG_LEVEL=">>tensorexpr_fuser"  python test.py 2>&1 | grep DUMP   >& /dev/null && echo OK || echo FAIL
PYTORCH_JIT_LOG_LEVEL=">>tensorexpr_fuser"  python test.py 2>&1 | grep UPDATE >& /dev/null && echo OK || echo FAIL
PYTORCH_JIT_LOG_LEVEL=">>tensorexpr_fuser"  python test.py 2>&1 | grep DEBUG  >& /dev/null && echo OK || echo FAIL
```

Before this change:
```
OK
FAIL
OK
OK
OK
FAIL
OK
OK
OK
```

With this change everthing passes.

Differential Revision: D23666813

Test Plan: Imported from OSS

Reviewed By: bertmaher

Pulled By: ZolotukhinM

fbshipit-source-id: 4adaa5a3d06deadf54eae014a0d76588cdc5e20a
2020-09-13 11:29:25 -07:00
6d4a605ce9 Fix bug simplifying if-then-else when it can be removed (#44462)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44462

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D23671157

Pulled By: bertmaher

fbshipit-source-id: b9b92ad0de1a7bd9bc1fcac390b542d885d0ca58
2020-09-13 10:29:28 -07:00
7e91728f68 Deprecates calling linspace and logspace without setting steps explicitly (#43860)
Summary:
**BC-breaking note**

This change is BC-breaking for C++ callers of linspace and logspace if they were providing a steps argument that could not be converted to an optional.

**PR note**

This PR deprecates calling linspace and logspace wihout setting steps explicitly by:

- updating the documentation to warn that not setting steps is deprecated
- warning (once) when linspace and logspace are called without steps being specified

A test for this behavior is added to test_tensor_creation_ops. The warning only appears once per process, however, so the test would pass even if no warning were thrown. Ideally there would be a mechanism to force all warnings, include those from TORCH_WARN_ONCE, to trigger.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43860

Reviewed By: izdeby

Differential Revision: D23498980

Pulled By: mruberry

fbshipit-source-id: c48d7a58896714d184cb6ff2a48e964243fafc90
2020-09-13 06:09:19 -07:00
e703c17967 Revert D23584071: [dper3] Create dper LearningRate low-level module
Test Plan: revert-hammer

Differential Revision:
D23584071 (a309355be3)

Original commit changeset: f6656531b1ca

fbshipit-source-id: b0a93f4286053fb8576a70278edca3a7d89c722b
2020-09-12 20:45:30 -07:00
a309355be3 [dper3] Create dper LearningRate low-level module
Summary: As title; this will unblock migration of several modules that need learning rate functionality.

Test Plan:
```
buck test //dper3/dper3/modules/low_level_modules/tests:learning_rate_test
```

WIP: need to add more learning rate tests for the different policies

Reviewed By: yf225

Differential Revision: D23584071

fbshipit-source-id: f6656531b1caba38c3e3a7d6e16d9591563391e2
2020-09-12 15:33:29 -07:00
0743d013a6 fuse layernorm + quantize (#44232)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44232

enhance layernorm to optionally quantize its output
add fusion code to replace instances of layernorm +quantization

Test Plan:
tested layernorm
net_runner

P141557987

Reviewed By: venkatacrc

Differential Revision: D23510893

fbshipit-source-id: 32f57ba2090d35d86dcc951e0f3f6a8901ab3153
2020-09-12 13:32:33 -07:00
6f2c3c39d2 Add SNPE deps for caffe2 benchmark android binary
Summary:
Adding snpe dependencies to caffe2_benchmark so that this can benchmark SNPE models on portal devices.

Also need to change ndk_libcxx to gnustl till snpe is updated to work with ndk.

Test Plan: Tested on top of the stack.

Reviewed By: linbinyu

Differential Revision: D23569397

fbshipit-source-id: a6281832804ed4fbb5a8406f436caeae1ff4fd2b
2020-09-12 12:34:56 -07:00
05c1f1d974 [ROCm] remove thrust workaround in ScanKernels (#44553)
Summary:
Remove ROCm workaround added in https://github.com/pytorch/pytorch/issues/39180.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44553

Reviewed By: mruberry

Differential Revision: D23663988

Pulled By: ngimel

fbshipit-source-id: 71b2fd7db006d9d3459b908a996c4d96838ba742
2020-09-11 21:12:43 -07:00
d191caa3e7 Cleanup workarounds for compiler bug of ROCm (#44579)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44579

Reviewed By: mruberry

Differential Revision: D23664481

Pulled By: ngimel

fbshipit-source-id: ef698f26455e5827c5b5c0e5d42a1c95bcac8af4
2020-09-11 21:10:33 -07:00
8641b55158 fix dangling ptr in embedding_bag (#44571)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44571

Test Plan: Imported from OSS

Reviewed By: malfet, ngimel

Differential Revision: D23661007

Pulled By: glaringlee

fbshipit-source-id: e4a54acd0de55f275828c1d1289a1f069de07291
2020-09-11 20:40:44 -07:00
82b4477948 Pass the input tensor vector by const reference. (#44340)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44340

Changed the constructor of GradBucket to pass the input by const
reference and hence avoided unnecessary explicit move semantics. Since
previously the declaration and definition are separated, passing the input
tensor vector by value looks quite bizarre.

Test Plan: buck test caffe2/torch/lib/c10d:ProcessGroupGlooTest

Reviewed By: pritamdamania87

Differential Revision: D23569939

fbshipit-source-id: db761d42e76bf938089a0b38e98e76a05bcf4162
2020-09-11 18:03:56 -07:00
ab5fee2784 Move the inline implementations of GradBucket class to the header. (#44339)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44339

Moved the inline implementations of GradBucket class to the header for
succinctness and readability. This coding style is also consistent with
reducer.h under the same directory.

Test Plan: buck test caffe2/torch/lib/c10d:ProcessGroupGlooTest

Reviewed By: pritamdamania87

Differential Revision: D23569701

fbshipit-source-id: 237d9e2c5f63a6bcac829d0fcb4a5ba3bede75e5
2020-09-11 18:01:37 -07:00
1f0dcf39fc [JIT] dont optimize device dtype on inline (#43363)
Summary:
Follow up to https://github.com/pytorch/pytorch/pull/36404

Adding prim::device and prim::dtype to list of skipped peepholes when we run inlining. In the long term another fix may not be to encode shape / dtype info on the traced graph, because it is not guaranteed to be correct. This is blocked by ONNX currently.

Partial fix for https://github.com/pytorch/pytorch/issues/43134

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43363

Reviewed By: glaringlee

Differential Revision: D23383987

Pulled By: eellison

fbshipit-source-id: 2e9c5160d39d690046bd9904be979d58af8d3a20
2020-09-11 17:29:54 -07:00
d729e2965e [TensorExpr] Do not inline autodiff graphs if they contain prim::TypeCheck nodes. (#44564)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44564

Before this change we sometimes inlined autodiff subgraph containing
fusion groups. This happened because we didn't look for 'unsupported'
nodes recursively (maybe we should), but fusion groups were inside
if-nodes.

The problem was detected by bertmaher in 'LearningToPaint' benchmark
investigation where this bug caused us to keep constantly hitting
fallback paths of the graph.

Test Plan: Imported from OSS

Reviewed By: bwasti

Differential Revision: D23657049

Pulled By: ZolotukhinM

fbshipit-source-id: 7c853424f6dce4b5c344d6cd9c467ee04a8f167e
2020-09-11 17:28:53 -07:00
64b4307d47 [NNC] Cuda Codegen - mask loops bound to block/thread dimensions (#44325)
Summary:
Fix an issue where loops of different sizes are bound to the same Cuda dimension / metavar.

Coming soon more info and tests...

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44325

Reviewed By: colesbury

Differential Revision: D23628859

Pulled By: nickgg

fbshipit-source-id: 3621850a4cc38a790b62ad168d32e7a0e2462fad
2020-09-11 16:48:16 -07:00
2ae74c0632 Compile less legacy code when BUILD_CAFFE2 is set to False (take 2) (#44453)
Summary:
2nd attempt to land https://github.com/pytorch/pytorch/pull/44079

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44453

Reviewed By: walterddr, seemethere

Differential Revision: D23619528

Pulled By: malfet

fbshipit-source-id: c7c206ebd327dcf3994789bd47008b05ff862fe7
2020-09-11 16:27:47 -07:00
566b8d0650 handle missing NEON vst1_*_x2 intrinsics (#44198) (#44199)
Summary:
CentOS 8 on AArch64 has vld1_* intrinsics but lacks vst1q_f32_x2 one.

This patch checks for it and handle it separately to vld1_* ones.

Fixes https://github.com/pytorch/pytorch/issues/44198

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44199

Reviewed By: seemethere

Differential Revision: D23641273

Pulled By: malfet

fbshipit-source-id: c2053c8e0427705eaeeeb82ec030925bff22623a
2020-09-11 16:02:44 -07:00
db24c5c582 Change code coverage option name (#43999)
Summary:
According to [documentation](https://github.com/pytorch/pytorch/blob/master/tools/setup_helpers/cmake.py#L265), only options starts with `BUILD_` / `USE_` / `CMAKE_` in `CMakeLists.txt` can be imported by environment variables.

 ---
This diff is originally intended to enable  `c++` source coverage with `CircleCI` and `codecov.io`, but we will finish it in the future. You can find the related information in the diff history. Following is the originally procedur:

Based on [this pull request](1bda5e480c), life becomes much easier for this time.
1.in `build.sh`
- Enable coverage builld option for c++
- `apt-get install lcov`

2.in `test.sh`
- run `lcov`

3.in `pytorch-job-specs.yml`
- copy coverage.info to `test/` folder and upload it to codecov.io

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43999

Test Plan: Test on github

Reviewed By: malfet

Differential Revision: D23464656

Pulled By: scintiller

fbshipit-source-id: b2365691f04681d25ba5c00293fbcafe8e8e0745
2020-09-11 15:55:05 -07:00
b6f0ea0c71 [quant][graphmode][fx][fix] Remove qconfig in convert (#44526)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44526

Test Plan: Imported from OSS

Reviewed By: vkuzo

Differential Revision: D23641960

fbshipit-source-id: 546da1c16694d1e1dfb72629085acaae2165e759
2020-09-11 15:51:47 -07:00
42f9f2f38f [fix] ReduceOps throw error if dim is repeated (#44281)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/44273

TODO

* [x] Add test

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44281

Reviewed By: zhangguanheng66

Differential Revision: D23569004

Pulled By: ezyang

fbshipit-source-id: 1ca6523fef168c8ce252aeb7ca418be346b297bf
2020-09-11 15:34:06 -07:00
f3a79b881f add lcov to oss for beautiful html report (#44568)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44568

By `lcov`, we can generate beautiful html. It's better than current file report and line report. Therefore in oss gcc, remove `export` code and `file/line level report` code, only use the html report.

But in clang, since such tool is not available, we will still use file-report and line-report generated by ourself.

Test Plan:
Test in docker ubuntu machine.
## Mesurement
1. After running `atest`, it takes about 15 mins to collect code coverage and genrate the report.
```
# gcc code coverage
python oss_coverage.py --run-only=atest
```

## Presentation
**The html result looks like:**

*Top Level:*

{F328330856}

*File Level:*

{F328336709}

Reviewed By: malfet

Differential Revision: D23550784

fbshipit-source-id: 1fff050e7f7d1cc8e86a6a200fd8db04b47f5f3e
2020-09-11 15:29:24 -07:00
c2b40b056a Filter default tests for clang coverage in oss
Summary: Some tests like `test_dataloader.py` are not able to run under `clang` in oss, because it generates too large intermediate files (~40G) that can't be merged by `llvm`. Skip them when user doesn't specify the `--run-only` option

Test Plan: Test locally. But still, not recomend user to run `clang` coverage in default mode, because it takes too much space.

Reviewed By: malfet

Differential Revision: D23549829

fbshipit-source-id: 0737e6e9dcbe3f38de00580ee6007906e743e52f
2020-09-11 15:28:15 -07:00
a82ea6a91f [quant][graphmode][fx][fix] Support None qconfig in convert (#44524)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44524

None qconfig is not handled previously
closes: https://github.com/pytorch/pytorch/issues/44438

Test Plan: Imported from OSS

Reviewed By: vkuzo

Differential Revision: D23640269

fbshipit-source-id: 8bfa88c8c78d4530338d9d7fa9669876c386d91f
2020-09-11 15:22:25 -07:00
1fb5883072 removing conv filters from conv pattern matching (#44512)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44512

Test Plan: Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D23637409

Pulled By: z-a-f

fbshipit-source-id: ad5be0fa6accfbcceaae9171bf529772d87b4098
2020-09-11 15:16:29 -07:00
dd4bbe1a79 Add iterator like functionality for DispatchKeySet (#44066)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44066

Add STL Input iterator to DispatchKeySet:
* Iterator is able to iterate from first not undefined DispatchKey
to NumDispatchKeys.
* Iterator is invalidated once underlying DispatchKeySet is invalidated

Note see http://www.cplusplus.com/reference/iterator/ for comparisons of
different iterators.

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D23611405

Pulled By: linux-jedi

fbshipit-source-id: 131b287d60226a1d67a6ee0f88571f8c4d29f9c3
2020-09-11 15:08:15 -07:00
e2bb34e860 Batched grad support for: slice, select, diagonal (#44505)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44505

Added batching rules for slice_backward, select_backward, and
diagonal_backward.

Test Plan: - new tests: `pytest test/test_vmap.y -v -k "BatchedGrad"`

Reviewed By: agolynski, anjali411

Differential Revision: D23650409

Pulled By: zou3519

fbshipit-source-id: e317609d068c88ee7bc07fab88b2b3acb8fad7e1
2020-09-11 14:59:58 -07:00
7632484000 Add some batched gradient tests (#44494)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44494

These tests check (most) operations that are useful for bayesian logistic
regression (BLR) models. Said operators are basically those found in the
log_prob functions of Distributions objects. This PR is not a general,
structured solution for testing batched gradients (see "Alternative
solution" for that), but I wanted to test a small subset of operations
to confirm that the BLR use case works.

There will be follow-up PRs implementing support for some missing
operations for the BLR use case.

Alternative solution
=====================

Ideally, and in the future, I want to autogenerate tests from
common_method_invocations and delete all of the manual tests
introduced by this PR. However, if we were to do this now,
we would need to store the following additional metadata somewhere:
- operator name, supports_batched_grad, allow_vmap_fallback_usage

We could store that metadata as a separate table from
common_method_invocations, or add two columns to
common_method_invocations. Either way that seems like a lot of work and
the situation will get better once vmap supports batched gradients for
all operators (on the fallback path).

I am neutral between performing the alternative approach now v.s. just
manually writing out some tests for these operations, so I picked the
easier approach. Please let me know if you think it would be better to
pursue the alternative approach now.

Test Plan: - `pytest test/test_vmap.py -v -k "BatchedGrad"`

Reviewed By: anjali411

Differential Revision: D23650408

Pulled By: zou3519

fbshipit-source-id: 2f26c7ad4655318a020bdaab5c767cd3956ea5eb
2020-09-11 14:59:54 -07:00
ab6126b50e [rpc][jit] support remote call in TorchScript (#43046)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43046

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D23621108

Pulled By: wanchaol

fbshipit-source-id: e8152c6cdd3831f32d72d46ac86ce22f3f13c651
2020-09-11 14:59:51 -07:00
3e5df5f216 [rpc][jit] support rpc_sync in TorchScript (#43043)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43043

This add the support for rpc_sync in TorchScript in a way similar to
rpc_async

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D23252039

Pulled By: wanchaol

fbshipit-source-id: 8a05329cb8a24079b2863178b73087d47273914c
2020-09-11 14:59:47 -07:00
8bec7cfa91 [rpc] rename some functions (#43042)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43042

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D23228894

Pulled By: wanchaol

fbshipit-source-id: 3702b7826ecb455073fabb9dc5dca804c0e092b2
2020-09-11 14:58:39 -07:00
70dfeb44bd MinMax based observers: respect device affinity for state_dict (#44537)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44537

Originally, the `min_val`, `max_val`, `min_vals`, `max_vals`
attributes of observers were Tensors but not buffers.  They had custom
state_dict save/load code to ensure their state was saved.

At some point, these attributes became buffers, and the custom
save/load code remained. This introduced a subtle bug:
* create model A, move it to a device (cpu/cuda) and save its state_dict
* create model B, load its state dict.
* `min_val|min_vals|max_val|max_vals` would always be loaded to model A's device, even if the rest of model B was on a different device
* the above is inconsistent with how save/load on different devices is expected to work (see https://pytorch.org/tutorials/beginner/saving_loading_models.html#saving-loading-model-across-devices)

In practice, the case people would sometimes hit is:
* model A is on CPU, state dict is saved
* model B is created and moved to GPU, state_dict from model A is loaded
* assertions throw when operations are attempted across different devices

This PR fixes the behavior by removing the custom save/load where
possible and letting the default `nn.Module` save/load code handle
device assignment.  We special case `PerChannelMinMaxObserver` and its
children to allow for loading buffers or different size, which is
normal.

There are some followups to also enable this for HistogramObserver
and FakeQuantize, which can be done in separate PRs due to higher
complexity.

Test Plan:
```
python test/test_quantization.py TestObserver.test_state_dict_respects_device_affinity
```

Imported from OSS

Reviewed By: raghuramank100

Differential Revision: D23644493

fbshipit-source-id: 0dbb6aa309ad569a91a663b9ee7e44644080032e
2020-09-11 14:48:56 -07:00
192c4111a3 Simplify target handling in nn gradcheck. (#44507)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44507

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D23635799

Pulled By: gchanan

fbshipit-source-id: 75090d6a48771e5c92e737a0829fbfa949f7c8a7
2020-09-11 13:25:59 -07:00
8a574c7104 [Cmake] Drop quotation marks around $ENV{MAX_JOBS} (#44557)
Summary:
Solves `the '-j' option requires a positive integer argument` error on some systems when MAX_JOBS is not defined

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44557

Reviewed By: vkuzo

Differential Revision: D23653511

Pulled By: malfet

fbshipit-source-id: 7d86fb7fb6c946c34afdc81bf2c3168a74d00a1f
2020-09-11 12:57:11 -07:00
2b8f0b2023 [caffe2] adds Cancel to OperatorBase and NetBase (#44145)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44145

## Motivation

* To be able to make C2 ops cancellable so we can safely exit.
* Some C2 operators are now blocking thus being non-cancellable. If an error
  occurs we need to be able to safely stop all net execution so we can throw
  the exception to the caller.

## Summary
*  Adds `NetBase::Cancel()` to NetBase which iterates over the entire list of
   operators and call Cancel.
* Cancel on all ops was added to Net since there's nothing Asyc specific about it.
* `AsyncSchedulingNet` calls parent Cancel.
* To preserve backwards compatibility, `AsyncSchedulingNet`'s Cancel still calls
   `CancelAndFinishAsyncTasks` .
* Adds `Cancel()` to `OperatorBase`.

Reviewed By: dzhulgakov

Differential Revision: D23279202

fbshipit-source-id: e1bb0ff04a4e1393f935dbcac7c78c0baf728550
2020-09-11 12:50:26 -07:00
5579b53a7f Fix SmoothL1Loss when target.requires_grad is True. (#44486)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44486

SmoothL1Loss had a completely different (and incorrect, see #43228) path when target.requires_grad was True.

This PR does the following:

1) adds derivative support for target via the normal derivatives.yaml route
2) kill the different (and incorrect) path for when target.requires_grad was True
3) modify the SmoothL1Loss CriterionTests to verify that the target derivative is checked.

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D23630699

Pulled By: gchanan

fbshipit-source-id: 0f94d1a928002122d6b6875182867618e713a917
2020-09-11 12:13:36 -07:00
b7ef4eec46 [NNC] Add loop slicing transforms (#43854)
Summary:
Add new transforms `sliceHead` and `sliceTail` to `LoopNest`, for example:

Before transformation:
```
for x in 0..10:
  A[x] = x*2
```

After `sliceHead(x, 4)`:

```
for x in 0..4:
  A[x] = x*2
for x in 4..10:
  A[x] = x*2
```

After `sliceTail(x, 1)`:
```
for x in 0..4:
  A[x] = x*2
for x in 4..9:
  A[x] = x*2
for x in 9..10:
  A[x] = x*2
```

`sliceHead(x, 10)` and `sliceTail(x, 10)` is no-op.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43854

Test Plan: Tests are added in `test_loopnest.cpp`, the tests cover the basic transformations, and also tests the combination with other transformations such as `splitWithTail`.

Reviewed By: nickgg

Differential Revision: D23417366

Pulled By: cheng-chang

fbshipit-source-id: 06c6348285f2bafb4be3286d1642bfbe1ea499bf
2020-09-11 12:09:12 -07:00
39bb455e36 Update fallback kernel for Autograd keys. (#44349)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44349

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D23589807

Pulled By: ailzhang

fbshipit-source-id: 0e4b0bf3e07bb4e35cbf1bda22f7b03193eb3dc4
2020-09-11 12:04:52 -07:00
11fb51d093 [quant][graphmode][fx][fix] Support dictionary output (#44508)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44508

Bug fix for dictionary output

Test Plan: Imported from OSS

Reviewed By: z-a-f

Differential Revision: D23636182

fbshipit-source-id: 0c00cd6b9747fa3f8702d7f7a0d5edb31265f466
2020-09-11 11:29:20 -07:00
442957d8b6 [pytorch] Remove mobile nonvariadic run_method (#44235)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44235

Removes nonvariadic run_method() from mobile Module entirely (to be later replaced by a variadic version). All use cases should have been migrated to use get_method() and Method::operator() in D23436351
ghstack-source-id: 111848220

Test Plan: CI

Reviewed By: iseeyuan

Differential Revision: D23484577

fbshipit-source-id: 602fcde61e13047a34915b509da048b9550103b1
2020-09-11 10:23:08 -07:00
a61318a535 [pytorch] Replace mobile run_method with get_method and operator() (#44202)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44202

In preparation for changing mobile run_method() to be variadic, this diff:

* Implements get_method() for mobile Module, which is similar to find_method but expects the method to exist.
* Replaces calls to the current nonvariadic implementation of run_method() by calling get_method() and then invoking the operator() overload on Method objects.
ghstack-source-id: 111848222

Test Plan: CI, and all the unit tests which currently contain run_method that are being changed.

Reviewed By: iseeyuan

Differential Revision: D23436351

fbshipit-source-id: 4655ed7182d8b6f111645d69798465879b67a577
2020-09-11 10:23:06 -07:00
cdf5e2ae86 add typing annotations for a few torch.utils.* modules (#43806)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/43431. Depends on [gh-43862](https://github.com/pytorch/pytorch/pull/43862) (EDIT: now merged)

Modules:
- torch.utils.mkldnn
- torch.utils.mobile_optimizer
- torch.utils.bundled_inputs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43806

Reviewed By: gmagogsfm

Differential Revision: D23635151

Pulled By: SplitInfinity

fbshipit-source-id: a85b75a7927dde6cc55bcb361f8ff601ffb0b2a1
2020-09-11 10:20:55 -07:00
7d78a6fcdd Update interpolate to use new upsample overloads (#43025)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43025

- Use new overloads that better reflect the arguments to interpolate.
- More uniform interface for upsample ops allows simplifying the Python code.
- Also reorder overloads in native_functions.yaml to give them priority.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/37177

ghstack-source-id: 106938111

Test Plan:
test_nn has pretty good coverage.

Relying on CI for ONNX, etc.

Didn't test FC because this change is *not* forward compatible.

To ensure backwards compatibility, I ran this code before this change

```python
def test_func(arg):
    interp = torch.nn.functional.interpolate
    with_size = interp(arg, size=(16,16))
    with_scale = interp(arg, scale_factor=[2.1, 2.2], recompute_scale_factor=False)
    with_compute = interp(arg, scale_factor=[2.1, 2.2])
    return (with_size, with_scale, with_compute)

traced_func = torch.jit.trace(test_func, torch.randn(1,1,1,1))

sample = torch.randn(1, 3, 7, 7)
output = traced_func(sample)

assert not torch.allclose(output[1], output[2])

torch.jit.save(traced_func, "model.pt")
torch.save((sample, output), "data.pt")
```

then this code after this change

```python
model = torch.jit.load("model.pt")
sample, golden = torch.load("data.pt")
result = model(sample)
for r, g in zip(result, golden):
    assert torch.allclose(r, g)
```

Reviewed By: AshkanAliabadi

Differential Revision: D21209991

fbshipit-source-id: 5b2ebb7c3ed76947361fe532d1dbdd6faa3544c8
2020-09-11 09:59:14 -07:00
df6ea62526 Add nondeterministic check to new upsample overloads
Summary: I think these were missed due to a code landing race condition.

Test Plan: Fixes CUDA tests with PR 43025 applied.

Reviewed By: iseeyuan, AshkanAliabadi

Differential Revision: D23639566

fbshipit-source-id: 1322d7708e246b075a66588e7e54f4e12092477f
2020-09-11 09:58:07 -07:00
3de2c0b42f Fix L1Loss when target.requires_grad is True. (#44471)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44471

L1Loss had a completely different (and incorrect, see #43228) path when target.requires_grad was True.

This PR does the following:

1) adds derivative support for target via the normal derivatives.yaml route
2) kill the different (and incorrect) path for when target.requires_grad was True
3) modify the L1Loss CriterionTests to verify that the target derivative is checked.

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D23626008

Pulled By: gchanan

fbshipit-source-id: 2828be16b56b8dabe114962223d71b0e9a85f0f5
2020-09-11 09:51:16 -07:00
ea55820606 [dper3] Export PackSegments and UnpackSegments to Pytorch
Summary: As title.

Test Plan:
```
buck test //caffe2/caffe2/python/operator_test/:torch_integration_test -- test_pack_segments
```

Reviewed By: yf225

Differential Revision: D23610495

fbshipit-source-id: bd8cb61f2284a08a54091a4f982f01fcf681f215
2020-09-11 09:29:24 -07:00
b73b44f976 [PyTorch Mobile] Move some string ops to register_prim_ops.cpp and make them selective (#44500)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44500

Some user models are using those operators. Unblock them while keep the ops selective.

Test Plan: CI

Reviewed By: linbinyu

Differential Revision: D23634769

fbshipit-source-id: 55841d1b07136b6a27b6a39342f321638dc508cd
2020-09-11 09:24:35 -07:00
567c51cce9 In common_distributed, fix TEST_SKIPS multiprocessing manager (#44525)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44525

Since `TEST_SKIPS` is a global multiprocessing.manager, this was causing
issues when one test would fail and make the rest of the tests fail during
setup due to networking errors.

See the failed CI job: https://app.circleci.com/pipelines/github/pytorch/pytorch/212491/workflows/0450151d-ca09-4cf6-863d-272de6ed917f/jobs/7389065 for an example, where `test_ddp_backward` failed but then caused the rest of the tests to fail at the line `test_skips.update(TEST_SKIPS)`.

To fix this issue, at the end of every test we revert `TEST_SKIPS` back to a regular dict, and redo the conversion to a `mulitiprocessing.Manager` in the next test, which prevents these errors.
ghstack-source-id: 111844724

Test Plan: CI

Reviewed By: malfet

Differential Revision: D23641618

fbshipit-source-id: 27ce823968ece9804bb4dda898ffac43ef732b89
2020-09-11 09:16:33 -07:00
d07d25a8c5 Fix MSELoss when target.requires_grad is True. (#44437)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44437

MSELoss had a completely different (and incorrect, see https://github.com/pytorch/pytorch/issues/43228) path when target.requires_grad was True.

This PR does the following:
1) adds derivative support for target via the normal derivatives.yaml route
2) kill the different (and incorrect) path for when target.requires_grad was True
3) modify the MSELoss CriterionTests to verify that the target derivative is checked.

TODO:
1) do we still need check_criterion_jacobian when we run grad/gradgrad checks?
2) ensure the Module tests check when target.requires_grad
3) do we actually test when reduction='none' and reduction='mean'?

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D23612166

Pulled By: gchanan

fbshipit-source-id: 4f74d38d8a81063c74e002e07fbb7837b2172a10
2020-09-11 08:51:28 -07:00
9a3b83cbf2 Update submodule gloo to have latest commits to enable it can work on Windows (#44529)
Summary:
Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44529

Reviewed By: rohan-varma

Differential Revision: D23650123

Pulled By: mrshenli

fbshipit-source-id: b5b891cbcec51a14379d6604af63c714c32d93e7
2020-09-11 08:47:02 -07:00
b6b1c01adf torch.view_as_complex fails with segfault for a zero dimensional tensor (#44175)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/44061

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44175

Reviewed By: colesbury

Differential Revision: D23628103

Pulled By: anjali411

fbshipit-source-id: 6f70b5824150121a1617c0757499832923ae02b5
2020-09-11 08:35:49 -07:00
a9754fb860 Use TP Tensor.metadata to carry device info (#44396)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44396

Test Plan: Imported from OSS

Reviewed By: lw

Differential Revision: D23602576

Pulled By: mrshenli

fbshipit-source-id: c639789979b2b71fc165efbcf70f37b4c39469df
2020-09-11 08:33:22 -07:00
f44de7cdc3 Add missing rpc.shutdown() (#44417)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44417

Test Plan: Imported from OSS

Reviewed By: lw

Differential Revision: D23626208

Pulled By: mrshenli

fbshipit-source-id: 4ff8cad0e1193f99518804c21c9dd26ae718f4eb
2020-09-11 08:32:15 -07:00
77cc7d1ecd C++ APIs Transformer NN Module Top Layer (#44333)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44333

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D23584010

Pulled By: glaringlee

fbshipit-source-id: 990026e3f1b5ae276776e344ea981386cb7528fe
2020-09-11 08:25:27 -07:00
09892de815 Clarify track_running_stats docs; Make SyncBatchNorm track_running_stats behavior consistent (#44445)
Summary:
context: https://github.com/pytorch/pytorch/pull/38084

Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44445

Reviewed By: colesbury

Differential Revision: D23634216

Pulled By: mrshenli

fbshipit-source-id: d1242c694dec0e7794651f8031327625eb9989ee
2020-09-11 08:20:34 -07:00
30fccc53a9 [NNC] Don't attempt to refactor conditional scalars (#44223)
Summary:
Fixes a bug in the NNC registerizer for Cuda where it would hoist reads out of a conditional context when trying to cache them. As a quick fix, prevent scalar replacement if a usage is within a condition.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44223

Reviewed By: gchanan

Differential Revision: D23551247

Pulled By: nickgg

fbshipit-source-id: 17a7bf2be4c8c3dd8a9ab7997dce9aea200c3685
2020-09-11 04:22:16 -07:00
c967e7724e [quant] conv_transpose1d_prepack / conv_transpose1d_unpack (#40360)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40360

Test Plan: Imported from OSS

Reviewed By: vkuzo

Differential Revision: D22158982

Pulled By: z-a-f

fbshipit-source-id: 844d02806554aaa68b521283703e630cc544d419
2020-09-11 04:12:28 -07:00
8b8986662f [JIT] Remove profiling nodes in autodiff forward graph (#44420)
Summary:
Previously we were not removing profiling nodes in graphs that required grad and contained diff graphs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44420

Reviewed By: bertmaher

Differential Revision: D23607482

Pulled By: eellison

fbshipit-source-id: af095f3ed8bb3c5d09610f38cc7d1481cbbd2613
2020-09-11 02:59:39 -07:00
c6febc6480 [JIT] Add a python hook for a function to interpret JIT graphs. (#44493)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44493

This function allows to execute a graph exactly as it is, without going
through a graph executor which would run passes on the graph before
interpreting it. I found this feature extremely helpful when I worked on
a stress-testing script to shake out bugs from the TE fuser: I needed to
execute a very specific set of passes on a graph and nothing else, and
then execute exactly it.

Test Plan: Imported from OSS

Reviewed By: jamesr66a

Differential Revision: D23632505

Pulled By: ZolotukhinM

fbshipit-source-id: ea81fc838933743e2057312d3156b77284d832ef
2020-09-11 02:55:26 -07:00
51ed31269e Replace FutureMessage with c10::ivalue::Future in DistEngine. (#44239)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44239

As part of https://github.com/pytorch/pytorch/issues/41574, use
c10::ivalue::Future everywhere in DistEngine.
ghstack-source-id: 111645070

Test Plan: waitforbuildbot

Reviewed By: mrshenli

Differential Revision: D23553507

fbshipit-source-id: 1b51ba13d1ebfa6c5c70b12028e9e96ce8ba51ff
2020-09-11 01:03:42 -07:00
b5d75dddd9 Enable lerp on half type; fix output memory format (#43541)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43541

Reviewed By: zou3519

Differential Revision: D23499592

Pulled By: ezyang

fbshipit-source-id: 9efdd6cbf0a334ec035ddd467667ba874b892549
2020-09-10 21:50:35 -07:00
0c58a017bd [quant][eagermode][refactor] Add set/get method for quantization and fusion mappings (#43990)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43990

Allow user to register custom quantization and fusion patterns

Test Plan: Imported from OSS

Reviewed By: z-a-f

Differential Revision: D23485344

fbshipit-source-id: 4f0174ee6d8000d83de0f73cb370e9a1941d54aa
2020-09-10 21:29:39 -07:00
f7278473d3 [NCCL] Fix NCCL_BLOCKING_WAIT functionality with Async Error Handling (#44411)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44411

This basically aborts errored NCCL communicators if either blocking
wait or async error handling is enabled. Otherwise we may abort nccl
communicators where neither are enabled, and this may result in subsequent GPU
operations using corrupted data.
ghstack-source-id: 111839264

Test Plan: Succesful Flow run: f217591683

Reviewed By: jiayisuse

Differential Revision: D23605382

fbshipit-source-id: 6c16f9626362be3b0ce2feaf0979b2dff97ce61b
2020-09-10 20:57:55 -07:00
6ee41974e3 Speedup Linux nightly builds (#44532)
Summary:
`stdbuf` affects not only the process it launches, but all of its subprocessed, which have a very negative effects on the IPC communication between nvcc and c++ preprocessor, which results in 2x slowdown, for example:

```
$ time /usr/local/cuda/bin/nvcc /pytorch/aten/src/THC/generated/THCTensorMathPointwiseByte.cu -c ...
real	0m34.623s
user	0m31.736s
sys	0m2.825s
```
but
```
time stdbuf -i0 -o0 -e0 /usr/local/cuda/bin/nvcc /pytorch/aten/src/THC/generated/THCTensorMathPointwiseByte.cu -c ...
real	1m14.113s
user	0m37.989s
sys	0m36.104s
```
because OS spends lots of time transferring preprocessed source back to nvcc byte by byte, as requested via stdbuf call

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44532

Reviewed By: ngimel

Differential Revision: D23643411

Pulled By: malfet

fbshipit-source-id: 9fdaf8b8a49574e6b281f68a5dd9ba9d33464dff
2020-09-10 20:32:08 -07:00
69f6d94caa Register diag_backward, diagonal_backward, infinitetely...gelu_backward as operators (#44422)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44422

See #44052 for context.

Test Plan:
- `pytest test/test_autograd.py -v`
- `pytest test/test_nn.py -v`

Reviewed By: mrshenli

Differential Revision: D23607691

Pulled By: zou3519

fbshipit-source-id: 09fbcd66b877af4fa85fd9b2f851ed3912ce84d6
2020-09-10 18:43:18 -07:00
7ff7e6cfc8 Register cummaxmin_backward, cumprod_backward as operators (#44410)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44410

See #44052 for context. One of the cumprod_backward overloads was unused
so I just deleted it.

Test Plan: - `pytest test/test_autograd.py -v`

Reviewed By: mrshenli

Differential Revision: D23605503

Pulled By: zou3519

fbshipit-source-id: f9c5b595e62d2d6e71f26580ba96df15cc9de4f7
2020-09-10 18:43:15 -07:00
08b431f54c Add trace_backward, masked_select_backward, and take_backward as ops (#44408)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44408

See #44052 for context.

Test Plan: - `pytest test/test_autograd.py -v`

Reviewed By: mrshenli

Differential Revision: D23605504

Pulled By: zou3519

fbshipit-source-id: b9b1646d13caa6e536d08669c29bfc2ad8ff89a3
2020-09-10 18:41:07 -07:00
41f62b17e7 Fix DDP join() API in the case of model.no_sync() (#44427)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44427

Closes https://github.com/pytorch/pytorch/issues/44425

DDP join API currently does not work properly with `model.no_sync()`, see https://github.com/pytorch/pytorch/issues/44425 for details. This PR fixes the problem via the approach mentioned in the issue, namely scheduling an allreduce that tells joined ranks whether to sync in the backwards pass or not. Tests are added for skipping gradient synchronization for various `sync_interval`s.
ghstack-source-id: 111786479

Reviewed By: pritamdamania87

Differential Revision: D23609070

fbshipit-source-id: e8716b7881f8eee95e3e3499283e716bd3d7fe76
2020-09-10 18:31:40 -07:00
129d52aef2 Fix uniqueness check in movedim (#44307)
Summary:
Noticed this bug in `torch.movedim` (https://github.com/pytorch/pytorch/issues/41480). [`std::unique`](https://en.cppreference.com/w/cpp/algorithm/unique) only guarantees uniqueness for _sorted_ inputs. The current check lets through non-unique values when they aren't adjacent to each other in the list, e.g. `(0, 1, 0)` wouldn't raise an exception and instead the algorithm fails later with an internal assert.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44307

Reviewed By: mrshenli

Differential Revision: D23598311

Pulled By: zou3519

fbshipit-source-id: fd6cc43877c42bb243cfa85341c564b6c758a1bf
2020-09-10 17:41:07 -07:00
c48f511c7e Moves some of TestTorchMathOps to OpInfos (#44277)
Summary:
This PR fixes three OpInfo-related bugs and moves some functions from TestTorchMathOps to be tested using the OpInfo pattern. The bugs are:

- A skip test path in test_ops.py incorrectly formatted its string argument
- Decorating the tests in common_device_type.py was incorrectly always applying decorators to the original test, not the op-specific variant of the test. This could cause the same decorator to be applied multiple times, overriding past applications.
- make_tensor was incorrectly constructing tensors in some cases

The functions moved are:

- asin
- asinh
- sinh
- acosh
- tan
- atan
- atanh
- tanh
- log
- log10
- log1p
- log2

In a follow-up PR more or all of the remaining functions in TestTorchMathOps will be refactored as OpInfo-based tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44277

Reviewed By: mrshenli, ngimel

Differential Revision: D23617361

Pulled By: mruberry

fbshipit-source-id: edb292947769967de9383f6a84eb327f027509e0
2020-09-10 17:31:50 -07:00
2e744b1820 Support work.result() to get result tensors for allreduce for Gloo, NCCL backends (#43970)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43970

It is resubmition of #43386

Original commit changeset: 27fbeb161706
ghstack-source-id: 111775070

Test Plan:
Added checks to existing unit test and ran it on gpu devserver.
Verified the test that was failing in original diff also passes: https://app.circleci.com/pipelines/github/pytorch/pytorch/210229/workflows/86bde47b-f2da-48e3-a618-566ae2713102/jobs/7253683

Reviewed By: pritamdamania87

Differential Revision: D23455047

fbshipit-source-id: b8dc4a30b95570d68a482c19131674fff2a3bc7c
2020-09-10 17:13:37 -07:00
91b16bff1e Disable PyTorch iOS ARM64 builds until cert problem is fixed (#44499)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44499

Reviewed By: seemethere, xta0

Differential Revision: D23634961

Pulled By: malfet

fbshipit-source-id: e32ae29c42c351bcb4f48bc52d4082ae56545e5b
2020-09-10 16:24:11 -07:00
1dd3fae3d2 [pytorch] Add logging to mobile Method run (#44234)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44234

Changes mobile Method to point to a mobile Module directly instead of the Module ivalue in order to access metadata for logging/debugging, and then adds said logging.
ghstack-source-id: 111775806

Test Plan:
CI/existing unit tests to test BC
Testing fb4a logging:
Built fb4a on D23436351 (because usage of run_method isn't replaced yet in this diff), and then checked the Scuba logs to see that the appropriate ad clicks were logged (one ad for Buzzfeed shopping and another about Netflix from Bustle)

{F328510687}
{F328511201}
[Scuba sample of QPL metrics](https://www.internalfb.com/intern/scuba/query/?dataset=qpl_metrics%2Fpytorch_employee&pool=uber&view=samples_client&drillstate=%7B%22sampleCols%22%3A[%22device_model%22%2C%22instance_id_sampled%22%2C%22method%22%2C%22ios_device_class%22%2C%22points_path%22%2C%22userid_sampled%22%2C%22client_sample_rate%22%2C%22browser_name%22%2C%22ios_device_name%22%2C%22points%22%2C%22is_employee%22%2C%22is_test_user%22%2C%22network_only_queries%22%2C%22annotations%22%2C%22oncall_shortname%22%2C%22environment_tags%22%2C%22revoked_queries%22%2C%22annotations_bool%22%2C%22points_data%22%2C%22annotations_double_array%22%2C%22annotations_string_array%22%2C%22revoked_steps%22%2C%22points_set%22%2C%22device_os_version%22%2C%22ota_version_rollout%22%2C%22steps%22%2C%22vadar_calculation_result%22%2C%22app_name%22%2C%22client_push_phase%22%2C%22vadar%22%2C%22release_channel%22%2C%22interaction_class%22%2C%22exposures%22%2C%22annotations_double%22%2C%22deviceid_sampled%22%2C%22is_logged_in%22%2C%22device_os%22%2C%22time%22%2C%22major_os_ver%22%2C%22annotations_int_array%22%2C%22duration_ns%22%2C%22app_build%22%2C%22bucket_id%22%2C%22cache_and_network_queries%22%2C%22value%22%2C%22vadar_v2%22%2C%22quicklog_event%22%2C%22unixname%22%2C%22vadar_calculation_result_v2%22%2C%22trace_tags%22%2C%22annotations_int%22%2C%22quicklog_module%22%2C%22push_phase%22%2C%22year_class%22%2C%22country%22%2C%22capped_duration%22%2C%22ram_class%22%2C%22weight%22%2C%22carrier%22%2C%22app_id%22%2C%22app_version%22%2C%22react_bundle_version%22%2C%22logging_source%22%2C%22is_unsampled_for_scuba%22%2C%22instrumentation_errors%22%2C%22android_cpu_abi_list%22%2C%22days_after_release%22%2C%22cpu_cores%22%2C%22user_bucket%22%2C%22quicklog_action%22%2C%22server_scuba_sample_rate%22%2C%22points_vector%22%2C%22annotations_bool_array%22%2C%22android_device_class%22%2C%22browser_full_version%22%2C%22major_app_ver%22]%2C%22derivedCols%22%3A[]%2C%22mappedCols%22%3A[]%2C%22enumCols%22%3A[]%2C%22hideEmptyColumns%22%3Afalse%2C%22focused_event%22%3A%22%22%2C%22show_metadata%22%3A%22false%22%2C%22start%22%3A%222020-09-08%2011%3A27%3A00%22%2C%22end%22%3A%22start%20%2B%201%20minute%22%2C%22timezone%22%3A%22America%2FLos_Angeles%22%2C%22samplingRatio%22%3A%221%22%2C%22num_samples%22%3A%22100%22%2C%22aggregateList%22%3A[]%2C%22param_dimensions%22%3A[]%2C%22modifiers%22%3A[]%2C%22order%22%3A%22none%22%2C%22order_desc%22%3Atrue%2C%22filterMode%22%3A%22DEFAULT%22%2C%22constraints%22%3A[[%7B%22column%22%3A%22quicklog_event%22%2C%22op%22%3A%22eq%22%2C%22value%22%3A[%22[%5C%22MOBILE_MODULE_STATS%5C%22]%22]%7D%2C%7B%22column%22%3A%22userid_sampled%22%2C%22op%22%3A%22eq%22%2C%22value%22%3A[%22[%5C%22100013484978975%5C%22]%22]%7D]]%2C%22c_constraints%22%3A[[]]%2C%22b_constraints%22%3A[[]]%2C%22metrik_view_params%22%3A%7B%22should_use_legacy_colors%22%3Afalse%2C%22columns_skip_formatting%22%3A[]%2C%22view%22%3A%22samples_client%22%2C%22width%22%3A%221358%22%2C%22height%22%3A%22912%22%2C%22tableID%22%3A%22qpl_metrics%2Fpytorch_employee%22%2C%22fitToContent%22%3Afalse%2C%22format_tooltip_in_percent%22%3Afalse%2C%22use_y_axis_hints_as_limits%22%3Atrue%2C%22has_dynamic_context_menu%22%3Atrue%2C%22has_context_menu%22%3Afalse%2C%22legend_mode%22%3A%22nongrid%22%2C%22connect_nulls%22%3Atrue%2C%22timezone_offset%22%3A420%2C%22timezone%22%3A%22America%2FLos_Angeles%22%2C%22y_min_hint%22%3A0%2C%22should_render_plugins_menu%22%3Afalse%7D%7D&normalized=1599581160)
[Scuba sample showing ad source; just the bottom two results](https://www.internalfb.com/intern/scuba/query/?dataset=business_integrity_webpage_semantic&pool=uber&drillstate=%7B%22sampleCols%22%3A[%22from_custom_sampling%22%2C%22data_version%22%2C%22scribe_category_type%22%2C%22page_id%22%2C%22name%22%2C%22source_url%22%2C%22time%22%2C%22title_semantic%22%2C%22major_version%22%2C%22server_protocol%22%2C%22custom_sampling_enabled%22%2C%22ad_id%22%2C%22appversion%22%2C%22clienttime%22%2C%22isemployee%22%2C%22title%22%2C%22images%22%2C%22weight%22%2C%22carrier%22%2C%22is_ad%22%2C%22locale%22%2C%22appid%22%2C%22ip_country%22%2C%22iab_models%22]%2C%22derivedCols%22%3A[]%2C%22mappedCols%22%3A[]%2C%22enumCols%22%3A[]%2C%22return_remainder%22%3Afalse%2C%22should_pivot%22%3Afalse%2C%22is_timeseries%22%3Afalse%2C%22hideEmptyColumns%22%3Afalse%2C%22main_dimension%22%3A%22time%22%2C%22start%22%3A%22-5%20minutes%22%2C%22samplingRatio%22%3A%221%22%2C%22compare%22%3A%22none%22%2C%22axes%22%3A%22linked%22%2C%22overlay_types%22%3A[]%2C%22minBucketSamples%22%3A%22%22%2C%22dimensions%22%3A[]%2C%22scale_type%22%3A%22absolute%22%2C%22num_samples%22%3A%22100%22%2C%22metric%22%3A%22avg%22%2C%22fill_missing_buckets%22%3A%22connect%22%2C%22smoothing_bucket%22%3A%221%22%2C%22top%22%3A%227%22%2C%22markers%22%3A%22%22%2C%22timezone%22%3A%22America%2FLos_Angeles%22%2C%22end%22%3A%22now%22%2C%22show_p95_ci%22%3Afalse%2C%22time_bucket%22%3A%22auto%22%2C%22compare_mode%22%3A%22normal%22%2C%22aggregateList%22%3A[]%2C%22param_dimensions%22%3A[]%2C%22modifiers%22%3A[]%2C%22order%22%3A%22none%22%2C%22order_desc%22%3Atrue%2C%22filterMode%22%3A%22DEFAULT%22%2C%22constraints%22%3A[[%7B%22column%22%3A%22major_version%22%2C%22op%22%3A%22eq%22%2C%22value%22%3A[%22[%5C%22288%5C%22]%22]%7D]]%2C%22c_constraints%22%3A[[]]%2C%22b_constraints%22%3A[[]]%2C%22metrik_view_params%22%3A%7B%22should_use_legacy_colors%22%3Afalse%2C%22columns_skip_formatting%22%3A[]%2C%22view%22%3A%22time_view%22%2C%22width%22%3A%221358%22%2C%22height%22%3A%22912%22%2C%22tableID%22%3A%22business_integrity_webpage_semantic%22%2C%22fitToContent%22%3Afalse%2C%22format_tooltip_in_percent%22%3Afalse%2C%22use_y_axis_hints_as_limits%22%3Atrue%2C%22has_dynamic_context_menu%22%3Atrue%2C%22has_context_menu%22%3Afalse%2C%22legend_mode%22%3A%22nongrid%22%2C%22connect_nulls%22%3Atrue%2C%22timezone_offset%22%3A420%2C%22timezone%22%3A%22America%2FLos_Angeles%22%2C%22y_min_hint%22%3A0%2C%22should_render_plugins_menu%22%3Afalse%7D%7D&view=samples_client&normalized=1599587280)

Reviewed By: iseeyuan

Differential Revision: D23548687

fbshipit-source-id: 3e63085663f5fd8de90a4c7dbad0a17947aee973
2020-09-10 15:26:33 -07:00
a2a81e1335 Add a CONTRIBUTING.md for the distributed package. (#44224)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44224

The purpose of this file is to help developers on PT distributed get
upto speed on the code structure and layout for PT Distributed.
ghstack-source-id: 111644842

Test Plan: waitforbuildbot

Reviewed By: rohan-varma

Differential Revision: D23548377

fbshipit-source-id: 561d5b8e257642de172def8fdcc1311fae20690b
2020-09-10 14:58:00 -07:00
4bead6438a Enable torch.autograd typechecks (#44451)
Summary:
To help with further typing, move dynamically added native contributions from `torch.autograd` to `torch._C._autograd`
Fix invalid error handling pattern in
89ac30afb8/torch/csrc/autograd/init.cpp (L13-L15)
`PyImport_ImportModule` already raises Python exception and nullptr should be returned to properly propagate the to Python runtime.

And all native methods/types in `torch/autograd/__init.py` after `torch._C._init_autograd()` has been called
Use f-strings instead of `.format` in test_type_hints.py
Fixes https://github.com/pytorch/pytorch/issues/44450

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44451

Reviewed By: ezyang

Differential Revision: D23618261

Pulled By: malfet

fbshipit-source-id: fa5f739d7cff8410641128b55b810318c5f636ae
2020-09-10 13:37:29 -07:00
cc5a1cf616 [JIT] Erase shapes before fallback graph (#44434)
Summary:
Previously the specialized types were copied over to the fallback function, although the tensors in the fallback type were not of that type.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44434

Reviewed By: SplitInfinity

Differential Revision: D23611943

Pulled By: eellison

fbshipit-source-id: 2ea88a97529409f6c5c4c1f59a14b623524933de
2020-09-10 12:07:31 -07:00
b3f0297a94 ConvPackedParams: remove legacy format (#43651)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43651

This is a forward compatibility follow-up to
https://github.com/pytorch/pytorch/pull/43086/. We switch the
conv serialization to output the v2 format instead of the v1 format.

The plan is to land this 1 - 2 weeks after the base PR.

Test Plan:
```
python test/test_quantization.py TestSerialization.test_conv2d_graph_v2
python test/test_quantization.py TestSerialization.test_conv2d_nobias_graph_v2
```

Imported from OSS

Reviewed By: z-a-f

Differential Revision: D23355480

fbshipit-source-id: 4cb04ed8b90a0e3e452297a411d641a15f6e625f
2020-09-10 11:47:34 -07:00
d232fec1f1 Partly fix cuda builds of dper broken by caffe2 c++
Summary:
cuda builds using clang error out when building caffe2 due to an incorrect std::move

This does not fix all known errors, but it's a step in the right direction.

Differential Revision: D23626667

fbshipit-source-id: 7d9df886129f671ec430a166dd22e4af470afe1e
2020-09-10 11:37:49 -07:00
38c10b4f30 [NCCL] Fix the initialization of futureNCCLCallbackStreams (#44347)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44347

Cloned from Pull Request resolved: https://github.com/pytorch/pytorch/pull/44097, because the original author Sinan has completed the internship and now is unable to submit this diff.

As johnsonpaul mentioned in D23277575 (7d517cf96f). It looks like all processes were allocating memory on GPU-ID=0.

I was able to reproduce it by running `test_ddp_comm_hook_allreduce_with_then_hook_nccl` unit test of `test_c10d.py` and running `nvidia-smi` while test was running. The issue was reproduced as:
```
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0   3132563      C   python                                       777MiB |
|    0   3132564      C   python                                       775MiB |
|    4   3132564      C   python                                       473MiB |
+-----------------------------------------------------------------------------+
```
I realized that as we initialize ProcessGroupNCCL both processes were initially allocating memory on GPU 0.

We later also realized that I forgot `isHighPriority` input of `getStreamFromPool` and `futureNCCLCallbackStreams_.push_back(std::make_shared<at::cuda::CUDAStream>(at::cuda::getStreamFromPool(device_index)));` was just creating a vector of GPU 0 streams. As i changed `at::cuda::getStreamFromPool(device_index)` to `at::cuda::getStreamFromPool(false, device_index)`. `nvidia-smi` looked like:
```
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0    673925      C   python                                       771MiB |
|    0    673926      C   python                                       771MiB |
|    1    673925      C   python                                       771MiB |
|    1    673926      C   python                                       771MiB |
|    2    673925      C   python                                       771MiB |
|    2    673926      C   python                                       771MiB |
|    3    673925      C   python                                       771MiB |
|    3    673926      C   python                                       771MiB |
|    4    673925      C   python                                       771MiB |
|    4    673926      C   python                                       771MiB |
|    5    673925      C   python                                       771MiB |
|    5    673926      C   python                                       771MiB |
|    6    673925      C   python                                       771MiB |
|    6    673926      C   python                                       771MiB |
|    7    673925      C   python                                       707MiB |
|    7    673926      C   python                                       623MiB |
+-----------------------------------------------------------------------------+
```
This confirms that we were just getting GPU 0 streams for the callback. I think this does not explain the `fp16_compress` stability issue, because we were able to reproduce that even without any then callback and just calling copy from fp32 to fp16 before allreduce. However, this can explain other issues where `allreduce` was not on par with `no_hook`. I'll run some additional simulations with this diff.

I tried to to replace `getStreamFromPool` by `getDefaultCUDAStream(deviceIndex)` and it wasn't causing additional memory usage. In this diff, I temporarily solved the issue by just initializing null pointers for each device in the constructor and setting the callback stream for corresponding devices inside `ProcessGroupNCCL::getNCCLComm`. After the fix it looks like the memory issue was resolved:
```
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0   2513142      C   python                                       745MiB |
|    4   2513144      C   python                                       747MiB |
+-----------------------------------------------------------------------------+
```
I could use a dictionary instead of a vector for `futureNCCLCallbackStreams_`, but since number of devices is fixed, I think it isn't necessary. Please let me know what you think in the comments.
ghstack-source-id: 111485483

Test Plan:
`test_c10d.py` and some perf tests. Also check `nvidia-smi` while running tests to validate memory looks okay.

This diff also fixes the regression in HPC tests as we register a hook:

{F322730175}

See https://fb.quip.com/IGuaAbD8 (474fdd7e2d)bnvy for details.

Reviewed By: pritamdamania87

Differential Revision: D23495436

fbshipit-source-id: ad08e1d94343252224595d7c8a279fe75e244822
2020-09-10 11:25:38 -07:00
cb90fef770 Fix return value of PyErr_WarnEx ignored (SystemError) (#44371)
Summary:
This PR fixes unexpected `SystemError` when warnings are emitted and warning filters are set.

## Current behavior

```
$ python -Werror
>>> import torch
>>> torch.range(1, 3)
UserWarning: torch.range is deprecated in favor of torch.arange and will be removed in 0.5. Note that arange generates values in [start; end), not [start; end].

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
SystemError: <built-in method range of type object at 0x7f38c7703a60> returned a result with an error set
```

## Expected behavior

```
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UserWarning: torch.range is deprecated and will be removed in a future release because its behavior is inconsistent with Python's range builtin. Instead, use torch.arange, which produces values in [start, end).
```

## Note

Python exception must be raised if `PyErr_WarnEx` returns `-1` ([python docs](https://docs.python.org/3/c-api/exceptions.html#issuing-warnings)). This PR fixes warnings raised in the following code:
```py
import torch

torch.range(1, 3)
torch.autograd.Variable().volatile
torch.autograd.Variable().volatile = True
torch.tensor(torch.tensor([]))
torch.tensor([]).new_tensor(torch.tensor([]))
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44371

Reviewed By: mrshenli

Differential Revision: D23598410

Pulled By: albanD

fbshipit-source-id: 2fbcb13fe4025dbebaf1fd837d4c8e0944e05010
2020-09-10 10:15:21 -07:00
f9a0d0c21e Allow Tensor-likes in torch.autograd.gradcheck (#43877)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/42942

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43877

Reviewed By: zou3519

Differential Revision: D23493257

Pulled By: ezyang

fbshipit-source-id: 6cdaabe17157b484e9491189706ccc15420ac239
2020-09-10 09:02:17 -07:00
c8914afdfa Merge criterion_tests and new_criterion_tests. (#44398)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44398

These end up executing the same tests, so no reason to have them separate.

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D23600855

Pulled By: gchanan

fbshipit-source-id: 0952492771498bf813f1bf8e1d7c8dce574ec965
2020-09-10 08:29:59 -07:00
fa158c4ca6 Combine criterion and new criterion tests in test_jit. (#43958)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43958

There is not any difference between these tests (I'm merging them), so let's merge them in the JIT as well.

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D23452337

Pulled By: gchanan

fbshipit-source-id: e6d13cdb164205eec3dbb7cdcd0052b02c961778
2020-09-10 08:28:14 -07:00
af9cad761a Stop ignoring NotImplementedErrors in cuda CriterionTests. (#44381)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44381

Perhaps this was necessary when the test was originally introduced, but it's difficult to figure out what is actually tested.  And I don't think we actually use NotImplementedErorrs.

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D23598646

Pulled By: gchanan

fbshipit-source-id: aa18154bfc4969cca22323e61683a301198823be
2020-09-10 08:18:33 -07:00
208ad45b4b fix scripts (#44464)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44464

Reviewed By: agolynski

Differential Revision: D23624921

Pulled By: colesbury

fbshipit-source-id: 72bed69edcf467a99eda9a3b97e894015c992dce
2020-09-10 08:13:48 -07:00
356aa54694 [Codemod][FBSourceClangFormatLinter] Daily arc lint --take CLANGFORMAT
Reviewed By: zertosh

Differential Revision: D23621463

fbshipit-source-id: 1cd7e94e480c7073c9a0aad55aeba98de4b96164
2020-09-10 04:24:43 -07:00
6c98d904c0 handle the case of -0.0 on tanh quantization (#44406)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44406

this fix makes fakelowp identical to hw

- mask out the floating point number with 0x7fff so we are always dealing
with positive numbers
- dsp implementation is correct, ice-ref suffers from this same problem

Test Plan: - tested with test_fusions.py, can't enable the test until the fix in ice-ref appears

Reviewed By: venkatacrc

Differential Revision: D23603878

fbshipit-source-id: a72d93a4bc811f98d1b5e82ddb204be028addfeb
2020-09-10 01:18:45 -07:00
28a23fce4c Deprecate torch.norm and torch.functional.norm (#44321)
Summary:
Part of https://github.com/pytorch/pytorch/issues/24802

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44321

Reviewed By: mrshenli

Differential Revision: D23617273

Pulled By: mruberry

fbshipit-source-id: 6f88b5cb097fd0acb9cf0e415172c5a86f94e9f2
2020-09-10 01:16:41 -07:00
7b547f086f To fix extra memory allocation when using circular padding (#39273)
Summary:
For fixing https://github.com/pytorch/pytorch/issues/39256

Pull Request resolved: https://github.com/pytorch/pytorch/pull/39273

Reviewed By: anjali411

Differential Revision: D23471811

Pulled By: mruberry

fbshipit-source-id: fb324b51baea765311715cdf14642b334f335733
2020-09-10 00:15:31 -07:00
65d4a6b7c0 [ROCm] fix cub hipify mappings (#44431)
Summary:
Fixes ROCm-specific workarounds introduced by https://github.com/pytorch/pytorch/issues/44259.  This adds new hipify mappings that properly handle cub outside of caffe2 sources.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44431

Reviewed By: mrshenli

Differential Revision: D23617417

Pulled By: ngimel

fbshipit-source-id: 5d16afb6b8e6ec5ed049c51571866b0878d534ca
2020-09-09 23:39:25 -07:00
28bd4929bd [NNC] Make it able to normalize loop with variable start (#44133)
Summary:
Loops with variable start can also be normalized.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44133

Test Plan: updated testNormalizeStartVariable.

Reviewed By: navahgar

Differential Revision: D23507097

Pulled By: cheng-chang

fbshipit-source-id: 4e9aad1cd4f4a839f59a00bf8ddf97637a1a6648
2020-09-09 23:05:57 -07:00
c515881137 Add reset_grad() function (#44423)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44423

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42754

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D23010859

Pulled By: ngimel

fbshipit-source-id: 56eec43eba88b98cbf714841813977c68f983564
2020-09-09 22:05:45 -07:00
6324ef4ced [caffe2] Speed up compilation of aten-op.cc (#44440)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44440

`aten-op.cc` takes a long time to compile due to the large generated constructor. For each case, the `std::function` constructor and the initialization functions are inlined, producing a huge amount of intermediate code that takes a long time to optimize, given that many compiler optimization passes are superlinear in the function size.

This diff moves each case to a separate function, so that each one is cheap to optimize, and the constructor is just a large jump table, which is easy to optimize.

Reviewed By: dzhulgakov

Differential Revision: D23593741

fbshipit-source-id: 1ce7a31cda10d9b0c9d799716ea312a291dc0d36
2020-09-09 21:21:48 -07:00
89ac30afb8 [JIT] Propagate type sharing setting to submodule compilation (#44226)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44226

**Summary**
At present, the `share_types` argument to `create_script_module` is used
to decide whether to reuse a previously created type for a top-level
module that has not yet been compiled. However, that setting does not apply
to the compilation of submodules of the top-level module; types are
still reused if possible.

This commit modifies `create_script_module` so that the `share_types`
flag is honoured during submodule compilation as well.

**Test Plan**
This commit adds a unit test to `TestTypeSharing` that checks that
submodule types are not shared or reused when `share_types` is set to
`False`.

**Fixes**
This commit fixes #43605.

Test Plan: Imported from OSS

Reviewed By: eellison

Differential Revision: D23602371

Pulled By: SplitInfinity

fbshipit-source-id: b909b8b6abbe3b4cb9be8319ac263ade90e83bd3
2020-09-09 20:06:35 -07:00
d3b6d5caf1 [JIT] Add support for del to TS classes (#44352)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44352

**Summary**
This commit adds support for `del` with class instances. If a class
implements `__delitem__`, then `del class_instance[key]` is syntactic
sugar for `class_instance.__delitem__[key]`.

**Test Plan**
This commit adds a unit test to TestClassTypes to test this feature.

Test Plan: Imported from OSS

Reviewed By: eellison

Differential Revision: D23603102

Pulled By: SplitInfinity

fbshipit-source-id: 28ad26ddc9a693a58a6c48a0e853a1c7cf5c9fd6
2020-09-09 19:52:35 -07:00
058d7228ec Expose the interface of nesterov of SGD Optimizer from caffe2 to dper
Summary:
Expose the interface of `nesterov` of SGD Optimizer from caffe2 to dper.

dper sgd optimizer (https://fburl.com/diffusion/chpobg0h) has referred to NAG sgdoptimizer in caffe2: https://fburl.com/diffusion/uat2lnan. So just need to add the parameter 'nesterov' in dper sgd optimizer.

Analysis of run resutls: N345540.

- train_ne increases as momentum (m) decreases.
- for m=0.95, 0.9: eval_ne is lower with NAG than production (no NAG, m = 0.95).
- for m=0.99: eval_ne with or without NAG is higher than production. It indicates larger variance in validation and overfit in training (lower train_ne).

Test Plan:
1. unit tests:
`buck test caffe2/caffe2/fb/dper/layer_models/tests/split_1:sparse_nn_test -- test_sgd_without_nesterov`
`buck test caffe2/caffe2/fb/dper/layer_models/tests/split_1:sparse_nn_test -- test_sgd_with_nesterov`
.
1. build dper front end package: `flow-cli canary   ads.dper3.workflows.sparse_nn.train --mode opt --entitlement      ads_global --run-as-secure-group      team_ads_ml_ranking`. The build result (refreshed) is here https://www.internalfb.com/intern/buck/build/2a368b55-d94b-45c1-8617-2753fbce994b. Flow package version is ads_dper3.canary:856b545cc6b249c0bd328f845adeb0d2.
.
2. To build dper back end package: `flow-cli canary  dper.workflows.dper3.train --mode opt --entitlement      ads_global --run-as-secure-group      team_ads_ml_ranking`. The build result (refreshed) is here: https://www.internalfb.com/intern/buck/build/70fa91cd-bf6e-4a08-8a4d-41e41a77fb52. Flow package version is aml.dper2.canary:84123a34be914dfe86b1ffd9925869de.
.
3. Compare prod with NAG-enabled runs:
a) refreshed prod run (m=0.95): f213877098
NAG enabled run (m=0.95): f213887113
.
b) prod run (m=0.9): f214065288
NAG enabled run (m=0.9): f214066319
.
c) prod run (m=0.99): f214065804
NAG enabled run (m=0.99): f214066725
.
d) change date type of nestrov to `bool` and launched a validation run
NAG enabled (m=0.95): f214500597

Reviewed By: ustctf

Differential Revision: D23152229

fbshipit-source-id: 61703ef6b4e72277f4c73171640fb8afc6d31f3c
2020-09-09 19:37:00 -07:00
5ee31308e6 [caffe2] exposes Net cancellation through pybind state (#44043)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44043

To invoke `cancel` from the net instance in Python, we expose it through pybind state.

Reviewed By: dzhulgakov

Differential Revision: D23249660

fbshipit-source-id: 45a1e9062dca811746fcf2e5e42199da8f76bb54
2020-09-09 18:13:13 -07:00
e028ad0762 Fix HashStoreTests and move to Gtest (#43384)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43384

Much like the FileStoreTests, the HashStoreTests were also run in a single blob and threw exceptions upon failure. This modularizes the test by separating each function into separate gtest test cases.
ghstack-source-id: 111690834

Test Plan: Confirmed that the tests pass on devvm.

Reviewed By: jiayisuse

Differential Revision: D23257579

fbshipit-source-id: 7e821f0e9ee74c8b815f06facddfdb7dc2724294
2020-09-09 17:56:33 -07:00
69a3ff005d Modularize FileStoreTest and move to Gtest (#43383)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43383

FileStore Test currently has a large blob of tests that throw
exceptions upon failure. This PR modularizes each test so they can run
independently, and migrates the framework to gtest.
ghstack-source-id: 111690831

Test Plan: Confirmed tests pass on devvm

Reviewed By: jiayisuse

Differential Revision: D22879473

fbshipit-source-id: 6fa5468e594a53c9a6b972757068dfc41645703e
2020-09-09 17:56:30 -07:00
a7fba7de22 Convert StoreTestUtils to Gtest (#43382)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43382

StoreTestCommon defines standard helper functions that are used by all of our Store tests. These helpers currently throw exceptions upon failure, this PR changes them to use gtest assertions instead.
ghstack-source-id: 111690833

Test Plan: Tested the 2 PR's above this on devvm

Reviewed By: jiayisuse

Differential Revision: D22828156

fbshipit-source-id: 9e116cf2904e05ac0342a441e483501e00aad3dd
2020-09-09 17:55:25 -07:00
b69c28d02c Improving ModuleList indexing error msg (#43361)
Summary:
Follow up to https://github.com/pytorch/pytorch/pull/41946/, to suggest enumerating a module as an alternative if a user tries indexing into a modulelist/sequential with a non-integer literal

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43361

Reviewed By: mrshenli

Differential Revision: D23602388

Pulled By: eellison

fbshipit-source-id: 51fa28d5bc45720529b3d45e92d367ee6c9e3316
2020-09-09 16:22:57 -07:00
c010ef7f0c use non-overflowing divide in cuda kernel util GET_BLOCKS (#44391)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/43476.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44391

Reviewed By: mrshenli

Differential Revision: D23602424

Pulled By: walterddr

fbshipit-source-id: 40ed81547f933194ce5bf4a5bcebdb3434298bc1
2020-09-09 16:20:41 -07:00
ba6ddaf04c [pyper] export caffe2 bucketize GPU operator to pytorch
Summary: Exporting the Bucketize operator on CUDA. Also adding unit test.

Test Plan: buck test mode/dev-nosan caffe2/torch/fb/sparsenn:gpu_test -- test_bucketize

Differential Revision: D23581321

fbshipit-source-id: 7f21862984c04d840410b8718db93006f526938a
2020-09-09 16:08:53 -07:00
e0c65abd38 Revert D23568330: [pytorch][PR] Moves some of TestTorchMathOps to OpInfos
Test Plan: revert-hammer

Differential Revision:
D23568330 (a953a825cc)

Original commit changeset: 03e69fccdbfd

fbshipit-source-id: 04ec6843c5eb3c84ddf226dad0088172d9bed84d
2020-09-09 15:48:56 -07:00
fc51047af5 Small fixes in Dependency.cmake and run_test.py (#44414)
Summary:
Do not add gencode flags to NVCC_FLAGS twice: First time they are added in `cmake/public/cuda.cmake` no need to do it again in `cmake/Dependencies.cmake`
Copy `additional_unittest_args` before appending local options to it in `run_test()` method

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44414

Reviewed By: seemethere

Differential Revision: D23605733

Pulled By: malfet

fbshipit-source-id: 782a0da61650356a978a892fb03c66cb1a1ea26b
2020-09-09 15:09:33 -07:00
b0bcdbb1ab [JIT] Support partially specified sizes/strides in IRParser (#44113)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44113

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D23508149

Pulled By: Lilyjjo

fbshipit-source-id: b6b2d32109fae599bc5347dae742b67a2e4a0a49
2020-09-09 14:45:51 -07:00
3674264947 [quant] quantized path for ConstantPadNd (#43304)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43304

Test Plan: Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D23231946

Pulled By: z-a-f

fbshipit-source-id: 8c77f9a81f5a36c268467a190b5b954df0a8f5a4
2020-09-09 14:04:41 -07:00
032480d365 fix typo in embedding_bag_non_contiguous_weight test (#44382)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44382

This is to fix a typo that introduced in #44032.

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D23601316

Pulled By: glaringlee

fbshipit-source-id: 17d6de5900443ea46c7a6ee9c7614fe6f2d92890
2020-09-09 13:30:36 -07:00
a00d36b0e7 [PyTorch][Mobile] Insert the module name as name() to metadata dict if metadata doesn't contain "model_name" (#44400)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44400

This diff does the identical thing as D23549149 (398409f072) does. A fix included for OSS CI: pytorch_windows_vs2019_py36_cuda10.1_test1
ghstack-source-id: 111679745

Test Plan:
- CI
- OSS CI

Reviewed By: xcheng16

Differential Revision: D23601050

fbshipit-source-id: 8ebdcd8fdc5865078889b54b0baeb397a90ddc40
2020-09-09 13:01:17 -07:00
24efd29d19 Check commutativity for computed dispatch table and add a test to check entries. (#44088)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44088

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D23492793

Pulled By: ailzhang

fbshipit-source-id: 37502f2a8a4d755219b400fcbb029e49d6cdb6e9
2020-09-09 12:48:34 -07:00
48c47db8fe [NCCL] Add Environment Variable to guard Async Error Handling feature (#44163)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44163

In this PR, we introduce a new environment variable
(NCCL_ASYNC_ERROR_HANDLING), which guards the asynchronous error handling
feature. We intend to eventually turn this feature on by default for all users,
but this is a temporary solution so the change in behavior from hanging to
crashing is not the default for users all of a sudden.
ghstack-source-id: 111637788

Test Plan:
CI/Sandcastle. We will turn on this env var by default in
torchelastic and HPC trainer soon.

Reviewed By: jiayisuse

Differential Revision: D23517895

fbshipit-source-id: e7cd244b2ddf2dc0800ff7df33c73a6f00b63dcc
2020-09-09 12:26:25 -07:00
211ece7267 [NCCL] ProcessGroupNCCL Destructor Blocks on WorkNCCL Completion (#41054)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41054

**This Commit:**
ProcessGroupNCCL destructor now blocks until all WorkNCCL objects have either been aborted or completed and removed from the work vector.

**This Stack:**
The purpose of this stack is to fix the hanging behavior observed in when using PyTorch DDP training with NCCL. In various situations (desynchronization, high GPU utilization, etc.), NCCL collectives may hang due to waiting on an unresponsive worker. This stack detects such hanging behavior and aborts timed-out collectives by throwing a user-visible exception, all with minimal perf regression. Training can then be restarted from a previous checkpoint with something like torchelastic.

ghstack-source-id: 111614314

Test Plan:
1. **DDP Sanity Check**: First we have a sanity check based on the PyTorch DDP benchmark. This verifies that the baseline DDP training with NCCL for  standard CU workloads works well (esp. with standard models like Resnet50 and BERT). Here is a sample Flow: f213293473

1. **HPC Performance Benchmarks**: This stack has undergone thorough testing and profiling on the Training Cluster with varying number of nodes. This introduces 1-1.5% QPS regression only (~200-400 QPS regression for 8-64 GPUs).

1. **HPC Accuracy Benchmarks**: We've confirmed NE parity with the existing NCCL/DDP stack without this change.

1. **Kernel-Specific Benchmarks**: We have profiled other approaches for this system (such as cudaStreamAddCallback) and performed microbenchmarks to confirm the current solution is optimal.

1. **Sandcastle/CI**: Apart from the recently fixed ProcessGroupNCCL tests, we will also introduce a new test for desynchronization scenarios.

Reviewed By: jiayisuse

Differential Revision: D22054298

fbshipit-source-id: 2b95a4430a4c9e9348611fd9cbcb476096183c06
2020-09-09 12:26:22 -07:00
afbf2f140b [NCCL] WorkNCCL Helper Functions (#41053)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41053

**This Commit:**
Some minor refactoring - added helper to check if `WorkNCCL` objects have timed out. Adding a new finish function to ProcessGroupNCCL::WorkNCCL that avoids notifying CV and uses `lock_guard`. Also renaming the timeoutCVMutex mutex to be more descriptive.

**This Stack:**
The purpose of this stack is to fix the hanging behavior observed in when using PyTorch DDP training with NCCL. In various situations (desynchronization, high GPU utilization, etc.), NCCL collectives may hang due to waiting on an unresponsive worker. This stack detects such hanging behavior and aborts timed-out collectives by throwing a user-visible exception, all with minimal perf regression. Training can then be restarted from a previous checkpoint with something like torchelastic.

ghstack-source-id: 111614315

Test Plan: See D22054298 for verification of correctness and performance

Reviewed By: jiayisuse

Differential Revision: D21943520

fbshipit-source-id: b27ee329f0da6465857204ee9d87953ed6072cbb
2020-09-09 12:26:18 -07:00
f8f7b7840d [NCCL] Abort Errored and Timed Out NCCL Communicators from Watchdog Thread (#41052)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41052

**This Commit:**
Watchdog Thread checks for error-ed or timed out `WorkNCCL` objects and aborts all associated NCCL Communicators. For now, we  also process these aborted communicators as with the existing Watchdog logic (by adding them to abortedCommIds and writing aborted communicator ids to the store.)

**This Stack:**
The purpose of this stack is to fix the hanging behavior observed in when using PyTorch DDP training with NCCL. In various situations (desynchronization, high GPU utilization, etc.), NCCL collectives may hang due to waiting on an unresponsive worker. This stack detects such hanging behavior and aborts timed-out collectives by throwing a user-visible exception, all with minimal perf regression. Training can then be restarted from a previous checkpoint with something like torchelastic.

ghstack-source-id: 111614313

Test Plan: See D22054298 for verification of correctness and performance

Reviewed By: jiayisuse

Differential Revision: D21943151

fbshipit-source-id: 337bfcb8af7542c451f1e4b3dcdfc5870bdec453
2020-09-09 12:26:15 -07:00
4e5c55ef69 [NCCL] Use cudaEventQuery to Poll for GPU operation errors (#41051)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41051

**This Commit:**
In the workCleanupThread, we process completion and exception handling for workNCCL objects corresponding to collective calls that have either completed GPU Execution, or have already thrown an exception. This way, we throw an exception from the workCleanupThread for failed GPU operations. This approach replaces the previous (and lower performance) approach of enqueuing a callback on the CUDA stream to process failures.

**This Stack:**
The purpose of this stack is to fix the hanging behavior observed in when using PyTorch DDP training with NCCL. In various situations (desynchronization, high GPU utilization, etc.), NCCL collectives may hang due to waiting on an unresponsive worker. This stack detects such hanging behavior and aborts timed-out collectives by throwing a user-visible exception, all with minimal perf regression. Training can then be restarted from a previous checkpoint with something like torchelastic.

ghstack-source-id: 111614319

Test Plan: See D22054298 for verification of correctness and performance

Reviewed By: jiayisuse

Differential Revision: D21938498

fbshipit-source-id: df598365031ff210afba57e0c7be865e3323ca07
2020-09-09 12:26:12 -07:00
1df24fd457 [NCCL] Timeout Loop Thread for Async Error Handling (#41050)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41050

**This Commit:**
We introduce a workVector to track live workNCCL objects corresponding to collective operations. Further, we introduce a workCleanupLoop, which busy-polls the vector of workNCCL objects and removes them upon completion.

**This Stack:**
The purpose of this stack is to fix the hanging behavior observed in when using PyTorch DDP training with NCCL. In various situations (desynchronization, high GPU utilization, etc.), NCCL collectives may hang due to waiting on an unresponsive worker. This stack detects such hanging behavior and aborts timed-out collectives by throwing a user-visible exception, all with minimal perf regression. Training can then be restarted from a previous checkpoint with something like torchelastic.

Test Plan: See D22054298 for verification of correctness and performance

Reviewed By: jiayisuse

Differential Revision: D21916637

fbshipit-source-id: f8cadaab0071aaad1c4e31f9b089aa23cba0cfbe
2020-09-09 12:25:06 -07:00
15cbd1cf4b Preserve .ninja_log in build artifacts (#44390)
Summary:
Helpful for later analysis on the build time trends
Also, same .whl files out of regular linux build job

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44390

Reviewed By: walterddr

Differential Revision: D23602049

Pulled By: malfet

fbshipit-source-id: 4d55c9aa2d161a7998ad991a3da0436da83f70ad
2020-09-09 12:19:46 -07:00
ef4475f902 [Reland] Optimize code path for adaptive_avg_pool2d when output size is (1, 1) (#44211)
Summary:
Reland of https://github.com/pytorch/pytorch/issues/43986

DO NOT MERGE YET. XLA failure seems real.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44211

Reviewed By: mrshenli

Differential Revision: D23590505

Pulled By: ngimel

fbshipit-source-id: 6ee516b0995bfff6efaf740474c82cb23055d274
2020-09-09 12:08:14 -07:00
37093f4d99 Benchmarks: make fuser and executor configurable from command line. (#44291)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44291

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D23569089

Pulled By: ZolotukhinM

fbshipit-source-id: ec25b2f0bba303adaa46c3e85b1a9ce4fa3cf076
2020-09-09 11:59:35 -07:00
364d03a67c Misc. FakeLowP OSS cleanup (#44331)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44331

[10:22 AM] Cherckez, Tal
summary of issues(just to have a clear list):
* std::clamp forces the user to use c++17
* using setting without given fails the test
* avoid using max_examples for tests

(Note: this ignores all push blocking failures!)

Test Plan: https://www.internalfb.com/intern/testinfra/testconsole/testrun/6192449509073222/

Reviewed By: hyuen

Differential Revision: D23581440

fbshipit-source-id: fe9fbc341f8fca02352f531cc622fc1035d0300c
2020-09-09 11:53:43 -07:00
758c2b96f5 BUG: make cholesky_solve_out do broadcast, error checking (#43137)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/42695

test, fix `cholesky_solve_out` to use error checking and broadcasting from `cholesky_solve`. Test segfaults before, passes after the fix.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43137

Reviewed By: izdeby

Differential Revision: D23568589

Pulled By: malfet

fbshipit-source-id: 41b67ba964b55e59f1897eef0d96e0f6e1725bef
2020-09-09 11:38:36 -07:00
683380fc91 Use compile time cudnn version if linking with it statically (#44402)
Summary:
This should prevent torch_python from linking the entire cudnn library statically just to query its version

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44402

Reviewed By: seemethere

Differential Revision: D23602720

Pulled By: malfet

fbshipit-source-id: 185b15b789bd48b1df178120801d140ea54ba569
2020-09-09 11:33:41 -07:00
6ec8fabc29 Fix frac in CUDA fuser (#44152)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44152

Test Plan: Imported from OSS

Reviewed By: jamesr66a

Differential Revision: D23528506

fbshipit-source-id: bfd468d72fa55ce317f88ae83e1f2d5eee041aa0
2020-09-09 11:10:08 -07:00
350130a69d Prevent the TE fuser from getting datatypes it can't handle (#44160)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44160

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D23528508

Pulled By: bertmaher

fbshipit-source-id: 03b22725fb2666f441cb504b35397ea6d155bb85
2020-09-09 11:10:04 -07:00
960c088a58 [te] Fix casting of unsigned char, and abs(int) (#44157)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44157

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D23528507

Pulled By: bertmaher

fbshipit-source-id: c5ef0422a91a4665b616601bed8b7cd137be39f9
2020-09-09 11:08:36 -07:00
7c464eed16 Skipping CUDA tests in ProcessGroupGloo and logs (#42488)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42488

Currently, ProcessGroupGloo tests do not emit logs if the test was
skipped due CUDA not being available/not enough CUDA devices. This PR clarifies
the reason for skipping through these logs.
ghstack-source-id: 111638111

Test Plan: tested on devvm and devgpu

Reviewed By: jiayisuse

Differential Revision: D22879396

fbshipit-source-id: d483ca46b5e22ed986521262c11a1c6dbfbe7efd
2020-09-09 10:52:52 -07:00
2a87742ffa Autocast wrappers for RNN cell apis (#44296)
Summary:
Should fix https://github.com/pytorch/pytorch/issues/42605.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44296

Reviewed By: izdeby

Differential Revision: D23580447

Pulled By: ezyang

fbshipit-source-id: 86027b693fd2b648f043ab781b84ffcc1f72854d
2020-09-09 09:44:59 -07:00
a953a825cc Moves some of TestTorchMathOps to OpInfos (#44277)
Summary:
This PR fixes three OpInfo-related bugs and moves some functions from TestTorchMathOps to be tested using the OpInfo pattern. The bugs are:

- A skip test path in test_ops.py incorrectly formatted its string argument
- Decorating the tests in common_device_type.py was incorrectly always applying decorators to the original test, not the op-specific variant of the test. This could cause the same decorator to be applied multiple times, overriding past applications.
- make_tensor was incorrectly constructing tensors in some cases

The functions moved are:

- asin
- asinh
- sinh
- acosh
- tan
- atan
- atanh
- tanh
- log
- log10
- log1p
- log2

In a follow-up PR more or all of the remaining functions in TestTorchMathOps will be refactored as OpInfo-based tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44277

Reviewed By: ngimel

Differential Revision: D23568330

Pulled By: mruberry

fbshipit-source-id: 03e69fccdbfd560217c34ce4e9a5f20e10d05a5e
2020-09-09 09:41:03 -07:00
f044b17ae2 Disable a test (#44348)
Summary:
Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44348

Reviewed By: mrshenli

Differential Revision: D23592524

Pulled By: Krovatkin

fbshipit-source-id: 349057606ce39dd5de24314c9ba8f40516d2ae1c
2020-09-09 08:36:19 -07:00
cfd3620b76 Don't use VCOMP if Intel OMP is used (#44280)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/44096.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44280

Reviewed By: malfet

Differential Revision: D23568557

Pulled By: ezyang

fbshipit-source-id: bd627e497a9f71be9ba908852bf3ae437b1a5c94
2020-09-09 08:12:34 -07:00
d23f3170ef Remove pybind11 from required submodules (#44278)
Summary:
This can be taken from the system in which case it is not used from the submodule. Hence the check here limits the usage unnecessarily

ccing malfet

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44278

Reviewed By: malfet

Differential Revision: D23568552

Pulled By: ezyang

fbshipit-source-id: 7fd2613251567f649b12eca0b1fe7663db9cb58d
2020-09-09 08:07:13 -07:00
8acce55015 Dump optimized graph when logging in already-optimized PE (#44315)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44315

I find it more intuitive to dump the optimized graph if we have one;
when I first saw the unoptimized graph being dumped I thought we had failed to
apply any optimizations.

Test Plan: Observe output by hand

Reviewed By: Lilyjjo

Differential Revision: D23578813

Pulled By: bertmaher

fbshipit-source-id: e2161189fb0e1cd53aae980a153aea610871662a
2020-09-09 01:28:48 -07:00
7a64b0c27a Export Node::isBefore/isAfter for PythonAPI (#44162)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44162

This diff exports Node::isBefore/isAfter method to PythonAPI.

Test Plan: Tested locally. Please let me know if there is a set of unit tests to be passed.

Reviewed By: soumith

Differential Revision: D23514448

fbshipit-source-id: 7ef709b036370217ffebef52fd93fbd68c464e89
2020-09-09 00:57:08 -07:00
135ebbde6d [Caffe2] Add RMSNormOp (#44338)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44338

Add RMSNormOp in Caffe2

Test Plan: buck test mode/dev-nosan //caffe2/caffe2/python/operator_test:rms_norm_op_test

Reviewed By: houseroad

Differential Revision: D23546424

fbshipit-source-id: 8f3940a0bb42230bfa647dc66b5e359cc84491c6
2020-09-08 23:50:44 -07:00
106459acac Rename test_distributed to test_distributed_fork (#42932)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42932

Follow up from https://github.com/pytorch/pytorch/pull/41769, rename `test_distributed` to `test_distributed_fork` to make it explicit that it forks.

New command to run test:
`python test/run_test.py -i distributed/test_distributed_fork -v`
ghstack-source-id: 111632568

Test Plan: `python test/run_test.py -i distributed/test_distributed_fork -v`

Reviewed By: izdeby

Differential Revision: D23072201

fbshipit-source-id: 48581688b6c5193a309e803c3de38e70be980872
2020-09-08 23:13:37 -07:00
b22abbe381 Enable test_distributed to work with spawn mode (#41769)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41769

Currently the tests in `test_distributed` only work with the `fork` mode multiprocessing, this PR introduces support for `spawn` mode multiprocessing as well (while keeping the `fork` mode intact).

Motivations for the change:
1) Spawn multiprocessing is the default on MacOS, so it better emulates how MacOS users would use distributed
2) With python 3.8+, spawn is the default on linux, so we should have test coverage for this
3) PT multiprocessing suggests using spawn/forkserver over fork, for sharing cuda tensors: https://pytorch.org/docs/stable/multiprocessing.html
4) Spawn is better supported with respect to certain sanitizers such as TSAN, so adding this sanitizer coverage may help us uncover issues.

How it is done:
1) Move `test_distributed` tests in `_DistTestBase` class to a shared file `distributed_test` (similar to how the RPC tests are structured)
2) For `Barrier`, refactor the setup of temp directories, as the current version did not work with spawn, each process would get a different randomly generated directory and thus would write to different barriers.
3) Add all the relevant builds to run internally and in OSS.
Running test_distributed with spawn mode in OSS can be done with:
`python test/run_test.py -i distributed/test_distributed_spawn -v`

Reviewed By: izdeby

Differential Revision: D22408023

fbshipit-source-id: e206be16961fd80438f995e221f18139d7e6d2a9
2020-09-08 23:11:12 -07:00
1d01fcdc24 [quant] fill_ path for quantized tensors (#43303)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43303

Test Plan: Imported from OSS

Reviewed By: raghuramank100

Differential Revision: D23231947

Pulled By: z-a-f

fbshipit-source-id: fd5110ff15a073f326ef590436f8c6e5a2608324
2020-09-08 21:34:06 -07:00
4aacfab221 Resolve Autograd key for disable_variable_dispatch flag. (#44268)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44268

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D23561042

Pulled By: ailzhang

fbshipit-source-id: 6f35cd9a543bea3f9e294584f1db7c3622ebb741
2020-09-08 21:27:52 -07:00
ecc6358dbe Port nonzero cuda from THC to ATen (#44259)
Summary:
1) Ports nonzero from THC to ATen
2) replaces most thrust uses with cub, to avoid synchronization and to improve performance. There is still one necessary synchronization point, communicating number of nonzero elements from GPU to CPU
3) slightly changes algorithm, now we first compute the number of nonzeros, and then allocate correct-sized output, instead of allocating full-sized output as was done before, to account for possibly all elements being non-zero
4) unfortunately, since the last transforms are still done with thrust, 2) is slightly beside the point, however it is a step towards a future without thrust
4) hard limits the number of elements in the input tensor to MAX_INT. Previous implementation allocated a Long tensor with the size ndim*nelements, so that would be at least 16 GB for a tensor with MAX_INT elements. It is reasonable to say that larger tensors could not be used anyway.

Benchmarking is done for tensors with approximately half non-zeros
<details><summary>Benchmarking script</summary>
<p>

```
import torch
from torch.utils._benchmark import Timer
from torch.utils._benchmark import Compare
import sys

device = "cuda"
results = []
for numel in (1024 * 128,):#, 1024 * 1024, 1024 * 1024 * 128):
    inp = torch.randint(2, (numel,), device="cuda", dtype=torch.float)
    for ndim in range(2,3):#(1,4):
        if ndim == 1:
            shape = (numel,)
        elif ndim == 2:
            shape = (1024, numel // 1024)
        else:
            shape = (1024, 128, numel // 1024 // 128)
        inp = inp.reshape(shape)
        repeats = 3
        timer = Timer(stmt="torch.nonzero(inp, as_tuple=False)", label="Nonzero", sub_label=f"number of elts {numel}",
        description = f"ndim {ndim}", globals=globals())
        for i in range(repeats):
            results.append(timer.blocked_autorange())
        print(f"\rnumel {numel} ndim {ndim}", end="")
        sys.stdout.flush()

comparison = Compare(results)
comparison.print()
```
</p>
</details>

### Results
Before:
```
[--------------------------- Nonzero ---------------------------]
                                 |  ndim 1  |   ndim 2  |   ndim 3
 1 threads: ------------------------------------------------------
       number of elts 131072     |    55.2  |     71.7  |     90.5
       number of elts 1048576    |   113.2  |    250.7  |    497.0
       number of elts 134217728  |  8353.7  |  23809.2  |  54602.3

 Times are in microseconds (us).
```
After:
```
[-------------------------- Nonzero --------------------------]
                                |  ndim 1  |  ndim 2  |  ndim 3
1 threads: ----------------------------------------------------
      number of elts 131072     |    48.6  |    79.1  |    90.2
      number of elts 1048576    |    64.7  |   134.2  |   161.1
      number of elts 134217728  |  3748.8  |  7881.3  |  9953.7

Times are in microseconds (us).

```
There's a real regression for smallish 2D tensor due to added work of computing number of nonzero elements, however, for other sizes there are significant gains, and there are drastically lower memory requirements. Perf gains would be even larger for tensors with fewer nonzeros.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44259

Reviewed By: izdeby

Differential Revision: D23581955

Pulled By: ngimel

fbshipit-source-id: 0b99a767fd60d674003d83f0848dc550d7a363dc
2020-09-08 20:52:51 -07:00
bd8e38cd88 [TensorExpr] Fuser: check node inputs' device before merging the node into a fusion group. (#44241)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44241

Test Plan: Imported from OSS

Reviewed By: gmagogsfm

Differential Revision: D23554192

Pulled By: ZolotukhinM

fbshipit-source-id: fb03262520303152b83671603e08e7aecc24f5f2
2020-09-08 19:32:23 -07:00
646ffd4886 [quant] Move EmbeddingBag eager quantization to static (#44217)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44217

Move the tests to static ones as well

Test Plan:
python test/test_quantization.py TestStaticQuantizedModule.test_embedding_bag_api

Imported from OSS

Reviewed By: raghuramank100

Differential Revision: D23547386

fbshipit-source-id: 41f81c31e1613098ecf6a7eff601c7dcd4b09c76
2020-09-08 19:05:02 -07:00
57b87aaf59 [quant] Add quantized Embedding module (#44208)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44208

Add quantized module in static quantization namespace. Embedding
quantization requires only weights to be quantized so it is static.
Internally it calls the embedding_bag_byte op with the offsets set corresponding to the
indices.

Future PR will move EmbeddingBag quantization from dynamic to static as well.

Test Plan:
python test/test_quantization.py test_embedding_api

Imported from OSS

Reviewed By: vkuzo

Differential Revision: D23547384

fbshipit-source-id: eddc6fb144b4a771060e7bab5853656ccb4443f0
2020-09-08 19:04:59 -07:00
6013a29fc0 [quant] Support quantization of embedding lookup operators (#44207)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44207

Use existing embedding_bag operator but set offsets to [0, 1, .. len(indices)]

Test Plan:
python test/test_quantization.py TestEmbeddingOps.test_embedding_byte

Imported from OSS

Reviewed By: vkuzo

Differential Revision: D23547385

fbshipit-source-id: ccce348bc192c6a4a65a8eca4c8b90f99f40f1b1
2020-09-08 19:03:59 -07:00
f27be2f781 [caffe2] fix wrong comment (#42735)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42735

We use reduced precision only for embedding table (not for momentum) in RowWiseSparseAdagrad

Test Plan: .

Reviewed By: jianyuh

Differential Revision: D23003939

fbshipit-source-id: 062290d94b160100bc4c2f48b797833819f8e88a
2020-09-08 18:54:24 -07:00
f9146b4598 fix lint (#44346)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44346

Reviewed By: jamesr66a

Differential Revision: D23589324

Pulled By: eellison

fbshipit-source-id: a4e22b69196909ec200ac3e262f04d2aaf78e9cf
2020-09-08 18:29:44 -07:00
6269b6e0f0 [quant][graphmode][fx][api] Call fuse in prepare (#43984)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43984

Test Plan: Imported from OSS

Reviewed By: vkuzo

Differential Revision: D23459261

fbshipit-source-id: 6b56b0916d76df67b9cc2f4be1fcee905d604019
2020-09-08 18:09:26 -07:00
be94dba429 [NNC] fix support for FP16 in CudaCodgen (#44209)
Summary:
Fixes a bug where FP16 values could be incorrectly cast to a half type that doesn't have a cast operator by inserting the cuda specific cast to float during handling of the Cast node, not as a wrapper around printing Loads and Stores. Two main changes: the HalfChecker now inserts the casts to float explicitly in the IR, and the PrioritizeLoad mutator now consumes both Loads and a Cast which immediately preceded a load.

Tested with test_jit_fuser_te.py and test_tensorexpr.py, plus C++ tests obv.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44209

Reviewed By: izdeby

Differential Revision: D23575577

Pulled By: nickgg

fbshipit-source-id: 808605aeb2af812758f96f9fdc11b07e08053b46
2020-09-08 18:00:39 -07:00
9f54bcc522 [quant][graphmode][fx] Support inplace option (#43983)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43983

Support inplace option in apis

Test Plan: Imported from OSS

Reviewed By: vkuzo

Differential Revision: D23459260

fbshipit-source-id: 80409c7984f17d1a4e13fb1eece8e18a69ee43b3
2020-09-08 17:39:13 -07:00
0351d31722 add rocm nightly build (#44250)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44250

Reviewed By: izdeby

Differential Revision: D23585431

Pulled By: walterddr

fbshipit-source-id: c798707f5cb55f720e470bc40f30ab82718e0ddf
2020-09-08 17:09:32 -07:00
40d138f7c1 Added alpha overloads for add/sub ops with lists (#43413)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43413

Test Plan: Imported from OSS

Reviewed By: cpuhrsch

Differential Revision: D23331896

Pulled By: izdeby

fbshipit-source-id: 2e7484339fec533e21224f18979fddbeca649d2c
2020-09-08 17:02:08 -07:00
00b5bd536f fx quant: add docblocks to _find_matches and _find_quants (#43928)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43928

Improving readability, no logic change.

Test Plan:
CI

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D23440249

fbshipit-source-id: a7ebfc7ad15c73e26b9a94758e7254413cc17d29
2020-09-08 16:13:11 -07:00
6dd53fb58d [fix] output of embedding_bag with non-contiguous weight (#44032)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/43723

use weight.contiguous on fast-path as it expects contiguous tensor.

TODO:
* [x] Add tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44032

Reviewed By: izdeby

Differential Revision: D23502200

Pulled By: glaringlee

fbshipit-source-id: 4a7b546b3e8b1ad35c287a634b4e990a1ccef874
2020-09-08 16:07:13 -07:00
43e38d60d6 [quant][graphmode][fx] Support quantize per channel in all cases (#44042)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44042

Missed one case last time

Test Plan: Imported from OSS

Reviewed By: vkuzo

Differential Revision: D23479345

fbshipit-source-id: 30e6713120c494e9fab5584de4df9b25bec83d32
2020-09-08 15:45:14 -07:00
49e979bfde Set default compiler differently according to platform (#43890)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43890

1. auto-detect `CXX` default compiler type in oss, and `clang` as default compiler type in fbcode (because auto-detecting will say `gcc` is the default compiler on devserver).

2. change `compiler type` from str `"CLANG" "GCC"` to enum type
3. rename function `get_cov_type` to `detect_compiler_type`
4. auto-set the default pytorch folder for users in oss

Test Plan:
on devserver:
```
buck run :coverage //caffe2/c10:
```

on oss:
```
python oss_coverage.py --run-only=atest
```

Reviewed By: malfet

Differential Revision: D23420034

fbshipit-source-id: c0ea88188578bb1343a286f2090eb8a74cdf3982
2020-09-08 14:57:35 -07:00
1fcccd6a18 [FX] Minor fixups in Graph printout (#44214)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44214

Test Plan: Imported from OSS

Reviewed By: suo

Differential Revision: D23545501

Pulled By: jamesr66a

fbshipit-source-id: dabb3b051ed4da213b2087979ade8a649288bd5d
2020-09-08 14:45:32 -07:00
47ac9bb105 Enable temp disabled tests in test_jit_fuser_te.py (#44222)
Summary:
Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44222

Reviewed By: izdeby

Differential Revision: D23582214

Pulled By: Krovatkin

fbshipit-source-id: 27caa3ea02ce10b163212f6a45a81b446898953d
2020-09-08 14:40:32 -07:00
54931ebb7b Release saved variable from DifferentiableGraphBackward (#42994)
Summary:
When the backward ops execute via the autograd engine evaluate_function(), the fn.release_variables() is called to release the SavedVariables. For the eager mode ops, this releases the saved inputs that was required for backward grad function. However, with TorchScript, we get a DifferentableGraph and the DifferentiableGraphBackward() doesn't implement a release_variables(). This leads to the SavedVariables to be alive longer. Implement release_variables() for DifferentiableGraphBackward to release these SavedVariables  early.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42994

Reviewed By: izdeby

Differential Revision: D23503172

Pulled By: albanD

fbshipit-source-id: d87127498cfa72883ae6bb31d0e6c7056c4c36d4
2020-09-08 14:36:52 -07:00
63d62d3e44 Skips test_addcmul_cuda if using ROCm (#44304)
Summary:
This test is failing consistently on linux-bionic-rocm3.7-py3.6-test2. Relevant log snippet:

```
03:43:11 FAIL: test_addcmul_cuda_float16 (__main__.TestForeachCUDA)
03:43:11 ----------------------------------------------------------------------
03:43:11 Traceback (most recent call last):
03:43:11   File "/var/lib/jenkins/.local/lib/python3.6/site-packages/torch/testing/_internal/common_utils.py", line 818, in wrapper
03:43:11     method(*args, **kwargs)
03:43:11   File "/var/lib/jenkins/.local/lib/python3.6/site-packages/torch/testing/_internal/common_device_type.py", line 258, in instantiated_test
03:43:11     result = test(self, *args)
03:43:11   File "test_foreach.py", line 83, in test_addcmul
03:43:11     self._test_pointwise_op(device, dtype, torch._foreach_addcmul, torch._foreach_addcmul_, torch.addcmul)
03:43:11   File "test_foreach.py", line 58, in _test_pointwise_op
03:43:11     self.assertEqual(tensors, expected)
03:43:11   File "/var/lib/jenkins/.local/lib/python3.6/site-packages/torch/testing/_internal/common_utils.py", line 1153, in assertEqual
03:43:11     exact_dtype=exact_dtype, exact_device=exact_device)
03:43:11   File "/var/lib/jenkins/.local/lib/python3.6/site-packages/torch/testing/_internal/common_utils.py", line 1127, in assertEqual
03:43:11     self.assertTrue(result, msg=msg)
03:43:11 AssertionError: False is not true : Tensors failed to compare as equal! With rtol=0.001 and atol=1e-05, found 10 element(s) (out of 400) whose difference(s) exceeded the margin of error (including 0 nan comparisons). The greatest difference was 0.00048828125 (-0.46484375 vs. -0.46533203125), which occurred at index (11, 18).
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44304

Reviewed By: malfet, izdeby

Differential Revision: D23578316

Pulled By: mruberry

fbshipit-source-id: 558eecf42677383e7deaa4961e12ef990ffbe28c
2020-09-08 13:14:25 -07:00
de89261abe Reduce sccache log levels for RocM to a default state (#44310)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44310

Reviewed By: walterddr

Differential Revision: D23576966

Pulled By: malfet

fbshipit-source-id: c7fa063ec2be92de8f3768aaa3e6a032913004f7
2020-09-08 12:55:23 -07:00
477f489137 Don't register a fallback for private use to let extensions do it themselves (#44149)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44149

Thanks Christian Puhrsch for reporting.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: izdeby

Differential Revision: D23574739

Pulled By: ezyang

fbshipit-source-id: 8c9d0d78e6970139e0103cd1e0004b743e3c7f9e
2020-09-08 12:30:26 -07:00
caf23d110f [JIT] Unshare types for modules that define() in __init__ (#44233)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44233

**Summary**
By default, scripting tries to share concrete and JIT types across
compilations. However, this can lead to incorrect results if a module
extends `torch.jit.ScriptModule`, and injects instance variables into
methods defined using `define`.

This commit detects when this has happened and disables type sharing
for the compilation of the module that uses `define` in `__init__`.

**Test Plan**
This commit adds a test to TestTypeSharing that tests this scenario.

**Fixes**
This commit fixes #43580.

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D23553870

Pulled By: SplitInfinity

fbshipit-source-id: d756e87fcf239befa0012998ce29eeb25728d3e1
2020-09-08 12:16:45 -07:00
4e0ac120e9 [FX] Only copy over training attr if it\'s there (#44314)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44314

Test Plan: Imported from OSS

Reviewed By: dzhulgakov

Differential Revision: D23578189

Pulled By: jamesr66a

fbshipit-source-id: fb7643f28582bd5009a826663a937fbe188c50bc
2020-09-08 11:50:08 -07:00
fd8e2064e0 quant: switch observers to use min_max (#42957)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42957

Switches observers to use the new min_max function to calculate
min and max at the same time.  We see around 45-50% speedup on
representative input shapes on the microbenchmarks for all observers except `HistogramObserver`.

Test Plan:
CI for correctness

performance:
```
cd benchmarks/operator_benchmark
// repeat (before diff, after diff) x (cpu, cuda)
python -m pt.qobserver_test --tag_filter all --device cpu
/*
    * before, cpu: https://our.intern.facebook.com/intern/paste/P138633280/
    * before, cuda: https://our.intern.facebook.com/intern/paste/P138639473/
    * after, cpu: https://our.intern.facebook.com/intern/paste/P138635458/
    * after, cuda: https://our.intern.facebook.com/intern/paste/P138636344/
*/
```

Imported from OSS

Reviewed By: supriyar

Differential Revision: D23093995

fbshipit-source-id: 9f416d144109b5b80baf089eb4bcfabe8fe358d5
2020-09-08 11:39:44 -07:00
de980f937b skip test_tanhquantize for now (#44312)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44312

This test is failing now when running on card. Let's disable it while Intel is investigating the issue.

Test Plan: Sandcastle

Reviewed By: hyuen

Differential Revision: D23577475

fbshipit-source-id: 84f957c69ed75e0e0f563858b8b8ad7a2158da4e
2020-09-08 11:21:41 -07:00
8d212d3f7a add 'run_duration' stats for binary builds to scuba (#44251)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44251

Reviewed By: seemethere

Differential Revision: D23575312

Pulled By: walterddr

fbshipit-source-id: 29d737f5bee1540d6595d4d0ca1386b9ce5ab2ee
2020-09-08 11:13:00 -07:00
1130de790c Automated submodule update: FBGEMM (#44177)
Summary:
This is an automated pull request to update the first-party submodule for [pytorch/FBGEMM](https://github.com/pytorch/FBGEMM).

New submodule commit: d5ace7ca70

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44177

Test Plan: Ensure that CI jobs succeed on GitHub before landing.

Reviewed By: dskhudia

Differential Revision: D23533561

fbshipit-source-id: 9e580f8dbfb83e57bebc28f8e459caa0c5fc7317
2020-09-08 10:12:21 -07:00
5de805d8a7 [dper3] Export Caffe2 operator LearningRate to PyTorch
Summary: Exports the operator to PyTorch, to be made into a low-level module.

Test Plan:
```
buck test //caffe2/caffe2/python/operator_test:torch_integration_test -- test_learning_rate
```

Reviewed By: yf225

Differential Revision: D23545582

fbshipit-source-id: 6b6d9aa6a47b2802ccef0f87c1263c6cc2d2fdf6
2020-09-08 08:50:09 -07:00
cce5982c4c Add unary ops: exp and sqrt (#42537)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42537

[First PR: Add private API to support tensor lists: _foreach_add(TensorList tensors, Scalar scalar)](https://github.com/pytorch/pytorch/pull/41554).

**Motivation**
[GitHub issue](https://github.com/pytorch/pytorch/issues/38655)
Current PyTorch optimizer implementations are not efficient in cases when we work with a lot of small feature tensors. Starting a lot of kernels slows down the whole process. We need to reduce the number of kernels that we start.
As an example, we should be looking at [NVIDIAs Apex](https://github.com/NVIDIA/apex).
In order to track progress, we will pick PyTorchs DCGAN model with Adam optimizer and once the optimizer is reimplemented with tensor lists, benchmark the model performance against original model version, Apexs version with original Adam optimizer and it’s FusedAdam optimizer.

**Current API restrictions**
- List can't be empty (will fixed in upcoming PRs).
- All tensors in the list must have the same dtype, device and size.

**Broadcasting**
At this point we don't support broadcasting.

**What is 'Fast' and 'Slow' route**
In particular cases, we cant process an op with a fast list CUDA kernel. Still, we can do with a regular for-loop where the op will be applied to each tensor individually through the dispatch mechanisms. There are a few checks that decide whether the op will be performed via a 'fast' or 'slow' path.
To go the fast route,
- All tensors must have strided layout
- All tensors must be dense and not have overlapping memory
- The resulting tensor type must be the same.

----------------
**In this PR**
Adding APIs:
```
torch._foreach_exp(TensorList tl1)
torch._foreach_exp_(TensorList tl1)
torch._foreach_sqrt(TensorList tl1)
torch._foreach_sqrt_(TensorList tl1)
```

**Tests**
Tested via unit tests

**TODO**
1. Properly handle empty lists
2. Properly handle bool tensors

**Plan for the next PRs**
1. APIs
- Pointwise Ops

2. Complete tasks from TODO
3. Rewrite PyTorch optimizers to use for-each operators for performance gains.

Test Plan: Imported from OSS

Reviewed By: cpuhrsch

Differential Revision: D23331889

Pulled By: izdeby

fbshipit-source-id: 8b04673b8412957472ed56361954ca3884eb9376
2020-09-07 19:57:34 -07:00
6134ac17ba Revert D23561500: Benchmarks: re-enable profiling-te configuration (try 2).
Test Plan: revert-hammer

Differential Revision:
D23561500 (589a2024c8)

Original commit changeset: 7fe86d34afa4

fbshipit-source-id: 10e48f230402572fcece56662ad4413ac0bd3cb5
2020-09-07 19:10:30 -07:00
7c61f57bec test_ops: skipTest only takes a single argument (#44181)
Summary:
Fixes a broken skipTest from https://github.com/pytorch/pytorch/issues/43451, e.g. in the ROCm CI.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44181

Reviewed By: ngimel

Differential Revision: D23568608

Pulled By: malfet

fbshipit-source-id: 557048bd5f0086ffac38d1c48255badb63869899
2020-09-07 18:32:59 -07:00
0e64b02912 FindCUDA error handling (#44236)
Summary:
Check return code of `nvcc --version` and if it's not zero, print warning and mark CUDA as not found.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44236

Test Plan: Run `CUDA_NVCC_EXECUTABLE=/foo/bar cmake ../`

Reviewed By: ezyang

Differential Revision: D23552336

Pulled By: malfet

fbshipit-source-id: cf9387140a8cdbc8dab12fcc4bfaf55ae8e6a502
2020-09-07 18:17:55 -07:00
5d748e6d22 [TensorExpr] Re-enable tests. (#44218)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44218

Differential Revision: D23546100

Test Plan: Imported from OSS

Reviewed By: ngimel

Pulled By: ZolotukhinM

fbshipit-source-id: 4c4c5378ec9891ef72b60ffb59081a009e0df049
2020-09-07 15:52:03 -07:00
589a2024c8 Benchmarks: re-enable profiling-te configuration (try 2). (#44270)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44270

The previous PR (#44212) was reverted since I didn't update the
`upload_scribe.py` script and it was looking for 'executor_and_fuser'
field in the json which now is replaced with two separate fields:
'executor' and 'fuser'.

Differential Revision: D23561500

Test Plan: Imported from OSS

Reviewed By: ngimel

Pulled By: ZolotukhinM

fbshipit-source-id: 7fe86d34afa488a0e43d5ea2aaa7bc382337f470
2020-09-07 15:50:39 -07:00
10dd25dcd1 Add binary ops for _foreach APIs (#42536)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42536

[First PR: Add private API to support tensor lists: _foreach_add(TensorList tensors, Scalar scalar)](https://github.com/pytorch/pytorch/pull/41554).

**Motivation**
[GitHub issue](https://github.com/pytorch/pytorch/issues/38655)
Current PyTorch optimizer implementations are not efficient in cases when we work with a lot of small feature tensors. Starting a lot of kernels slows down the whole process. We need to reduce the number of kernels that we start.
As an example, we should be looking at [NVIDIAs Apex](https://github.com/NVIDIA/apex).
In order to track progress, we will pick PyTorchs DCGAN model with Adam optimizer and once the optimizer is reimplemented with tensor lists, benchmark the model performance against original model version, Apexs version with original Adam optimizer and it’s FusedAdam optimizer.

**Current API restrictions**
- List can't be empty (will fixed in upcoming PRs).
- All tensors in the list must have the same dtype, device and size.

**Broadcasting**
At this point we don't support broadcasting.

**What is 'Fast' and 'Slow' route**
In particular cases, we cant process an op with a fast list CUDA kernel. Still, we can do with a regular for-loop where the op will be applied to each tensor individually through the dispatch mechanisms. There are a few checks that decide whether the op will be performed via a 'fast' or 'slow' path.
To go the fast route,
- All tensors must have strided layout
- All tensors must be dense and not have overlapping memory
- The resulting tensor type must be the same.

----------------
**In this PR**
Adding APIs:
```
torch._foreach_sub(TensorList tl1, TensorList tl2)
torch._foreach_sub_(TensorList self, TensorList tl2)
torch._foreach_mul(TensorList tl1, TensorList tl2)
torch._foreach_mul_(TensorList self, TensorList tl2)
torch._foreach_div(TensorList tl1, TensorList tl2)
torch._foreach_div_(TensorList self, TensorList tl2)

torch._foreach_sub(TensorList tl1, Scalar scalar)
torch._foreach_sub_(TensorList self, Scalar scalar)
torch._foreach_mul(TensorList tl1, Scalar scalar)
torch._foreach_mul_(TensorList self, Scalar scalar)
torch._foreach_div(TensorList tl1, Scalar scalar)
torch._foreach_div(TensorList self, Scalar scalar)
```

**Tests**
Tested via unit tests

**TODO**
1. Properly handle empty lists
2. Properly handle bool tensors

**Plan for the next PRs**
1. APIs
- Unary Ops for list
- Pointwise Ops

2. Complete tasks from TODO
3. Rewrite PyTorch optimizers to use for-each operators for performance gains.

Test Plan: Imported from OSS

Reviewed By: cpuhrsch

Differential Revision: D23331891

Pulled By: izdeby

fbshipit-source-id: 18c5937287e33e825b2e391e41864dd64e226f19
2020-09-07 10:29:32 -07:00
626e410e1d Revert D23544563: Benchmarks: re-enable profiling-te configuration.
Test Plan: revert-hammer

Differential Revision:
D23544563 (ac1f471fe2)

Original commit changeset: 98659e8860fa

fbshipit-source-id: 5dab7044699f59c709e64d178758f5f462ebb788
2020-09-06 21:01:19 -07:00
1b2da9ed82 Expose alias key info in dumpState and update test_dispatch. (#44081)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44081

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D23492794

Pulled By: ailzhang

fbshipit-source-id: 27a2978591900463bda2e92e0201c9fd719f9792
2020-09-06 18:43:05 -07:00
514f20ea51 Histogram Binning Calibration
Summary:
Adding a calibration module called histogram binning:

Divide the prediction range (e.g., [0, 1]) into B bins. In each bin, use two parameters to store the number of positive examples and the number of examples that fall into this bucket. So we basically have a histogram for the model prediction.

As a result, for each bin, we have a statistical value for the real CTR (num_pos / num_example). We use this statistical value as the final calibrated prediction if the pre-cali prediction falls into the corresponding bin.

In this way, the predictions within each bin should be well-calibrated if we have sufficient examples. That is, we have a fine-grained calibrated model by this calibration module.

Theoretically, this calibration layer can fix any uncalibrated model or prediction if we have sufficient bins and examples. It provides the potential to use any kind of training weight allocation to our training data, without worrying about the calibration issue.

Test Plan:
buck test dper3/dper3/modules/calibration/tests:calibration_test -- test_histogram_binning_calibration

buck test dper3/dper3_models/ads_ranking/tests:model_paradigm_e2e_tests -- test_sparse_nn_histogram_binning_calibration

All tests passed.

Example workflows:
f215431958

{F326445092}

f215445048

{F326445223}

Reviewed By: chenshouyuan

Differential Revision: D23356450

fbshipit-source-id: c691b66c51ef33908c17575ce12e5bee5fb325ff
2020-09-06 17:11:16 -07:00
ac1f471fe2 Benchmarks: re-enable profiling-te configuration. (#44212)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44212

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D23544563

Pulled By: ZolotukhinM

fbshipit-source-id: 98659e8860fa951d142e0f393731c4a769463c6c
2020-09-06 10:22:16 -07:00
bb861e1d69 Ports CUDA var and std reduce all (with no out argument) to ATen, fixes var docs (#43858)
Summary:
When var and std are called without args (other than unbiased) they currently call into TH or THC. This PR:

- Removes the THC var_all and std_all functions and updates CUDA var and std to use the ATen reduction
- Fixes var's docs, which listed its arguments in the incorrect order
- Adds new tests comparing var and std with their NumPy counterparts

Performance appears to have improved as a result of this change. I ran experiments on 1D tensors, 1D tensors with every other element viewed ([::2]), 2D tensors and 2D transposed tensors. Some notable datapoints:

- torch.randn((8000, 8000))
  - var measured 0.0022215843200683594s on CUDA before the change
  - var measured 0.0020322799682617188s on CUDA after the change
- torch.randn((8000, 8000)).T
  - var measured .015128850936889648 on CUDA before the change
  - var measured 0.001912832260131836 on CUDA after the change
- torch.randn(8000 ** 2)
  - std measured 0.11031460762023926 on CUDA before the change
  - std measured 0.0017833709716796875 on CUDA after the change

Timings for var and std are, as expected, similar.

On the CPU, however, the performance change from making the analogous update was more complicated, and ngimel and I decided not to remove CPU var_all and std_all. ngimel wrote the following script that showcases how single-threaded CPU inference would suffer from this change:

```
import torch
import numpy as np
from torch.utils._benchmark import Timer
from torch.utils._benchmark import Compare
import sys
base = 8
multiplier = 1

def stdfn(a):
    meanv = a.mean()
    ac = a-meanv
    return torch.sqrt(((ac*ac).sum())/a.numel())

results = []
num_threads=1
for _ in range(7):
    size = base*multiplier
    input = torch.randn(size)

    tasks = [("torch.var(input)", "torch_var"),
             ("torch.var(input, dim=0)", "torch_var0"),
             ("stdfn(input)", "stdfn"),
             ("torch.sum(input, dim=0)", "torch_sum0")
            ]
    timers = [Timer(stmt=stmt, num_threads=num_threads, label="Index", sub_label=f"{size}",
    description=label, globals=globals()) for stmt, label in tasks]
    repeats = 3

    for i, timer in enumerate(timers * repeats):
        results.append(
            timer.blocked_autorange()
        )
        print(f"\r{i + 1} / {len(timers) * repeats}", end="")
        sys.stdout.flush()
    multiplier *=10
print()

comparison = Compare(results)

comparison.print()
```

The TH timings using this script on my devfair are:

```
[------------------------------ Index ------------------------------]
        | torch_var | torch_var0 |  stdfn  | torch_sum0
1 threads: ----------------------------------------------------------
   8    |   16.0  |    5.6  |   40.9 |    5.0
   80    |   15.9  |    6.1  |   41.6 |    4.9
   800   |   16.7  |   12.0  |   42.3 |    5.0
   8000   |   27.2  |   72.7  |   51.5 |    6.2
   80000  |   129.0  |   715.0  |  133.0 |   18.0
   800000  |  1099.8  |  6961.2  |  842.0 |   112.6
   8000000 |  11879.8  |  68948.5  | 20138.4 |  1750.3
```

and the ATen timings are:

```
[------------------------------ Index ------------------------------]
               |  torch_var  |  torch_var0  |   stdfn   |  torch_sum0
1 threads: ----------------------------------------------------------
      8              |       4.3   |       5.4    |     41.4  |       5.4
      80            |       4.9   |       5.7    |     42.6  |       5.4
      800          |      10.7   |      11.7    |     43.3  |       5.5
      8000        |      69.3   |      72.2    |     52.8  |       6.6
      80000      |     679.1   |     676.3    |    129.5  |      18.1
      800000    |    6770.8   |    6728.8    |    819.8  |     109.7
      8000000  |   65928.2   |   65538.7    |  19408.7  |    1699.4
```

which demonstrates that performance is analogous to calling the existing var and std with `dim=0` on a 1D tensor. This would be a significant performance hit. Another simple script shows the performance is mixed when using multiple threads, too:

```
import torch
import time

# Benchmarking var and std, 1D with varying sizes
base = 8
multiplier = 1

op = torch.var
reps = 1000

for _ in range(7):
    size = base * multiplier
    t = torch.randn(size)
    elapsed = 0
    for _ in range(reps):
        start = time.time()
        op(t)
        end = time.time()
        elapsed += end - start
    multiplier *= 10

    print("Size: ", size)
    print("Avg. elapsed time: ", elapsed / reps)
```

```
var cpu TH vs ATen timings

Size:  8
Avg. elapsed time:  1.7853736877441406e-05 vs 4.9788951873779295e-06 (ATen wins)
Size:  80
Avg. elapsed time:  1.7803430557250977e-05 vs 6.156444549560547e-06 (ATen wins)
Size:  800
Avg. elapsed time:  1.8569469451904296e-05 vs 1.2302875518798827e-05 (ATen wins)
Size:  8000
Avg. elapsed time:  2.8756141662597655e-05 vs. 6.97789192199707e-05 (TH wins)
Size:  80000
Avg. elapsed time:  0.00026622867584228516 vs. 0.0002447957992553711 (ATen wins)
Size:  800000
Avg. elapsed time:  0.0010556647777557374 vs 0.00030616092681884767 (ATen wins)
Size:  8000000
Avg. elapsed time:  0.009990205764770508 vs 0.002938544034957886 (ATen wins)

std cpu TH vs ATen timings

Size:  8
Avg. elapsed time:  1.6681909561157225e-05 vs. 4.659652709960938e-06 (ATen wins)
Size:  80
Avg. elapsed time:  1.699185371398926e-05 vs. 5.431413650512695e-06 (ATen wins)
Size:  800
Avg. elapsed time:  1.768803596496582e-05 vs. 1.1279821395874023e-05 (ATen wins)
Size:  8000
Avg. elapsed time:  2.7791500091552735e-05  vs 7.031106948852539e-05 (TH wins)
Size:  80000
Avg. elapsed time:  0.00018650460243225096 vs 0.00024368906021118164 (TH wins)
Size:  800000
Avg. elapsed time:  0.0010522041320800782 vs 0.0003039860725402832 (ATen wins)
Size:  8000000
Avg. elapsed time:  0.009976618766784668 vs. 0.0029211788177490234 (ATen wins)
```

These results show the TH solution still performs better than the ATen solution with default threading for some sizes.

It seems like removing CPU var_all and std_all will require an improvement in ATen reductions. https://github.com/pytorch/pytorch/issues/40570 has been updated with this information.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43858

Reviewed By: zou3519

Differential Revision: D23498981

Pulled By: mruberry

fbshipit-source-id: 34bee046c4872d11c3f2ffa1b5beee8968b22050
2020-09-06 09:40:54 -07:00
83a6e7d342 Adds inequality testing aliases for better NumPy compatibility (#43870)
Summary:
This PR adds the following aliaes:

- not_equal for torch.ne
- greater for torch.gt
- greater_equal for torch.ge
- less for torch.lt
- less_equal for torch.le

This aliases are consistent with NumPy's naming for these functions.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43870

Reviewed By: zou3519

Differential Revision: D23498975

Pulled By: mruberry

fbshipit-source-id: 78560df98c9f7747e804a420c1e53fd1dd225002
2020-09-06 09:36:23 -07:00
671160a963 Revert D23557576: Revert D23519521: [dper3] replace LengthsGather lowlevel module's PT implemetnatio to use caffe2 op
Test Plan: revert-hammer

Differential Revision:
D23557576

Original commit changeset: 33631299eabe

fbshipit-source-id: 704d36a16346f047b30e2da8be882062135f8617
2020-09-06 01:50:43 -07:00
e358d516c8 Revert D23549149: [PyTorch][Mobile] Insert the module name as name() to metadata dict if metadata doesn't contain "model_name"
Test Plan: revert-hammer

Differential Revision:
D23549149 (398409f072)

Original commit changeset: fad742a8d4e6

fbshipit-source-id: bd92a2033a804d3e6a2747b4fda4ca527991a993
2020-09-06 00:06:35 -07:00
70c8daf439 Apply selective build on RNN operators (#44132)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44132

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43985

Added
```
def(detail::SelectiveStr<true>, ...)
impl(detail::SelectiveStr<true>, ...)
```
in torch/library, which can also be used for other templated selective registration.

Size saves for this diff:
fbios-pika: 78 KB
igios: 87 KB

Test Plan: Imported from OSS

Reviewed By: ljk53, smessmer

Differential Revision: D23459774

Pulled By: iseeyuan

fbshipit-source-id: 86d34cfe8e3f852602f203db06f23fa99af2c018
2020-09-05 23:47:51 -07:00
68297eeb1a Add support for integer dim arg in torch.linalg.norm (#43907)
Summary:
Since PR https://github.com/pytorch/pytorch/issues/43262 is merged, this works now.

Part of https://github.com/pytorch/pytorch/issues/24802

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43907

Reviewed By: anjali411

Differential Revision: D23471964

Pulled By: mruberry

fbshipit-source-id: ef2f11f78343fc866f752c9691b0c1fa687353ba
2020-09-05 23:16:36 -07:00
719d29dab5 Implement torch.i0 and torch.kaiser_window (#43132)
Summary:
Related to https://github.com/pytorch/pytorch/issues/38349

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43132

Reviewed By: smessmer

Differential Revision: D23479072

Pulled By: mruberry

fbshipit-source-id: 4fb1de44830771c6a7222cf19f7728d9ac7c043b
2020-09-05 23:11:47 -07:00
4fc29e9c43 Revert D23519521: [dper3] replace LengthsGather lowlevel module's PT implemetnatio to use caffe2 op
Test Plan: revert-hammer

Differential Revision:
D23519521 (8c64bb4f47)

Original commit changeset: ed9bd16a8af3

fbshipit-source-id: 33631299eabec05a1a272bfd0040d96203cf62a0
2020-09-05 20:43:04 -07:00
396469f18c Explicitly forbidden the other inherited methods of RemoteModule. (#43895)
Summary:
Throw exceptions when the methods except for forwardXXX are used.

Original PR issue: RemoteModule enhancements #40550

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43895

Test Plan: buck test test/distributed/rpc:process_group_agent -- RemoteModule

Reviewed By: rohan-varma

Differential Revision: D23392842

Pulled By: SciPioneer

fbshipit-source-id: 7c09a55a03f9f0b7e9f9264a42bfb907607f4651
2020-09-05 14:48:56 -07:00
199c73be0f [quant][pyper] Support quantization of ops in fork-wait subgraph (#44048)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44048

Inline the fork-wait calls to make sure we can see the ops to be quantized in the main graph

Also fix the InlineForkWait JIT pass to account for the case where the aten::wait call isn't present in the main graph
and we return future tensor from subgraph

Example

```
graph(%self.1 : __torch__.dper3.core.interop.___torch_mangle_6325.DperModuleWrapper,
       %argument_1.1 : Tensor,
       %argument_2.1 : Tensor):
   %3 : Future[Tensor[]] = prim::fork_0(%self.1, %argument_1.1, %argument_2.1) # :0:0
   return (%3)
 with prim::fork_0 = graph(%self.1 : __torch__.dper3.core.interop.___torch_mangle_5396.DperModuleWrapper,
       %argument_1.1 : Tensor,
       %argument_2.1 : Tensor):
   %3 : __torch__.dper3.core.interop.___torch_mangle_6330.DperModuleWrapper = prim::GetAttr[name="x"](%self.1)
   %4 : __torch__.dper3.core.interop.___torch_mangle_5397.DperModuleWrapper = prim::GetAttr[name="y"](%self.1)
   %5 : __torch__.dper3.core.interop.___torch_mangle_6327.DperModuleWrapper = prim::GetAttr[name="z"](%4)
   %6 : Tensor = prim::CallMethod[name="forward"](%5, %argument_1.1, %argument_2.1) # :0:0
   %7 : None = prim::CallMethod[name="forward"](%3, %6) # :0:0
   %8 : Tensor[] = prim::ListConstruct(%6)
   return (%8)
```

Test Plan:
python test/test_quantization.py test_interface_with_fork

Imported from OSS

Reviewed By: vkuzo

Differential Revision: D23481003

fbshipit-source-id: 2e756be73c248319da38e053f021888b40593032
2020-09-05 12:06:19 -07:00
164b96c34c [quant][pyper] make embedding_bag quantization static (#44008)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44008

embedding_bag requires only quantization of weights (no dynamic quantization of inputs)
So the type of quantization is essentially static (without calibration)
This will enable pyper to do fc and embedding_bag quantization using the same API call

Test Plan:
python test/test_quantization.py test_embedding_bag

Imported from OSS

Reviewed By: vkuzo

Differential Revision: D23467019

fbshipit-source-id: 41a61a17ee34bcb737ba5b4e19fb7a576d4aeaf9
2020-09-05 12:06:16 -07:00
a0ae416d60 [quant] Support aten::embedding_bag quantization in graph mode (#43989)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43989

When we trace the model it produces aten::embedding_bag node in the graph,
Add necessary passes in graph mode to help support quantizing it as well

Test Plan:
python test/test_quantization.py TestQuantizeDynamicJitOps.test_embedding_bag

Imported from OSS

Reviewed By: vkuzo

Differential Revision: D23460485

fbshipit-source-id: 328c5e1816cfebb10ba951113f657665b6d17575
2020-09-05 12:05:06 -07:00
15a7368115 Add const to getTensors method of GradBucket. (#44126)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44126

Add const to getTensors method of GradBucket.

Test Plan: buck test caffe2/torch/lib/c10d:ProcessGroupGlooTest

Reviewed By: sinannasir, jiayisuse

Differential Revision: D23504088

fbshipit-source-id: 427d9591042e0c03cde02629c1146ff1e5e027f9
2020-09-05 09:19:42 -07:00
5bd2902796 [JIT] Remove references to no longer generated _tanh_backward and _sigmoid_backward (#44138)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44138

If you look at the sigmoid and tanh backward they are composed of other ops: https://github.com/pytorch/pytorch/blob/master/torch/csrc/jit/runtime/symbolic_script.cpp#L786
https://github.com/pytorch/pytorch/blob/master/torch/csrc/jit/runtime/symbolic_script.cpp#L164

So tanh_backward and sigmoid_backward are no longer generated / legacy ops.

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D23543603

Pulled By: eellison

fbshipit-source-id: ce8353e53043cf969b536aac47c9576d66d4ce02
2020-09-05 01:41:36 -07:00
df67f0beab [TensorExpr fuser] Guard nodes that have tensor output properties determined by non-tensor inputs (#44137)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44137

We only insert guards on Tensor types, so we rely on the output
of a node being uniquely determined by its input types.
bail if any non-Tensor input affects the output type
and cannot be reasoned about statically

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D23543602

Pulled By: eellison

fbshipit-source-id: abd6fe0b1fd7fe6fc251694d4cd442b19c032dd7
2020-09-05 01:40:18 -07:00
5a0d65b06b Further expand coverage of addmm/addmv, fix 0 stride (#43980)
Summary:
- test beta=0, self=nan
- test transposes
- fixes broadcasting of addmv
- not supporting tf32 yet, will do it in future PR together with other testing fixes

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43980

Reviewed By: mruberry

Differential Revision: D23507559

Pulled By: ngimel

fbshipit-source-id: 14ee39d1a0e13b9482932bede3fccb61fe6d086d
2020-09-04 23:03:23 -07:00
d07a36e0c1 Revert D23490149: [pytorch][PR] Compile less legacy code when BUILD_CAFFE2 is set to False
Test Plan: revert-hammer

Differential Revision:
D23490149 (15e99b6ff6)

Original commit changeset: a76382c30d83

fbshipit-source-id: 75057fa9af2c19eb976962552118bf0a99911b38
2020-09-04 22:59:39 -07:00
618b4dd763 fx quant prepare: clarify naming (#44125)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44125

In `Quantizer._prepare`, `observed` was used for two different variables
with different types.  Making the names a bit cleaner and removing the
name conflict.

Test Plan:
```
python test/test_quantization.py TestQuantizeFx
python test/test_quantization.py TestQuantizeFxOps
```

Imported from OSS

Reviewed By: dskhudia

Differential Revision: D23504109

fbshipit-source-id: 0f73eac3d6dd5f72ad5574a4d47d33808a70174a
2020-09-04 21:29:56 -07:00
a940f5ea5d torchscript graph mode quant: remove benchmark filter (#44165)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44165

Allows convolutions to be quantized if `torch.cudnn.backends.benchmark`
flag was set.

Not for land yet, just testing.

Test Plan:
in the gist below, the resulting graph now has quantized convolutions
https://gist.github.com/vkuzo/622213cb12faa0996b6700b08d6ab2f0

Imported from OSS

Reviewed By: supriyar

Differential Revision: D23518775

fbshipit-source-id: 294f678c6afbd3feeb89b7a6655bc66ac9f8bfbc
2020-09-04 21:25:35 -07:00
8c64bb4f47 [dper3] replace LengthsGather lowlevel module's PT implemetnatio to use caffe2 op
Summary: Use a more efficient C++ implementation in a caffe2 op to get rid of control flow statements here.

Test Plan:
- Ran `buck test dper3/dper3/modules/low_level_modules/tests:single_operators_test`
- Ran `buck-out/gen/dper3/dper3_models/experimental/pytorch/ads_model_generation_script.par --model_type="inline_cvr_post_imp" --model_version="april_2020" --gen_inference_model` and observed files getting generated:
```
[ashenoy@devbig086.ash8 ~/fbsource/fbcode] ls -l /tmp/ashenoy/inline_cvr_post_imp_april_2020/
total 278332
-rw-r--r--. 1 ashenoy users 71376941 Sep  3 23:10 serialized_inline_cvr_post_imp_april_2020_model_inference.pt
-rw-r--r--. 1 ashenoy users 71437424 Sep  3 22:09 serialized_inline_cvr_post_imp_april_2020_model_inference_shrunk.pt
-rw-r--r--. 1 ashenoy users    14952 Sep  3 22:38 serialized_inline_cvr_post_imp_april_2020_model_io_metadata_map.pt
-rw-r--r--. 1 ashenoy users    14952 Sep  3 21:42 serialized_inline_cvr_post_imp_april_2020_model_io_metadata_map_shrunk.pt
-rw-r--r--. 1 ashenoy users 67001662 Sep  3 22:38 serialized_inline_cvr_post_imp_april_2020_model_main.pt
-rw-r--r--. 1 ashenoy users 67126415 Sep  3 21:42 serialized_inline_cvr_post_imp_april_2020_model_main_shrunk.pt
-rw-r--r--. 1 ashenoy users  3945257 Sep  3 22:34 serialized_inline_cvr_post_imp_april_2020_model_preproc.pt
-rw-r--r--. 1 ashenoy users  4077266 Sep  3 21:37 serialized_inline_cvr_post_imp_april_2020_model_preproc_shrunk.pt
```
- Ran `buck-out/gen/dper3/dper3_models/experimental/pytorch/ads_model_generation_script.par --model_type="ctr_mbl_feed" --model_version="april_2020" --gen_inference_model` and observed model files getting generated:
```
[ashenoy@devbig086.ash8 ~/fbsource/fbcode] ls -l /tmp/ashenoy/ctr_mbl_feed_april_2020/
total 170304
-rw-r--r--. 1 ashenoy users  2641870 Sep  3 23:06 ctr_mbl_feed_april_2020_prod_eval_training_options
-rw-r--r--. 1 ashenoy users  2641870 Sep  3 23:06 ctr_mbl_feed_april_2020_prod_train_training_options
-rw-r--r--. 1 ashenoy users 42225079 Sep  3 23:59 serialized_ctr_mbl_feed_april_2020_model_inference.pt
-rw-r--r--. 1 ashenoy users 42576708 Sep  3 22:33 serialized_ctr_mbl_feed_april_2020_model_inference_shrunk.pt
-rw-r--r--. 1 ashenoy users    11194 Sep  3 23:29 serialized_ctr_mbl_feed_april_2020_model_io_metadata_map.pt
-rw-r--r--. 1 ashenoy users    11194 Sep  3 22:05 serialized_ctr_mbl_feed_april_2020_model_io_metadata_map_shrunk.pt
-rw-r--r--. 1 ashenoy users 39239139 Sep  3 23:29 serialized_ctr_mbl_feed_april_2020_model_main.pt
-rw-r--r--. 1 ashenoy users 39250842 Sep  3 22:05 serialized_ctr_mbl_feed_april_2020_model_main_shrunk.pt
-rw-r--r--. 1 ashenoy users  2839097 Sep  3 23:24 serialized_ctr_mbl_feed_april_2020_model_preproc.pt
-rw-r--r--. 1 ashenoy users  2944239 Sep  3 22:01 serialized_ctr_mbl_feed_april_2020_model_preproc_shrunk.pt
```

Reviewed By: houseroad

Differential Revision: D23519521

fbshipit-source-id: ed9bd16a8af3cca3a865d9614d67d07f01d8b18a
2020-09-04 21:19:53 -07:00
398409f072 [PyTorch][Mobile] Insert the module name as name() to metadata dict if metadata doesn't contain "model_name" (#44227)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44227

As title
ghstack-source-id: 111490242

Test Plan: CI

Reviewed By: xcheng16

Differential Revision: D23549149

fbshipit-source-id: fad742a8d4e6f844f83495514cd60ff2bf0d5bcb
2020-09-04 21:18:12 -07:00
15e99b6ff6 Compile less legacy code when BUILD_CAFFE2 is set to False (#44079)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44079

Reviewed By: walterddr

Differential Revision: D23490149

Pulled By: malfet

fbshipit-source-id: a76382c30d83127d180ec63ac15093a7297aae53
2020-09-04 20:04:21 -07:00
f3bf6a41ca [ONNX] Update repeat op (#43430)
Summary:
Update repeat op so that the inputs to sizes argument can a mixture of dynamic and constant inputs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43430

Reviewed By: houseroad

Differential Revision: D23494257

Pulled By: bzinodev

fbshipit-source-id: 90c5e90e4f73e98f3a9d5c8772850e72cecdf0d4
2020-09-04 18:53:31 -07:00
3699274ce2 [DPER3] AOT integration
Summary: Integrate aot flow with model exporter.

Test Plan:
buck test dper3/dper3_backend/delivery/tests:dper3_model_export_test

replayer test see D23407733

Reviewed By: ipiszy

Differential Revision: D23313689

fbshipit-source-id: 39ae8d578ed28ddd6510db959b65974a5ff62888
2020-09-04 18:37:22 -07:00
8b17fd2516 Add remote_parameters() into RemoteModule class. (#43906)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43906

This method returns a list of RRefs of remote parameters that can be fed into the DistributedOptimizer.

Original PR issue: RemoteModule enhancements #40550

Test Plan: buck test caffe2/test/distributed/rpc:process_group_agent -- RemoteModule

Reviewed By: rohan-varma

Differential Revision: D23399586

fbshipit-source-id: 4b0f1ccf2e47c8a9e4f79cb2c8668f3cdbdff820
2020-09-04 16:22:40 -07:00
8f37ad8290 [BUILD] Guard '#pragma unroll' with COMPILING_FOR_MIN_SIZE
Summary: Disable  unroll hints when COMPILING_FOR_MIN_SIZE is on. We were seeing hundreds of errors in the build because the optimization was not being performed.

Test Plan: Smoke builds

Differential Revision: D23513255

fbshipit-source-id: 87da2fdc3c1146e8ffcacf14a49d5151d313f367
2020-09-04 15:55:28 -07:00
3d7c22a2ce [ONNX] Enable new scripting passes for functionalization and remove_mutation (#43791)
Summary:
Duplicate of https://github.com/pytorch/pytorch/issues/41413
This PR initiates the process of updating the torchsciprt backend interface used by ONNX exporter.

Replace jit lower graph pass by freeze module pass

Enable ScriptModule tests for ONNX operator tests (ORT backend) and model tests by default.

Replace jit remove_inplace_ops pass with remove_mutation and consolidation all passes for handling inplace ops.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43791

Reviewed By: houseroad

Differential Revision: D23421872

Pulled By: bzinodev

fbshipit-source-id: a98710c45ee905748ec58385e2a232de2486331b
2020-09-04 15:21:45 -07:00
70bbd08402 [FX] Fix forward merge conflict breakage (#44221)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44221

Test Plan: Imported from OSS

Reviewed By: zdevito

Differential Revision: D23547373

Pulled By: jamesr66a

fbshipit-source-id: df47fce0f6ff2988093208fc8370544b7985288d
2020-09-04 15:12:33 -07:00
4562b212db Fix potential divide by zero for CostInferenceForRowWiseSparseAdagrad
Summary: Fix the potential divide by zero error in CostInferenceForRowWiseSparseAdagrad, when n has zero elements

Test Plan:
Ran buck test caffe2/caffe2/python/operator_test:adagrad_test
Result: https://our.intern.facebook.com/intern/testinfra/testrun/562950122086369

Reviewed By: idning

Differential Revision: D23520763

fbshipit-source-id: 191345bd24f5179a9dbdb41c6784eab102cfe89c
2020-09-04 14:14:49 -07:00
2ad5a82c43 [fx] get rid of graph_module.root (#44092)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44092

instead submodules and weights are installed directly on the
graph_module by transferring the original modules. This makes it more
likely that scripting will succeed (since we no longer have submodules
that are not used in the trace). It also prevents layered transforms
from having to special case handling of the `root` module. GraphModules
can now be re-traced as part of the input to other transforms.

Test Plan: Imported from OSS

Reviewed By: jamesr66a

Differential Revision: D23504210

Pulled By: zdevito

fbshipit-source-id: f79e5c4cbfc52eb0ffb5d6ed89b37ce35a7dc467
2020-09-04 11:35:32 -07:00
0c2bc4fe20 Revert D23468286: [pytorch][PR] Optimize code path for adaptive_avg_pool2d when output size is (1, 1)
Test Plan: revert-hammer

Differential Revision:
D23468286 (f8f35fddd4)

Original commit changeset: cc181f705fea

fbshipit-source-id: 3a1db0eef849e0c2f3c0c64040d2a8b799644fa3
2020-09-04 11:28:15 -07:00
6474057c76 Revert D23503636: [pytorch][PR] [NNC] make inlining immediate (take 2) and fix bugs
Test Plan: revert-hammer

Differential Revision:
D23503636 (70aecd2a7f)

Original commit changeset: cdbdc902b7a1

fbshipit-source-id: b5164835f874a56213de4bed9ad690164eae9230
2020-09-04 10:58:23 -07:00
539d029d8c [ONNX] Fix split export using slice (#43670)
Summary:
Fix for exporting split with fixed output shape using slice.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43670

Reviewed By: houseroad

Differential Revision: D23420318

Pulled By: bzinodev

fbshipit-source-id: 09c2b58049fe32dca2f2977d91dd64de6ee9a72f
2020-09-04 10:52:44 -07:00
af13faf18b [FX] __str__ for GraphModule and Graph (#44166)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44166

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D23520801

Pulled By: jamesr66a

fbshipit-source-id: f77e3466e435127ec01e66291964395f32a18992
2020-09-04 10:46:43 -07:00
0e3cf6b8d2 [pytorch] remove code analyzer build folder between builds (#44148)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44148

Automatically remove the build_code_analyzer folder each time build.sh is run
ghstack-source-id: 111458413

Test Plan:
Run build.sh with different options and compare the outputs (should be different).
Ex:
`ANALYZE_TORCH=1 DEPLOY=1 BASE_OPS_FILE=/path/to/baseops MOBILE_BUILD_FLAGS='-DBUILD_MOBILE_AUTOGRAD=OFF' tools/code_analyzer/build.sh `

should produce a shorter file than
`ANALYZE_TORCH=1 DEPLOY=1 BASE_OPS_FILE=/path/to/baseops MOBILE_BUILD_FLAGS='-DBUILD_MOBILE_AUTOGRAD=ON' tools/code_analyzer/build.sh`

Reviewed By: iseeyuan

Differential Revision: D23503886

fbshipit-source-id: 9b95d4365540da0bd2d27760e1315caed5f44eec
2020-09-04 10:38:12 -07:00
f38e7aee71 Updates to SCCACHE for ROCm case (#44155)
Summary:
- Collecting sccache trace logs
- Change the SCCACHE_IDLE_TIMEOUT to unlimited

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44155

Reviewed By: ngimel

Differential Revision: D23516192

Pulled By: malfet

fbshipit-source-id: aa93052d7b9a1832eeaa8e81ee8706aeb9f7a508
2020-09-04 10:11:18 -07:00
2a1fc56694 replace the white list from default mappings (#41802)
Summary:
Replaced "whitelist" from default_mappings.py
Fixes https://github.com/pytorch/pytorch/issues/41756

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41802

Reviewed By: ngimel

Differential Revision: D23521452

Pulled By: malfet

fbshipit-source-id: 019a2d5c06dc59dc53d6c48b70fb35b216299cf4
2020-09-04 10:04:28 -07:00
4d431881d1 Control NCCL build parallelism via MAX_JOBS environment var (#44167)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44167

Reviewed By: walterddr, ngimel

Differential Revision: D23522419

Pulled By: malfet

fbshipit-source-id: 31b25a71fef3e470bdf382eb3698e267326fa354
2020-09-04 10:02:53 -07:00
6aba58cfd3 Limit MAX_JOBS to 18 for linux binary builds (#44168)
Summary:
Because those jobs are running in Docker2XLarge+ container that has 20 cores
Unfortunately `nproc` returns number of cores available on the host rather than number of cores available to container

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44168

Reviewed By: walterddr, ngimel

Differential Revision: D23539558

Pulled By: malfet

fbshipit-source-id: 3df858722e153a8fcbe8ef6370b1a9c1993ada5b
2020-09-04 09:58:17 -07:00
6cecf7ec68 Enable test_cublas_config_deterministic_error for windows (#42796)
Summary:
test_cublas_config_deterministic_error can pass for windows, so enable it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42796

Reviewed By: seemethere

Differential Revision: D23520002

Pulled By: malfet

fbshipit-source-id: eccedbbf202b1cada795071a34e266b2c635c2cf
2020-09-04 09:52:57 -07:00
9a5a732866 Register some backwards functions as operators (#44052)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44052

Summary
=======

This PR registers the following backwards functions as operators:
- slice_backward
- select_backward
- gather_backward
- index_select_backward (the backward function for index_select)
- select_index_backward (prevously known as index_select_backward, but is actually the backward function for max.dim, min.dim, etc)

In the future, I'd like to register more backward functions as operators
so that we can write batching rules for the backward functions. Batching
rules for backward functions makes it so that we can compute batched
gradients.

Motivation
==========
The rationale behind this PR is that a lot of backwards functions (27 in total)
are incompatible with BatchedTensor due to using in-place operations.
Sometimes we can allow the in-place operations, but other times we can't.
For example, consider select_backward:

```
Tensor select_backward(const Tensor& grad, IntArrayRef input_sizes, int64_t dim, int64_t index) {
  auto grad_input = at::zeros(input_sizes, grad.options());
  grad_input.select(dim, index).copy_(grad);
  return grad_input;
}
```

and consider the following code:
```
x = torch.randn(5, requires_grad=True)
def select_grad(v):
    torch.autograd.grad(x[0], x, v)

vs = torch.randn(B0)
batched_grads = vmap(select_grad)(vs)
```

For the batched gradient use case, `grad` is a BatchedTensor.
The physical version of `grad` has size `(B0,)`.
However, select_backward creates a `grad_input` of shape `(5)`, and
tries to copy `grad` to a slice of it.

Other approaches
================

I've considered the following:
- register select_backward as an operator (this PR)
- have a branch inside select_backward for if `grad` is batched.
    - this is OK, but what if we have more tensor extensions that want to override this?
- modify select_backward to work with BatchedTensor, by creating a new operator for the "select + copy_ behavior".
    - select + copy_ isn't used elsewhere in derivative formulas so this doesn't seem useful

Test Plan
=========

- `pytest test/test_autograd.py -v`
- Registering backward functions may impact performance. I benchmarked
select_backward to see if registering it as an operator led to any noticable
performance overheads: https://gist.github.com/zou3519/56d6cb53775649047b0e66de6f0007dc.
The TL;DR is that the overhead is pretty minimal.

Test Plan: Imported from OSS

Reviewed By: ezyang, fbhuba

Differential Revision: D23481183

Pulled By: zou3519

fbshipit-source-id: 125af62eb95824626dc83d06bbc513262ee27350
2020-09-04 08:30:39 -07:00
0c01f136f3 [BE] Use f-string in various Python functions (#44161)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44161

Reviewed By: seemethere

Differential Revision: D23515874

Pulled By: malfet

fbshipit-source-id: 868cf65aedd58fce943c08f8e079e84e0a36df1f
2020-09-04 07:38:25 -07:00
28b1360d24 [Codemod][FBSourceGoogleJavaFormatLinter] Daily arc lint --take GOOGLEJAVAFORMAT
Reviewed By: zertosh

Differential Revision: D23536088

fbshipit-source-id: d4c6c26ed5bad4e8c1b80ac1c05bd86b36cb6aaa
2020-09-04 07:30:50 -07:00
f8f35fddd4 Optimize code path for adaptive_avg_pool2d when output size is (1, 1) (#43986)
Summary:
Benchmark:

code: https://github.com/xwang233/code-snippet/blob/master/adaptive-avg-pool2d-output-1x1/adap.ipynb

| shape | time_before (ms) | time_after (ms) |
| --- | --- | --- |
| (2, 3, 4, 4), torch.contiguous_format, cpu  |  0.035 |  0.031 |
| (2, 3, 4, 4), torch.contiguous_format, cuda  |  0.041 |  0.031 |
| (2, 3, 4, 4), torch.channels_last, cpu  |  0.027 |  0.029 |
| (2, 3, 4, 4), torch.channels_last, cuda  |  0.031 |  0.034 |
| (2, 3, 4, 4), non_contiguous, cpu  |  0.037 |  0.026 |
| (2, 3, 4, 4), non_contiguous, cuda  |  0.062 |  0.033 |
| (4, 16, 32, 32), torch.contiguous_format, cpu  |  0.063 |  0.055 |
| (4, 16, 32, 32), torch.contiguous_format, cuda  |  0.043 |  0.031 |
| (4, 16, 32, 32), torch.channels_last, cpu  |  0.052 |  0.064 |
| (4, 16, 32, 32), torch.channels_last, cuda  |  0.190 |  0.033 |
| (4, 16, 32, 32), non_contiguous, cpu  |  0.048 |  0.035 |
| (4, 16, 32, 32), non_contiguous, cuda  |  0.062 |  0.033 |
| (8, 128, 64, 64), torch.contiguous_format, cpu  |  0.120 |  0.109 |
| (8, 128, 64, 64), torch.contiguous_format, cuda  |  0.043 |  0.044 |
| (8, 128, 64, 64), torch.channels_last, cpu  |  1.303 |  0.260 |
| (8, 128, 64, 64), torch.channels_last, cuda  |  1.237 |  0.049 |
| (8, 128, 64, 64), non_contiguous, cpu  |  0.132 |  0.128 |
| (8, 128, 64, 64), non_contiguous, cuda  |  0.062 |  0.031 |
| (16, 256, 224, 224), torch.contiguous_format, cpu  |  17.232 |  14.807 |
| (16, 256, 224, 224), torch.contiguous_format, cuda  |  1.930 |  1.930 |
| (16, 256, 224, 224), torch.channels_last, cpu  |  245.025 |  24.345 |
| (16, 256, 224, 224), torch.channels_last, cuda  |  15.593 |  1.944 |
| (16, 256, 224, 224), non_contiguous, cpu  |  11.738 |  6.460 |
| (16, 256, 224, 224), non_contiguous, cuda  |  0.524 |  0.251 |

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43986

Reviewed By: anjali411

Differential Revision: D23468286

Pulled By: ngimel

fbshipit-source-id: cc181f705feacb2f86df420d648cc59fda69fdb7
2020-09-04 03:37:33 -07:00
ef28ee50b0 [Codemod][FBSourceClangFormatLinter] Daily arc lint --take CLANGFORMAT
Reviewed By: zertosh

Differential Revision: D23536086

fbshipit-source-id: 56e9c70a6998086515f59d74c5d8a2280ac2f669
2020-09-04 03:33:32 -07:00
98ad5ff41f [te] Disable reductions by default (#44122)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44122

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D23504769

Pulled By: bertmaher

fbshipit-source-id: 1889217cd22da529e46ab30c9319a5646267e4ec
2020-09-03 23:37:45 -07:00
a37c199b8b [c2][cuda] small improvement to dedup adagrad by avoiding recompute of x_ij (#44173)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44173

it has small 10~15% speed improvement

Test Plan:
== Correctness ==
`buck test mode/dev-nosan //caffe2/caffe2/fb/net_transforms/tests:fuse_sparse_ops_test -- 'test_fuse_sparse_adagrad_with_sparse_lengths_sum_gradient '`

Reviewed By: jianyuh

Differential Revision: D23494030

fbshipit-source-id: cdb7ee716a7e559903b72ed9f93bf106813f88fa
2020-09-03 22:50:53 -07:00
2f8a43341d Add API for onnxifi with AOT Glow ONNX (#44021)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44021

Pull Request resolved: https://github.com/pytorch/glow/pull/4854

Test Plan: Added `test_onnxifi_aot.py`

Reviewed By: yinghai

Differential Revision: D23307003

fbshipit-source-id: e6d4f3e394f96fd22f80eb2b8a686cf8171a54c0
2020-09-03 22:46:20 -07:00
d221256888 [Message] Add what to do for missing operators.
Summary: As title.

Test Plan: N/A

Reviewed By: gaurav-work

Differential Revision: D23502416

fbshipit-source-id: a341eb10030e3f319266019ba4c02d9d9a0a6298
2020-09-03 22:41:27 -07:00
addfd7a9b9 Add tests against autograd precedence and multiple dispatch. (#44037)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44037

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D23480154

Pulled By: ailzhang

fbshipit-source-id: 28b68e67975397c76ce6c73ceaeec9d5cc934635
2020-09-03 22:19:08 -07:00
b60ffcdfdd Enable typechecks for torch.nn.quantized.modules.linear (#44154)
Summary:
Also import `Optional` directly from `typing` rather than from `_jit_internal`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44154

Reviewed By: seemethere

Differential Revision: D23511833

Pulled By: malfet

fbshipit-source-id: f78c5fd679c002b218e4d287a9e56fa198171981
2020-09-03 19:52:49 -07:00
538d3bd364 Enable CUDA 11 jobs for Windows nightly builds (#44086)
Summary:
Fixes https://github.com/pytorch/pytorch/pull/43366/files#r474333051.
Testing with https://github.com/pytorch/pytorch/pull/44007.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44086

Reviewed By: ezyang

Differential Revision: D23493553

Pulled By: malfet

fbshipit-source-id: 34b3e5b2e8dece5e97db9d507c34d61d33bd0863
2020-09-03 17:45:31 -07:00
69e38828f5 [quant] conv_transpose2d_prepack/conv_transpose2d_unpack (#40351)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40351

Test Plan: Imported from OSS

Reviewed By: vkuzo

Differential Revision: D22158983

Pulled By: z-a-f

fbshipit-source-id: 3ca064c2d826609724b2740fcc9b9eb40556168d
2020-09-03 17:21:32 -07:00
c40e3f9f98 [android][jni] Support Tensor MemoryFormat in java wrappers (#40785)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40785

The main goal of this change is to support creating Tensors specifying blob in NHWC (ChannelsLast) format.

ChannelsLast is supported only for 4-dim tensors, this is enforced on LibTorch side, I have not added asserts on java side in case that this limitation will be changed in future and not to have double asserts.

Additional changes in `aten/src/ATen/templates/Functions.h`:

`from_blob` creates `at::empty({0}, options)` tensor first and sets it Storage with sizes and strides afterwards.

But as ChannelsLast is only for 4-dim tensors - it fails on that creation, as dim==1.

I've added `zero_sizes()` function that returns `{0, 0, 0, 0}` for ChannelsLast and ChannelsLast3d.

Test Plan: Imported from OSS

Reviewed By: dreiss

Differential Revision: D22396244

Pulled By: IvanKobzarev

fbshipit-source-id: 02582d748a554e0f859aefe71cd2c1e321fb8979
2020-09-03 17:01:35 -07:00
70aecd2a7f [NNC] make inlining immediate (take 2) and fix bugs (#43885)
Summary:
A rework of `computeInline` which makes it work a bit better, particularly when combined with other transformations. Previously we stored Functions that were inlined and then deferred the actual inlining of the function body until prepareForCodgen was called. This has an issue when transformations are applied to the LoopNest: the function body can be different from what appears in the root_stmt and result in inlining that a) fails, b) reverses other transformations or c) a weird unpredictable combination of the two.

This PR changes that behaviour so that the inlining occurs in the root stmt immediately, which means it reflects any previous transformations and any future transformations have a true view of the internal IR. It also has the benefit that inspecting the root statement gives an accurate view of it without needing to call prepareForCodgen. I also removed the difference between `computeInline` and `computeInlineWithRand` and we handle calls to `rand()` in all branches.

This is a rework of https://github.com/pytorch/pytorch/issues/38696, with the agreed changes from ZolotukhinM and zheng-xq: we should only inline if the dimensions are trivial (ie. they are vars not exprs).

This PR is mostly tests, and I fixed a bunch of bugs I found along the way. Partial list:
* When inlining an expression involving rand, we would create random vars equal to the dimensionality of the enclosing Tensor not the produced Tensor - meaning we'd use an incorrect value if the inlined tensor was smaller. E.g: `X[i] = rand(); A[i, j] = X[i]` would produce a tensor where `A[0, 0] != A[0, 1]`. This is fixed by inserting the Let binding of the random variable at the correct loop body.
* When inlining we'd replace all calls to `rand()` rather than just those present in the Tensor being inlined.
* `rand()` was treated symbolically by the simplifier and we would aggregate or cancel calls to `rand()`. Have fixed the hasher to hash all calls to `rand()` distinctly.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43885

Reviewed By: gmagogsfm

Differential Revision: D23503636

Pulled By: nickgg

fbshipit-source-id: cdbdc902b7a14d269911d978a74a1c11eab004fa
2020-09-03 16:49:24 -07:00
bc4a00c197 [TVM] Support Fused8BitRowwiseQuantizedToFloat op (#44098)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44098

Reviewed By: yinghai

Differential Revision: D23470129

fbshipit-source-id: 1959e2167859f7cbc16e1423b957072bbc743ece
2020-09-03 16:39:53 -07:00
3105d8a9b2 [TensorExpr] Fuser: rely on input types when checking whether a device is supported. (#44139)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44139

Also, make sure that we're checking that condition when we're starting a
new fusion group, not only when we merge a node into an existing fusion
group. Oh, and one more: add a test checking that we're rejecting graphs
with unspecified shapes.

Differential Revision: D23507510

Test Plan: Imported from OSS

Reviewed By: bertmaher

Pulled By: ZolotukhinM

fbshipit-source-id: 9c268825ac785671d7c90faf2aff2a3e5985ac5b
2020-09-03 16:27:14 -07:00
71510c60ad fx qat: respect device affinity (#44115)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44115

Fixes device affinity in the FX prepare pass for QAT. Before this PR, observers
were always created on CPU. After this PR, observers are created on the
same device as the rest of the model. This will enable QAT prepare to
work regardless of whether users move the model to cuda before or after
calling this pass.

Test Plan:
```
python test/test_quantization.py TestQuantizeFx.test_qat_prepare_device_affinity
```

Imported from OSS

Reviewed By: supriyar

Differential Revision: D23502291

fbshipit-source-id: ec4ed20c21748a56a25e3395b35ab8640d71b5a8
2020-09-03 16:16:59 -07:00
7816d53798 [JIT] Add mypy type annotations for JIT (#43862)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43862

Test Plan: Imported from OSS

Reviewed By: eellison

Differential Revision: D23491151

Pulled By: SplitInfinity

fbshipit-source-id: 88367b89896cf409bb9ac3db7490d6779efdc3a4
2020-09-03 15:09:24 -07:00
9dd8670d7d [jit] Better match behavior of loaded ScriptModules vs. freshly created ones (#43298)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43298

IR emitter uses `ModuleValue` to represent ScriptModules and emit IR for
attribute access, submodule access, etc.

`ModuleValue` relies on two pieces of information, the JIT type of the
module, and the `ConcreteModuleType`, which encapsulates Python-only
information about the module.

ScriptModules loaded from a package used to create a dummy
ConcreteModuleType without any info in it. This led to divergences in
behavior during compilation.

This PR makes the two ways of constructing a ConcreteModuleType equivalent,
modulo any py-only information (which, by definition, is never present in
packaged files anyway).

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D23228738

Pulled By: suo

fbshipit-source-id: f6a660f42272640ca1a1bb8c4ee7edfa2d1b07cc
2020-09-03 15:03:39 -07:00
74f18476a2 [jit] fix segfault in attribute lookup on loaded ScriptModules (#43284)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43284

The IR emitter looks for attributes on modules like:
1. Check the JIT type for the attribute
2. Check the originating Python class, in order to fulfill requests for, e.g. static methods or ignored methods.

In the case where you do:
```
inner_module = torch.jit.load("inner.pt")
wrapped = Wrapper(inner_module)  # wrap the loaded ScriptModule in an nn.Module
torch.jit.script(wrapped)
```

The IR emitter may check for attributes on `inner_module`. There is no
originating Python class for `inner_module`, since it was directly
compiled from the serialized format.

Due to a bug in the code, we don't guard for this case an a segfault
results if the wrapper asks for an undefined attribute. The lookup in
this case looks like:
1. Check the JIT type for the attribute (not there!)
2. Check the originating Python class (this is a nullptr! segfault!)

This PR guards this case and properly just raises an attribute missing
compiler error instead of segfaulting.

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D23224337

Pulled By: suo

fbshipit-source-id: 0cf3060c427f2253286f76f646765ec37b9c4c49
2020-09-03 15:01:59 -07:00
e64879e180 [tensorexpr] Alias analysis tests (#44110)
Summary:
Some tests for alias analysis.

The first aliases at the module level and the second at the input level.

Please let me know if there are other alias situations!

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44110

Reviewed By: nickgg

Differential Revision: D23509473

Pulled By: bwasti

fbshipit-source-id: fbfe71a1d40152c8fbbd8d631f0a54589b791c34
2020-09-03 14:52:47 -07:00
6868bf95c6 [JIT] Fuser match on schemas not node kind (#44083)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44083

Match on the complete schema of a node instead of its node kind when deciding to fuse it. Previously we matched on node kind, which could fail with something like `aten::add(int, int)` and if a new overload was added to an op without corresponding NNC support we would fuse it.

Follow ups are:
 - bail when an output tensor type isnt uniquely determined by the input types (e.g. aten::add and the second input could be either a float or an int)
- remove NNC lowering for _tanh_backward & _sigmoid_backward
- Validate that we support all of the overloads here. I optimistically added ops that included Tensors, it's possible that we do not support every overload here. This isn't a regression, and this PR is at least improving our failures in that regard.

I can do any of these as part of this PR if desired, but there are a number of failures people have run into that this PR fixes so I think it would be good to land this sooner than later.

Test Plan: Imported from OSS

Reviewed By: SplitInfinity

Differential Revision: D23503704

Pulled By: eellison

fbshipit-source-id: 3ce971fb1bc3a7f1cbaa38f1ed853e2db3d67c18
2020-09-03 14:47:19 -07:00
9b3c72d46e [pytorch] Make mobile find_method return an optional (#43965)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43965

As part of a larger effort to unify the API between the lite interpreter and full JIT:
- implement torch::jit::mobile::Method, a proxy for torch::jit::mobile::Function
- add support for overloaded operator() to mobile Method and Function
- mobile find_method now returns a c10::optional<Method> (so signature matches full jit)
- moves some implementation of Function from module.cpp to function.cpp
ghstack-source-id: 111161942

Test Plan: CI

Reviewed By: iseeyuan

Differential Revision: D23330762

fbshipit-source-id: bf0ba0d711d9566c92af31772057ecd35983ee6d
2020-09-03 14:46:18 -07:00
f91bdbeabd Enable function calls in TEFuser and SpecializeAutogradZero (#43866)
Summary:
Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43866

Reviewed By: ezyang

Differential Revision: D23452798

Pulled By: Krovatkin

fbshipit-source-id: 2cff4c905bf1b5d9de56e7869458ffa6fce1f1b5
2020-09-03 14:42:52 -07:00
e05fa2f553 [quant] Prep for conv_transpose packing (#39714)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/39714

Test Plan: Imported from OSS

Reviewed By: vkuzo

Differential Revision: D22087071

Pulled By: z-a-f

fbshipit-source-id: 507f8a414026eb4c9926f68c1e94d2f56119bca6
2020-09-03 14:10:32 -07:00
352a32e7f3 [caffe2] fix clang build
Summary:
* multiple -Wpessimizing-moves
* `static` within  `__host__` `__device__` function

Test Plan:
```lang=bash
buck build -c fbcode.cuda_use_clang=true fblearner/flow/projects/dper:workflow
```

Reviewed By: andrewjcg

Differential Revision: D23506573

fbshipit-source-id: 1490a1267e39e067d3ef836ef9b1cd5d7a28f724
2020-09-03 14:02:27 -07:00
f3da9e3b50 Enable Enum pickling/unpickling. (#43188)
Summary:
Stack from [ghstack](https://github.com/ezyang/ghstack):
* **https://github.com/pytorch/pytorch/issues/43188 Enable Enum pickling/unpickling.**
* https://github.com/pytorch/pytorch/issues/42963 Add Enum TorchScript serialization and deserialization support
* https://github.com/pytorch/pytorch/issues/42874 Fix enum constant printing and add FileCheck to all Enum tests
* https://github.com/pytorch/pytorch/issues/43121 Add Enum convert back to Python object support

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43188

Reviewed By: zdevito

Differential Revision: D23365141

Pulled By: gmagogsfm

fbshipit-source-id: f0c93d4ac614dec047ad8640eb6bd9c74159b558
2020-09-03 13:51:02 -07:00
d0421ff1cc Benchmarks: add scripts for FastRNNs results comparison. (#44134)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44134

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D23505810

Pulled By: ZolotukhinM

fbshipit-source-id: d0b3d70d4c2a44a8c3773631d09a25a98ec59370
2020-09-03 13:44:42 -07:00
3806c939bd Polish DDP join API docstrings (#43973)
Summary:
Polishes DDP join api docstrings and makes a few minor cosmetic changes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43973

Reviewed By: zou3519

Differential Revision: D23467238

Pulled By: rohan-varma

fbshipit-source-id: faf0ee56585fca5cc16f6891ea88032336b3be56
2020-09-03 13:39:45 -07:00
442684cb25 Enable typechecks for torch.nn.modules.[activation|upsampling] (#44093)
Summary:
Add missing `hardsigmoid`, `silu`, `hardswish` and `multi_head_attention_forward` to functional.pyi.in
 Embed some typing annotations into functional.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44093

Reviewed By: ezyang

Differential Revision: D23494384

Pulled By: malfet

fbshipit-source-id: 27023c16ff5951ceaebb78799c4629efa25f7c5c
2020-09-03 13:20:04 -07:00
a153f69417 Fix replaceAtenConvolution for BC. (#44036)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44036

Running replaceAtenConvolution on older traced model wont work as
_convolution signature has changed and replaceAtenConvolution was
changed to account for that.
But we did not preserve the old behavior during that. This change
restores the old behavior while keeing the new one.

Test Plan: Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D23476775

fbshipit-source-id: 73a0c2b7387f2a8d82a8d26070d0059972126836
2020-09-03 12:57:57 -07:00
ba65cce2a2 Fix transposed conv2d rewrite pattern to account for convolution api (#44035)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44035

change

Also added test so as to capture such cases for future.

Test Plan:
python test/test_xnnpack_integration.py

Imported from OSS

Reviewed By: iseeyuan

Differential Revision: D23476773

fbshipit-source-id: a62c4429351c909245106a70b4c60b1bacffa817
2020-09-03 12:55:43 -07:00
55ff9aa185 Test TE fuser unary ops and fix sigmoid(half) (#44094)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44094

Test Plan: Imported from OSS

Reviewed By: SplitInfinity

Differential Revision: D23494950

Pulled By: bertmaher

fbshipit-source-id: 676c4e57267c4ad92065ea90b06323918dd5b0de
2020-09-03 12:48:46 -07:00
bfa1fa5249 Update rocm-3.5.1 build job to rocm-3.7 (#44123)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44123

Reviewed By: seemethere

Differential Revision: D23504193

Pulled By: malfet

fbshipit-source-id: 3570dc0aa879a3fdd43f3ecd41ee9e745006cfde
2020-09-03 12:39:30 -07:00
49215d7f26 For CriterionTests, have check_gradgrad actually only affect gradgrad checks. (#44060)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44060

Right now it skips grad checks as well.

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D23484018

Pulled By: gchanan

fbshipit-source-id: 24a8f1af41f9918aaa62bc3cd78b139b2f8de1e1
2020-09-03 12:29:32 -07:00
42f9897983 Mark bucketize as not subject to autograd (#44102)
Summary:
Bucketize returns integers, currently this triggers an internal assert, so we apply the mechanism for this case (also used for argmax etc.).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44102

Reviewed By: zou3519

Differential Revision: D23500048

Pulled By: albanD

fbshipit-source-id: fdd869cd1feead6616b532b3e188bd5512adedea
2020-09-03 12:05:47 -07:00
91b0d1866a add tanh + quantize unit test (#44076)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44076

add fakelowp test for tanh + quantize

Test Plan: net runner

Reviewed By: venkatacrc

Differential Revision: D23339662

fbshipit-source-id: 96c2cea12b41bf3df24aa46e601e053dca8e9481
2020-09-03 12:00:36 -07:00
de672e874d [JIT] Improve error message for unsupported Optional types (#44054)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44054

**Summary**
This commit improves the error message that is printed when an
`Optional` type annotation with an unsupported contained type is
encountered. At present, the `Optional` is printed as-is, and
`Optional[T]` is syntatic sugar for `Union[T, None]`, so that is what
shows up in the error message and can be confusing. This commit modifies
the error message so that it prints `T` instead of `Union[T, None]`.

**Test Plan**
Continuous integration.

Example of old message:
```
AssertionError: Unsupported annotation typing.Union[typing.List, NoneType] could not be resolved.
```
Example of new message:
```
AssertionError: Unsupported annotation typing.Union[typing.List, NoneType] could not be resolved because typing.List could not be resolved.
```

**Fixes**
This commit fixes #42859.

Test Plan: Imported from OSS

Reviewed By: gmagogsfm

Differential Revision: D23490365

Pulled By: SplitInfinity

fbshipit-source-id: 2aa9233718e78cf1ba3501ae11f5c6f0089e29cd
2020-09-03 11:55:06 -07:00
d11603de38 [TensorExpr] Benchmarks: set number of profiling runs to 2 for PE. (#44112)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44112

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D23500904

Pulled By: ZolotukhinM

fbshipit-source-id: d0dd54752b7ea5ae11f33e865c96d2d61e98d573
2020-09-03 11:29:35 -07:00
b10c527a1f [pytorch][bot] update mobile op deps (#44100)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44100

Test Plan: Imported from OSS

Reviewed By: iseeyuan

Differential Revision: D23496532

Pulled By: ljk53

fbshipit-source-id: 1e5b9059482e423960349d1361a7a98718c2d9ed
2020-09-03 11:24:26 -07:00
f96b91332f [caffe2.proto] Add AOTConfig (#44020)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44020

Pull Request resolved: https://github.com/pytorch/glow/pull/4853

Add AOT config

Reviewed By: yinghai

Differential Revision: D23414435

fbshipit-source-id: 3c48acf29889fcf63def37a48de382e675e0e1f3
2020-09-03 11:07:45 -07:00
c59e11bfbb Add soft error reporting to capture all the inference runtime failure. (#44078)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44078

When PyTorch mobile inference failed and throw exception, if caller catch and not crash the app, we are not able to track all the inference failures.

So we are adding native soft error reporting to capture all the failures occurring during module loading and running including both crashing and on-crashing failures. Since c10::Error has good error messaging stack handling (D21202891 (a058e938f9)), we are utilizing it for the error handling and message print out.
ghstack-source-id: 111307080

Test Plan:
Verified that the soft error reporting is sent through module.cpp when operator is missing, make sure a logview mid is generated with stack trace: https://www.internalfb.com/intern/logview/details/facebook_android_softerrors/5dd347d1398c1a9a73c804b20f7c2179/?selected-logview-tab=latest.

Error message with context is logged below:

```
soft_error.cpp		[PyTorchMobileInference] : Error occured during model running entry point: Could not run 'aten::embedding' with arguments from the 'CPU' backend. 'aten::embedding' is only available for these backends: [BackendSelect, Named, Autograd, Autocast, Batched, VmapMode].

BackendSelect: fallthrough registered at xplat/caffe2/aten/src/ATen/core/BackendSelectFallbackKernel.cpp:3 [backend fallback]
Named: registered at xplat/caffe2/aten/src/ATen/core/NamedRegistrations.cpp:7 [backend fallback]
Autograd: fallthrough registered at xplat/caffe2/aten/src/ATen/core/VariableFallbackKernel.cpp:31 [backend fallback]
Autocast: fallthrough registered at xplat/caffe2/aten/src/ATen/autocast_mode.cpp:253 [backend fallback]
Batched: registered at xplat/caffe2/aten/src/ATen/BatchingRegistrations.cpp:317 [backend fallback]
VmapMode: fallthrough registered at xplat/caffe2/aten/src/ATen/VmapModeRegistrations.cpp:33 [backend fallback]

Exception raised from reportError at xplat/caffe2/aten/src/ATen/core/dispatch/OperatorEntry.cpp:261 (m
```

Reviewed By: iseeyuan

Differential Revision: D23428636

fbshipit-source-id: 82d5d9c054300dff18d144f264389402d0b55a8a
2020-09-03 10:54:43 -07:00
5973b44d9e Rename NewCriterionTest to CriterionTest. (#44056)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44056

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D23482573

Pulled By: gchanan

fbshipit-source-id: dde0f1624330dc85f48e5a0b9d98fb55fdb72f68
2020-09-03 10:29:20 -07:00
7d95eb8633 [fbgemm] manual submodule update (#44082)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44082

Automated submodule is running into some test failures and I am not sure how can I rebase that.

automated submodule update:
https://github.com/pytorch/pytorch/pull/43817

Test Plan: CI tests

Reviewed By: jianyuh

Differential Revision: D23489240

fbshipit-source-id: a49b01786ebf0a59b719a0abf22398e1eafa90af
2020-09-03 10:07:46 -07:00
c10f30647f Fix CUDA debug nightly build failure (#44085)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/43607.
Tested in https://github.com/pytorch/pytorch/pull/44007.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44085

Reviewed By: malfet

Differential Revision: D23493663

Pulled By: ezyang

fbshipit-source-id: 4c01f3fc5a52814a23773a56b980c455851c2686
2020-09-03 09:12:52 -07:00
98320061ad DDP Communication hook: (Patch) Fix the way we pass future result to buckets. (#43734)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43734

Following the additional GH comments on the original PR https://github.com/pytorch/pytorch/pull/43307.
ghstack-source-id: 111327130

Test Plan: Run `python test/distributed/test_c10d.py`

Reviewed By: smessmer

Differential Revision: D23380288

fbshipit-source-id: 4b8889341c57b3701f0efa4edbe1d7bbc2a82ced
2020-09-03 08:59:10 -07:00
768c2b0fb2 Fix THPVariable_float_scalar (#43842)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43842

Reviewed By: ailzhang

Differential Revision: D23426892

Pulled By: ezyang

fbshipit-source-id: 63318721fb3f4a57d417f9a87e57c74f6d4e6e18
2020-09-03 08:39:41 -07:00
b6e2b1eac7 BatchedFallback: stop emitting the entire schema in the fallback warning (#44051)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44051

Instead, just emit the operator name. The entire schema is pretty wordy
and doesn't add any additional information.

Test Plan: - modified test: `pytest test/test_vmap.py -v`

Reviewed By: ezyang

Differential Revision: D23481184

Pulled By: zou3519

fbshipit-source-id: 9fbda61fc63565507b04c8b87e0e326a2036effa
2020-09-03 08:33:51 -07:00
cae52b4036 Merge CriterionTest into NewCriterionTest. (#44055)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44055

There is no functional change here.  Another patch will rename NewCriterionTest to CriterionTest.

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D23482572

Pulled By: gchanan

fbshipit-source-id: de364579067e2cc9de7df6767491f8fa3a685de2
2020-09-03 08:14:34 -07:00
15643de941 With fixes, Back out "Back out "Selective meta programming preparation for prim ops""
Summary: Original commit changeset: b2c712a512a2

Test Plan: CI

Reviewed By: jiatongzhou

Differential Revision: D23477710

fbshipit-source-id: 177ee56a82234376b7a5c3fc33441f8acfd59fea
2020-09-03 08:02:20 -07:00
24ca6aab02 Improves type-checking guards. (#43339)
Summary:
PR https://github.com/pytorch/pytorch/issues/38157 fixed type checking for mypy by including `if False` guards on some type-checker-only imports. However other typecheckers - [like pyright](https://github.com/microsoft/pylance-release/issues/262#issuecomment-677758245) - will respect this logic and ignore the imports. Using [`if TYPE_CHECKING`](https://docs.python.org/3/library/typing.html#typing.TYPE_CHECKING) instead means both mypy and pyright will work correctly.

[For background, an example of where the current code fails](https://github.com/microsoft/pylance-release/issues/262) is if you make a file `tmp.py` with the contents
```python
import torch
torch.ones((1,))
```
Then [`pyright tmp.py --lib`](https://github.com/microsoft/pyright#command-line) will fail with a `"ones" is not a known member of module` error. This is because it can't find the `_VariableFunctions.pyi` stub file, as pyright respects the `if False` logic. After adding the `TYPE_CHECKING` guard, all works correctly.

Credit to erictraut for suggesting the fix.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43339

Reviewed By: agolynski

Differential Revision: D23348142

Pulled By: ezyang

fbshipit-source-id: c8a58122a7b0016845c311da39a1cc48748ba03f
2020-09-03 07:45:53 -07:00
b6d5973e13 Delete THCStream.cpp (#43733)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43733

Reviewed By: malfet

Differential Revision: D23405121

Pulled By: ezyang

fbshipit-source-id: 95fa80b5dcb11abaf4d2507af15646a98029c80d
2020-09-03 07:41:24 -07:00
68a1fbe308 Allow criterion backwards test on modules requiring extra args (i.e. CTCLoss). (#44050)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44050

We don't actually turn on the CTCLoss tests since they fail, but this allows you to toggle check_forward_only and for the code to actually run.

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D23481091

Pulled By: gchanan

fbshipit-source-id: f2a3b0a2dee27341933c5d25f1e37a878b04b9f6
2020-09-03 07:41:21 -07:00
5f89aa36cf Actually run backward criterion tests. (#44030)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44030

This looks to have been a mistake from https://github.com/pytorch/pytorch/pull/9287.

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D23476274

Pulled By: gchanan

fbshipit-source-id: 81ed9d0c9a40d49153fc97cd69fdcd469bec0c73
2020-09-03 07:39:13 -07:00
665feda15b Adds opinfo-based autograd tests and (un)supported dtype tests (#43451)
Summary:
This PR adds a new test suite, test_ops.py, designed for generic tests across all operators with OpInfos. It currently has two kinds of tests:

- it validates that the OpInfo has the correct supported dtypes by verifying that unsupported dtypes throw an error and supported dtypes do not
- it runs grad and gradgrad checks on each op and its variants (method and inplace) that has an OpInfo

This is a significant expansion and simplification of the current autogenerated autograd tests, which spend considerable processing their inputs. As an alternative, this PR extends OpInfos with "SampleInputs" that are much easier to use. These sample inputs are analogous to the existing tuples in`method_tests()`.

Future PRs will extend OpInfo-based testing to other uses of `method_tests()`, like test_jit.py, to ensure that new operator tests can be implemented entirely using an OpInfo.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43451

Reviewed By: albanD

Differential Revision: D23481723

Pulled By: mruberry

fbshipit-source-id: 0c2cdeacc1fdaaf8c69bcd060d623fa3db3d6459
2020-09-03 02:50:48 -07:00
ab7606702c Rectified a few grammatical errors in documentation (#43695)
Summary:
Rectified a few grammatical errors in documentation of pytorch.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43695

Reviewed By: anjali411

Differential Revision: D23451600

Pulled By: ezyang

fbshipit-source-id: bc7b34c240fde1b31cac811080befa2ff2989395
2020-09-02 23:59:45 -07:00
40fec4e739 [TensorExpr] Fuser: do not fuse ops with 0-dim tensors. (#44073)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44073

We don't have a proper support on NNC and JIT IR->NNC lowering side for it yet.

Test Plan: Imported from OSS

Reviewed By: SplitInfinity

Differential Revision: D23487905

Pulled By: ZolotukhinM

fbshipit-source-id: da0da7478fc8ce7b455176c95d8fd610c94352c1
2020-09-02 22:59:04 -07:00
3da82aee03 [JIT] Remove profile nodes before BatchMM. (#43961)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43961

Currently we're removing prim::profile nodes and embed the type info
directly in the IR right before the fuser, because it is difficult to
fuse in a presence of prim::profile nodes. It turns out that BatchMM has
a similar problem: it doesn't work when there are prim::profile nodes in
the graph. These two passes run next to each other, so we could simply
remove prim::profile nodes slightly earlier: before the BatchMM pass.

Test Plan: Imported from OSS

Reviewed By: eellison

Differential Revision: D23453266

Pulled By: ZolotukhinM

fbshipit-source-id: 92cb50863962109b3c0e0112e56c1f2cb7467ff1
2020-09-02 22:57:39 -07:00
ae7699829c Remove THC max and min, which are longer used (#43903)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43903

Reviewed By: smessmer

Differential Revision: D23493225

Pulled By: ezyang

fbshipit-source-id: bc89d8221f3351da0ef3cff468ffe6a91dae96a6
2020-09-02 22:05:05 -07:00
32e0cedc53 [ONNX] Move tests to test_pytorch_onnx_onnxruntime (#42684)
Summary:
Move tests to test_pytorch_onnx_onnxruntime from test_utility_fun

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42684

Reviewed By: smessmer

Differential Revision: D23480360

Pulled By: bzinodev

fbshipit-source-id: 8876ba0a0c3e1d7104511de7a5cca5262b32f574
2020-09-02 21:47:38 -07:00
bc45c47aa3 Expand the coverage of test_addmm and test_addmm_sizes (#43831)
Summary:
- This test is very fast and very important, so it makes no sense in marking it as slowTest
- This test is should also run on CUDA
- This test should check alpha and beta support
- This test should check `out=` support
- manual computation should use list instead of index_put because list is much faster
- precision for TF32 needs to be fixed. Will do it in future PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43831

Reviewed By: ailzhang

Differential Revision: D23435032

Pulled By: ngimel

fbshipit-source-id: d1b8350addf1e2fe180fdf3df243f38d95aa3f5a
2020-09-02 20:51:49 -07:00
f5ba489f93 Move dependent configs to CUDA-10.2 (#44057)
Summary:
Move `multigpu`, `noavx` and `slow` test configs to CUDA-10.2, but keep them a master only tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44057

Reviewed By: walterddr, seemethere

Differential Revision: D23482732

Pulled By: malfet

fbshipit-source-id: a6b050701cbc1d8f176ebb302f7f5076a78f1f58
2020-09-02 20:07:48 -07:00
a76a56d761 Add "torch/testing/_internal/data/*.pt" to .gitignore (#43941)
Summary:
I usually get this extra "legacy_conv2d.pt" file in my git "changed files". I found that this is from tests with `download_file`
42c895de4d/test/test_nn.py (L410-L426)

and its definition (see `data_dir` for download output location)
f17d7a5556/torch/testing/_internal/common_utils.py (L1338-L1357)

I assume a file "generated" by test should not be tracked in VCS? Also, if the file is updated on the server, users may still use the old version of it if they have already downloaded that before.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43941

Reviewed By: anjali411

Differential Revision: D23451264

Pulled By: ezyang

fbshipit-source-id: 7fcdfb24685a7e483914cc46b3b024df798bf7f7
2020-09-02 20:00:31 -07:00
37658b144b Remove useless py2 compatibility import __future__, part 1 (#43808)
Summary:
To avoid conflicts, this PR does not remove all imports. More are coming in further PRs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43808

Reviewed By: wanchaol

Differential Revision: D23436675

Pulled By: ailzhang

fbshipit-source-id: ccc21a1955c244f0804277e9e47e54bfd23455cd
2020-09-02 19:15:11 -07:00
b2a9c3baa9 [TVM] Support fp16 weights in c2_frontend (#44070)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44070

Reviewed By: yinghai

Differential Revision: D23444253

fbshipit-source-id: 0bfa98172dfae835eba5ca7cbe30383ba964c2a6
2020-09-02 19:07:35 -07:00
b2aaf212aa [TensorExpr] Add option to enforce TensorExprKernel fallbacks. (#43972)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43972

It is useful when debugging a bug to disable NNC backend to see whether
the bug is there or in the fuser logic.

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D23455624

Pulled By: ZolotukhinM

fbshipit-source-id: f7c0452a29b860afc806e2d58acf35aa89afc060
2020-09-02 18:34:24 -07:00
6a6552576d rename _min_max to _aminmax (#44001)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44001

This is to align with the naming in numpy and in
https://github.com/pytorch/pytorch/pull/43092

Test Plan:
```
python test/test_torch.py TestTorchDeviceTypeCPU.test_aminmax_cpu_float32
python test/test_torch.py TestTorchDeviceTypeCUDA.test_aminmax_cuda_float32
```

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D23465298

fbshipit-source-id: b599035507156cefa53942db05f93242a21c8d06
2020-09-02 18:07:55 -07:00
486a9fdab2 _min_max.dim: CUDA implementation (#42943)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42943

Adds a CUDA kernel for _min_max_val.dim

Test Plan:
correctness:
```
python test/test_torch.py TestTorchDeviceTypeCUDA.test_minmax_cuda_float32
```

performance: ~50% savings on a tensor representative of quantization workloads: https://gist.github.com/vkuzo/3e16c645e07a79dd66bcd50629ff5db0

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D23086797

fbshipit-source-id: 04a2d310f64a388d48ab8131538dbd287900ca4a
2020-09-02 18:07:51 -07:00
834279f4ab _min_max_val.dim: CPU implementation (#42894)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42894

Continuing the min_max kernel implementation, this PR adds the
CPU path when a dim is specified.  Next PR will replicate for CUDA.

Note: after a discussion with ngimel, we are taking the fast path
of calculating the values only and not the indices, since that is what
is needed for quantization, and calculating indices would require support
for reductions on 4 outputs which is additional work.  So, the API
doesn't fully match `min.dim` and `max.dim`.

Flexible on the name, let me know if something else is better.

Test Plan:
correctness:
```
python test/test_torch.py TestTorchDeviceTypeCPU.test_minmax_cpu_float32
```

performance: seeing a 49% speedup on a min+max tensor with similar shapes
to what we care about for quantization observers (bench:
https://gist.github.com/vkuzo/b3f24d67060e916128a51777f9b89326). For
other shapes (more dims, different dim sizes, etc), I've noticed a
speedup as low as 20%, but we don't have a good use case to optimize
that so perhaps we can save that for a future PR.

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D23086798

fbshipit-source-id: b24ce827d179191c30eccf31ab0b2b76139b0ad5
2020-09-02 18:07:47 -07:00
78994d165f min_max kernel: add CUDA (#42868)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42868

Adds a CUDA kernel for the _min_max function.

Note: this is a re-submit of https://github.com/pytorch/pytorch/pull/41805,
was faster to resubmit than to ressurect that one.  Thanks to durumu
for writing the original implementation!

Future PRs will add index support, docs, and hook this up to observers.

Test Plan:
```
python test/test_torch.py TestTorchDeviceTypeCUDA.test_minmax_cuda_float32
```

Basic benchmarking shows a 50% reduction in time to calculate min + max:
https://gist.github.com/vkuzo/b7dd91196345ad8bce77f2e700f10cf9

TODO

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D23057766

fbshipit-source-id: 70644d2471cf5dae0a69343fba614fb486bb0891
2020-09-02 18:06:03 -07:00
33d51a9b32 Respect canFuseOn{CPU,GPU} in TE fuser (#43967)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43967

Test Plan: Imported from OSS

Reviewed By: asuhan

Differential Revision: D23469048

Pulled By: bertmaher

fbshipit-source-id: 1005a7ae08974059ff9d467492caa3a388070eeb
2020-09-02 18:00:25 -07:00
041573c8cd Add Cost Inference for AdaGrad and RowWiseSparseAdagrad
Summary: Add cost inference for AdaGrad and RowWiseSparseAdagrad

Test Plan:
Ran `buck test caffe2/caffe2/python/operator_test:adagrad_test`
Result: https://our.intern.facebook.com/intern/testinfra/testrun/5629499567799494

Reviewed By: bwasti

Differential Revision: D23442607

fbshipit-source-id: 67800fb82475696512ad19a43067774247f8b230
2020-09-02 17:52:40 -07:00
2f044d4ee5 Fix CI build (#44068)
Summary:
Some of our machines have only 1 device.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44068

Reviewed By: wanchaol

Differential Revision: D23485730

Pulled By: izdeby

fbshipit-source-id: df6bc0aba18feefc50c56a8f376103352fa2a2ea
2020-09-02 17:09:30 -07:00
129f406062 Make torch.conj() a composite function and return self for real tensors (#43270)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43270

`torch.conj` is a very commonly used operator for complex tensors, but it's mathematically a no op for real tensors. Switching to tensorflow gradients for complex tensors (as discussed in #41857) would involve adding `torch.conj()` to the backward definitions for a lot of operators. In order to preserve autograd performance for real tensors and maintain numpy compatibility for `torch.conj`, this PR updates `torch.conj()` which behaves the same for complex tensors but performs a view/returns `self` tensor for tensors of non-complex dtypes. The documentation states that the returned tensor for a real input shouldn't be mutated. We could perhaps return an immutable tensor for this case in future when that functionality is available (zdevito ezyang ).

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D23460493

Pulled By: anjali411

fbshipit-source-id: 3b3bf0af55423b77ff2d0e29f5d2c160291ae3d9
2020-09-02 17:06:04 -07:00
f9efcb646b fx quant: clarify state in Quantizer object (#43927)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43927

Adds uninitialized placeholders for various state
used throughout the Quantizer object, with documentation
on what they are. No logic change.

Test Plan:
```
python test/test_quantization.py TestQuantizeFxOps
python test/test_quantization.py TestQuantizeFx
```

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D23439473

fbshipit-source-id: d4ae83331cf20d81a7f974f88664ccddca063ffc
2020-09-02 16:34:00 -07:00
f15e27265f [torch.fx] Add support for custom op (#43248)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43248

We add the support of __torch_function__ override for C++ custom op. The logic is the same as the other components, like torch.nn.Module.
Refactored some code a little bit to make it reusable.

Test Plan: buck test //caffe2/test:fx -- test_torch_custom_ops

Reviewed By: bradleyhd

Differential Revision: D23203204

fbshipit-source-id: c462a86e407e46c777171da32d7a40860acf061e
2020-09-02 16:08:37 -07:00
7a77d1c5c2 [FX] Only copy over forward() from exec (#44006)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44006

Test Plan: Imported from OSS

Reviewed By: zdevito

Differential Revision: D23466542

Pulled By: jamesr66a

fbshipit-source-id: 12a1839ddc65333e3e3d511eeb53206f06546a87
2020-09-02 15:35:49 -07:00
402e9953df [pytorch][bot] update mobile op deps (#44018)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44018

Test Plan: Imported from OSS

Reviewed By: iseeyuan

Differential Revision: D23470528

Pulled By: ljk53

fbshipit-source-id: b677e1c5677fc8929713ee108df69098502c50ea
2020-09-02 14:34:33 -07:00
297c938729 Add _foreach_add(TensorList tl1, TensorList tl2) and _foreach_add_(TensorList tl1, TensorList tl2) APIs (#42533)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42533

[First PR: Add private API to support tensor lists: _foreach_add(TensorList tensors, Scalar scalar)](https://github.com/pytorch/pytorch/pull/41554).

**Motivation**
[GitHub issue](https://github.com/pytorch/pytorch/issues/38655)
Current PyTorch optimizer implementations are not efficient in cases when we work with a lot of small feature tensors. Starting a lot of kernels slows down the whole process. We need to reduce the number of kernels that we start.
As an example, we should be looking at [NVIDIAs Apex](https://github.com/NVIDIA/apex).
In order to track progress, we will pick PyTorchs DCGAN model with Adam optimizer and once the optimizer is reimplemented with tensor lists, benchmark the model performance against original model version, Apexs version with original Adam optimizer and it’s FusedAdam optimizer.

**Current API restrictions**
- List can't be empty (will fixed in upcoming PRs).
- All tensors in the list must have the same dtype, device and size.

**Broadcasting**
At this point we don't support broadcasting.

**What is 'Fast' and 'Slow' route**
In particular cases, we cant process an op with a fast list CUDA kernel. Still, we can do with a regular for-loop where the op will be applied to each tensor individually through the dispatch mechanisms. There are a few checks that decide whether the op will be performed via a 'fast' or 'slow' path.
To go the fast route,
- All tensors must have strided layout
- All tensors must be dense and not have overlapping memory
- The resulting tensor type must be the same.

----------------
**In this PR**
- Adding a `_foreach_add(TensorList tl1, TensorList tl2)` API
- Adding a `_foreach_add_(TensorList tl1, TensorList tl2)` API

**Tests**
Tested via unit tests

**TODO**
1. Properly handle empty lists

**Plan for the next PRs**
1. APIs
- Binary Ops for list with Scalar
- Binary Ops for list with list
- Unary Ops for list
- Pointwise Ops

2. Complete tasks from TODO
3. Rewrite PyTorch optimizers to use for-each operators for performance gains.

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D23331894

Pulled By: izdeby

fbshipit-source-id: 876dd1bc82750f609b9e3ba23c8cad94d8d6041c
2020-09-02 12:18:28 -07:00
f6f9d22228 [ONNX] Export KLDivLoss (#41858)
Summary:
Enable export for KLDivLoss

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41858

Reviewed By: mrshenli

Differential Revision: D22918004

Pulled By: bzinodev

fbshipit-source-id: e3debf77a4cf0eae0df6ed5a72ee91c43e482b62
2020-09-02 11:45:13 -07:00
4716284904 Update persons_of_interest.rst (#44031)
Summary:
Adding Geeta to the POI for TorchServe

cc chauhang

Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44031

Reviewed By: jspisak

Differential Revision: D23476439

Pulled By: soumith

fbshipit-source-id: 6936d46c201e1437143d85e1dce24da355857628
2020-09-02 10:56:27 -07:00
b167402e2e [redo] Fix SyncBatchNorm forward pass for non-default process group (#43861)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43861

This is a redo of https://github.com/pytorch/pytorch/pull/38874, and
fixing my original bug from
https://github.com/pytorch/pytorch/pull/38246.

Test Plan:
CI

Imported from OSS

Reviewed By: supriyar

Differential Revision: D23418816

fbshipit-source-id: 2a3a3d67fc2d03bb0bf30a87cce4e805ac8839fb
2020-09-02 10:44:46 -07:00
544a56ef69 [JIT] Always map node output in vmap (#43988)
Summary:
Previously when merging a node without a subgraph, we would merge the node's outputs to the corresponding subgraph values, but when merging a node with a subgraph the node's outputs would be absent in the value mapping. This PR makes it so they are included.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43988

Reviewed By: ZolotukhinM

Differential Revision: D23462116

Pulled By: eellison

fbshipit-source-id: 232c081261e9ae040df0accca34b1b96a5a5af57
2020-09-02 10:30:43 -07:00
276158fd05 .circleci: Remove un-needed steps from binary builds (#43974)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43974

We already install devtoolset7 in our docker images for binary builds
and tclsh shouldn't be needed since we're not relying on unbuffer
anymore

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D23462531

Pulled By: seemethere

fbshipit-source-id: 83cbb8b0782054f0b543dab8d11fa6ac57685272
2020-09-02 09:57:52 -07:00
73f009a2aa refactor manual function definitions (#43711)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43711

this makes them available in forward if needed

No change to the file content, just a copy-paste.

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D23454146

Pulled By: albanD

fbshipit-source-id: 6269a4aaf02ed53870fadf8b769ac960e49af195
2020-09-02 09:23:21 -07:00
a6789074fc Implement ChannelShuffle op with XNNPACK (#43602)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43602

Test Plan: Imported from OSS

Reviewed By: kimishpatel

Differential Revision: D23334952

Pulled By: kimishpatel

fbshipit-source-id: 858ef3db599b1c521ba3a1855c9a3c35fe3b02b0
2020-09-02 09:18:25 -07:00
df8da5cb5a fx quant: make load_arg function more clear (#43923)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43923

Readability improvements to `Quantizer.convert.load_arg`, makes
things easier to read.
1. add docblock
2. `arg` -> `arg_or_args`, to match what's actually happening
3. `loaded_arg` -> `loaded_args`, to match what's actually happening

Test Plan:
```
python test/test_quantization.py TestQuantizeFxOps
python test/test_quantization.py TestQuantizeFx
```

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D23438745

fbshipit-source-id: f886b324d2e2e33458b72381499e37dccfc3bd30
2020-09-02 09:06:05 -07:00
77ef77e5fa fx quant: rename matches -> is_match (#43914)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43914

Renames `matches` function to `is_match`, since there is also
a list named `matches` we are passing around in `Quantizer`,
and would be good to decrease name conflicts.

Test Plan:
```
python test/test_quantization.py TestQuantizeFxOps
```

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D23435601

fbshipit-source-id: 394af11e0120cfb07dedc79d5219247330d4dfd6
2020-09-02 09:06:01 -07:00
6f5282adc8 add quantization debug util to pretty print FX graphs (#43910)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43910

Adds a debug function to get a representation of all nodes in the
graph, such as

```
name          op      target         args               kwargs
x             plchdr  x              ()                 {}
linear_weight gt_prm  linear.weight  ()                 {}
add_1         cl_fun  <bi_fun add>   (x, linear_weight) {}
linear_1      cl_mod  linear         (add_1,)           {}
relu_1        cl_meth relu           (linear_1,)        {}
sum_1         cl_fun  <bi_meth sum>  (relu_1,)          {'dim': -1}
topk_1        cl_fun  <bi_meth topk> (sum_1, 3)         {}
```

using only Python STL. This is useful for printing internal state of
graphs when working on FX code.

Has some on-by-default logic to shorten things so that node reprs for
toy models and unit tests fit into 80 chars.

Flexible on function name and location, I care more that this is
accessible from both inside PT as well as from debug scripts which
are not checked in.

Test Plan:
see
https://gist.github.com/vkuzo/ed0a50e5d6dc7442668b03bb417bd603 for
example usage

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D23435029

fbshipit-source-id: 1a2df797156a19cedd705e9e700ba7098b5a1376
2020-09-02 09:04:44 -07:00
b6b5ebc345 Add torch.vdot (#43004)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/42747

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43004

Reviewed By: mruberry

Differential Revision: D23318935

Pulled By: anjali411

fbshipit-source-id: 12d4824b7cb42bb9ca703172c54ec5c663d9e325
2020-09-02 09:00:30 -07:00
14ebb2c67c Allow no-bias MKLDNN Linear call (#43703)
Summary:
MKLDNN linear incorrectly assumes that bias is defined and will fail for no-bias calls.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43703

Reviewed By: glaringlee

Differential Revision: D23373182

Pulled By: bwasti

fbshipit-source-id: 1e817674838a07d237c02eebe235c386cf5b191e
2020-09-02 08:54:50 -07:00
c88ac25679 Check for internal memory overlap in some indexing-type functions (#43423)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43423

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D23298652

Pulled By: zou3519

fbshipit-source-id: c13c59aec0c6967ef0d6365d782c1f4c98c04227
2020-09-02 08:51:50 -07:00
5807bb92d3 TensorIteratorConfig: Check memory overlap by default (#43422)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43422

Test Plan: Imported from OSS

Reviewed By: ailzhang

Differential Revision: D23298653

Pulled By: zou3519

fbshipit-source-id: a7b66a8a828f4b35e31e8be0c07e7fe9339181f2
2020-09-02 08:50:29 -07:00
cd58114c6c Adjust level of verbosity of debug dumps in graph executor T74227880 (#43682)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43682

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D23397980

Pulled By: Lilyjjo

fbshipit-source-id: b0114efbd63b2a29eb14086b0a8963880023c2a8
2020-09-02 08:45:16 -07:00
8722952dbd Add benchmark for channel_shuffle operator (#43509)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43509

Test Plan: Imported from OSS

Reviewed By: kimishpatel

Differential Revision: D23299972

Pulled By: kimishpatel

fbshipit-source-id: 6189d209859da5a41067eb9e8317e3bf7a0fc754
2020-09-02 08:15:19 -07:00
6512032699 [Static Runtime] Add OSS build for static runtime benchmarks (#43881)
Summary:
Adds CMake option.  Build with:

```
BUILD_STATIC_RUNTIME_BENCHMARK=ON python setup.py install
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43881

Reviewed By: hlu1

Differential Revision: D23430708

Pulled By: bwasti

fbshipit-source-id: a39bf54e8d4d044a4a3e4273a5b9a887daa033ec
2020-09-02 08:00:18 -07:00
c61a16b237 Kill dead code in common_nn as part of merging Criterion and NewCriterionTests. (#43956)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43956

See https://github.com/pytorch/pytorch/pull/43769 and https://github.com/pytorch/pytorch/pull/43776 for proof this code is dead.

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D23452217

Pulled By: gchanan

fbshipit-source-id: 6850aab2daaa1c321a6b7714f6f113f364f41973
2020-09-02 07:54:05 -07:00
95f912ab13 Use NewCriterionTest in test_cpp_api_parity.py. (#43954)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43954

CriterionTest is basically dead -- see https://github.com/pytorch/pytorch/pull/43769 and https://github.com/pytorch/pytorch/pull/43776.

The only exception is the cpp parity test, but the difference there doesn't actually have any effect -- the get_target has unpack=True, but all examples don't require unpacking (I checked).

As a pre-requisite for merging these tests, have the cpp parity test start using the NewCriterionTest.

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D23452144

Pulled By: gchanan

fbshipit-source-id: 5dca1eb0878b882c93431d3b0e880b5bb1764522
2020-09-02 07:53:03 -07:00
4bb5d33076 is_numpy_scalar should also consider bool and complex types (#43644)
Summary:
Before this PR,

```python
import torch
import numpy as np

a = torch.tensor([1, 2], dtype=torch.bool)
c = np.array([1, 2], dtype=np.bool)
print(a[0] == c[0])

a = torch.tensor([1, 2], dtype=torch.complex64)
c = np.array([1, 2], dtype=np.complex64)
print(a[0] == c[0])

 # This case is still broken
a = torch.tensor([1 + 1j, 2 + 2j], dtype=torch.complex64)
c = np.array([1 + 1j, 2 + 2j], dtype=np.complex64)
print(a[0] == c[0])
```

outputs

```
False
False
False
```

After this PR, it outputs:

```
tensor(True)
/home/user/src/pytorch/torch/tensor.py:25: ComplexWarning: Casting complex values to real discards the imaginary part return f(*args, **kwargs)
tensor(True)
tensor(False)
```

Related issue: https://github.com/pytorch/pytorch/issues/43579

cc anjali411 mruberry

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43644

Reviewed By: ailzhang

Differential Revision: D23425569

Pulled By: anjali411

fbshipit-source-id: a868209376b30cea601295e54015c47803923054
2020-09-02 07:41:50 -07:00
7000c2efb5 [2/2][PyTorch][Mobile] Added mobile module metadata logging (#43853)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43853

Add QPL logging for mobile module's metadata
ghstack-source-id: 111113492

(Note: this ignores all push blocking failures!)

Test Plan:
- CI

- Load the model trained by `mobile_model_util.py`

- Local QPL logger standard output.
{F319012106}

Reviewed By: xcheng16

Differential Revision: D23417304

fbshipit-source-id: 7bc834f39e616be1eccfae698b3bccdf2f7146e5
2020-09-01 22:27:10 -07:00
1dd658f28f [Codemod][GleanFbcode] Remove dead includes in caffe2/test (#43953)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43953

Reviewed By: malfet

Differential Revision: D23445556

fbshipit-source-id: 89cd6833aa06f35c5d3c99d698abb08cd61ae4ab
2020-09-01 21:48:28 -07:00
c259146477 add missing NEON {vld1,vst1}_*_x2 intrinsics (#43683)
Summary:
Workaround for issue https://github.com/pytorch/pytorch/issues/43265.
Add the missing intrinsics until gcc-7 gets the missing patches backported.

Fixes https://github.com/pytorch/pytorch/issues/43265.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43683

Reviewed By: albanD

Differential Revision: D23467867

Pulled By: malfet

fbshipit-source-id: 7c138dd3de3c45852a60f2cfe8b4d7f7cf76bc7e
2020-09-01 21:19:39 -07:00
137a4fcc3b Back out "Selective meta programming preparation for prim ops"
Summary:
The diff D22618309 (bacee6aa2e) breaks CYA ACP e2e tests. (https://www.internalfb.com/intern/ods/chart/?rapido=%7B%22queries%22%3A[%7B%22entity%22%3A%22regex(assistant%5C%5C.cya%5C%5C..*acp.*)%2C%5Cn%2C%20!regex(assistant%5C%5C.cya%5C%5C..*fair.*)%2C%22%2C%22key%22%3A%22overview.pct_passed_x_1000%2C%22%2C%22transform%22%3A%22formula(%2F%20%241%201000.0)%2C%22%2C%22reduce_keys%22%3Atrue%2C%22datatypes%22%3A[%22raw%22]%2C%22reduce%22%3A%22%22%2C%22id%22%3A%22ds1%22%2C%22source%22%3A%22ods%22%2C%22active%22%3Atrue%7D]%2C%22period%22%3A%7B%22minutes_back%22%3A720%2C%22time_type%22%3A%22dynamic%22%7D%7D&view=%7B%22type%22%3A%22line_chart_client%22%2C%22params%22%3A%7B%22title%22%3A%22Pass%20Rates%20of%20All%20Continuous%20Runs%20in%20PROD%22%2C%22haspoints%22%3Afalse%2C%22state%22%3A%22published%22%2C%22title_use_v2%22%3Atrue%2C%22tooltip_outside%22%3Atrue%2C%22series_names_preg_replace_list%22%3A[%7B%22series_name_preg_replace_list_group%22%3Anull%2C%22pattern%22%3A%22%2Fassistant%5C%5C.cya%5C%5C.(%5C%5Cw%2B)%5C%5C.([%5E%3A]%2B)%3A%3A.*%2F%22%2C%22replacement%22%3A%22%241%2F%242%22%7D]%2C%22sort_by_series_name%22%3A%22ASC%22%2C%22use_y_axis_hints_as_limits%22%3Atrue%7D%7D&version=2)

So I back out the diff.

Test Plan:
```
cya test -n aloha.acp.arv2.prod --tp ~/tmp/cyaTests/assistant/cya/aloha_acp/whatsapp_call_who_ondevice_oacr.yaml --device_no_new_conn --retries 0
Installing: finished in 13.4 sec
More details at https://www.internalfb.com/intern/buck/build/c48882e8-1032-43ca-ba8f-8
Running "aloha.acp.arv2.prod (acp)" [1 tests] with endpoint "https://prod.facebookvirtualassistant.com"
.
  %100.0 tests passed:  1/1
  Avg turn duration:    12.6s
  P99 turn duration:    24.4s
  CTP report:  https://our.intern.facebook.com/intern/testinfra/testrun/2814749804232321

[jaeholee@32384.od ~/fbsource (7934576f)]$
```

Differential Revision: D23464555

fbshipit-source-id: b2c712a512a207c4813585f4ee57fdb5607317c6
2020-09-01 21:05:45 -07:00
263412e536 Rename is_complex_t -> is_complex (#39906)
Summary:
`is_complex_t` is a bad name. For example in std, there are `std::is_same` but not `std::is_same_t`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/39906

Reviewed By: mrshenli

Differential Revision: D22665013

Pulled By: anjali411

fbshipit-source-id: 4b71745f5e2ea2d8cf5845d95ada4556c87e040d
2020-09-01 21:04:19 -07:00
9db90fe1f3 [TensorExpr] Remove unused functions in kernel.cpp (#43966)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43966

Test Plan: build.

Reviewed By: ZolotukhinM

Differential Revision: D23456660

Pulled By: asuhan

fbshipit-source-id: c13411b61cf62dd5d038e7246f79a8682822b472
2020-09-01 20:25:16 -07:00
8fd9fe93be [quant][graphmode][fx] Support dynamic quantization without calibration (#43952)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43952

Run weight observer for dynamic quantization before inserting quant/dequant node

Test Plan: Imported from OSS

Reviewed By: supriyar

Differential Revision: D23452123

fbshipit-source-id: c322808fa8025bbadba36c2e5ab89f59e85de468
2020-09-01 19:09:48 -07:00
fbea2ee917 broadcast_object API for c10d (#43887)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43887

As part of addressing #23232, this PR adds support for `broadcast_object_list` which is an API to broadcast arbitrary picklable objects to all the other ranks.  This has been a long-requested feature, so would be good for Pytorch to natively support this.

The implementation approach follows a similar approach as https://github.com/pytorch/pytorch/pull/42189. The input is a list of objects to be broadcasted and it is in place, meaning all ranks part of the group will have their input list modified to contain the broadcasted objects from the src rank.

Note that the API is designed to match the tensor-based collectives other than supporting async_op. For now, it is a blocking call. If we see demand to support async_op, we will have to make more progress on merging work/future to support this.
ghstack-source-id: 111180436

Reviewed By: mrshenli

Differential Revision: D23422577

fbshipit-source-id: fa700abb86eff7128dc29129a0823e83caf4ab0e
2020-09-01 18:54:17 -07:00
4134b7abfa Pass CC env variable as ccbin argument to nvcc (#43931)
Summary:
This is the common behavior when one builds PyTorch (or any other CUDA project) using CMake, so it should be held true for Torch CUDA extensions as well.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43931

Reviewed By: ezyang, seemethere

Differential Revision: D23441793

Pulled By: malfet

fbshipit-source-id: 1af392107a94840331014fda970ef640dc094ae4
2020-09-01 17:26:08 -07:00
0ffe3d84d5 [quant][graphmode][fx] Support dynamic quantization without calibration (#43892)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43892

Run weight observer in the convert function, so user do not need to run calibration

Test Plan: Imported from OSS

Reviewed By: vkuzo

Differential Revision: D23429758

fbshipit-source-id: 5bc222e3b731789ff7a86463c449690a58dffb7b
2020-09-01 17:01:48 -07:00
d15b9d980c [quant][graphmode][fx][refactor] Move patterns to separate files (#43891)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43891

Test Plan: Imported from OSS

Reviewed By: vkuzo

Differential Revision: D23429759

fbshipit-source-id: f19add96beb7c8bac323ad78f74588ca1393040c
2020-09-01 16:37:33 -07:00
8d53df30ea [FX] Better error when unpacking Proxy (#43740)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43740

Test Plan: Imported from OSS

Reviewed By: suo

Differential Revision: D23380964

Pulled By: jamesr66a

fbshipit-source-id: 9658ef1c50d0f9c4de38781a7485002487f6d3f7
2020-09-01 16:28:50 -07:00
ec7f14943c [OSS] Update README.md -- Explain more complex arguments and functionalities
Summary: Update `README.md` for oss to explain the usage of `--run` `--export` `--summary`

Test Plan: Test locally.

Reviewed By: malfet

Differential Revision: D23431508

fbshipit-source-id: 368b8dd8cd5099f39c7f5bc985203c417bf7af39
2020-09-01 16:10:33 -07:00
e49dd9fa05 Delete raise_from from torch._six (#43981)
Summary:
No need for compatibility wrapper in Python3+ world

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43981

Reviewed By: seemethere

Differential Revision: D23458325

Pulled By: malfet

fbshipit-source-id: 00f822895625f4867c22376fe558c50316f5974d
2020-09-01 15:46:18 -07:00
5e97f251a8 Enable TF32 support for cuDNN (#40737)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40737

Reviewed By: mruberry

Differential Revision: D22801525

Pulled By: ngimel

fbshipit-source-id: ac7f7e728b4b3e01925337e8c9996f26a6433fd2
2020-09-01 15:34:24 -07:00
93fbbaab2a Update README.md in oss (#43893)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43893

Update `README.md` in oss, provide more examples, start from the most common use to specified use. Make `README.md` be more friendly and more specific.

Test Plan: `README.md` doesn't need test.

Reviewed By: malfet, seemethere

Differential Revision: D23420203

fbshipit-source-id: 1a4c146393fbcaf2893321e7892740edf5d0c248
2020-09-01 14:58:28 -07:00
24eea364f7 Check SparseAdam params are dense on init (#41966) (#43668)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/41966

Raises a value error if user attempts to create SparseAdam optimizer with sparse parameter tensors.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43668

Reviewed By: glaringlee

Differential Revision: D23388109

Pulled By: ranman

fbshipit-source-id: 1fbcc7527d49eac6fae9ce51b3307c609a6ca38b
2020-09-01 14:25:59 -07:00
bacee6aa2e Selective meta programming preparation for prim ops (#43540)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43540

selected_mobile_ops.h is generated at BUCK build time, which contains the whitelist of root operators. It's used for templated selective build when XPLAT_MOBILE_BUILD is defined.

ghstack-source-id: 111014372

Test Plan: CI and BSB

Reviewed By: ljk53

Differential Revision: D22618309

fbshipit-source-id: ddf813904892f99c3f4ae0cd14ce8b27727be5a2
2020-09-01 13:51:44 -07:00
a1a23669f2 [FX] Pickle serialization of GraphModule via forward source (#43674)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43674

Test Plan: Imported from OSS

Reviewed By: dzhulgakov

Differential Revision: D23362396

Pulled By: jamesr66a

fbshipit-source-id: cb8181edff70643b7bbe548cc6b0957328d4eedd
2020-09-01 13:31:18 -07:00
73f7d63bc9 [FX] Support tensor-valued constants (#43666)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43666

Test Plan: Imported from OSS

Reviewed By: dzhulgakov

Differential Revision: D23359110

Pulled By: jamesr66a

fbshipit-source-id: 8569a2db0ef081ea7d8e81d7ba26a92bc12ed423
2020-09-01 13:30:04 -07:00
06c277f38e [TVM] Support slice op (#43969)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43969

Reviewed By: yinghai

Differential Revision: D23413340

fbshipit-source-id: 20168bd573b81ce538e3589b72aba9590c3c055e
2020-09-01 12:34:30 -07:00
5472426b9f Reset DataLoader workers instead of creating new ones (#35795)
Summary:
This PR needs discussion as it changes the behavior of `DataLoader`. It can be closed if its not considered a good practice.

Currently, the `DataLoader` spawns a new `_BaseDataLoaderIter` object every epoch,
In the case of the multiprocess DataLoader, every epoch the worker processes are re-created and they make a copy of the original `Dataset` object.
If users want to cache data or do some tracking on their datasets, all their data will be wiped out every epoch. Notice that this doesn't happen when the number of workers is 0. giving some inconsistencies with the multiprocess and serial data loaders.

This PR keeps the `_BaseDataLoaderIter` object alive and just resets it within epochs, so the workers remain active and so their own `Dataset` objects. People seem to file issues about this often.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/35795

Reviewed By: ailzhang

Differential Revision: D23426612

Pulled By: VitalyFedyunin

fbshipit-source-id: e16950036bae35548cd0cfa78faa06b6c232a2ea
2020-09-01 11:48:00 -07:00
db6bd9d60b rename input argunment interested-folder to interest-only -- be consistent with other arguments (#43889)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43889

1. rename input argunment `interested-folder` to `interest-only` -- be consistent with `run-only`, `coverage-only` and be shorted

Test Plan: Test on devserver and linux docker.

Reviewed By: malfet

Differential Revision: D23417338

fbshipit-source-id: ce9711e75ca3a1c30801ad6bd1a620f3b06819c5
2020-09-01 11:46:23 -07:00
bc64efae48 Back out "Revert D19987020: [pytorch][PR] Add the sls tensor train op" (#43938)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43938

resubmit

Test Plan: unit test included

Reviewed By: mruberry

Differential Revision: D23443493

fbshipit-source-id: 7b68f8f7d1be58bee2154e9a498b5b6a09d11670
2020-09-01 11:42:12 -07:00
7035cd0f84 Revert D23216393: Support work.result() to get result tensors for allreduce for Gloo, NCCL backends
Test Plan: revert-hammer

Differential Revision:
D23216393 (0b2694cd11)

Original commit changeset: fed5e37fbabb

fbshipit-source-id: 27fbeb1617066fa3f271a681cb089622027d6689
2020-09-01 10:32:38 -07:00
63a0bb0ab9 Add typing annotations for torch.nn.quantized.dynamic.modules.rnn (#43186)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/43185

xref: [gh-43072](https://github.com/pytorch/pytorch/issues/43072)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43186

Reviewed By: ezyang

Differential Revision: D23441259

Pulled By: malfet

fbshipit-source-id: 80265ae7f3a70f0087e620969dbd4aa8ca17c317
2020-09-01 10:25:10 -07:00
8ca3913f47 Introduce BUILD_CAFFE2 flag (#43673)
Summary:
introduce BUILD_CAFFE2 flag. default to `ON`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43673

Reviewed By: malfet

Differential Revision: D23381035

Pulled By: walterddr

fbshipit-source-id: 1f4582987fa0c4a911f0b18d311c04fdbf8dd8f0
2020-09-01 10:18:23 -07:00
76ca365661 [pytorch][bot] update mobile op deps (#43937)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43937

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D23443927

Pulled By: ljk53

fbshipit-source-id: 526ca08dfb5bd32527bff98b243da90dbbf2ea49
2020-09-01 10:07:52 -07:00
e3cb582e05 Error printing extension support for multiline errors (#43807)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43807

Test Plan: Imported from OSS

Reviewed By: gmagogsfm

Differential Revision: D23407457

Pulled By: Lilyjjo

fbshipit-source-id: 05a6a50dc39c00474d9087ef56028a2c183aa53a
2020-09-01 10:02:43 -07:00
224232032c Move Autograd to an alias dispatch key (#43070)
Summary:
This PR moves `DispatchKey::Autograd` to an alias dispatch key mapping to `AutogradCPU, AutogradCUDA, AutogradXLA, AutogradOther, AutogradPrivate*` keys.

A few things are handled in this PR:
- Update alias dispatch key mapping and precompute dispatchTable logic
- Move `Autograd` key from `always_included` set to TensorImpl constructor.
- Update `dummyTensor` constructor to take `requires_grad` as optional argument so that it's closer to the real application in op_registration_test.
- Use `BackendSelect` key for both backend select before and after autograd layer. (1 liner in backend_select codegen)

A few planned followups ordered by priority:
- [cleanup] Update `test_dispatch.py` to include testing `Autograd`.
- [cleanup] Add Math alias key and move catchAll to Math. (to remove 2.2 in `computeDispatchTableEntryWithDebug`)
- [new feature] Add support for Math in native_functions.yaml
- [cleanup] Add iterator like functionality to DispatchKeySet
- [cleanup/large] Only add Autograd backend keys when tensor requires grad. (cc: ljk53 ?)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43070

Reviewed By: ezyang

Differential Revision: D23281535

Pulled By: ailzhang

fbshipit-source-id: 9ad00b17142e9b83304f63cf599f785500f28f71
2020-09-01 09:05:29 -07:00
13a48ac1f3 MaxPool1d without indices optimization (#43745)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43745

This is part of a larger effort to refactor and optimize the pooling code. Previously I started working on MaxPool2d here https://github.com/pytorch/pytorch/pull/43267 but since it uses MaxPool1d as a subroutine, it made more sense to work on 1D first and get it tested and optimized and then move up to 2D and then 3D.

Below are some benchmarking results, the python script I used is under the results.

## Benchmarking
```
Name (time in us)                            Min                   Max                Mean             StdDev              Median                 IQR            Outliers  OPS (Kops/s)            Rounds  Iterations
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_googlenet[(3, 2, 0, 1, 0)-new]      79.7659 (1.03)     1,059.6327 (5.32)      90.6280 (1.01)     19.1196 (1.41)      84.2176 (1.01)       2.4289 (1.0)     1079;2818       11.0341 (0.99)       9055           1
test_googlenet[(3, 2, 0, 1, 0)-old]     505.1531 (6.55)       830.8962 (4.17)     563.4763 (6.29)     65.3974 (4.81)     538.3361 (6.43)      80.5371 (33.16)      242;99        1.7747 (0.16)       1742           1
test_googlenet[(3, 2, 0, 1, 1)-new]      80.2949 (1.04)       233.0020 (1.17)      97.6498 (1.09)     19.1228 (1.41)      89.2282 (1.07)      18.5743 (7.65)     1858;741       10.2407 (0.92)       9587           1
test_googlenet[(3, 2, 0, 1, 1)-old]     513.5350 (6.66)       977.4677 (4.91)     594.4559 (6.63)     69.9372 (5.15)     577.9080 (6.90)      79.8218 (32.86)      503;84        1.6822 (0.15)       1675           1
test_googlenet[(3, 2, 1, 1, 0)-new]      77.1061 (1.0)        199.1168 (1.0)       89.6529 (1.0)      13.5864 (1.0)       83.7557 (1.0)        7.5139 (3.09)    1419;1556       11.1541 (1.0)        7434           1
test_googlenet[(3, 2, 1, 1, 0)-old]     543.6055 (7.05)       964.5708 (4.84)     636.9867 (7.11)     84.0732 (6.19)     616.7777 (7.36)     100.4562 (41.36)      434;65        1.5699 (0.14)       1552           1
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_inception[(3, 2, 0, 1, 0)-new]      84.5827 (1.00)       184.2827 (1.0)       90.5438 (1.01)      9.6324 (1.0)       89.3027 (1.05)      4.5672 (1.03)      637;759       11.0444 (0.99)       6274           1
test_inception[(3, 2, 0, 1, 0)-old]     641.2268 (7.59)     1,704.8977 (9.25)     686.9383 (7.65)     57.2499 (5.94)     682.5905 (8.01)     58.3753 (13.17)       86;21        1.4557 (0.13)        802           1
test_inception[(3, 2, 0, 1, 1)-new]      84.5008 (1.0)      1,093.6335 (5.93)      89.8233 (1.0)      14.0443 (1.46)      85.2682 (1.0)       4.4331 (1.0)      802;1106       11.1330 (1.0)        9190           1
test_inception[(3, 2, 0, 1, 1)-old]     643.7078 (7.62)       851.4188 (4.62)     687.4905 (7.65)     41.1116 (4.27)     685.1386 (8.04)     60.2733 (13.60)      286;14        1.4546 (0.13)       1300           1
test_inception[(3, 2, 1, 1, 0)-new]     106.0739 (1.26)       258.5649 (1.40)     115.3597 (1.28)     17.5436 (1.82)     106.9643 (1.25)      5.5470 (1.25)     894;1402        8.6685 (0.78)       7635           1
test_inception[(3, 2, 1, 1, 0)-old]     651.0504 (7.70)       955.2278 (5.18)     698.0295 (7.77)     45.5097 (4.72)     692.8109 (8.13)     64.6794 (14.59)      145;15        1.4326 (0.13)        909           1
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_large_batch_size[new]       2.9608 (1.0)        5.1127 (1.0)        3.3096 (1.0)      0.1936 (1.0)        3.3131 (1.0)      0.2093 (1.0)          71;6  302.1515 (1.0)         297           1
test_large_batch_size[old]     130.6583 (44.13)    152.9521 (29.92)    137.1385 (41.44)    7.4352 (38.40)    135.1784 (40.80)    5.1358 (24.53)         1;1    7.2919 (0.02)          7           1
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_large_channel_size[new]      2.9696 (1.0)       5.5595 (1.0)       3.5997 (1.0)      0.5836 (1.0)       3.3497 (1.0)      0.3445 (1.0)         58;54  277.8014 (1.0)         277           1
test_large_channel_size[old]     19.6838 (6.63)     22.6637 (4.08)     21.1775 (5.88)     0.8610 (1.48)     21.3739 (6.38)     1.4930 (4.33)         13;0   47.2199 (0.17)         36           1
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_large_width[new]      1.7714 (1.0)       2.4104 (1.0)       1.8988 (1.0)      0.0767 (1.0)       1.8911 (1.0)      0.0885 (1.0)         86;13  526.6454 (1.0)         373           1
test_large_width[old]     19.5708 (11.05)    22.8755 (9.49)     20.7987 (10.95)    0.7009 (9.14)     20.6623 (10.93)    0.8584 (9.70)         14;1   48.0799 (0.09)         46           1
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_multithreaded[new]      15.0560 (1.0)       24.2891 (1.0)       16.1627 (1.0)      1.5657 (1.0)       15.7182 (1.0)      0.7598 (1.0)           4;6  61.8709 (1.0)          65           1
test_multithreaded[old]     115.7614 (7.69)     120.9670 (4.98)     118.3004 (7.32)     1.6259 (1.04)     118.4164 (7.53)     1.9613 (2.58)          2;0   8.4531 (0.14)          8           1
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Legend:
  Outliers: 1 Standard Deviation from Mean; 1.5 IQR (InterQuartile Range) from 1st Quartile and 3rd Quartile.
  OPS: Operations Per Second, computed as 1 / Mean
```

### Benchmarking script
To run the benchmark make sure you have pytest-benchmark installed with `pip install pytest-benchmark` and use the following command: `pytest benchmark.py --benchmark-sort='name'`

```
import torch
import pytest

def _test_speedup(benchmark, batches=1, channels=32, width=32,
                  kernel_size=2, stride=None, padding=0, dilation=1, ceil_mode=False, return_indices=False):
    torch.set_num_threads(1)
    x = torch.randn((batches, channels, width))
    model = torch.nn.MaxPool1d(kernel_size, stride, padding, dilation, return_indices, ceil_mode)
    benchmark(model, x)

pytest.mark.benchmark(group="inception")
pytest.mark.parametrize("return_indices", [True, False], ids=["old", "new"])
pytest.mark.parametrize("params", [(3, 2), (3, 2, 0, 1, True), (3, 2, 1)],
                         ids=["(3, 2, 0, 1, 0)",
                              "(3, 2, 0, 1, 1)",
                              "(3, 2, 1, 1, 0)"])
def test_inception(benchmark, params, return_indices):
    _test_speedup(benchmark, 10, 64, 147, *params, return_indices=return_indices)

pytest.mark.benchmark(group="googlenet")
pytest.mark.parametrize("return_indices", [True, False], ids=["old", "new"])
pytest.mark.parametrize("params", [(3, 2), (3, 2, 0, 1, True), (3, 2, 1)],
                         ids=["(3, 2, 0, 1, 0)",
                              "(3, 2, 0, 1, 1)",
                              "(3, 2, 1, 1, 0)"])
def test_googlenet(benchmark, params, return_indices):
    _test_speedup(benchmark, 10, 64, 112, *params, return_indices=return_indices)

pytest.mark.benchmark(group="large batch size")
pytest.mark.parametrize("return_indices", [True, False], ids=["old", "new"])
def test_large_batch_size(benchmark, return_indices):
    _test_speedup(benchmark, 100000, 1, 32, return_indices=return_indices)

pytest.mark.benchmark(group="large channel size")
pytest.mark.parametrize("return_indices", [True, False], ids=["old", "new"])
def test_large_channel_size(benchmark, return_indices):
    _test_speedup(benchmark, 1, 100000, 32, return_indices=return_indices)

pytest.mark.benchmark(group="large width")
pytest.mark.parametrize("return_indices", [True, False], ids=["old", "new"])
def test_large_width(benchmark, return_indices):
    _test_speedup(benchmark, 1, 32, 100000, return_indices=return_indices)

pytest.mark.benchmark(group="multithreading")
pytest.mark.parametrize("return_indices", [True, False], ids=["old", "new"])
def test_multithreaded(benchmark, return_indices):
    x = torch.randn((40, 10000, 32))
    model = torch.nn.MaxPool1d(2, return_indices=return_indices)
    benchmark(model, x)
```

## Discussion

The new algorithm is on average 7x faster than the old one. But because the old algorithm had many issues with how it parallelized the code and made use of the cache, one can come up with input parameters (like large batch size) that will make the new algorithm much faster than the original one.

Test Plan: Imported from OSS

Reviewed By: glaringlee

Differential Revision: D23425348

Pulled By: heitorschueroff

fbshipit-source-id: 3fa3f9b8e71200da48424a95510124a83f50d7b2
2020-09-01 08:40:01 -07:00
a044c039c0 updated documentation to streamline setup (#42850)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42850

Reviewed By: mrshenli

Differential Revision: D23449055

Pulled By: osandoval-fb

fbshipit-source-id: 6db695d4fe5f6d9b7bb2895c85c855db4779516b
2020-09-01 08:25:48 -07:00
b1f19c20d6 Run function check and out check in TestTensorDeviceOps (#43830)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43830

Reviewed By: ailzhang

Differential Revision: D23438101

Pulled By: mruberry

fbshipit-source-id: b581ce779ea2f50ea8dfec51d5469031ec7a0a67
2020-09-01 08:21:53 -07:00
9b98bcecfa torch.cat and torch.stack batching rules (#43798)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43798

These are relatively straightforward.

Test Plan: - `pytest test/test_vmap.py -v`

Reviewed By: ezyang

Differential Revision: D23405000

Pulled By: zou3519

fbshipit-source-id: 65c78da3dee43652636bdb0a65b636fca69e765d
2020-09-01 08:12:46 -07:00
dbc4218f11 Batching rules for: torch.bmm, torch.dot (#43781)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43781

Test Plan: - `pytest test/test_vmap.py -v`

Reviewed By: ezyang

Differential Revision: D23400843

Pulled By: zou3519

fbshipit-source-id: a901bba6dc2d8435d314cb4dac85bbd5cd4ee2a5
2020-09-01 08:12:43 -07:00
fa12e225d3 Batching rule for torch.mv (#43780)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43780

The general strategy is:
- unsqueeze the physical inputs enough
- pass the unsqueezed physical inputs to at::matmul
- squeeze any extra dimensions

Test Plan: - `pytest test/test_vmap.py -v`

Reviewed By: ezyang

Differential Revision: D23400842

Pulled By: zou3519

fbshipit-source-id: c550eeb935747c08e3b083609ed307a4374b9096
2020-09-01 08:12:41 -07:00
2789a4023b TestVmapOperators: add structured tests that batching rules get invoked (#43731)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43731

After this PR, for each test in TestVmapOperators, TestVmapOperators
tests that the test never invokes the slow vmap fallback path. The
rationale behind this change is that TestVmapOperators is used for
testing batching rules and we want confidence that the batching rules
actually get invoked.

We set this up using a similar mechanism to the CUDA memory leak check:
(bff741a849/torch/testing/_internal/common_utils.py (L506-L511))

This PR also implements the batching rule for `to.dtype_layout`; the new
testing caught that we were testing vmap on `to.dtype_layout` but it
didn't actually have a batching rule implemented!

Test Plan: - New tests in `pytest test/test_vmap.py -v` that test the mechanism.

Reviewed By: ezyang

Differential Revision: D23380729

Pulled By: zou3519

fbshipit-source-id: 6a4b97a7fa7b4e1c5be6ad80d6761e0d5b97bb8c
2020-09-01 08:11:35 -07:00
0b2694cd11 Support work.result() to get result tensors for allreduce for Gloo, NCCL backends (#43386)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43386

Resolves #43178

ghstack-source-id: 111109716

Test Plan: Added checks to existing unit test and ran it on gpu devserver.

Reviewed By: rohan-varma

Differential Revision: D23216393

fbshipit-source-id: fed5e37fbabbd2ac4a9055b20057fffe3c416c0b
2020-09-01 08:05:55 -07:00
a67246b2d4 Add reduction string test for ctc_loss. (#43884)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43884

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D23427907

Pulled By: gchanan

fbshipit-source-id: 889bd92e9d3e0528b57e3952fc83e25bc7abe293
2020-09-01 07:01:54 -07:00
fab012aa28 Revert "Added support for Huber Loss (#37599)" (#43351)
Summary:
This reverts commit 11e5174926d807a540fc7b54fb45a26ec0c5d9c0 due to [comment](https://github.com/pytorch/pytorch/pull/37599#pullrequestreview-471950192).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43351

Reviewed By: pbelevich, seemethere

Differential Revision: D23249511

Pulled By: vincentqb

fbshipit-source-id: 18b8b346f00eaf0ef7376b06579d404a84add4de
2020-09-01 06:34:26 -07:00
c14a3613a8 Fix NaN propagation in TE fuser's min/max implementation (#43609)
Summary:
Per eager mode source-of-truth, NaNs shall be propagated by min/max.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43609

Reviewed By: ZolotukhinM

Differential Revision: D23349184

Pulled By: bertmaher

fbshipit-source-id: 094eb8b89a02b27d5ecf3988d0f473c0f91e4afb
2020-09-01 02:10:13 -07:00
820c4b05a9 [ONNX] Update slice symbolic function (#42935)
Summary:
During scripting, combination of shape (or size()) and slice (e.g x.shape[2:]) produces following error:
 slice() missing 1 required positional argument: 'step'
This happens because aten::slice has 2 signatures:

- aten::slice(Tensor self, int dim, int start, int end, int step) -> Tensor
- aten::slice(t[] l, int start, int end, int step) -> t[]

and when a list is passed instead of tensor the 2nd of the two slice signatures is called, and since it has 4 instead of 5 arguments it produces the above exception.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42935

Reviewed By: houseroad

Differential Revision: D23398435

Pulled By: bzinodev

fbshipit-source-id: 4151a8f878c520cea199b265973fb476b17801fe
2020-09-01 02:08:48 -07:00
f1624b82b5 Preserve python backtrace in autograd engine errors. (#43684)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43684

This PR attempts to address #42560 by capturing the appropriate
exception_ptr in the autograd engine and passing it over to the Future.

As part of this change, there is a significant change the Future API where we
now only accept an exception_ptr as part of setError.

For the example in #42560, the exception trace would now look like:

```
> Traceback (most recent call last):
>   File "test_autograd.py", line 6914, in test_preserve_backtrace
>     Foo.apply(t).sum().backward()
>   File "torch/tensor.py", line 214, in backward
>     torch.autograd.backward(self, gradient, retain_graph, create_graph)
>   File "torch/autograd/__init__.py", line 127, in backward
>     allow_unreachable=True)  # allow_unreachable flag
>   File "torch/autograd/function.py", line 87, in apply
>     return self._forward_cls.backward(self, *args)
>   File "test_autograd.py", line 6910, in backward
>     raise ValueError("something")
> ValueError: something
```
ghstack-source-id: 111109637

Test Plan: waitforbuildbot

Reviewed By: albanD

Differential Revision: D23365408

fbshipit-source-id: 1470c4776ec8053ea92a6ee1663460a3bae6edc5
2020-09-01 01:28:47 -07:00
825c109eb7 [reland][quant][graphmode][fx] Add support for weight prepack folding (#43728) (#43902)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43902

Trace back from the weight node util we hit getattr, reconstruct the graph module with the traced nodes
and run the graph module to pack the weight. then replace the original chain of ops with the packed weight.

Test Plan:
Imported from OSS

Imported from OSS

Reviewed By: vkuzo

Differential Revision: D23432431

fbshipit-source-id: 657f21a8287494f7f87687a9d618ca46376d3aa3
2020-09-01 00:26:19 -07:00
6da26cf0d9 Update torch.range warning message regarding the removal version number (#43569)
Summary:
`torch.range` still hasn't been removed way after version 0.5. This PR fixes the warning message. Alternatively, we can remove `torch.range`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43569

Reviewed By: ngimel

Differential Revision: D23408233

Pulled By: mruberry

fbshipit-source-id: 86c4f9f018ea5eddaf80b78a3c54dfa41cfc6fa6
2020-08-31 22:23:32 -07:00
85d91a3230 [TensorExpr] Check statements in test_kernel.cpp (#43911)
Summary:
Check statements and fix all the warnings.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43911

Test Plan: test_tensorexpr

Reviewed By: ZolotukhinM

Differential Revision: D23441092

Pulled By: asuhan

fbshipit-source-id: f671eef4b4eb9b51acb15054131152ae650fedbd
2020-08-31 22:16:25 -07:00
f229d2c07b Revert D23335106: [quant][graphmode][fix] Fix insert quant dequant for observers without qparams
Test Plan: revert-hammer

Differential Revision:
D23335106 (602209751e)

Original commit changeset: 84af2884d521

fbshipit-source-id: 8d227fe2048b532016407d8ecfbaa6ffd1c313fd
2020-08-31 22:12:37 -07:00
69080e9e7e simplify profile text output by displaying only top-level ops statistics (#42262)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42262

Test Plan:
Imported from OSS
```
==================================================================================================================================================================================
TEST
-----------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------------------------------------
Name                           Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     Number of Calls  Input Shapes
-----------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------------------------------------
aten::add_                     3.61%            462.489us        3.61%            462.489us        462.489us        1                [[3, 20], [3, 20], []]
aten::slice                    1.95%            249.571us        1.95%            250.018us        250.018us        1                [[3, 80], [], [], [], []]
aten::lstm                     1.89%            242.534us        22.41%           2.872ms          2.872ms          1                [[5, 3, 10], [], [], [], [], [], [], [], []]
aten::lstm                     1.68%            215.852us        18.18%           2.330ms          2.330ms          1                [[5, 3, 10], [], [], [], [], [], [], [], []]
aten::lstm                     1.68%            215.767us        18.49%           2.370ms          2.370ms          1                [[5, 3, 10], [], [], [], [], [], [], [], []]
aten::lstm                     1.60%            205.014us        20.15%           2.582ms          2.582ms          1                [[5, 3, 10], [], [], [], [], [], [], [], []]
aten::lstm                     1.55%            198.213us        18.53%           2.375ms          2.375ms          1                [[5, 3, 10], [], [], [], [], [], [], [], []]
aten::addmm                    0.95%            122.359us        1.01%            129.857us        129.857us        1                [[80], [3, 20], [20, 80], [], []]
aten::stack                    0.29%            36.745us         0.63%            80.179us         80.179us         1                [[], []]
aten::add_                     0.28%            35.694us         0.28%            35.694us         35.694us         1                [[3, 20], [3, 20], []]
-----------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------------------------------------
Self CPU time total: 12.817ms

-----------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------------------------------------
Name                           Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     Number of Calls  Input Shapes
-----------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------------------------------------
aten::mul                      11.45%           1.467ms          12.88%           1.651ms          11.006us         150              [[3, 20], [3, 20]]
aten::lstm                     8.41%            1.077ms          97.76%           12.529ms         2.506ms          5                [[5, 3, 10], [], [], [], [], [], [], [], []]
aten::addmm                    7.65%            979.982us        11.38%           1.459ms          29.182us         50               [[80], [3, 20], [20, 80], [], []]
aten::sigmoid_                 6.78%            869.295us        9.74%            1.249ms          8.327us          150              [[3, 20]]
aten::add_                     5.82%            745.801us        5.82%            745.801us        14.916us         50               [[3, 20], [3, 20], []]
aten::slice                    5.58%            715.532us        6.61%            847.445us        4.237us          200              [[3, 80], [], [], [], []]
aten::unsafe_split             4.24%            544.015us        13.25%           1.698ms          33.957us         50               [[3, 80], [], []]
aten::tanh                     3.11%            398.881us        6.05%            775.024us        15.500us         50               [[3, 20]]
aten::empty                    3.04%            389.055us        3.04%            389.055us        1.319us          295              [[], [], [], [], [], []]
aten::sigmoid                  2.96%            379.686us        2.96%            379.686us        2.531us          150              [[3, 20], [3, 20]]
-----------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------------------------------------
Self CPU time total: 12.817ms

==================================================================================================================================================================================
TEST
==================================================================================================================================================================================
This report only display top-level ops statistics
-----------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------------------------------------
Name                           Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     Number of Calls  Input Shapes
-----------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------------------------------------
aten::lstm                     1.89%            242.534us        22.41%           2.872ms          2.872ms          1                [[5, 3, 10], [], [], [], [], [], [], [], []]
aten::lstm                     1.68%            215.852us        18.18%           2.330ms          2.330ms          1                [[5, 3, 10], [], [], [], [], [], [], [], []]
aten::lstm                     1.68%            215.767us        18.49%           2.370ms          2.370ms          1                [[5, 3, 10], [], [], [], [], [], [], [], []]
aten::lstm                     1.60%            205.014us        20.15%           2.582ms          2.582ms          1                [[5, 3, 10], [], [], [], [], [], [], [], []]
aten::lstm                     1.55%            198.213us        18.53%           2.375ms          2.375ms          1                [[5, 3, 10], [], [], [], [], [], [], [], []]
-----------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------------------------------------
Self CPU time total: 12.817ms

==================================================================================================================================================================================
This report only display top-level ops statistics
-----------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------------------------------------
Name                           Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     Number of Calls  Input Shapes
-----------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------------------------------------
aten::lstm                     8.41%            1.077ms          97.76%           12.529ms         2.506ms          5                [[5, 3, 10], [], [], [], [], [], [], [], []]
-----------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------------------------------------
Self CPU time total: 12.817ms

Total time based on python measurements:  13.206ms
CPU time measurement python side overhead: 3.03%
```

Reviewed By: ilia-cher

Differential Revision: D22830328

Pulled By: ilia-cher

fbshipit-source-id: c9a71be7b23a8f84784117c788faa43caa96f545
2020-08-31 21:41:40 -07:00
d7ee84c9b5 Update determinism documentation (#41692)
Summary:
Add user-facing documentation for set_deterministic
Also update grammar and readability in Reproducibility page

Issue https://github.com/pytorch/pytorch/issues/15359

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41692

Reviewed By: ailzhang

Differential Revision: D23433061

Pulled By: mruberry

fbshipit-source-id: 4c4552950803c2aaf80f7bb4792d2095706d07cf
2020-08-31 21:06:24 -07:00
69fbc705d8 Remained changes of #43578 (#43921)
Summary:
Not full https://github.com/pytorch/pytorch/issues/43578 was merged. This PR is the remained part.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43921

Reviewed By: ailzhang

Differential Revision: D23438504

Pulled By: mruberry

fbshipit-source-id: 9c5e26346dfc423b7a440b8a986420a27349090f
2020-08-31 20:42:07 -07:00
3c2f6d2ecf [caffe2] Extend dedup SparseAdagrad fusion with stochastic rounding FP16 (#43124)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43124

Add the stochastic rounding FP16 support for dedup version of SparseAdagrad fusion.
ghstack-source-id: 111037723

Test Plan:
```
buck test mode/dev-nosan //caffe2/caffe2/fb/net_transforms/tests:fuse_sparse_ops_test -- 'test_fuse_sparse_adagrad_with_sparse_lengths_sum_gradient \(caffe2\.caffe2\.fb\.net_transforms\.tests\.fuse_sparse_ops_test\.TestFuseSparseOps\)' --print-passing-details
```

https://our.intern.facebook.com/intern/testinfra/testrun/5629499566042000

```
buck test mode/dev-nosan //caffe2/caffe2/fb/net_transforms/tests:fuse_sparse_ops_test -- 'test_fuse_sparse_adagrad_with_sparse_lengths_mean_gradient \(caffe2\.caffe2\.fb\.net_transforms\.tests\.fuse_sparse_ops_test\.TestFuseSparseOps\)' --print-passing-details
```

https://our.intern.facebook.com/intern/testinfra/testrun/1125900076333177

Reviewed By: xianjiec

Differential Revision: D22893851

fbshipit-source-id: 81c7a7fe4b0d2de0e6b4fc965c5d23210213c46c
2020-08-31 20:35:22 -07:00
f17d7a5556 Fix exception chaining in torch/ (#43836)
Summary:
## Motivation
Fixes https://github.com/pytorch/pytorch/issues/43770.

## Description of the change
This PR fixes exception chaining only in files under `torch/` where appropriate.
To fix exception chaining, I used either:
1. `raise new_exception from old_exception` where `new_exception` itself seems not descriptive enough to debug or `old_exception` delivers valuable information.
2. `raise new_exception from None` where raising both of `new_exception` and `old_exception` seems a bit noisy and redundant.
I subjectively chose which one to use from the above options.

## List of lines containing raise in except clause:
I wrote [this simple script](https://gist.github.com/akihironitta/4223c1b32404b36c1b349d70c4c93b4d) using [ast](https://docs.python.org/3.8/library/ast.html#module-ast) to list lines where `raise`ing in `except` clause.

- [x] 000739c31a/torch/jit/annotations.py (L35)
- [x] 000739c31a/torch/jit/annotations.py (L150)
- [x] 000739c31a/torch/jit/annotations.py (L158)
- [x] 000739c31a/torch/jit/annotations.py (L231)
- [x] 000739c31a/torch/jit/_trace.py (L432)
- [x] 000739c31a/torch/nn/utils/prune.py (L192)
- [x] 000739c31a/torch/cuda/nvtx.py (L7)
- [x] 000739c31a/torch/utils/cpp_extension.py (L1537)
- [x] 000739c31a/torch/utils/tensorboard/_pytorch_graph.py (L292)
- [x] 000739c31a/torch/utils/data/dataloader.py (L835)
- [x] 000739c31a/torch/utils/data/dataloader.py (L849)
- [x] 000739c31a/torch/utils/data/dataloader.py (L856)
- [x] 000739c31a/torch/testing/_internal/common_utils.py (L186)
- [x] 000739c31a/torch/testing/_internal/common_utils.py (L189)
- [x] 000739c31a/torch/testing/_internal/common_utils.py (L424)
- [x] 000739c31a/torch/testing/_internal/common_utils.py (L1279)
- [x] 000739c31a/torch/testing/_internal/common_utils.py (L1283)
- [x] 000739c31a/torch/testing/_internal/common_utils.py (L1356)
- [x] 000739c31a/torch/testing/_internal/common_utils.py (L1388)
- [x] 000739c31a/torch/testing/_internal/common_utils.py (L1391)
- [ ] 000739c31a/torch/testing/_internal/common_utils.py (L1412)
- [x] 000739c31a/torch/testing/_internal/codegen/random_topo_test.py (L310)
- [x] 000739c31a/torch/testing/_internal/codegen/random_topo_test.py (L329)
- [x] 000739c31a/torch/testing/_internal/codegen/random_topo_test.py (L332)
- [x] 000739c31a/torch/testing/_internal/jit_utils.py (L183)
- [x] 000739c31a/torch/testing/_internal/common_nn.py (L4789)
- [x] 000739c31a/torch/onnx/utils.py (L367)
- [x] 000739c31a/torch/onnx/utils.py (L659)
- [x] 000739c31a/torch/onnx/utils.py (L892)
- [x] 000739c31a/torch/onnx/utils.py (L897)
- [x] 000739c31a/torch/serialization.py (L108)
- [x] 000739c31a/torch/serialization.py (L754)
- [x] 000739c31a/torch/distributed/rpc/_testing/faulty_agent_backend_registry.py (L76)
- [x] 000739c31a/torch/distributed/rpc/backend_registry.py (L260)
- [x] 000739c31a/torch/distributed/distributed_c10d.py (L184)
- [x] 000739c31a/torch/_utils_internal.py (L57)
- [x] 000739c31a/torch/hub.py (L494)
- [x] 000739c31a/torch/contrib/_tensorboard_vis.py (L16)
- [x] 000739c31a/torch/distributions/lowrank_multivariate_normal.py (L100)
- [x] 000739c31a/torch/distributions/constraint_registry.py (L142)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43836

Reviewed By: ailzhang

Differential Revision: D23431212

Pulled By: malfet

fbshipit-source-id: 5f7f41b391164a5ad0efc06e55cd58c23408a921
2020-08-31 20:26:23 -07:00
da32bf4cc6 Move type annotations for remaining torch.utils stub files inline (#43406)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43406

Reviewed By: mruberry

Differential Revision: D23319736

Pulled By: malfet

fbshipit-source-id: e25fbb49f27aa4893590b022441303d6d98263a9
2020-08-31 18:44:09 -07:00
602209751e [quant][graphmode][fix] Fix insert quant dequant for observers without qparams (#43606)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43606

Test Plan: Imported from OSS

Reviewed By: vkuzo

Differential Revision: D23335106

fbshipit-source-id: 84af2884d52118c069fc43a9f166dc336a8a87c8
2020-08-31 18:27:53 -07:00
7db7da7151 [reland][quant][graphmode][fx] Add top level APIs (#43581) (#43901)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43901

Add similar APIs like eager and graph mode on torchscript
- fuse_fx
- quantize_fx (for both post training static and qat)
- quantize_dynamic_fx (for post training dynamic)
- prepare_fx (for both post training static and qat)
- prepare_dynamic_fx (for post training dynamic)
- convert_fx (for all modes)

Test Plan:
Imported from OSS

Imported from OSS

Reviewed By: vkuzo

Differential Revision: D23432430

fbshipit-source-id: fc99eb75cbecd6ee7a3aa6c8ec71cd499ff7e3c1
2020-08-31 18:24:26 -07:00
deb5fde51c [TensorExpr] Make KernelSumMultipleAxes much faster (#43905)
Summary:
Reduce input size, skip the dtype conversion.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43905

Test Plan: test_tensorexpr --gtest_filter=TensorExprTest.KernelSum*

Reviewed By: ailzhang

Differential Revision: D23433398

Pulled By: asuhan

fbshipit-source-id: 0d95ced3c1382f10595a9e5745bf4bef007cc913
2020-08-31 17:58:43 -07:00
ee53a335c0 [ONNX] Floordiv (#43022)
Summary:
Add export of floordiv op

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43022

Reviewed By: houseroad

Differential Revision: D23398493

Pulled By: bzinodev

fbshipit-source-id: f929a88b3bc0c3867e8fbc4e50afdf0c0c71553d
2020-08-31 17:54:40 -07:00
f73ba88946 Avoid resizing in MinMaxObserver (#43789)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43789

Since it's single element.. In some cases we may not be able to resize the
buffers.

Test Plan: unit tests

Reviewed By: supriyar

Differential Revision: D23393108

fbshipit-source-id: 46cd7f73ed42a05093662213978a01ee726433eb
2020-08-31 17:41:39 -07:00
98b846cd1d [JIT] Remove loop peeling from the profiling executor pipeline. (#43847)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43847

It seems to slowdown two fastRNN benchmarks and does not speed up
others.

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D23416197

Pulled By: ZolotukhinM

fbshipit-source-id: 598144561979e84bcf6bccf9b0ca786f5af18383
2020-08-31 17:26:55 -07:00
d69d603061 [JIT] Specialize autograd zero: actually remove the original graph after we created its versioned copy. (#43900)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43900

The original code assumed that the versioning if was inserted in the
beginning of the graph while in fact it was inserted in the end. We're
now also not removing `profile_optional` nodes and rely on DCE to clean
it up later (the reason we're not doing it is that deletion could
invalidate the insertion point being used).

Test Plan: Imported from OSS

Reviewed By: eellison

Differential Revision: D23432175

Pulled By: ZolotukhinM

fbshipit-source-id: 1bf55affaa3f17af1bf71bad3ef64edf71a3e3fb
2020-08-31 17:26:51 -07:00
f150f924d3 [JIT] Specialize autograd zero: fix the guarding condition. (#43846)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43846

We are looking for tensors that are expected to be undefined (according
to the profile info) and should be checking for them to satisfy the
following condition: "not(have any non-zero)", which is equivalent to
"tensor is all zeros". The issue was that we've been checking tensors
that were expected *not* to be undefined.

Test Plan: Imported from OSS

Reviewed By: eellison

Differential Revision: D23416198

Pulled By: ZolotukhinM

fbshipit-source-id: 71e22f552680f68f2af29f427b7355df9b1a4278
2020-08-31 17:25:50 -07:00
9b820fe904 Fix ImportError in the OSS land. (#43912)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43912

Fixed the ImportError: cannot import name 'compute_ulp_error' from 'caffe2.python.oss.fakelowp.test_utils'

Test Plan: test_op_nnpi_fp16.py

Reviewed By: hyuen

Differential Revision: D23435218

fbshipit-source-id: be0b240ee62090d06fdc8efac85fb1c32803da0d
2020-08-31 16:48:54 -07:00
7137327646 log message at per-test level forperfpipe_pytorch_test_times (#43752)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43752

Test Plan:
{F315930458}

{F315930459}

Reviewed By: walterddr, malfet

Differential Revision: D23387998

Pulled By: dhuang29

fbshipit-source-id: 2da8b607c049a6f8f21d98dbb25e664ea6229f27
2020-08-31 16:22:44 -07:00
4c19a1e350 Move torch/autograd/grad_mode.pyi stubs inline (#43415)
Summary:
- Add `torch._C` bindings from `torch/csrc/autograd/init.cpp`
- Renamed `torch._C.set_grad_enabled` to `torch._C._set_grad_enabled`
  so it doesn't conflict with torch.set_grad_enabled anymore

This is a continuation of gh-38201. All I did was resolve merge conflicts and finish the annotation of `_DecoratorContextManager.__call__` that ezyang started in the first commit.

~Reverts commit b5cd3a80bbc, which was only motivated by not having `typing_extensions` available.~ (JIT can't be made to understand `Literal[False]`, so keep as is).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43415

Reviewed By: ngimel

Differential Revision: D23301168

Pulled By: malfet

fbshipit-source-id: cb5290f2e556b4036592655b9fe54564cbb036f6
2020-08-31 16:14:41 -07:00
e941a462a3 Enable gcc coverage in OSS (#43883)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43883

Check the result of GCC coverage in OSS is reasonable and ready to ship.

The amount of executable lines are not the same between `gcc` and `clang` because of the following reasons:
* Lines following are counted in `clang` but not in `gcc`:
1. empty line or line with only “{” or “}”
3. some comments are counted in clang but not in gcc
5. `#define ...` -- not supported by gcc according to official documentation

* Besides, a statement that explains to more than one line will be counted as only one executable line in gcc, but several lines in clang

## Advantage of `gcc` coverage
1. Much faster
- code coverage tool runtime is onle **4 min** (*ammazzzing!!*) by `gcc`, compared to **3 hours!!** by `clang`, to analyze all the tests' artifacts
2. Use less disk
- `Clang`'s artifacts will take as large as 170G, but `GCC` is 980M

Besides, also update `README.md`.

Test Plan:
Compare the result in OSS `clang` and OSS `gcc` with the same command:
```
python oss_coverage.py --run-only atest test_nn.py --interested-folder=aten
```

----

## GCC
**Summary**
> time: 0:15:45
summary percentage: 44.85%

**Report and Log**
[File Coverage Report](P140825162)
[Line Coverage Report](P140825196)
[Log](P140825385)

------

## CLANG

**Summary**
> time: 0:21:35
summary percentage: 44.08%

**Report and Log**
[File Coverage Report](P140825845)
[Line Coverage Report](P140825923)
[Log](P140825950)

----------

# Run all tests
```
# run all tests and get coverage over Pytorch
python oss_coverage.py
```
**Summary**
> time: 1:27:20. ( time to run tests:  1:23:33)
summary percentage: 56.62%

**Report and Log**
[File Coverage Report](P140837175)
[Log](P140837121)

Reviewed By: malfet

Differential Revision: D23416772

fbshipit-source-id: a6810fa4d8199690f10bd0a4f58a42ab2a22182b
2020-08-31 16:11:33 -07:00
da0e93a8c3 Move fbcode related coverage code to fb/ folder and add TARGETS (#43800)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43800

1. Move fbcode related coverage code to fb/ folder and add TARGETS so that we can use buck run to run the tool and solved the import probelm.

2. Write `README.md` to give users guidance about the tool

Test Plan:
On devserver:
```
buck run //caffe2/fb/code_coverage/tool:coverage -- //caffe2/c10:
```

More examples in README.md

Reviewed By: malfet

Differential Revision: D23404988

fbshipit-source-id: 4942cd0e0fb7bd28a5e884d9835b93f00adb7b92
2020-08-31 16:10:33 -07:00
3682df77db Implementing NumPy-like function torch.heaviside() (#42523)
Summary:
- Related with https://github.com/pytorch/pytorch/issues/38349
- Implementing the NumPy-like function `torch.heaviside()` .

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42523

Reviewed By: ngimel

Differential Revision: D23416743

Pulled By: mruberry

fbshipit-source-id: 9975bd9c9fa73bd0958fe9879f79a692aeb722d5
2020-08-31 15:54:56 -07:00
7680d87a76 Let linspace support bfloat16 and complex dtypes (#43578)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43578

Reviewed By: malfet

Differential Revision: D23413690

Pulled By: mruberry

fbshipit-source-id: 8c24f7b054269e1317fe53d26d523fea4decb164
2020-08-31 14:54:22 -07:00
3278beff44 Skip target determination for codecov test (#43899)
Summary:
Python code coverage tests should not rely on target determination as it will negatively impact the coverage score

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43899

Reviewed By: seemethere

Differential Revision: D23432069

Pulled By: malfet

fbshipit-source-id: 341fcadafaab6bd96d33d23973e01f7d421a6593
2020-08-31 14:43:12 -07:00
ffca81e38b [pytorch][bot] update mobile op deps (#43871)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43871

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D23422523

Pulled By: ljk53

fbshipit-source-id: 95f2a1b6a2d25b13618c65944a2b919922083fb8
2020-08-31 14:42:12 -07:00
4e4626a23d Join-based API to support DDP uneven inputs (#42577)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42577

Closes https://github.com/pytorch/pytorch/issues/38174. Implements a join-based API to support training with the DDP module in the scenario where different processes have different no. of inputs. The implementation follows the description in https://github.com/pytorch/pytorch/issues/38174. Details are available in the RFC, but as a summary, we make the following changes:

#### Approach
1) Add a context manager `torch.nn.parallel.distributed.join`
2) In the forward pass, we schedule a "present" allreduce where non-joined process contribute 1 and joined processes contribute 0. This lets us keep track of joined processes and know when all procs are joined.
3) When a process depletes its input and exits the context manager, it enters "joining" mode and attempts to "shadow" the collective comm. calls made in the model's forward and backward pass. For example we schedule the same allreduces in the same order as the backward pass, but with zeros
4) We adjust the allreduce division logic to divide by the effective world size (no. of non-joined procs) rather than the absolute world size to maintain correctness.
5) At the end of training, the last joined process is selected to be the "authoritative" model copy

We also make some misc. changes such as adding a `rank` argument to `_distributed_broadcast_coalesced` and exposing some getters/setters on `Reducer` to support the above changes.

#### How is it tested?
We have tests covering the following models/scenarios:
- [x] Simple linear model
- [x] Large convolutional model
- [x] Large model with module buffers that are broadcast in the forward pass (resnet). We verify this with a helper function `will_sync_module_buffers` and ensure this is true for ResNet (due to batchnorm)
- [x] Scenario where a rank calls join() without iterating at all, so without rebuilding buckets (which requires collective comm)
- [x] Model with unused params (with find unused parameters=True)
- [x] Scenarios where different processes iterate for a varying number of different iterations.
- [x] Test consistency in tie-breaking when multiple ranks are the last ones to join
- [x] Test that we divide by the effective world_size (no. of unjoined processes)

#### Performance implications

###### Trunk vs PR patched, 32 GPUs, batch size = 32
P50, forward + backward + optimizer batch latency & total QPS: 0.121 264/s vs 0.121 264/s
P50 backwards only batch latency & total QPS: 0.087 369/s vs 0.087 368/s

###### join(enable=True) vs without join, 32 GPUs, batch size = 32, even inputs
P50, forward + backward + optimizer batch latency & total QPS: 0.120 265/s vs 0.121 264/s
P50 backwards only batch latency & total QPS: 0.088 364/s vs 0.087 368/s

###### join(enable=False) vs without join, 32 GPUs, batch size = 32, even inputs
P50 forward + backward + optimizer batch latency & total QPS: 0.121 264/s vs 0.121 264/s
P50 backwards only batch latency & total QPS: 0.087 368/s vs 0.087 368/s

###### join(enable=True) with uneven inputs (offset = 2000), 32 GPUs, batch size = 32
P50 forward + backward + optimizer batch latency & total QPS: 0.183 174/s vs 0.121 264/s
P50 backwards only batch latency & total QPS: 0.150 213/s vs 0.087 368/s

###### join(enable=True) with uneven inputs ((offset = 2000)), 8 GPUs, batch size = 32
P50 forward + backward + optimizer batch latency & total QPS: 0.104 308/s vs 0.104 308/s
P50 backwards only batch latency & total QPS: 0.070 454/s vs 0.070 459/s

The 2 above uneven inputs benchmark was conducted 32 GPUs and 4 GPUs immediately depleting their inputs and entering "join" mode (i.e. not iterating at all), while the other 28 iterating as normal. It looks like there is a pretty significant perf hit for this case when there are uneven inputs and multi-node training. Strangely, when there is a single node (8 GPUs), this does not reproduce.

#### Limitations
1) This is only implemented for MPSD, not SPMD. Per a discussion with mrshenli we want to encourage the use of MPSD over SPMD for DDP.
2) This does not currently work with SyncBN or custom collective calls made in the model's forward pass. This is because the `join` class only shadows the `broadcast` for buffers in the forward pass, the gradient allreduces in the bwd pass, unused parameters reduction, and (optionally) the rebuild buckets broadcasting in the backwards pass. Supporting this will require additional design thought.
3) Has not been tested with the [DDP comm. hook](https://github.com/pytorch/pytorch/issues/39272) as this feature is still being finalized/in progress. We will add support for this in follow up PRs.
ghstack-source-id: 111033819

Reviewed By: mrshenli

Differential Revision: D22893859

fbshipit-source-id: dd02a7aac6c6cd968db882c62892ee1c48817fbe
2020-08-31 13:29:03 -07:00
2f52748515 Publish all_gather_object and gather_object docs (#43772)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43772

Test Plan: Imported from OSS

Reviewed By: rohan-varma

Differential Revision: D23398495

Pulled By: rohan-varma

fbshipit-source-id: 032e1d628c0c0f2dec297226167471698c56b605
2020-08-31 13:28:00 -07:00
f7bae5b6b1 Revert D23385091: [quant][graphmode][fx] Add top level APIs
Test Plan: revert-hammer

Differential Revision:
D23385091 (eb4199b0a7)

Original commit changeset: b789e54e1a0f

fbshipit-source-id: dc3dd9169d34beab92488d78d42d7e7d05e771d1
2020-08-31 12:18:29 -07:00
68304c527a Revert D23385090: [quant][graphmode][fx] Add support for weight prepack folding
Test Plan: revert-hammer

Differential Revision:
D23385090 (ef08f92076)

Original commit changeset: 11341f0af525

fbshipit-source-id: fe2bcdc16106923a2cee99eb5cc0a1e9c14ad2c5
2020-08-31 12:17:28 -07:00
0394c5a283 [fix] torch.multinomial : fix for 0 size dim (#43775)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/43768

TO-DO:
* [x] Add test

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43775

Reviewed By: ZolotukhinM

Differential Revision: D23421979

Pulled By: ngimel

fbshipit-source-id: 949fcdd30f18d17ae1c372fa6ca6a0b8d0d538ce
2020-08-31 11:57:42 -07:00
3c8b1d73c9 Update aliasing in tensorexpr fuser (#43743)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43743

Test Plan: Imported from OSS

Reviewed By: Krovatkin

Differential Revision: D23385205

Pulled By: eellison

fbshipit-source-id: 097a15d5bcf216453e1dd144d6117108b3deae4d
2020-08-31 11:52:26 -07:00
5da8a7bf2d use types in the IR instead of vmap (#43742)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43742

We can remove all prim::profiles, update the values to their specialized profiled types, and then later guard the input graphs based on the input types of the fusion group. After that we remove specialized tensor types from the graph. This gets rid of having to update the vmap and removes all of the profile nodes in fusing.

Test Plan: Imported from OSS

Reviewed By: Krovatkin

Differential Revision: D23385206

Pulled By: eellison

fbshipit-source-id: 2c84bd1d1c38df0d7585e523c30f7bd28f399d7c
2020-08-31 11:52:23 -07:00
259e5b7d71 Add passes to profiling executor pipeline (#43636)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43636

We weren't running inlining in the forward graph of differentiable subgraphs, and we weren't getting rid of all profiles as part of optimization.

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D23358804

Pulled By: eellison

fbshipit-source-id: 05ede5fa356a15ca385f899006cb5b35484ef620
2020-08-31 11:52:20 -07:00
a7e7981c0b Use prim::TensorExprGroup interned symbol (#43635)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43635

Intern the symbol, no functional changes. Aliasing need to be looked at but this should be done in a separate PR; this PR is just changing the symbol.

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D23358806

Pulled By: eellison

fbshipit-source-id: f18bcd142a0daf514136f019ae607e4c3f45d9f8
2020-08-31 11:52:16 -07:00
1c0faa759e Update requires grad property (#43634)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43634

Because differentiable graphs detach the gradients of input Tensors, creating and inlining differentiable graphs changes the requires_grad property of tensors in the graph. In the legacy executor, this was not a problem as the Fuser would simply ignore the gradient property because it would be invariant that the LegacyExecutor only passed tensors with grad = False. This is not the case with the profiler, as the Fuser does it's own guarding.

Updating the type also helps with other typechecks, e.g. the ones specializing the backward, and with debugging the graph.

Other possibilities considered were:
- Fuser/Specialize AutogradZero always guards against requires_grad=False regardless of the profiled type
- Re-profile forward execution of differentiable graph

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D23358803

Pulled By: eellison

fbshipit-source-id: b106998accd5d0f718527bc00177de9af5bad5fc
2020-08-31 11:51:06 -07:00
2bede78a05 add qr_backward functionality for wide case (#42216)
Summary:
Unblocks implementation of https://github.com/pytorch/pytorch/issues/27036. Note that this PR ***does not*** fix #{27036}.
Currently QR decomposition only has support for square and tall (a.k.a. skinny) case.
This PR adds functionality for wide A matrix/tensors, includes 3 unit tests for the new case
and restructures the `qr_backward` method to use the same Walther method as a helper.

cc albanD t-vi

I don't have a gpu machine so haven't tested on cuda but everything passes on my local machine in cpu.

The basic idea of the PR is noted in the comments in the `Functions.cpp` file but I'll note here too for clarity:

let <img src="https://render.githubusercontent.com/render/math?math=A_{m,n}"> be a matrix and <img src="https://render.githubusercontent.com/render/math?math=m < n"> then partition <img src="https://render.githubusercontent.com/render/math?math=A_{m, n}"> as  <img src="https://render.githubusercontent.com/render/math?math=A_{m,n} = [ X_{m,m} |\ Y_{m, n-m} ]">
and take QR of <img src="https://render.githubusercontent.com/render/math?math=X"> and call that one
<img src="https://render.githubusercontent.com/render/math?math=X=QU"> the <img src="https://render.githubusercontent.com/render/math?math=Q"> here from <img src="https://render.githubusercontent.com/render/math?math=X"> is the same as the <img src="https://render.githubusercontent.com/render/math?math=Q"> from <img src="https://render.githubusercontent.com/render/math?math=QR"> on entire <img src="https://render.githubusercontent.com/render/math?math=A"> matrix. Then transform <img src="https://render.githubusercontent.com/render/math?math=Y"> with the <img src="https://render.githubusercontent.com/render/math?math=Q"> rotation got from <img src="https://render.githubusercontent.com/render/math?math=X"> to get <img src="https://render.githubusercontent.com/render/math?math=V=Q^{T}Y"> now <img src="https://render.githubusercontent.com/render/math?math=R= [U |\ V] "> and similarly for the grads of each piece, e.g. if <img src="https://render.githubusercontent.com/render/math?math=\bar{A}"> is  `grad_A` then
<img src="https://render.githubusercontent.com/render/math?math=\bar{A} = [ \bar{X} |\ \bar{Y}]"> and <img src="https://render.githubusercontent.com/render/math?math=\bar{R} = [ \bar{U} |\ \bar{V}]"> and then
<img src="https://render.githubusercontent.com/render/math?math=\bar{Y} =  Q\bar{V}"> and
<img src="https://render.githubusercontent.com/render/math?math=\bar{V}"> is the `narrow()` of `grad_R`.
<img src="https://render.githubusercontent.com/render/math?math=\bar{X}"> is calculated very similar to the original Walther formula (exactly the same in the tall and square cases) but is slightly modified here for wide case matrices.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42216

Reviewed By: glaringlee

Differential Revision: D23373118

Pulled By: albanD

fbshipit-source-id: 3702ba7e7e23923868c02cdb7e10a96036052344
2020-08-31 11:46:45 -07:00
69dd0bab90 [RPC profiling] Add test to ensure using record_function works for RPC (#43657)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43657

We didn't have a test that ensures functions ran over RPC that are being profiled can use `with record_function()` to profile specific blocks in the function execution. This is useful if the user wants certain information about specific blocks in the function ran over RPC composed of many torch ops and some custom logic, for example.

Currently, this will not work if the function is TorchScripted since `with record_function()` is not torchscriptable yet. We can add support for this in future PRs so that torchscript RPC functions can also be profiled like this.
ghstack-source-id: 111033981

Reviewed By: mrshenli

Differential Revision: D23355215

fbshipit-source-id: 318d92e285afebfeeb2a7896b4959412c5c241d4
2020-08-31 11:43:09 -07:00
4ef12be900 Add __complex__ (#43844)
Summary:
fixes https://github.com/pytorch/pytorch/issues/43833

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43844

Reviewed By: ZolotukhinM

Differential Revision: D23422000

Pulled By: ngimel

fbshipit-source-id: ebc6a27a9b04c77c3977e6c184cefce9e817cc2f
2020-08-31 11:39:41 -07:00
c5d0f091b2 addmm/addmv should accept complex alpha and beta (#43827)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43827

Reviewed By: malfet

Differential Revision: D23415869

Pulled By: ngimel

fbshipit-source-id: a47b76df5fb751f76d36697f5fd95c69dd3a6efe
2020-08-31 11:35:58 -07:00
89452a67de [fx] GraphModule.src -> GraphModule.code (#43655)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43655

Pure, unadulerated bikeshed. The good stuff.

This makes things more consistent with ScriptModule.

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D23401528

Pulled By: suo

fbshipit-source-id: 7dd8396365f118abcd045434acd9348545314f44
2020-08-31 11:26:05 -07:00
1390cad2d8 [NNC] Hook up registerizer to Cuda codegen [2/x] (#42878)
Summary:
Insert the registerizer into the Cuda Codegen pass list, to enable scalar replacement and close the gap in simple reduction performance.

First up the good stuff, benchmark before:
```
          Column sum          Caffe2             NNC          Simple          Better
           (10, 100)          5.7917          9.7037          6.9386          6.0448
          (100, 100)          5.9338          14.972          7.1139          6.3254
        (100, 10000)          21.453          741.54          145.74          12.555
        (1000, 1000)          8.0678          122.75          22.833          9.0778

             Row sum          Caffe2             NNC          Simple          Better
           (10, 100)          5.4502          7.9661          6.1469          5.5587
          (100, 100)          5.7613          13.897           21.49          5.5808
        (100, 10000)          21.702          82.398          75.462          22.793
        (1000, 1000)          22.527             129          176.51          22.517

```

After:
```
          Column sum          Caffe2             NNC          Simple          Better
           (10, 100)          6.0458          9.4966          7.1094           6.056
          (100, 100)          5.9299          9.1482          7.1693           6.593
        (100, 10000)          21.739          121.97          162.63          14.376
        (1000, 1000)          9.2374           29.01          26.883          10.127

             Row sum          Caffe2             NNC          Simple          Better
           (10, 100)          5.9773          8.1792          7.2307          5.8941
          (100, 100)          6.1456          9.3155          24.563          5.8163
        (100, 10000)          25.384          30.212          88.531          27.185
        (1000, 1000)          26.517          32.702          209.31          26.537
```

Speedup about 3-8x depending on the size of the data (increasing with bigger inputs).

The gap between NNC and simple is closed or eliminated - remaining issue appears to be kernel launch overhead. Next up is getting us closer to the _Better_ kernel.

It required a lot of refactoring and bug fixes on the way:
* Refactored flattening of parallelized loops out of the CudaPrinter and into its own stage, so we can transform the graph in the stage between flattening and printing (where registerization occurs).
* Made AtomicAddFuser less pessimistic, it will now recognize that if an Add to a buffer is dependent on all used Block and Thread vars then it has no overlap and does not need to be atomic. This allows registerization to apply to these stores.
* Fixed PrioritizeLoad mutator so that it does not attempt to separate the Store and Load to the same buffer (i.e. reduction case).
* Moved CudaAnalysis earlier in the process, allowing later stages to use the analyzed bufs.
* Fixed a bug in the Registerizer where when adding a default initializer statement it would use the dtype of the underlying var (which is always kHandle) instead of the dtype of the Buf.
* Fixed a bug in the IRMutator where Allocate statements logic was inverted to be replaced only if they did not change.
* Added simplification of simple Division patterns to the IRSimplifier.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42878

Reviewed By: glaringlee

Differential Revision: D23382499

Pulled By: nickgg

fbshipit-source-id: 3640a98fd843723abad9f54e67070d48c96fe949
2020-08-31 10:39:46 -07:00
63dbef3038 Better msg (#43848)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43848

Missing space in logging.

Test Plan: build

Reviewed By: hl475

Differential Revision: D23416698

fbshipit-source-id: bf7c494f33836601f5f380c03a0910f419c2e62b
2020-08-31 10:36:59 -07:00
ef08f92076 [quant][graphmode][fx] Add support for weight prepack folding (#43728)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43728

Trace back from the weight node util we hit getattr, reconstruct the graph module with the traced nodes
and run the graph module to pack the weight. then replace the original chain of ops with the packed weight.

Test Plan: Imported from OSS

Reviewed By: vkuzo

Differential Revision: D23385090

fbshipit-source-id: 11341f0af525a02ecec36f163a9cd35dee3744a1
2020-08-31 10:35:11 -07:00
eb4199b0a7 [quant][graphmode][fx] Add top level APIs (#43581)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43581

Add similar APIs like eager and graph mode on torchscript
- fuse_fx
- quantize_fx (for both post training static and qat)
- quantize_dynamic_fx (for post training dynamic)
- prepare_fx (for both post training static and qat)
- prepare_dynamic_fx (for post training dynamic)
- convert_fx (for all modes)

Test Plan: Imported from OSS

Reviewed By: vkuzo

Differential Revision: D23385091

fbshipit-source-id: b789e54e1a0f3af6b026fd568281984e253e0433
2020-08-31 10:12:55 -07:00
42c895de4d Properly check that reduction strings are valid for l1_loss, smoothl1_loss, and mse_loss. (#43527)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43527

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D23306786

Pulled By: gchanan

fbshipit-source-id: f3b7c9c02ae02813da116cb6b247a95727c47587
2020-08-31 09:53:56 -07:00
b8d34547ee [quant][graphmode][fx][fix] enable per channel quantization for functional ops (#43534)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43534

Test Plan: Imported from OSS

Reviewed By: vkuzo

Differential Revision: D23310857

fbshipit-source-id: ff7a681ee55bcc51f564e9de78319249b989366c
2020-08-31 09:35:25 -07:00
6ea89166bd Rewrite of ATen code generator (#42629)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42629

How to approach reviewing this diff:

- The new codegen itself lives in `tools/codegen`. Start with `gen.py`, then read `model.py` and them the `api/` folder. The comments at the top of the files describe what is going on. The CLI interface of the new codegen is similar to the old one, but (1) it is no longer necessary to explicitly specify cwrap inputs (and now we will error if you do so) and (2) the default settings for source and install dir are much better; to the extent that if you run the codegen from the root source directory as just `python -m tools.codegen.gen`, something reasonable will happen.
- The old codegen is (nearly) entirely deleted; every Python file in `aten/src/ATen` was deleted except for `common_with_cwrap.py`, which now permanently finds its home in `tools/shared/cwrap_common.py` (previously cmake copied the file there), and `code_template.py`, which now lives in `tools/codegen/code_template.py`. We remove the copying logic for `common_with_cwrap.py`.
- All of the inputs to the old codegen are deleted.
- Build rules now have to be adjusted to not refer to files that no longer exist, and to abide by the (slightly modified) CLI.
- LegacyTHFunctions files have been generated and checked in. We expect these to be deleted as these final functions get ported to ATen. The deletion process is straightforward; just delete the functions of the ones you are porting. There are 39 more functions left to port.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: bhosmer

Differential Revision: D23183978

Pulled By: ezyang

fbshipit-source-id: 6073ba432ad182c7284a97147b05f0574a02f763
2020-08-31 09:00:22 -07:00
576880febf Print all traceback for nested backwards in detect_anomaly (#43626)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/43405.

This pull request adds a feature of printing all tracebacks if a `detect_anomaly` mode detects `nan` in nested backward operations.
The way I did it is by assigning a node as a parent to all nodes it produces during its backward calculation. Then if one of the children produces `nan`, it will print the traceback from the parent and grand parents (if any).

The parent is assigned in `parent_node_` member in `Node` class which is accessible in C++ by function `node->parent()` and in Python by `node.parent_function`.
A node has a parent iff:

1. it is created from a backward operation, and
2. created when anomaly mode and grad mode are both enabled.

An example of this feature:

    import torch

    def example():
        x = torch.tensor(1.0, requires_grad=True)
        y = torch.tensor(1e-8, requires_grad=True)  # small to induce nan in n-th backward
        a = x * y
        b = x * y
        z1 = a / b  # can produce nan in n-th backward as long as https://github.com/pytorch/pytorch/issues/43414 is unsolved
        z = z1 * z1
        gy , = torch.autograd.grad( z , (y,), create_graph=True)
        gy2, = torch.autograd.grad(gy , (y,), create_graph=True)
        gy3, = torch.autograd.grad(gy2, (y,), create_graph=True)
        gy4, = torch.autograd.grad(gy3, (y,), create_graph=True)
        return gy4

    with torch.autograd.detect_anomaly():
        gy4 = example()

with output:

    example.py:16: UserWarning: Anomaly Detection has been enabled. This mode will increase the runtime and should only be enabled for debugging.
      with torch.autograd.detect_anomaly():
    /home/mfkasim/anaconda2/envs/base3/lib/python3.8/site-packages/torch/autograd/__init__.py:190: UserWarning: Error detected in DivBackward0. Traceback of forward call that caused the error:
      File "example.py", line 17, in <module>
        gy4 = example()
      File "example.py", line 12, in example
        gy3, = torch.autograd.grad(gy2, (y,), create_graph=True)
      File "/home/mfkasim/anaconda2/envs/base3/lib/python3.8/site-packages/torch/autograd/__init__.py", line 190, in grad
        return Variable._execution_engine.run_backward(
     (Triggered internally at  ../torch/csrc/autograd/python_anomaly_mode.cpp:61.)
      return Variable._execution_engine.run_backward(
    /home/mfkasim/anaconda2/envs/base3/lib/python3.8/site-packages/torch/autograd/__init__.py:190: UserWarning:

    Traceback of forward call that induces the previous calculation:
      File "example.py", line 17, in <module>
        gy4 = example()
      File "example.py", line 11, in example
        gy2, = torch.autograd.grad(gy , (y,), create_graph=True)
      File "/home/mfkasim/anaconda2/envs/base3/lib/python3.8/site-packages/torch/autograd/__init__.py", line 190, in grad
        return Variable._execution_engine.run_backward(
     (Triggered internally at  ../torch/csrc/autograd/python_anomaly_mode.cpp:65.)
      return Variable._execution_engine.run_backward(
    /home/mfkasim/anaconda2/envs/base3/lib/python3.8/site-packages/torch/autograd/__init__.py:190: UserWarning:

    Traceback of forward call that induces the previous calculation:
      File "example.py", line 17, in <module>
        gy4 = example()
      File "example.py", line 8, in example
        z1 = a / b  # can produce nan in n-th backward as long as https://github.com/pytorch/pytorch/issues/43414 is unsolved
     (Triggered internally at  ../torch/csrc/autograd/python_anomaly_mode.cpp:65.)
      return Variable._execution_engine.run_backward(
    Traceback (most recent call last):
      File "example.py", line 17, in <module>
        gy4 = example()
      File "example.py", line 13, in example
        gy4, = torch.autograd.grad(gy3, (y,), create_graph=True)
      File "/home/mfkasim/anaconda2/envs/base3/lib/python3.8/site-packages/torch/autograd/__init__.py", line 190, in grad
        return Variable._execution_engine.run_backward(
    RuntimeError: Function 'DivBackward0' returned nan values in its 1th output.

cc & thanks to albanD

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43626

Reviewed By: malfet

Differential Revision: D23397499

Pulled By: albanD

fbshipit-source-id: aa7435ec2a7f0d23a7a02ab7db751c198faf3b7d
2020-08-31 08:23:07 -07:00
1cdb9d2ab5 Test runner for batched gradient computation with vmap (#43664)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43664

This PR implements the test runner for batched gradient computation with
vmap. It also implements the batching rule for sigmoid_backward and
tests that one can compute batched gradients with sigmoid (and batched
2nd gradients).

Test Plan: - New tests: `python test/test_vmap.py -v`

Reviewed By: ezyang

Differential Revision: D23358555

Pulled By: zou3519

fbshipit-source-id: 7bb05b845a41b638b7cca45a5eff1fbfb542a51f
2020-08-31 08:21:41 -07:00
1dcc4fb6b7 Kill unused _pointwise_loss function. (#43523)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43523

The code is also wrong, see https://github.com/pytorch/pytorch/issues/43228.

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D23305461

Pulled By: gchanan

fbshipit-source-id: 9fe516d87a4243d5ce3c29e8822417709a1d6346
2020-08-31 07:58:04 -07:00
a860be898e [resubmit] Add amax/amin (#43819)
Summary:
Resubmit for landing next week.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43819

Reviewed By: ngimel

Differential Revision: D23421906

Pulled By: mruberry

fbshipit-source-id: 23dd60d1e365bb1197d660c3bfad7ee07ba3e97f
2020-08-31 04:54:48 -07:00
8fb7c50250 Enable complex blas for ROCm. (#43744)
Summary:
Revert "Skips some complex tests on ROCm (https://github.com/pytorch/pytorch/issues/42759)".  This reverts commit 55b1706775726418ddc5dd3b7756ea0388c0817c.

Use new cuda_to_hip_mappings.py from https://github.com/pytorch/pytorch/issues/43004.

Fixes https://github.com/pytorch/pytorch/pull/42383#issuecomment-670771922

CC sunway513

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43744

Reviewed By: glaringlee

Differential Revision: D23391263

Pulled By: ngimel

fbshipit-source-id: ddf734cea3ba69c24f0d79cf1b87c05cdb45ec3d
2020-08-30 22:43:54 -07:00
08126c9153 [ONNX] Utilize ONNX shape inference for ONNX exporter (#40628)
Summary:
It is often that the conversion from torch operator to onnx operator requires input rank/dtype/shape to be known. Previously, the conversion depends on tracer to provide these info, leaving a gap in conversion of scripted modules.

We are extending the export with support from onnx shape inference. If enabled, onnx shape inference will be called whenever an onnx node is created. This is the first PR introducing the initial look of the feature. More and more cases will be supported following this PR.

* Added pass to run onnx shape inference on a given node. The node has to have namespace `onnx`.
* Moved helper functions from `export.cpp` to a common place for re-use.
* This feature is currently experimental, and can be turned on through flag `onnx_shape_inference` in internal api `torch.onnx._export`.
* Currently skipping ONNX Sequence ops, If/Loop and ConstantOfShape due to limitations. Support will be added in the future.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/40628

Reviewed By: mrshenli

Differential Revision: D22709746

Pulled By: bzinodev

fbshipit-source-id: b52aeeae00667e66e0b0c1144022f7af9a8b2948
2020-08-30 18:35:46 -07:00
3aeb70db0b Documents sub properly, adds subtract alias (#43850)
Summary:
`torch.sub` was undocumented, so this PR adds its documentation, analogous to `torch.add`'s documentation, and adds the alias `torch.subtract` for `torch.sub`, too. This alias comes from NumPy (see https://numpy.org/doc/stable/reference/generated/numpy.subtract.html?highlight=subtract#numpy.subtract)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43850

Reviewed By: ngimel

Differential Revision: D23416908

Pulled By: mruberry

fbshipit-source-id: 6c4d2ebaf6ecae91f3a6efe484ce6c4dad96f016
2020-08-30 15:44:56 -07:00
3dc9645430 Disable RocM CircleCI jobs (#42630)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42630

Reviewed By: seemethere

Differential Revision: D22957640

Pulled By: malfet

fbshipit-source-id: 9f7d633310c653fcd14e66755168c0e559307b69
2020-08-30 11:41:40 -07:00
7b835eb887 Update CUDA11 docker container (#42200)
Summary:
- no more `-rc`
- add magma

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42200

Reviewed By: ZolotukhinM, mruberry

Differential Revision: D23411686

Pulled By: malfet

fbshipit-source-id: 04532bc1cc65b3e14ddf29e8bf61a7a3b4c706ad
2020-08-30 11:39:20 -07:00
5021ec826b Fix docs for kwargs, f-p (#43586)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43586

Reviewed By: glaringlee

Differential Revision: D23390667

Pulled By: mruberry

fbshipit-source-id: dd51a4a48ff4e2fc10675ec817a206041957982f
2020-08-30 10:13:36 -07:00
1830e4f08c Remove unnamed namespace in headers (#43689)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43689

Test Plan: Imported from OSS

Reviewed By: eellison, asuhan

Differential Revision: D23367636

Pulled By: bertmaher

fbshipit-source-id: ddb6d34d2f7cadff3a591c3650e1dd1b401c3d2d
2020-08-29 22:45:53 -07:00
ab3ea95e90 #include <string> in loopnest.h (#43835)
Summary:
This file is causing compiling failure on my gcc-10.1

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43835

Reviewed By: bhosmer

Differential Revision: D23416417

Pulled By: ZolotukhinM

fbshipit-source-id: d0c2998347438fb729212574d52ce20dd6faae85
2020-08-29 19:06:44 -07:00
628db9699f Vulkan command buffer and pool. (#42930)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42930

Test Plan: Imported from OSS

Reviewed By: IvanKobzarev

Differential Revision: D23252333

Pulled By: AshkanAliabadi

fbshipit-source-id: 738385e0058edf3d3b34173e1b1011356adb7b3c
2020-08-29 17:48:19 -07:00
d1df098956 Vulkan resource cache. (#42709)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42709

Test Plan: Imported from OSS

Reviewed By: IvanKobzarev

Differential Revision: D23252339

Pulled By: AshkanAliabadi

fbshipit-source-id: 977ab3fdedfe98789a48dd263127529d8be0ed37
2020-08-29 17:48:17 -07:00
87e8f50aae Vulkan descriptor and descriptor layout cache. (#42642)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42642

Test Plan: Imported from OSS

Reviewed By: IvanKobzarev

Differential Revision: D23252337

Pulled By: AshkanAliabadi

fbshipit-source-id: 075acc8c093e639bb24a0d4653d5c922b36a1128
2020-08-29 17:48:14 -07:00
15aaeb8867 Vulkan pipeline and pipeline layout cache. (#42395)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42395

Test Plan: Imported from OSS

Reviewed By: IvanKobzarev

Differential Revision: D23252334

Pulled By: AshkanAliabadi

fbshipit-source-id: 6b4e88f9794a7879d47a1cdb671076d50f1944d9
2020-08-29 17:48:12 -07:00
387dc24c92 Vulkan memory allocator. (#42786)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42786

Test Plan: Imported from OSS

Reviewed By: IvanKobzarev

Differential Revision: D23252332

Pulled By: AshkanAliabadi

fbshipit-source-id: 14e848ad81b4ba1367e8cf719343a51995457827
2020-08-29 17:48:10 -07:00
287fb273cd Vulkan (source and binary) shader and shader layout cache. (#42325)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42325

Test Plan: Imported from OSS

Reviewed By: IvanKobzarev

Differential Revision: D23252336

Pulled By: AshkanAliabadi

fbshipit-source-id: f3f26c78366be45c90a370db9194d88defbf08d8
2020-08-29 17:48:08 -07:00
6373063a98 Generic Vulkan object cache. (#42394)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42394

Test Plan: Imported from OSS

Reviewed By: IvanKobzarev

Differential Revision: D23252340

Pulled By: AshkanAliabadi

fbshipit-source-id: 34e753964b94153ed6ed1fcaa7f3b4a7c6b5f340
2020-08-29 17:48:06 -07:00
4e39c310eb Move torch/csrc/utils/hash.h to c10/util/hash.h. (#42503)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42503

Test Plan: Imported from OSS

Reviewed By: IvanKobzarev

Differential Revision: D23252331

Pulled By: AshkanAliabadi

fbshipit-source-id: 3c4c0e27b9a7eec8560e374c2a3ba5f1c65dae48
2020-08-29 17:47:00 -07:00
7f967c08b8 Document the beta=0 behavior of BLAS functions (#43823)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43823

Reviewed By: mruberry

Differential Revision: D23413899

Pulled By: ngimel

fbshipit-source-id: d3c4e5631db729a3f3d5eb9290c76cb1aa529f74
2020-08-29 13:03:16 -07:00
cc52386096 Revert D19987020: [pytorch][PR] Add the sls tensor train op
Test Plan: revert-hammer

Differential Revision:
D19987020 (f31b111a35)

Original commit changeset: e3ca7b00a374

fbshipit-source-id: a600c747a45dfb51e0882196e382a21ccaa7b989
2020-08-29 12:46:11 -07:00
45ba836876 Revert "Revert D23252335: Refactor Vulkan context into its own files. Use RAII." (#43628)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43628

This reverts commit 6c772515ed1a87ec676382492ff3c019c6d194c3.

Test Plan: Imported from OSS

Reviewed By: IvanKobzarev

Differential Revision: D23356714

Pulled By: AshkanAliabadi

fbshipit-source-id: a44af3b3c7b00a097eae1b0c9a00fdabc7ab6f86
2020-08-29 12:39:22 -07:00
f31b111a35 Add the sls tensor train op (#33525)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/33525

Reviewed By: wx1988

Differential Revision: D19987020

Pulled By: lly-zero-one

fbshipit-source-id: e3ca7b00a374a75ee42716c4e6236bf168ebebf1
2020-08-29 12:16:44 -07:00
550fb2fd52 Expand the coverage of test_blas_empty (#43822)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43822

Reviewed By: mruberry

Differential Revision: D23413359

Pulled By: ngimel

fbshipit-source-id: fcdb337e32ed2d1c791fa0762d5233b346b26d14
2020-08-29 12:13:15 -07:00
60ad7e9c04 [TensorExpr] Make sum available from Python (#43730)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43730

Test Plan:
python test/test_jit_fuser_te.py -k TestTEFuser.test_sum
test_tensorexpr --gtest_filter=TensorExprTest.KernelSum*

Reviewed By: ZolotukhinM

Differential Revision: D23407600

Pulled By: asuhan

fbshipit-source-id: e6da4690ae6d802f9be012e39e61b7467aa5285c
2020-08-29 10:38:21 -07:00
8a41fa4718 [Selective Build] Move register_prim_ops and register_special_ops to app level (#43539)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43539

Move the two source files out of the base internal mobile library to the app level. Make it ready for app-based selective build. Opensource build should not be affected. The file list change in build_variables.bzl affects internal build only.

ghstack-source-id: 111006135

Test Plan: CI

Reviewed By: ljk53

Differential Revision: D23287661

fbshipit-source-id: 9b2d688544e79e0fca9c84730ef0259952cd8abe
2020-08-29 03:12:28 -07:00
d10056652b Enable torch.half for lt and masked_select (#43704)
Summary:
Enable testing of those options in `TestTorchDeviceTypeCPU.test_logical_cpu` and `TestTorchDeviceTypeCPU.test_masked_select_cpu_float16`
Add `view_as_real` testing for `torch.complex32` type

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43704

Reviewed By: albanD

Differential Revision: D23373070

Pulled By: malfet

fbshipit-source-id: 00f17f23b48513379a414227aea91e2d3c0dd5f9
2020-08-29 02:37:26 -07:00
931b8b4ac8 Use ivalue::Future in autograd engine and DistEngine. (#43676)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43676

This is one part of https://github.com/pytorch/pytorch/issues/41574 to
ensure we consolidate everything around ivalue::Future.

I've removed the use of torch/csrc/utils/future.h from the autograd engines and
used ivalue::Future instead.
ghstack-source-id: 110895545

Test Plan: waitforbuildbot.

Reviewed By: albanD

Differential Revision: D23362415

fbshipit-source-id: aa109b3f8acf0814d59fc5264a85a8c27ef4bdb6
2020-08-29 02:15:26 -07:00
000739c31a Function calls for fallback paths (#43274)
Summary:
This PR adds API to package unoptimized/fallback blocks as function calls. It's mainly meant to be used by TensorExpressionsFuser and SpecializeAutogradZero passes as both specialize the original graph but would also like to provide a fallback path in case the assumptions under which the graph was specialized do not hold for some inputs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43274

Reviewed By: malfet

Differential Revision: D23406961

Pulled By: Krovatkin

fbshipit-source-id: ef21fc9ad886953461b09418d02c75c58375490c
2020-08-28 23:31:02 -07:00
8538a79bfe [jit][static] Basic executor (#43647)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43647

Nothing fancy, just a basic implementation of the graph executor without using stack machine.

Reviewed By: bwasti

Differential Revision: D23208413

fbshipit-source-id: e483bb6ad7ba8591bbe1767e669654d82f42c356
2020-08-28 23:20:07 -07:00
6aaae3b08b [ONNX] Addition of diagnostic tool API (#43020)
Summary:
Added initial diagnostic tool API

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43020

Reviewed By: malfet

Differential Revision: D23398459

Pulled By: bzinodev

fbshipit-source-id: 7a6d9164a19e3ba51676fbcf645c4d358825eb42
2020-08-28 23:04:59 -07:00
58148c85f4 Use template OperatorGenerator for prim and special operator registration (#43481)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43481

Apply OperatorGenerator for prim and special operator registration. It does not affect the existing build by default. However, if a whitelist of operator exists, only the operators in the whitelist will be registered. It has the potential to save up to 200 KB binary size, depending on the usage.

Test Plan: Imported from OSS

Reviewed By: ljk53

Differential Revision: D23287251

Pulled By: iseeyuan

fbshipit-source-id: 3ca39fbba645bad8d69e69195f3680e4f6d633c5
2020-08-28 21:18:00 -07:00
8997a4b56b [typing] Enable typing in torch.quantization.fuse_modules typechecks … (#43786)
Summary:
…during CI

Fixes #{42971}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43786

Reviewed By: malfet

Differential Revision: D23403258

Pulled By: yizhouyu

fbshipit-source-id: 4cd24a4fcf1408341a210fa50f574887b6db5e0e
2020-08-28 20:42:23 -07:00
eae92b7187 Updated README.md by correcting grammatical errors (#43779)
Summary:
Fixed grammatical errors and punctuation so that it be can more understandable.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43779

Reviewed By: ZolotukhinM

Differential Revision: D23407849

Pulled By: malfet

fbshipit-source-id: 09c064ce68d0f37f8023c2ecae8775fc00541a2c
2020-08-28 20:30:03 -07:00
13c7c6227e Python/C++ API Parity: TransformerDecoder (#42886)
Summary:
Fixes #{[37756](https://github.com/pytorch/pytorch/issues/37756)}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42886

Reviewed By: zhangguanheng66

Differential Revision: D23385631

Pulled By: glaringlee

fbshipit-source-id: 610a2fabb4c25b2dfd37b33287215bb8872d653d
2020-08-28 20:13:53 -07:00
64906497cd Revert D23391941: [pytorch][PR] Implementing NumPy-like function torch.heaviside()
Test Plan: revert-hammer

Differential Revision:
D23391941 (a1eae6d158)

Original commit changeset: 7b942321a625

fbshipit-source-id: c2a7418a1fedaa9493300945c30e2392fc0d08ee
2020-08-28 19:16:58 -07:00
47e489b135 Make ExtraFilesMap return bytes instead of str (#43241)
Summary:
In case we want to store binary files using `ScriptModule.save(..., _extra_files=...)` functionality. With python3 we can just use bytes only and not bother about it.

I had to do a copy-pasta from pybind sources, maybe we should upstream it, but it'd mean adding a bunch of template arguments to `bind_map` which is a bind untidy.

Let me know if there's a better place to park this function (it seems to be the only invocation of `bind_map` so I put it in the same file)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43241

Reviewed By: zdevito

Differential Revision: D23205244

Pulled By: dzhulgakov

fbshipit-source-id: 8f291eb4294945fe1c581c620d48ba2e81b3dd9c
2020-08-28 19:11:33 -07:00
1a79d7bb28 DDP communication hook examples (#43310)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43310

In this diff, we prepared some example DDP communication hooks [#40848](https://github.com/pytorch/pytorch/pull/40848):

1\. `allreduce_hook`: This DDP communication hook just calls ``allreduce`` using ``GradBucket`` tensors. Once gradient tensors are aggregated across all workers, its ``then`` callback takes the mean and returns the result. If user registers this hook DDP results is expected to be same as the case where no hook was registered. Hence, this won't change behavior of DDP and user can use this as a reference or modify this hook to log useful information or any other purposes while unaffecting DDP behavior.

2\. `allgather_then_aggregate_hook` Similar to ``allreduce_hook``, this hook first gathers ``GradBucket`` tensors and its ``then`` callback aggregates the gathered gradient tensors and takes mean. Instead of ``allreduce`` this hook uses ``allgather``. Note that with W workers, both the computation and communication time scale as O(W) for allgather compared to O(logW) for allreduce. Therefore, this hook is expected to be much slower than ``allreduce_hook`` although both essentially do the same thing with the gradients.

3\. `fp16_compress_hook` This DDP communication hook implements a simple gradient compression approach that converts ``GradBucket`` tensors whose type is assumed to be ``torch.float32`` to half-precision floating point format (``torch.float16``). It allreduces those ``float16`` gradient tensors. Once compressed gradient tensors are allreduced, its then callback called ``decompress`` converts the aggregated result back to ``float32`` and takes the mean.

4\. `quantization_pertensor_hook` does quantization per tensor and uses the idea in https://pytorch.org/docs/master/generated/torch.quantize_per_tensor.html.  Note that we separately send scale and zero_point (two floats per rank) before quantized tensors.

5\. `quantization_perchannel_hook` does quantization per channel similar to https://pytorch.org/docs/master/generated/torch.quantize_per_channel.html. The main motivation is that after the initial QSGD study diff, we realized that for considerably large gradient tensors such as a tensor that contains 6 million floats quantizing dividing it into smaller channels (512 float chunks) and quantizing independently may significantly increase the resolution and result with lower error.
ghstack-source-id: 110923269

Test Plan:
python torch/distributed/algorithms/ddp_comm_hooks/test_ddp_hooks.py
Couldn't download test skip set, leaving all tests enabled...
.....
----------------------------------------------------------------------
Ran 4 tests in 26.724s

OK

Internal testing:
```
buck run mode/dev-nosan //caffe2/test/distributed/algorithms/ddp_comm_hooks:test_ddp_hooks
```

Reviewed By: malfet

Differential Revision: D22937999

fbshipit-source-id: 274452e7932414570999cb978ae77a97eb3fb0ec
2020-08-28 18:59:14 -07:00
68b9daa9bf Add torch.linalg.norm (#42749)
Summary:
Adds `torch.linalg.norm` function that matches the behavior of `numpy.linalg.norm`.

Additional changes:
* Add support for dimension wrapping in `frobenius_norm` and `nuclear_norm`
* Fix `out` argument behavior for `nuclear_norm`
* Fix issue where `frobenius_norm` allowed duplicates in `dim` argument
* Add `_norm_matrix`

Closes https://github.com/pytorch/pytorch/issues/24802

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42749

Reviewed By: ngimel

Differential Revision: D23336234

Pulled By: mruberry

fbshipit-source-id: f0aba3089a3a0bf856aa9c4215e673ff34228fac
2020-08-28 18:28:33 -07:00
cd0bab8d8d [ONNX] Where op (#41544)
Summary:
Extending where op export

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41544

Reviewed By: malfet

Differential Revision: D23279515

Pulled By: bzinodev

fbshipit-source-id: 4627c95ba18c8a5ac8d06839c343e06e71c46aa7
2020-08-28 18:15:01 -07:00
a1eae6d158 Implementing NumPy-like function torch.heaviside() (#42523)
Summary:
- Related with https://github.com/pytorch/pytorch/issues/38349
- Implementing the NumPy-like function `torch.heaviside()` .

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42523

Reviewed By: glaringlee

Differential Revision: D23391941

Pulled By: mruberry

fbshipit-source-id: 7b942321a62567a5fc0a3679a289f4c4c19e6134
2020-08-28 18:11:20 -07:00
633d239409 [torch.fx] Pass placeholders through delegate too (#43432)
Summary:
It's useful if we add additional attributed to nodes in the graph - it's easier to set the attribute on all nodes, even if the value would happen to be None.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43432

Reviewed By: jamesr66a

Differential Revision: D23276433

Pulled By: dzhulgakov

fbshipit-source-id: c69e7cb723bbbb4dba3b508a3d6c0e456fe610df
2020-08-28 18:07:52 -07:00
3f0120edb4 Revert D23360705: [pytorch][PR] Add amax/amin
Test Plan: revert-hammer

Differential Revision:
D23360705 (bcec8cc3f9)

Original commit changeset: 5bdeb08a2465

fbshipit-source-id: 76a9e199823c7585e55328bad0778bcd8cd49381
2020-08-28 18:01:25 -07:00
7d517cf96f [NCCL] Dedicated stream to run all FutureNCCL callbacks. (#43447)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43447

Two main better-engineering motivations to run all FutureNCCL callbacks on a dedicated stream:
1. Each time a then callback was called, we would get a stream from the pool and run the callback on that stream. If we observe the stream traces using that approach, we would see a lot of streams and debugging would become more complicated. If we have a dedicated stream to run all then callback operations, the trace results will be much cleaner and easier to follow.
2. getStreamFromPool may eventually return the default stream or a stream that is used for other operations. This can cause slowdowns.

Unless then callback takes longer than preceding allreduce, this approach will be as performant as the previous approach.
ghstack-source-id: 110909401

Test Plan:
Perf trace runs to validate the desired behavior:
See the dedicated stream 152 is running the then callback operations:

{F299759342}

I run pytorch.benchmark.main.workflow using resnet50 and 32 GPUs registering allreduce with then hook.
See f213777896 [traces](https://www.internalfb.com/intern/perfdoctor/results?run_id=26197585)

After updates, same observation: see f214890101

Reviewed By: malfet

Differential Revision: D23277575

fbshipit-source-id: 67a89900ed7b70f3daa92505f75049c547d6b4d9
2020-08-28 17:26:23 -07:00
3f5ea2367e Adding a version serialization type to ConvPackedParam (#43086)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43086

This PR changes the format of `ConvPackedParam` in a nearly backwards-compatible way:
* a new format is introduced which has more flexibility and a lower on-disk size
* custom pickle functions are added to `ConvPackedParams` which know how to load the old format
* the custom pickle functions are **not** BC because the output type of `__getstate__` has changed.  We expect this to be acceptable as no user flows are actually broken (loading a v1 model with v2 code works), which is why we whitelist the failure.

Test plan (TODO finalize):

```
// adhoc testing of saving v1 and loading in v2: https://gist.github.com/vkuzo/f3616c5de1b3109cb2a1f504feed69be

// test that loading models with v1 conv params format works and leads to the same numerics
python test/test_quantization.py TestSerialization.test_conv2d_graph
python test/test_quantization.py TestSerialization.test_conv2d_nobias_graph

// test that saving and loading models with v2 conv params format works and leads to same numerics
python test/test_quantization.py TestSerialization.test_conv2d_graph_v2
python test/test_quantization.py TestSerialization.test_conv2d_nobias_graph_v2

// TODO before land:
// test numerics for a real model
// test legacy ONNX path
```

Note: this is a newer copy of https://github.com/pytorch/pytorch/pull/40003

Test Plan: Imported from OSS

Reviewed By: dreiss

Differential Revision: D23347832

Pulled By: vkuzo

fbshipit-source-id: 06bbe4666421ebad25dc54004c3b49a481d3cc92
2020-08-28 15:41:30 -07:00
af4ecb3c11 quantized conv: add support for graph mode BC testing, and increase coverage (#43524)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43524

1. adds support for testing BC on data format and numerics for graph mode
quantized modules
2. using the above, adds coverage for quantized conv2d on graph mode

Test Plan:
```
python test/test_quantization.py TestSerialization.test_conv2d_nobias
python test/test_quantization.py TestSerialization.test_conv2d_graph
python test/test_quantization.py TestSerialization.test_conv2d_nobias_graph
```

Imported from OSS

Reviewed By: supriyar

Differential Revision: D23335222

fbshipit-source-id: 0c9e93a940bbf6c676c2576eb62fcc725247588b
2020-08-28 15:40:22 -07:00
4cb8d306e6 Add _foreach_add_(TensorList tensors, Scalar scalar) API (#42531)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42531

[First PR: Add private API to support tensor lists: _foreach_add(TensorList tensors, Scalar scalar)](https://github.com/pytorch/pytorch/pull/41554).

**Motivation**
[GitHub issue](https://github.com/pytorch/pytorch/issues/38655)
Current PyTorch optimizer implementations are not efficient in cases when we work with a lot of small feature tensors. Starting a lot of kernels slows down the whole process. We need to reduce the number of kernels that we start.
As an example, we should be looking at [NVIDIAs Apex](https://github.com/NVIDIA/apex).
In order to track progress, we will pick PyTorchs DCGAN model with Adam optimizer and once the optimizer is reimplemented with tensor lists, benchmark the model performance against original model version, Apexs version with original Adam optimizer and it’s FusedAdam optimizer.

**Current API restrictions**
- List can't be empty (will fixed in upcoming PRs).
- All tensors in the list must have the same dtype, device and size.

**Broadcasting**
At this point we don't support broadcasting.

**What is 'Fast' and 'Slow' route**
In particular cases, we cant process an op with a fast list CUDA kernel. Still, we can do with a regular for-loop where the op will be applied to each tensor individually through the dispatch mechanisms. There are a few checks that decide whether the op will be performed via a 'fast' or 'slow' path.
To go the fast route,
- All tensors must have strided layout
- All tensors must be dense and not have overlapping memory
- The resulting tensor type must be the same.

---------------
**In this PR**
- Adding a `std::vector<Tensor> _foreach_add_(TensorList tensors, Scalar scalar)` API
- Resolving some additional comments from previous [PR](https://github.com/pytorch/pytorch/pull/41554).

**Tests**
Tested via unit tests

**TODO**
1. Properly handle empty lists

**Plan for the next PRs**
1. APIs
- Binary Ops for list with Scalar
- Binary Ops for list with list
- Unary Ops for list
- Pointwise Ops

2. Complete tasks from TODO
3. Rewrite PyTorch optimizers to use for-each operators for performance gains.

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D23331892

Pulled By: izdeby

fbshipit-source-id: c585b72e1e87f6f273f904f75445618915665c4c
2020-08-28 14:34:46 -07:00
20abfc21e4 Adds arctanh, arcsinh aliases, simplifies arc* alias dispatch (#43762)
Summary:
Adds two more "missing" NumPy aliases: arctanh and arcsinh, and simplifies the dispatch of other arc* aliases.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43762

Reviewed By: ngimel

Differential Revision: D23396370

Pulled By: mruberry

fbshipit-source-id: 43eb0c62536615fed221d460c1dec289526fb23c
2020-08-28 13:59:19 -07:00
0564d7a652 Land code coverage tool for OSS (#43778)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43778

Move code_coverage_tool from experimental folder to caffe2/tools folder.

Delete `TODO` and fb-related code.

Test Plan: Test locally

Reviewed By: malfet

Differential Revision: D23399983

fbshipit-source-id: 92316fd3cc88409d087d2dc6ed0be674155b3762
2020-08-28 13:56:15 -07:00
89e2a3591e Add 1% threshold to codecov (#43783)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43783

Reviewed By: seemethere

Differential Revision: D23402196

Pulled By: malfet

fbshipit-source-id: bd11d6edc6d1f15bd227636a549b9ea7b3aca256
2020-08-28 13:51:23 -07:00
b23e9cdd64 .circleci: Add slash to end of s3 cp (#43792)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43792

This fixes the issue we had with the nightlies not being uploaded
properly, basically what was happening was that `aws s3 cp` doesn't
automatically distinguish between prefixes that are already
"directories" vs a single file with the same name.

This means that if you'd like to upload a file to a "directory" in S3
you need to suffix your destination with a slash.

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D23402074

Pulled By: seemethere

fbshipit-source-id: 6085595283fcbbbab0836ccdfe0f8aa2a6abd7c8
2020-08-28 13:37:25 -07:00
776c2d495f [JIT] IRParser: store list attributes as generic ivalue lists. (#43785)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43785

Test Plan: Imported from OSS

Reviewed By: eellison

Differential Revision: D23400565

Pulled By: ZolotukhinM

fbshipit-source-id: e248eb1854c4ec40da9455d4279ea6e47b1f2a16
2020-08-28 13:27:28 -07:00
bcec8cc3f9 Add amax/amin (#43092)
Summary:
Add a max/min operator that only return values.

## Some important decision to discuss
| **Question**                          | **Current State** |
|---------------------------------------|-------------------|
| Expose torch.max_values to python?    | No                |
| Remove max_values and only keep amax? | Yes               |
| Should amax support named tensors?    | Not in this PR    |

## Numpy compatibility

Reference: https://numpy.org/doc/stable/reference/generated/numpy.amax.html

| Parameter                                                                                                                                                                                                                                              | PyTorch Behavior                                                                  |
|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------|
| `axis`:  None or int or tuple of ints, optional. Axis or axes along which to operate. By default, flattened input is used. If this is a tuple of ints, the maximum is selected over multiple axes, instead of a single axis or all the axes as before. | Named `dim`, behavior same as `torch.sum` (https://github.com/pytorch/pytorch/issues/29137)                                |
| `out`: ndarray, optional. Alternative output array in which to place the result. Must be of the same shape and buffer length as the expected output.                                                                                                   | Same                                                                              |
| `keepdims`: bool, optional. If this is set to True, the axes which are reduced are left in the result as dimensions with size one. With this option, the result will broadcast correctly against the input array.                                      | implemented as `keepdim`                                                          |
| `initial`: scalar, optional. The minimum value of an output element. Must be present to allow computation on empty slice.                                                                                                                              | Not implemented in this PR. Better to implement for all reductions in the future. |
| `where`: array_like of bool, optional. Elements to compare for the maximum.                                                                                                                                                                            | Not implemented in this PR. Better to implement for all reductions in the future. |

**Note from numpy:**
> NaN values are propagated, that is if at least one item is NaN, the corresponding max value will be NaN as well. To ignore NaN values (MATLAB behavior), please use nanmax.

PyTorch has the same behavior

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43092

Reviewed By: ngimel

Differential Revision: D23360705

Pulled By: mruberry

fbshipit-source-id: 5bdeb08a2465836764a5a6fc1a6cc370ae1ec09d
2020-08-28 12:51:03 -07:00
f4695203c2 Fixes fft function calls for C++ API (#43749)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/43732.

Requires importing the fft namespace in the C++ API, just like the Python API does, to avoid clobbering torch::fft the function.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43749

Reviewed By: glaringlee

Differential Revision: D23391544

Pulled By: mruberry

fbshipit-source-id: d477d0b6d9a689d5c154ad6c31213a7d96fdf271
2020-08-28 12:41:30 -07:00
dc5d365514 Fix bug in caching allocator. (#43719)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43719

Accidentally this slipped through: with guard did not update the current
context

Test Plan: cpu_caching_allocator_test

Reviewed By: linbinyu

Differential Revision: D23374453

fbshipit-source-id: 1d3ef21cc390d0a8bde98fb1b5c2175b40ab571b
2020-08-28 11:56:23 -07:00
be3ec6ab3e [caffe2][torch] correctly re-raise Manifold StorageException
Summary:
1) Manifold raises StorageException when it see's an error: https://fburl.com/diffusion/kit3me8a
2) torch re-raises exception: https://fburl.com/diffusion/zbw9wmpu
Issue here, that in StorageException first argument is bool canRetry while re-raising happens with first argument being str as in all Python exceptions.

Test Plan:
Existing tests should pass. +
```
In [1]: from manifold.clients.python import StorageException
In [2]: getattr(StorageException, "message", None)
Out[2]: <attribute 'message' of 'manifold.blobstore.blobstore.types.StorageException' objects>
In [3]: getattr(Exception, "message", None) is None
Out[3]: True

Reviewed By: haijunz

Differential Revision: D23195514

fbshipit-source-id: baa1667dbba4086db6ec93f009e400611ac9b938
2020-08-28 11:41:10 -07:00
b72da0cf28 OneDNN: report error for dilation max_pooling and replace AT_ERROR with TORCH_CHECK in oneDNN codes (#43538)
Summary:
Fix https://github.com/pytorch/pytorch/issues/43514.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43538

Reviewed By: agolynski

Differential Revision: D23364302

Pulled By: ngimel

fbshipit-source-id: 8d17752cf33dcacd34504e32b5e523e607cfb497
2020-08-28 10:57:19 -07:00
1f7434d1ea Fix 'module' to 'model' in quantize_dynamic doc (#43693)
Summary:
Fixes issue https://github.com/pytorch/pytorch/issues/43503

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43693

Reviewed By: malfet

Differential Revision: D23397641

Pulled By: mrshenli

fbshipit-source-id: bc216cea4f0a30c035e84a6cfebabd3755ef1305
2020-08-28 10:44:43 -07:00
a76184fe1e grammatical error fix (#43697)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43697

Reviewed By: malfet

Differential Revision: D23397655

Pulled By: mrshenli

fbshipit-source-id: fb447dcde4f83bc6650f0faa0728a1867cfa5213
2020-08-28 10:38:46 -07:00
b630c1870d Add stateful XNNPack deconvolution2d operator to torch. (#43233)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43233

XNNPack is already being used for the convolution2d operation. Add the
ability for it to be used with transpose convolution.

Test Plan: buck run caffe2/test:xnnpack_integration

Reviewed By: kimishpatel

Differential Revision: D23184249

fbshipit-source-id: 3fa728ce1eaca154d24e60f800d5e946d768c8b7
2020-08-28 10:31:36 -07:00
58a7e73a95 [TensorExpr] Block Codegen (#40054)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40054

Reviewed By: ZolotukhinM

Differential Revision: D22061350

Pulled By: protonu

fbshipit-source-id: 004f7c316629b16610ecdbb97e43036c72c65067
2020-08-28 09:53:42 -07:00
9063bcee04 Don't proceed into setup.py too far if Python version is unsupported (#42870)
Summary:
This prevents confusing errors when the interpreter encounters some
syntax errors in the middle.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42870

Reviewed By: albanD

Differential Revision: D23269265

Pulled By: ezyang

fbshipit-source-id: 61f62cbe294078ad4a909fa87aa93abd08c26344
2020-08-28 09:04:55 -07:00
c177d25edf TensorIterator: Check for memory overlap in all nullary_ops (#43421)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43421

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D23298654

Pulled By: zou3519

fbshipit-source-id: 71b401f6ea1e3b50b830fef650927cc5b3fb940f
2020-08-28 08:40:25 -07:00
dc0722e9b7 TensorIterator: Check for memory overlap in all compare_ops (#43420)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43420

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D23298650

Pulled By: zou3519

fbshipit-source-id: 171cd17a3012880a5d248ffd0ea6942fbfb6606f
2020-08-28 08:40:22 -07:00
065ebdb92f TensorIterator: Check for memory overlap in all binary_ops (#43419)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43419

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D23298655

Pulled By: zou3519

fbshipit-source-id: 82e0ff308a6a7e46b4342d57ddb4c1d73745411a
2020-08-28 08:40:19 -07:00
bdee8e02c0 TensorIterator: Check memory overlap in all unary_ops (#43418)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43418

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D23298651

Pulled By: zou3519

fbshipit-source-id: 84be498f5375813fd10cf30b8beabbd2d15210a3
2020-08-28 08:39:13 -07:00
0ab83f7f9f Fixed undefined behavior in BatchedFallback (#43705)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43705

This was causing fb-internal flakiness. I'm surprised that the ASAN
builds don't catch this behavior.

The problem is that dereferencing the end() pointer of a vector is
undefined behavior. This PR fixes one callsite where BatchedFallback
dereferences the end() pointer and adds an assert to make sure another
callsite doesn't do that.

Test Plan:
- Make sure all tests pass (`pytest test/test_vmap.py -v`)
- It's hard to write a new test for this because most of the time this
doesn't cause a crash. It really depends on what lives at the end()
pointer.

Reviewed By: ezyang

Differential Revision: D23373352

Pulled By: zou3519

fbshipit-source-id: 61ea0be80dc006f6d4e73f2c5badd75096f63e56
2020-08-28 08:09:17 -07:00
8e507ad00e Update the div formula for numerical stability (#43627)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/43414

See the issue for numerical improvements and quick benchmark.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43627

Reviewed By: agolynski

Differential Revision: D23350124

Pulled By: albanD

fbshipit-source-id: 19d51640b3f200db37c32d2233a4244480e5a15b
2020-08-28 07:49:35 -07:00
b29375840a Revert D23379383: Land code_coverage_tool to caffe2/tools folder
Test Plan: revert-hammer

Differential Revision:
D23379383 (f06d3904f2)

Original commit changeset: f6782389ebb1

fbshipit-source-id: 33a26761deb58dfe81314ea912bf485c5fc962b7
2020-08-28 07:19:12 -07:00
c7787f7fbf [numpy compatibility]Fix argmin/argmax when multiple max/min values (#42004)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/41998
Fixes https://github.com/pytorch/pytorch/issues/22853

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42004

Reviewed By: ngimel

Differential Revision: D23049003

Pulled By: mruberry

fbshipit-source-id: a6fddbadfec4b8696730550859395ce4f0cf50d6
2020-08-28 06:42:42 -07:00
26161e8ab6 [Codemod][FBSourceClangFormatLinter] Daily arc lint --take CLANGFORMAT
Reviewed By: zertosh

Differential Revision: D23393950

fbshipit-source-id: 6a31b7ab6961cba88014f41b3ed1eda108edebab
2020-08-28 05:38:13 -07:00
f06d3904f2 Land code_coverage_tool to caffe2/tools folder
Summary:
Move `code_coverage_tool` from `experimental` folder to `caffe2/tools` folder.

Not sure if the fb related code is something we don't want to share with the oss. Can reviewers please help me check with `fbcode_coverage.py` and files in `fbcode/` folder?

Test Plan: Test locally

Reviewed By: malfet

Differential Revision: D23379383

fbshipit-source-id: f6782389ebb1b147eaf6d3664b5955db79d24ff3
2020-08-27 18:44:40 -07:00
654ab209c6 [JIT] Disable broken tests (#43750)
Summary:
These started failing since **https://github.com/pytorch/pytorch/pull/43633** for indecipherable reasons; temporarily disable. The errors on the PRs were
```
Downloading workspace layers
  workflows/workspaces/3ca9ca71-7449-4ae1-bb7b-b7612629cc62/0/8607ba99-5ced-473b-b60a-0025b48739a6/0/105.tar.gz - 8.4 MB
Applying workspace layers
  8607ba99-5ced-473b-b60a-0025b48739a6
```
which is not too helpful...

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43750

Reviewed By: ZolotukhinM

Differential Revision: D23388060

Pulled By: eellison

fbshipit-source-id: 96afa0160ec948049f3e194787a0a7ddbeb5124a
2020-08-27 18:12:57 -07:00
1a21c92364 [ONNX] Update in scatter ONNX export when scalar src has different type (#43440)
Summary:
`torch.scatter` allows `src` to be of different type when `src` is a scalar. This requires a an explicit cast op to be inserted in the ONNX graph because ONNX `ScatterElements` does not allow different types. This PR updates the export of `torch.scatter` with this logic.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43440

Reviewed By: hl475

Differential Revision: D23352317

Pulled By: houseroad

fbshipit-source-id: c9eeddeebb67fc3c40ad01def134799ef2b4dea6
2020-08-27 16:45:37 -07:00
87d7c362b1 [JIT] Add JIT support for torch.no_grad (#41371)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41371

**Summary**
This commit enables the use of `torch.no_grad()` in a with item of a
with statement within JIT. Note that the use of this context manager as
a decorator is not supported.

**Test Plan**
This commit adds a test case to the existing with statements tests for
`torch.no_grad()`.

**Fixes**
This commit fixes #40259.

Test Plan: Imported from OSS

Reviewed By: gmagogsfm

Differential Revision: D22649519

Pulled By: SplitInfinity

fbshipit-source-id: 7fa675d04835377666dfd0ca4e6bc393dc541ab9
2020-08-27 15:32:57 -07:00
8032dbc117 Add Rowwise Prune PyTorch op (#42708)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42708

Add rowwise prune pytorch op.

This operator introduces sparsity to the 'weights' matrix with the help
of the importance indicator 'mask'.

A row is considered important and not pruned if the mask value for that
particular row is 1(True) and not important otherwise.

Test Plan:
buck test caffe2/torch/fb/sparsenn:test -- rowwise_prune
buck test caffe2/test:pruning

Reviewed By: supriyar

Differential Revision: D22849432

fbshipit-source-id: 456f4f77c04158cdc3830b2e69de541c7272a46d
2020-08-27 15:16:23 -07:00
3a0e35c9f2 [pytorch] deprecate static dispatch (#43564)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43564

Static dispatch was originally introduced for mobile selective build.

Since we have added selective build support for dynamic dispatch and
tested it in FB production for months, we can deprecate static dispatch
to reduce the complexity of the codebase.

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D23324452

Pulled By: ljk53

fbshipit-source-id: d2970257616a8c6337f90249076fca1ae93090c7
2020-08-27 14:52:48 -07:00
3afd24d62c [pytorch] check in default generated op dependency graph (#43570)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43570

Add the default op dependency graph to the source tree - use it if user runs
custom build in dynamic dispatch mode without providing the graph.

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D23326988

Pulled By: ljk53

fbshipit-source-id: 5fefe90ca08bb0ca20284e87b70fe1dba8c66084
2020-08-27 14:51:44 -07:00
9a2d4d550e update build flags for benchmark binaries
Summary:
Suggested by Shoaib Meenai, we should use mode/ndk_libcxx to replace mode/gnustl.

This diff updated all build flags for caffe2 and pytorch in aibench. For easy management, I created two mode files in xplat/caffe2/mode, and delete buckconfig.ptmobile.pep.

Test Plan:
caffe2
```
buck run aibench:run_bench -- -b aibench/specifications/models/caffe2/squeezenet/squeezenet.json --remote --devices s9f
```
https://our.intern.facebook.com/intern/aibench/details/433604719423848

full jit
```
buck run aibench:run_bench -- -b aibench/specifications/models/pytorch/fbnet/fbnet_mobile_inference.json --platform android/full_jit --framework pytorch --remote --devices SM-G960F-8.0.0-26
```
https://our.intern.facebook.com/intern/aibench/details/189359776958060

lite interpreter
```
buck run aibench:run_bench -- -b aibench/specifications/models/pytorch/fbnet/fbnet_mobile_inference.json --platform android --framework pytorch --remote --devices s9f
```
https://our.intern.facebook.com/intern/aibench/details/568178969092066

Reviewed By: smeenai

Differential Revision: D23338089

fbshipit-source-id: 62f4ae2beb004ceaab1f73f4de8ff9e0c152d5ee
2020-08-27 14:40:01 -07:00
01f974eb1e Specialize optionals for grad_sum_to_size (#43633)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43633

In the backward graph, _grad_sum_to_size is inserted whenever a possibly broadcasting op is called:"
`"aten::_grad_sum_to_size(Tensor(a) self, int[]? size) -> Tensor(a)"`
 If a broadcast occurred, a sum is called, otherwise the second input is None and it is a no-op. Most of the time, it's a no-op (in the fast RNNs benchmark > 90% of the time).

We can get rid of this op by profiling the optionality of the second input. I added `prim::profile_optional` to do this, which counts the number of times it saw a None value and the number of times it saw a value present. When specializing the backward graph, we insert checks for values we profiled as None, and in the optimized block can remove the grad_sum_to_size calls that use those values.

In the future we may revisit this when NNC supports reductions and we want to replace grad_sum_to_size with sums as well, but I think this is worth landing now.

Test Plan: Imported from OSS

Reviewed By: bwasti, ZolotukhinM

Differential Revision: D23358809

Pulled By: eellison

fbshipit-source-id: a30a148ca581370789d57ba082d23cbf7ef2cd4d
2020-08-27 14:35:37 -07:00
a19fd3a388 Add undefined specializations in backward (#43632)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43632

Specialize the backward graph by guarding on the undefinedness of the input tensors. The graph will look like:
```
ty1, ty2, succesful_checks = prim::TypeCheck(...)
if (succesful_checks)
-> optimized graph
else:
-> fallback graph
```

Specializing on the undefinedness of tensors allows us to clean up the
```
if any_defined(inputs):
 outputs = <original_computation>
else:
 outputs = autograd zero tensors
```
blocks that make up the backward graph, so that we can fuse the original_computation nodes together.

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D23358808

Pulled By: eellison

fbshipit-source-id: f5bb28f78a4a3082ecc688a8fe0345a8a098c091
2020-08-27 14:35:35 -07:00
a4cf4c2437 refactor tests (#43631)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43631

I added a new test for just profiler stuff - I don't think the test should go in test_jit.py. Maybe this should just go in test_tensorexpr_fuser, but I'm not really testing tensorexpr stuff either... LMK

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D23358810

Pulled By: eellison

fbshipit-source-id: 074238e1b60e4c4a919a052b7a5312b790ad5d82
2020-08-27 14:35:33 -07:00
e189ef5577 Refactor pass to class (#43630)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43630

No functional changes here - just refactoring specialize autograd zero to a class, and standardizing its API to take in a shared_ptr<Graph>

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D23358805

Pulled By: eellison

fbshipit-source-id: 42e19ef2e14df66b44592252497a47d03cb07a7f
2020-08-27 14:35:30 -07:00
d1c4d75c14 Add API for unexecuted op (#43629)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43629

We have a few places where we count the size a block / subgraph - it's nice to have a shared API to ignore operators that are not executed in the optimized graph (will be used when i add a new profiling node in PR ^^)

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D23358807

Pulled By: eellison

fbshipit-source-id: 62c745d9025de94bdafd9f748f7c5a8574cace3f
2020-08-27 14:34:05 -07:00
5da97a38d1 Check if input is ChannelsLast or ChannelsLast3d for quantized AdaptivePool3d. (#42780)
Summary:
cc z-a-f, vkuzo. This serves as a very simple first step to the issue mentioned in https://github.com/pytorch/pytorch/issues/42779.

# Description
Since `ChannelsLast` and `ChannelsLast3d` are not equivalent [(MemoryFormat.h)](4e93844ab1/c10/core/MemoryFormat.h (L27)), the "fast" path for `NDHWC` is ignored.

This PR would produce the expected behaviour for 4 (5 if including batch) dimensional tensors.

# Benchmarks
## Notes
- For channels `< 8`, it is actually slower than before.
- For `qint32`, it is actually `2x` slower than before.
- For channels `> 8`, the execution time decreases up to `9-10` times in the benchmarks.
- While execution time does improve, it remains slower than the `contiguous` variant when channels `> 64`.

## C++
<img width="1667" alt="before_after_py" src="https://user-images.githubusercontent.com/37529096/89711911-5da22d80-d9e1-11ea-9b30-0c23d46c2c93.png">

## Python
<img width="1523" alt="before_after_cpp" src="https://user-images.githubusercontent.com/37529096/89711906-58dd7980-d9e1-11ea-9696-1963f394198a.png">

## Reproduce
See https://github.com/pytorch/pytorch/issues/42779.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42780

Reviewed By: smessmer

Differential Revision: D23035424

Pulled By: z-a-f

fbshipit-source-id: 15594846f66b73c22d2371eb8e47c472324d6139
2020-08-27 14:23:57 -07:00
cdc3e232e9 Add __str__ and __repr__ bindings to SourceRange (#43601)
Summary:
Added the bindings for `__str__` and `__repr__` methods for SourceRange

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43601

Test Plan:
`python test/test_jit.py`

cc gmagogsfm

Reviewed By: agolynski

Differential Revision: D23366500

Pulled By: gmagogsfm

fbshipit-source-id: ab4be6e8f9ad5f67a323554437878198483f4320
2020-08-27 12:30:47 -07:00
04ccd3ed77 Fix bazel dependencies (#43688)
Summary:
Add `header_template_rule` to `substitution.bzl`
Use it in BUILD.bazel to specify dependencies on autogenerated headers

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43688

Test Plan: bazel build --sandbox_writable_path=$HOME/.ccache -c dbg :caffe2

Reviewed By: seemethere

Differential Revision: D23374702

Pulled By: malfet

fbshipit-source-id: 180dd996d1382df86258bb6abab9f2c7e964152e
2020-08-27 12:11:34 -07:00
bff741a849 Improve save_for_mobile cxx binary (#43721)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43721

We can combine optimization pass and save_for_mobile together to reduce friction. Since lite interpreter model can also be used in full JIT, I don't think we need the option to save it as full JIT model.

Also
- improved usage message
- print op list before and after optimization pass

Test Plan:
```
buck run //xplat/caffe2:optimize_for_mobile -- --model=/home/linbin/sparkspot.pt

Building: finished in 12.4 sec (100%) 2597/2597 jobs, 2 updated
  Total time: 12.5 sec

pt_operator_library(
        name = "old_op_library",
        ops = [
                "aten::_convolution",
                "aten::adaptive_avg_pool2d",
                "aten::add_.Tensor",
                "aten::batch_norm",
                "aten::mul.Tensor",
                "aten::relu_",
                "aten::softplus",
                "aten::sub.Tensor",
        ],
)

pt_operator_library(
        name = "new_op_library",
        ops = [
                "aten::adaptive_avg_pool2d",
                "aten::add_.Tensor",
                "aten::batch_norm",
                "aten::mul.Tensor",
                "aten::relu_",
                "aten::softplus",
                "aten::sub.Tensor",
                "prepacked::conv2d_clamp_run",
        ],
)

The optimized model for lite interpreter was saved to /home/linbin/sparkspot_mobile_optimized.bc
```

```
buck run //xplat/caffe2:optimize_for_mobile -- --model=/home/linbin/sparkspot.pt --backend=vulkan
```

Reviewed By: kimishpatel

Differential Revision: D23363533

fbshipit-source-id: f7fd61aaeda5944de5bf198e7f93cacf8368babd
2020-08-27 11:01:12 -07:00
3830998ac3 [fx] When generating names, avoid shadowing builtins (#43653)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43653

When nodes are created without an explicit name, a name is generated for
it based on the target. In these cases, we need to avoid shadowing
builtin names. Otherwise, code like:
```
a.foo.bar
```
results in pretty-printed code like:
```
getattr = a.foo
getattr_1 = getattr.bar
```

While this is technically allowed in Python, it's probably a bad idea,
and more importantly is not supported by TorchScript (where `getattr` is
hardcoded).

This PR changes the name generation logic to avoid shadowing all
builtins and langauge keywords. We already do this for PyTorch
built-ins, so just extend that logic. So now the generated code will
look like:

```
getattr_1 = a.foo
getattr_2 = getattr_1.bar
```
Fixes #43522

Test Plan: Imported from OSS

Reviewed By: jamesr66a

Differential Revision: D23357420

Pulled By: suo

fbshipit-source-id: 91e9974adc22987eca6007a2af4fb4fe67f192a8
2020-08-27 10:43:56 -07:00
5a1aa0e21e [reland][quant][graphmode][fx] Add e2e test on torchvision (#43587)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43587

Add tests for graph mode quantization on torchvision and make sure it matches
current eager mode quantization

Test Plan:
Imported from OSS

Imported from OSS

Reviewed By: z-a-f

Differential Revision: D23331253

fbshipit-source-id: 0445a44145d99837a2c975684cd0a0b7d965c8f9
2020-08-27 10:12:07 -07:00
73dcfc5e78 Update RNN op registration format (#43599)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43599

Test Plan: Imported from OSS

Reviewed By: smessmer

Differential Revision: D23350223

Pulled By: iseeyuan

fbshipit-source-id: 94c528799e31b2ffb02cff675604e7cce639687f
2020-08-27 07:27:14 -07:00
288a2effa0 Operator generator based on templated selective build. (#43456)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43456

Introduce the template OperatorGenerator, which returns an optional Operator. It's null if the templated bool value is null.

RegisterOperators() is updated to take the optional Operator. A null will not be registered.

With this update the selective operator registration can be done at compile time. Tests are added to show an operator can be registered if it's in a whitelist and it will not be registered if it's not in the whitelist.

Test Plan: Imported from OSS

Reviewed By: ljk53

Differential Revision: D23283563

Pulled By: iseeyuan

fbshipit-source-id: 456e0c72b2f335256be800aeabb797bd83bcf0b3
2020-08-27 07:26:07 -07:00
c25d0015f0 Autograd code clean up (#43167)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43167

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D23222358

Pulled By: anjali411

fbshipit-source-id: b738c63b294bcee7d680fa64c6300007d988d218
2020-08-27 07:07:52 -07:00
de84db2a9d [TensorExpr] Add aten::sum lowering to the kernel (#43585)
Summary:
Handles all dimensions and selected dimensions, per PyTorch semantics.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43585

Test Plan: test_tensorexpr

Reviewed By: bertmaher

Differential Revision: D23362382

Pulled By: asuhan

fbshipit-source-id: e8d8f1197a026be0b46603b0807d996a0de5d58c
2020-08-27 02:46:47 -07:00
48e08f884e C++ APIs TransformerEncoder (#43187)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43187

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D23182770

Pulled By: glaringlee

fbshipit-source-id: 968846138d4b1c391a74277216111dba8b72d683
2020-08-27 01:31:46 -07:00
f63d06a57b Fix docs for kwargs, a-e (#43583)
Summary:
To reduce the chance of conflicts, not all ops are fixed. Ops starting with letter `f` will be fixed in separate PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43583

Reviewed By: ZolotukhinM

Differential Revision: D23330347

Pulled By: mruberry

fbshipit-source-id: 3387cb1e495faebd16fb183039197c6d90972ad4
2020-08-27 00:14:05 -07:00
a070c619b9 [FX] Native callables in FX lowering (#43426)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43426

Test Plan: Imported from OSS

Reviewed By: zdevito

Differential Revision: D23273427

Pulled By: jamesr66a

fbshipit-source-id: 3a9d04486c72933d8afd9c181578fe98c3d825b0
2020-08-27 00:00:03 -07:00
79e6aaeb4c pull empty() out of use_c10_dispatcher: full (#43572)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43572

Test Plan: Imported from OSS

Reviewed By: smessmer

Differential Revision: D23326019

Pulled By: bhosmer

fbshipit-source-id: 10a4d7ffe33b4be4ae45396725456c6097ce1757
2020-08-26 22:51:06 -07:00
01b5c06254 [fix] handle empty args in chain_matmul (#43553)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/41817

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43553

Reviewed By: agolynski

Differential Revision: D23342586

Pulled By: mruberry

fbshipit-source-id: c6349f8fa9fcefcf03681d92c085a21265d1e690
2020-08-26 18:54:46 -07:00
28be3ef2f2 Fix hipify script for pytorch extensions (#43528)
Summary:
PyTorch extensions can have .cpp or .h files which contain CUDA code that needs to be hipified. The current hipify script logic has overly strict conditions to determine which files get considered for hipification: https://github.com/pytorch/pytorch/blob/master/torch/utils/hipify/hipify_python.py#L146

These conditions might apply well to pytorch/caffe2 source code, but are overconstrained for third-party extensions.
`is_pytorch_file` conditions: https://github.com/pytorch/pytorch/blob/master/torch/utils/hipify/hipify_python.py#L549
`is_caffe2_gpu_file` conditions: https://github.com/pytorch/pytorch/blob/master/torch/utils/hipify/hipify_python.py#L561

This PR relaxes these conditions if we're hipifying a pytorch extension (specified by `is_pytorch_extension=True`) and considers all the file extensions specified using the `extensions` parameter: https://github.com/pytorch/pytorch/blob/master/torch/utils/hipify/hipify_python.py#L820

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43528

Reviewed By: mruberry

Differential Revision: D23328272

Pulled By: ngimel

fbshipit-source-id: 1e9c3a54ae2da65ac596a7ecd5539f3e14eeed88
2020-08-26 18:41:48 -07:00
c4e5ab6ff2 [TensorExpr] Disable a flaky test. (#43678)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43678

Test Plan: Imported from OSS

Reviewed By: Krovatkin

Differential Revision: D23363651

Pulled By: ZolotukhinM

fbshipit-source-id: 9557fbfda28633cea169836b02d034e9c950bc71
2020-08-26 18:35:24 -07:00
00c1501bc0 [JIT] Cast return values of functions returning Any (#42259)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42259

**Summary**
This commit modifies IR generation to insert explicit cast that cast
each return value to `Any` when a function is annotated as returning `Any`.
This precludes the failure in type unification (see below) that caused
this issue.

Issue #41962 reported that the use of an `Any` return type in
combination with different code paths returning values of different
types causes a segmentation fault. This is because the exit transform
pass tries to unify the different return types, fails, but silently sets
the type of the if node to c10::nullopt. This causes problems later in
shape analysis when that type object is dereferenced.

**Test Plan**
This commit adds a unit test that checks that a function similar to the
one in #41962 can be scripted and executed.

**Fixes**
This commit fixes #41962.

Differential Revision: D22883244

Test Plan: Imported from OSS

Reviewed By: eellison, yf225

Pulled By: SplitInfinity

fbshipit-source-id: 523d002d846239df0222cd07f0d519956e521c5f
2020-08-26 18:24:11 -07:00
f73e32cd04 Reduce amount of work done within a global lock within ParallelLoadOp (#43508)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43508

Differential Revision: D22952007

fbshipit-source-id: 11e28d20175271e6068edce8cb36f9fcf867a02a
2020-08-26 18:19:40 -07:00
0bf27d64f4 Fix NaN propagation in fuser's min/max implementation (#43590)
Summary:
fmax/fmin propagate the number if one argument is NaN, which doesn't match the eager mode behavior.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43590

Reviewed By: mruberry

Differential Revision: D23338664

Pulled By: bertmaher

fbshipit-source-id: b0316a6f01fcf8946ba77621efa18f339379b2d0
2020-08-26 17:31:06 -07:00
033b7ae3ef implement NumPy-like functionality maximum, minimum (#42579)
Summary:
Related to https://github.com/pytorch/pytorch/issues/38349

Implement NumPy-like functions `maximum` and `minimum`.
The `maximum` and `minimum` functions compute input tensors element-wise, returning a new array with the element-wise maxima/minima.

If one of the elements being compared is a NaN, then that element is returned, both `maximum` and `minimum` functions do not support complex inputs.

This PR also promotes the overloaded versions of torch.max and torch.min, by re-dispatching binary `torch.max` and `torch.min` to `torch.maximum` and `torch.minimum`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42579

Reviewed By: mrshenli

Differential Revision: D23153081

Pulled By: mruberry

fbshipit-source-id: 803506c912440326d06faa1b71964ec06775eac1
2020-08-26 16:56:12 -07:00
9ca338a9d4 [ONNX] Modified slice node in inplace ops pass (#43275)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/42292

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43275

Reviewed By: hl475

Differential Revision: D23352540

Pulled By: houseroad

fbshipit-source-id: 7fce3087c333efe3db4b03e9b678d0bee418e93a
2020-08-26 16:51:20 -07:00
1bda5e480c Add Python code coverage (#43600)
Summary:
Replace  `test` with  `coverage_test` stage for `pytorch-linux-bionic-py3.8-gcc9` configuration
Add `coverage.xml` to the list of ignored files
Add `codecov.yml` that maps installed pytorch folders back to original locations
Cleanup coverage option utilization in `run_test.py` and adapt it towards combining coverage reports across the runs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43600

Reviewed By: seemethere

Differential Revision: D23351877

Pulled By: malfet

fbshipit-source-id: acf78ae4c8f3e23920a76cce1d50f2821b83eb06
2020-08-26 16:16:03 -07:00
88e35fb8bd Skip SVD tests when no lapack (#43566)
Summary:
These tests are failing on one of my system that does not have lapack

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43566

Reviewed By: ZolotukhinM

Differential Revision: D23325378

Pulled By: mruberry

fbshipit-source-id: 5d795e460df0a2a06b37182d3d4084d8c5c8e751
2020-08-26 15:58:31 -07:00
cf26050e29 [pytorch] Move TensorIteratorConfig method implementation to cpp file (#43554)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43554

Move function implementations in the TensorIteratorConfig Class from TensorIterator.h to TensorIterator.cpp to avoid this issue: https://github.com/pytorch/pytorch/issues/43300

Reviewed By: malfet

Differential Revision: D23319007

fbshipit-source-id: 6cc3474994ea3094a294f795ac6998c572d6fb9b
2020-08-26 15:18:37 -07:00
6c28df7ceb [fx] add test for args/kwargs handling (#43640)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43640

+ added a `self.checkGraphModule` utility function to wrap the common
test assert pattern.

Test Plan: Imported from OSS

Reviewed By: gmagogsfm

Differential Revision: D23356262

Pulled By: suo

fbshipit-source-id: a50626dcb01246d0dbd442204a8db5958cae23ab
2020-08-26 14:39:25 -07:00
5a15f56668 match batchmatmul on 1.0.0.6 (#43559)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43559

- remove mkl strided gemm since it was acting weird in some cases, use the plain for loop for gemm for now, it will have performance implications but this closes the gap for the ctr_instagram_5x model
- reproduced the failure scenario of batchmatmul on ctr_instagram_5x by increasing the dimensions of the inputs
- added an option in netrunner to skip bmm if needed

Test Plan:
- net runner passes with ctr_instagram 5x
- bmm unit test repros the discrepancy fixed

Reviewed By: amylittleyang

Differential Revision: D23320857

fbshipit-source-id: 7d5cfb23c1b0d684e1ef766f1c1cd47bb86c9757
2020-08-26 14:35:31 -07:00
769b9381fc DDP Communication hook: Fix the way we pass future result to buckets. (#43307)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43307

I identified a bug with DDP communication hook while I was trying accuracy benchmarks: I was getting `loss=nan`.

Looks like when we re-`initialize_bucketviews` with the value of `future_work`, as `Reducer::mark_variable_ready_dense` does `bucket_view.copy_(grad)` it wasn't copying the `grads` back to the contents since `bucket_view` wouldn't have any relationship with `contents` after re-intitializing it with something else. As we have multiple iterations, this was causing problems.
I solved this by adding two states for `bucket_view`:
```
    // bucket_views_in[i].copy_(grad) and
    // grad.copy_(bucket_views_out[i])
    // provide convenient ways to move grad data in/out of contents.
    std::vector<at::Tensor> bucket_views_in;
    std::vector<at::Tensor> bucket_views_out;
```

I included two additional unit tests where we run multiple iterations for better test coverage:
1) `test_accumulate_gradients_no_sync_allreduce_hook`
2) `test_accumulate_gradients_no_sync_allreduce_with_then_hook`.

ghstack-source-id: 110728299

Test Plan:
Run `python test/distributed/test_c10d.py`, some perf&accuracy benchmarks.

New tests:
`test_accumulate_gradients_no_sync_allreduce_hook`
`test_accumulate_gradients_no_sync_allreduce_with_then_hook`

Acc benchmark results look okay:
f214188350

Reviewed By: agolynski

Differential Revision: D23229309

fbshipit-source-id: 329470036cbc05ac12049055828495fdb548a082
2020-08-26 14:22:09 -07:00
0521c71241 [D23047144 Duplicate][2/3][lite interpreter] add metadata when saving and loading models for mobile (#43584)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43584

1. add `metadata.pkl` to `.bc` file which includes the model info that we are interested in
2. load `metadata.pkl` as a attribute `unordered_map<string, string>` in the module
ghstack-source-id: 110730013

Test Plan:
- CI
```buck build //xplat/caffe2:jit_module_saving
```
```buck build //xplat/caffe2:torch_mobile_core
```

Reviewed By: xcheng16

Differential Revision: D23330080

fbshipit-source-id: 5d65bd730b4b566730930d3754fa1bf16aa3957e
2020-08-26 14:07:49 -07:00
306eb3def7 Additional error checking for torch.cuda.nccl APIs. (#43247)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43247

`torch.cuda.nccl` APIs didn't throw appropriate errors when called
with inputs/outputs that were of the wrong type and it resulted in some cryptic
errors instead.

Adding some error checks with explicit error messages for these APIs.
ghstack-source-id: 110683546

Test Plan: waitforbuildbot

Reviewed By: rohan-varma

Differential Revision: D23206069

fbshipit-source-id: 8107b39d27f4b7c921aa238ef37c051a9ef4d65b
2020-08-26 13:50:00 -07:00
db1fbc5729 [OACR][NLU] Add aten::str operator (#43573)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43573

We recently updated the Stella NLU model in D23307228, and the App started to crash with `Following ops cannot be found:{aten::str, }`.

Test Plan: Verified by installing the assistant-playground app on Android.

Reviewed By: czlx0701

Differential Revision: D23325409

fbshipit-source-id: d670242868774bb0aef4be5c8212bc3a3f2f667c
2020-08-26 13:27:11 -07:00
6459f0a077 added rocm 3.7 docker image (#43576)
Summary:
Added bionic rocm 3.7 docker image

- jeffdaily

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43576

Reviewed By: malfet

Differential Revision: D23352310

Pulled By: seemethere

fbshipit-source-id: fd544b3825d8c25587f5765332c0a8ed1fa63c6e
2020-08-26 12:39:46 -07:00
a91e1cedc5 Reduce number of hypothesis tests in CI (#43591)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43591

100 randomized inputs vs 50 doesn't change the balance that much but speed up test runtime

Test Plan: CI

Reviewed By: orionr, seemethere

Differential Revision: D23332393

fbshipit-source-id: 7a8ff9127ee3e045a83658a7a670a844f3862987
2020-08-26 11:54:49 -07:00
2a4d312027 Allow GPU skip decorators to report the right number of GPUs required in (#43468)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43468

Closes https://github.com/pytorch/pytorch/issues/41378.
https://github.com/pytorch/pytorch/pull/41973 enhanced the skip decorators to
report the right no. of GPUs required, but this information was not passed to
the main process where the message is actually displayed. This PR uses a
`multiprocessing.Manager()` so that the dictionary modification is reflected
correctly in the main process.
ghstack-source-id: 110684228

Test Plan:
With this diff, we can run a test in such as in https://github.com/pytorch/pytorch/pull/42577 that requires 4 GPUs on a 2 GPU machine, and we get the expected message:

```
test_ddp_uneven_inputs_replicated_error (test_distributed.TestDistBackend) ... skipped 'Need at least 4 CUDA devices'
```

Reviewed By: mrshenli

Differential Revision: D23285790

fbshipit-source-id: ac32456ef3d0b1d8f1337a24dba9f342c736ca18
2020-08-26 11:44:13 -07:00
25dcc28cd6 [jit][static] Replace deepcopy with copy (#43182)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43182

We should avoid using `deepcopy` on the module because it involves copying the weights.

Comparing the implementation of `c10::ivalue::Object::copy()` vs `c10::ivalue::Object::deepcopy()`, the only difference is `deepcopy` copies the attributes (slots) while `copy` does not.

Reviewed By: bwasti

Differential Revision: D23171770

fbshipit-source-id: 3cd711c6a2a19ea31d1ac1ab2703a0248b5a4ef3
2020-08-26 11:15:49 -07:00
51861cc9b1 .circleci: Add CUDA 11 to nightly binary builds (#43366)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43366

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D23348556

Pulled By: seemethere

fbshipit-source-id: 0cd129c5c27ffceec80636384762c3ff7bf74fdc
2020-08-26 10:11:01 -07:00
42f6c3b1f4 Raise error on device mismatch in addmm (#43505)
Summary:
Fixes gh-42282

This adds a device-mismatch check to `addmm` on CPU and CUDA. Although it seems like the dispatcher is always selecting the CUDA version here if any of the inputs are on GPU. So in theory the CPU check is unnecessary, but probably better to err on the side of caution.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43505

Reviewed By: mruberry

Differential Revision: D23331651

Pulled By: ngimel

fbshipit-source-id: 8eb2f64f13d87e3ca816bacec9d91fe285d83ea0
2020-08-26 09:37:57 -07:00
7beeef2c69 .jenkins: Remove openssh installs (#43597)
Summary:
openssh should be installed by either the circleci machines or from the
jenkins workers so we shouldn't need to install it ourselves in order to
get ssh functionality

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43597

Reviewed By: ezyang

Differential Revision: D23333479

Pulled By: seemethere

fbshipit-source-id: 17a1ad0200a9df7d4818ab1ed44c8488ec8888fb
2020-08-26 09:36:53 -07:00
573940f8d7 Fix type annotation errors in torch.functional (#43446)
Summary:
Closes gh-42968

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43446

Reviewed By: albanD

Differential Revision: D23280962

Pulled By: malfet

fbshipit-source-id: de5386a95a20ecc814c39cbec3e4252112340b3a
2020-08-26 08:27:59 -07:00
2b70f82737 fix typo in test_dataloader test_multiprocessing_contexts (take 2) (#43588)
Summary:
2nd attempt to land https://github.com/pytorch/pytorch/pull/43343

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43588

Reviewed By: seemethere

Differential Revision: D23332284

Pulled By: malfet

fbshipit-source-id: d78faf468c56af2f176dbdd2ce4bd51f0b5df6fd
2020-08-25 21:11:53 -07:00
c1553ff94b Benchmarks: temporarily disable profiling-te configuration. (#43603)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43603

We are in the midst of landing a big reword of profiling executor and
benchmarks are expected to fail while we are in the transitional state.

Test Plan: Imported from OSS

Reviewed By: SplitInfinity

Differential Revision: D23334818

Pulled By: ZolotukhinM

fbshipit-source-id: 99ff17c6f8ee18d003f6ee76ff0e719cea68c170
2020-08-25 21:00:10 -07:00
3ec24f02af [TensorExpr] Start using typecheck in the fuser. (#43173)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43173

With this change the fuser starts to generate typechecks for inputs of
fusion group. For each fusion group we generate a typecheck and an if
node: the true block contains the fused subgraph, the false block
contains unoptimized original subgraph.

Differential Revision: D23178230

Test Plan: Imported from OSS

Reviewed By: eellison

Pulled By: ZolotukhinM

fbshipit-source-id: f56e9529613263fb3e6575869fdb49973c7a520b
2020-08-25 18:13:32 -07:00
b763666f9f [JIT] Subgraph utils: add an optional vmap argument to the API to allow retrieving value mappings. (#43235)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43235

This functionality is needed when we want to not lose track of
nodes/values as we merge and unmerge them into other nodes. For
instance, if we have a side data structure with some meta information
about values or nodes, this new functionality would allow to keep that
metadata up to date after merging and unmerging nodes.

Differential Revision: D23202648

Test Plan: Imported from OSS

Reviewed By: eellison

Pulled By: ZolotukhinM

fbshipit-source-id: 350d21a5d462454166f8a61b51d833551c49fcc9
2020-08-25 18:13:29 -07:00
d18566c617 [TensorExpr] Fuser: disallow aten::slice nodes. (#43365)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43365

We don't have shape inference for them yet.

Test Plan: Imported from OSS

Reviewed By: eellison

Differential Revision: D23253418

Pulled By: ZolotukhinM

fbshipit-source-id: 9c38778b8a616e70f6b2cb5aab03d3c2013b34b0
2020-08-25 18:13:27 -07:00
8dc4b415eb [TensorExpr] Fuser: only require input shapes to be known (output shapes can be inferred). (#43171)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43171

Differential Revision: D23178228

Test Plan: Imported from OSS

Reviewed By: eellison

Pulled By: ZolotukhinM

fbshipit-source-id: e3465066e0cc4274d28db655de274a51c67594c4
2020-08-25 18:13:25 -07:00
f6b7c6da19 [TensorExpr] Fuser: move canHandle and some other auxiliary functions into TensorExprFuser class. (#43170)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43170

Differential Revision: D23178227

Test Plan: Imported from OSS

Reviewed By: eellison

Pulled By: ZolotukhinM

fbshipit-source-id: 3c3a0215344fb5942c4f3078023fef32ad062fe9
2020-08-25 18:12:01 -07:00
f35e069622 Back out "Make grad point to bucket buffer in DDP to save memory usage" (#43557)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43557

backout the diff that caused some errors in pytext distributed training

Test Plan: Tested by rayhou who verified reverting the diff works

Differential Revision: D23320238

fbshipit-source-id: caa0fe74404059e336cd95fdb41373f58ecf486e
2020-08-25 18:04:39 -07:00
58666982fb check in intel nnpi 1007 into fbcode/tp2
Summary: As title

Test Plan:
* Details of conducted tests can be found in https://fb.workplace.com/groups/527892364588452/permalink/615694119141609/
* Sandcastle

Reviewed By: arunm-git

Differential Revision: D23198458

fbshipit-source-id: dd8d34a985dced66a5624a21e5d4a7e9a499ce39
2020-08-25 17:59:11 -07:00
b3f8834033 Batching rule for torch.pow, torch.result_type (#43515)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43515

This PR adds a batching rule for torch.pow. This required adding a
batching rule for torch.result_type.

Test Plan: - added new tests: `pytest test/test_vmap.py -v`

Reviewed By: cpuhrsch

Differential Revision: D23302737

Pulled By: zou3519

fbshipit-source-id: 2cade358750f6cc3abf45f81f2394900600927cc
2020-08-25 17:55:53 -07:00
c9f125bf70 Black to Block for various files (#42913)
Summary:
Fixes  https://github.com/pytorch/pytorch/issues/41735 #41736 https://github.com/pytorch/pytorch/issues/41737 #41738 all areas where black is mentioned is replaced to block

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42913

Reviewed By: houseroad

Differential Revision: D23112873

Pulled By: malfet

fbshipit-source-id: a515b56dc2ed20aa75741c577988d95f750b364c
2020-08-25 17:43:31 -07:00
348e78b086 Evenly distribute output grad into all matching inputs for min/max/median (#43519)
Summary:
cc: ngimel mruberry

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43519

Reviewed By: albanD

Differential Revision: D23312235

Pulled By: ngimel

fbshipit-source-id: 678bda54996df7f29acf96add928bb7042fc2069
2020-08-25 16:36:33 -07:00
be637fd5f6 Revert D23306683: [quant][graphmode][fx] Testing torchvision
Test Plan: revert-hammer

Differential Revision:
D23306683 (62dcd253e3)

Original commit changeset: 30d27e225d45

fbshipit-source-id: e661334d187d3d6756facd36f2ebdb3ab2cd2e26
2020-08-25 15:24:02 -07:00
05f27b18fb Back out D23047144 "[2/3][lite interpreter] add metadata when saving and loading models for mobile"
Summary:
Original commit changeset: f368d00f7bae

Back out "[2/3][lite interpreter] add metadata when saving and loading models for mobile"

D23047144 (e37f871e87)

Pull Request: https://github.com/pytorch/pytorch/pull/43516

(Note: this ignores all push blocking failures!)

Test Plan: CI

Reviewed By: xcheng16

Differential Revision: D23304639

fbshipit-source-id: 970ca3438c1858f8656cbcf831ffee2c4a551110
2020-08-25 14:58:38 -07:00
5ca6cbbd93 Remove unnecessary copies in ProcessGroupGloo for multiple inputs allreduce (#43543)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43543

Closes https://github.com/pytorch/pytorch/issues/14691. This is not needed in the multiple outputs case, because gloo allreduce
will broadcast the result tensor to all the outputs. See
https://github.com/facebookincubator/gloo/issues/152 and commit
9cabb5aaa4
for more details. Came across this when debugging https://github.com/pytorch/pytorch/pull/42577.

This effectively reverts https://github.com/pytorch/pytorch/pull/14688 while still keeping the tests.

Tested by ensuring `test_allreduce_basics` in `test_c10d.py` still works as expected.
ghstack-source-id: 110636498

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D23173945

fbshipit-source-id: d1ae08f84b4ac9919c53080949b8fffcb2fe63a8
2020-08-25 14:01:26 -07:00
9b05fbd92e Correct the windows docs (#43479)
Summary:
Fixes https://discuss.pytorch.org/t/i-cannot-use-the-pytorch-that-was-built-successfully-from-source-dll-initialization-routine-failed-error-loading-caffe2-detectron-ops-gpu-dll/93243/5?u=peterjc123.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43479

Reviewed By: mrshenli, ngimel

Differential Revision: D23294211

Pulled By: ezyang

fbshipit-source-id: d67df7d0355c2783153d780c94f959758b246d36
2020-08-25 13:41:24 -07:00
3df398a3a8 Update the QR documentation to include a warning about when the QR.backward is well-defined. (#43547)
Summary:
As per title.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43547

Reviewed By: mruberry

Differential Revision: D23318829

Pulled By: albanD

fbshipit-source-id: 4764ebe1ad440e881b1c4c88b16fb569ef8eb0fa
2020-08-25 13:19:25 -07:00
62dcd253e3 [quant][graphmode][fx] Testing torchvision (#43526)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43526

Add tests for graph mode quantization on torchvision and make sure it matches
current eager mode quantization

Test Plan: Imported from OSS

Reviewed By: vkuzo

Differential Revision: D23306683

fbshipit-source-id: 30d27e225d4557bfc1d9aa462086e416aa9a9c0e
2020-08-25 13:02:14 -07:00
9420c773d0 Revert D23299452: [pytorch][PR] fix typo in test_dataloader test_multiprocessing_contexts
Test Plan: revert-hammer

Differential Revision:
D23299452 (6a2d7a05c4)

Original commit changeset: 9489c48b83bc

fbshipit-source-id: e8c15d338dd89d8e92f3710e9cf149149bd2e763
2020-08-25 12:34:49 -07:00
ebc0fc4dfc Polish the nightly.py docs in CONTRIBUTING a little (#43494)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43494

Reviewed By: mruberry

Differential Revision: D23296032

Pulled By: ngimel

fbshipit-source-id: c85a6d4c39cbb60644f79136a6f21fd49c813b61
2020-08-25 12:13:27 -07:00
3dcfe84861 Grammatical corrections (#43473)
Summary:
**Few documentation corrections.**

1. [...] If there is hard-to-debug error in one of your TorchScript **models**, you can use this flag [...]
2. [...] Since TorchScript (scripting and tracing) **is** disabled with this flag [...]

**Before corrections (as of now):**
![before-fix](https://user-images.githubusercontent.com/45713346/90977203-d8bc2580-e543-11ea-9609-fbdf5689dcb9.jpg)

**After corrections:**
![after-fix](https://user-images.githubusercontent.com/45713346/90977209-dbb71600-e543-11ea-8259-011618efd95b.jpg)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43473

Reviewed By: mruberry

Differential Revision: D23296167

Pulled By: ngimel

fbshipit-source-id: 932c9b25cc79d6e266e5ddb3744573b0bd63d925
2020-08-25 12:09:14 -07:00
f32ca57c5e Fix typo in LSTMCell document (#43395)
Summary:
Fixes typo in document

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43395

Reviewed By: mruberry

Differential Revision: D23312561

Pulled By: ngimel

fbshipit-source-id: 28340c96faf52c17acfe9f6b1dd94b71ea4d60ce
2020-08-25 12:04:59 -07:00
f8e9e7ad4a Allocating warp to an input index in compute_cuda_kernel (#43354)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43354

Instead of assigning a thread to an input index for repeating that index, we assign a warp to an index. This helps us in avoiding the costly uncoaelesced memory accesses and brach divergence which occur when each thread is repeating the index.

Test Plan: Run trainer to test

Reviewed By: ngimel

Differential Revision: D23230917

fbshipit-source-id: 731e912c844f1d859b0384fcaebafe69cb4ab56a
2020-08-25 10:47:50 -07:00
76894062dc move wholearchive to link option (#43485)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/43216

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43485

Reviewed By: glaringlee

Differential Revision: D23318735

Pulled By: malfet

fbshipit-source-id: 90c316d3d5ed51afcff356e6d9219950f119a902
2020-08-25 10:36:10 -07:00
1089ff404c Refactored the duplicate code into a function in _ConvNd (#43525)
Summary:
Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43525

Reviewed By: ngimel

Differential Revision: D23306593

Pulled By: jerryzh168

fbshipit-source-id: 3427cd2b9132a203858477b6c858d59b00e1282e
2020-08-25 10:00:07 -07:00
8ecfa9d9a2 [cmake] End support for python3.5 for pytorch (#43105)
Summary:
PyTorch uses f-string in its python codes.
Python support for f-string started with version 3.6
Using python version 3.5 or older fails the build with latest release/master.
This patch checks the version of the python used for build and mandates it to be 3.6 or higher.

Signed-off-by: Parichay Kapoor <kparichay@gmail.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43105

Reviewed By: glaringlee

Differential Revision: D23301481

Pulled By: malfet

fbshipit-source-id: e9b4f7bffce7384c8ade3b7d131b10cf58f5e8a0
2020-08-25 09:42:42 -07:00
6a2d7a05c4 fix typo in test_dataloader test_multiprocessing_contexts (#43343)
Summary:
https://github.com/pytorch/pytorch/issues/22990 added a multiprocessing_context argument to DataLoader, but a typo in the test causes the wrong DataLoader class to be used.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43343

Reviewed By: glaringlee

Differential Revision: D23299452

Pulled By: malfet

fbshipit-source-id: 9489c48b83bce36f46d350cad902f7ad96e1eec4
2020-08-25 09:36:56 -07:00
b430347a60 Address JIT/Mypy issue with torch._VF (#43454)
Summary:
- `torch._VF` is a hack to work around the lack of support for `torch.functional` in the JIT
- that hack hides `torch._VF` functions from Mypy
- could be worked around by re-introducing a stub file for `torch.functional`, but that's undesirable
- so instead try to make both happy at the same time: the type ignore comments are needed for Mypy, and don't seem to affect the JIT after excluding them from the `get_type_line()` logic

Encountered this issue while trying to make `mypy` run on `torch/functional.py` in gh-43446.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43454

Reviewed By: glaringlee

Differential Revision: D23305579

Pulled By: malfet

fbshipit-source-id: 50e490693c1e53054927b57fd9acc7dca57e88ca
2020-08-25 09:23:54 -07:00
f02753fabb Support AMP in nn.parallel (#43102)
Summary:
Take care of the state of autocast in `parallel_apply`, so there is no need to decorate model implementations.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43102

Reviewed By: ngimel

Differential Revision: D23294610

Pulled By: mrshenli

fbshipit-source-id: 0fbe0c79de976c88cadf2ceb3f2de99d9342d762
2020-08-25 08:38:49 -07:00
cbdaa20c88 [serialize] Expose zip file alignment calculation functions (#43531)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43531

It's useful for building some tooling out of tree to manipulate zip files in PyTorch-y way

Test Plan: contbuild

Reviewed By: houseroad

Differential Revision: D23277361

fbshipit-source-id: e15fad20e792d1e41018d32fd48295cfe74bea8c
2020-08-25 02:32:58 -07:00
d1d32003bb force pytorch tensors to contiguous before calling c2 ops
Summary: per title, makes c2 wrappers safer as contiguity of torch inputs is not guaranteed

Test Plan: covered by existing tests

Reviewed By: dzhulgakov

Differential Revision: D23310137

fbshipit-source-id: 3fe12abc7e394b8762098d032200778018e5b591
2020-08-24 23:04:13 -07:00
675f3f0482 Fix "save binary size" steps (#43529)
Summary:
`pip3` alias might not be available, so call `python3 -mpip` to be on the safe side
Should fix failures like that:
https://app.circleci.com/pipelines/github/pytorch/pytorch/203448/workflows/3837b2d6-b089-4a19-b797-38bdf989c82e/jobs/6913032/parallel-runs/0/steps/0-109

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43529

Reviewed By: seemethere

Differential Revision: D23307306

Pulled By: malfet

fbshipit-source-id: b55e6782b29f1a1f56787902cbb85b3c3d20370c
2020-08-24 19:25:33 -07:00
f80b695a75 Properly format db.h and db.cc (#43027)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43027

Format db.h and db.cc using the default formatter.

This change was split off of D22705434.

Test Plan: Wait for sandcastle.

Reviewed By: rohithmenon, marksantaniello

Differential Revision: D23113765

fbshipit-source-id: 3f02d55bfb055bda0fcba5122336fa001562d42e
2020-08-24 18:29:45 -07:00
7b243a4d46 [quant][graphmode[fx][test][refactor] Refactor tests for graph mode quantization on fx (#43445)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43445

changed the interface for checkGraphModule to make the arguments more explicit
as requested in https://github.com/pytorch/pytorch/pull/43437

Test Plan:
TestQuantizeFx

Imported from OSS

Reviewed By: vkuzo

Differential Revision: D23280586

fbshipit-source-id: 5b5859e326d149a5aacb1d15cbeee69667cc9109
2020-08-24 17:58:55 -07:00
87905b5856 [pytorch] add option to include autograd for code analyzer (#43155)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43155

Update the code_analyzer build.sh script to be able to take additional build flags in the mobile build/analysis

Test Plan:
Checkout associated PR or copy contents of build.sh into PyTorch repo (must be run from root of PyTorch repo)

To run with inclusion of autograd dependencies (note BUILD_MOBILE_AUTOGRAD is still an experimental build flag): `ANALYZE_TORCH=1 DEPLOY=1 BASE_OPS_FILE=/path/to/baseopsfile MOBILE_BUILD_FLAGS='-DBUILD_MOBILE_AUTOGRAD=ON' tools/code_analyzer/build.sh`

Reviewed By: ljk53

Differential Revision: D23065754

fbshipit-source-id: d83a7ad62ad366a84725430ed020adf4d56687bd
2020-08-24 15:04:43 -07:00
284ff04792 [quant] Support set API for EmbeddingBag quantization (#43433)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43433

Add support for torch.quint8 dtype

Test Plan: Imported from OSS

Reviewed By: radkris-git

Differential Revision: D23277002

fbshipit-source-id: 4204bc62f124b4fd481aaa6aa47b9437978c43ee
2020-08-24 14:33:35 -07:00
e37f871e87 [2/3][lite interpreter] add metadata when saving and loading models for mobile
Summary:
1. add `metadata.pkl` to `.bc` file which includes the model info that we are interested in
2. load `metadata.pkl` as a attribute `unordered_map<string, string>` in the module

Test Plan:
- CI
```buck build //xplat/caffe2:jit_module_saving
```
```buck build //xplat/caffe2:torch_mobile_core
```

Reviewed By: xcheng16

Differential Revision: D23047144

fbshipit-source-id: f368d00f7baef2d3d15f89473cdb146467aa1e0b
2020-08-24 13:40:52 -07:00
ed8b08a3ba Update quantize_jit to handle new upsample overloads (#43407)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43407

ghstack-source-id: 110404846

Test Plan:
test_general_value_ops passes with D21209991 applied.
(Without this diff D21209991 breaks that test.)

Reviewed By: jerryzh168

Differential Revision: D23256503

fbshipit-source-id: 0f75e50a9f7fccb5b4325604319a5f76b42dfe5e
2020-08-24 13:33:47 -07:00
e08e93f946 Reland of benchmark code (#43428)
Summary:
Reland of the benchmark code that broke the slow tests because the GPU were running out of memory

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43428

Reviewed By: ngimel

Differential Revision: D23296136

Pulled By: albanD

fbshipit-source-id: 0002ae23dc82f401604e33d0905d6b9eedebc851
2020-08-24 13:27:26 -07:00
4cfac34075 [ROCm] allow .jenkins/pytorch/test.sh to run on centos (#42197)
Summary:
This doesn't fix any reported issue.  We validate ROCm PyTorch on ubuntu and centos.  For centos, we must modify the test.sh script to let it run on centos.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42197

Reviewed By: ezyang, ngimel

Differential Revision: D23175669

Pulled By: malfet

fbshipit-source-id: 0da435de6fb17d2ca48e924bec90ef61ebbb5042
2020-08-24 13:12:49 -07:00
35a36c1280 Implement JIT Enum type serialization and deserialization (#43460)
Summary:
[Re-review tips: nothing changed other than a type in python_ir.cpp to fix a windows build failure]

Adds code printing for enum type
Enhance enum type to include all contained enum names and values
Adds code parsing for enum type in deserialization
Enabled serialization/deserialization test in most TestCases. (With a few dangling issues to be addressed in later PRs to avoid this PR grows too large)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43460

Reviewed By: albanD

Differential Revision: D23284929

Pulled By: gmagogsfm

fbshipit-source-id: e3e81d6106f18b7337ac3ff5cd1eeaff854904f3
2020-08-24 12:04:31 -07:00
0fa99d50bc Enable torch.cuda.memory typechecking (#43444)
Summary:
Add number of function prototypes defined in torch/csrs/cuda/Module.cpp to `__init__.pyi.in`

Fixes https://github.com/pytorch/pytorch/issues/43442

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43444

Reviewed By: ezyang

Differential Revision: D23280221

Pulled By: malfet

fbshipit-source-id: 7d67dff7b24c8d7b7e72c919e6e7b847f242ef83
2020-08-24 11:46:04 -07:00
7024ce8a2c [quant] Add benchmarks for quantized embeddingbag module (#43296)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43296

Use common config for float and quantized embedding_bag modules

Test Plan:
```
python -m pt.qembeddingbag_test

 Benchmarking PyTorch: qEmbeddingBag
 Mode: Eager
 Name: qEmbeddingBag_embeddingbags10_dim4_modesum_input_size8_offset0_sparseTrue_include_last_offsetTrue_cpu
 Input: embeddingbags: 10, dim: 4, mode: sum, input_size: 8, offset: 0, sparse: True, include_last_offset: True, device: cpu
Forward Execution Time (us) : 35.738

 Benchmarking PyTorch: qEmbeddingBag
 Mode: Eager
 Name: qEmbeddingBag_embeddingbags10_dim4_modesum_input_size8_offset0_sparseTrue_include_last_offsetFalse_cpu
 Input: embeddingbags: 10, dim: 4, mode: sum, input_size: 8, offset: 0, sparse: True, include_last_offset: False, device: cpu
Forward Execution Time (us) : 62.708

python -m pt.embeddingbag_test

 Benchmarking PyTorch: embeddingbag
 Mode: Eager
 Name: embeddingbag_embeddingbags10_dim4_modesum_input_size8_offset0_sparseTrue_include_last_offsetTrue_cpu
 Input: embeddingbags: 10, dim: 4, mode: sum, input_size: 8, offset: 0, sparse: True, include_last_offset: True, device: cpu
Forward Execution Time (us) : 46.878

 Benchmarking PyTorch: embeddingbag
 Mode: Eager
 Name: embeddingbag_embeddingbags10_dim4_modesum_input_size8_offset0_sparseTrue_include_last_offsetFalse_cpu
 Input: embeddingbags: 10, dim: 4, mode: sum, input_size: 8, offset: 0, sparse: True, include_last_offset: False, device: cpu
Forward Execution Time (us) : 103.904

```

Imported from OSS

Reviewed By: vkuzo

Differential Revision: D23245531

fbshipit-source-id: 81b44fde522238d3eef469434e93dd7f94b528a8
2020-08-24 09:51:03 -07:00
7cc1efec13 Add lite SequentialSampler to torch mobile (#43299)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43299

Test Plan: Imported from OSS

Reviewed By: iseeyuan

Differential Revision: D23228415

Pulled By: ann-ss

fbshipit-source-id: eebe54353a128783f039c7dac0e2dd765a61940d
2020-08-24 09:45:24 -07:00
c972e6232a Implement batching rules for basic arithmetic ops (#43362)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43362

Batching rules implemented for: addition subtraction division
multiplication.

I refactored the original `mul_batching_rule` into a templated function
so that one can insert arbitrary binary operations into it.

add, sub, rsub, mul, and div all work the same way. However, other
binary operations work slightly differently (I'm still figuring out the
differences and why they're different) so those may need a different
implementation.

Test Plan: - "pytest test/test_vmap.py -v": new tests

Reviewed By: ezyang

Differential Revision: D23252317

Pulled By: zou3519

fbshipit-source-id: 6d36cd837a006a2fd31474469323463c1bd797fc
2020-08-24 08:43:36 -07:00
db78c07ced Enable torch.cuda.nvtx typechecking (#43443)
Summary:
Add pyi file covering torch._C.nvtx submodule

Fixes https://github.com/pytorch/pytorch/issues/43436

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43443

Reviewed By: ezyang

Differential Revision: D23280188

Pulled By: malfet

fbshipit-source-id: 882860cce9feb0b5307c8b7c887f4a2f2c1548a2
2020-08-24 08:20:12 -07:00
2f9c9796f1 [Codemod][FBSourceClangFormatLinter] Daily arc lint --take CLANGFORMAT
Reviewed By: zertosh

Differential Revision: D23290730

fbshipit-source-id: ee3ffbd6f9c0fade4586d8f4f8c8dd3d310d1f33
2020-08-24 05:36:38 -07:00
c4e841654d Add alias torch.negative to torch.neg. (#43400)
Summary:
xref https://github.com/pytorch/pytorch/issues/42515

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43400

Reviewed By: albanD

Differential Revision: D23266011

Pulled By: mruberry

fbshipit-source-id: ca20b30d99206a255cf26438b09c3ca1f99445c6
2020-08-24 01:15:04 -07:00
1f0cfbaaad [fx] add type annotations (#43083)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43083

This adds type annotations to all classes, arguments, and returns
for fx. This should make it easier to understand the code, and
encourage users of the library to also write typed code.

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D23145853

Pulled By: zdevito

fbshipit-source-id: 648d91df3f9620578c1c51408003cd5152e34514
2020-08-23 15:38:33 -07:00
b349f58c21 [fx] enabling typechecking of fx files (#43082)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43082

Fixes all present errors in mypy. Does not try to add annotations everywhere.

Test Plan: Imported from OSS

Reviewed By: jamesr66a

Differential Revision: D23145854

Pulled By: zdevito

fbshipit-source-id: 18e483ed605e89ed8125971e84da1a83128765b7
2020-08-23 15:37:29 -07:00
a97ca93c0e remove prim::profile and special-casing (#43160)
Summary:
Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43160

Reviewed By: ZolotukhinM

Differential Revision: D23284421

Pulled By: Krovatkin

fbshipit-source-id: 35e97aad299509a682ae7e95d7cef53301625309
2020-08-22 23:52:36 -07:00
d70b263e3a [DPER3] Separate user embeddings and ad embeddings in blob reorder
Summary:
Separate user embeddings and ad embeddings in blobsOrder. New order:
1. meta_net_def
2. preload_blobs
3. user_embeddings (embeddings in remote request only net)
4. ad_embeddings (embeddings in remote other net)

Add a field requestOnlyEmbeddings in meta_net_def to record user_embeddings.

This is for flash verification.

Test Plan:
buck test dper3/dper3_backend/delivery/tests:blob_reorder_test

Run a flow with canary package f211282476
Check the net: n326826, request_only_embeddings are recorded as expected

Reviewed By: ipiszy

Differential Revision: D23008305

fbshipit-source-id: 9360ba3d078f205832821005e8f151b8314f0cf2
2020-08-22 23:40:04 -07:00
4dc8f3be8c Creates test_tensor_creation_ops.py test suite (#43104)
Summary:
As part of our continued refactoring of test_torch.py, this takes tests for tensor creation ops like torch.eye, torch.randint, and torch.ones_like and puts them in test_tensor_creation_ops.py. There hare three test classes in the new test suite: TestTensorCreation, TestRandomTensorCreation, TestLikeTensorCreation. TestViewOps and tests for construction of tensors from NumPy arrays have been left in test_torch.py. These might be refactored separately into test_view_ops.py and test_numpy_interop.py in the future.

Most of the tests ported from test_torch.py were left as is or received a signature change to make them nominally "device generic." Future work will need to review test coverage and update the tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43104

Reviewed By: ngimel

Differential Revision: D23280358

Pulled By: mruberry

fbshipit-source-id: 469325dd1a734509dd478cc7fe0413e276ffb192
2020-08-22 23:18:54 -07:00
35351ff409 Fix ToC Link (#43427)
Summary:
CC ezyang - no code here

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43427

Reviewed By: albanD

Differential Revision: D23273866

Pulled By: mrshenli

fbshipit-source-id: ca07d286410f367cc78549828e517510a86d63ec
2020-08-22 19:51:24 -07:00
e4af45f3aa Fix bugs in vec256_float_neon.h (#43321)
Summary:
fixing neon vector conversion problems.

Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43321

Reviewed By: pbelevich

Differential Revision: D23241536

Pulled By: kimishpatel

fbshipit-source-id: 37a4e10989c9342ae5e8c78f6875b7aad785dd76
2020-08-22 17:27:18 -07:00
b003f2cc28 Enable input pointer caching in XNNPACK integration. (#42840)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42840

By caching input/output pointers and input parameters we enable the use
of caching allocator and check if we get the same input/output pointers.
If so we skip setup steps.

Test Plan:
python test/test_xnnpack_integration.py

Imported from OSS

Reviewed By: AshkanAliabadi

Differential Revision: D23044585

fbshipit-source-id: ac676cff77f264d8ccfd792d1a540c76816d5359
2020-08-22 16:50:17 -07:00
b52e6d00f9 Change quantizer to account for input tensor's memory format. (#42178)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42178

This otherwise introduces unnecessary calls to contiguous in the rest of
the network, where certain ops want channels last format.

Test Plan:
Quantization tests.

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D22796479

fbshipit-source-id: f1ada1c2eeed84991b9b195120699b943ef6e421
2020-08-22 16:48:50 -07:00
b1d31428e7 Reduce number of prim::profile (#43147)
Summary:
Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43147

Reviewed By: colesbury

Differential Revision: D23190137

Pulled By: Krovatkin

fbshipit-source-id: bf5f29a76e5ebfb5b9d3b6adee424e213c25891b
2020-08-22 16:06:30 -07:00
8efa898349 [ONNX] Export split_to_sequence as slice when output number is static (#42744)
Summary:
Optimize exported graph to export slice nodes for aten::split when the number of split outputs are fixed. Previously under some cases these are exported as onnx::SplitToSequence, which is dynamic in tensor output count.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42744

Reviewed By: houseroad

Differential Revision: D23172465

Pulled By: bzinodev

fbshipit-source-id: 11e432b4ac1351f17e48356c16dc46f877fdf7da
2020-08-22 09:11:25 -07:00
ec9e6e07bc [quant][graphmode][fx] Add support for general value ops (#43439)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43439

Porting op tests from test_quantize_jit.py

Test Plan:
TestQuantizeFxOps

Imported from OSS

Reviewed By: raghuramank100

Differential Revision: D23278585

fbshipit-source-id: ad29f39482cf4909068ce29555470ef430ea17f6
2020-08-22 08:52:28 -07:00
47e1b7a8f1 Set CONSTEXPR_EXCEPT_WIN_CUDA as const while it is not constexpr (#43380)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/42467

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43380

Reviewed By: albanD

Differential Revision: D23278930

Pulled By: pbelevich

fbshipit-source-id: 6ce0bc9fd73cd0ead46c414fdea5f6fb7e9fec3e
2020-08-22 03:25:37 -07:00
d94b10a832 Revert D23223281: Add Enum TorchScript serialization and deserialization support
Test Plan: revert-hammer

Differential Revision:
D23223281 (f269fb83c1)

Original commit changeset: 716d1866b777

fbshipit-source-id: da1ad8387b7d7aad9ff69e1ebeb5cd0b9394c2df
2020-08-22 02:38:12 -07:00
915fd1c8fc centralize autograd dispatch key set (#43387)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43387

Test Plan: Imported from OSS

Reviewed By: ailzhang

Differential Revision: D23258687

Pulled By: bhosmer

fbshipit-source-id: 3718f74fc7324db027f87eda0b90893a960aa56e
2020-08-22 00:46:02 -07:00
88b564ce39 [quant][graphmode][fx] Add support for general shape ops (#43438)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43438

Porting op tests from test_quantize_jit.py

Test Plan:
TestQuantizeFxOps

Imported from OSS

Reviewed By: raghuramank100

Differential Revision: D23278583

fbshipit-source-id: 34b73390d47c7ce60528444da77c4096432ea2cb
2020-08-21 23:07:20 -07:00
192c4b0050 [quant][graphmode][fx] Add support for clamp (#43437)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43437

Porting op tests from test_quantize_jit.py

Test Plan:
TestQuantizeFxOps

Imported from OSS

Reviewed By: raghuramank100

Differential Revision: D23278584

fbshipit-source-id: 266dc68c9ca30d9160a1dacf28dc7781b3d472c2
2020-08-21 20:21:50 -07:00
40c77f926c Add prim::TypeCheck operation (#43026)
Summary:
TypeCheck is a new operation to check the shape of tensors against
 expectd shapes. TypeCheck is a variadic operation. An example,

 %t0 : Tensor = ...
 %t1 : Tensor = ...
 %2 : FLOAT(20, 20), %3 : FLOAT(30, 30), %1 : bool =
 prim::TypeCheck(%t1, %t2)
 prim::If(%1)

Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43026

Reviewed By: ZolotukhinM

Differential Revision: D23115830

Pulled By: bzinodev

fbshipit-source-id: fbf142126002173d2d865cf4b932dea3864466b4
2020-08-21 20:03:24 -07:00
98307a2821 Fix bfloat16 erfinv get incorrect value problem for cpu path (#43399)
Summary:
Fix https://github.com/pytorch/pytorch/issues/43344

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43399

Reviewed By: albanD

Differential Revision: D23264789

Pulled By: pbelevich

fbshipit-source-id: 8b77c0f6ca44346e44599844fb1e172fdbd9df6c
2020-08-21 19:59:37 -07:00
5e04bb2c1c caffe2: expose CPUContext RandSeed for backwards compatibility with external RNG (#43239)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43239

This is an incremental step as part of the process to migrate caffe2 random number generator off of std::mt19937 and to instead use at::mt19937+at::CPUGeneratorImpl. The ATen variants are much more performant (10x faster).

This adds a way to get the CPUContext RandSeed for tail use cases that require a std::mt19937 and borrow the CPUContext one.

Test Plan: This isn't used anywhere within the caffe2 codebase. Compile should be sufficient.

Reviewed By: dzhulgakov

Differential Revision: D23203280

fbshipit-source-id: 595c1cb447290604ee3ef61d5b5fc079b61a4e14
2020-08-21 19:36:38 -07:00
fb12992b5d Call qnnpack's conv setup only if input pointer has changed. (#42008)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42008

With caching allocator we have increased the likelihood of getting the
same input pointer. With that we can cache qnnpack operator and input
pointer and check if the input pointer is the same. If so we can skip
setup step.

Test Plan:
Ran one of the quantized models to observe
1. No pagefaults due to indirection buffer reallocation.
2. Much less time spent in indirection buffer population.

Imported from OSS

Reviewed By: AshkanAliabadi

Differential Revision: D22726973

fbshipit-source-id: 2dd2a6a6ecf1b5cfa7dde65e384b36a6eab052d7
2020-08-21 19:10:40 -07:00
04aa42a073 Refactor qconv to reduce allocations. (#42007)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42007

zero buffer and indirection pointers are allocatoed on every iterations.
With this refactor we create op once for qnnpackconv struct and keep
repopulating indirection pointer as necessary.

For deconv moved much of op creation outside so that we can avoid creating and
destroying ops every time.

Test Plan:
CI quantization tests.
deconvolution-test

Imported from OSS

Reviewed By: AshkanAliabadi

Differential Revision: D22726972

fbshipit-source-id: 07c03a4e90b397c36aae537ef7c0b7d81d4adc1a
2020-08-21 19:10:37 -07:00
2a08566b8f Simple caching allocator for CPU. (#42006)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42006

This PR introduces a simple CPU caching allocator. This is specifically
intended for mobile use cases and for inference. There is nothing
specific to the implementation that can prevent it from other use cases,
however its simplicity may not be suitable everywhere.
It simply tracks allocation by sizes and relies on deterministic
repeatable behavior where allocation of same sizes are made on every
inference.
Thus after the first allocation when the pointer is returned, instead of
returning it to system, allocator caches it for subsequent use.
Memory is freed automatically at the end of the process, or it can be
explicitly freed.
This is enabled at the moment in DefaultMobileCPUAllocator only.

Test Plan:
android test: cpu_caching_allocator_test

Imported from OSS

Reviewed By: dreiss

Differential Revision: D22726976

fbshipit-source-id: 9a38b1ce34059d5653040a1c3d035bfc97609e6c
2020-08-21 19:09:22 -07:00
abe878ce96 Allow Freezing of Module containing interface attribute (#41860)
Summary:
This patch allows to freeze model that utilizes interfaces. Freezing works
under the user assumption that the interfase module dones not aliases with
any value used in the model.

To enable freezing of such modules, added an extra pramater:

torch._C._freeze_module(module, ignoreInterfaces = True)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41860

Reviewed By: eellison

Differential Revision: D22670566

Pulled By: bzinodev

fbshipit-source-id: 41197a724bc2dca2e8495a0924c224dc569f62a4
2020-08-21 18:57:13 -07:00
490d41aaa6 [quant][graphmode][fx] Add support for instance_norm (#43377)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43377

Porting op tests from test_quantize_jit.py

Test Plan:
TestQuantizeFxOps

Imported from OSS

Reviewed By: z-a-f

Differential Revision: D23257045

fbshipit-source-id: 7f4ad5d81f21bf0b8b9d960b054b20dc889e6c3b
2020-08-21 18:32:50 -07:00
a5a6a3e633 add support for optional int list with scalar fill (#43262)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43262

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D23212049

Pulled By: bhosmer

fbshipit-source-id: c7ceb2318645c07d36c3f932c981c9ee3c414f82
2020-08-21 18:24:36 -07:00
f269fb83c1 Add Enum TorchScript serialization and deserialization support (#42963)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42963

* Adds code printing for enum type
* Enhance enum type to include all contained enum names and values
* Adds code parsing for enum type in deserialization
* Enabled serialization/deserialization test in most TestCases. (With a few dangling issues to be addressed in later PRs to avoid this PR grows too large)

Test Plan: Imported from OSS

Reviewed By: SplitInfinity

Differential Revision: D23223281

Pulled By: gmagogsfm

fbshipit-source-id: 716d1866b7770dfb7bd8515548cfe7dc4c4585f7
2020-08-21 18:13:27 -07:00
aa53b2d427 Workaround bugs in user side embedding meta info and better msgs (#43355)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43355

There seem to be some bugs where we cannot guarantees that blobs in `PARAMETERS_BLOB_TYPE_FULLY_REMOTE_REQUEST_ONLY` and `PARAMETERS_BLOB_TYPE_DISAGG_ACC_REMOTE_OTHER` are disjoint. Hence we need to walk around this.

Also make the msg more informative.

Test Plan:
```
flow-cli test-locally --mode opt dper.workflows.evaluation.eval_workflow --parameters-file=/mnt/shared/yinghai/v0_ctr_mbl_feed_1120_onnx.json
```

Reviewed By: ehsanardestani

Differential Revision: D23141538

fbshipit-source-id: 8e311f8fc0e40eff6eb2c778213f78592e6bf079
2020-08-21 17:18:51 -07:00
aec917a408 [quant][graphmode][fx] Add support for layer_norm (#43376)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43376

Porting op tests from test_quantize_jit.py

Test Plan:
TestQuantizeFxOps

Imported from OSS

Reviewed By: z-a-f

Differential Revision: D23257048

fbshipit-source-id: 47a04a5221bcaf930d574f879d515e3dff2d1f6d
2020-08-21 16:38:16 -07:00
089bb1a8e4 [quant][graphmode][fx] Add support for elu (#43375)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43375

Porting op tests from test_quantize_jit.py

Test Plan:
TestQuantizeFxOps

Imported from OSS

Reviewed By: z-a-f

Differential Revision: D23257043

fbshipit-source-id: 22360610d87ef98d25871daff3fdc3dbb3ec5bdb
2020-08-21 16:07:36 -07:00
5a02c6b158 [quant][graphmode][fx] Add support for hardswish (#43374)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43374

Porting op tests from test_quantize_jit.py

Test Plan:
TestQuantizeFxOps

Imported from OSS

Reviewed By: z-a-f

Differential Revision: D23257044

fbshipit-source-id: 2cdf12e104db6e51ffa0324eb602e68132a646ef
2020-08-21 16:06:32 -07:00
93f1b5c8da Mobile backward compatibility (#42413)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42413

When a default argument is added, it does not break backward compatibility (BC) for full-jit, but does break BC for mobile bytecode. For example, https://github.com/pytorch/pytorch/pull/40737. To make bytecode BC in this case, we

1. Introduce kMinSupportedBytecodeVersion. The loaded model version should be between kMinSupportedBytecodeVersion and kProducedBytecodeVersion.
2. If an operator is updated, and we can handle BC, bump the kProducedBytecodeVersion (for example, from 3 to 4).
3. If model version is at the older version of the operator, add an adapter function at loading. For the added default arg, we push this default arg to stack before calling the actual operator function.

Test Plan: Imported from OSS

Reviewed By: xcheng16

Differential Revision: D22898314

Pulled By: iseeyuan

fbshipit-source-id: 90d339f8e1365f4bb178db8db7c147390173372b
2020-08-21 15:45:52 -07:00
e96871ea46 [quant][graphmode][fx] Add support for mul and mul relu (#43373)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43373

Porting op tests from test_quantize_jit.py

Test Plan:
TestQuantizeFxOps

Imported from OSS

Reviewed By: z-a-f

Differential Revision: D23257047

fbshipit-source-id: b7f9fcef965d6368018e05cff09260f0eb6f3b50
2020-08-21 15:31:00 -07:00
6c772515ed Revert D23252335: Refactor Vulkan context into its own files. Use RAII.
Test Plan: revert-hammer

Differential Revision:
D23252335 (054073c60d)

Original commit changeset: 43144446f2f3

fbshipit-source-id: 442b914f47a82efee18cfd84aab893e22d1defdd
2020-08-21 15:10:06 -07:00
8eb3de76ba Fix enum constant printing and add FileCheck to all Enum tests (#42874)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42874

Test Plan: Imported from OSS

Reviewed By: SplitInfinity

Differential Revision: D23222894

Pulled By: gmagogsfm

fbshipit-source-id: 86495a350d388c82276933d24a2ca3c0f59af8da
2020-08-21 14:55:46 -07:00
ff454cc429 [quant][grapphmode][fx][test][refactor] Refactor quantized add test (#43372)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43372

So that adding more binary op tests are easier

Test Plan: Imported from OSS

Reviewed By: z-a-f

Differential Revision: D23257046

fbshipit-source-id: 661acd4c38abdc892c9db8493b569226b13e0d0d
2020-08-21 14:53:23 -07:00
109ea59afc [quant][graphmode][fx] Add support for batchnorm relu (#43335)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43335

Porting op tests from test_quantize_jit.py

Test Plan:
TestQuantizeFxOps

Imported from OSS

Reviewed By: z-a-f

Differential Revision: D23243563

fbshipit-source-id: 3c562f519b90e0157761a00c89eca63af8b909f2
2020-08-21 14:32:51 -07:00
9e87a8ddf4 [quant][graphmode][fx] Add support for batchnorm (#43334)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43334

Porting op tests from test_quantize_jit.py

Test Plan:
TestQuantizeFxOps

Imported from OSS

Reviewed By: z-a-f

Differential Revision: D23243560

fbshipit-source-id: 0a7bc331293bbc3db85616bf43a995d3b112beb6
2020-08-21 14:31:49 -07:00
054073c60d Refactor Vulkan context into its own files. Use RAII. (#42273)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42273

Test Plan: Imported from OSS

Reviewed By: IvanKobzarev

Differential Revision: D23252335

Pulled By: AshkanAliabadi

fbshipit-source-id: 43144446f2f3530e6cb2a85706a9afc60771347d
2020-08-21 14:28:38 -07:00
3d76f7065e [quant][graphmode][fx] Add support for cat (#43333)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43333

Porting op tests from test_quantize_jit.py

Test Plan:
TestQuantizeFxOps

Imported from OSS

Reviewed By: z-a-f

Differential Revision: D23243562

fbshipit-source-id: 5c8eab2af592a9ea4afa713fb884e34e0ffd82b1
2020-08-21 12:54:50 -07:00
26be4dcfa1 [quant][graphmode][fx] Add support for add relu (#43332)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43332

Porting op tests from test_quantize_jit.py

Test Plan:
TestQuantizeFxOps

Imported from OSS

Reviewed By: z-a-f

Differential Revision: D23243564

fbshipit-source-id: 3cd1786c6356aaa234d31b50f12ad6ddc38d5664
2020-08-21 12:54:41 -07:00
452a473729 [quant][graphmode][fx] Add support for add (#43331)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43331

Porting op tests from test_quantize_jit.py

Test Plan:
TestQuantizeFxOps

Imported from OSS

Reviewed By: raghuramank100

Differential Revision: D23243561

fbshipit-source-id: 5a6399d25cc881728cf298c77570ce2aaf3ca22e
2020-08-21 12:52:37 -07:00
6e48c88e09 .circleci: Prefer using env-file for docker run (#43293)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43293

'docker run' has the capability to use a file for environment variables,
we should prefer to use that instead of having it be sourced per command
in the docker container.

Also opens the door for cutting down on the total number of commands we
need to echo into a script to then execute as a 'docker exec' command.

Plus side of this approach is that the BASH_ENV is persisted through all
of the steps so there's no need to do any exports / worry about
environment variables not persisting through jobs.

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D23227059

Pulled By: seemethere

fbshipit-source-id: be425aa21b420b9c6e96df8b2177f508ee641a20
2020-08-21 12:48:35 -07:00
100649d6a9 Normalize loops with non-zero start. (#43179)
Summary:
This diff normalizes for-loops that have non 0 loop starts to always start from 0. Given a for-loop, this normalization changes the loop start to be 0 and adjusts the loop end and all accesses to the index variable within the loop body appropriately.

This diff also adds tests for several cases of normalization and also tests normalization in conjunction with `splitwithTail` transformation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43179

Reviewed By: nickgg

Differential Revision: D23220534

Pulled By: navahgar

fbshipit-source-id: 64be0c72e4dbc76906084f7089dea81ae07d6020
2020-08-21 12:37:27 -07:00
74781ab5b8 Revert D23242101: [pytorch][PR] Implement first draft of autograd benchmark.
Test Plan: revert-hammer

Differential Revision:
D23242101 (c2511bdfa4)

Original commit changeset: a2b92d5a4341

fbshipit-source-id: bda562d15565f074b448022d180ec8f959c6ecc9
2020-08-21 12:22:57 -07:00
650590da0d [quant][graphmode][fx] Add support for conv module + relu (#43287)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43287

Porting op tests from test_quantize_jit.py

Test Plan:
TestQuantizeFxOps

Imported from OSS

Reviewed By: raghuramank100

Differential Revision: D23221735

fbshipit-source-id: 2513892a1928f92c09d7e9a24b2ea12b00de218d
2020-08-21 12:13:02 -07:00
3293fdfa80 [quant] Enable from_float for quantized Embedding_Bag (#43176)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43176

Convert floating point nn.EmbeddingBag module to
nn.quantized.dynamic.EmbeddingBag module

Test Plan:
python test/test_quantization.py TestDynamicQuantizedModule.test_embedding_bag_api
python test/test_quantization.py TestPostTrainingDynamic.test_embedding_quantization

Imported from OSS

Reviewed By: vkuzo

Differential Revision: D23200196

fbshipit-source-id: 090f47dbf7aceab9c719cbf282fad20fe3e5a983
2020-08-21 11:46:03 -07:00
b354b422ee [quant] Make offsets an optional argument (#43090)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43090

To match the floating point module

Test Plan: Imported from OSS

Reviewed By: vkuzo

Differential Revision: D23167518

fbshipit-source-id: 29db596e10731be4cfed7efd18f33a0b3dbd0ca7
2020-08-21 11:46:00 -07:00
4db8ca1129 [quant] Create nn.quantized.dynamic.EmbeddingBag (#43088)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43088

Create quantized module that the user can use to perform embedding bag quantization
The module uses the EmbeddingPackedParams to store the weights which can be serialized /deserialized
using TorchBind custom classes (C++ get/setstate code)
Following PR will add support for `from_float` to convert from float to quantized module

Test Plan:
python test/test_quantization.py TestDynamicQuantizedModule.test_embedding_bag_api

Imported from OSS

Reviewed By: vkuzo

Differential Revision: D23167519

fbshipit-source-id: 029d7bb44debf78c4ef08bfebf267580ed94d033
2020-08-21 11:45:02 -07:00
f20a04fa2d [TensorExpr] Simplify conditional select (#43350)
Summary:
Fold conditional select when both sides are constant.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43350

Test Plan: test_tensorexpr --gtest_filter=TensorExprTest.ConditionalSelectFold*

Reviewed By: pbelevich

Differential Revision: D23256602

Pulled By: asuhan

fbshipit-source-id: ec04b1e4ae64f59fa574047f2d7af55a717a5262
2020-08-21 11:15:48 -07:00
743cff4a1a Fix PackedGemmMatrixFP16 repacking (#43320)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43320

Previous impl seem to be buggy although I don't why. New impl is copied from https://fburl.com/diffusion/cing6mxv

Reviewed By: jianyuh

Differential Revision: D23235964

fbshipit-source-id: 780b6e388ef895232e3ba34b125c2492b1cee60c
2020-08-21 10:58:18 -07:00
e57b89c8dc Adds arccos, arcsin, arctan aliases (#43319)
Summary:
These aliases are consistent with NumPy (see, for example, https://numpy.org/doc/stable/reference/generated/numpy.arccos.html?highlight=acos).

Note that PyTorch's existing names are consistent with Python (see https://docs.python.org/3.10/library/math.html?highlight=acos#math.acos) and C++ (see, for example, https://en.cppreference.com/w/cpp/numeric/math/acos).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43319

Reviewed By: pbelevich

Differential Revision: D23260426

Pulled By: mruberry

fbshipit-source-id: 98a6c97f69d1f718a396c2182e938a7a260c0889
2020-08-21 10:53:17 -07:00
3aec1185e0 Enables bfloat16 x [float16, complex64, complex128] type promotion (#43324)
Summary:
Implements bfloat16 type promotion consistent with JAX (see https://jax.readthedocs.io/en/latest/type_promotion.html), addressing issue https://github.com/pytorch/pytorch/issues/43049.

- bfloat16 x float16 -> float32
- bfloat16 x complex64 -> complex64
- bfloat16 x complex128 -> complex128

Existing tests, after updates, are sufficient to validate the new behavior.

cc xuhdev

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43324

Reviewed By: albanD

Differential Revision: D23259823

Pulled By: mruberry

fbshipit-source-id: ca9c2c7d0325faced1f884f3c37edf8fa8c8b089
2020-08-21 10:48:04 -07:00
478fb925e6 [jit] PyTorchStreamReader::getAllRecord should omit archive name prefix (#43317)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43317

Previous version was returning the path with a prefix so subsequent `getRecord` would fail.

There's only one place in PyTorch codebase that uses this function (introduced in https://github.com/pytorch/pytorch/pull/29339 ) and it's unlikely that anyone else is using it - it's not a public API anyway.

Test Plan: unittest

Reviewed By: houseroad

Differential Revision: D23235241

fbshipit-source-id: 6f7363e6981623aa96320f5e39c54e65d716240b
2020-08-21 10:39:57 -07:00
0bd35de30e Add Enum convert back to Python object support (#43121)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43121

Test Plan: Imported from OSS

Reviewed By: SplitInfinity

Differential Revision: D23222628

Pulled By: gmagogsfm

fbshipit-source-id: 6850c56ced5b52943a47f627b2d1963cc9239408
2020-08-21 10:36:51 -07:00
f4b6ef9c56 Do not define the macro "isnan" (#43242)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43242

This causes "std::isnan" to produce confusing error messages (std::std has not been declared).
Instead, simply let isnan be exposed in the global namespace.

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D23214374

Pulled By: ezyang

fbshipit-source-id: 9615116a980340e36376a20f2e546e4d36839d4b
2020-08-21 10:08:38 -07:00
7b520297dc Remove erroneous trailing backslashes (#43318)
Summary:
They were likely copied from some macro definition, but they do not
belong to macro definitions here.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43318

Reviewed By: pbelevich

Differential Revision: D23241526

Pulled By: mrshenli

fbshipit-source-id: e0b5eddfde2c882bb67f56d84ee79281cc5fc941
2020-08-21 08:21:56 -07:00
c2511bdfa4 Implement first draft of autograd benchmark. (#40586)
Summary:
It is quite a lot of code because I pulled some code from torchaudio and torchvision to remove issues I had to get latest version with pytorch built from source while I can't build there libs from source (dependency missing for torchaudio).

The compare script generates table as follows:
| model | task | speedup | mean (before) | var (before) | mean (after) | var (after) |
| -- | -- | -- | -- | -- | -- | -- |
| resnet18 | vjp | 1.021151844124464 | 1.5627719163894653 | 0.005164200905710459 | 1.5304011106491089 | 0.003979875706136227 |
| resnet18 | vhp | 0.9919114430761606 | 6.8089728355407715 | 0.019538333639502525 | 6.86449670791626 | 0.014775685034692287 |
| resnet18 | jvp | 0.9715963084255123 | 5.720699310302734 | 0.08197150379419327 | 5.887938499450684 | 0.018408503383398056 |
| ppl_simple_reg | vjp | 0.9529183269165618 | 0.000362396240234375 | 7.526952949810095e-10 | 0.00038030146970413625 | 7.726220357939795e-11 |
| ppl_simple_reg | vhp | 0.9317708619586977 | 0.00048058031825348735 | 5.035701855504726e-10 | 0.0005157709238119423 | 3.250243477137538e-11 |
| ppl_simple_reg | jvp | 0.8609755877018406 | 0.00045447348384186625 | 9.646707044286273e-11 | 0.0005278587341308594 | 1.4493808930815533e-10 |
| ppl_simple_reg | hvp | 0.9764100147808232 | 0.0005881547695025802 | 7.618464747949361e-10 | 0.0006023645401000977 | 6.370915461850757e-10 |
| ppl_simple_reg | jacobian | 1.0019173715134297 | 0.0003612995205912739 | 2.2979899233499523e-11 | 0.0003606081008911133 | 1.2609764794835332e-11 |
| ppl_simple_reg | hessian | 1.0358429970264393 | 0.00206911563873291 | 2.590938796842579e-09 | 0.0019975185859948397 | 2.8916853356264482e-09 |
| ppl_robust_reg | vjp | 1.0669910916521521 | 0.0017304659122601151 | 3.1047047155396967e-09 | 0.0016218185191974044 | 4.926861585374809e-09 |
| ppl_robust_reg | vhp | 1.0181130455462972 | 0.0029563189018517733 | 2.6359153082466946e-08 | 0.0029037236236035824 | 1.020585038702393e-08 |
| ppl_robust_reg | jvp | 0.9818360373406179 | 0.0026934861671179533 | 6.981357714153091e-09 | 0.00274331565015018 | 3.589908459389335e-08 |
| ppl_robust_reg | hvp | 1.0270848910527002 | 0.005576515104621649 | 3.2798087801211295e-08 | 0.005429458804428577 | 6.438724398094564e-08 |
| ppl_robust_reg | jacobian | 1.0543611284155785 | 0.00167675013653934 | 2.3236829349571053e-08 | 0.001590299652889371 | 1.2011492245278532e-08 |
| ppl_robust_reg | hessian | 1.0535378727082656 | 0.01643357239663601 | 1.8450685956850066e-06 | 0.015598463825881481 | 2.1876705602608126e-07 |
| wav2letter | vjp | 1.0060408105086573 | 0.3516994118690491 | 1.4463969819189515e-05 | 0.349587619304657 | 9.897866402752697e-05 |
| wav2letter | vhp | 0.9873655295086051 | 1.1196287870407104 | 0.00474404776468873 | 1.133955717086792 | 0.009759620763361454 |
| wav2letter | jvp | 0.9741820317882822 | 0.7888165712356567 | 0.0017476462526246905 | 0.8097219467163086 | 0.0018235758179798722 |
| transfo | vjp | 0.9883954031921641 | 2.8865864276885986 | 0.008410997688770294 | 2.9204773902893066 | 0.006901870481669903 |
| transfo | vhp | 1.0111290842971339 | 8.374398231506348 | 0.014904373325407505 | 8.282224655151367 | 0.04449500888586044 |
| transfo | jvp | 1.0080534543381963 | 6.293097972869873 | 0.03796082362532616 | 6.24282169342041 | 0.010179692879319191 |

Pull Request resolved: https://github.com/pytorch/pytorch/pull/40586

Reviewed By: pbelevich

Differential Revision: D23242101

Pulled By: albanD

fbshipit-source-id: a2b92d5a4341fe1472711a685ca425ec257d6384
2020-08-21 07:36:26 -07:00
0cb52cb458 Autograd better error (#43308)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/5025

Thanks for the conversation in the issue thread. Hopefully this must fix it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43308

Reviewed By: ezyang

Differential Revision: D23241918

Pulled By: suraj813

fbshipit-source-id: e1efac13f5ce590196f227149f011c973c2bbdde
2020-08-21 05:50:33 -07:00
da036250cd Add benchmark for performance comparison (#43221)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43221

Test Plan: Example: https://www.internalfb.com/intern/paste/P139226521/

Reviewed By: kimishpatel

Differential Revision: D23197567

Pulled By: kimishpatel

fbshipit-source-id: 7d0f8e653c62f0bee5795618e712d07effbd460a
2020-08-20 23:11:40 -07:00
da70976e66 [ONNX] Add support for operator add between tensor list (#41888)
Summary:
E.g.
```python
outs = []
outs += [torch.randn(3,4)]
outs = outs + [torch.randn(4,5), torch.randn(5,6)]
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41888

Reviewed By: houseroad

Differential Revision: D23172880

Pulled By: bzinodev

fbshipit-source-id: 93865106e3de5908a993e0cfa82f626ba94dab7e
2020-08-20 22:38:23 -07:00
c64594f5cc Extends test_unary_ufunc.py with numerics, contiguity, domain tests (#42965)
Summary:
This PR:

- ports the tests in TestTorchMathOps to test_unary_ufuncs.py
- removes duplicative tests for the tested unary ufuncs from test_torch.py
- adds a new test, test_reference_numerics, that validates the behavior of our unary ufuncs vs. reference implementations on empty, scalar, 1D, and 2D tensors that are contiguous, discontiguous, and that contain extremal values, for every dtype the unary ufunc supports
- adds support for skipping tests by regex, this behavior is used to make the test suite pass on Windows, MacOS, and ROCm builds, which have a variety of issues, and on Linux builds (see https://github.com/pytorch/pytorch/issues/42952)
- adds a new OpInfo helper, `supports_dtype`, to facilitate test writing
- extends unary ufunc op info to include reference, domain, and extremal value handling information
- adds OpInfos for `torch.acos` and `torch.sin`

These improvements reveal that our testing has been incomplete on several systems, especially with larger float values and complex values, and several TODOs have been added for follow-up investigations. Luckily when writing tests that cover many ops we can afford to spend additional time crafting the tests and ensuring coverage.

Follow-up PRs will:

- refactor TestTorchMathOps into test_unary_ufuncs.py
- continue porting tests from test_torch.py to test_unary_ufuncs.py (where appropriate)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42965

Reviewed By: pbelevich

Differential Revision: D23238083

Pulled By: mruberry

fbshipit-source-id: c6be317551453aaebae9d144f4ef472f0b3d08eb
2020-08-20 22:02:00 -07:00
e31cd46278 Add alias torch.fix for torch.trunc to be compatible with NumPy. (#43326)
Summary:
xref https://github.com/pytorch/pytorch/issues/42515

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43326

Reviewed By: pbelevich

Differential Revision: D23249089

Pulled By: mruberry

fbshipit-source-id: 6afa9eb20493983d084e0676022c6245e7463e05
2020-08-20 21:47:39 -07:00
17f9edda42 Bias Correction Implementation (#41845)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/41845

Test Plan: Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D22661503

Pulled By: edmundw314

fbshipit-source-id: a88c349c6cc15b1c66aa6dee7593ef3df588eb85
2020-08-20 21:40:33 -07:00
665da61d2b Replace Conv1d with Conv2d (#42867)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42867

Test Plan: Imported from OSS

Reviewed By: kimishpatel

Differential Revision: D23177916

Pulled By: kimishpatel

fbshipit-source-id: 68cc40cf42d03e5b8432dc08f9933a4409c76e25
2020-08-20 21:36:51 -07:00
e8139624f2 Search on system path for Vulkan headers and libraries as a last resort. (#43301)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43301

Test Plan: Imported from OSS

Reviewed By: IvanKobzarev

Differential Revision: D23252338

Pulled By: AshkanAliabadi

fbshipit-source-id: 8eefe98eedf9dbeb570565bfb13ab61b1d6bca0e
2020-08-20 21:14:09 -07:00
217ddea93a [quant] Make OP_LIST_TO_FUSER_METHOD public (#43286)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43286

We need to use this in graph mode quantization on fx

Test Plan: Imported from OSS

Reviewed By: vkuzo

Differential Revision: D23221734

fbshipit-source-id: 7c3c3840ce5bdc185b962e081aff1618f4c58e85
2020-08-20 20:19:13 -07:00
844d469ae7 Remove proprietary notices
Summary:
These were added accidentally (probably by an IDE) during a refactor.
These files have always been Open Source.

Test Plan: CI

Reviewed By: xcheng16

Differential Revision: D23250761

fbshipit-source-id: 4974430c0e28dd3269424d38edb36f4f71508157
2020-08-20 20:14:59 -07:00
9984d33542 [quant][graphmode][fx] Add support for conv module (#43285)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43285

Porting op tests from test_quantize_jit.py

(Note: this ignores all push blocking failures!)

Test Plan:
TestQuantizeFxOps

Imported from OSS

Reviewed By: vkuzo

Differential Revision: D23221733

fbshipit-source-id: c1f0f7ae0c82379143aa33fc1af7284d8303174b
2020-08-20 19:53:30 -07:00
7c50c2f79e Reimplement per-operator selective build (#39401)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39401

This uses the technique proposed by smessmer in D16451848 to selectively
register operators without codegen.  See the Note inside for more
details.

This PR has feature parity with the old selective build apparatus:
it can whitelist schema def()s, impl()s, and on a per dispatch key
basis.  It has expanded dispatch key whitelisting, whereas previously
manually written registrations were not whitelisted at all.  (This
means we may be dropping dispatch keys where we weren't previously!)

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: pbelevich

Differential Revision: D21905593

Pulled By: ezyang

fbshipit-source-id: d4870f800c66be5ce57ec173c9b6e14a52c4a48b
2020-08-20 19:10:02 -07:00
e32d014f46 remove empty override pretty_print (#43341)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43341

This is to remove the empty pretty_print() since it overrides the impl within Module base which is not as designed here.

Test Plan: Imported from OSS

Reviewed By: pbelevich

Differential Revision: D23244616

Pulled By: glaringlee

fbshipit-source-id: 94b8dfd3697dfc450f53b3b4eee6e9c13cafba7b
2020-08-20 18:48:29 -07:00
ad8294d35b [vulkan][ci] Vulkan tests running on linux build via swiftshader (added to docker) (#42614)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42614

Vulkan backend linux build (USE_VULKAN=1) and running Vulkan tests using software Vulkan implementation via [swiftshader](https://github.com/google/swiftshader)

Vulkan linux build needs VulkanSdk and running tests needs Swiftshader.
swiftshader needs to be compiled using clang toolchain, added them to bionic-clang-9 docker image.

VulkanSdk will be downloaded from aws;
Swiftshader is cloned from github, as it has many submodules , commit hash is fixed in install_swiftshader script.

To pass all the tests:
Disabled adaptive_avg_pool2d_2 as it needs at::view which will be landed in https://github.com/pytorch/pytorch/pull/42676 and after that can be enabled

Change strides, padding, dilation params in tests to vector

Docker image rebuild:
https://app.circleci.com/pipelines/github/pytorch/pytorch/200251/workflows/465f911f-f170-47e1-954e-b9605d91abd8/jobs/6700311
Vulkan Linux Build:
https://app.circleci.com/pipelines/github/pytorch/pytorch/200251/workflows/465f911f-f170-47e1-954e-b9605d91abd8/jobs/6701604
Vulkan Linux Test:
https://app.circleci.com/pipelines/github/pytorch/pytorch/200251/workflows/465f911f-f170-47e1-954e-b9605d91abd8/jobs/6703026

Test Plan: Imported from OSS

Reviewed By: seemethere

Differential Revision: D23174038

Pulled By: IvanKobzarev

fbshipit-source-id: 431c72e31743ca0c0b82a497420f6330a311b35b
2020-08-20 18:40:32 -07:00
5cf8592663 Fix backward compatibility test (#43371)
Summary:
Drop `.out` suffix from allow_list pattern added by https://github.com/pytorch/pytorch/issues/43272

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43371

Reviewed By: pbelevich

Differential Revision: D23256914

Pulled By: malfet

fbshipit-source-id: 10168b55b98c24c84ac2676963049d1eca5c182d
2020-08-20 18:29:10 -07:00
9a1f2b3617 .circleci: Use dynamic docker image for android (#43356)
Summary:
We recently upgraded to a dynamic docker image and this android build
job was missed during that transition

Fixes https://github.com/pytorch/pytorch/issues/43338

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43356

Reviewed By: pbelevich

Differential Revision: D23253175

Pulled By: seemethere

fbshipit-source-id: 4831d4fe554a126e202e788444a63516d34b3d72
2020-08-20 17:42:26 -07:00
e10aa47615 Fix at::native::view_as_real() for ComplexHalf Tensors (#43279)
Summary:
Add ComplexHalf case to toValueType, which fixes the logic how view_as_real and view_as_complex slices complex tensor to the floating point one, as it is used to generate tensor of random complex values, see:
018b4d7abb/aten/src/ATen/native/DistributionTemplates.h (L200)
Also add ability to convert python complex object to `c10::complex<at::Half>`

Add `torch.half` and `torch.complex32` to the list of `test_randn` dtypes

Fixes https://github.com/pytorch/pytorch/issues/43143

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43279

Reviewed By: mrshenli

Differential Revision: D23230296

Pulled By: malfet

fbshipit-source-id: b4bb66c4c81dd867e72ab7c4563d73f6a4d80a44
2020-08-20 17:38:06 -07:00
b0ec336477 [quant][graphmode][fx][test] Add per op test for graph mode quant on fx (#43229)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43229

Test Plan: Imported from OSS

Reviewed By: supriyar

Differential Revision: D23201692

fbshipit-source-id: 37fa54dcf0a9d5029f1101e11bfd4ca45b422641
2020-08-20 17:32:02 -07:00
2b7108a96f Update hardcoded pytorch_android_gradle_custom_build_single hash (#43340)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43340

This doesn't fix https://github.com/pytorch/pytorch/issues/43338 but
it gets us a little more up to date.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: seemethere

Differential Revision: D23243933

Pulled By: ezyang

fbshipit-source-id: ce2773c55864d1a6f6628ba60bb9ad6aee4aba14
2020-08-20 15:37:43 -07:00
97d594b9f7 Make grad point to bucket buffer in DDP to save memory usage (#41954)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41954
Make both variable.grad() and grad in distautograd context point to bucket buffer in DDP to save memory usage.
In this case, grad will be view of bucket buffer tensors, in order to make it compatiable with optimizer.zero_grad(), we
made changes in https://github.com/pytorch/pytorch/pull/41283.

Also be noted that we can not make variable.grad() pointing to bucket buffer during construction time, because we want to
keep grad undefined for unused parameters.
ghstack-source-id: 110260297

Test Plan:
unit tests,

For roberta_base model with ~1GB parameters, peak memory dropped ~1GB (8250MB-7183MB).  Per iteration latency (0.982s ->0.909s), 8% speed up
https://www.internalfb.com/intern/fblearner/details/211713882?tab=operator_details
https://www.internalfb.com/intern/fblearner/details/211772923?tab=operator_details

For resnet model with ~97M parameters, peak memory dropped ~100MB (3089MB -> 2988MB). Per iteration latency has no change (0.122s -> 0.123s)
https://www.internalfb.com/intern/fblearner/details/211713577?tab=operator_details
https://www.internalfb.com/intern/fblearner/details/211712582?tab=operator_details

accuracy benchmark is expected as well
https://www.internalfb.com/intern/fblearner/details/213237067?tab=Outputs

Reviewed By: mrshenli

Differential Revision: D22707857

fbshipit-source-id: b5e767cfb34ccb3d067db2735482a86d59aea7a4
2020-08-20 15:33:44 -07:00
51bab0877d Fix torch.hub for new zipfile format. (#42333)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/42239

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42333

Reviewed By: VitalyFedyunin

Differential Revision: D23215210

Pulled By: ailzhang

fbshipit-source-id: 161ead8b457c11655dd2cab5eecfd0edf7ae5c2b
2020-08-20 14:54:02 -07:00
dae2973fae [quant][graphmode][fx] Add graph mode quantization on fx (#43175)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43175

This PR added graph mode quantization on fx: https://github.com/pytorch/pytorch/pull/42741
Currently it matches eager mode quantization for torchvision with static/dynamic/qat
ddp/synbn test is still wip

Test Plan:
python test/test_quantization.py TestQuantizeFx

Imported from OSS

Reviewed By: vkuzo

Differential Revision: D23178602

fbshipit-source-id: 8e7e0322846fbda2cfa79ad188abd7235326f879
2020-08-20 14:50:09 -07:00
c89d2c6bf2 Replace black_list with block_list (#42088)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/41735

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42088

Reviewed By: pbelevich

Differential Revision: D22794582

Pulled By: SplitInfinity

fbshipit-source-id: e256353befefa2630b99f9bcf0b79df3a7a8dcbd
2020-08-20 14:34:02 -07:00
a12fe1a242 Minor RPC doc fixes (#43337)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43337

Test Plan: Imported from OSS

Reviewed By: osalpekar

Differential Revision: D23242698

Pulled By: osalpekar

fbshipit-source-id: 7757fc43824423e3a6efd4da44c69995f64a6015
2020-08-20 14:17:07 -07:00
5006d24302 Make TensorPipe the default backend for RPC (#43246)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43246

Test Plan: Imported from OSS

Reviewed By: osalpekar

Differential Revision: D23206042

Pulled By: osalpekar

fbshipit-source-id: 258481ea9e753cd36c2787183827ca3b81d678e3
2020-08-20 14:17:02 -07:00
d0a6819b0e [ROCm] skip test_rpc in .jenkins/pytorch/test.sh (#43305)
Summary:
https://github.com/pytorch/pytorch/issues/42636 added test_rpc, but this test binary is not built for ROCm.  Skip this test for ROCm builds.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43305

Reviewed By: pbelevich

Differential Revision: D23233087

Pulled By: mrshenli

fbshipit-source-id: 29cd81e88a543c922a988e09d5f789becf4b74e4
2020-08-20 14:15:27 -07:00
c66ca7a48d vmap: Fix bug with x * 0.1 (#43218)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43218

Previously, `vmap(lambda x: x * 0.1)(torch.ones(3))` would return a
float64 tensor(!!). This is because there is a subtle bug in the
batching rule: the batching rule receives:
- A batched tensor for x
- a scalar tensor: tensor(0.1, dtype=torch.float64).
The batching rule decides to expand the scalar tensor to be the same
size as x and then multiplies the two tensors, promoting the output to
be a float64 tensor. However, this isn't correct: we should treat the
scalar tensor like a scalar tensor. When adding a FloatTensor to a
Double scalar tensor, we don't promote the type usually.

Another example of a bug this PR fixes is the following:
`vmap(torch.mul)(torch.ones(3), torch.ones(3, dtype=torch.float64))`
Multiplying a scalar float tensor with a scalar double tensor produces a
float tensor, but the above produced a float64 before this PR due to
mistakingly type-promoting the tensors.

Test Plan:
- new test: `pytest test/test_vmap.py -v`
- I refactored some tests a bit.

Reviewed By: cpuhrsch

Differential Revision: D23195418

Pulled By: zou3519

fbshipit-source-id: 33b7da841e55b47352405839f1f9445c4e0bc721
2020-08-20 13:44:31 -07:00
0dc41ff465 [pytorch] add flag for autograd ops to mobile builds (#43154)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43154

Adds the build flag `BUILD_MOBILE_AUTOGRAD` which toggles whether autograd files should be included for a PyTorch mobile build (default off).
ghstack-source-id: 110369406

Test Plan: CI

Reviewed By: ljk53

Differential Revision: D23061913

fbshipit-source-id: bc3d6683ab17f158990d83e4fae0a011d5adeca1
2020-08-20 12:39:55 -07:00
4fc9e958c4 [quant] Add benchmakrs for embedding_bag coversion ops (#43291)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43291

Test Float2Fused and Fused2Float conversion operators for embedding_bag byte and 4-bit ops

Test Plan:
```
python -m pt.qembedding_pack_tes
```

Imported from OSS

Reviewed By: radkris-git

Differential Revision: D23231641

fbshipit-source-id: a2afe51bba52980d2e96dfd7dbc183327e9349fd
2020-08-20 11:26:20 -07:00
c8bc298d6c streamline stride propagation logic in TensorIterator (#42922)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/41314 among other things.
This PR streamlines layout propagation logic in TensorIterator and removes almost all cases of channels-last hardcoding. The new rules and changes are as follows:
1) behavior of undefined `output` and defined output of the wrong (e.g. 0) size is always the same (before this PR the behavior was divergent)
2) in obvious cases (unary operation on memory-dense tensors, binary operations on memory-dense tensors with the same layout) strides are propagated (before propagation was inconsistent) (see footnote)
3) in other cases the output permutation is obtained as inverse permutation of sorting inputs by strides. Sorting is done with comparator obeying the following rules: strides of broadcasted dimensions are set to 0, and 0 compares equal to anything. Strides of not-broadcasted dimensions (including dimensions of size `1`) participate in sorting. Precedence is given to the first input, in case of a tie in the first input, first the corresponding dimensions are considered, and if that does not indicate that swap is needed, strides of the same dimension in subsequent inputs are considered. See changes in `reorder_dimensions` and `compute_strides`. Note that first inspecting dimensions of the first input allows us to better recover it's permutation (and we select this behavior because it more reliably propagates channels-last strides) but in some rare cases could result in worse traversal order for the second tensor.

These rules are enough to recover previously hard-coded behavior related to channels last, so all existing tests are passing.
In general, these rules will produce intuitive results, and in most cases permutation of the full size input (in case of broadcasted operation) will be recovered, or permutation of the first input (in case of same sized inputs) will be recovered, including cases with trivial (1) dimensions. As an example of the latter, the following tensor
```
x=torch.randn(2,1,3).permute(1,0,2)
```
will produce output with the same stride (3,3,1) in binary operations with 1d tensor. Another example is a tensor of size N1H1 that has strides `H,H,1,1` when contiguous and `H, 1, 1, 1` when channels-last. The output retains these strides in binary operations when another 1d tensor is broadcasted on this one.

Footnote: for ambiguous cases where all inputs are memory dense and have the same physical layout that nevertheless can correspond to different permutations, such as e.g. NC11-sized physically contiguous tensors, regular contiguous tensor is returned, and thus permutation information of the input is lost (so for NC11 channels-last input had the strides `C, 1, C, C`, but output will have the strides `C, 1, 1, 1`). This behavior is unchanged from before and consistent with numpy, but it still makes sense to change it. The blocker for doing it currently is performance of `empty_strided`. Once we make it on par with `empty` we should be able to propagate layouts in these cases. For now, to not slow down common contiguous case, we default to contiguous.
The table below shows how in some cases current behavior loses permutation/stride information, whereas new behavior propagates permutation.
| code                                                                                                                                                                                           | old                                                   | new                                                  |
|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------|------------------------------------------------------|
| #strided tensors<br>a=torch.randn(2,3,8)[:,:,::2].permute(2,0,1)<br>print(a.stride())<br>print(a.exp().stride())<br>print((a+a).stride())<br>out = torch.empty(0)<br>torch.add(a,a,out=out)<br>print(out.stride()) | (2, 24, 8) <br>(6, 3, 1) <br>(1, 12, 4) <br>(6, 3, 1) | (2, 24, 8)<br>(1, 12, 4)<br>(1, 12, 4)<br>(1, 12, 4) |
| #memory dense tensors<br>a=torch.randn(3,1,1).as_strided((3,1,1), (1,3,3))<br>print(a.stride(), (a+torch.randn(1)).stride())<br>a=torch.randn(2,3,4).permute(2,0,1)<br>print(a.stride())<br>print(a.exp().stride())<br>print((a+a).stride())<br>out = torch.empty(0)<br>torch.add(a,a,out=out)<br>print(out.stride())                                                                                                                                                                                               |  (1, 3, 3) (1, 1, 1)<br>(1, 12, 4)<br>(6, 3, 1)<br>(1, 12, 4)<br>(6, 3, 1)                                                       |  (1, 3, 3) (1, 3, 3)<br>(1, 12, 4)<br>(1, 12, 4)<br>(1, 12, 4)<br>(1, 12, 4) |

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42922

Reviewed By: ezyang

Differential Revision: D23148204

Pulled By: ngimel

fbshipit-source-id: 670fb6188c7288e506e5ee488a0e11efc8442d1f
2020-08-20 10:50:35 -07:00
ca9d4401d4 .circleci: Remove manual docker installation (#43277)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43277

Docker added native support for GPUs with the release of 19.03 and
CircleCI's infrastructure is all on Docker 19.03 as of now.

This also removes all references to `nvidia-docker` in the `.circleci` fodler.

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D23217570

Pulled By: seemethere

fbshipit-source-id: af297c7e82bf264252f8ead10d1a154354b24689
2020-08-20 10:36:03 -07:00
66a79bf114 .circleci: Don't quote glob for conda upload (#43297)
Summary:
Globs don't get expanded if you quote them in a bash script...
apparently.

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43297

Reviewed By: malfet

Differential Revision: D23227626

Pulled By: seemethere

fbshipit-source-id: d124025cfcaacbfb68167a062ca487c08f7f6bc9
2020-08-20 10:24:27 -07:00
397325a109 Make _compute_linear_combination.out a true out function (#43272)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43272

Was missing kwarg-onlyness.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: VitalyFedyunin

Differential Revision: D23215506

Pulled By: ezyang

fbshipit-source-id: 2c282c9a534fa8ea1825c31a24cb2441f0d6b234
2020-08-20 09:00:17 -07:00
f9a766bb39 Increase deadline time for load_save tests (#43205)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43205

A number of tests that forward to `TestLoadSaveBase.load_save` are all marked as flaky due to them regularly taking much longer to start up than hypothesis' default timeout of 200ms. This diff fixes the problem by removing the timeout for `load_save`. This is alright as these tests aren't meant to be testing the performance of these operators.

I would set the deadline to 60s if I could however it appears the that caffe2 github CI uses a different version of hypothesis that doesn't allow using `dateutil.timedelta` so instead of trying to figure out an approach that works on both I've just removed the deadline time.

I've also tagged all existing tasks WRT these failures.

Differential Revision: D23175752

fbshipit-source-id: 324f9ff034df1ac4874797f04f50067149a6ba48
2020-08-20 08:41:24 -07:00
a2ae2d3203 Nightly Pull (#43294)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/40829

This addresses remaining issues/improvements in https://github.com/pytorch/pytorch/issues/40829 that were brought up prior to https://github.com/pytorch/pytorch/issues/42635 being merged.  Namely, this changes the name of the script and adds separate `checkout` and `pull` subcommands. I have tested it locally and everything appears to work.  Please let me know if you encounter any issues. I hope that this supports a more natural workflow.

CC ezyang rgommers

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43294

Reviewed By: pbelevich

Differential Revision: D23241849

Pulled By: ezyang

fbshipit-source-id: c24556024d7e5d14b9a5006e927819d4ad370dd7
2020-08-20 08:34:18 -07:00
6a09df99e1 Fix ASAN error in QNNPACK's integration of qlinear_dynamic. (#41967)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/41967

Test Plan: `buck test fbandroid/mode/asan xplat/assistant/oacr/nlu/tests:nlu_testsAndroid` no longer reports an error.

Reviewed By: kimishpatel, xuwenfang

Differential Revision: D22715307

Pulled By: AshkanAliabadi

fbshipit-source-id: bec7296b345125ec5243ee6e6c484246ecfca3b7
2020-08-20 07:46:34 -07:00
60b524f271 Update torch.Tensor.is_set_to documentation (#43052)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/30350

Preview:

![image](https://user-images.githubusercontent.com/5676233/90250018-69d72200-de09-11ea-8984-7401cfd6c719.png)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43052

Reviewed By: mrshenli

Differential Revision: D23173066

Pulled By: suraj813

fbshipit-source-id: d90a11490739068ea448d975548a71e07180bd77
2020-08-20 07:40:00 -07:00
4e964f3b97 Make Windows CUDA-11 tests master only (#43234)
Summary:
According to the correlation analysis, CUDA-10.1 vs CUDA-11 test failures are quite dependent on each other

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43234

Reviewed By: ezyang, seemethere

Differential Revision: D23204289

Pulled By: malfet

fbshipit-source-id: c53c5f87e55f2dabbb6735a0566c314c204ebc69
2020-08-19 21:05:46 -07:00
3eb31325fc refactor torch/cuda/nccl.h to remove direct dependency on NCCL in libtorch_python (#42687)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42687

Reviewed By: malfet

Differential Revision: D23145834

Pulled By: walterddr

fbshipit-source-id: c703a953a54a638852f6e5a1479ca95ae6a10529
2020-08-19 20:16:53 -07:00
6e1127ea3f [NCCL] Changed FutureNCCL's then callback logic for better efficiency. (#42869)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42869

We realized that when we invoke a simple callback that divides the tensors by `world_size` after `allreduce`, the performance was almost 50% lower in terms of QPS compared to the case where a simple `allreduce` hook is used with no `then` callback.

The main problem was as we call `work.wait()` before invoking `then` callback, we were synchronizing `work`'s stream with the default PyTorch stream inside [`runHook`](https://github.com/pytorch/pytorch/blob/master/torch/csrc/distributed/c10d/reducer.cpp#L609) and stalling the backward computation.

In that PR, we ensure that FutureNCCL's `then` callback is not stalling the backward computation. Assuming single-process single-device, `FutureNCCL` gets a new stream from device's pool using `at::cuda::getStreamFromPool` to run `callback` and before invoking the `callback` inline it synchronizes `WorkNCCL`'s stream by callback's stream not the default stream.

ghstack-source-id: 110208431

Test Plan: Run performance benchmark tests to validate performance issue is resolved. Also, `python test/distributed/test_c10d.py` to avoid any odd issues.

Reviewed By: pritamdamania87

Differential Revision: D23055807

fbshipit-source-id: 60e50993f1ed97497514eac5cb1018579ed2a4c5
2020-08-19 19:42:22 -07:00
97d62bcd19 Modify Circle CI script to upload test report for analysis. (#43180)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43180

Reviewed By: VitalyFedyunin

Differential Revision: D23195934

Pulled By: walterddr

fbshipit-source-id: 5b9b411c3ea769951b5b1a456b5f7696b8ba0a92
2020-08-19 19:38:25 -07:00
0617156f0e [vulkan] fix invalid memory op and tests (#43312)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43312

Test Plan: Imported from OSS

Reviewed By: AshkanAliabadi

Differential Revision: D23232809

Pulled By: IvanKobzarev

fbshipit-source-id: 11b070b6e082bac72e21dd4c25c9c675bbc8c4a3
2020-08-19 19:34:08 -07:00
aad1ff9f18 [quant][cleanup]test_qlinear_legacy should be under TestDynamicQuantizedLinear. (#40084)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40084

This is just a nit diff (got merge conflict) while writing some unit-tests.
This move was nit as part of D21628596 (655f1ea176).

Test Plan: buck test test:quantization -- test_qlinear_legacy

Reviewed By: supriyar

Differential Revision: D22065463

fbshipit-source-id: 96ceaa53355349af7157f38b3a6366c550eeec6f
2020-08-19 18:50:46 -07:00
410d5b95b2 [jit] fix str -> Device implicit conversions (#43213)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43213

A reversed isSubtypeOf caused erroreous conversions to be inserted.

Test Plan: Imported from OSS

Reviewed By: eellison

Differential Revision: D23192787

Pulled By: zdevito

fbshipit-source-id: 4a90b19d99a4fc889e55568ced850f08dadbc3fe
2020-08-19 16:05:11 -07:00
018b4d7abb Automated submodule update: FBGEMM (#43251)
Summary:
This is an automated pull request to update the first-party submodule for [pytorch/FBGEMM](https://github.com/pytorch/FBGEMM).

New submodule commit: 685149bbc0

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43251

Test Plan: Ensure that CI jobs succeed on GitHub before landing.

Reviewed By: YazhiGao

Differential Revision: D23207016

fbshipit-source-id: 54e13b246bb5189260ed11316ddf3d26d52c6b24
2020-08-19 11:42:16 -07:00
eb7fc2e98f .circleci: Simplify binary upload process (#43159)
Summary:
Binary uploads were gated into 3 separate scripts making it difficult to
actually contribute changes, this simplifies that by consolidating all 3
scripts into one single script and then further consolidates it by
putting them all into the same job.

This also further simplifies things by separating upload jobs into their
own function under binary_build_definitions.py, since following the
conditional logic tree under the generic function was too difficult.

Testing this change here: https://github.com/pytorch/pytorch/pull/43161

Proof of success:
* [libtorch](https://app.circleci.com/pipelines/github/pytorch/pytorch/201868/workflows/54ce962f-f35b-4d97-93a7-bee186b14ead/jobs/6791347)
* [conda](https://app.circleci.com/pipelines/github/pytorch/pytorch/201868/workflows/54ce962f-f35b-4d97-93a7-bee186b14ead/jobs/6794359)
* [manywheel](https://app.circleci.com/pipelines/github/pytorch/pytorch/201868/workflows/54ce962f-f35b-4d97-93a7-bee186b14ead/jobs/6794253)

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43159

Reviewed By: malfet

Differential Revision: D23175174

Pulled By: seemethere

fbshipit-source-id: a2de64c033df99b03a124d3a0a2c92560af62c37
2020-08-19 11:34:14 -07:00
d467ac8ff0 [GLOO] handle empty split size (#43256)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43256

* Handle empty split size by moving to call computeLengthsAndOffsets()
* Enable GLOO alltoall python tests
ghstack-source-id: 109292763

Test Plan:
buck build mode/dev-nosan caffe2/torch/lib/c10d:ProcessGroupGlooTest

./trainer_cmd.sh -p 16 -n 8 -d gloo (modify ./trainer_cmd.sh a bit)

Reviewed By: mingzhe09088

Differential Revision: D22961600

fbshipit-source-id: b9e90dadf7b45323b8af2e6cab2e156043b7743b
2020-08-19 11:14:06 -07:00
7d10298067 Implement Tensor.to batching rule (#43206)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43206

The batching rule is the same as the unary pointwise batching rules:
given a BatchedTensor, we unwrap it, call Tensor.to, and then re-wrap
it.

Test Plan: - `pytest test/test_vmap.py -v -k`

Reviewed By: ezyang

Differential Revision: D23189053

Pulled By: zou3519

fbshipit-source-id: 51b4e41b1cd34bd082082ec4fff3c643002edbaf
2020-08-19 10:54:26 -07:00
1e248caba8 [CircleCI] Use canary images until VC++ 14.27 issue is resolved (#43220)
Summary:
Should fix binary build issue on Windows, and promptly error out if images are updated to a different version of VC++

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43220

Reviewed By: ezyang

Differential Revision: D23198530

Pulled By: malfet

fbshipit-source-id: 0c80361ad7dcfb7aaffccc306b7d741671bedc11
2020-08-19 10:28:19 -07:00
bc0e1e8ed2 Add dataclasses to base Docker images. (#43217)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43217

Dataclasses is part of standard library in Python 3.7 and there
is a backport for it in Python 3.6.  Our code generation will
start using it, so add it to the default library set.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D23214028

Pulled By: ezyang

fbshipit-source-id: a2ae20b9fa8f0b22966ae48506d4ddea203e7459
2020-08-19 09:56:23 -07:00
06d43dc69a default ice-ref to c-step (#4812)
Summary:
Pull Request resolved: https://github.com/pytorch/glow/pull/4812

if no compilation options are passed, default to c-step

fixed the FC and batchmatmul implementations to match C-step
fixed the fakelowp map calling to make sure we use the fp32 substitution of operators
updated the accumulator test to make it pass with fp32

Test Plan:
fakelowp tests
glow/test/numerics
net_runner

Reviewed By: jfix71

Differential Revision: D23086534

fbshipit-source-id: 3fbb8c4055bb190becb39ce8cdff6671f8558734
2020-08-19 09:50:34 -07:00
fa6b34b54c 2 Bit Embedding Conversion Operator support. (#43077)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43077

2 Bit Embedding weight conversion operation is quite similar to
4 bit embedding weight conversion.

The diff contains both the
1. 2bit packing op `embedding_bag_2bit_prepack`.
2. 2bit unpacking op `embedding_bag_2bit_unpack`.

Comments about the op are inline with the op definition.

Test Plan: buck test caffe2/test:quantization -- test_embedding_bag_2bit_unpack

Reviewed By: supriyar

Differential Revision: D23143262

fbshipit-source-id: fd8877f049ac1f7eb4bc580e588dc95f8b1edef0
2020-08-18 23:20:30 -07:00
ab366d0f5f Fix some mistakes in native_functions.yaml (#43156)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43156

- supports_named_tensor no longer does anything, so I have removed
  it.  I'm guessing these were cargo culted from some old occurrences
  of it in native_functions.yaml

- comma, not period, in variants

In my upcoming codegen rewrite, there will be strict error checking
for these cases (indeed, that is how I found these problems), so
I do not add error testing here.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D23183977

Pulled By: ezyang

fbshipit-source-id: a47d342152badfb8aea248a819ad94fd93dd6ab2
2020-08-18 23:13:20 -07:00
27ec91b0c9 remove thunk fix now that ROCm CI images are >= ROCm 3.5 (#43226)
Summary:
Also, relax BUILD_ENVIRONMENT exact match to rocm when installing pip packages for tests.

CC ezyang xw285cornell sunway513

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43226

Reviewed By: colesbury

Differential Revision: D23200460

Pulled By: xw285cornell

fbshipit-source-id: 11cd889cc320d0249d7ebea4da261bfe779e82ac
2020-08-18 23:10:15 -07:00
8094228f26 update path in CI script to access ninja (#43236)
Summary:
This relaxes the assumption that test.sh will be run in the CI environment by the CI user.

CC ezyang xw285cornell sunway513

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43236

Reviewed By: colesbury

Differential Revision: D23205981

Pulled By: ezyang

fbshipit-source-id: 302743cb03c9e9c6bfcdd478a6cd920b536dc29b
2020-08-18 21:43:41 -07:00
7c923a1025 Optimize linux CI build/test matrix (#43240)
Summary:
Make CUDA-10.1 configs build-only, as CUDA-10.1 and CUDA-10.2 test matrix is almost identical, and now, since CUDA-11 is out perhaps it's time to stop testing CUDA-10.1.
Make CUDA-9.2+GCC_5.4 an important (i.e. running on PR) build only config, because of the big overlap between  CUDA-9.2-GCC7 and CUDA-9.2-GCC5.4 test coverage.
Make CUDA-11 libtorch tests important rather that CUDA-10.2.

As result of the change, every PR will be built against CUDA-9.2, CUDA-10.2 and CUDA-11 and tested against CUDA-10.2

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43240

Reviewed By: ezyang

Differential Revision: D23205129

Pulled By: malfet

fbshipit-source-id: 70932e8b2167cce9fd621115c8bf24b1c81ed621
2020-08-18 20:39:32 -07:00
e41ca2d9fa In copy_weights_to_flat_buf_views() explicitly construct tuple (#43244)
Summary:
In some versions of GCC, tuple constructor from initializer list  is marked as explicit, which results in the following compilation error:
```
/var/lib/jenkins/workspace/aten/src/ATen/native/cudnn/RNN.cpp: In function 'std::tuple<at::Tensor, std::vector<at::Tensor, std::allocator<at::Tensor> > > at::native::cudnn_rnn::copy_weights_to_flat_buf_views(at::TensorList, int64_t, int64_t, int64_t, int64_t, int64_t, bool, bool, cudnnDataType_t, const c10::TensorOptions&, bool, bool, bool)':
/var/lib/jenkins/workspace/aten/src/ATen/native/cudnn/RNN.cpp:687:35: error: converting to 'std::tuple<at::Tensor, std::vector<at::Tensor, std::allocator<at::Tensor> > >' from initializer list would use explicit constructor 'constexpr std::tuple<_T1, _T2>::tuple(_U1&&, _U2&&) [with _U1 = at::Tensor&; _U2 = std::vector<at::Tensor>&; <template-parameter-2-3> = void; _T1 = at::Tensor; _T2 = std::vector<at::Tensor>]'
     return {weight_buf, params_arr};
```
This regression was introduced by https://github.com/pytorch/pytorch/pull/42385

Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43244

Reviewed By: pbelevich

Differential Revision: D23205656

Pulled By: malfet

fbshipit-source-id: 51470386ad95290c7c99d733fc1fe655aa27d009
2020-08-18 19:31:51 -07:00
d06f1818ad Fix codegen/cuda gcc-5.4 compilation issues (#43223)
Summary:
Most of the fixes is the same old enum-is-not-hasheable error
In manager.cpp use std::unordered_map::emplace rather than `insert` to avoid error triggered by missed copy elision
This regression was introduced by https://github.com/pytorch/pytorch/pull/43129

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43223

Reviewed By: albanD, seemethere

Differential Revision: D23198330

Pulled By: malfet

fbshipit-source-id: 576082f7a4454dd29182892c9c4e0b51a967d456
2020-08-18 17:19:07 -07:00
d5bc2a8058 Remove std::complex from c10::Half (#39833)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/39833

Reviewed By: mrshenli

Differential Revision: D22644987

Pulled By: anjali411

fbshipit-source-id: 5ae5db10b12d410560eca43234efa04b711a639c
2020-08-18 15:22:36 -07:00
6c99d5611d [tensorexpr] Fix promotion of booleans (#43097)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43097

Boolean arguments weren't promoted, so if you tried to write a comparison with
types such as `Tensor(Bool) == Int` you'd fail typechecking inside the TE
engine.

Test Plan: Imported from OSS

Reviewed By: protonu, zheng-xq

Differential Revision: D23167926

Pulled By: bertmaher

fbshipit-source-id: 47091a815d5ae521637142a5c390e8a51a776906
2020-08-18 15:19:38 -07:00
da5df7e2d2 Remove use of term "blacklist" from tools/autograd/gen_python_functions.py (#42047)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/41720

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42047

Reviewed By: colesbury

Differential Revision: D23197785

Pulled By: SplitInfinity

fbshipit-source-id: 8ef38518f479e5e96b6a51bc420b0df5b35b447c
2020-08-18 15:11:22 -07:00
3951457ca5 [FX] Add in resnet + quantization tests (#43157)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43157

Test Plan: Imported from OSS

Reviewed By: zdevito

Differential Revision: D23173327

Pulled By: jamesr66a

fbshipit-source-id: 724d0f5399d389cdaa53917861b2113c33b9b5f9
2020-08-18 15:00:18 -07:00
dd194c1612 add _save_parameters to serialize map (#43163)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43163

Test Plan: Imported from OSS

Reviewed By: iseeyuan

Differential Revision: D23175287

Pulled By: ann-ss

fbshipit-source-id: ddfd734513c07e8bdbec108f26d1ca1770d098a6
2020-08-18 14:58:04 -07:00
2e6e295ecc refactor _save_parameters to _save_data (#43162)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43162

Test Plan: Imported from OSS

Reviewed By: iseeyuan

Differential Revision: D23175286

Pulled By: ann-ss

fbshipit-source-id: 6f930b98c367242fd4efbf51cb1d09995f7c4b40
2020-08-18 14:57:03 -07:00
888ae1b3d8 Introducing Matrix exponential (#40161)
Summary:
Implements (batched) matrix exponential. Fixes [https://github.com/pytorch/pytorch/issues/9983](https://github.com/pytorch/pytorch/issues/9983).

The algorithm follows:
```
 Bader, P.; Blanes, S.; Casas, F.
 Computing the Matrix Exponential with an Optimized Taylor Polynomial Approximation.
 Mathematics 2019, 7, 1174.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/40161

Reviewed By: zhangguanheng66

Differential Revision: D22951372

Pulled By: ezyang

fbshipit-source-id: aa068cb76d5cf71696b333d3e72cee287b3089e3
2020-08-18 14:15:10 -07:00
dfdd797723 Replace all AT_ASSERTM under ATen CUDA kernels. (#42989)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42989

Test Plan: Imported from OSS

Reviewed By: colesbury

Differential Revision: D23190011

Pulled By: ezyang

fbshipit-source-id: 7489598d7d920f32334943c1bf12bba74208a96c
2020-08-18 13:50:49 -07:00
493b3c2c7c Replace all AT_ASSERTM under ATen CPU kernels. (#41876)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/41876

Test Plan: Imported from OSS

Reviewed By: colesbury

Differential Revision: D23190010

Pulled By: ezyang

fbshipit-source-id: 238f1cd8db283805d6e892de7549763d0aa13316
2020-08-18 13:49:15 -07:00
0744dd6166 Fix shapes in the MarginRankingLoss docs (#43131)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/42884

I did some additional research and considering the first few lines of the docs (`Creates a criterion that measures the loss given inputs x1, x2, two 1D mini-batch Tensors, and a label 1D mini-batch tensor y (containing 1 or -1`) and the provided tests, this loss should be used primarily with 1-D tensors. More advanced users (that may use this loss in non-standard ways) can easily check the source and see that the definition accepts inputs/targets of arbitrary dimension as long as they match in shape or are broadcastable.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43131

Reviewed By: colesbury

Differential Revision: D23192011

Pulled By: mrshenli

fbshipit-source-id: c412c28daf9845c0142ea33b35d4287e5b65fbb9
2020-08-18 13:44:16 -07:00
fbf274f5a7 Autocast support for cudnn RNNs (#42385)
Summary:
Should close https://github.com/pytorch/pytorch/issues/36428.

The cudnn RNN API expects weights to occupy a flat buffer in memory with a particular layout.  This PR implements a "speed of light" fix:  [`_cudnn_rnn_cast_reflatten`](https://github.com/pytorch/pytorch/pull/42385/files#diff-9ef93b6a4fb5a06a37c562b83737ac6aR327) (the autocast wrapper assigned to `_cudnn_rnn`) copies weights to the right slices of a flat FP16 buffer with a single read/write per weight (as opposed to casting them to FP16 individually then reflattening the individual FP16 weights, which would require 2 read/writes per weight).

It isn't pretty but IMO it doesn't make rnn bindings much more tortuous than they already are.

The [test](https://github.com/pytorch/pytorch/pull/42385/files#diff-e68a7bc6ba14f212e5e7eb3727394b40R2683) tries a forward under autocast and a backward for the full cross product of RNN options and input/weight/hidden dtypes.  As for all FP16list autocast tests, forward output and backward grads are checked against a control where inputs (including RNN module weights in this case) are precasted to FP16 on the python side.

Not sure who to ask for review, tagging ezyang and ngimel because Ed wrote this file (almost 2 years ago) and Natalia did the most recent major [surgery](https://github.com/pytorch/pytorch/pull/12600).

Side quests discovered:
- Should we update [persistent RNN heuristics](dbdd28207c/aten/src/ATen/native/cudnn/RNN.cpp (L584)) to include compute capability 8.0?  Could be another PR but seems easy enough to include.
- Many (maybe all?!) the raw cudnn API calls in [RNN.cpp](https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/cudnn/RNN.cpp) are deprecated in cudnn 8.  I don't mind taking the AI to update them since my mental cache is full of rnn stuff, but that would be a substantial separate PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42385

Reviewed By: zhangguanheng66

Differential Revision: D23077782

Pulled By: ezyang

fbshipit-source-id: a2afb1bdab33ba0442879a703df13dc87f03ec2e
2020-08-18 13:37:42 -07:00
0a9c35aba3 maybe minor fix to dispatch/backend_fallback_test.cpp? (#42990)
Summary:
I think you want to push rewrapped `rets`, not `args`, back to the stack.

Doesn't matter for test purposes because tests only check if/when fallbacks were called, they don't check outputs for correctness.  But it avoids reader confusion.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42990

Reviewed By: mrshenli

Differential Revision: D23168277

Pulled By: ezyang

fbshipit-source-id: 2559f0707acdca2e3deac09006bc66ce3c788ea3
2020-08-18 13:01:35 -07:00
e39b43fd76 Issue 43057 (#43063)
Summary:
A small change that adds a docstring that can be found with
`getattr(nn.Module, nn.Module.forward.__name__, None).__doc__`

Fixes https://github.com/pytorch/pytorch/issues/43057

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43063

Reviewed By: mrshenli

Differential Revision: D23161782

Pulled By: ezyang

fbshipit-source-id: 95456f858e2b6a0e41ae551ea4ec2e78dd35ee3f
2020-08-18 12:50:53 -07:00
5d608d45cf Added Encoder Layer constructor with default parameters (#43130)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/37756

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43130

Reviewed By: colesbury

Differential Revision: D23189803

Pulled By: mrshenli

fbshipit-source-id: 53f3fca838828ddd728d8b44c36745bab5acee1f
2020-08-18 11:09:49 -07:00
53bbf5a48b Update README.md (#43100)
Summary:
The changes are minor.
1. Add back the external links so that readers can find out more about external tools on how to accelerate PyTorch.
2. Fix typo

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43100

Reviewed By: colesbury

Differential Revision: D23192251

Pulled By: mrshenli

fbshipit-source-id: dde54b7942ebff5bbe3d58ad95744c6d95fe60fe
2020-08-18 11:04:36 -07:00
ee74c2e5be Compress fatbin to fit into 32bit indexing (#43074)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/39968

tested with `TORCH_CUDA_ARCH_LIST='3.5 5.2 6.0 6.1 7.0 7.5 8.0+PTX'`, before this PR, it was failing, and with this  PR, the build succeed.

With `TORCH_CUDA_ARCH_LIST='7.0 7.5 8.0+PTX'`, `libtorch_cuda.so` with symbols changes from 2.9GB -> 2.2GB

cc: ptrblck mcarilli jjsjann123

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43074

Reviewed By: mrshenli

Differential Revision: D23176095

Pulled By: malfet

fbshipit-source-id: 7b3e6d049fc080e519f21e80df05ef68e7bea57e
2020-08-18 09:48:54 -07:00
b92b556a12 Add shape inference to SparseLengthsSumSparse ops (#43181)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43181

att

Test Plan:
```
buck test caffe2/caffe2/opt:bound_shape_inference_test
```

Reviewed By: ChunliF

Differential Revision: D23097145

fbshipit-source-id: 3e4506308446f28fbeb01dcac97dce70c0443975
2020-08-18 09:36:53 -07:00
b3bda94393 [NVFuser] Enable E2E BCast-PWise-Reduction fusions (#43129)
Summary:
Had a bunch of merged commits that shouldn't have been there, reverted them to prevent conflicts. Lots of new features, highlights listed below.

**Overall:**

- Enables pointwise fusion, single (but N-D) broadcast -- pointwise fusion, single (but N-D) broadcast -- pointwise -- single (but N-D) reduction fusion.

**Integration:**

- Separate "magic scheduler" logic that takes a fusion and generates code generator schedule
- Reduction fusion scheduling with heuristics closely matching eagermode (unrolling supported, but no vectorize support)
- 2-Stage caching mechanism, one on contiguity, device, type, and operations, the other one is input size->reduction heuristic

**Code Generation:**

- More generic support in code generation for computeAt
- Full rework of loop nest generation and Indexing to more generically handle broadcast operations
- Code generator has automatic kernel launch configuration (including automatic allocation of grid reduction buffers)
- Symbolic (runtime) tilling on grid/block dimensions is supported
- Simplified index generation based on user-defined input contiguity
- Automatic broadcast support (similar to numpy/pytorch semantics)
- Support for compile time constant shared memory buffers
- Parallelized broadcast support (i.e. block reduction -> block broadcast support)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43129

Reviewed By: mrshenli

Differential Revision: D23162207

Pulled By: soumith

fbshipit-source-id: 16deee4074c64de877eed7c271d6a359927111b2
2020-08-18 09:10:08 -07:00
c44b1de54e Pin VC++ version to 14.26 (#43184)
Summary:
VC++14.27 fails to compile mkl-dnn, see oneapi-src/oneDNN#812

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43184

Reviewed By: glaringlee

Differential Revision: D23181803

Pulled By: malfet

fbshipit-source-id: 9861c6243673c775374d77d2f51b45a42791b475
2020-08-17 22:17:06 -07:00
e8db0425b5 remove dot from TH (#43148)
Summary:
small cleanup of dead code

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43148

Reviewed By: mruberry

Differential Revision: D23175571

Pulled By: ngimel

fbshipit-source-id: b1b0ae9864d373c75666b95c589d090a9ca791b2
2020-08-17 21:40:44 -07:00
aef2890a75 Improve zero sized input for addmv (#41824)
Summary:
fixes https://github.com/pytorch/pytorch/issues/41340

Unfortunately, I still can not get a K80 to verify the fix, but it should be working.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41824

Reviewed By: mruberry

Differential Revision: D23172775

Pulled By: ngimel

fbshipit-source-id: aa6af96fe74e3bb07982c006cb35ecc7f18181bc
2020-08-17 20:05:31 -07:00
3c5e3966f4 [ONNX] Squeeze operator should give an error when trying to apply to a dimension with shape > 1 (#38476)
Summary:
The ONNX spec for the Squeeze operator:

> Remove single-dimensional entries from the shape of a tensor. Takes a parameter axes with a list of axes to squeeze. If axes is not provided, all the single dimensions will be removed from the shape. If an axis is selected with shape entry not equal to one, an error is raised.

Currently, as explained in issue https://github.com/pytorch/pytorch/issues/36796, it is possible to export such a model to ONNX, and this results in an exception from ONNX runtime.

Fixes https://github.com/pytorch/pytorch/issues/36796.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/38476

Reviewed By: hl475

Differential Revision: D22158024

Pulled By: houseroad

fbshipit-source-id: bed625f3c626eabcbfb2ea83ec2f992963defa19
2020-08-17 17:41:46 -07:00
cd96dfd44b Delete accidentally committed file errors.txt. (#43164)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43164

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D23175392

Pulled By: gchanan

fbshipit-source-id: 0d2d918fdf4a94361cdc3344bf1bc89dd0286ace
2020-08-17 17:37:48 -07:00
57af1ec145 observers: use torch.all to check for valid min and max values (#43151)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43151

Using `torch.all` instead of `torch.sum` and length check.
It's unclear whether the increase in perf (~5% for small inputs) is
real, but should be a net benefit, especially for larger channel inputs.

Test Plan: Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D23170426

fbshipit-source-id: ee5c25eb93cee1430661128ac9458a9c525df8e5
2020-08-17 17:08:57 -07:00
3264ba065c observers: use clamp instead of min/max in calculate_qparams (#43150)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43150

The current logic was expensive because it created tensors on CUDA.
Switching to clamp since it can work without needing to create tensors.

Test Plan:
benchmarks

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D23170427

fbshipit-source-id: 6fe3a728e737aca9f6c2c4d518c6376738577e21
2020-08-17 17:08:54 -07:00
a5dfba0a6e observers: make eps a buffer (#43149)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43149

This value doesn't change, making it a buffer to only pay
the cost of creating a tensor once.

Test Plan: Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D23170428

fbshipit-source-id: 6b963951a573efcc5b5a57649c814590b448dd72
2020-08-17 17:08:51 -07:00
5aa61afbfb quant bench: update observer configs (#42956)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42956

In preparation for observer perf improvement, cleans up the
micro benchmarks:
* disable CUDA for histogram observers (it's too slow)
* add larger shapes for better representation of real workloads

Test Plan:
```
cd benchmarks/operator_benchmark
python -m pt.qobserver_test
```

Imported from OSS

Reviewed By: supriyar

Differential Revision: D23093996

fbshipit-source-id: 5dc477c9bd5490d79d85ff8537270cd25aca221a
2020-08-17 17:07:56 -07:00
1f6e6a1166 Remove unused variable vecVecStartIdx (#42257)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42257

Reviewed By: gchanan

Differential Revision: D23109328

Pulled By: ezyang

fbshipit-source-id: dacd438395fedd1050ad3ffb81327bbb746c776c
2020-08-17 15:41:07 -07:00
133e9f96e1 Use c10 threadpool for GPU to CPU distributed autograd continuations. (#42511)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42511

DistEngine currently only has a single thread to execute GPU to CPU
continuations as part of the backward pass. This would be a significant
performance bottleneck in cases where we have such continuations and would like
to execute these using all CPU cores.

To alleviate this in this PR, we have the single thread in DistEngine only
dequeue work from the global queue, but then hand off execution of that work to
the c10 threadpool where we call "execute_graph_task_until_ready_queue_empty".

For more context please see:
https://github.com/pytorch/pytorch/issues/40255#issuecomment-663298062.
ghstack-source-id: 109997718

Test Plan: waitforbuildbot

Reviewed By: albanD

Differential Revision: D22917579

fbshipit-source-id: c634b6c97f3051f071fd7b994333e6ecb8c54155
2020-08-17 15:04:19 -07:00
825ec18eed [jit] better error message (#43093)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43093

without this it's hard to tell which module is going wrong

Test Plan:
```
> TypeError:
> 'numpy.int64' object in attribute 'Linear.in_features' is not a valid constant.
> Valid constants are:
> 1. a nn.ModuleList
> 2. a value of type {bool, float, int, str, NoneType, torch.device, torch.layout, torch.dtype}
> 3. a list or tuple of (2)
```

Reviewed By: eellison

Differential Revision: D23148516

fbshipit-source-id: b86296cdeb7b47c9fd69b5cfa479914c58ef02e6
2020-08-17 14:57:56 -07:00
864f0cfb2d Fix type annotations for torch.sparse, enable in CI (#43108)
Summary:
Closes gh-42982

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43108

Reviewed By: malfet

Differential Revision: D23167560

Pulled By: ezyang

fbshipit-source-id: 0d660ca686ada2347bf440c6349551d1539f99ef
2020-08-17 14:40:11 -07:00
6db0b8785d Adds movedim method, fixes movedim docs, fixes view doc links (#43122)
Summary:
This PR:

- Adds a method variant to movedim
- Fixes the movedim docs so it will actually appear in the documentation
- Fixes three view doc links which were broken

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43122

Reviewed By: ngimel

Differential Revision: D23166222

Pulled By: mruberry

fbshipit-source-id: 14971585072bbc04b5366d4cc146574839e79cdb
2020-08-17 14:24:52 -07:00
37252e8f00 Implement batching rules for some unary ops (#43059)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43059

This PR implements batching rules for some unary ops. In particular, it
implements the batching rules for the unary ops that take a single
tensor as input (and nothing else).

The batching rule for a unary op is:
(1) grab the physical tensor straight out of the BatchedTensor
(2) call the unary op
(3) rewrap the physical tensor in a BatchedTensor

Test Plan: - new tests `pytest test/test_vmap.py -v -k "Operators"`

Reviewed By: ezyang

Differential Revision: D23132277

Pulled By: zou3519

fbshipit-source-id: 24b9d7535338207531d767155cdefd2c373ada77
2020-08-17 13:38:10 -07:00
768c2a8c25 vmap: fixed to work with functools.partial (#43028)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43028

There was a bug where we always tried to grab the `__name__` attribute of
the function passed in by the user. Not all Callables have the
`__name__` attribute, an example being a Callable produced by
functools.partial.

This PR modifies the error-checking code to use `repr` if `__name__` is
not available. Furthermore, it moves the "get the name of this function"
functionality to the actual error sites as an optimization so we don't
spend time trying to compute `__repr__` for the Callable if there is no
error.

Test Plan: - `pytest test/test_vmap.py -v`, added new tests.

Reviewed By: yf225

Differential Revision: D23130235

Pulled By: zou3519

fbshipit-source-id: 937f3640cc4d759bf6fa38b600161f5387a54dcf
2020-08-17 13:36:49 -07:00
9c3f579528 .circleci: Copy LLVM from pre-built image (#43038)
Summary:
LLVM builds took a large amount of time and bogged down docker builds in
general. Since we build it the same for everything let's just copy it
from a pre-built image instead of building it from source every time.

Builds are defined in https://github.com/pytorch/builder/pull/491

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43038

Reviewed By: malfet

Differential Revision: D23119513

Pulled By: seemethere

fbshipit-source-id: f44324439d45d97065246caad07c848e261a1ab6
2020-08-17 11:04:35 -07:00
7cb8d68ae1 Rename XLAPreAutograd to AutogradXLA. (#43047)
Summary:
Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43047

Reviewed By: ezyang

Differential Revision: D23134326

Pulled By: ailzhang

fbshipit-source-id: 5fcbc23755daa8a28f9b03af6aeb3ea0603b5c9a
2020-08-17 10:47:43 -07:00
034e6727e7 Set default ATen threading backend to native if USE_OPENMP is false (#43067)
Summary:
Since OpenMP is not available on some platforms, or might be disabled by user, set default `ATEN_THREADING` based on USE_OPENMP and USE_TBB options

Fixes https://github.com/pytorch/pytorch/issues/43036

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43067

Reviewed By: houseroad

Differential Revision: D23138856

Pulled By: malfet

fbshipit-source-id: cc8f9ee59a5559baeb3f19bf461abbc08043b71c
2020-08-17 10:33:31 -07:00
aab66602c4 Add torch.dot for complex tensors (#42745)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42745

Test Plan: Imported from OSS

Reviewed By: izdeby

Differential Revision: D23056382

Pulled By: anjali411

fbshipit-source-id: c97f15e057095f78069844dbe0299c14104d2fce
2020-08-17 09:05:41 -07:00
472f291375 Fix freeze_module pass for sharedtype (#42457)
Summary:
During cleanup phase, calling recordReferencedAttrs would record
the attributes which are referenced and hence kept.
However, if you have two instances of the same type which are preserved
through freezing process, as the added testcase shows, then during
recording the attributes which are referenced, we iterate through the
type INSTANCES that we have seen so far and record those ones.
Thus if we have another instance of the same type, we will just look at
the first instance in the list, and record that instances.
This PR fixes that by traversing the getattr chains and getting the
actual instance of the getattr output.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42457

Test Plan:
python test/test_jit.py TestFreezing
Fixes #{issue number}

Reviewed By: gchanan

Differential Revision: D23106921

Pulled By: kimishpatel

fbshipit-source-id: ffff52876938f8a1fedc69b8b24a3872ea66103b
2020-08-17 08:27:31 -07:00
269fdb5bb2 prepare to split transformer header file (#43069)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43069

The transformer c++ impl need to put TransformerEncoderLayer/DecoderLayer and TransformerEncoder/TransformerDecoder in different header since TransformerEncoder/Decoder's options class need TransformerEncoderLayer/DecoderLayer as input parameter. Split header files to avoid cycle includsion.

Test Plan: Imported from OSS

Reviewed By: yf225

Differential Revision: D23139437

Pulled By: glaringlee

fbshipit-source-id: 3c752ed7702ba18a9742e4d47d049e62d2813de0
2020-08-17 07:54:05 -07:00
248b6a30f4 add training mode to mobile::Module (#42880)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42880

Enable switching between and checking for training and eval mode for torch::jit::mobile::Module using train(), eval(), and is_training(), like exists for torch::jit::Module.

Test Plan: Imported from OSS

Reviewed By: iseeyuan

Differential Revision: D23063006

Pulled By: ann-ss

fbshipit-source-id: b79002148c46146b6e961cbef8aaf738bbd53cb2
2020-08-17 00:20:03 -07:00
e2eb0cb1a9 Adds arccosh alias for acosh and adds an alias consistency test (#43107)
Summary:
This adds the torch.arccosh alias and updates alias testing to validate the consistency of the aliased and original operations. The alias testing is also updated to run on CPU and CUDA, which revealed a memory leak when tracing (see https://github.com/pytorch/pytorch/issues/43119).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43107

Reviewed By: ngimel

Differential Revision: D23156472

Pulled By: mruberry

fbshipit-source-id: 6155fac7954fcc49b95e7c72ed917c85e0eabfcd
2020-08-16 22:12:25 -07:00
4ae832e106 Optimize SiLU (Swish) op in PyTorch (#42976)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42976

Optimize SiLU (Swish) op in PyTorch.

Some benchmark result

input = torch.rand(1024, 32768, dtype=torch.float, device="cpu")
forward: 221ms -> 133ms
backward: 600ms -> 170ms

input = torch.rand(1024, 32768, dtype=torch.double, device="cpu")
forward: 479ms -> 297ms
backward: 1438ms -> 387ms

input = torch.rand(8192, 32768, dtype=torch.float, device="cuda")
forward: 24.34ms -> 9.83ms
backward: 97.05ms -> 29.03ms

input = torch.rand(4096, 32768, dtype=torch.double, device="cuda")
forward: 44.24ms -> 30.15ms
backward: 126.21ms -> 49.68ms

Test Plan: buck test mode/dev-nosan //caffe2/test:nn -- "SiLU"

Reviewed By: houseroad

Differential Revision: D23093593

fbshipit-source-id: 1ba7b95d5926c4527216ed211a5ff1cefa3d3bfd
2020-08-16 13:21:57 -07:00
d4c5f561ec Updates torch.clone documentation to be consistent with other functions (#43098)
Summary:
`torch.clone` exists but was undocumented, and the method incorrectly listed `memory_format` as a positional argument. This:

- documents `torch.clone`
- lists `memory_format` as a keyword-only argument
- wordsmiths the documentation

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43098

Reviewed By: ngimel

Differential Revision: D23153397

Pulled By: mruberry

fbshipit-source-id: c2ea781cdcb8b5ad3f04987c2b3a2f1fe0eaf18b
2020-08-16 04:18:49 -07:00
5bcf9b017a Implement hstack, vstack, dstack (#42799)
Summary:
Related to https://github.com/pytorch/pytorch/issues/38349

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42799

Reviewed By: izdeby

Differential Revision: D23140704

Pulled By: mruberry

fbshipit-source-id: 6a36363562c50d0abce87021b84b194bb32825fb
2020-08-15 20:39:14 -07:00
8864148823 [jit] DeepAndWide benchmark (#43096)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43096

Add benchmark script for deep and wide model.

Reviewed By: bwasti, yinghai

Differential Revision: D23099925

fbshipit-source-id: aef09d8606eba1eccc0ed674dfea59b890d3648b
2020-08-15 01:27:12 -07:00
91f3114fc1 [JIT] Represent profiled types as a node attribute (#43035)
Summary:
This changes profiled types from being represented as:
`%23 : Float(4:256, 256:1, requires_grad=0, device=cpu) = prim::profile(%0)`
->
`%23 : Tensor = prim::profile[profiled_type=Float(4:256, 256:1, requires_grad=0, device=cpu)](%0)`

Previously, by representing the profiled type in the IR directly it was very easy for optimizations to accidentally use profiled types without inserting the proper guards that would ensure that the specialized type would be seen.

It would be a nice follow up to extend this to prim::Guard as well, however we have short term plans to get rid of prim::Guard.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43035

Reviewed By: ZolotukhinM

Differential Revision: D23120226

Pulled By: eellison

fbshipit-source-id: c78d7904edf314dd65d1a343f2c3a947cb721b32
2020-08-14 20:17:46 -07:00
19902f6c0e Document unavailable reduction ops with NCCL backend (#42822)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42822

These ops arent supported with NCCL backend and used to silently error.
We disabled them as part of addressing https://github.com/pytorch/pytorch/issues/41362, so
document that here.
ghstack-source-id: 109957761

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D23023046

fbshipit-source-id: 45d69028012e0b6590c827d54b35c66cd17e7270
2020-08-14 19:08:28 -07:00
06aaf8c20d Add set_device_map to TensorPipeOptions to support GPU args (#42637)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42637

This commit enables sending non-CPU tensors through RPC using
TensorPipe backend. Users can configure device mappings by calling
set_map_location on `TensorPipeRpcBackendOptions`. Internally,
the `init_rpc` API verifies the correctness of device mappings. It
will shutdown RPC if the check failed, or proceed and pass global
mappings to `TensorPipeAgent` if the check was successful. For serde,
we added a device indices field to TensorPipe read and write buffers,
which should be either empty (all tensors must be on CPU) or match
the tensors in order and number in the RPC message. This commit
does not yet avoid zero-copy, the tensor is always moved to CPU
on the sender and then moved to the specified device on the receiver.

Test Plan: Imported from OSS

Reviewed By: izdeby

Differential Revision: D23011572

Pulled By: mrshenli

fbshipit-source-id: 62b617eed91237d4e9926bc8551db78b822a1187
2020-08-14 18:46:55 -07:00
c84f78470b Fix type annotations for a number of torch.utils submodules (#42711)
Summary:
Related issue on `torch.utils` type annotation hiccups: gh-41794

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42711

Reviewed By: mrshenli

Differential Revision: D23005434

Pulled By: malfet

fbshipit-source-id: 151554b1e7582743f032476aeccdfdad7a252095
2020-08-14 18:12:48 -07:00
bcf54f9438 Stop treating ASAN as special case (#43048)
Summary:
Add "asan" node to a `CONFIG_TREE_DATA` rather than hardcoded that non-xla clang-5 is ASAN

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43048

Reviewed By: houseroad

Differential Revision: D23126296

Pulled By: malfet

fbshipit-source-id: 22f02067bb2f5435a0e963a6c722b9c115ccfea4
2020-08-14 17:24:05 -07:00
0cf4a5bccb Add GCC codecoverage flags (#43066)
Summary:
Rename `CLANG_CODE_COVERAGE` option to `CODE_COVERAGE` and add compiler specific flags for GCC and Clang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43066

Reviewed By: scintiller

Differential Revision: D23137488

Pulled By: malfet

fbshipit-source-id: a89570469692f878d84f7da6f9d5dc01df423e80
2020-08-14 17:16:18 -07:00
ita
91b090ceaf Add polygamma where n >= 2 (#42499)
Summary:
https://github.com/pytorch/pytorch/issues/40980

I have a few questions during implementing Polygamma function...
so, I made PR prior to complete it.

1. some code blocks brought from cephes library(and I did too)
```
/*
 * The following function comes with the following copyright notice.
 * It has been released under the BSD license.
 *
 * Cephes Math Library Release 2.8:  June, 2000
 * Copyright 1984, 1987, 1992, 2000 by Stephen L. Moshier
 */
```
is it okay for me to use cephes code with this same copyright notice(already in the Pytorch codebases)

2. There is no linting in internal Aten library. (as far as I know, I read https://github.com/pytorch/pytorch/blob/master/CONTRIBUTING.md)
How do I'm sure my code will follow appropriate guidelines of this library..?

3. Actually, there's a digamma, trigamma function already
digamma is needed, however, trigamma function becomes redundant if  polygamma function is added.
it is okay for trigamma to be there or should be removed?

btw, CPU version works fine with 3-rd order polygamma(it's what we need to play with variational inference with beta/gamma distribution) now and I'm going to finish GPU version soon.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42499

Reviewed By: gchanan

Differential Revision: D23110016

Pulled By: albanD

fbshipit-source-id: 246f4c2b755a99d9e18a15fcd1a24e3df5e0b53e
2020-08-14 17:00:24 -07:00
4011685a8b [fx] split Node into Node/Proxy (#42991)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42991

Have Node both be a record of the operator in the graph, and the
way we _build_ the graph made it difficult to keep the IR datastructure
separate from the proxying logic in the build.

Among other issues this means that typos when using nodes would add
things to the graph:
```
    for node in graph.nodes:
        node.grph # does not error, returns an node.Attribute object!
```

This separates the builder into a Proxy object. Graph/Node no longer
need to understand `delegate` objects since they are now just pure IR.
This separates the `symbolic_trace` (proxy.py/symbolic_trace.py) from
the IR (node.py, graph.py).

This also allows us to add `create_arg` to the delegate object,
allowing the customization of how aggregate arguments are handled
when converting to a graph.

Test Plan: Imported from OSS

Reviewed By: jamesr66a

Differential Revision: D23099786

Pulled By: zdevito

fbshipit-source-id: 6f207a8c237e5eb2f326b63b0d702c3ebcb254e4
2020-08-14 16:45:21 -07:00
a1a6e1bc91 Fix warning: dynamic initialization in unreachable code. (#43065)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43065

Test Plan: Imported from OSS

Reviewed By: suo

Differential Revision: D23136883

Pulled By: ZolotukhinM

fbshipit-source-id: 878f6af13ff8df63fef5f34228f7667ee452dd95
2020-08-14 16:08:32 -07:00
66b3382c5b [quant] Add torchbind support for embedding_bag packed weights (#42881)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42881

This enables serialization/de-serialization of embedding packed params using getstate/setstate calls.
Added version number to deal with changes to serialization formats in future.

This can be extended in the future to support 4-bit/2-bit once we add support for that.

Test Plan:
python test/test_quantization.py TestQuantizedEmbeddingBag

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D23070634

fbshipit-source-id: 2ca322ab998184c728be6836f9fd12cec98b2660
2020-08-14 16:05:27 -07:00
7632a9b090 [quant] Add embeddingbag_prepack function that works on quantized tensor. (#42762)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42762

Use a prepack function that accepts qtensor as an input. The output is a byte tensor with packed data.
This is currently implemented only for 8-bit. In the future once we add 4-bit support this function will be extended to support that too.

Note -In the following change I will add TorchBind support for this to support serialization of packed weights.

Test Plan:
python test/test_quantization.py TestQuantizedEmbeddingBag

Imported from OSS

Reviewed By: vkuzo

Differential Revision: D23070632

fbshipit-source-id: 502aa1302dffec1298cdf52832c9e2e5b69e44a8
2020-08-14 16:02:57 -07:00
450315198a Fix a casting warning (#42451)
Summary:
Fix an annoying casting warning

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42451

Reviewed By: yf225

Differential Revision: D22993194

Pulled By: ailzhang

fbshipit-source-id: f317a212d4e768d49d24f50aeff9c003be2fd30a
2020-08-14 15:47:02 -07:00
3d8c144400 Implemented torch::nn::Unflatten in libtorch (#42613)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42613

Test Plan: Imported from OSS

Reviewed By: glaringlee

Differential Revision: D23030302

Pulled By: heitorschueroff

fbshipit-source-id: 954f1cdfcbd3a62a7f0e887fcf5995ef27222a87
2020-08-14 15:32:13 -07:00
33c5fe3c1d Enable test_logit FakeLowP test. (#43073)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43073

Enable test_logit FakeLowP test.

Test Plan: test_op_nnpi_fp16.py

Reviewed By: hyuen

Differential Revision: D23141375

fbshipit-source-id: cb7e7879487e33908b14ef401e1ab05fda193d28
2020-08-14 14:49:29 -07:00
5014cf4a4d Export MergeIdLists Caffe2 Operator to PyTorch
Summary: As titled.

Test Plan: buck test //caffe2/caffe2/python/operator_test:torch_integration_test -- test_merge_id_lists

Reviewed By: yf225

Differential Revision: D23076951

fbshipit-source-id: c37dfd93003590eed70b0d46e0151397a402dde6
2020-08-14 14:46:17 -07:00
c8e789e06e add fake fp16 fusions to net transforms (#42927)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42927

added fp16 fusion to net transforms
refactored the transforms as well as glow_transform to get out of opt/custom so that the OSS builds passed

Test Plan: added net runner tests for this

Reviewed By: yinghai

Differential Revision: D23080881

fbshipit-source-id: ee6451811fedfd07c6560c178229854bca29301f
2020-08-14 13:30:27 -07:00
1c6ace87d1 Embed torch.nn typing annotations (#43044)
Summary:
Delete several .pyi files and embed annotations from those files in respective .py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43044

Reviewed By: ezyang

Differential Revision: D23123234

Pulled By: malfet

fbshipit-source-id: 4ba361cc84402352090523924b0035e100ba48b1
2020-08-14 13:24:58 -07:00
fcc10d75e1 [JIT] Add property support to TorchScript classes (#42389)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42389

**Summary**
This commit adds support for properties to TorchScript classes,
specifically for getters and setters. They are implemented essentially
as pointers to the methods that the corresponding decorators decorate,
which are treated like regular class methods. Deleters for properties
are considered to be out of scope (and probably useless for TorchScript
anyway).

**Test Plan**
This commit adds a unit test for a class with a property that has both
getter and setter and one that has only a getter.

`python test/test_jit.py TestClassType.test_properties`

Test Plan: Imported from OSS

Reviewed By: eellison, ppwwyyxx

Differential Revision: D22880232

Pulled By: SplitInfinity

fbshipit-source-id: 4828640f4234cb3b0d4f3da4872a75fbf519e5b0
2020-08-14 12:56:57 -07:00
64a7684219 Enable typechecking of collect_env.py during CI (#43062)
Summary:
No type annotations can be added to the script, as it still have to be Python-2 compliant.
 Make changes to avoid variable type redefinition.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43062

Reviewed By: zou3519

Differential Revision: D23132991

Pulled By: malfet

fbshipit-source-id: 360c02e564398f555273e5889a99f834a5467059
2020-08-14 12:46:42 -07:00
1f6d0985d7 fix searchsorted output type (#42933)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/41389
Make sure searchsorted that returns integer type does not make them require gradients.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42933

Reviewed By: gchanan

Differential Revision: D23109583

Pulled By: albanD

fbshipit-source-id: 5af300b2f7f3c140d39fd7f7d87799f7b93a79c1
2020-08-14 12:34:51 -07:00
059aa34b12 Clip Binomial results for different endpoints in curand_uniform (#42702)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/42153

As [documented](https://docs.nvidia.com/cuda/curand/device-api-overview.html) (search for `curand_uniform` on the page), `curand_uniform` returns "from 0.0 to 1.0, where 1.0 is included and 0.0 is excluded." These endpoints are different than the CPU equivalent, and makes the calculation in the PR fail when the value is 1.0.

The test from the issue is added, it failed for me consistently before the PR even though I cut the number of samples by 10.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42702

Reviewed By: gchanan

Differential Revision: D23107451

Pulled By: ngimel

fbshipit-source-id: 3575d5b8cd5668e74b5edbecd95154b51aa485a1
2020-08-14 12:01:17 -07:00
71bbd5f1d4 Add back Tensor.nonzero type annotation (#43053)
Summary:
Closes gh-42998

The issue is marked for 1.6.1, if there's anything I need to do for a backport please tell me what that is.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43053

Reviewed By: izdeby

Differential Revision: D23131708

Pulled By: malfet

fbshipit-source-id: 2744bacce6bdf6ae463c17411b672f09707e0887
2020-08-14 11:41:19 -07:00
75dfa5a459 Remove itruediv because it's already defined in torch/tensor.py (#42962)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/42955

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42962

Reviewed By: mruberry

Differential Revision: D23111523

Pulled By: malfet

fbshipit-source-id: ecab7a4aae1fe556753b8d6528cae1ae201beff3
2020-08-14 11:36:23 -07:00
1c616c5ab7 Add complex tensor dtypes for the __cuda_array_interface__ spec (#42918)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/42860

The `__cuda_array_interface__` tensor specification is missing the appropriate datatypes for the newly merged complex64 and complex128 tensors. This PR addresses this issue by casting:

* `torch.complex64` to 'c8'
* `torch.complex128` to 'c16'

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42918

Reviewed By: izdeby

Differential Revision: D23130219

Pulled By: anjali411

fbshipit-source-id: 5f8ee8446a71cad2f28811afdeae3a263a31ad11
2020-08-14 10:26:23 -07:00
c3fb152274 Test the type promotion between every two dtypes thoroughly (#42585)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/41842

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42585

Reviewed By: izdeby

Differential Revision: D23126759

Pulled By: mruberry

fbshipit-source-id: 8337e02f23a4136c2ba28c368f8bdbd28400de44
2020-08-14 10:05:10 -07:00
ff6a2b0b7a Add inplace option for torch.nn.Hardsigmoid and torch.nn.Hardswish layers (#42346)
Summary:
**`torch.nn.Hardsigmoid`** and **`torch.nn.Hardswish`** classes currently do not support `inplace` operations as it uses `torch.nn.functional.hardsigmoid` and `torch.nn.functional.hardswish` functions with their default inplace argument which is `False`.

So, I added `inplace` argument for `torch.nn.Hardsigmoid` and `torch.nn.Hardswish` classes so that forward operation can be done inplace as well while using these layers.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42346

Reviewed By: izdeby

Differential Revision: D23108487

Pulled By: albanD

fbshipit-source-id: 0767334fa10e5ecc06fada2d6469f3ee1cacd957
2020-08-14 10:01:31 -07:00
2f9fd8ad29 Build test_e2e_tensorpipe only if Gloo is enabled (#43041)
Summary:
test_e2e_tensorpipe depends on ProcessGroupGloo, therefore it could not be tested with Gloo disabled
Otherwise, it re-introduces  https://github.com/pytorch/pytorch/issues/42776

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43041

Reviewed By: lw

Differential Revision: D23122101

Pulled By: malfet

fbshipit-source-id: a8a088b6522a3bc888238ede5c2d589b83c6ea94
2020-08-14 09:24:47 -07:00
31788ae151 Trim trailing whitespace
Test Plan: CI

Reviewed By: linbinyu

Differential Revision: D23108919

fbshipit-source-id: 913c982351a94080944f350641d7966c6c2cc508
2020-08-14 09:18:40 -07:00
a2b86d95d1 Make Mish support large inputs. (#43037)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43037

In the previous version of mish_op.cc, the output would be 'nan' for large inputs. We re-write mish_op.cc to solve this problem.

Test Plan:
Unit test
buck test //dper3/dper3/modules/tests:core_modules_test -- test_linear_compress_embedding_with_attention_with_activation_mish
{F284052906}

buck test mode/opt //dper3/dper3_models/ads_ranking/tests:model_paradigm_e2e_tests -- test_sparse_nn_with_mish
{F284224158}

## Workflow
f212113434

{F285281318}

Differential Revision: D23102644

fbshipit-source-id: 98f1ea82f8c8e05b655047b4520c600fc1a826f4
2020-08-14 08:53:16 -07:00
c7d2774d20 Fix typo in collect_env.py (#43050)
Summary:
Minor typo fix introduced in yesterdays PR: https://github.com/pytorch/pytorch/pull/42961

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43050

Reviewed By: ezyang, malfet

Differential Revision: D23130936

Pulled By: zou3519

fbshipit-source-id: e8fa2bf155ab6a5988c74e8345278d8d70855894
2020-08-14 08:33:35 -07:00
d60d6d0d7b Automated submodule update: FBGEMM (#42834)
Summary:
This is an automated pull request to update the first-party submodule for [pytorch/FBGEMM](https://github.com/pytorch/FBGEMM).

New submodule commit: 29d5eb9f3c

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42834

Test Plan: Ensure that CI jobs succeed on GitHub before landing.

Reviewed By: jspark1105

Differential Revision: D23040145

fbshipit-source-id: 1d7209ea1910419b7837703122b8a4c76380ca4a
2020-08-14 05:43:20 -07:00
ed242cbec5 Guard TensorPipe agent by USE_TENSORPIPE (#42682)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42682

ghstack-source-id: 109834351

Test Plan: CI

Reviewed By: malfet

Differential Revision: D22978717

fbshipit-source-id: 18b7cbdb532e78ff9259e82f0f92ad279124419d
2020-08-14 02:57:36 -07:00
ccd9f3244b Get, save, and load module information for each operator (#42133)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42133

Test Plan:
We save a module with module debugging information as follows.
```
import torch
m = torch.jit.load('./detect.pt')
# Save module without debug info
m._save_for_lite_interpreter('./detect.bc')
# Save module with debug info
m._save_for_lite_interpreter('./detect.bc', _save_debug_info_in_bytecode=True)
```
Size of the file without module debugging information: 4.508 MB
Size of the file with module debugging information: 4.512 MB

Reviewed By: kimishpatel

Differential Revision: D22803740

Pulled By: taivu1998

fbshipit-source-id: c82ea62498fde36a1cfc5b073e2cea510d3b7edb
2020-08-14 01:25:27 -07:00
e182ec97b3 Fix illegal memory acess issue for CUDA versionn of SplitByLengths operator.
Summary:
1. Fix illegal memory access issue for SplitByLengths operator in the CUDA context.
2. Add support to scaling lengths vector for SplitByLengths operator.
3. Add support to test SplitByLengths operator in the CUDA context.

Example for SplitByLengths operator processing scaling lengths vector:
value vector A = [1, 2, 3, 4, 5, 6]
length vector B = [1, 2]
after execution of SplitByLengths operator,
the output should be [1,2] and [3,4,5,6]

Test Plan: buck test mode/dev-nosan caffe2/caffe2/python/operator_test:concat_split_op_test

Reviewed By: kennyhorror

Differential Revision: D23079841

fbshipit-source-id: 3700e7f2ee0a5a2791850071fdc16e5b054f8400
2020-08-14 01:04:08 -07:00
b8102b1550 Implement torch.nextafter (#42580)
Summary:
Related to https://github.com/pytorch/pytorch/issues/38349.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42580

Reviewed By: smessmer

Differential Revision: D23012260

Pulled By: mruberry

fbshipit-source-id: ce82a63c4ad407ec6ffea795f575ca7c58cd6137
2020-08-14 00:35:30 -07:00
e4373083a2 torch.complex and torch.polar (#39617)
Summary:
For https://github.com/pytorch/pytorch/issues/35312 and https://github.com/pytorch/pytorch/issues/38458#issuecomment-636066256.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/39617

Reviewed By: zhangguanheng66

Differential Revision: D23083926

Pulled By: anjali411

fbshipit-source-id: 1874378001efe2ff286096eaf1e92afe91c55b29
2020-08-14 00:30:11 -07:00
b9a105bcc0 [TensorExpr] Cleanup logic in the TensorExpr fuser pass. (#42938)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42938

1. Structure the logic in a more straight-forward way: instead of magic
   tricks with node iterators in a block we now have a function that
   tries to create a fusion group starting from a given node (and pull
   everything it can into it).
2. The order in which we're pulling nodes into a fusion group is now
   more apparent.
3. The new pass structure automatically allows us to support fusion
   groups of size=1.

Test Plan: Imported from OSS

Reviewed By: eellison

Differential Revision: D23084409

Pulled By: ZolotukhinM

fbshipit-source-id: d59fc00c06af39a8e1345a4aed8d829494db084c
2020-08-13 23:49:42 -07:00
fc304bec9f [TensorExpr] Remove redundant checks from canHandle in TE fuser. (#42937)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42937

Test Plan: Imported from OSS

Reviewed By: eellison

Differential Revision: D23084408

Pulled By: ZolotukhinM

fbshipit-source-id: 8e562e25ecc73b4e7b01e30f8b282945b96b4871
2020-08-13 23:49:40 -07:00
48c183af3d [TensorExpr] Wrap fuser in a class. (#42936)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42936

Test Plan: Imported from OSS

Reviewed By: eellison

Differential Revision: D23084407

Pulled By: ZolotukhinM

fbshipit-source-id: f622874efbcbf8d4e49c8fa519a066161ebe4877
2020-08-13 23:48:16 -07:00
02c8ad70f2 Reconstruct scopes (#41615)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/41615

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D22611331

Pulled By: taivu1998

fbshipit-source-id: d4ed4cf6360bc1f72ac9fa24bb4fcf6b7d9e7576
2020-08-13 22:38:16 -07:00
3dc845319f Add more verbose error message about PackedSequence lengths argument (#42891)
Summary:
Add given tensor dimentionality, device and dtype to the error message

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42891

Reviewed By: ezyang

Differential Revision: D23068769

Pulled By: malfet

fbshipit-source-id: e49d0a5d0c10918795c1770b4f4e02494d799c51
2020-08-13 22:33:34 -07:00
b992a927a9 Clearer Semantics and Naming for Customized Quantization Range Initialization in Observer (#42602)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42602

In this diff, clearer semantics and namings for are introduced by splitting the original `init_dynamic_qrange` into 2 separate `Optional[int]` types `qmin` and `qmax` to avoid the confusion of the parameters with dynamic quantization.

The `qmin` and `qmax` parameters allow customers to specify their own customary quantization range and enables specific use cases for lower bit quantization.

Test Plan:
To assert the correctness and compatibility of the changes with existing observers, on a devvm, execute the following command to run the unit tests:

`buck test //caffe2/test:quantization -- observer`

Reviewed By: vkuzo, raghuramank100

Differential Revision: D22948334

fbshipit-source-id: 275bc8c9b5db4ba76fc2e79ed938376ea4f5a37c
2020-08-13 21:15:23 -07:00
a55b7e2a6d [reland][quant][fix] Remove activation_post_process in qat modules (#42343) (#43015)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43015

Currently activation_post_process are inserted by default in qat modules, which is not
friendly to automatic quantization tools, this PR removes them.

Test Plan:
Imported from OSS

Imported from OSS

Reviewed By: vkuzo

Differential Revision: D23105059

fbshipit-source-id: 3439ac39e718ffb0390468163bcbffd384802b57
2020-08-13 20:44:14 -07:00
8cf01c5c35 Back out "change pt_defs.bzl to python file"
Summary: Original commit changeset: d720fe2e684d

Test Plan: CIs

Reviewed By: linbinyu

Differential Revision: D23114839

fbshipit-source-id: fda570b5e989a51936a6c5bc68f0e60c6f6b4b82
2020-08-13 20:33:12 -07:00
830423b80b Python/C++ API Parity: TransformerDecoderLayer (#42717)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/37756

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42717

Reviewed By: zhangguanheng66

Differential Revision: D23095841

Pulled By: glaringlee

fbshipit-source-id: 327a5a23c9a3cca05e422666a6d7d802a7e8c468
2020-08-13 20:31:13 -07:00
85752b989d [quant][doc] Print more info for fake quantize module (#43031)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43031

fixes: https://github.com/pytorch/pytorch/issues/43023

Test Plan: Imported from OSS

Reviewed By: vkuzo

Differential Revision: D23116200

fbshipit-source-id: faa90ce8711da0785d635aacd0362c45717cfacc
2020-08-13 20:27:36 -07:00
523b2ce9c6 [jit][static runtime] Simplify the graph and add operator whitelist (#43024)
Summary:
This PR whitelists and simplifies graphs to help with development later on.  Key to note in this PR is the use of both a pattern substitution and the registration of custom operators.  This will likely be one of the main optimization types done in this folder.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43024

Reviewed By: hlu1

Differential Revision: D23114262

Pulled By: bwasti

fbshipit-source-id: e25aa3564dcc8a2b48cfd1561b3ee2a4780ae462
2020-08-13 20:19:55 -07:00
89b0b3bc8c Allow RPC to be initialized again after shutdown. (#42723)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42723

This PR is addressing https://github.com/pytorch/pytorch/issues/39340
and allows users to initialize RPC again after shutdown. Major changes in the
PR include:

1. Change to DistAutogradContainer to support this.
2. Ensure PythonRpcHandler is reinitialized appropriately.
3. Use PrefixStore in RPC initialization to ensure each new `init_rpc` uses a
different prefix.
ghstack-source-id: 109805368

Test Plan: waitforbuildbot

Reviewed By: rohan-varma

Differential Revision: D22993909

fbshipit-source-id: 9f1c1e0a58b58b97125f41090601e967f96f70c6
2020-08-13 20:18:34 -07:00
21823aa680 Nightly checkout tool (#42635)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/40829

This is cross-platform but I have only tried it on linux, personally. Also, I am not fully certain of the usage pattern, so if there are any additional features / adjustments / tests that you want me to add, please just let me know!

CC ezyang rgommers

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42635

Reviewed By: zhangguanheng66

Differential Revision: D23078663

Pulled By: ezyang

fbshipit-source-id: 5c8c8abebd1d462409c22dc4301afcd8080922bb
2020-08-13 20:07:18 -07:00
a6b69fdd33 Add DDP+RPC tutorial to RPC docs page. (#42828)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42828

ghstack-source-id: 109855425

Test Plan: waitforbuildbot

Reviewed By: jlin27

Differential Revision: D23037016

fbshipit-source-id: 250f322b652b86257839943309b8f0b8ce1bb25b
2020-08-13 19:41:06 -07:00
3544f60f76 make deadline=None for all numerics tests (#43014)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43014

changing this behavior mimics the behavior of the hold hypothesis
testing library

Test Plan: ran all tests on devserver

Reviewed By: hl475

Differential Revision: D23085949

fbshipit-source-id: 433fdfbb04b6a609b738eb7c319365049a49579b
2020-08-13 16:48:31 -07:00
8b5642a786 Fix to Learnable Fake Quantization Op Benchmarking (#43018)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43018

In this diff, a fix is added where the original non-learnable fake quantize is provided with trainable scale and zero point, whereas the requires_grad for both parameters should be completely disabled.

Test Plan:
Use the following command to execute the benchmark test:

`buck test mode/dev-nosan pt:quantization_test`

Reviewed By: vkuzo

Differential Revision: D23107846

fbshipit-source-id: d2213983295f69121e9e6ae37c84d1f37d78ef39
2020-08-13 16:32:13 -07:00
6753157c5a Enable torch.utils typechecks (#42960)
Summary:
Fix typos in torch.utils/_benchmark/README.md
Add empty __init__.py to examples folder to make example invocations from README.md correct
Fixed uniform distribution logic generation when mixval and maxval are None

Fixes https://github.com/pytorch/pytorch/issues/42984

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42960

Reviewed By: seemethere

Differential Revision: D23095399

Pulled By: malfet

fbshipit-source-id: 0546ce7299b157d9a1f8634340024b10c4b7e7de
2020-08-13 15:24:56 -07:00
eb47940c0a Add executor and fuser options to the fastrnn test fixture (#42946)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42946

There are 3 options for the executor and fuser and some of them aren't
super interesting so I've combined the options into a single parameter, but
made it fairly easy to expand the set if there are other configs we might care
about.

Test Plan:
Benchmark it

Imported from OSS

Reviewed By: zheng-xq

Differential Revision: D23090177

fbshipit-source-id: bd93a93c3fc64e5a4a847d1ce7f42ce0600a586e
2020-08-13 12:45:37 -07:00
fd5ed4b6d6 Update ort-nightly version to dev202008122 (#43019)
Summary:
Fixes caffe2_onnx_ort1_py3_6_clang7_ubuntu16_04 test failures

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43019

Reviewed By: gchanan

Differential Revision: D23108767

Pulled By: malfet

fbshipit-source-id: 0131cf4ac0bf93d3d93cb0c97a888f1524e87472
2020-08-13 11:40:16 -07:00
816d37b1d8 [quant] Make PerChannel Observer work with float qparams (#42690)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42690

Add implementation for new qscheme per_channel_affine_float_qparams in observer

Test Plan:
python test/test_quantization.py TestObserver.test_per_channel_observers

Imported from OSS

Reviewed By: vkuzo

Differential Revision: D23070633

fbshipit-source-id: 84d348b0ad91e9214770131a72f7adfd3970349c
2020-08-13 11:22:19 -07:00
6f8446840e [quant] Create PerRowQuantizer for floating point scale and zero_point (#42612)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42612

Add a new Quantizer that supports an input zero point (bias) that can be float.
The quantization equation in this case is

Xq = (Xf - bias) * inv_scale, where bias is float zero_point value
We start with per-row implementation and can extend to per-tensor in the future, if necessary

Test Plan:
python test/test_quantization.py TestQuantizedTensor

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D22960142

fbshipit-source-id: ca9ab6c5b45115d3dcb1c4358897093594313706
2020-08-13 11:20:53 -07:00
0ff51accd8 collect_env.py: Print CPU architecture after Linux OS name (#42961)
Summary:
Missed this case in https://github.com/pytorch/pytorch/pull/42887

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42961

Reviewed By: zou3519

Differential Revision: D23095264

Pulled By: malfet

fbshipit-source-id: ff1fb0eba9ecd29bfa3d8f5e4c3dcbcb11deefcb
2020-08-13 10:49:15 -07:00
ebc7ebc74e Do not ignore torch/__init__.pyi (#42958)
Summary:
Delete abovementioned from .gitignore as the file is gone since https://github.com/pytorch/pytorch/issues/42908 and no longer should be autogenerated.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42958

Reviewed By: seemethere

Differential Revision: D23094391

Pulled By: malfet

fbshipit-source-id: af303477301ae89d6f283e34d7aeddeda7a9260f
2020-08-13 10:29:58 -07:00
6fb5ce5569 [NNC] Fix some bugs in Round+Mod simplification (#42934)
Summary:
When working on the Cuda Codegen, I found that running the IRSimplifier before generating code lead to test fails. This was due to a bug in Round+Mod simplification (e.g. (x / y * y) + (x % y) => x) to do with the order in which the terms appeared. After fixing it and writing a few tests around those cases, I found another bug in simplification of the same pattern and have fixed it (with some more test coverage).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42934

Reviewed By: zhangguanheng66

Differential Revision: D23085548

Pulled By: nickgg

fbshipit-source-id: e780967dcaa7a5fda9f6d7d19a6b7e7b4e94374b
2020-08-13 09:47:21 -07:00
f03f9ad621 update clone doc (#42931)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42931

Reviewed By: zhangguanheng66

Differential Revision: D23083000

Pulled By: albanD

fbshipit-source-id: d76d90476ca294763f204c185a62ff6484381c67
2020-08-13 08:45:46 -07:00
ba9025bc1a [tensorexpr] Autograd for testing (#42548)
Summary:
A simple differentiable abstraction to allow testing of full training graphs.

Included in this 1st PR is an example of trivial differentiation.

If approved, I can add a full MLP and demonstrate convergence using purely NNC (for performance testing) in the next PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42548

Reviewed By: ZolotukhinM

Differential Revision: D23057920

Pulled By: bwasti

fbshipit-source-id: 4a239852c5479bf6bd20094c6c35f066a81a832e
2020-08-13 07:58:06 -07:00
607e49cc83 Revert D22856816: [quant][fix] Remove activation_post_process in qat modules
Test Plan: revert-hammer

Differential Revision:
D22856816 (8cb42fce17)

Original commit changeset: 988a43bce46a

fbshipit-source-id: eff5b9abdfc15b21c02c61eefbda38d349173436
2020-08-13 07:22:20 -07:00
8493b0d5d6 Enroll TensorPipe agent in C++-only E2E test (#42680)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42680

ghstack-source-id: 109544678

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D22978714

fbshipit-source-id: 04d6d190c240c6ead9bd9f3b7f3a5f964d7451e8
2020-08-13 07:07:30 -07:00
c88d3a5e76 Remove Python dependency from TensorPipe RPC agent (#42678)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42678

ghstack-source-id: 109544679

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D22978716

fbshipit-source-id: 31f91d35e9538375b047184cf4a735e4b8809a15
2020-08-13 07:06:10 -07:00
d39cb84f1f [Codemod][FBSourceClangFormatLinter] Daily arc lint --take CLANGFORMAT
Reviewed By: zertosh

Differential Revision: D23102075

fbshipit-source-id: afb89e061bb9c290df7cf4c58157fc8d67fe78ad
2020-08-13 05:14:21 -07:00
c9dcc833bc [quant][pyper] Make offsets an optional paramter in the qembedding_bag op (#42924)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42924

offsets is an optional paramter in the python module currently. So we update the operator to follow suit
in order to avoid bad optional access

Test Plan:
python test/test_quantization.py TestQuantizeDynamicJitOps.test_embedding_bag

Imported from OSS

Reviewed By: radkris-git

Differential Revision: D23081152

fbshipit-source-id: 847b58f826f5a18e8d4978fc4afc6f3a96dc4230
2020-08-12 20:25:44 -07:00
8cb42fce17 [quant][fix] Remove activation_post_process in qat modules (#42343)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42343

Currently activation_post_process are inserted by default in qat modules, which is not
friendly to automatic quantization tools, this PR removes them.

Test Plan: Imported from OSS

Reviewed By: raghuramank100

Differential Revision: D22856816

fbshipit-source-id: 988a43bce46a992b38fd0d469929f89e5b046131
2020-08-12 20:14:23 -07:00
7a7424bf91 Remove impl_unboxedOnlyKernel (#42841)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42841

There is nothing using those APIs anymore. While we still have ops that require an unboxedOnly implementation (i.e. that aren't c10-full yet), those are all already migrated to the new op registration API and use `.impl_UNBOXED()`.
ghstack-source-id: 109693705

Test Plan: waitforsandcastle

Reviewed By: bhosmer

Differential Revision: D23045335

fbshipit-source-id: d8e15cea1888262135e0d1d94c515d8a01bddc45
2020-08-12 17:35:09 -07:00
20e0e54dbe Allow Tensor& in the unboxing logic (#42712)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42712

Previously, operators taking Tensor& as arguments or returning it couldn't be c10-full because the unboxing logic didn't support it.
This adds temporary support for that. We're planning to remove this again later, but for now we need it to make those ops c10-full.
See https://docs.google.com/document/d/19thMVO10yMZA_dQRoB7H9nTPw_ldLjUADGjpvDmH0TQ for the full plan.

This PR also makes some ops c10-full that now can be.
ghstack-source-id: 109693706

Test Plan: unit tests

Reviewed By: bhosmer

Differential Revision: D22989242

fbshipit-source-id: 1bd97e5fa2b90b0860784da4eb772660ca2db5a3
2020-08-12 17:33:23 -07:00
5d2e9b6ed9 Add missing type annotation for Tensor.ndim (#42909)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/42908

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42909

Reviewed By: zhangguanheng66

Differential Revision: D23090364

Pulled By: malfet

fbshipit-source-id: 44457fddc86f6abde635aa671e7611b405780ab9
2020-08-12 17:14:20 -07:00
b8ae563ce6 Add a microbenchmark for LSTM elementwise portion (#42901)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42901

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D23079714

Pulled By: bertmaher

fbshipit-source-id: 28f8c3b5019ee898e82e64a0a674da1b4736d252
2020-08-12 17:11:47 -07:00
33d209b5f4 Fix TE microbenchmark harness to use appropriate fuser/executor (#42900)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42900

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D23079715

Pulled By: bertmaher

fbshipit-source-id: 6aa2b08a550835b7737e355960a16a7ca83878ea
2020-08-12 17:11:44 -07:00
1adeed2720 Speed up CUDA kernel launch when block/thread extents are statically known (#42899)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42899

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D23078708

Pulled By: bertmaher

fbshipit-source-id: 237404b47a31672d7145d70996868a3b9b97924e
2020-08-12 17:10:30 -07:00
f373cda021 Revert D22994446: [pytorch][PR] CUDA reduction: allow outputs to have different strides
Test Plan: revert-hammer

Differential Revision:
D22994446 (7f3f5020e6)

Original commit changeset: cc60beebad2e

fbshipit-source-id: f4635deac386db0c161f910760cace09f15a1ff9
2020-08-12 17:05:04 -07:00
86841f5f61 Update cuda init docstring to improve clarity (#42923)
Summary:
A small clarity improvement to the cuda init docstring

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42923

Reviewed By: zhangguanheng66

Differential Revision: D23080693

Pulled By: mrshenli

fbshipit-source-id: aad5ed9276af3b872c1def76c6175ee30104ccb2
2020-08-12 15:41:28 -07:00
0134deda0f [FX] Add interface to reject nodes (#42865)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42865

Test Plan: Imported from OSS

Reviewed By: zdevito

Differential Revision: D23056584

Pulled By: jamesr66a

fbshipit-source-id: 02db08165ab41be5f3c4b5ff253cbb444eb9a7b8
2020-08-12 14:30:06 -07:00
92885ebe16 Implement hypot (#42291)
Summary:
Related to https://github.com/pytorch/pytorch/issues/38349
Closes https://github.com/pytorch/pytorch/issues/22764

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42291

Reviewed By: malfet

Differential Revision: D22951859

Pulled By: mruberry

fbshipit-source-id: d0118f2b6437e5c3f775f699ec46e946a8da50f0
2020-08-12 13:18:26 -07:00
62bd2ddec7 Implemented non-named version of unflatten (#42563)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42563

Moved logic for non-named unflatten from python nn module to aten/native to be reused by the nn module later. Fixed some inconsistencies with doc and code logic.

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D23030301

Pulled By: heitorschueroff

fbshipit-source-id: 7c804ed0baa5fca960a990211b8994b3efa7c415
2020-08-12 13:14:28 -07:00
7f3f5020e6 CUDA reduction: allow outputs to have different strides (#42649)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/42364

Benchmark:
https://github.com/zasdfgbnm/things/blob/master/2020Q3/min-benchmark.ipynb
```python
import torch

print(torch.__version__)
print()

for i in range(100):
    torch.randn(1000, device='cuda')

for e in range(7, 15):
    N = 2 ** e
    input_ = torch.randn(N, N, device='cuda')
    torch.cuda.synchronize()
    %timeit input_.min(dim=0); torch.cuda.synchronize()
    input_ = torch.randn(N, N, device='cuda').t()
    torch.cuda.synchronize()
    %timeit input_.min(dim=0); torch.cuda.synchronize()
    print()
```
Before
```
1.7.0a0+5d7c3f9

21.7 µs ± 1.67 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
20.6 µs ± 773 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

22.5 µs ± 294 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
20.2 µs ± 250 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

26.4 µs ± 67 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
20.9 µs ± 316 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

33 µs ± 474 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
21.1 µs ± 218 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

84.2 µs ± 691 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
50.3 µs ± 105 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

181 µs ± 2.36 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
145 µs ± 149 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

542 µs ± 753 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
528 µs ± 10.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

2.04 ms ± 9.74 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.01 ms ± 22.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
```
After
```
1.7.0a0+9911817

21.4 µs ± 695 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
20.6 µs ± 989 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

22.4 µs ± 153 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
20.5 µs ± 58.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

26.6 µs ± 147 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
20.9 µs ± 675 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

35.4 µs ± 560 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
21.7 µs ± 1.17 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)

86.5 µs ± 1.99 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
52.2 µs ± 1.57 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

195 µs ± 2.97 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
153 µs ± 4.46 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

550 µs ± 7.72 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
527 µs ± 3.04 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

2.05 ms ± 7.87 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2 ms ± 4.93 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42649

Reviewed By: ezyang

Differential Revision: D22994446

Pulled By: ngimel

fbshipit-source-id: cc60beebad2e04c26ebf3ca702a6cb05846522c9
2020-08-12 13:09:36 -07:00
ada8404f2d [jit] Scaffold a static runtime (#42753)
Summary:
The premise of this approach is that a small subset of neural networks are well represented by a data flow graph.  The README contains more information.

The name is subject to change, but I thought it was a cute reference to fire.

suo let me know if you'd prefer this in a different spot.  Since it lowers a JIT'd module directly I assumed the JIT folder would be appropriate.  There is no exposed Python interface yet (but is mocked up in `test_accelerant.py`)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42753

Reviewed By: zou3519

Differential Revision: D23043771

Pulled By: bwasti

fbshipit-source-id: 5353731e3aae31c08b5b49820815da98113eb551
2020-08-12 13:05:27 -07:00
59f8692350 [pytorch] BUCK build for Vulkan backend
Summary:
Introducing `//xplat/caffe2:aten_vulkan` target which contains pytorch Vulkan backend and its ops.

 `//xplat/caffe2:aten_vulkan` depends on ` //xplat/caffe2:aten_cpu`

Just inclusion it to linking registers Vulkan Backend and its ops.

**Code generation:**
1. `VulkanType.h`, `VulkanType.cpp`
Tensor Types for Vulkan backend are generated by `//xplat/caffe2:gen_aten_vulkan` which runs aten code generation (`aten/src/ATen/gen.py`) with `--vulkan` argument.

2. Shaders compilation
`//xplat/caffe2:gen_aten_vulkan_spv`  genrule runs `//xplat/caffe2:gen_aten_vulkan_spv_bin` which is a wrapper on `aten/src/ATen/native/vulkan/gen_spv.py`

GLSL files are listed in `aten/src/ATen/native/vulkan/glsl/*` and to compile them `glslc` (glsl compiler) is required.

`glslc` is in opensource https://github.com/google/shaderc , that also has a few dependencies  on other libraries, that porting this build to BUCK will take significant amount of time.

To use `glslc` in BUCK introducing

dotslash `xplat/caffe2/fb/vulkan/dotslash/glslc` which is stored on manifold the latest prebuilt binaries of `glslc` from ANDROID_NDK for linux, macos and windows.

Not using it from ANDROID_NDK directly allows to update it without dependency on ndk.

Test Plan:
Building aten_vulkan target:
```
buck build //xplat/caffe2:aten_vulkan
```

Building vulkan_test that contains vulkan unittests for android:
```
buck build //xplat/caffe2:pt_vulkan_test_binAndroid#android-armv7
```
And running it on the device with vulkan support.

Reviewed By: iseeyuan

Differential Revision: D22770299

fbshipit-source-id: 843af8df226d4b5395b8e480eb47b233d57201df
2020-08-12 10:34:41 -07:00
ea65a56854 Use string(APPEND FOO " bar") instead of `set(FOO "${FOO} bar") (#42844)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42844

Reviewed By: scintiller

Differential Revision: D23067577

Pulled By: malfet

fbshipit-source-id: e4380ce02fd6aca37c955a7bc24435222c5d8b19
2020-08-12 10:33:11 -07:00
3d3752d716 Revert D22898051: [pytorch][PR] Fix freeze_module pass for sharedtype
Test Plan: revert-hammer

Differential Revision:
D22898051 (4665f3fc8d)

Original commit changeset: 8b1d80f0eb40

fbshipit-source-id: 4dc0ba274282a157509db16df13269eed6cd5be9
2020-08-12 10:28:03 -07:00
bda0007620 Improve calling backward() and grad() inside vmap error messages (#42876)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42876

Previously, the error messages were pretty bad. This PR adds nice
error messages for the following cases:
- user attempts to call .backward() inside vmap for any reason
whatsoever
- user attempts to call autograd.grad(outputs, inputs, grad_outputs),
where outputs or inputs is being vmapped over (so they are
BatchedTensors).

The case we do support is calling autograd.grad(outputs, inputs,
grad_outputs) where `grad_outputs` is being vmapped over. This is the
case for batched gradient support (e.g., user passes in a batched
grad_output).

Test Plan: - new tests: `pytest test/test_vmap.py -v`

Reviewed By: ezyang

Differential Revision: D23059836

Pulled By: zou3519

fbshipit-source-id: 2fd4e3fd93f558e67e2f0941b18f0d00d8ab439f
2020-08-12 10:05:31 -07:00
5c39146c34 Fix get_writable_path (#42895)
Summary:
As name suggests, this function should always return a writable path
Call `mkdtemp` to create temp folder if path is not writable

This fixes `TestNN.test_conv_backcompat` if PyTorch is installed in non-writable location

Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42895

Reviewed By: dzhulgakov

Differential Revision: D23070320

Pulled By: malfet

fbshipit-source-id: ed6a681d46346696a0de7e71f0b21cba852a964e
2020-08-12 09:38:24 -07:00
5157afcf59 fix int8 FC (#42691)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42691

fix quantization of FC bias to match nnpi
quantize biases to fp16

Test Plan: improved the unit test to have input tensors in fp32

Reviewed By: tracelogfb

Differential Revision: D22941521

fbshipit-source-id: 00afb70610f8a149110344d52595c39e3fc988ab
2020-08-12 09:30:34 -07:00
686705c98b Optimize LayerNorm performance on CPU both forward and backward (#35750)
Summary:
This PR aims at improving `LayerNorm` performance on CPU for both forward and backward.

Results on Xeon 6248:
1. single socket inference **1.14x** improvement
2. single core inference **1.77x** improvement
3. single socket training **6.27x** improvement

The fine tuning of GPT2 on WikiTest2 dataset time per iteration on dual socket reduced from **4.69s/it** to **3.16s/it**, **1.48x** improvement.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/35750

Reviewed By: zhangguanheng66

Differential Revision: D20810026

Pulled By: glaringlee

fbshipit-source-id: c5801bd76eb944f2e46c2fe4991d9ad4f40495c3
2020-08-12 09:17:20 -07:00
75a15d3d01 Follow-up for pytorch/pytorch#37091. (#42806)
Summary:
This is a follow-up PR for https://github.com/pytorch/pytorch/issues/37091, fixing some of the quirks of that PR as that one was landed early to avoid merge conflicts.

This PR addresses the following action items:

- [x] Use error-handling macros instead of a `try`-`catch`.
- [x] Renamed and added comments to clarify the use of `HANDLED_FUNCTIONS_WRAPPERS` in tests. `HANDLED_FUNCTIONS_NAMESPACES` was already removed in the last PR as we had a way to test for methods.

This PR does NOT address the following action item, as it proved to be difficult:

- [ ] Define `__module__`  for whole API.

Single-line repro-er for why this is hard:

```python
>>> torch.Tensor.grad.__get__.__module__ = "torch.Tensor.grad"
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'method-wrapper' object has no attribute '__module__'
```

Explanation: Methods  defined in C/properties don't always have a `__dict__` attribute or a mutable `__module__` slot for us to modify.

The documentation action items were addressed in the following commit, with the additional future task of adding the rendered RFCs to the documentation: 552ba37c05

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42806

Reviewed By: smessmer

Differential Revision: D23031501

Pulled By: ezyang

fbshipit-source-id: b781c97f7840b8838ede50a0017b4327f96bc98a
2020-08-12 09:11:33 -07:00
2878efb35d Use C10_API_ENUM to fix invalid attribute warnings (#42464)
Summary:
Using the macro added in https://github.com/pytorch/pytorch/issues/38988 to fix more attribute warnings.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42464

Reviewed By: malfet

Differential Revision: D22916943

Pulled By: ezyang

fbshipit-source-id: ab9ca8755cd8b89aaf7f8718b4107b4b94d95005
2020-08-12 09:02:49 -07:00
2f1baf6c25 Fix coding style and safety issues in CuBLAS nondeterministic unit test (#42627)
Summary:
Addresses some comments that were left unaddressed after PR https://github.com/pytorch/pytorch/issues/41377 was merged:

* Use `check_output` instead of `Popen` to run each subprocess sequentially
* Use f-strings rather than old python format string style
* Provide environment variables to subprocess through the `env` kwarg
* Check for correct error behavior inside the subprocess, and raise another error if incorrect. Then the main process fails the test if any error is raised

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42627

Reviewed By: malfet

Differential Revision: D22969231

Pulled By: ezyang

fbshipit-source-id: 38d5f3f0d641c1590a93541a5e14d90c2e20acec
2020-08-12 08:54:28 -07:00
77bd4d3426 MAINT: speed up istft by using col2im (the original python code used … (#42826)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/42213

The [original python code](https://github.com/pytorch/audio/blob/v0.5.0/torchaudio/functional.py#L178) from `torchaudio` was converted to a native function, but used `eye` to  allocate a Tensor and was much slower.
Using `at::col2im` (which is the equivalent of `torch.nn.functional.fold`) solved the slowdown.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42826

Reviewed By: smessmer

Differential Revision: D23043673

Pulled By: mthrok

fbshipit-source-id: 3f5d0779a87379b002340ea19c9ae5042a43e94e
2020-08-12 08:39:12 -07:00
4665f3fc8d Fix freeze_module pass for sharedtype (#42457)
Summary:
During cleanup phase, calling recordReferencedAttrs would record
the attributes which are referenced and hence kept.
However, if you have two instances of the same type which are preserved
through freezing process, as the added testcase shows, then during
recording the attributes which are referenced, we iterate through the
type INSTANCES that we have seen so far and record those ones.
Thus if we have another instance of the same type, we will just look at
the first instance in the list, and record that instances.
This PR fixes that by traversing the getattr chains and getting the
actual instance of the getattr output.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42457

Test Plan:
python test/test_jit.py TestFreezing
Fixes #{issue number}

Reviewed By: zou3519

Differential Revision: D22898051

Pulled By: kimishpatel

fbshipit-source-id: 8b1d80f0eb40ab99244f931d4a1fdb28290a4683
2020-08-12 08:35:05 -07:00
ecb9e790ed Remove excessive logging in plan_executor (#42888)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42888

as title

Test Plan: flow-cli test-locally dper.workflows.evaluation.eval_workflow --parameters-file /mnt/public/ehsanardestani/temp/quant_eval_inputs_all.json

Reviewed By: amylittleyang

Differential Revision: D23066529

fbshipit-source-id: f925afd1734e617e412b0f171e16c781d13272d9
2020-08-11 23:57:17 -07:00
a346e90c49 Update to NNP-I v1.0.0.5 (#4770)
Summary:
Align code to NNP-I v1.0.0.5 (glow tracing changes).

Pull Request resolved: https://github.com/pytorch/glow/pull/4770

Reviewed By: arunm-git

Differential Revision: D22927904

Pulled By: hl475

fbshipit-source-id: 3746a6b07f3fcffc662d80a95513427cfccac7a5
2020-08-11 23:53:23 -07:00
ab0a04dc9c Add torch.nansum (#38628)
Summary:
Reference: https://github.com/pytorch/pytorch/issues/38349

Pull Request resolved: https://github.com/pytorch/pytorch/pull/38628

Reviewed By: VitalyFedyunin

Differential Revision: D22860549

Pulled By: mruberry

fbshipit-source-id: 87fcbfd096d83fc14b3b5622f2301073729ce710
2020-08-11 22:26:04 -07:00
38c7b9a168 avoid redundant isCustomClassRegistered() checks (#42852)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42852

Test Plan: Imported from OSS

Reviewed By: smessmer

Differential Revision: D23048381

Pulled By: bhosmer

fbshipit-source-id: 40b71670a84cb6f7e5a03279f58ce227d676aa03
2020-08-11 21:53:19 -07:00
bee174dc3f Adds linalg.det alias, fixes outer alias, updates alias testing (#42802)
Summary:
This PR:

- updates test_op_normalization.py, which verifies that aliases are correctly translated in the JIT
- adds torch.linalg.det as an alias for torch.det
- moves the torch.linalg.outer alias to torch.outer (to be consistent with NumPy)

The torch.linalg.outer alias was put the linalg namespace erroneously as a placeholder since it's a "linear algebra op" according to NumPy but is actually still in the main NumPy namespace.

The updates to test_op_normalization are necessary. Previously it was using method_tests to generate tests, and method_tests assumes test suites using it also use the device generic framework, which test_op_normalization did not. For example, some ops require decorators like `skipCPUIfNoLapack`, which only works in device generic test classes. Moving test_op_normalization to the device generic framework also lets these tests run on CPU and CUDA.

Continued reliance on method_tests() is excessive since the test suite is only interested in testing aliasing, and a simpler and more readable `AliasInfo` class is used for the required information. An example impedance mismatch between method_tests and the new tests, for example, was how to handle ops in namespaces like torch.linalg.det. In the future this information will likely be folded into a common 'OpInfo' registry in the test suite.

The actual tests performed are similar to what they were previously: a scripted and traced version of the op is run and the test verifies that both graphs do not contain the alias name and do contain the aliased name.

The guidance for adding an alias has been updated accordingly.

cc mattip

Note:

ngimel suggests:
- deprecating and then removing the `torch.ger` name
- reviewing the implementation of `torch.outer`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42802

Reviewed By: zou3519

Differential Revision: D23059883

Pulled By: mruberry

fbshipit-source-id: 11321c2a7fb283a6e7c0d8899849ad7476be42d1
2020-08-11 21:48:31 -07:00
cd756ee3d4 Support boolean key in dictionary (#42833)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/41449 .

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42833

Test Plan: `python test/test_jit.py TestDict`

Reviewed By: zou3519

Differential Revision: D23056250

Pulled By: asuhan

fbshipit-source-id: 90dabe1490c99d3e57a742140a4a2b805f325c12
2020-08-11 21:37:37 -07:00
ac93d45906 [quant] Attach qconfig to all modules (#42576)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42576

Previously we have qconfig propagate list and we only attach qconfig for modules
in the list, this works when everything is quantized in the form of module.
but now we are expanding quantization for functional/torch ops, we'll need to attach qconfig
to all modules

Test Plan: Imported from OSS

Reviewed By: vkuzo

Differential Revision: D22939453

fbshipit-source-id: 7d6a1f73ff9bfe461b3afc75aa266fcc8f7db517
2020-08-11 20:34:34 -07:00
e845b0ab51 [Resending] [ONNX] Add eliminate_unused_items pass (#42743)
Summary:
This PR:

- Adds eliminate_unused_items pass that removes unused inputs and initializers.
- Fixes run_embed_params function so it doesn't export unnecessary parameters.
- Removes test_modifying_params in test_verify since it's no longer needed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42743

Reviewed By: hl475

Differential Revision: D23058954

Pulled By: houseroad

fbshipit-source-id: cd1e81463285a0bf4e60766c8c87fc9a350d9c7e
2020-08-11 20:30:50 -07:00
a846ed5ce7 [quant] Reduce number of variants of add/mul (#42769)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42769

Some of the quantized add and mul can have the same name

Test Plan: Imported from OSS

Reviewed By: supriyar

Differential Revision: D23054822

fbshipit-source-id: c1300f3f0f046eaf0cf767d03b957835e22cfb4b
2020-08-11 20:01:06 -07:00
5edd9aa95a Fix manual seed to unpack unsigned long (#42206)
Summary:
`torch.manual_seed` was unpacking its argument as an `int64_t`. This fix changes it to a `uint64_t`.

Fixes https://github.com/pytorch/pytorch/issues/33546

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42206

Reviewed By: ezyang

Differential Revision: D22822098

Pulled By: albanD

fbshipit-source-id: 97c978139c5cb2d5b62cc2c963550c758ee994f7
2020-08-11 18:05:34 -07:00
b0b8340065 Collect more data in collect_env (#42887)
Summary:
Collect Python runtime bitness (32 vs 64 bit)
Collect Mac/Linux OS machine time (x86_64, arm, Power, etc)
Collect Clang version

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42887

Reviewed By: seemethere

Differential Revision: D23064788

Pulled By: malfet

fbshipit-source-id: df361bdbb79364dc521b8e1ecbed1b4bd08f9742
2020-08-11 18:01:14 -07:00
7a9ae52550 [hypothesis] Deadline followup (#42842)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42842

Test Plan: `buck test`

Reviewed By: thatch

Differential Revision: D23045269

fbshipit-source-id: 8a3f4981869287a0f5fb3f0009e13548b7478086
2020-08-11 15:33:23 -07:00
eeb43ffab9 format for readability (#42851)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42851

Test Plan: Imported from OSS

Reviewed By: smessmer

Differential Revision: D23048382

Pulled By: bhosmer

fbshipit-source-id: 55d84d5f9c69be089056bf3e3734c1b1581dc127
2020-08-11 14:46:42 -07:00
3bf2978497 remove deadline enforcement for hypothesis (#42871)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42871

old version of hypothesis.testing was not enforcing deadlines
after the library got updated, default deadline=200ms, but even with 1s or
more, tests are flaky. Changing deadline to non-enforced which is the same
behavior as the old version

Test Plan: tested fakelowp/tests

Reviewed By: hl475

Differential Revision: D23059033

fbshipit-source-id: 79b6aec39a2714ca5d62420c15ca9c2c1e7a8883
2020-08-11 14:28:53 -07:00
0ff0fea42b [FX] fix lint (#42866)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42866

Test Plan: Imported from OSS

Reviewed By: zdevito

Differential Revision: D23056813

Pulled By: jamesr66a

fbshipit-source-id: d30cdffe6f0465223354dec00f15658eb0b08363
2020-08-11 14:01:26 -07:00
43613b4236 Fix incorrect aten::sorted.str return type (#42853)
Summary:
aten::sorted.str output type was incorrectly set to bool[] due to a copy-paste error. This PR fixes it.

Fixes https://fburl.com/0rv8amz7

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42853

Reviewed By: yf225

Differential Revision: D23054907

Pulled By: gmagogsfm

fbshipit-source-id: a62968c90f0301d4a5546e6262cb9315401a9729
2020-08-11 14:01:23 -07:00
71dbfc79b3 Export BatchBucketOneHot Caffe2 Operator to PyTorch
Summary: As titled.

Test Plan:
```
buck test caffe2/caffe2/python/operator_test:torch_integration_test -- test_batch_bucket_one_hot_op
```

Reviewed By: yf225

Differential Revision: D23005981

fbshipit-source-id: 1daa8d3e7d6ad75e97e94964db95ccfb58541672
2020-08-11 14:00:19 -07:00
4afbf39737 Add nn.functional.adaptive_avg_pool size empty tests (#42857)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42857

Reviewed By: seemethere

Differential Revision: D23053677

Pulled By: malfet

fbshipit-source-id: b3d0d517cddc96796461332150e74ae94aac8090
2020-08-11 12:59:58 -07:00
9c8f5cb61d Ensure IDEEP transpose operator works correctly
Summary: I found out that without exporting to public format IDEEP transpose operator in the middle of convolution net produces incorrect results (probably reading some out-of-bound memory). Exporting to public format might not be the most efficient solution, but at least it ensures correct behavior.

Test Plan: Running ConvFusion followed by transpose should give identical results on CPU and IDEEP

Reviewed By: bwasti

Differential Revision: D22970872

fbshipit-source-id: 1ddca16233e3d7d35a367c93e72d70632d28e1ef
2020-08-11 12:58:31 -07:00
c660d2a9ae Initial quantile operator implementation (#42755)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42755

Attempting to land quantile again after being landed here https://github.com/pytorch/pytorch/pull/39417 and reverted here https://github.com/pytorch/pytorch/pull/41616.

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D23030338

Pulled By: heitorschueroff

fbshipit-source-id: 124a86eea3aee1fdaa0aad718b04863935be26c7
2020-08-11 12:08:17 -07:00
6471b5dc66 Correct the type of some floating point literals in calc_digamma (#42846)
Summary:
They are double, but they are supposed to be of accscalar_t or a faster type.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42846

Reviewed By: zou3519

Differential Revision: D23049405

Pulled By: mruberry

fbshipit-source-id: 29bb5d5419dc7556b02768f0ff96dfc28676f257
2020-08-11 11:39:06 -07:00
4bafca1a69 Adds list of operator-related information for testing (#41662)
Summary:
This PR adds:

- an "OpInfo" class in common_method_invocations that can contain useful information about an operator, like what dtypes it supports
- a more specialized "UnaryUfuncInfo" class designed to help test the unary ufuncs
- the `ops` decorator, which can generate test variants from lists of OpInfos
- test_unary_ufuncs.py, a new test suite stub that shows how the `ops` decorator and operator information can be used to improve the thoroughness of our testing

The single test in test_unary_ufuncs.py simply ensures that the dtypes associated with a unary ufunc operator in its OpInfo entry are correct. Writing a test like this previously, however, would have required manually constructing test-specific operator information and writing a custom test generator. The `ops` decorator and a common place to put operator information make writing tests like this easier and allows what would have been test-specific information to be reused.

The `ops` decorator extends and composes with the existing device generic test framework, allowing its decorators to be reused. For example, the `onlyOnCPUAndCUDA` decorator works with the new `ops` decorator. This should keep the tests readable and consistent.

Future PRs will likely:

- continue refactoring the too large test_torch.py into more verticals (unary ufuncs, binary ufuncs, reductions...)
- add more operator information to common_method_invocations.py
- refactor tests for unary ufuncs into test_unary_ufunc

Examples of possible future extensions are [here](616747e50d), where an example unary ufunc test is added, and [here](d0b624f110), where example autograd tests are added. Both tests leverage the operator info in common_method_invocations to simplify testing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41662

Reviewed By: ngimel

Differential Revision: D23048416

Pulled By: mruberry

fbshipit-source-id: ecce279ac8767f742150d45854404921a6855f2c
2020-08-11 11:34:53 -07:00
aabdef51f9 [NNC] Registerizer for GPU [1/x] (#42606)
Summary:
Adds a new optimization pass, the Registerizer, which looks for common Stores and Loads to a single item in a buffer and replaces them with a local temporary scalar which is cheaper to write.

For example it can replace:
```
A[0] = 0;
for (int x = 0; x < 10; x++) {
  A[0] = (A[0]) + x;
}
```

with:
```
int A_ = 0;
for (int x = 0; x < 10; x++) {
  A_ = x + A_;
}
A[0] = A_;
```

This is particularly useful on GPUs when parallelizing, since after replacing loops with metavars we have a lot of accesses like this. Early tests of simple reductions on a V100 indicates this can speed them up by ~5x.

This diff got a bit unwieldy with the integration code so that will come in a follow up.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42606

Reviewed By: bertmaher

Differential Revision: D22970969

Pulled By: nickgg

fbshipit-source-id: 831fd213f486968624b9a4899a331ea9aeb40180
2020-08-11 11:17:50 -07:00
57b056b5f2 align qlinear benchmark to linear benchmark (#42767)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42767

Same as previous PR, forcing the qlinear benchmark to follow the fp one

Test Plan:
```
cd benchmarks/operator_benchmark
python -m pt.linear_test
python -m pt.qlinear_test
```

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D23013937

fbshipit-source-id: fffaa7cfbfb63cea41883fd4d70cd3f08120aaf8
2020-08-11 10:35:16 -07:00
a7bdf575cb align qconv benchmark to conv benchmark (#42761)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42761

Makes the qconv benchmark follow the conv benchmark exactly. This way
it will be easy to compare q vs fp with the same settings.

Test Plan:
```
cd benchmarks/operator_benchmark
python -m pt.qconv_test
python -m pt.conv_test
```

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D23012533

fbshipit-source-id: af30ee585389395569a6322f5210828432963077
2020-08-11 10:33:19 -07:00
2c8cbd78bd Fix orgqr input size conditions (#42825)
Summary:
* Adds support for `n > k`
* Throw error if `m >= n >= k` is not true
* Updates existing error messages to match argument names shown in public docs
* Adds error tests

Fixes https://github.com/pytorch/pytorch/issues/41776

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42825

Reviewed By: smessmer

Differential Revision: D23038916

Pulled By: albanD

fbshipit-source-id: e9bec7b11557505e10e0568599d0a6cb7e12ab46
2020-08-11 10:17:39 -07:00
575e7497f6 Introduce experimental FX library (#42741)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42741

Test Plan: Imported from OSS

Reviewed By: dzhulgakov

Differential Revision: D23006383

Pulled By: jamesr66a

fbshipit-source-id: 6cb6d921981fcae47a07df581ffcf900fb8a7fe8
2020-08-11 10:01:47 -07:00
7524699d58 Modify clang code coverage to CMakeList.txt (for MacOS) (#42837)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42837

Originally we use
```
list(APPEND CMAKE_C_FLAGS  -fprofile-instr-generate -fcoverage-mapping)
list(APPEND CMAKE_CXX_FLAGS  -fprofile-instr-generate -fcoverage-mapping)
```
But when compile project on mac with Coverage On, it has the error:
`clang: error: no input files
/bin/sh: -fprofile-instr-generate: command not found
/bin/sh: -fcoverage-mapping: command not found`

The reason behind it, is `list(APPEND CMAKE_CXX_FLAGS` will add an additional `;` to the variable. This means, if we do `list(APPEND foo a)` and then `list(APPEND foo b)`, then `foo` will be `a;b` -- with the additional `;`. Since we have `CMAKE_CXX_FLAGS` defined before in the `CMakeList.txt`, we can only use `set(...)` here
After changing it to
```
set(CMAKE_C_FLAGS "${CMAKE_C_FLAGS} -fprofile-instr-generate -fcoverage-mapping")
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -fprofile-instr-generate -fcoverage-mapping")
```
Test successufully in local mac machine.

Test Plan: Test locally on mac machine

Reviewed By: malfet

Differential Revision: D23043057

fbshipit-source-id: ff6f4891b35b7f005861ee2f8e4c550c997fe961
2020-08-11 09:57:55 -07:00
42114a0154 Update the documentation for scatter to include streams parameter. (#42814)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/41827

![Screenshot from 2020-08-10 13-41-20](https://user-images.githubusercontent.com/46765601/89813181-41041380-db0f-11ea-88c2-a97d7b994ac5.png)

Current:
https://pytorch.org/docs/stable/cuda.html#communication-collectives

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42814

Reviewed By: smessmer

Differential Revision: D23033544

Pulled By: mrshenli

fbshipit-source-id: 88747fbb06e88ef9630c042ea9af07dafd422296
2020-08-11 09:28:14 -07:00
1041bdebb0 Fix a typo in EmbeddingBag.cu (#42742)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42742

Reviewed By: smessmer

Differential Revision: D23011029

Pulled By: mrshenli

fbshipit-source-id: 615f8b876ef1881660af71b6e145fb4ca97d2ebb
2020-08-11 09:24:38 -07:00
916235284c [JIT] Fix typing.Final for python 3.8 (#39568)
Summary:
fixes https://github.com/pytorch/pytorch/issues/39566

`typing.Final` is a thing since python 3.8, and on python 3.8, `typing_extensions.Final` is an alias of `typing.Final`, therefore, `ann.__module__ == 'typing_extensions'` will become False when using 3.8 and `typing_extensions` is installed.

~~I don't know why the test is skipped, seems like due to historical reason when python 2.7 was still a thing?~~ Edit: I know now, the `Final` for `<3.7` don't have `__origin__`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/39568

Reviewed By: smessmer

Differential Revision: D23043388

Pulled By: malfet

fbshipit-source-id: cc87a9e4e38090d784e9cea630e1c543897a1697
2020-08-11 08:51:46 -07:00
d28639a080 Optimization with Backward Implementation of Learnable Fake Quantize Per Channel Kernel (CPU and GPU) (#42810)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42810

In this diff, the original backward pass implementation is sped up by merging the 3 iterations computing dX, dScale, and dZeroPoint separately. In this case, a native loop is directly used on a byte-wise level (referenced by `strides`). In addition, vectorization is used such that scale and zero point are expanded to share the same shape and the element-wise corresponding values to X along the channel axis.

In the benchmark test on the operators, for an input of shape `3x3x256x256`, we have observed the following improvement in performance:
**Speedup from python operator**: ~10x
**Speedup from original learnable kernel**: ~5.4x
**Speedup from non-backprop kernel**: ~1.8x

Test Plan:
To assert correctness of the new kernel, on a devvm, enter the command

`buck test //caffe2/test:quantization -- learnable_backward_per_channel`

To benchmark the operators, on a devvm, enter the command
1. Set the kernel size to 3x3x256x256 or a reasonable input size.
2. Run `buck test //caffe2/benchmarks/operator_benchmark/pt:quantization_test`
3. The relevant outputs for CPU are as follows:

```
# Benchmarking PyTorch: FakeQuantizePerChannelOpBenchmark
# Mode: Eager
# Name: FakeQuantizePerChannelOpBenchmark_N3_C3_H256_W256_cpu_op_typepy_module
# Input: N: 3, C: 3, H: 256, W: 256, device: cpu, op_type: py_module
Backward Execution Time (us) : 989024.686

# Benchmarking PyTorch: FakeQuantizePerChannelOpBenchmark
# Mode: Eager
# Name: FakeQuantizePerChannelOpBenchmark_N3_C3_H256_W256_cpu_op_typelearnable_kernel
# Input: N: 3, C: 3, H: 256, W: 256, device: cpu, op_type: learnable_kernel
Backward Execution Time (us) : 95654.079

# Benchmarking PyTorch: FakeQuantizePerChannelOpBenchmark
# Mode: Eager
# Name: FakeQuantizePerChannelOpBenchmark_N3_C3_H256_W256_cpu_op_typeoriginal_kernel
# Input: N: 3, C: 3, H: 256, W: 256, device: cpu, op_type: original_kernel
Backward Execution Time (us) : 176948.970
```
4. The relevant outputs for GPU are as follows:
The relevant outputs are as follows

**Pre-optimization**:

```
# Benchmarking PyTorch: FakeQuantizePerChannelOpBenchmark
# Mode: Eager
# Name: FakeQuantizePerChannelOpBenchmark_N3_C3_H256_W256_cuda_op_typepy_module
# Input: N: 3, C: 3, H: 256, W: 256, device: cpu, op_type: py_module
Backward Execution Time (us) : 6795.173

# Benchmarking PyTorch: FakeQuantizePerChannelOpBenchmark
# Mode: Eager
# Name: FakeQuantizePerChannelOpBenchmark_N3_C3_H256_W256_cuda_op_typelearnable_kernel
# Input: N: 3, C: 3, H: 256, W: 256, device: cpu, op_type: learnable_kernel
Backward Execution Time (us) : 4321.351

# Benchmarking PyTorch: FakeQuantizePerChannelOpBenchmark
# Mode: Eager
# Name: FakeQuantizePerChannelOpBenchmark_N3_C3_H256_W256_cuda_op_typeoriginal_kernel
# Input: N: 3, C: 3, H: 256, W: 256, device: cpu, op_type: original_kernel
Backward Execution Time (us) : 1052.066
```

**Post-optimization**:
```
# Benchmarking PyTorch: FakeQuantizePerChannelOpBenchmark
# Mode: Eager
# Name: FakeQuantizePerChannelOpBenchmark_N3_C3_H256_W256_cuda_op_typepy_module
# Input: N: 3, C: 3, H: 256, W: 256, device: cpu, op_type: py_module
Backward Execution Time (us) : 6737.106

# Benchmarking PyTorch: FakeQuantizePerChannelOpBenchmark
# Mode: Eager
# Name: FakeQuantizePerChannelOpBenchmark_N3_C3_H256_W256_cuda_op_typelearnable_kernel
# Input: N: 3, C: 3, H: 256, W: 256, device: cpu, op_type: learnable_kernel
Backward Execution Time (us) : 2112.484

# Benchmarking PyTorch: FakeQuantizePerChannelOpBenchmark
# Mode: Eager
# Name: FakeQuantizePerChannelOpBenchmark_N3_C3_H256_W256_cuda_op_typeoriginal_kernel
# Input: N: 3, C: 3, H: 256, W: 256, device: cpu, op_type: original_kernel
Backward Execution Time (us) : 1078.79

Reviewed By: vkuzo

Differential Revision: D22946853

fbshipit-source-id: 1a01284641480282b3f57907cc7908d68c68decd
2020-08-11 08:41:53 -07:00
42b4a7132e Raise error if at::native::embedding is given 0-D weight (#42550)
Summary:
Previously, `at::native::embedding` implicitly assumed that the `weight` argument would be 1-D or greater. Given a 0-D tensor, it would segfault. This change makes it throw a RuntimeError instead.

Fixes https://github.com/pytorch/pytorch/issues/41780

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42550

Reviewed By: smessmer

Differential Revision: D23040744

Pulled By: albanD

fbshipit-source-id: d3d315850a5ee2d2b6fcc0bdb30db2b76ffffb01
2020-08-11 08:26:45 -07:00
d396d135db Added torch::cuda::manual_seed(_all) to mirror torch.cuda.manual_seed(_all) (#42638)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42638

Test Plan: Imported from OSS

Reviewed By: glaringlee

Differential Revision: D23030317

Pulled By: heitorschueroff

fbshipit-source-id: b0d7bdf0bc592a913ae5b1ffc14c3a5067478ce3
2020-08-11 08:22:20 -07:00
e8f4b04d9a vmap: temporarily disable support for random functions (#42617)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42617

While we figure out the random plan, I want to initially disable
support for random operations. This is because there is an ambiguity in
what randomness means. For example,

```
tensor = torch.zeros(B0, 1)
vmap(lambda t: t.normal_())(tensor)
```

in the above example, should tensor[0] and tensor[1] be equal (i.e.,
use the same random seed), or should they be different?

The mechanism for disabling random support is as follows:
- We add a new dispatch key called VmapMode
- Whenever we're inside vmap, we enable VmapMode for all tensors.
This is done via at::VmapMode::increment_nesting and
at::VmapMode::decrement_nesting.
- DispatchKey::VmapMode's fallback kernel is the fallthrough kernel.
- We register kernels that raise errors for all random functions on
DispatchKey::VmapMode. This way, whenever someone calls a random
function on any tensor (not just BatchedTensors) inside of a vmap block,
an error gets thrown.

Test Plan: - pytest test/test_vmap.py -v -k "Operators"

Reviewed By: ezyang

Differential Revision: D22954840

Pulled By: zou3519

fbshipit-source-id: cb8d71062d4087e10cbf408f74b1a9dff81a226d
2020-08-11 07:19:51 -07:00
ffc3da35f4 Don't materialize output grads (#41821)
Summary:
Added a new option in AutogradContext to tell autograd to not materialize output grad tensors, that is, don't expand undefined/None tensors into tensors full of zeros before passing them as input to the backward function.

This PR is the second part that closes https://github.com/pytorch/pytorch/issues/41359. The first PR is https://github.com/pytorch/pytorch/pull/41490.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41821

Reviewed By: albanD

Differential Revision: D22693163

Pulled By: heitorschueroff

fbshipit-source-id: a8d060405a17ab1280a8506a06a2bbd85cb86461
2020-08-11 04:27:07 -07:00
ddcf3ded3e Revert D23002043: add net transforms for fusion
Test Plan: revert-hammer

Differential Revision:
D23002043 (a4b763bc2c)

Original commit changeset: f0b13d51d68c

fbshipit-source-id: d43602743af35db825e951358992e979283a26f6
2020-08-10 21:22:57 -07:00
59b10f7929 [quant] Sorting the list of dispathes (#42758)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42758

Test Plan: Imported from OSS

Reviewed By: vkuzo

Differential Revision: D23011764

Pulled By: z-a-f

fbshipit-source-id: df87acdcf77ae8961a109eaba20521bc4f27ad0e
2020-08-10 21:05:30 -07:00
dedcc30c84 Fix ROCm CI by increasing test timeout (#42827)
Summary:
ROCm is failing to run this test in the allotted time. See, for example, https://app.circleci.com/pipelines/github/pytorch/pytorch/198759/workflows/f6066acf-b289-46c5-aad0-6f4f663ce820/jobs/6618625.

cc jeffdaily

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42827

Reviewed By: pbelevich

Differential Revision: D23042220

Pulled By: mruberry

fbshipit-source-id: 52b426b0733b7b52ac3b311466d5000334864a82
2020-08-10 20:26:20 -07:00
a4b763bc2c add net transforms for fusion (#42763)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42763

add the fp16 fusions as net transforms:
-layernorm fused with mul+add
-swish int8

Test Plan: added unit test, ran flows

Reviewed By: yinghai

Differential Revision: D23002043

fbshipit-source-id: f0b13d51d68c240b05d2a237a7fb8273e996328b
2020-08-10 20:16:14 -07:00
103887892c Fix "non-negative integer" error messages (#42734)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/42662

Use "positive integer" error message for consistency with: 17f76f9a78/torch/optim/lr_scheduler.py (L958-L959)
ad7133d3c1/torch/utils/data/sampler.py (L102-L104)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42734

Reviewed By: zdevito

Differential Revision: D23039575

Pulled By: smessmer

fbshipit-source-id: 1be1e0caa868891540ecdbe6f471a6cd51c40ede
2020-08-10 19:39:37 -07:00
c14a7f6808 adaptive_avg_pool[23]d: check output_size.size() (#42831)
Summary:
Return an error if output_size is unexpected

Fixes https://github.com/pytorch/pytorch/issues/42578

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42831

Reviewed By: ezyang

Differential Revision: D23039295

Pulled By: malfet

fbshipit-source-id: d14a5e6dccdf785756635caee2c87151c9634872
2020-08-10 19:27:18 -07:00
c9e825640a [c10d] Template computeLengthsAndOffsets() (#42706)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42706

Different backends accept different type of length to, like MPI_Alltoallv, nccSend/Recv(), gloo::alltoallv(). So to make computeLengthsAndOffsets() template

Test Plan:
Sandcastle
CI
HPC: ./trainer_cmd.sh -p 16 -n 8 -d nccl

Reviewed By: osalpekar

Differential Revision: D22961459

fbshipit-source-id: 45ec271f8271b96f2dba76cd9dce3e678bcfb625
2020-08-10 19:21:46 -07:00
a414bd69de Skip test_c10d.ProcessGroupNCCLTest under TSAN (#42750)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42750

All of these tests fail under TSAN since we fork in a multithreaded
environment.
ghstack-source-id: 109566396

Test Plan: CI

Reviewed By: pritamdamania87

Differential Revision: D23007746

fbshipit-source-id: 65571607522b790280363882d61bfac8a52007a1
2020-08-10 19:13:52 -07:00
a2559652ab Rename some BatchedTensorImpl APIs (#42700)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42700

I was about to use `isBatched` somewhere not in the files used to
implement vmap but then realized how silly that sounds due to
ambiguity. This PR renames some of the BatchedTensor APIs to make a bit
more sense to onlookers.

- isBatched(Tensor) -> isBatchedTensor(Tensor)
- unsafeGetBatched(Tensor) -> unsafeGetBatchedImpl(Tensor)
- maybeGetBatched(Tensor) -> maybeGetBatchedImpl(Tensor)

Test Plan: - build Pytorch, run tests.

Reviewed By: ezyang

Differential Revision: D22985868

Pulled By: zou3519

fbshipit-source-id: b8ed9925aabffe98085bcf5c81d22cd1da026f46
2020-08-10 17:43:20 -07:00
8f67c7a624 BatchedTensor fallback: extended to support ops with multiple Tensor returns (#42628)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42628

This PR extends the BatchedTensor fallback to support operators with
multiple Tensor returns. If an operator has multiple returns, we stack
shards of each return to create the full outputs.

Test Plan:
- `pytest test/test_vmap.py -v`. Added a new test for an operator with
multiple returns (torch.var_mean).

Reviewed By: izdeby

Differential Revision: D22957095

Pulled By: zou3519

fbshipit-source-id: 5c0ec3bf51283cc4493b432bcfed1acf5509e662
2020-08-10 17:42:03 -07:00
64a7939ee5 test_cpp_rpc: Build test_e2e_process_group.cpp only if USE_GLOO is true (#42836)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/42776

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42836

Reviewed By: seemethere

Differential Revision: D23041274

Pulled By: malfet

fbshipit-source-id: 8605332701271bea6d9b3a52023f548c11d8916f
2020-08-10 16:54:26 -07:00
8718524571 [vulkan] cat op (concatenate) (#41434)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/41434

Test Plan: Imported from OSS

Reviewed By: AshkanAliabadi

Differential Revision: D22754941

Pulled By: IvanKobzarev

fbshipit-source-id: cd03577e1c2f639b2592d4b7393da4657422e23c
2020-08-10 16:24:20 -07:00
3cf2551f2f Fix torch.nn.functional.grid_sample crashes if grid has NaNs (#42703)
Summary:
In `clip_coordinates` replace `minimum(maximum(in))` composition with `clamp_max(clamp_min(in))`
Swap order of `clamp_min` operands to clamp NaNs in grid to 0

Fixes https://github.com/pytorch/pytorch/issues/42616

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42703

Reviewed By: ezyang

Differential Revision: D22987447

Pulled By: malfet

fbshipit-source-id: a8a2d6de8043d6b77c8707326c5412d0250efae6
2020-08-10 16:20:09 -07:00
e06b4be5ae change pt_defs.bzl to python file (#42725)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42725

This diff changes pt_defs.bzl to pt_defs.py, so that it can be included as python source file.

The reason is if we remove base ops, pt_defs.bzl becomes too big (8k lines) and we cannot pass its content to gen_oplist (python library). The easy solution is to change it to a python source file so that it can be used in gen_oplist.

Test Plan: sandcastle

Reviewed By: ljk53, iseeyuan

Differential Revision: D22968258

fbshipit-source-id: d720fe2e684d9a2bf5bd6115b6e6f9b812473f12
2020-08-10 16:12:43 -07:00
752f433a24 DDP communication hook: skip dividing grads by world_size if hook registered. (#42400)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42400

mcarilli spotted that in the original DDP communication hook design described in [39272](https://github.com/pytorch/pytorch/issues/39272), the hooks receive grads that are already predivided by world size.

It makes sense to skip the divide completely if hook registered. The hook is meant for the user to completely override DDP communication. For example, if the user would like to implement something like GossipGrad, always dividing by the world_size would not be a good idea.

We also included a warning in the register_comm_hook API as:
> GradBucket bucket's tensors will not be predivided by world_size. User is responsible to divide by the world_size in case of operations like allreduce.
ghstack-source-id: 109548696

**Update:** We discovered and fixed a bug with the sparse tensors case. See new unit test called `test_ddp_comm_hook_sparse_gradients` and changes in `reducer.cpp`.

Test Plan: python test/distributed/test_c10d.py and perf benchmark tests.

Reviewed By: ezyang

Differential Revision: D22883905

fbshipit-source-id: 3277323fe9bd7eb6e638b7ef0535cab1fc72f89e
2020-08-10 13:55:42 -07:00
d7aaa3327b .circleci: Only do comparisons when available (#42816)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42816

Comparisons were being done on branches where the '<<
pipeline.git.base_revision >>' didn't exist before so let's just move it
so that comparison / code branch is only run when that variable is
available

Example: https://app.circleci.com/pipelines/github/pytorch/pytorch/198611/workflows/8a316eef-d864-4bb0-863f-1454696b1e8a/jobs/6610393

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D23032900

Pulled By: seemethere

fbshipit-source-id: 98a49c78b174d6fde9c6b5bd3d86a6058d0658bd
2020-08-10 12:33:37 -07:00
d83cc92948 [ONNX] Add support for scalar src in torch.scatter ONNX export. (#42765)
Summary:
`torch.scatter` supports two overloads – one where `src` input tensor is same size as the `index` tensor input, and second, where `src` is a scalar. Currrently, ONNX exporter only supports the first overload. This PR adds export support for the second overload of `torch.scatter`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42765

Reviewed By: hl475

Differential Revision: D23025189

Pulled By: houseroad

fbshipit-source-id: 5c2a3f3ce3b2d69661a227df8a8e0ed7c1858dbf
2020-08-10 11:45:42 -07:00
e7b5a23607 include missing settings import
Summary: from hypothesis import given, settings

Test Plan: test_op_nnpi_fp16.py

Differential Revision: D23031038

fbshipit-source-id: 751547e6a6e992d8816d4cc2c5a699ba19a97796
2020-08-10 10:45:34 -07:00
77305c1e44 Automated submodule update: FBGEMM (#42781)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42781

This is an automated pull request to update the first-party submodule for [pytorch/FBGEMM](https://github.com/pytorch/FBGEMM).

New submodule commit: fbd813e29f

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42771

Test Plan: Ensure that CI jobs succeed on GitHub before landing.

Reviewed By: dskhudia

Differential Revision: D23015890

Pulled By: jspark1105

fbshipit-source-id: f0f62969f8744df96a4e7f5aff2ce95baabb2f76
2020-08-10 10:14:56 -07:00
e5adf45dde Add python unittest target to caffe2/test/TARGETS (#42766)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42766

**Summary**
Some python tests are missing in `caffe2/test/TARGETS`, add them to be more comprehension.

According to [run_test.py](https://github.com/pytorch/pytorch/blob/master/test/run_test.py#L125), some tests are slower. Slow tests are added as independent targets and others are put together into one `others` target. The reason is because we want to reduce overhead, especially for code covarge collection.  Tests in one target can be run as a bundle, and then coverage can be collected together. Typically coverage collection procedure is time-expensive, so this helps us save time.

Test Plan:
Run all the new test targets locally in dev server and record the time they cost.
**Statistics**

```
# jit target
real    33m7.694s
user    653m1.181s
sys     58m14.160s

--------- Compare to Initial Jit Target runtime: ----------------

real    32m13.057s
user    613m52.843s
sys     54m58.678s

```

```
# others target
real    9m2.920s
user    164m21.927s
sys     12m54.840s
```

```
# serialization target
real    4m21.090s
user    23m33.501s
sys     1m53.308s

```

```
# tensorexpr
real    11m28.187s
user    33m36.420s
sys     1m15.925s
```

```
# type target
real    3m36.197s
user    51m47.912s
sys     4m14.149s
```

Reviewed By: malfet

Differential Revision: D22979219

fbshipit-source-id: 12a30839bb76a64871359bc024e4bff670c5ca8b
2020-08-10 09:48:59 -07:00
bc779667d6 generalize circleci docker build.sh and add centos support (#41255)
Summary:
Add centos Dockerfile and support to circleci docker builds, and allow generic image names to be parsed by build.sh, so both hardcoded images and custom images can be built.

Currently only adds a ROCm centos Dockerfile.

CC ezyang xw285cornell sunway513

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41255

Reviewed By: mrshenli

Differential Revision: D23003218

Pulled By: malfet

fbshipit-source-id: 562c53533e7fb9637dc2e81edb06b2242afff477
2020-08-10 09:42:05 -07:00
05f00532f5 Fix TensorPipe submodule (#42789)
Summary:
Not sure what happened, but possibly I landed a PR on PyTorch which updated the TensorPipe submodule to a commit hash of a *PR* of TensorPipe. Now that the latter PR has been merged though that same commit has a different hash. The commit referenced by PyTorch, therefore, has become orphaned. This is causing some issues.

Hence here I am updating the commit, which however does not change a single line of code.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42789

Reviewed By: houseroad

Differential Revision: D23023238

Pulled By: lw

fbshipit-source-id: ca2dcf6b7e07ab64fb37e280a3dd7478479f87fd
2020-08-10 02:15:44 -07:00
55ac240589 [ONNX] Fix scalar type cast for comparison ops (#37787)
Summary:
Always promote type casts for comparison operators, regardless if the input is tensor or scalar. Unlike arithmetic operators, where scalars are implicitly cast to the same type as tensors.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/37787

Reviewed By: hl475

Differential Revision: D21440585

Pulled By: houseroad

fbshipit-source-id: fb5c78933760f1d1388b921e14d73a2cb982b92f
2020-08-09 23:00:57 -07:00
162972e980 Fix op benchmark (#42757)
Summary:
A benchmark relies on abs_ having a functional variant.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42757

Reviewed By: ngimel

Differential Revision: D23011037

Pulled By: mruberry

fbshipit-source-id: c04866015fa259e4c544e5cf0c33ca1e11091d92
2020-08-09 17:31:51 -07:00
87970b70a7 Adds 'clip' alias for clamp (#42770)
Summary:
Per title. Also updates our guidance for adding aliases to clarify interned_string and method_test requirements. The alias is tested by extending test_clamp to also test clip.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42770

Reviewed By: ngimel

Differential Revision: D23020655

Pulled By: mruberry

fbshipit-source-id: f1d8e751de9ac5f21a4f95d241b193730f07b5dc
2020-08-09 02:46:02 -07:00
b6810c1064 Include/ExcludeDispatchKeySetGuard API (#42658)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42658

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D22971426

Pulled By: bhosmer

fbshipit-source-id: 4d63e0cb31745e7b662685176ae0126ff04cdece
2020-08-08 16:27:05 -07:00
79b8328aaf optimize_for_mobile: bring packed params to root module (#42740)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42740

Adds a pass to hoist conv packed params to root module.
The benefit is that if there is nothing else in the conv module,
subsequent passes will delete it, which will reduce module size.

For context, freezing does not handle this because conv packed
params is a custom object.

Test Plan:
```
PYTORCH_JIT_LOG_LEVEL=">hoist_conv_packed_params.cpp" python test/test_mobile_optimizer.py TestOptimizer.test_hoist_conv_packed_params
```

Imported from OSS

Reviewed By: kimishpatel

Differential Revision: D23005961

fbshipit-source-id: 31ab1f5c42a627cb74629566483cdc91f3770a94
2020-08-08 15:53:20 -07:00
d8801f590c fix asan failure for module freezing in conv bn folding (#42739)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42739

This is a test case which fails with ASAN on at the module freezing
step.

Test Plan:
```
USE_ASAN=1 USE_CUDA=0 python setup.py develop
LD_PRELOAD=/usr/lib64/libasan.so.4 python test/test_mobile_optimizer.py TestOptimizer.test_optimize_for_mobile_asan

// output tail: https://gist.github.com/vkuzo/7a0018b9e10ffe64dab0ac7381479f23
```

Imported from OSS

Reviewed By: kimishpatel

Differential Revision: D23005962

fbshipit-source-id: b7d4492e989af7c2e22197c16150812bd2dda7cc
2020-08-08 15:51:59 -07:00
5cd0f5e8ec [PyFI] Update hypothesis and switch from tp2 (#41645)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41645

Pull Request resolved: https://github.com/facebookresearch/pytext/pull/1405

Test Plan: buck test

Reviewed By: thatch

Differential Revision: D20323893

fbshipit-source-id: 54665d589568c4198e96a27f0ed8e5b41df7b86b
2020-08-08 12:13:04 -07:00
b7a9bc0802 Revert D22217029: Add fake quantize operator that works in backward pass
Test Plan: revert-hammer

Differential Revision:
D22217029 (48e978ba18)

Original commit changeset: 7055a2cdafcf

fbshipit-source-id: f57a27be412c6fbfd5a5b07a26f758ac36be3b67
2020-08-07 23:04:40 -07:00
18ca999e1a integrate int8 swish with net transformer
Summary:
add a fuse path for deq->swish->quant
update swish fake op interface to take arguments accordingly

Test Plan:
net_runner passes
unit tests need to be updated

Reviewed By: venkatacrc

Differential Revision: D22962064

fbshipit-source-id: cef79768db3c8af926fca58193d459d671321f80
2020-08-07 23:01:06 -07:00
c889de7e25 update DispatchKey::toString() (#42619)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42619

Added missing entries to `DispatchKey::toString()` and reordered to match declaration order in `DispatchKey.h`

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D22963407

Pulled By: bhosmer

fbshipit-source-id: 34a012135599f497c308ba90ea6e8117e85c74ac
2020-08-07 22:39:23 -07:00
5dd230d6a2 [vulkan] inplace add_, relu_ (#41380)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/41380

Test Plan: Imported from OSS

Reviewed By: AshkanAliabadi

Differential Revision: D22754939

Pulled By: IvanKobzarev

fbshipit-source-id: 19b0bbfc5e1f149f9996b5043b77675421ecb2ed
2020-08-07 21:18:17 -07:00
6755e49cad Set proper return type (#42454)
Summary:
This function was always expecting to return a `size_t` value

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42454

Reviewed By: ezyang

Differential Revision: D22993168

Pulled By: ailzhang

fbshipit-source-id: 044df8ce17983f04681bda8c30cd742920ef7b1e
2020-08-07 19:22:35 -07:00
e95fbaaba3 Adding Peter's Swish Op ULP analysis. (#42573)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42573

* Generate the ULP png files for different ranges.

Test Plan: test_op_ulp_error.py

Reviewed By: hyuen

Differential Revision: D22938572

fbshipit-source-id: 6374bef6d44c38e1141030d44029dee99112cd18
2020-08-07 19:13:01 -07:00
0a804be47d [NCCL] DDP communication hook: getFuture() without cudaStreamAddCallback (#42335)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42335

**Main goal:** For DDP communication hook, provide an API called "get_future" to retrieve a future associated with the completion of c10d.ProcessGroupNCCL.work. Enable NCCL support for this API in this diff.

We add an API `c10::intrusive_ptr<c10::ivalue::Future> getFuture()` to `c10d::ProcessGroup::Work`. This API will only be supported by NCCL in the first version, the default implementation will throw UnsupportedOperation.

We no longer consider a design that involves cudaStreamAddCallback which potentially was causing performance regression in [#41596](https://github.com/pytorch/pytorch/pull/41596).

ghstack-source-id: 109461507

Test Plan:
```(pytorch) [sinannasir@devgpu017.ash6 ~/local/pytorch] python test/distributed/test_c10d.py
Couldn't download test skip set, leaving all tests enabled...
..............................s.....................................................s................................
----------------------------------------------------------------------
Ran 117 tests in 298.042s

OK (skipped=2)
```
### Facebook Internal:
2\. HPC PT trainer run to validate no regression. Check the QPS number:
**Master:** QPS after 1000 iters: around ~34100
```
hpc_dist_trainer --fb-data=none --mtml-fusion-level=1 --target-model=ifr_video --max-ind-range=1000000 --embedding-partition=row-wise mast --domain $USER"testvideo_master" --trainers 16 --trainer-version 1c53912
```
```
[0] I0806 142048.682 metrics_publishers.py:50] Finished iter 999, Local  window NE: [0.963963 0.950479 0.953704], lifetime NE: [0.963963 0.950479 0.953704], loss: [0.243456 0.235225 0.248375], QPS: 34199
```
[detailed logs](https://www.internalfb.com/intern/tupperware/details/task/?handle=priv3_global%2Fmast_hpc%2Fhpc.sinannasirtestvideo_mastwarm.trainer.trainer%2F0&ta_tab=logs)

**getFuture/new design:** QPS after 1000 iters: around ~34030
```
hpc_dist_trainer --fb-data=none --mtml-fusion-level=1 --target-model=ifr_video --max-ind-range=1000000 --embedding-partition=row-wise mast --domain $USER"testvideo_getFutureCyclicFix" --trainers 16 --trainer-version 8553aee
```
```
[0] I0806 160149.197 metrics_publishers.py:50] Finished iter 999, Local  window NE: [0.963959 0.950477 0.953704], lifetime NE: [0.963959 0.950477 0.953704], loss: [0.243456 0.235225 0.248375], QPS: 34018
```
[detailed logs](https://www.internalfb.com/intern/tupperware/details/task/?handle=priv3_global%2Fmast_hpc%2Fhpc.sinannasirtestvideo_getFutureCyclicFix.trainer.trainer%2F0&ta_tab=logs)
**getFuture/new design Run 2:** QPS after 1000 iters: around ~34200
```
hpc_dist_trainer --fb-data=none --mtml-fusion-level=1 --target-model=ifr_video --max-ind-range=1000000 --embedding-partition=row-wise mast --domain $USER"test2video_getFutureCyclicFix" --trainers 16 --trainer-version 8553aee
```
```
[0] I0806 160444.650 metrics_publishers.py:50] Finished iter 999, Local  window NE: [0.963963 0.950482 0.953706], lifetime NE: [0.963963 0.950482 0.953706], loss: [0.243456 0.235225 0.248375], QPS: 34201
```
[detailed logs](https://www.internalfb.com/intern/tupperware/details/task/?handle=priv3_global%2Fmast_hpc%2Fhpc.sinannasirtest2video_getFutureCyclicFix.trainer.trainer%2F0&ta_tab=logs)
**getFuture/old design (Regression):** QPS after 1000 iters: around ~31150
```
hpc_dist_trainer --fb-data=none --mtml-fusion-level=1 --target-model=ifr_video --max-ind-range=1000000 --embedding-partition=row-wise mast --domain $USER”testvideo_OLDgetFutureD22583690 (d904ea5972)" --trainers 16 --trainer-version 1cb5cbb
```
```
priv3_global/mast_hpc/hpc.sinannasirtestvideo_OLDgetFutureD22583690 (d904ea5972).trainer.trainer/0 [0] I0805 101320.407 metrics_publishers.py:50] Finished iter 999, Local  window NE: [0.963964 0.950482 0.953703], lifetime NE: [0.963964 0.950482 0.953703], loss: [0.243456 0.235225 0.248375], QPS: 31159
```
3\. `flow-cli` tests; roberta_base; world_size=4:
**Master:** f210039922
```
total:
  32 GPUs -- 32 GPUs: p25:  0.908    35/s  p50:  1.002    31/s  p75:  1.035    30/s  p90:  1.051    30/s  p95:  1.063    30/s
forward:
  32 GPUs -- 32 GPUs: p25:  0.071   452/s  p50:  0.071   449/s  p75:  0.072   446/s  p90:  0.072   445/s  p95:  0.072   444/s
backward:
  32 GPUs -- 32 GPUs: p25:  0.821    38/s  p50:  0.915    34/s  p75:  0.948    33/s  p90:  0.964    33/s  p95:  0.976    32/s
optimizer:
  32 GPUs -- 32 GPUs: p25:  0.016  2037/s  p50:  0.016  2035/s  p75:  0.016  2027/s  p90:  0.016  2019/s  p95:  0.016  2017/s
```
**getFuture new design:** f210285797
```
total:
  32 GPUs -- 32 GPUs: p25:  0.952    33/s  p50:  1.031    31/s  p75:  1.046    30/s  p90:  1.055    30/s  p95:  1.070    29/s
forward:
  32 GPUs -- 32 GPUs: p25:  0.071   449/s  p50:  0.072   446/s  p75:  0.072   445/s  p90:  0.072   444/s  p95:  0.072   443/s
backward:
  32 GPUs -- 32 GPUs: p25:  0.865    37/s  p50:  0.943    33/s  p75:  0.958    33/s  p90:  0.968    33/s  p95:  0.982    32/s
optimizer:
  32 GPUs -- 32 GPUs: p25:  0.016  2037/s  p50:  0.016  2033/s  p75:  0.016  2022/s  p90:  0.016  2018/s  p95:  0.016  2017/s

```

Reviewed By: ezyang

Differential Revision: D22833298

fbshipit-source-id: 1bb268d3b00335b42ee235c112f93ebe2f25b208
2020-08-07 18:48:35 -07:00
d4a4c62df3 [caffe2] Fix the timeout (stuck) issues of dedup SparseAdagrad C2 kernel
Summary:
Backout D22800959 (f30ac66e79). This one is causing the timeout (machine stuck) issues for dedup kernels. Reverting it make the unit test pass. Still need to investigate why this is the culprit...

Original commit changeset: 641d52a51070

Test Plan:
```
buck test mode/dev-nosan //caffe2/caffe2/fb/net_transforms/tests:fuse_sparse_ops_test -- 'test_fuse_sparse_adagrad_with_sparse_lengths_sum_gradient \(caffe2\.caffe2\.fb\.net_transforms\.tests\.fuse_sparse_ops_test\.TestFuseSparseOps\)' --print-passing-details
```

Reviewed By: jspark1105

Differential Revision: D23008389

fbshipit-source-id: 4f1b9a41c78eaa5541d57b9d8aa12401e1d495f2
2020-08-07 18:42:36 -07:00
3fa0581cf2 [fbgemm] use new more general depthwise 3d conv interface (#42697)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42697

Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/401

As title

Test Plan: CI

Reviewed By: dskhudia

Differential Revision: D22972233

fbshipit-source-id: a2c8e989dee84b2c0587faccb4f8e3bcb05c797c
2020-08-07 18:30:56 -07:00
13bc542829 Fix lite trainer unit test submodule registration (#42714)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42714

Change two unit tests for the lite trainer to register two instances/objects of the same submodule type instead of the same submodule object twice.

Test Plan: Imported from OSS

Reviewed By: iseeyuan

Differential Revision: D22990736

Pulled By: ann-ss

fbshipit-source-id: 2bf56b5cc438b5a5fc3db90d3f30c5c431d3ae77
2020-08-07 18:26:56 -07:00
48e978ba18 Add fake quantize operator that works in backward pass (#40532)
Summary:
This diff adds FakeQuantizeWithBackward. This works the same way as the regular FakeQuantize module, allowing QAT to occur in the forward pass, except it has an additional quantize_backward parameter. When quantize_backward is enabled, the gradients are fake quantized as well (dynamically, using hard-coded values). This allows the user to see whether there would be a significant loss of accuracy if the gradients were quantized in their model.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/40532

Test Plan: The relevant test for this can be run using `python test/test_quantization.py TestQATBackward.test_forward_and_backward`

Reviewed By: supriyar

Differential Revision: D22217029

Pulled By: durumu

fbshipit-source-id: 7055a2cdafcf022f1ea11c3442721ae146d2b3f2
2020-08-07 17:47:01 -07:00
2b04712205 Exposing Percentile Caffe2 Operator in PyTorch
Summary: As titled.

Test Plan:
```
buck test caffe2/caffe2/python/operator_test:torch_integration_test -- test_percentile
```

Reviewed By: yf225

Differential Revision: D22999896

fbshipit-source-id: 2e3686cb893dff1518d533cb3d78c92eb2a6efa5
2020-08-07 16:22:37 -07:00
55b1706775 Skips some complex tests on ROCm (#42759)
Summary:
Fixes ROCm build on OSS master.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42759

Reviewed By: ngimel

Differential Revision: D23011560

Pulled By: mruberry

fbshipit-source-id: 3339ecbd5a0ca47aede6f7c3f84739af1ac820d5
2020-08-07 16:12:32 -07:00
95f4f67552 Restrict conversion to SmallVector (#42694)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42694

The old implementation allowed calling SmallVector constructor and operator= for any type without restrictions,
but then failed with a compiler error when the type wasn't a collection.

Instead, we should only use it if Container follows a container concept and just not match the constructor otherwise.

This fixes an issue kimishpatel was running into.
ghstack-source-id: 109370513

Test Plan: unit tests

Reviewed By: kimishpatel, ezyang

Differential Revision: D22983020

fbshipit-source-id: c31264f5c393762d822f3d64dd2a8e3279d8da44
2020-08-07 15:47:29 -07:00
faca3c43e6 fix celu in quantized benchmark (#42756)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42756

Similar to ELU, CELU was also broken in the quantized benchmark, fixing.

Test Plan:
```
cd benchmarks/operator_benchmark
python -m pt.qactivation_test
```

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D23010863

fbshipit-source-id: 203e63f9cff760af6809f6f345b0d222dc1e9e1b
2020-08-07 15:23:50 -07:00
4eb66b814e Automated submodule update: FBGEMM (#42713)
Summary:
This is an automated pull request to update the first-party submodule for [pytorch/FBGEMM](https://github.com/pytorch/FBGEMM).

New submodule commit: a989b99279

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42713

Test Plan: Ensure that CI jobs succeed on GitHub before landing.

Reviewed By: amylittleyang

Differential Revision: D22990108

Pulled By: jspark1105

fbshipit-source-id: 3252a0f5ad9546221ef2fe908ce6b896252e1887
2020-08-07 13:41:54 -07:00
02f58bdbd7 [caffe2] add type annotations for caffe2.distributed.python
Summary: Add Python type annotations for the `caffe2.distributed.python` module.

Test Plan: Will check sandcastle results.

Reviewed By: jeffdunn

Differential Revision: D22994012

fbshipit-source-id: 30565cc41dd05b5fbc639ae994dfe2ddd9e56cb1
2020-08-07 13:12:53 -07:00
6ebc0504ca BAND, BOR and BXOR for NCCL (all_)reduce should throw runtime errors (#42669)
Summary:
cc rohan-varma
Fixes https://github.com/pytorch/pytorch/issues/41362 #39708

# Description
NCCL doesn't support `BAND, BOR, BXOR`. Since the [current mapping](0642d17efc/torch/lib/c10d/ProcessGroupNCCL.cpp (L39)) doesn't contain any of the mentioned bitwise operator, a default value of `ncclSum` is used instead.

This PR should provide the expected behaviour where a runtime exception is thrown.

# Notes
- The way I'm throwing exceptions is derived from [ProcessGroupGloo.cpp](0642d17efc/torch/lib/c10d/ProcessGroupGloo.cpp (L101))

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42669

Reviewed By: ezyang

Differential Revision: D22996295

Pulled By: rohan-varma

fbshipit-source-id: 83a9fedf11050d2890f9f05ebcedf53be0fc3516
2020-08-07 13:09:07 -07:00
7332c21f7a Speed up HistogramObserver by vectorizing critical path (#41041)
Summary:
22x speedup over the code this replaces. Tested on ResNet18 on a devvm using CPU only, using default parameters for HistogramObserver (i.e. 2048 bins).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41041

Test Plan:
To run the test against the reference (old) implementation, you can use `python test/test_quantization.py TestRecordHistogramObserver.test_histogram_observer_against_reference`.

To run the benchmark, while in the folder `benchmarks/operator_benchmark`, you can use `python -m benchmark_all_quantized_test --operators HistogramObserverCalculateQparams`.

Benchmark results before speedup:
```
# ----------------------------------------
# PyTorch/Caffe2 Operator Micro-benchmarks
# ----------------------------------------
# Tag : short

# Benchmarking PyTorch: HistogramObserverCalculateQparams
# Mode: Eager
# Name: HistogramObserverCalculateQparams_C3_M512_N512_dtypetorch.quint8_cpu_qschemetorch.per_tensor_affine
# Input: C: 3, M: 512, N: 512, dtype: torch.quint8, device: cpu, qscheme: torch.per_tensor_affine
Forward Execution Time (us) : 185818.566

# Benchmarking PyTorch: HistogramObserverCalculateQparams
# Mode: Eager
# Name: HistogramObserverCalculateQparams_C3_M512_N512_dtypetorch.quint8_cpu_qschemetorch.per_tensor_symmetric
# Input: C: 3, M: 512, N: 512, dtype: torch.quint8, device: cpu, qscheme: torch.per_tensor_symmetric
Forward Execution Time (us) : 165325.916
```

Benchmark results after speedup:
```
# ----------------------------------------
# PyTorch/Caffe2 Operator Micro-benchmarks
# ----------------------------------------
# Tag : short

# Benchmarking PyTorch: HistogramObserverCalculateQparams
# Mode: Eager
# Name: HistogramObserverCalculateQparams_C3_M512_N512_dtypetorch.quint8_cpu_qschemetorch.per_tensor_affine
# Input: C: 3, M: 512, N: 512, dtype: torch.quint8, device: cpu, qscheme: torch.per_tensor_affine
Forward Execution Time (us) : 12242.241

# Benchmarking PyTorch: HistogramObserverCalculateQparams
# Mode: Eager
# Name: HistogramObserverCalculateQparams_C3_M512_N512_dtypetorch.quint8_cpu_qschemetorch.per_tensor_symmetric
# Input: C: 3, M: 512, N: 512, dtype: torch.quint8, device: cpu, qscheme: torch.per_tensor_symmetric
Forward Execution Time (us) : 12655.354
```

Reviewed By: raghuramank100

Differential Revision: D22400755

Pulled By: durumu

fbshipit-source-id: 639ac796a554710a33c8a930c1feae95a1148718
2020-08-07 12:29:23 -07:00
98de150381 C++ API TransformerEncoderLayer (#42633)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42633

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D22994332

Pulled By: glaringlee

fbshipit-source-id: 873abdf887d135fb05bde560d695e2e8c992c946
2020-08-07 11:49:42 -07:00
eba35025e0 [JIT] Exclude staticmethods from TS class compilation (#42611)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42611

**Summary**
This commit modifies the Python frontend to ignore static functions on
Torchscript classes when compiling them. They are currently included
along with methods, which causes the first argument of the
staticfunction to be unconditionally inferred to be of the type of the
class it belongs to (regardless of how it is annotated or whether it is
annotated at all). This can lead to compilation errors depending on
how that argument is used in the body of the function.

Static functions are instead imported and scripted as if they were
standalone functions.

**Test Plan**
This commit augments the unit test for static methods in `test_class_types.py`
to test that static functions can call each other and the class
constructor.

**Fixes**
This commit fixes #39308.

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D22958163

Pulled By: SplitInfinity

fbshipit-source-id: 45c3c372792299e6e5288e1dbb727291e977a2af
2020-08-07 11:22:04 -07:00
9f88bcb5a2 Minor typo fix (#42731)
Summary:
Just fixed a typo in test/test_sparse.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42731

Reviewed By: ezyang

Differential Revision: D22999930

Pulled By: mrshenli

fbshipit-source-id: 1b5b21d7cb274bd172fb541b2761f727ba06302c
2020-08-07 11:17:51 -07:00
04c62d4a06 [vulkan] Fix warnings: static_cast, remove unused (#42195)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42195

Test Plan: Imported from OSS

Reviewed By: AshkanAliabadi

Differential Revision: D22803035

Pulled By: IvanKobzarev

fbshipit-source-id: d7bf256437eccb5c421a7fd0aa8ec23a8fec0470
2020-08-07 11:12:54 -07:00
586399c03f Remove duplicate definitions of CppTypeToScalarType (#42640)
Summary:
I noticed that `TensorIteratorDynamicCasting.h` defines a helper meta-function `CPPTypeToScalarType` which does exactly the same thing as the `c10::CppTypeToScalarType` meta-function I added in gh-40927. No need for two identical definitions.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42640

Reviewed By: malfet

Differential Revision: D22969708

Pulled By: ezyang

fbshipit-source-id: 8303c7f4a75ae248f393a4811ae9d2bcacab44ff
2020-08-07 11:02:42 -07:00
944ac133d0 [NNC] Remove VarBinding and go back to Let stmts (#42634)
Summary:
Awhile back when commonizing the Let and LetStmt nodes, I ended up removing both and adding a separate VarBinding section the Block. At the time I couldn't find a counter example, but I found it today: Local Vars and Allocations dependencies may go in either direction and so we need to support interleaving of those statements.

So, I've removed all the VarBinding logic and reimplemented Let statements. ZolotukhinM I think you get to say "I told you so". No new tests, existing tests should cover this.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42634

Reviewed By: mruberry

Differential Revision: D22969771

Pulled By: nickgg

fbshipit-source-id: a46c5193357902d0f59bf30ab103fe123b1503f1
2020-08-07 10:50:38 -07:00
2971bc23a6 Handle fused scale and bias in fake fp16 layernorm
Summary: Allow passing scale and bias to fake fp16 layernorm.

Test Plan: net_runner. Now matches glow's fused layernorm.

Reviewed By: hyuen

Differential Revision: D22952646

fbshipit-source-id: cf9ad055b14f9d0167016a18a6b6e26449cb4de8
2020-08-07 10:48:33 -07:00
dcee8933fb Fix some linking rules to allow path with whitespaces (#42718)
Summary:
Essentially, replace `-Wl,--whole-archive,$<TARGET_FILE:FOO>` with `-Wl,--whole-archive,\"$<TARGET_FILE:FOO>\"` as TARGET_FILE might return path containing whitespaces

Fixes https://github.com/pytorch/pytorch/issues/42657

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42718

Reviewed By: ezyang

Differential Revision: D22993568

Pulled By: malfet

fbshipit-source-id: de878b17d20e35b51dd350f20d079c8b879f70b5
2020-08-07 10:23:23 -07:00
9c8021c0b1 Adds torch.linalg namespace (#42664)
Summary:
This PR adds the `torch.linalg` namespace as part of our continued effort to be more compatible with NumPy. The namespace is tested by adding a single function, `torch.linalg.outer`, and testing it in a new test suite, test_linalg.py. It follows the same pattern that https://github.com/pytorch/pytorch/pull/41911, which added the `torch.fft` namespace, did.

Future PRs will likely:

- add more functions to torch.linalg
- expand the testing done in test_linalg.py, including legacy functions, like torch.ger
- deprecate existing linalg functions outside of `torch.linalg` in preference to the new namespace

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42664

Reviewed By: ngimel

Differential Revision: D22991019

Pulled By: mruberry

fbshipit-source-id: 39258d9b116a916817b3588f160b141f956e5d0b
2020-08-07 10:18:30 -07:00
c9346ad3b8 [CPU] Added torch.bmm for complex tensors (#42383)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42383

Test Plan - Updated existing tests to run for complex dtypes as well.

Also added tests for `torch.addmm`, `torch.badmm`

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D22960339

Pulled By: anjali411

fbshipit-source-id: 0805f21caaa40f6e671cefb65cef83a980328b7d
2020-08-07 10:04:20 -07:00
31ed468905 Fix cmake warning (#42707)
Summary:
If argumenets in set_target_properties are not separated by whitespace, cmake raises a warning:
```
CMake Warning (dev) at cmake/public/cuda.cmake:269:
  Syntax Warning in cmake code at column 54

  Argument not separated from preceding token by whitespace.
```

Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42707

Reviewed By: ailzhang

Differential Revision: D22988055

Pulled By: malfet

fbshipit-source-id: c3744f23b383d603788cd36f89a8286a46b6c00f
2020-08-07 09:57:21 -07:00
3c66a3795a [vulkan] Ops registration to TORCH_LIBRARY_IMPL (#42194)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42194

Test Plan: Imported from OSS

Reviewed By: AshkanAliabadi

Differential Revision: D22803036

Pulled By: IvanKobzarev

fbshipit-source-id: 2f402541aecf887d78f650bf05d758a0e403bc4d
2020-08-07 09:06:22 -07:00
4eb02add51 Blacklist to Blocklist in onnxifi_transformer (#42590)
Summary:
Fixes issues in https://github.com/pytorch/pytorch/issues/41704 and https://github.com/pytorch/pytorch/issues/41705

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42590

Reviewed By: ailzhang

Differential Revision: D22977357

Pulled By: malfet

fbshipit-source-id: ab61b964cfdf8bd2b469f4ff8f6486a76bc697de
2020-08-07 08:05:32 -07:00
fb8aa0046c Add use_glow_aot, and include ONNX again as a backend for onnxifiGlow (#4787)
Summary:
Pull Request resolved: https://github.com/pytorch/glow/pull/4787

Resurrect ONNX as a backend through onnxifiGlow (was killed as part of D16215878). Then look for the `use_glow_aot` argument in the Onnxifi op. If it's there and true, then we override whatever `backend_id` is set and use the ONNX backend.

Reviewed By: yinghai, rdzhabarov

Differential Revision: D22762123

fbshipit-source-id: abb4c3458261f8b7eeae3016dda5359fa85672f0
2020-08-07 04:31:24 -07:00
73642d9425 Updates alias pattern (and torch.absolute to use it) (#42586)
Summary:
This PR canonicalizes our (current) pattern for adding aliases to PyTorch. That pattern is:

- Copy the original functions native_functions.yaml entry, but replace the original function's name with their own.
- Implement the corresponding functions and have them redispatch to the original function.
- Add docstrings to the new functions that reference the original function.
- Update the alias_map in torch/csrc/jit/passes/normalize_ops.cpp.
- Update the op_alias_mappings in torch/testing/_internal/jit_utils.py.
- Add a test validating the alias's behavior is the same as the original function's.

An alternative pattern would be to use Python and C++ language features to alias ops directly. For example in Python:

```
torch.absolute = torch.abs
```

Let the pattern in this PR be the "native function" pattern, and the alternative pattern be the "language pattern." There are pros/cons to both approaches:

**Pros of the "Language Pattern"**
- torch.absolute is torch.abs.
- no (or very little) overhead for calling the alias.
- no native_functions.yaml redundancy or possibility of "drift" between the original function's entries and the alias's.

**Cons of the "Language Pattern"**
- requires manually adding doc entries
- requires updating Python alias and C++ alias lists
- requires hand writing alias methods on Tensor (technically this should require a C++ test to validate)
- no single list of all PyTorch ops -- have to check native_functions.yaml and one of the separate alias lists

**Pros of the "Native Function" pattern**

- alias declarations stay in native_functions.yaml
- doc entries are written as normal

**Cons of the "Native Function" pattern**

- aliases redispatch to the original functions
- torch.absolute is not torch.abs (requires writing test to validate behavior)
- possibility of drift between original's and alias's native_functions.yaml entries

While either approach is reasonable, I suggest the "native function" pattern since it preserves "native_functions.yaml" as a source of truth and minimizes the number of alias lists that need to be maintained. In the future, entries in native_functions.yaml may support an "alias" argument and replace whatever pattern we choose now.

Ops that are likely to use aliasing are:

- div (divide, true_divide)
- mul (multiply)
- bucketize (digitize)
- cat (concatenate)
- clamp (clip)
- conj (conjugate)
- rad2deg (degrees)
- trunc (fix)
- neg (negative)
- deg2rad (radians)
- round (rint)
- acos (arccos)
- acosh (arcosh)
- asin (arcsin)
- asinh (arcsinh)
- atan (arctan)
- atan2 (arctan2)
- atanh (arctanh)
- bartlett_window (bartlett)
- hamming_window (hamming)
- hann_window (hanning)
- bitwise_not (invert)
- gt (greater)
- ge (greater_equal)
- lt (less)
- le (less_equal)
- ne (not_equal)
- ger (outer)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42586

Reviewed By: ngimel

Differential Revision: D22991086

Pulled By: mruberry

fbshipit-source-id: d6ac96512d095b261ed2f304d7dddd38cf45e7b0
2020-08-07 00:24:06 -07:00
cb1ac94069 [blob reorder] Seperate user embeddings and ad embeddings in large model loading script
Summary: Put user embedding before ads embedding in blobReorder, for flash verification reason.

Test Plan:
```
buck run mode/opt-clang -c python.package_style=inplace sigrid/predictor/scripts:enable_large_model_loading -- --model_path_src="/home/$USER/models/" --model_path_dst="/home/$USER/models_modified/" --model_file_name="182560549_0.predictor"
```
https://www.internalfb.com/intern/anp/view/?id=320921 to check blobsOrder

Reviewed By: yinghai

Differential Revision: D22964332

fbshipit-source-id: 78b4861476a3c889a5ff62492939f717c307a8d2
2020-08-06 23:54:03 -07:00
9597af01ca Support iterating through an Enum class (#42661)
Summary:
[5/N] Implement Enum JIT support

Implement Enum class iteration
Add aten.ne for EnumType

Supported:
Enum-typed function arguments
using Enum type and comparing them
Support getting name/value attrs of enums
Using Enum value as constant
Support Enum-typed return values
Support iterating through Enum class (enum value list)

TODO:
Support serialization and deserialization

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42661

Reviewed By: SplitInfinity

Differential Revision: D22977364

Pulled By: gmagogsfm

fbshipit-source-id: 1a0216f91d296119e34cc292791f9aef1095b5a8
2020-08-06 22:56:34 -07:00
952526804c Print TE CUDA kernel (#42692)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42692

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D22986112

Pulled By: bertmaher

fbshipit-source-id: 52ec3389535c8b276858bef8c470a59aeba4946f
2020-08-06 20:42:04 -07:00
a6c8730045 [ONNX] Add preprocess pass for onnx export (#41832)
Summary:
in `_jit_pass_onnx`, symbolic functions are called for each node for conversion. However, there are nodes that cannot be converted without additional context. For example, the number of outputs from split (and whether it is static or dynamic) is unknown until the point where it is unpacked by listUnpack node. This pass does a preprocess, and prepares the nodes such that enough context can be received by the symbolic function.
* After preprocessing, `_jit_pass_onnx` should have enough context to produce valid ONNX nodes, instead of half baked nodes that replies on fixes from later postpasses.
* `_jit_pass_onnx_peephole` should be a pass that does ONNX specific optimizations instead of ONNX specific fixes.
* Producing more valid ONNX nodes in `_jit_pass_onnx` enables better utilization of the ONNX shape inference https://github.com/pytorch/pytorch/issues/40628.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41832

Reviewed By: ZolotukhinM

Differential Revision: D22968334

Pulled By: bzinodev

fbshipit-source-id: 8226f03c5b29968e8197d242ca8e620c6e1d42a5
2020-08-06 20:34:12 -07:00
9152f2f73a Optimization of Backward Implementation for Learnable Fake Quantize Per Tensor Kernels (CPU and GPU) (#42384)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42384

In this diff, the original backward pass implementation is sped up by merging the 3 iterations computing dX, dScale, and dZeroPoint separately. In this case, a native loop is directly used on a byte-wise level (referenced by `strides`).

In the benchmark test on the operators, for an input of shape `3x3x256x256`, we have observed the following improvement in performance:
- original python operator: 1021037 microseconds
- original learnable kernel: 407576 microseconds
- optimized learnable kernel: 102584 microseconds
- original non-backprop kernel: 139806 microseconds

**Speedup from python operator**: ~10x
**Speedup from original learnable kernel**: ~4x
**Speedup from non-backprop kernel**: ~1.2x

Test Plan:
To assert correctness of the new kernel, on a devvm, enter the command

`buck test //caffe2/test:quantization -- learnable_backward_per_tensor`

To benchmark the operators, on a devvm, enter the command
1. Set the kernel size to 3x3x256x256 or a reasonable input size.
2. Run `buck test //caffe2/benchmarks/operator_benchmark/pt:quantization_test`
3. The relevant outputs are as follows:

(CPU)
```
# Benchmarking PyTorch: FakeQuantizePerTensorOpBenchmark
# Mode: Eager
# Name: FakeQuantizePerTensorOpBenchmark_N3_C3_H256_W256_nbits4_cpu_op_typepy_module
# Input: N: 3, C: 3, H: 256, W: 256, device: cpu, op_type: py_module
Backward Execution Time (us) : 1021036.957

# Benchmarking PyTorch: FakeQuantizePerTensorOpBenchmark
# Mode: Eager
# Name: FakeQuantizePerTensorOpBenchmark_N3_C3_H256_W256_nbits4_cpu_op_typelearnable_kernel
# Input: N: 3, C: 3, H: 256, W: 256, device: cpu, op_type: learnable_kernel
Backward Execution Time (us) : 102583.693

# Benchmarking PyTorch: FakeQuantizePerTensorOpBenchmark
# Mode: Eager
# Name: FakeQuantizePerTensorOpBenchmark_N3_C3_H256_W256_nbits4_cpu_op_typeoriginal_kernel
# Input: N: 3, C: 3, H: 256, W: 256, device: cpu, op_type: original_kernel
Backward Execution Time (us) : 139806.086
```

(GPU)
```
# Benchmarking PyTorch: FakeQuantizePerChannelOpBenchmark
# Mode: Eager
# Name: FakeQuantizePerChannelOpBenchmark_N3_C3_H256_W256_cuda_op_typepy_module
# Input: N: 3, C: 3, H: 256, W: 256, device: cuda, op_type: py_module
Backward Execution Time (us) : 6548.350

# Benchmarking PyTorch: FakeQuantizePerChannelOpBenchmark
# Mode: Eager
# Name: FakeQuantizePerChannelOpBenchmark_N3_C3_H256_W256_cuda_op_typelearnable_kernel
# Input: N: 3, C: 3, H: 256, W: 256, device: cuda, op_type: learnable_kernel
Backward Execution Time (us) : 1340.724

# Benchmarking PyTorch: FakeQuantizePerChannelOpBenchmark
# Mode: Eager
# Name: FakeQuantizePerChannelOpBenchmark_N3_C3_H256_W256_cuda_op_typeoriginal_kernel
# Input: N: 3, C: 3, H: 256, W: 256, device: cuda, op_type: original_kernel
Backward Execution Time (us) : 656.863
```

Reviewed By: vkuzo

Differential Revision: D22875998

fbshipit-source-id: cfcd62c327bb622270a783d2cbe97f00508c4a16
2020-08-06 19:54:17 -07:00
4959981cff [ONNX] Export tensor (#41872)
Summary:
Adding tensor symbolic for opset 9

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41872

Reviewed By: houseroad

Differential Revision: D22968426

Pulled By: bzinodev

fbshipit-source-id: 70e1afc7397e38039e2030e550fd72f09bac7c7c
2020-08-06 19:33:11 -07:00
40ac95dd3c [ONNX] Update ONNX export of torch.where to support ByteTensor as input. (#42264)
Summary:
`torch.where` supports `ByteTensor` and `BoolTensor` types for the first input argument (`condition` predicate). Currently, ONNX exporter assumes that the first argument is `BoolTensor`. This PR updates the export for `torch.where` to correctly support export when first argument is a `ByteTensor`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42264

Reviewed By: houseroad

Differential Revision: D22968473

Pulled By: bzinodev

fbshipit-source-id: 7306388c8446ef3faeb86dc89d72d1f72c1c2314
2020-08-06 19:16:39 -07:00
f9a6c14364 Fix sequence numbers in profiler output (#42565)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42565

After recent changes to the record function we record more
ranges in profiler output and also keep emitting sequence numbers for
all ranges.

Sequence numbers are used by external tools to correlate forward
and autograd ranges and with many ranges having the same sequence number
it becomes impossible to do this.

This PR ensures that we set sequence numbers only for the top-level
ranges and only in case when autograd is enabled.

Test Plan:
nvprof -fo trace.nvvp --profile-from-start off python test_script.py
test_script
https://gist.github.com/ilia-cher/2baffdd98951ee2a5f2da56a04fe15d0
then examining ranges in nvvp

Reviewed By: ngimel

Differential Revision: D22938828

Pulled By: ilia-cher

fbshipit-source-id: 9a5a076706a6043dfa669375da916a1708d12c19
2020-08-06 19:12:05 -07:00
dab9bbfce7 Move jit_profiling tests into test1 on Windows (#42650)
Summary:
Test takes 5 min to finish and 5 min to spin up the environment, so it doesn't make much sense to keep it as separate config
Limit those tests to be run only when `USE_CUDA` environment variable is set to tru

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42650

Reviewed By: ailzhang

Differential Revision: D22967817

Pulled By: malfet

fbshipit-source-id: c6c26df140059491e7ff53ee9cbbc93433d2f36f
2020-08-06 16:16:40 -07:00
33519e19ab Fix 64-bit indexing in GridSampler (#41923)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/41656

For the CPU version, this is a regression introduced in https://github.com/pytorch/pytorch/issues/10980 which vectorized the `grid_sampler_2d` implementation. It uses the AVX2 gather intrinsic which for `float` requires 32-bit indexing to match the number of floats in the AVX register. There is also an `i64gather_ps` variant but this only utilizes half of the vector width so would be expected to give worse performance in the more likely case where 32-bit indexing is acceptable. So, I've left the optimised AVX version as-is and reinstated the old non-vectorized version as a fallback.

For the CUDA version, this operation has never supported 32-bit indexing so this isn't a regression. I've templated the kernel on index type and added 64-bit variants. Although I gather in some places a simple `TORCH_CHECK(canUse32BitIndexMath(...))` is used instead. So, there is a decision to be made here.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41923

Reviewed By: glaringlee

Differential Revision: D22925931

Pulled By: zou3519

fbshipit-source-id: 920816107aae26360c5e7f4e9c729fa9057268bb
2020-08-06 16:08:09 -07:00
eaace3e10e Skip CUDA benchmarks on nogpu configs (#42704)
Summary:
Avoids timeouts when the benchmark is launched on nogpu configs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42704

Reviewed By: mruberry

Differential Revision: D22987725

Pulled By: malfet

fbshipit-source-id: aa9aece16557c0af8e05e612277ae1d9e0173a51
2020-08-06 15:47:48 -07:00
6cb0807f88 Fixes ROCm CI (#42701)
Summary:
Per title. ROCm CI doesn't have MKL so this adds a couple missing test annotations.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42701

Reviewed By: ngimel

Differential Revision: D22986273

Pulled By: mruberry

fbshipit-source-id: efa717e2e3771562e9e82d1f914e251918e96f64
2020-08-06 15:24:50 -07:00
cc596ac3a8 [JIT] Add debug dumps in between passes in graph executor. (#42688)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42688

Both the profiling executor and the legacy executor have the debug
loggin now.

Ideally, if we had a pass manager, this could be done as a part of it,
but since we have none, I had to insert the debug statements manually.

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D22981675

Pulled By: ZolotukhinM

fbshipit-source-id: 22b8789e860aa90d5802fc72a4113b22c6fc4da5
2020-08-06 15:16:35 -07:00
cdd7db1ffc Bound shape inferencer: fix int8fc scale and bias
Summary:
Previous when inferring Int8FC, we failed to carry over the scale and zero point properly.

Also fixed int8 FC weight data type to be int8 instead of uint8 as that's what C2 actually uses.

Test Plan: Use net_runner to lower a single Int8Dequantize op. Previous scale and bias would always be 1 and 0. Now the proper value is set.

Reviewed By: yinghai

Differential Revision: D22912186

fbshipit-source-id: a6620c3493e492bdda91da73775bfc9117db12d1
2020-08-06 14:40:25 -07:00
b44a10c179 List[index]::toOptionalStringRef (#42263)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42263

Allow a way to get a reference to the stored string in a `List<optional<string>>` without having to copy the string.
This for example improves perf of the map_lookup op by 3x.
ghstack-source-id: 109162026

Test Plan: unit tests

Reviewed By: ezyang

Differential Revision: D22830381

fbshipit-source-id: e6af2bc8cebd6e68794eb18daf183979bc6297ae
2020-08-06 13:44:33 -07:00
f22aa601ce All Gather and gather APIs for Python Objects (#42189)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42189

Rehash of https://github.com/pytorch/pytorch/pull/28811, which was several months old.

As part of addressing https://github.com/pytorch/pytorch/issues/23232, this PR adds support for the following APIs:

`allgather_object` and `gather_object` to support gather/allgather of generic, pickable Python objects. This has been a long-requested feature so PyTorch should provide these helpers built-in.

The methodology is what is proposed in the original issue:
1) Pickle object to ByteTensor using torch.save
2) Comm. tensor sizes
3) Copy local ByteTensor into a tensor of maximal size
4) Call tensor-based collectives on the result of (3)
5) Unpickle back into object using torch.load

Note that the API is designed to match other than supporting `async_op`. For now, it is a blocking call. If we see demand to support `async_op`, we will have to make more progress on merging work/future to support this.

If this is a suitable approach, we can support `scatter`, `broadcast` in follow up PRs.
ghstack-source-id: 109322433

Reviewed By: mrshenli

Differential Revision: D22785387

fbshipit-source-id: a265a44ec0aa3aaffc3c6966023400495904c7d8
2020-08-06 13:30:25 -07:00
1f689b6ef9 suppress all Autograd keys in AutoNonVariableTypeMode (#42610)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42610

Fix for https://github.com/pytorch/pytorch/issues/42609: `AutoNonVariableTypeMode` should suppress all autograd dispatch keys, not just `Autograd` (e.g. `XLAPreAutograd`, `PrivateUse<N>_PreAutograd`)

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D22963408

Pulled By: bhosmer

fbshipit-source-id: 2f3516580ce0c9136aff5e025285d679394f2f18
2020-08-06 13:15:42 -07:00
85a00c4c92 Skips spectral tests to prevent ROCm build from timing out (#42667)
Summary:
Per title.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42667

Reviewed By: ailzhang

Differential Revision: D22978531

Pulled By: mruberry

fbshipit-source-id: 0c3ba116836ed6c433e2c6a0e1a0f2e3c94c7803
2020-08-06 12:41:32 -07:00
40b6dacb50 Delete dead is_named_tensor_only (#42672)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42672

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D22978389

Pulled By: ezyang

fbshipit-source-id: ef1302c57fe26a58a46ca1f4a4a7c3e2cdbfdc5d
2020-08-06 12:19:44 -07:00
5ca08b8891 Add benchmark for calculate_qparams (#42138)
Summary:
Adds a benchmark for `HistogramObserver.calculate_qparams` to the quantized op benchmarks. The next diff in this stack adds a ~15x speedup for this benchmark.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42138

Test Plan:
While in the folder `benchmarks/operator_benchmark`, the benchmark can be run using `python -m benchmark_all_quantized_test --operators HistogramObserverCalculateQparams`.

Benchmark results before speedup:
```
# ----------------------------------------
# PyTorch/Caffe2 Operator Micro-benchmarks
# ----------------------------------------
# Tag : short

# Benchmarking PyTorch: HistogramObserverCalculateQparams
# Mode: Eager
# Name: HistogramObserverCalculateQparams_C3_M512_N512_dtypetorch.quint8_cpu_qschemetorch.per_tensor_affine
# Input: C: 3, M: 512, N: 512, dtype: torch.quint8, device: cpu, qscheme: torch.per_tensor_affine
Forward Execution Time (us) : 185818.566

# Benchmarking PyTorch: HistogramObserverCalculateQparams
# Mode: Eager
# Name: HistogramObserverCalculateQparams_C3_M512_N512_dtypetorch.quint8_cpu_qschemetorch.per_tensor_symmetric
# Input: C: 3, M: 512, N: 512, dtype: torch.quint8, device: cpu, qscheme: torch.per_tensor_symmetric
Forward Execution Time (us) : 165325.916
```

Benchmark results after speedup:
```
# ----------------------------------------
# PyTorch/Caffe2 Operator Micro-benchmarks
# ----------------------------------------
# Tag : short

# Benchmarking PyTorch: HistogramObserverCalculateQparams
# Mode: Eager
# Name: HistogramObserverCalculateQparams_C3_M512_N512_dtypetorch.quint8_cpu_qschemetorch.per_tensor_affine
# Input: C: 3, M: 512, N: 512, dtype: torch.quint8, device: cpu, qscheme: torch.per_tensor_affine
Forward Execution Time (us) : 12242.241

# Benchmarking PyTorch: HistogramObserverCalculateQparams
# Mode: Eager
# Name: HistogramObserverCalculateQparams_C3_M512_N512_dtypetorch.quint8_cpu_qschemetorch.per_tensor_symmetric
# Input: C: 3, M: 512, N: 512, dtype: torch.quint8, device: cpu, qscheme: torch.per_tensor_symmetric
Forward Execution Time (us) : 12655.354
```

Reviewed By: supriyar

Differential Revision: D22779291

Pulled By: durumu

fbshipit-source-id: 1fe17d20eda5dd99e0e2590480142034c3574d4e
2020-08-06 11:10:12 -07:00
79de9c028a Remove VS2017 workaround for autocasting (#42352)
Summary:
Because VS2017 is no longer supported after https://github.com/pytorch/pytorch/pull/42144
cc: mcarilli

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42352

Reviewed By: malfet

Differential Revision: D22962809

Pulled By: ngimel

fbshipit-source-id: 0346cde87bf5d617dfc0d7b34c92ac6ec5bbf568
2020-08-06 11:03:34 -07:00
e28a98a904 Turn on non ASCII string literals serialization (#40719)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40719

This is a follow up patch to turn on this feature in order to handle breaking
forward compatibility.

Test Plan: Imported from OSS

Reviewed By: suo

Differential Revision: D22457952

Pulled By: bzinodev

fbshipit-source-id: fac0dfed8b8b5fa2d52d342ee8cf06742959b3c5
2020-08-06 10:47:09 -07:00
57854e7f08 [JIT] Clone runOptimizations and similar functions for profiling executor. (#42656)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42656

Thing change will allow us to more freely experiment with pass pipelines
in the profiling executor without affecting passes in the legacy
executor. Also, it somewhat helps to keep all passes in one place to be
able to tell what's going on.

Currently this change should not affect any behavior as I copied the
passes exactly as they've been invoked before, but we will probably want
to change these pipelines in a near future.

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D22971050

Pulled By: ZolotukhinM

fbshipit-source-id: f5bb60783a553c7b51c5343eec7f8fe40037ff99
2020-08-06 10:43:28 -07:00
a4dbc64800 Add documentation for PYTORCH_JIT_TYPE_VERBOSITY (#42241)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42241

that's it

Test Plan: docs only

Reviewed By: SplitInfinity

Differential Revision: D22818705

fbshipit-source-id: 22cdf4f23c3ed0a15c23f116457fc842d7f7b520
2020-08-06 10:39:39 -07:00
65066d779b Add fastrnns benchmark to CI and upload data to scribe (#42030)
Summary:
Run fastrnns benchmark using pytest-benchmark infra, then parse its json format and upload to scribe.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42030

Reviewed By: malfet

Differential Revision: D22970270

Pulled By: wconstab

fbshipit-source-id: 87da9b7ddf741da14b80d20779771d19123be3c5
2020-08-06 10:30:27 -07:00
a5af2434fe NVMified NE Eval
Summary:
This diff NVMifies the NE Eval Flow.
- It defines a `LoadNVM` operator which either
  - receives a list of nvm blobs, or
  - extracts the blobs that could be NVMified from the model.
- dumps NVMified blobs into NVM
-  and deallocates from DRAM
- NVMify the Eval net on dper and C2 backend

Specific NVMOp for SLS is pushed through different diffs.

Test Plan: flow-cli test-locally dper.workflows.evaluation.eval_workflow --parameters-file=/mnt/public/ehsaardestani/temp/small_model.json 2>&1 | tee log

Reviewed By: yinghai, amylittleyang

Differential Revision: D22469973

fbshipit-source-id: ed8379ad404e96d04ac05e580176d3aca984575b
2020-08-06 10:25:31 -07:00
049c1b97be pin numpy version to 1.18.5 (#42670)
Summary:
Using numpy 1.19.x instead of 1.18.x breaks certain unit tests.
Fixes https://github.com/pytorch/pytorch/issues/42561.  Likely also fixes https://github.com/pytorch/pytorch/issues/42583.

CC ezyang xw285cornell sunway513

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42670

Reviewed By: ezyang

Differential Revision: D22978369

Pulled By: malfet

fbshipit-source-id: ce1f35c7ba620c2b9dd10613f39354cebee8b87d
2020-08-06 10:01:56 -07:00
bcab2d6848 And type annotations for cpp_extension, utils.data, signal_handling (#42647)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42647

Reviewed By: ezyang

Differential Revision: D22967041

Pulled By: malfet

fbshipit-source-id: 35e124da0be56934faef56834a93b2b400decf66
2020-08-06 09:42:07 -07:00
608f99e4ea Fix cudnn version on build_environment of Windows CI (#42615)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42615

Reviewed By: mrshenli

Differential Revision: D22958660

Pulled By: malfet

fbshipit-source-id: 97a6a0e769143bd161667d0ee081ea0751995775
2020-08-06 09:36:24 -07:00
576aab5084 Bump up NCCL to 2.7.6 (#42645)
Summary:
Because 2.7.3 has some bug on GA100 which is fixed in 2.7.6

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42645

Reviewed By: malfet

Differential Revision: D22977280

Pulled By: mrshenli

fbshipit-source-id: 74779eff90d7d660a988ff33659f3a2237ca7e29
2020-08-06 08:45:59 -07:00
0642d17efc Enable C++ RPC tests (#42636)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42636

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D22967777

Pulled By: mrshenli

fbshipit-source-id: 8816c190a4ead7d7f906c140c8a4e76b992f5502
2020-08-06 07:15:02 -07:00
c30bc6d4d7 Update TensorPipe submodule (#42522)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42522

Main changes:
- Consolidated CMake files to have a single entry point, rather than having a specialized one for PyTorch.
- Changed the way the preprocessor flags are provided, and changed their name.

There were a few instances in PyTorch's CMake files where we were directly adding TensorPipe's source directory as an include path, which however doesn't contain the auto-generated header we now added. We fix that by adding the `tensorpipe` CMake target as a dependency, so that the include paths defined by TensorPipe are used, which contain that auto-generated header. So instead we link those targets to the tensorpipe target in order for them to pick up the correct include directories.

I'm turning off SHM and CMA for now because they have never been covered by the CI. I'll enable them in a separate PR so that if they turn out to be flaky we can revert that change without reverting this one.

Test Plan: CI

Reviewed By: malfet

Differential Revision: D22959472

fbshipit-source-id: 1959a41c4a66ef78bf0f3bd5e3964969a2a1bf67
2020-08-06 02:14:58 -07:00
bd458b7d02 Don't reference TensorPipe headers in our headers (#42521)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42521

PyTorch's usage of TensorPipe is entirely wrapped within the RPC agent, which means we only need access to TensorPipe within the implementation (the .cpp file) and not in the interface (the .h file). We were however including the TensorPipe headers from the public PyTorch headers, which meant that PyTorch's downstream users had to have the TensorPipe include directories for that to work. By forward-declaring the symbols we need in the PyTorch header, and then including the TensorPipe header in the PyTorch implementation, we avoid "leaking" the dependency on TensorPipe, thus effectively keeping it private.

Test Plan: Imported from OSS

Reviewed By: beauby

Differential Revision: D22944238

Pulled By: lw

fbshipit-source-id: 2b12d59bd5beeaa439e50f9088a792c9d9bae9e8
2020-08-06 02:14:00 -07:00
a53fdaa23f Remove ProfiledType (#42570)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42570

ProfiledType doesn't do anything and is not used atm, removing

Test Plan: CI

Reviewed By: ezyang

Differential Revision: D22938664

Pulled By: ilia-cher

fbshipit-source-id: 037c512938028f44258b702bbcde3f8c144f4aa0
2020-08-06 01:52:08 -07:00
ccfce9d4a9 Adds fft namespace (#41911)
Summary:
This PR creates a new namespace, torch.fft (torch::fft) and puts a single function, fft, in it. This function is analogous to is a simplified version of NumPy's [numpy.fft.fft](https://numpy.org/doc/1.18/reference/generated/numpy.fft.fft.html?highlight=fft#numpy.fft.fft) that accepts no optional arguments. It is intended to demonstrate how to add and document functions in the namespace, and is not intended to deprecate the existing torch.fft function.

Adding this namespace was complicated by the existence of the torch.fft function in Python. Creating a torch.fft Python module makes this name ambiguous: does it refer to a function or module? If the JIT didn't exist, a solution to this problem would have been to make torch.fft refer to a callable class that mimicked both the function and module. The JIT, however, cannot understand this pattern. As a workaround it's required to explicitly `import torch.fft` to access the torch.fft.fft function in Python:

```
import torch.fft

t = torch.randn(128, dtype=torch.cdouble)
torch.fft.fft(t)
```

See https://github.com/pytorch/pytorch/issues/42175 for future work. Another possible future PR is to get the JIT to understand torch.fft as a callable class so it need not be imported explicitly to be used.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41911

Reviewed By: glaringlee

Differential Revision: D22941894

Pulled By: mruberry

fbshipit-source-id: c8e0b44cbe90d21e998ca3832cf3a533f28dbe8d
2020-08-06 00:20:50 -07:00
644d787cd8 find rccl properly (#42072)
Summary:
Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42072

Reviewed By: malfet

Differential Revision: D22969778

Pulled By: ezyang

fbshipit-source-id: 509178775d4d99460bcb147bcfced29f04cabdc4
2020-08-05 21:46:38 -07:00
23607441c2 Create CuBLAS PointerModeGuard (#42639)
Summary:
Adds an RAII guard for `cublasSetPointerMode()`.
Updates `dot_cuda` to use the guard, rather than exception catching.

Addresses this comment: https://github.com/pytorch/pytorch/pull/41377#discussion_r465754082

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42639

Reviewed By: malfet

Differential Revision: D22969985

Pulled By: ezyang

fbshipit-source-id: b05c35d1884bb890f8767d6a4ef8b4724a329471
2020-08-05 21:40:42 -07:00
eb9ae7c038 Implement gpu_kernel_multiple_outputs (#37969)
Summary:
This PR introduces a variant of `gpu_kernel` for functions that return multiple values with `thrust::tuple`.
With this I simplified `prelu_cuda_backward_share_weights_kernel`.

### Why using `thrust::tuple`?
Because `std::tuple` does not support `operator=` on device code which makes the implementation complicated.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/37969

Reviewed By: paulshaoyuqiao

Differential Revision: D22868670

Pulled By: ngimel

fbshipit-source-id: eda0a29ac0347ad544b24bf60e3d809a7db1a929
2020-08-05 21:17:08 -07:00
1848b43c4d [NNC] Add loop unroll transformation (#42465)
Summary:
Unroll a loop with constant boundaries, replacing it with multiple
instances of the loop body. For example:

```
for x in 0..3:
  A[x] = x*2
```

becomes:

```
A[0] = 0
A[1] = 2
A[2] = 4
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42465

Test Plan: `test_tensorexpr` unit tests.

Reviewed By: agolynski

Differential Revision: D22914418

Pulled By: asuhan

fbshipit-source-id: 72ca10d7c0b1ac7f9a3688ac872bd94a1c53dc51
2020-08-05 20:46:32 -07:00
3d46e02ea1 Add __torch_function__ for methods (#37091)
Summary:
According to pytorch/rfcs#3

From the goals in the RFC:

1. Support subclassing `torch.Tensor` in Python (done here)
2. Preserve `torch.Tensor` subclasses when calling `torch` functions on them (done here)
3. Use the PyTorch API with `torch.Tensor`-like objects that are _not_ `torch.Tensor`
   subclasses (done in https://github.com/pytorch/pytorch/issues/30730)
4. Preserve `torch.Tensor` subclasses when calling `torch.Tensor` methods. (done here)
5. Propagating subclass instances correctly also with operators, using
   views/slices/indexing/etc. (done here)
6. Preserve subclass attributes when using methods or views/slices/indexing. (done here)
7. A way to insert code that operates on both functions and methods uniformly
   (so we can write a single function that overrides all operators). (done here)
8. The ability to give external libraries a way to also define
   functions/methods that follow the `__torch_function__` protocol. (will be addressed in a separate PR)

This PR makes the following changes:

1. Adds the `self` argument to the arg parser.
2. Dispatches on `self` as well if `self` is not `nullptr`.
3. Adds a `torch._C.DisableTorchFunction` context manager to disable `__torch_function__`.
4. Adds a `torch::torch_function_enabled()` and `torch._C._torch_function_enabled()` to check the state of `__torch_function__`.
5. Dispatches all `torch._C.TensorBase` and `torch.Tensor` methods via `__torch_function__`.

TODO:

- [x] Sequence Methods
- [x] Docs
- [x] Tests

Closes https://github.com/pytorch/pytorch/issues/28361

Benchmarks in https://github.com/pytorch/pytorch/pull/37091#issuecomment-633657778

Pull Request resolved: https://github.com/pytorch/pytorch/pull/37091

Reviewed By: ngimel

Differential Revision: D22765678

Pulled By: ezyang

fbshipit-source-id: 53f8aa17ddb8b1108c0997f6a7aa13cb5be73de0
2020-08-05 20:44:13 -07:00
92b7347fd7 Enforce counter value to double type in rowwise_counter
Summary:
Enforce counter value to double type in rowwise_counter.

**Context:**
The existing implementation is using float type for counter value. But due to the precision limit of a floating number [1], we observed that the counter value can't increment beyond 16777216.0 (i.e., the max value is 16777216.0) in our earlier experiments. We decide to enforce double type to avoid this issue.

[1] https://stackoverflow.com/questions/12596695/why-does-a-float-variable-stop-incrementing-at-16777216-in-c

Test Plan:
op test
```
ruixliu@devvm1997:~/fbsource/fbcode/caffe2/caffe2/python/operator_test(f0b0b48c)$ buck test :rowwise_counter_test
Trace available for this run at /tmp/testpilot.20200728-083200.729292.log
TestPilot test runner for Facebook. See https://fburl.com/testpilot for details.
Testpilot build revision cd2638f1f47250eac058b8c36561760027d16add fbpkg f88726c8ebde4ba288e1172a348c7f46 at Mon Jul 27 18:11:43 2020 by twsvcscm from /usr/local/fbprojects/packages/testinfra.testpilot/887/t.par
Discovering tests
Running 1 test
Started new test run: https://our.intern.facebook.com/intern/testinfra/testrun/7881299364977047
      ✓ caffe2/caffe2/python/operator_test:rowwise_counter_test - test_rowwise_counter (caffe2.caffe2.python.operator_test.rowwise_counter_test.TestRowWiseCounter) 0.265 1/1 (passed)
      ✓ caffe2/caffe2/python/operator_test:rowwise_counter_test - main 14.414 (passed)
Finished test run: https://our.intern.facebook.com/intern/testinfra/testrun/7881299364977047
Summary (total time 18.51s):
  PASS: 2
  FAIL: 0
  SKIP: 0
  FATAL: 0
  TIMEOUT: 0
  OMIT: 0
```

optimizer test
```
ruixliu@devvm1997:~/fbsource/fbcode/caffe2/caffe2/python(7d66fbb9)$ buck test :optimizer_test
Finished test run: https://our.intern.facebook.com/intern/testinfra/testrun/7036874434841896
Summary (total time 64.87s):
  PASS: 48
  FAIL: 0
  SKIP: 24
    caffe2/caffe2/python:optimizer_test - testGPUDense (caffe2.caffe2.python.optimizer_test.TestMomentumSgd)
    caffe2/caffe2/python:optimizer_test - testGPUDense (caffe2.caffe2.python.optimizer_test.TestGFtrl)
    caffe2/caffe2/python:optimizer_test - test_caffe2_cpu_vs_numpy (caffe2.caffe2.python.optimizer_test.TestYellowFin)
    caffe2/caffe2/python:optimizer_test - testGPUDense (caffe2.caffe2.python.optimizer_test.TestSparseRAdam)
    caffe2/caffe2/python:optimizer_test - testGPUDense (caffe2.caffe2.python.optimizer_test.TestRowWiseAdagradWithCounter)
    caffe2/caffe2/python:optimizer_test - testGPUDense (caffe2.caffe2.python.optimizer_test.TestAdagrad)
    caffe2/caffe2/python:optimizer_test - test_caffe2_gpu_vs_numpy (caffe2.caffe2.python.optimizer_test.TestYellowFin)
    caffe2/caffe2/python:optimizer_test - testDense (caffe2.caffe2.python.optimizer_test.TestRowWiseAdagrad)
    caffe2/caffe2/python:optimizer_test - testGPUDense (caffe2.caffe2.python.optimizer_test.TestFtrl)
    caffe2/caffe2/python:optimizer_test - testSparse (caffe2.caffe2.python.optimizer_test.TestRmsProp)
    ...and 14 more not shown...
  FATAL: 0
  TIMEOUT: 0
  OMIT: 0
```

param download test
```
ruixliu@devvm1997:~/fbsource/fbcode/caffe2/caffe2/fb/net_transforms/tests(7ef20a38)$ sudo buck test :param_download_test
Finished test run: Finished test run: https://our.intern.facebook.com/intern/testinfra/testrun/6473924481526935
```

e2e flow:
f208394929
f207991149
f207967273

ANP notebook to check the counter value loaded from the flows
https://fburl.com/anp/5fdcbnoi

screenshot of the loaded counter (note that counter max is larger than 16777216.0)

{F250926501}

Reviewed By: ellie-wen

Differential Revision: D22711514

fbshipit-source-id: 426fed7415270aa3f276dda8141907534734337f
2020-08-05 20:40:51 -07:00
c14fbc36ed Update docs about CUDA stream priority (#41364)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/41364

Reviewed By: malfet

Differential Revision: D22962856

Pulled By: ngimel

fbshipit-source-id: 47f65069516cb555579455e8680deb937fc1f544
2020-08-05 20:03:18 -07:00
ddb8849ffc Fix method stub used for fixing mypy issue to work with pylint (#42356)
Summary:
Make function from method

Since _forward_unimplemented is defined within the nn.Module class,
pylint (correctly) complains about not implementing this method in subclasses.

Fixes https://github.com/pytorch/pytorch/issues/42305

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42356

Reviewed By: mruberry

Differential Revision: D22867255

Pulled By: ezyang

fbshipit-source-id: ccf3e45e359d927e010791fadf70b2ef231ddb0b
2020-08-05 19:57:38 -07:00
04d7e1679d [quant] Quantized Average Pool Refactoring (#42009)
Summary:
**cc** z-a-f. Refactor `qavg_pool(2,3)d_nhwc_kernel` as mentioned in https://github.com/pytorch/pytorch/issues/40316.

# Benchmarks
## Python
Before | After
![before_after](https://user-images.githubusercontent.com/37529096/88401550-fea7ba80-ce1d-11ea-81c5-3ae912e81e8f.png)
## C++
![before_after_cpp](https://user-images.githubusercontent.com/37529096/88401845-5ba37080-ce1e-11ea-9bf2-3c95ac2b4b49.png)
## Notes
- It does seem that for `qint8` and `quint8` there is a noticeable 2x increase in speed at least when the `channels > 64` in the benchmarks.
## Reproduce
### Python
```
import time
import numpy as np
import torch
from termcolor import colored
def time_avg_pool2d(X, kernel, stride, padding, ceil_mode, count_include_pad, divisor_override, iterations):
    X, (scale, zero_point, torch_type) = X
    qX_nchw = torch.quantize_per_tensor(torch.from_numpy(X), scale=scale,
                                    zero_point=zero_point, dtype=torch_type)
    qX_nhwc = qX_nchw.contiguous(memory_format=torch.channels_last)
    assert(qX_nhwc.stride() != sorted(qX_nhwc.stride()))
    assert(qX_nchw.is_contiguous(memory_format=torch.contiguous_format))
    assert(qX_nhwc.is_contiguous(memory_format=torch.channels_last))
    start = time.time()
    for _ in range(iterations):
        X_hat = torch.nn.quantized.functional.avg_pool2d(qX_nchw, kernel_size=kernel, stride=stride, padding=padding, ceil_mode=ceil_mode,
                count_include_pad=count_include_pad, divisor_override=divisor_override)
    qnchw_end = time.time() - start
    start = time.time()
    for _ in range(iterations):
        X_hat = torch.nn.quantized.functional.avg_pool2d(qX_nhwc, kernel_size=kernel, stride=stride, padding=padding, ceil_mode=ceil_mode,
                count_include_pad=count_include_pad, divisor_override=divisor_override)
    qnhwc_end = time.time() - start
    return qnchw_end*1000/iterations, qnhwc_end*1000/iterations

def time_avg_pool3d(X, kernel, stride, padding, ceil_mode, count_include_pad, divisor_override,  iterations):
    X, (scale, zero_point, torch_type) = X
    qX_ncdhw = torch.quantize_per_tensor(torch.from_numpy(X), scale=scale,
                                    zero_point=zero_point, dtype=torch_type)
    qX_ndhwc = qX_ncdhw.contiguous(memory_format=torch.channels_last_3d)
    assert(qX_ndhwc.stride() != sorted(qX_ndhwc.stride()))
    assert(qX_ncdhw.is_contiguous(memory_format=torch.contiguous_format))
    assert(qX_ndhwc.is_contiguous(memory_format=torch.channels_last_3d))
    start = time.time()
    for _ in range(iterations):
        X_hat = torch.nn.quantized.functional.avg_pool3d(qX_ncdhw, kernel_size=kernel, stride=stride, padding=padding, ceil_mode=ceil_mode,
                count_include_pad=count_include_pad, divisor_override=divisor_override)
    qncdhw_end = time.time() - start
    start = time.time()
    for _ in range(iterations):
        X_hat = torch.nn.quantized.functional.avg_pool3d(qX_ndhwc, kernel_size=kernel, stride=stride, padding=padding, ceil_mode=ceil_mode,
                count_include_pad=count_include_pad, divisor_override=divisor_override)
    qndhwc_end = time.time() - start
    return qncdhw_end*1000/iterations, qndhwc_end*1000/iterations

iterations = 10000
print("iterations = {}".format(iterations))
print("Benchmark", "Time(ms)", sep="\t\t\t\t\t")
for torch_type in (torch.qint8, torch.quint8, torch.qint32):
    for channel in (4,8,64,256):
        X = np.random.rand(1, channel, 56, 56).astype(np.float32), (0.5, 1, torch_type)
        ts = time_avg_pool2d(X, 4, None, 0, True, True, None, iterations)
        print(colored("avg_pool2d({}, {}, {})".format(str(torch_type), channel, "nchw"), 'green'), colored(ts[0], 'yellow'), sep="\t")
        print(colored("avg_pool2d({}, {}, {})".format(str(torch_type), channel, "nhwc"), 'green'), colored(ts[1], 'yellow'), sep="\t")
for torch_type in (torch.qint8, torch.quint8, torch.qint32):
    for channel in (4,8,64,256):
        X = np.random.rand(1, channel, 56, 56, 4).astype(np.float32), (0.5, 1, torch_type)
        ts = time_avg_pool3d(X, 4, None, 0, True, True, None, iterations)
        print(colored("avg_pool3d({}, {}, {})".format(str(torch_type), channel, "ncdhw"), 'green'), colored(ts[0], 'yellow'), sep="\t")
        print(colored("avg_pool3d({}, {}, {})".format(str(torch_type), channel, "ndhwc"), 'green'), colored(ts[1], 'yellow'), sep="\t")
```
### C++
1. `git clone https://github.com/google/benchmark.git`
2. `git clone https://github.com/google/googletest.git benchmark/googletest`

```
# CMakeLists.txt
cmake_minimum_required(VERSION 3.10 FATAL_ERROR)
project(time_avg_pool VERSION 0.1.0)

find_package(Torch REQUIRED)
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} ${TORCH_CXX_FLAGS}")
add_subdirectory(benchmark)

add_executable(time_average_pool time_average_pool.cpp)
target_link_libraries(time_average_pool ${TORCH_LIBRARIES})
set_property(TARGET time_average_pool PROPERTY CXX_STANDARD 14)
target_link_libraries(time_average_pool benchmark::benchmark)
```

```
// time_average_pool.cpp
#include <benchmark/benchmark.h>
#include <torch/torch.h>

torch::Device device(torch::kCPU);

static void BM_TORCH_QAVG_POOL2D_NCHW_SINGLE_THREADED(benchmark::State& state) {
  torch::init_num_threads();
  torch::set_num_threads(1);
  auto x_nchw = torch::rand({1, state.range(0), 56, 56}, device);
  auto qx_nchw = torch::quantize_per_tensor(x_nchw, 0.5, 1, torch::kQUInt8);
  torch::Tensor X_hat;
  for (auto _ : state)
    X_hat = torch::nn::functional::avg_pool2d(
        qx_nchw,
        torch::nn::AvgPool2dOptions({4, 4}).ceil_mode(true).count_include_pad(
            true));
}

static void BM_TORCH_QAVG_POOL2D_NHWC_SINGLE_THREADED(benchmark::State& state) {
  torch::init_num_threads();
  torch::set_num_threads(1);
  auto x_nchw = torch::rand({1, state.range(0), 56, 56}, device);
  auto qx_nchw = torch::quantize_per_tensor(x_nchw, 0.5, 1, torch::kQUInt8);
  auto qx_nhwc = qx_nchw.contiguous(torch::MemoryFormat::ChannelsLast);
  torch::Tensor X_hat;
  for (auto _ : state)
    X_hat = torch::nn::functional::avg_pool2d(
        qx_nhwc,
        torch::nn::AvgPool2dOptions({4, 4}).ceil_mode(true).count_include_pad(
            true));
}

static void BM_TORCH_QAVG_POOL2D_NCHW(benchmark::State& state) {
  auto x_nchw = torch::rand({1, state.range(0), 56, 56}, device);
  auto qx_nchw = torch::quantize_per_tensor(x_nchw, 0.5, 1, torch::kQUInt8);
  torch::Tensor X_hat;
  for (auto _ : state)
    X_hat = torch::nn::functional::avg_pool2d(
        qx_nchw,
        torch::nn::AvgPool2dOptions({4, 4}).ceil_mode(true).count_include_pad(
            true));
}

static void BM_TORCH_QAVG_POOL2D_NHWC(benchmark::State& state) {
  auto x_nchw = torch::rand({1, state.range(0), 56, 56}, device);
  auto qx_nchw = torch::quantize_per_tensor(x_nchw, 0.5, 1, torch::kQUInt8);
  auto qx_nhwc = qx_nchw.contiguous(torch::MemoryFormat::ChannelsLast);
  torch::Tensor X_hat;
  for (auto _ : state)
    X_hat = torch::nn::functional::avg_pool2d(
        qx_nhwc,
        torch::nn::AvgPool2dOptions({4, 4}).ceil_mode(true).count_include_pad(
            true));
}

static void BM_TORCH_QAVG_POOL3D_NCDHW_SINGLE_THREADED(
    benchmark::State& state) {
  torch::init_num_threads();
  torch::set_num_threads(1);
  auto x_ncdhw = torch::rand({1, state.range(0), 56, 56, 4}, device);
  auto qx_ncdhw = torch::quantize_per_tensor(x_ncdhw, 0.5, 1, torch::kQUInt8);
  torch::Tensor X_hat;
  for (auto _ : state)
    X_hat = torch::nn::functional::avg_pool3d(
        qx_ncdhw,
        torch::nn::AvgPool3dOptions({5, 5, 5})
            .ceil_mode(true)
            .count_include_pad(true));
}

static void BM_TORCH_QAVG_POOL3D_NDHWC_SINGLE_THREADED(
    benchmark::State& state) {
  torch::init_num_threads();
  torch::set_num_threads(1);
  auto x_ncdhw = torch::rand({1, state.range(0), 56, 56, 4}, device);
  auto qx_ncdhw = torch::quantize_per_tensor(x_ncdhw, 0.5, 1, torch::kQUInt8);
  auto qx_ndhwc = qx_ncdhw.contiguous(torch::MemoryFormat::ChannelsLast3d);
  torch::Tensor X_hat;
  for (auto _ : state)
    X_hat = torch::nn::functional::avg_pool3d(
        qx_ndhwc,
        torch::nn::AvgPool3dOptions({5, 5, 5})
            .ceil_mode(true)
            .count_include_pad(true));
}

static void BM_TORCH_QAVG_POOL3D_NCDHW(benchmark::State& state) {
  auto x_ncdhw = torch::rand({1, state.range(0), 56, 56, 4}, device);
  auto qx_ncdhw = torch::quantize_per_tensor(x_ncdhw, 0.5, 1, torch::kQUInt8);
  torch::Tensor X_hat;
  for (auto _ : state)
    X_hat = torch::nn::functional::avg_pool3d(
        qx_ncdhw,
        torch::nn::AvgPool3dOptions({5, 5, 5})
            .ceil_mode(true)
            .count_include_pad(true));
}

static void BM_TORCH_QAVG_POOL3D_NDHWC(benchmark::State& state) {
  auto x_ncdhw = torch::rand({1, state.range(0), 56, 56, 4}, device);
  auto qx_ncdhw = torch::quantize_per_tensor(x_ncdhw, 0.5, 1, torch::kQUInt8);
  auto qx_ndhwc = qx_ncdhw.contiguous(torch::MemoryFormat::ChannelsLast3d);
  torch::Tensor X_hat;
  for (auto _ : state)
    X_hat = torch::nn::functional::avg_pool3d(
        qx_ndhwc,
        torch::nn::AvgPool3dOptions({5, 5, 5})
            .ceil_mode(true)
            .count_include_pad(true));
}

BENCHMARK(BM_TORCH_QAVG_POOL2D_NCHW)->RangeMultiplier(8)->Range(4, 256);
BENCHMARK(BM_TORCH_QAVG_POOL2D_NHWC)->RangeMultiplier(8)->Range(4, 256);
BENCHMARK(BM_TORCH_QAVG_POOL3D_NCDHW)->RangeMultiplier(8)->Range(4, 256);
BENCHMARK(BM_TORCH_QAVG_POOL3D_NDHWC)->RangeMultiplier(8)->Range(4, 256);
BENCHMARK(BM_TORCH_QAVG_POOL2D_NCHW_SINGLE_THREADED)
    ->RangeMultiplier(8)
    ->Range(4, 256);
BENCHMARK(BM_TORCH_QAVG_POOL2D_NHWC_SINGLE_THREADED)
    ->RangeMultiplier(8)
    ->Range(4, 256);
BENCHMARK(BM_TORCH_QAVG_POOL3D_NCDHW_SINGLE_THREADED)
    ->RangeMultiplier(8)
    ->Range(4, 256);
BENCHMARK(BM_TORCH_QAVG_POOL3D_NDHWC_SINGLE_THREADED)
    ->RangeMultiplier(8)
    ->Range(4, 256);
BENCHMARK_MAIN();
```

3. `mkdir build && cd build`
4. ```cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_PREFIX_PATH=`python -c 'import torch;print(torch.utils.cmake_prefix_path)'` .. ```
5. `cmake --build . --config Release`
6. `./time_average_pool`

# Further notes
- I've used `istrideB, istrideD, istrideH, strideW, strideC` to match `_qadaptive_avg_pool_kernel` since there's some code duplication there as mentioned in https://github.com/pytorch/pytorch/issues/40316.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42009

Reviewed By: pbelevich

Differential Revision: D22794441

Pulled By: z-a-f

fbshipit-source-id: 16710202811a1fbe1c99ea4d9b45876d6d28a8da
2020-08-05 19:44:42 -07:00
9add11ffc1 Fix IS_SPMM_AVAILABLE macro definition (#42643)
Summary:
This should fix CUDA-11 on Windows build issue

`defined` is not a function, and so it can not be used in macro substitution.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42643

Reviewed By: pbelevich, xw285cornell

Differential Revision: D22963420

Pulled By: malfet

fbshipit-source-id: cccf7db0d03cd62b655beeb154db9e628aa749f0
2020-08-05 18:56:23 -07:00
509fb77b70 Adjust bound_shape_inferencer to take 4 inputs for FCs (#41934)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41934

The model exported from online training workflow with int8 quantization contains FCs with 4 inputs. The extra input is the quant_param blob. This diff is to adjust the bound_shape_inferencer and int8 op schema to get shape info for the quant_param input.

Test Plan:
```
buck test caffe2/caffe2/opt:bound_shape_inference_test
```

Reviewed By: yinghai

Differential Revision: D22683554

fbshipit-source-id: 684d1433212a528120aba1c37d27e26b6a31b403
2020-08-05 18:44:48 -07:00
9ea9d1b52e [fbs][2/n] Remove .python3 markers
Test Plan:
`xbgr '\.python3'` shows only one (dead) usage of this file:
https://www.internalfb.com/intern/diffusion/FBS/browse/master/fbcode/python/repo_stats/buck.py?commit=9a8dd3243207819325d520c208218f6ab69e4e49&lines=854

Reviewed By: lisroach

Differential Revision: D22955631

fbshipit-source-id: e686d9157c08c347d0ce4acdd05bd7ab29ff7df5
2020-08-05 18:25:50 -07:00
5d7c3f92b9 Issue warning instead of error when parsing Enum while enum support is not enabled (#42623)
Summary:
Returnning None rather than error matches previous behavior better.

Fixes https://fburl.com/yrrvtes3

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42623

Reviewed By: ajaech

Differential Revision: D22957498

Pulled By: gmagogsfm

fbshipit-source-id: 61dabc6d23ad44e75bd35d837768bdb6fe71eece
2020-08-05 17:55:29 -07:00
50f0d2b97d quant: add q_batchnorm_1d op (#42491)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42491

Hooks up quantized batchnorm_1d to the quantized_bn kernel. Eager mode
hookup will be in a future PR, and graph mode should work after this PR.

Note: currently the implementation is ~2x slower on the benchmark than q_batch_norm2d
because we convert back to contiguous memory format at the end, since
channels_last is only defined for rank >= 4. If further optimization is
needed, that can be a separate PR (will need the NHWC folks to see if
there is a workaround).  Meanwhile, having this is better than not having anything.

Context: There have been both internal and external requests for various
quantized BN1d use cases.

Test Plan:
```
python test/test_quantization.py TestQuantizedOps.test_batch_norm_1d_2d_3d
python test/test_quantization.py TestQuantizedOps.test_batch_norm_1d_2d_3d_relu
python test/test_quantization.py TestQuantizeJitOps.test_qbatch_norm

// performance:
// https://gist.github.com/vkuzo/73a07c0f24c05f5804990d9ebfaecf5e

```

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D22926254

fbshipit-source-id: 2780e6a81cd13a7455f6ab6e5118c22850a97a12
2020-08-05 17:20:18 -07:00
54ffb05eff better error message between C2 and glow (#41603)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41603

Pull Request resolved: https://github.com/pytorch/glow/pull/4704

Previously in the glow onnxifi path, when an error is encountered, we log it to stderr then just return ONNXIFI_STATUS_INTERNAL_ERROR to C2. C2 then does CAFFE2_ENFORCE_EQUAL(return_code, ONNXIFI_STATUS_SUCCESS). The error message that eventually went to the user is something like

   [enforce fail at onnxifi_op.cc:545] eventStatus == ONNXIFI_STATUS_SUCCESS. 1030 vs 0

This diff adds plumbing to get human readable error message out of glow into C2.

Test Plan:
Run ads replayer. Overload it with traffic. Now the error message sent back to the client used to be

  E0707 00:57:45.697196 3709559 Caffe2DisaggAcceleratorTask.cpp:493] During running REMOTE_OTHER net: [enforce fail at onnxifi_op.cc:545] eventStatus == ONNXIFI_STATUS_SUCCESS. 1030 vs 0 (Error from operator:....

Now it's

```
E0707 16:46:48.366263 1532943 Client.cpp:966] Exception when calling caffe2_run_disagg_accelerator on remote predictor for model 190081310_0 : apache::thrift::TApplicationException: c10::Error: [enforce fail at onnxifi_op.cc:556] .
Error code: RUNTIME_REQUEST_REFUSED
Error message: The number of allowed queued requests has been exceeded. queued requests: 100 allowed requests: 100
Error return stack:
glow/glow/lib/Runtime/HostManager/HostManager.cpp:673
glow/glow/lib/Onnxifi/HostMana (Error from operator:...
```

Reviewed By: gcatron, yinghai

Differential Revision: D22416857

fbshipit-source-id: 564bc7644d9666eb660725c2dca5637affae9b73
2020-08-05 16:25:13 -07:00
aa4e91a6dc Fix TestSparse.test_bmm_windows_error when CUDA is not available (#42626)
Summary:
Refactor comnon pattern of (torch.cuda.version and [int(x) for x in torch.cuda.version.split(".")] >= [a, b]) into `_get_torch_cuda_version()` function

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42626

Reviewed By: seemethere

Differential Revision: D22956149

Pulled By: malfet

fbshipit-source-id: 897c55965e53b477cd20f69e8da15d90489035de
2020-08-05 16:07:35 -07:00
5023995292 fix output size adjustment for onnxifi_op
Summary: this breaks if we cut the net at certain int8 ops boundary.

Test Plan: with net_runner to lower a single Int8Quantize op. It used to break. Now it works.

Reviewed By: yinghai

Differential Revision: D22912178

fbshipit-source-id: ca306068c9768df84c1cfa8b34226a1330e19912
2020-08-05 15:55:46 -07:00
102abb877c Reland D22939119: "[TensorExpr] Fix a way we were createing np arrays in tests." (#42608)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42608

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D22952745

Pulled By: ZolotukhinM

fbshipit-source-id: fd6a3efbfcaa876a2f4d27b507fe0ccdcb55a002
2020-08-05 15:14:23 -07:00
2501e2b12d [RPC tests] Run DdpUnderDistAutogradTest and DdpComparisonTest with fork too (#42528)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42528

It seems it was an oversight that they weren't run. This allows to simplify our auto-generation logic as now all test suites are run in both modes.
ghstack-source-id: 109229969

Test Plan: CI

Reviewed By: pritamdamania87

Differential Revision: D22922151

fbshipit-source-id: 0766a6970c927efb04eee4894b73d4bcaf60b97f
2020-08-05 15:10:29 -07:00
4da602b004 [RPC tests] Generate test classes automatically (#42527)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42527

ghstack-source-id: 109229468

Test Plan: CI

Reviewed By: pritamdamania87

Differential Revision: D22864698

fbshipit-source-id: 6a55f3201c544f0173493b38699a2c7e95ac1bbc
2020-08-05 15:10:26 -07:00
d7516ccfac [RPC tests] Enroll TensorPipe in missing test suites (#40823)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40823

Summary of the entire stack:
--

This diff is part of an attempt to refactor the RPC tests. They currently suffer from several problems:
- Several ways to specify the agent to use: there exists one "generic" fixture that uses the global variable TEST_CONFIG to look up the agent name, and is used for process group and Thrift, and then there are separate fixtures for the flaky agent and the TensorPipe one.
- These two ways lead to having two separate decorators (`requires_process_group_agent` and `@_skip_if_tensorpipe_agent`) which must both be specified, making it unclear what the effect of each of them is and what happens if only one is given.
- Thrift must override the TEST_CONFIG global variable before any other import (in order for the `requires_process_group_agent` decorator to work correctly) and for that it must use a "trap" file, which makes it even harder to track which agent is being used, and which is specific to Buck, and thus cannot be used in OSS by other agents.
- Even if the TensorPipe fixture doesn't use TEST_CONFIG, it still needs to set it to the right value for other parts of the code to work. (This is done in `dist_init`).
- There are a few functions in dist_utils.py that return some properties of the agent (e.g., a regexp to match against the error it returns in case of shutdown). These functions are effectively chained if/elses on the various agents, which has the effect of "leaking" some part of the Thrift agent into OSS.
- Each test suite (RPC, dist autograd/dist optimizer, their JIT versions, remote module, ...) must be run on each agent (or almost; the faulty one is an exception) in both fork and spawn mode. Each of these combinations is a separate file, which leads to a proliferation of scripts.
- There is no "master list" of what combinations make sense and should be run. Therefore it has happened that when adding new tests or new agents we forgot to enroll them into the right tests. (TensorPipe is still missing a few tests, it turns out).
- All of these tiny "entry point" files contain almost the same duplicated boilerplate. This makes it very easy to get the wrong content into one of them due to a bad copy-paste.

This refactoring aims to address these problems by:
- Avoiding global state, defaults/override, traps, if/elses, ... and have a single way to specify the agent, based on an abstract base class and several concrete subclasses which can be "mixed in" to any test suite.
- Instead of enabling/disabling tests using decorators, the tests that are specific to a certain agent are now in a separate class (which is a subclass of the "generic" test suite) so that they are only picked up by the agent they apply to.
- Instead of having one separate entry point script for each combination, it uses one entry point for each agent, and in that script it provides a list of all the test suites it wants to run on that agent. And it does that by trying to deduplicate the boilerplate as much as possible. (In fact, the various agent-suite combinations could be grouped in any way, not necessarily by agent as I did here).

It provides further advantages:
- It puts all the agents on equal standing, by not having any of them be the default, making it thus easier to migrate from process group to TensorPipe.
- It will make it easier to add more versions of the TensorPipe tests (e.g., one that disables the same-machine backends in order to test the TCP-based ones) without a further duplication of entry points, of boilerplate, ...

Summary of this commit
--
As it is now easier to spot that the TensorPipe agent wasn't being run on some test suite, we fix that. We keep this change for last so that if those tests turn out to be flaky and must be reverted this won't affect the rest of the stack.
ghstack-source-id: 109229469

Test Plan: Sandcastle and CircleCI

Reviewed By: pritamdamania87

Differential Revision: D22309432

fbshipit-source-id: c433a6a49a7b6737e0df4cd953f3dfde290f20b8
2020-08-05 15:10:23 -07:00
2e7b464c43 [RPC tests] Remove global TEST_CONFIG (#40822)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40822

Summary of the entire stack:
--

This diff is part of an attempt to refactor the RPC tests. They currently suffer from several problems:
- Several ways to specify the agent to use: there exists one "generic" fixture that uses the global variable TEST_CONFIG to look up the agent name, and is used for process group and Thrift, and then there are separate fixtures for the flaky agent and the TensorPipe one.
- These two ways lead to having two separate decorators (`requires_process_group_agent` and `@_skip_if_tensorpipe_agent`) which must both be specified, making it unclear what the effect of each of them is and what happens if only one is given.
- Thrift must override the TEST_CONFIG global variable before any other import (in order for the `requires_process_group_agent` decorator to work correctly) and for that it must use a "trap" file, which makes it even harder to track which agent is being used, and which is specific to Buck, and thus cannot be used in OSS by other agents.
- Even if the TensorPipe fixture doesn't use TEST_CONFIG, it still needs to set it to the right value for other parts of the code to work. (This is done in `dist_init`).
- There are a few functions in dist_utils.py that return some properties of the agent (e.g., a regexp to match against the error it returns in case of shutdown). These functions are effectively chained if/elses on the various agents, which has the effect of "leaking" some part of the Thrift agent into OSS.
- Each test suite (RPC, dist autograd/dist optimizer, their JIT versions, remote module, ...) must be run on each agent (or almost; the faulty one is an exception) in both fork and spawn mode. Each of these combinations is a separate file, which leads to a proliferation of scripts.
- There is no "master list" of what combinations make sense and should be run. Therefore it has happened that when adding new tests or new agents we forgot to enroll them into the right tests. (TensorPipe is still missing a few tests, it turns out).
- All of these tiny "entry point" files contain almost the same duplicated boilerplate. This makes it very easy to get the wrong content into one of them due to a bad copy-paste.

This refactoring aims to address these problems by:
- Avoiding global state, defaults/override, traps, if/elses, ... and have a single way to specify the agent, based on an abstract base class and several concrete subclasses which can be "mixed in" to any test suite.
- Instead of enabling/disabling tests using decorators, the tests that are specific to a certain agent are now in a separate class (which is a subclass of the "generic" test suite) so that they are only picked up by the agent they apply to.
- Instead of having one separate entry point script for each combination, it uses one entry point for each agent, and in that script it provides a list of all the test suites it wants to run on that agent. And it does that by trying to deduplicate the boilerplate as much as possible. (In fact, the various agent-suite combinations could be grouped in any way, not necessarily by agent as I did here).

It provides further advantages:
- It puts all the agents on equal standing, by not having any of them be the default, making it thus easier to migrate from process group to TensorPipe.
- It will make it easier to add more versions of the TensorPipe tests (e.g., one that disables the same-machine backends in order to test the TCP-based ones) without a further duplication of entry points, of boilerplate, ...

Summary of this commit
--
This is the last step of removing TEST_CONFIG. As there was no one left using it, there is really not much to it.
ghstack-source-id: 109229471

Test Plan: Sandcastle and CircleCI

Reviewed By: pritamdamania87

Differential Revision: D22307778

fbshipit-source-id: 0d9498d9367eec671e0a964ce693015f73c5638c
2020-08-05 15:10:20 -07:00
e7c7eaab82 [RPC tests] Move some functions to methods of fixture (#40821)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40821

Summary of the entire stack:
--

This diff is part of an attempt to refactor the RPC tests. They currently suffer from several problems:
- Several ways to specify the agent to use: there exists one "generic" fixture that uses the global variable TEST_CONFIG to look up the agent name, and is used for process group and Thrift, and then there are separate fixtures for the flaky agent and the TensorPipe one.
- These two ways lead to having two separate decorators (`requires_process_group_agent` and `@_skip_if_tensorpipe_agent`) which must both be specified, making it unclear what the effect of each of them is and what happens if only one is given.
- Thrift must override the TEST_CONFIG global variable before any other import (in order for the `requires_process_group_agent` decorator to work correctly) and for that it must use a "trap" file, which makes it even harder to track which agent is being used, and which is specific to Buck, and thus cannot be used in OSS by other agents.
- Even if the TensorPipe fixture doesn't use TEST_CONFIG, it still needs to set it to the right value for other parts of the code to work. (This is done in `dist_init`).
- There are a few functions in dist_utils.py that return some properties of the agent (e.g., a regexp to match against the error it returns in case of shutdown). These functions are effectively chained if/elses on the various agents, which has the effect of "leaking" some part of the Thrift agent into OSS.
- Each test suite (RPC, dist autograd/dist optimizer, their JIT versions, remote module, ...) must be run on each agent (or almost; the faulty one is an exception) in both fork and spawn mode. Each of these combinations is a separate file, which leads to a proliferation of scripts.
- There is no "master list" of what combinations make sense and should be run. Therefore it has happened that when adding new tests or new agents we forgot to enroll them into the right tests. (TensorPipe is still missing a few tests, it turns out).
- All of these tiny "entry point" files contain almost the same duplicated boilerplate. This makes it very easy to get the wrong content into one of them due to a bad copy-paste.

This refactoring aims to address these problems by:
- Avoiding global state, defaults/override, traps, if/elses, ... and have a single way to specify the agent, based on an abstract base class and several concrete subclasses which can be "mixed in" to any test suite.
- Instead of enabling/disabling tests using decorators, the tests that are specific to a certain agent are now in a separate class (which is a subclass of the "generic" test suite) so that they are only picked up by the agent they apply to.
- Instead of having one separate entry point script for each combination, it uses one entry point for each agent, and in that script it provides a list of all the test suites it wants to run on that agent. And it does that by trying to deduplicate the boilerplate as much as possible. (In fact, the various agent-suite combinations could be grouped in any way, not necessarily by agent as I did here).

It provides further advantages:
- It puts all the agents on equal standing, by not having any of them be the default, making it thus easier to migrate from process group to TensorPipe.
- It will make it easier to add more versions of the TensorPipe tests (e.g., one that disables the same-machine backends in order to test the TCP-based ones) without a further duplication of entry points, of boilerplate, ...

Summary of this commit
--
This change continues the work towards removing TEST_CONFIG, by taking a few functions that were accepting the agent name (as obtained from TEST_CONFIG) and then did a bunch of if/elses on it, and replace them by new abstract methods on the fixtures, so that these functions become "decentralized".
ghstack-source-id: 109229472

Test Plan: Sandcastle and CircleCI

Reviewed By: pritamdamania87

Differential Revision: D22307776

fbshipit-source-id: 9e1f6edca79aacf0bcf9d83d50ce9e0d2beec0dd
2020-08-05 15:10:17 -07:00
2acef69ce3 [RPC tests] Make generic fixture an abstract base class (#40820)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40820

Summary of the entire stack:
--

This diff is part of an attempt to refactor the RPC tests. They currently suffer from several problems:
- Several ways to specify the agent to use: there exists one "generic" fixture that uses the global variable TEST_CONFIG to look up the agent name, and is used for process group and Thrift, and then there are separate fixtures for the flaky agent and the TensorPipe one.
- These two ways lead to having two separate decorators (`requires_process_group_agent` and `@_skip_if_tensorpipe_agent`) which must both be specified, making it unclear what the effect of each of them is and what happens if only one is given.
- Thrift must override the TEST_CONFIG global variable before any other import (in order for the `requires_process_group_agent` decorator to work correctly) and for that it must use a "trap" file, which makes it even harder to track which agent is being used, and which is specific to Buck, and thus cannot be used in OSS by other agents.
- Even if the TensorPipe fixture doesn't use TEST_CONFIG, it still needs to set it to the right value for other parts of the code to work. (This is done in `dist_init`).
- There are a few functions in dist_utils.py that return some properties of the agent (e.g., a regexp to match against the error it returns in case of shutdown). These functions are effectively chained if/elses on the various agents, which has the effect of "leaking" some part of the Thrift agent into OSS.
- Each test suite (RPC, dist autograd/dist optimizer, their JIT versions, remote module, ...) must be run on each agent (or almost; the faulty one is an exception) in both fork and spawn mode. Each of these combinations is a separate file, which leads to a proliferation of scripts.
- There is no "master list" of what combinations make sense and should be run. Therefore it has happened that when adding new tests or new agents we forgot to enroll them into the right tests. (TensorPipe is still missing a few tests, it turns out).
- All of these tiny "entry point" files contain almost the same duplicated boilerplate. This makes it very easy to get the wrong content into one of them due to a bad copy-paste.

This refactoring aims to address these problems by:
- Avoiding global state, defaults/override, traps, if/elses, ... and have a single way to specify the agent, based on an abstract base class and several concrete subclasses which can be "mixed in" to any test suite.
- Instead of enabling/disabling tests using decorators, the tests that are specific to a certain agent are now in a separate class (which is a subclass of the "generic" test suite) so that they are only picked up by the agent they apply to.
- Instead of having one separate entry point script for each combination, it uses one entry point for each agent, and in that script it provides a list of all the test suites it wants to run on that agent. And it does that by trying to deduplicate the boilerplate as much as possible. (In fact, the various agent-suite combinations could be grouped in any way, not necessarily by agent as I did here).

It provides further advantages:
- It puts all the agents on equal standing, by not having any of them be the default, making it thus easier to migrate from process group to TensorPipe.
- It will make it easier to add more versions of the TensorPipe tests (e.g., one that disables the same-machine backends in order to test the TCP-based ones) without a further duplication of entry points, of boilerplate, ...

Summary of this commit
--
Now that no one is using the generic fixture anymore (i.e., the fixture that looks up the agent's name in the global TEST_CONFIG) we can make it abstract, i.e., have its methods become no-ops and add decorators that will require all subclasses to provide new implementations of those methods. This is a first step towards removing TEST_CONFIG.
ghstack-source-id: 109229475

Test Plan: Sandcastle and CircleCI

Reviewed By: pritamdamania87

Differential Revision: D22307777

fbshipit-source-id: e52abd915c37894933545eebdfdca3ecb9559926
2020-08-05 15:10:14 -07:00
a94039fce5 [RPC tests] Avoid decorators to skip tests (#40819)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40819

Summary of the entire stack:
--

This diff is part of an attempt to refactor the RPC tests. They currently suffer from several problems:
- Several ways to specify the agent to use: there exists one "generic" fixture that uses the global variable TEST_CONFIG to look up the agent name, and is used for process group and Thrift, and then there are separate fixtures for the flaky agent and the TensorPipe one.
- These two ways lead to having two separate decorators (`requires_process_group_agent` and `@_skip_if_tensorpipe_agent`) which must both be specified, making it unclear what the effect of each of them is and what happens if only one is given.
- Thrift must override the TEST_CONFIG global variable before any other import (in order for the `requires_process_group_agent` decorator to work correctly) and for that it must use a "trap" file, which makes it even harder to track which agent is being used, and which is specific to Buck, and thus cannot be used in OSS by other agents.
- Even if the TensorPipe fixture doesn't use TEST_CONFIG, it still needs to set it to the right value for other parts of the code to work. (This is done in `dist_init`).
- There are a few functions in dist_utils.py that return some properties of the agent (e.g., a regexp to match against the error it returns in case of shutdown). These functions are effectively chained if/elses on the various agents, which has the effect of "leaking" some part of the Thrift agent into OSS.
- Each test suite (RPC, dist autograd/dist optimizer, their JIT versions, remote module, ...) must be run on each agent (or almost; the faulty one is an exception) in both fork and spawn mode. Each of these combinations is a separate file, which leads to a proliferation of scripts.
- There is no "master list" of what combinations make sense and should be run. Therefore it has happened that when adding new tests or new agents we forgot to enroll them into the right tests. (TensorPipe is still missing a few tests, it turns out).
- All of these tiny "entry point" files contain almost the same duplicated boilerplate. This makes it very easy to get the wrong content into one of them due to a bad copy-paste.

This refactoring aims to address these problems by:
- Avoiding global state, defaults/override, traps, if/elses, ... and have a single way to specify the agent, based on an abstract base class and several concrete subclasses which can be "mixed in" to any test suite.
- Instead of enabling/disabling tests using decorators, the tests that are specific to a certain agent are now in a separate class (which is a subclass of the "generic" test suite) so that they are only picked up by the agent they apply to.
- Instead of having one separate entry point script for each combination, it uses one entry point for each agent, and in that script it provides a list of all the test suites it wants to run on that agent. And it does that by trying to deduplicate the boilerplate as much as possible. (In fact, the various agent-suite combinations could be grouped in any way, not necessarily by agent as I did here).

It provides further advantages:
- It puts all the agents on equal standing, by not having any of them be the default, making it thus easier to migrate from process group to TensorPipe.
- It will make it easier to add more versions of the TensorPipe tests (e.g., one that disables the same-machine backends in order to test the TCP-based ones) without a further duplication of entry points, of boilerplate, ...

Summary of this commit
--
This diff removes the two decorators (`requires_process_group_agent` and `@_skip_if_tensorpipe_agent`) which were used to skip tests. They were only used to prevent the TensorPipe agent from running tests that were using the process group agent's options. The converse (preventing the PG agent from using the TP options) is achieved by having those tests live in a `TensorPipeAgentRpcTest` class. So here we're doing the same for process group, by moving those tests to a `ProcessGroupAgentRpcTest` class.
ghstack-source-id: 109229473

Test Plan: Sandcastle and CircleCI

Reviewed By: pritamdamania87

Differential Revision: D22283179

fbshipit-source-id: b9315f9fd67f35e88fe1843faa161fc53a4133c4
2020-08-05 15:10:11 -07:00
935fcc9580 [RPC tests] Merge process group tests into single entry point (#40818)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40818

Summary of the entire stack:
--

This diff is part of an attempt to refactor the RPC tests. They currently suffer from several problems:
- Several ways to specify the agent to use: there exists one "generic" fixture that uses the global variable TEST_CONFIG to look up the agent name, and is used for process group and Thrift, and then there are separate fixtures for the flaky agent and the TensorPipe one.
- These two ways lead to having two separate decorators (`requires_process_group_agent` and `@_skip_if_tensorpipe_agent`) which must both be specified, making it unclear what the effect of each of them is and what happens if only one is given.
- Thrift must override the TEST_CONFIG global variable before any other import (in order for the `requires_process_group_agent` decorator to work correctly) and for that it must use a "trap" file, which makes it even harder to track which agent is being used, and which is specific to Buck, and thus cannot be used in OSS by other agents.
- Even if the TensorPipe fixture doesn't use TEST_CONFIG, it still needs to set it to the right value for other parts of the code to work. (This is done in `dist_init`).
- There are a few functions in dist_utils.py that return some properties of the agent (e.g., a regexp to match against the error it returns in case of shutdown). These functions are effectively chained if/elses on the various agents, which has the effect of "leaking" some part of the Thrift agent into OSS.
- Each test suite (RPC, dist autograd/dist optimizer, their JIT versions, remote module, ...) must be run on each agent (or almost; the faulty one is an exception) in both fork and spawn mode. Each of these combinations is a separate file, which leads to a proliferation of scripts.
- There is no "master list" of what combinations make sense and should be run. Therefore it has happened that when adding new tests or new agents we forgot to enroll them into the right tests. (TensorPipe is still missing a few tests, it turns out).
- All of these tiny "entry point" files contain almost the same duplicated boilerplate. This makes it very easy to get the wrong content into one of them due to a bad copy-paste.

This refactoring aims to address these problems by:
- Avoiding global state, defaults/override, traps, if/elses, ... and have a single way to specify the agent, based on an abstract base class and several concrete subclasses which can be "mixed in" to any test suite.
- Instead of enabling/disabling tests using decorators, the tests that are specific to a certain agent are now in a separate class (which is a subclass of the "generic" test suite) so that they are only picked up by the agent they apply to.
- Instead of having one separate entry point script for each combination, it uses one entry point for each agent, and in that script it provides a list of all the test suites it wants to run on that agent. And it does that by trying to deduplicate the boilerplate as much as possible. (In fact, the various agent-suite combinations could be grouped in any way, not necessarily by agent as I did here).

It provides further advantages:
- It puts all the agents on equal standing, by not having any of them be the default, making it thus easier to migrate from process group to TensorPipe.
- It will make it easier to add more versions of the TensorPipe tests (e.g., one that disables the same-machine backends in order to test the TCP-based ones) without a further duplication of entry points, of boilerplate, ...

Summary of this commit
--
This diff does the changes described above for the process group agent. It defines a fixture for it (instead of using the generic fixture in its default behavior) and then merges all the entry points into a single script. Note that after this change there won't be anymore a "vanilla" RPC test: all test scripts now specify what agent they are using. This puts all agents on equal standing.
ghstack-source-id: 109229474

Test Plan: Sandcastle and CircleCI

Reviewed By: pritamdamania87

Differential Revision: D22283182

fbshipit-source-id: 7e3626bbbf37d88b892077a03725f0598576b370
2020-08-05 15:10:07 -07:00
b93c7c54eb [RPC tests] Merge tests for faulty agent into single script (#40817)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40817

Summary of the entire stack:
--

This diff is part of an attempt to refactor the RPC tests. They currently suffer from several problems:
- Several ways to specify the agent to use: there exists one "generic" fixture that uses the global variable TEST_CONFIG to look up the agent name, and is used for process group and Thrift, and then there are separate fixtures for the flaky agent and the TensorPipe one.
- These two ways lead to having two separate decorators (`requires_process_group_agent` and `@_skip_if_tensorpipe_agent`) which must both be specified, making it unclear what the effect of each of them is and what happens if only one is given.
- Thrift must override the TEST_CONFIG global variable before any other import (in order for the `requires_process_group_agent` decorator to work correctly) and for that it must use a "trap" file, which makes it even harder to track which agent is being used, and which is specific to Buck, and thus cannot be used in OSS by other agents.
- Even if the TensorPipe fixture doesn't use TEST_CONFIG, it still needs to set it to the right value for other parts of the code to work. (This is done in `dist_init`).
- There are a few functions in dist_utils.py that return some properties of the agent (e.g., a regexp to match against the error it returns in case of shutdown). These functions are effectively chained if/elses on the various agents, which has the effect of "leaking" some part of the Thrift agent into OSS.
- Each test suite (RPC, dist autograd/dist optimizer, their JIT versions, remote module, ...) must be run on each agent (or almost; the faulty one is an exception) in both fork and spawn mode. Each of these combinations is a separate file, which leads to a proliferation of scripts.
- There is no "master list" of what combinations make sense and should be run. Therefore it has happened that when adding new tests or new agents we forgot to enroll them into the right tests. (TensorPipe is still missing a few tests, it turns out).
- All of these tiny "entry point" files contain almost the same duplicated boilerplate. This makes it very easy to get the wrong content into one of them due to a bad copy-paste.

This refactoring aims to address these problems by:
- Avoiding global state, defaults/override, traps, if/elses, ... and have a single way to specify the agent, based on an abstract base class and several concrete subclasses which can be "mixed in" to any test suite.
- Instead of enabling/disabling tests using decorators, the tests that are specific to a certain agent are now in a separate class (which is a subclass of the "generic" test suite) so that they are only picked up by the agent they apply to.
- Instead of having one separate entry point script for each combination, it uses one entry point for each agent, and in that script it provides a list of all the test suites it wants to run on that agent. And it does that by trying to deduplicate the boilerplate as much as possible. (In fact, the various agent-suite combinations could be grouped in any way, not necessarily by agent as I did here).

It provides further advantages:
- It puts all the agents on equal standing, by not having any of them be the default, making it thus easier to migrate from process group to TensorPipe.
- It will make it easier to add more versions of the TensorPipe tests (e.g., one that disables the same-machine backends in order to test the TCP-based ones) without a further duplication of entry points, of boilerplate, ...

Summary of this commit
--
This diff does the changes described above for the faulty agent, which is its own strange beast. It merges all the test entry points (i.e., the combinations of agent, suite and fork/spawn) into a single file. It also modifies the test suites that are intended to be run only on the faulty agent, which used to inherit from its fixture, to inherit from the generic fixture, as they will be mixed in with the faulty fixture at the very end, inside the entry point script.
ghstack-source-id: 109229477

Test Plan: Sandcastle and CircleCI

Reviewed By: pritamdamania87

Differential Revision: D22283178

fbshipit-source-id: 72659efe6652dac8450473642a578933030f2c74
2020-08-05 15:10:04 -07:00
edf6c4bc4d [RPC tests] Merge TensorPipe tests into single entry point (#40816)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40816

Summary of the entire stack:
--

This diff is part of an attempt to refactor the RPC tests. They currently suffer from several problems:
- Several ways to specify the agent to use: there exists one "generic" fixture that uses the global variable TEST_CONFIG to look up the agent name, and is used for process group and Thrift, and then there are separate fixtures for the flaky agent and the TensorPipe one.
- These two ways lead to having two separate decorators (`requires_process_group_agent` and `@_skip_if_tensorpipe_agent`) which must both be specified, making it unclear what the effect of each of them is and what happens if only one is given.
- Thrift must override the TEST_CONFIG global variable before any other import (in order for the `requires_process_group_agent` decorator to work correctly) and for that it must use a "trap" file, which makes it even harder to track which agent is being used, and which is specific to Buck, and thus cannot be used in OSS by other agents.
- Even if the TensorPipe fixture doesn't use TEST_CONFIG, it still needs to set it to the right value for other parts of the code to work. (This is done in `dist_init`).
- There are a few functions in dist_utils.py that return some properties of the agent (e.g., a regexp to match against the error it returns in case of shutdown). These functions are effectively chained if/elses on the various agents, which has the effect of "leaking" some part of the Thrift agent into OSS.
- Each test suite (RPC, dist autograd/dist optimizer, their JIT versions, remote module, ...) must be run on each agent (or almost; the faulty one is an exception) in both fork and spawn mode. Each of these combinations is a separate file, which leads to a proliferation of scripts.
- There is no "master list" of what combinations make sense and should be run. Therefore it has happened that when adding new tests or new agents we forgot to enroll them into the right tests. (TensorPipe is still missing a few tests, it turns out).
- All of these tiny "entry point" files contain almost the same duplicated boilerplate. This makes it very easy to get the wrong content into one of them due to a bad copy-paste.

This refactoring aims to address these problems by:
- Avoiding global state, defaults/override, traps, if/elses, ... and have a single way to specify the agent, based on an abstract base class and several concrete subclasses which can be "mixed in" to any test suite.
- Instead of enabling/disabling tests using decorators, the tests that are specific to a certain agent are now in a separate class (which is a subclass of the "generic" test suite) so that they are only picked up by the agent they apply to.
- Instead of having one separate entry point script for each combination, it uses one entry point for each agent, and in that script it provides a list of all the test suites it wants to run on that agent. And it does that by trying to deduplicate the boilerplate as much as possible. (In fact, the various agent-suite combinations could be grouped in any way, not necessarily by agent as I did here).

It provides further advantages:
- It puts all the agents on equal standing, by not having any of them be the default, making it thus easier to migrate from process group to TensorPipe.
- It will make it easier to add more versions of the TensorPipe tests (e.g., one that disables the same-machine backends in order to test the TCP-based ones) without a further duplication of entry points, of boilerplate, ...

Summary of this commit
--
This diff does the changes described above for the TensorPipe agent. It fixes its fixture (making it inherit from the generic fixture) and merges all the entry point scripts into a single one, so that it's easier to have a clear overview of all the test suites which we run on TensorPipe (you'll notice that many are missing: the JIT ones, the remote module one, ...).
ghstack-source-id: 109229476

Test Plan: Sandcastle and CircleCI

Reviewed By: pritamdamania87

Differential Revision: D22283180

fbshipit-source-id: d5e9f9f4e6d4bfd6fbcae7ae56eed63d2567a02f
2020-08-05 15:08:32 -07:00
73351ee91d [TensorExpr] Disallow fallback to JIT interpreter from TensorExprKernel (flip the default). (#42568)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42568

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D22936175

Pulled By: ZolotukhinM

fbshipit-source-id: 62cb505acb77789ed9f483842a8b31eb245697b3
2020-08-05 14:13:49 -07:00
ef50694d44 [TensorExpr] Apply GenericIntrinsicExpander recursively. (#42567)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42567

Before this change we didn't expand arguments, and thus in an expr
`sigmoid(sigmoid(x))` only the outer call was expanded.

Test Plan: Imported from OSS

Reviewed By: gmagogsfm

Differential Revision: D22936177

Pulled By: ZolotukhinM

fbshipit-source-id: 9c05dc96561225bab9a90a407d7bcf9a89b078a1
2020-08-05 14:13:46 -07:00
ea9053b86d [TensorExpr] Handle constant nodes in shape inference. (#42566)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42566

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D22936176

Pulled By: ZolotukhinM

fbshipit-source-id: 69d0f9907de0e98f1fbd56407df235774cb5b788
2020-08-05 14:13:44 -07:00
b9c49f0e69 [TensorExpr] Support shape inference in TE for aten::cat. (#42387)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42387

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D22879281

Pulled By: ZolotukhinM

fbshipit-source-id: 775e46a4cfd91c63196b378ee587cc4434672c89
2020-08-05 14:11:24 -07:00
feeb515ad5 add Quantizer support to IValue (#42438)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42438

Test Plan: Imported from OSS

Reviewed By: smessmer

Differential Revision: D22894190

Pulled By: bhosmer

fbshipit-source-id: b2d08abd6f582f29daa6cc7ebf05bb1a99f7514b
2020-08-05 12:56:18 -07:00
24e2a8a171 Revert D22780307: Fix illegal memory acess issue for CUDA versionn of SplitByLengths operator.
Test Plan: revert-hammer

Differential Revision:
D22780307 (76905527fe)

Original commit changeset: c5ca60ae16b2

fbshipit-source-id: f3c99eec5f05121e2bed606fe2ba84a0be0cdf16
2020-08-05 12:47:56 -07:00
df7c059428 Throw error if torch.set_deterministic(True) is called with nondeterministic CuBLAS config (#41377)
Summary:
For CUDA >= 10.2, the `CUBLAS_WORKSPACE_CONFIG` environment variable must be set to either `:4096:8` or `:16:8` to ensure deterministic CUDA stream usage. This PR adds some logic inside `torch.set_deterministic()` to raise an error if this environment variable is not set properly and CUDA >= 10.2.

Issue https://github.com/pytorch/pytorch/issues/15359

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41377

Reviewed By: malfet

Differential Revision: D22758459

Pulled By: ezyang

fbshipit-source-id: 4b96f1e9abf85d94ba79140fd927bbd0c05c4522
2020-08-05 12:42:24 -07:00
7221a3d1aa enable torch.optim.swa_utils.SWALR (#42574)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/42435

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42574

Reviewed By: zou3519

Differential Revision: D22949369

Pulled By: vincentqb

fbshipit-source-id: f2f319ec94a97e0afe4d4327c866504ae632a986
2020-08-05 12:37:45 -07:00
18a32b807b Add API to collect output_col_minmax_histogram
Summary:
Add an API to collect output_col_minmax_histogram. This is used to implement input_equalization.

Roll back revised the collect_single_histogram in the new version to make sure it does not affect the product.
The newly added one can implement collect the activation histogram and output col max histogram at the same time.

Test Plan:
Add a unit test, and pass it.
https://our.intern.facebook.com/intern/testinfra/testrun/2251799847601374
After updating the dump API, it passed the updated unit test
https://our.intern.facebook.com/intern/testinfra/testrun/844425097716401

Integrated the output_col_minmax_histogram to the collect single histogram, and make it downward compatible
https://our.intern.facebook.com/intern/testinfra/testrun/8162774342207893

I added different cases to tested newly added function. It passed the unit test  https://our.intern.facebook.com/intern/testinfra/testrun/4503599658969000

Tested after new revision: https://our.intern.facebook.com/intern/testinfra/testrun/5348024589078557

Reviewed By: hx89

Differential Revision: D22919913

fbshipit-source-id: c9cb05e0cf14af0dfde3d22921abb42f97a61df2
2020-08-05 12:33:10 -07:00
7c33225c72 Add strict mypy type checking and update code_template.py (#42322)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42322

Our current type checking rules are rather lax, and for
example don't force users to make sure they annotate all functions
with types.  For code generation code, it would be better to force
100% typing.  This PR introduces a new mypy configuration
mypy-strict.ini which applies rules from --strict.  We extend
test_type_hints.py to test for this case.  It only covers
code_template.py, which I have made strict clean in this PR.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D22846120

Pulled By: ezyang

fbshipit-source-id: 8d253829223bfa0d811b6add53b7bc2d3a4356b0
2020-08-05 12:28:15 -07:00
5c5d7a9dca Freeze dynamic (re)quantizaiton ops into standard ones (#42591)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42591

We don't support lowering with 2-input Int8Quantize and 4-input Int8FC. Just do a conversion to absorb the quantization params into the op itself.

Test Plan:
```
buck test caffe2/caffe2/quantization/server:quantize_dnnlowp_op_test
```

Reviewed By: benjibc

Differential Revision: D22942673

fbshipit-source-id: a392ba2afdfa39c05c5adcb6c4dc5f814c95e449
2020-08-05 11:53:09 -07:00
6d1e43c5a6 Release the GIL before invokeOperator (#42341)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/41865

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42341

Reviewed By: ezyang

Differential Revision: D22928622

Pulled By: wconstab

fbshipit-source-id: 8fa41277c9465f816342db6ec0e6cd4b30095c5c
2020-08-05 11:51:39 -07:00
76905527fe Fix illegal memory acess issue for CUDA versionn of SplitByLengths operator.
Summary:
1. Fix illegal memory access issue for SplitByLengths operator in the CUDA context.
2. Add support to scaling lengths vector for SplitByLengths operator.
3. Add support to test SplitByLengths operator in the CUDA context.

Example for SplitByLengths operator processing scaling lengths vector:
value vector A = [1, 2, 3, 4, 5, 6]
length vector B = [1, 2]
after execution of SplitByLengths operator,
the output should be [1,2] and [3,4,5,6]

Test Plan: buck test mode/dev-nosan caffe2/caffe2/python/operator_test:concat_split_op_test

Reviewed By: kennyhorror

Differential Revision: D22780307

fbshipit-source-id: c5ca60ae16b24032cedfa045a421503b713daa6c
2020-08-05 11:46:00 -07:00
06d978a9ad [c10/cuda] Reorganize device_count() and robustly surface ASAN warnings (#42249)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42249

Main change is to bring Caffe2's superior error messages for cuda initialization into c10 and use them in all code paths.

Basic logic:

| Case | Call to device_count() | init_cuda, e.g. allocating tensor |
| -- | -- | -- |
| all good | non-zero | just works |
| no gpus | 0, no warning | throw exception with good message |
| driver issues | 0, produce warning | throw exception with good message |
| out of memory with ASAN | 0, produce warning| throw exception with ASAN message |

Previously, the error thrown from init_cuda was very generic and the ASAN warning (if any) was buried in the logs.

Other clean up changes:
* cache device_count() always in a static variable
* move all asan macros in c10

Test Plan:
Hard to unittest because of build modes. Verified manually that the behavior from the table above holds by running the following script in different modes (ASAN/no-ASAN, CUDA_VISIBLE_DEVICES=):

```
print('before import')
import torch
print('after import')
print('devices: ', torch.cuda.device_count())
x = torch.tensor([1,2,3])
print('tensor creation')
x = x.cuda()
print('moved to cuda')
```

Reviewed By: ngimel

Differential Revision: D22824329

fbshipit-source-id: 5314007313a3897fc955b02f8b21b661ae35fdf5
2020-08-05 11:39:31 -07:00
27e8dc78ca [vulkan] VulkanTensor lazy buffer allocation (#42569)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42569

We do not need to allocate buffers for Vulkan Tensors if they are not the forward input or output.
Removing allocate_storage() for outputs of operations by default, their image representation will have the result.
Allocating buffer only if it was requested for the operations (For some ops like concatenate, transpose) or copy to host.

`VulkanTensor.image()` if buffer was not allocated - just allocates texture skipping copy from buffer to texture.
As allocate storage was before for all operations - we are saving buffer allocation and buffer_to_image call.

MobilNetV2 on my Pixel4:
```
flame:/data/local/tmp $ ./speed_benchmark_torch  --model=mnfp32-vopt.pt --input_type=float --input_dims=1,3,224,224 --warmup=3 --iter=20 --vulkan=true
Starting benchmark.
Running warmup runs.
Main runs.
Main run finished. Microseconds per iter: 305818. Iters per second: 3.26991
Segmentation fault
```
```
139|flame:/data/local/tmp $ ./speed_benchmark_torch_noas  --model=mnfp32-vopt.pt --input_type=float --input_dims=1,3,224,224 --warmup=3 --iter=20 --vulkan=true
Starting benchmark.
Running warmup runs.
Main runs.
Main run finished. Microseconds per iter: 236768. Iters per second: 4.22355
Segmentation fault
```

Test Plan: Imported from OSS

Reviewed By: AshkanAliabadi

Differential Revision: D22946552

Pulled By: IvanKobzarev

fbshipit-source-id: ac0743bb316847632a22cf9aafb8938e50b2fb7b
2020-08-05 10:54:41 -07:00
dae94ed022 Keep manual_kernel_registration only effective in aten codegen. (#42386)
Summary:
This PR removes manual registration in aten/native codebase.
And it separates manual device/catchall kernel registration from manual VariableType kernel registration.
The first one remains as manual_kernel_registration in native_functions.yaml.
The second one is moved to tools/ codegen.

Difference in generated TypeDefault.cpp: https://gist.github.com/ailzhang/897ef9fdf0c834279cd358febba07734
No difference in generated VariableType_X.cpp

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42386

Reviewed By: agolynski

Differential Revision: D22915649

Pulled By: ailzhang

fbshipit-source-id: ce93784b9b081234f05f3343e8de3c7a704a5783
2020-08-05 10:31:35 -07:00
b08347fd7b Add CUDA 11 builds for Windows CI (#42420)
Summary:
Stacked on https://github.com/pytorch/pytorch/pull/42410.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42420

Reviewed By: seemethere

Differential Revision: D22917230

Pulled By: malfet

fbshipit-source-id: 6ad394f7f8c430c587e0b0d9c5a5e7b7bcd85bfe
2020-08-05 09:40:33 -07:00
db52cd7322 .circleci: Hardcode rocm image to previous tag (#42603)
Summary:
There were some inconsistencies with the newer docker images so it'd be
best to stick with something that works without reverting the entire
docker builder PR

This was made after the previous efforts to disable the tests that were failing:
* https://github.com/pytorch/pytorch/pull/42583
* https://github.com/pytorch/pytorch/pull/42561

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42603

Reviewed By: ezyang

Differential Revision: D22948743

Pulled By: seemethere

fbshipit-source-id: cc8b834e0c8a6a4763f5ba07ce220a9c192ea6eb
2020-08-05 09:23:21 -07:00
eb8a5fed38 Automated submodule update: FBGEMM (#42584)
Summary:
This is an automated pull request to update the first-party submodule for [pytorch/FBGEMM](https://github.com/pytorch/FBGEMM).

New submodule commit: 4abc34af1a

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42584

Test Plan: Ensure that CI jobs succeed on GitHub before landing.

Reviewed By: dskhudia

Differential Revision: D22941475

fbshipit-source-id: 29863cad7f77939edb44d337918693879b35cfaa
2020-08-05 09:19:27 -07:00
924a1dbe9b Revert D22939119: [TensorExpr] Fix a way we were createing np arrays in tests.
Test Plan: revert-hammer

Differential Revision:
D22939119 (882ad117cf)

Original commit changeset: 3388270af8ea

fbshipit-source-id: 7c8d159586ce2c4c21184fd84aa6da5183bc71ea
2020-08-05 08:25:47 -07:00
0cf71eb547 Unconditinally use typing extensions in jit_internal (#42538)
Summary:
Since https://github.com/pytorch/pytorch/issues/38221 is closed now, `typing_extensions` module should always be available

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42538

Reviewed By: ezyang

Differential Revision: D22942153

Pulled By: malfet

fbshipit-source-id: edabbadde13800a3412d14c19ca55ef206ada5e1
2020-08-05 08:22:59 -07:00
b85216887b [vulkan] max_pool2d (#41379)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/41379

Test Plan: Imported from OSS

Reviewed By: AshkanAliabadi

Differential Revision: D22754944

Pulled By: IvanKobzarev

fbshipit-source-id: 5261337bb731a207a1532e6423c0d33f1307e413
2020-08-05 01:53:52 -07:00
0f358fab6b Hide cudnn symbols in libtorch_cuda.so when statically linking cudnn (#41986)
Summary:
This PR intends to fix https://github.com/pytorch/pytorch/issues/32983.

The initial (one-line) diff causes statically linked cudnn symbols in `libtorch_cuda.so` to have local linkage (such that they shouldn't be visible to external libraries during dynamic linking at load time), at least in my source build on Ubuntu 20.04.

Procedure I used to verify:
```
export USE_STATIC_CUDNN=ON
python3 setup.py install
...
```
then
```
mcarilli@mcarilli-desktop:~/Desktop/mcarilli_github/pytorch/torch/lib$ nm libtorch_cuda.so | grep cudnnCreate
00000000031ff540 t cudnnCreate
00000000031fbe70 t cudnnCreateActivationDescriptor
```
Before the diff they were marked with capital `T`s indicating external linkage.

Caveats:
- The fix is gcc-specific afaik.  I have no idea how to enable it for Windows or other compilers.
- Hiding the cudnn symbols will break external C++ applications that rely on linking `libtorch.so` to supply cudnn symbol definitions.  IMO this is "off menu" usage so I don't think it's a major concern.  Hiding the symbols _won't_ break applications that call cudnn indirectly through torch functions, which IMO is the "on menu" way.
- I know _very little_ about the build system.  The diff's intent is to add a link option that applies to any Pytorch `.so`s that statically link cudnn, and does so on Linux only.  I'm blindly following soumith 's recommendation https://github.com/pytorch/pytorch/issues/32983#issuecomment-662056151, and post-checking the built libs (I also added `set(CMAKE_VERBOSE_MAKEFILE ON)` to the top-level CMakeLists.txt at one point to confirm `-Wl,--exclude-libs,libcudnn_static.a` was picked up by the command that linked `libtorch_cuda.so`).
- https://github.com/pytorch/pytorch/issues/32983 (which used a Pytorch 1.4 binary build) complained about `libtorch.so`, not `libtorch_cuda.so`:
    ```
    nvpohanh@ubuntu:~$ nm /usr/local/lib/python3.5/dist-packages/torch/lib/libtorch.so | grep ' cudnnCreate'
    000000000f479c30 T cudnnCreate
    000000000f475ff0 T cudnnCreateActivationDescriptor
    ```
  In my source build, `libtorch.so` ends up small, containing no cudnn symbols (this is true with or without the PR's diff), which contradicts https://github.com/pytorch/pytorch/issues/32983.  Maybe the symbol organization (what goes in   `libtorch.so` vs `libtorch_cuda/cpu/whatever.so`) changed since 1.4.  Or maybe the symbol organization is different for source vs binary builds, in which case I have no idea if this PR's diff has the same effect for a binary build.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41986

Reviewed By: glaringlee

Differential Revision: D22934926

Pulled By: malfet

fbshipit-source-id: 711475834e0f8148f0e5f2fe28fca5f138ef494b
2020-08-04 22:59:40 -07:00
882ad117cf [TensorExpr] Fix a way we were createing np arrays in tests. (#42575)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42575

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D22939119

Pulled By: ZolotukhinM

fbshipit-source-id: 3388270af8eae9fd4747f06202f366887aaf5f36
2020-08-04 21:24:25 -07:00
3c7fccc1c2 Reenable cusparse SpMM on cuda 10.2 (#42556)
Summary:
This fixes feature regression introduced by https://github.com/pytorch/pytorch/issues/42412 which limited all the use of the API to CUDA-11.0+

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42556

Reviewed By: ngimel

Differential Revision: D22932129

Pulled By: malfet

fbshipit-source-id: 2756e0587456678fa1bc7deaa09d0ea482dfd19f
2020-08-04 19:02:34 -07:00
78f4cff8fe handle multiple returns properly in boxing wrappers (#42437)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42437

Test Plan: Imported from OSS

Reviewed By: smessmer

Differential Revision: D22894191

Pulled By: bhosmer

fbshipit-source-id: fd4c7bc605a4b20bb3882f71e3b8874150671324
2020-08-04 18:27:25 -07:00
d45e2d3ef9 Reduce the output overhead of OutputColumnMaxHistogramObserver by enabling changing bin_nums, Update the observer_test.py
Summary: Current OutputColumnMaxHistogramObserver will output 2048 bins for each column. The file will be extremely large and the dumping time is quite long. However, we only use the min and max finally. This diff enables changing bin_nums by adding an argument. And the default value is set to 16 to reduce dumping overhead. When we need more bins to analyze the results, we only need to change this argument

Test Plan:
buck run caffe2/caffe2/quantization/server:observer_test

{F263843430}

Reviewed By: hx89

Differential Revision: D22918202

fbshipit-source-id: bda34449355b269b24c55802012450ebaa4d280c
2020-08-04 17:07:25 -07:00
61027a1a59 Install typing_extensions in PyTorch CI (#42551)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42551

Reviewed By: seemethere

Differential Revision: D22929256

Pulled By: malfet

fbshipit-source-id: 9a6f8c56ca1c0fb8a8569614a34a12f2769755f3
2020-08-04 17:03:44 -07:00
29700c0092 [JIT] Fix torch.jit.is_tracing() (#42486)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42486

**Summary**
This commit fixes a small bug in which `torch.jit.is_tracing()` returns
`torch._C.is_tracing`, the function object, instead of calling the
function and returning the result.

**Test Plan**
Continuous integration?

**Fixes**
This commit fixes #42448.

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D22911062

Pulled By: SplitInfinity

fbshipit-source-id: b94eca0c1c65ca6f22acc6c5542af397f2dc37f0
2020-08-04 16:57:36 -07:00
afa489dea9 [ONNX] Enable lower_tuple pass for custom layer (#41548)
Summary:
Custom layer by `torch.autograd.Function` appears in the lower_tuple as `prim::PythonOp`. Adding this op type to the allowed list to enable lower_tuple pass. This helps with exporting custom layer with tuple outputs.

E.g.
```python
import torch
class CustomFunction(torch.autograd.Function):
    staticmethod
    def symbolic(g, input):
        return g.op('CustomNamespace::Custom', input, outputs=2)
    staticmethod
    def forward(ctx, input):
        return input, input
class Custom(torch.nn.Module):
    def forward(self, input):
        return CustomFunction.apply(input)

model = Custom()
batch = torch.FloatTensor(1, 3)
torch.onnx.export(model, batch, "test.onnx", verbose=True)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41548

Reviewed By: glaringlee

Differential Revision: D22926143

Pulled By: bzinodev

fbshipit-source-id: ce14d1d3c70a920154a8235d635ab31ddf0c46f3
2020-08-04 16:22:39 -07:00
ccc831ae35 test: Disable test_strided_grad_layout on ROCM (#42561)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42561

Regression was introduced as part of 5939d8a3e0, logs: https://app.circleci.com/pipelines/github/pytorch/pytorch/196558/workflows/9a2dd56e-86af-4d0f-9fb9-b205dcd12f93/jobs/6502042

Going to go ahead and disable the test to give rocm folks time to investigate what's going on

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D22932615

Pulled By: seemethere

fbshipit-source-id: 41150f3085f848cce75990716362261fea9391a0
2020-08-04 16:20:44 -07:00
c3e2ee725f Automated submodule update: FBGEMM (#42496)
Summary:
This is an automated pull request to update the first-party submodule for [pytorch/FBGEMM](https://github.com/pytorch/FBGEMM).

New submodule commit: 87c378172a

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42496

Test Plan: Ensure that CI jobs succeed on GitHub before landing.

Reviewed By: dskhudia

Differential Revision: D22911638

fbshipit-source-id: f20c83908b51ff56d8bf1d8b46961f70d023c81a
2020-08-04 16:15:26 -07:00
b9e68e03c4 Fix the bug in THCTensor_(baddbmm) and ATen's addmm_cuda for strided views input (#42425)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/42418.

The problem was that the non-contiguous batched matrices were passed to `gemmStridedBatched`.

The following code fails on master and works with the proposed patch:
```python
import torch
x = torch.tensor([[1., 2, 3], [4., 5, 6]], device='cuda:0')
c = torch.as_strided(x, size=[2, 2, 2], stride=[3, 1, 1])
torch.einsum('...ab,...bc->...ac', c, c)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42425

Reviewed By: glaringlee

Differential Revision: D22925266

Pulled By: ngimel

fbshipit-source-id: a72d56d26c7381b7793a047d76bcc5bd45a9602c
2020-08-04 16:11:07 -07:00
317b9d3bfc Implement sort for string in aten (#42398)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/42375

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42398

Reviewed By: ailzhang

Differential Revision: D22884849

Pulled By: gmagogsfm

fbshipit-source-id: e53386949f0a5e166f3d1c2aa695294340bd1440
2020-08-04 15:25:35 -07:00
56fc7d0345 Fix doc build (#42559)
Summary:
Add space between double back quotes and left curly bracket

Otherwise doc generation failed with `Inline literal start-string without end-string.`

This regression was introduced by b56db305cf

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42559

Reviewed By: glaringlee

Differential Revision: D22931527

Pulled By: malfet

fbshipit-source-id: 11c04a92dbba48592505f704d77222cf92a81055
2020-08-04 15:15:15 -07:00
e995c3d21e Add private API to support tensor lists: _foreach_add(TensorList tensors, Scalar scalar) (#41554)
Summary:
Initial PR for the Tensor List functionality.

**Motivation**
[GitHub issue](https://github.com/pytorch/pytorch/issues/38655)
Current PyTorch optimizer implementations are not efficient in cases when we work with a lot of small feature tensors. Starting a lot of kernels slows down the whole process. We need to reduce the number of kernels that we start.
As an example, we should be looking at [NVIDIAs Apex](https://github.com/NVIDIA/apex).
In order to track progress, we will pick PyTorchs DCGAN model with Adam optimizer and once the optimizer is reimplemented with tensor lists, benchmark the model performance against original model version, Apexs version with original Adam optimizer and it’s FusedAdam optimizer.

**In this PR**
- Adding `multi_tensor_apply` mechanism which will help to efficiently apply passed functor on a given list of tensors on CUDA.
- Adding a first private API - `std::vector<Tensor> _foreach_add(TensorList tensors, Scalar scalar)`

**Tests**
Tested via unit tests

**Plan for the next PRs**

1. Cover these ops with `multi_tensor_apply` support
- exponent
- division
- mul_
- add_
- addcmul_
- addcdiv_
- Sqrt

2. Rewrite PyTorch optimizers to use for-each operators in order to get performance gains.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41554

Reviewed By: cpuhrsch

Differential Revision: D22829724

Pulled By: izdeby

fbshipit-source-id: 47febdbf7845cf931958a638567b7428a24782b1
2020-08-04 15:01:09 -07:00
a0695b34cd .circleci: Have python docs always push to site (#42552)
Summary:
Was getting an error when attempting to push to master for
pytorch/pytorch.github.io since the main branch on that repository is
actually site and not master.

Get rid of the loop too since the loop wasn't going to work with a
conditional and conditionals on a two variable loop just isn't worth the
readability concerns

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42552

Reviewed By: malfet

Differential Revision: D22929503

Pulled By: seemethere

fbshipit-source-id: acdd26b86718304eac9dcfc81761de0b3e609004
2020-08-04 14:44:42 -07:00
91d87292a6 [vulkan][asan] Fix Invalid Memory ops (#41224)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/41224

Test Plan: Imported from OSS

Reviewed By: AshkanAliabadi

Differential Revision: D22754940

Pulled By: IvanKobzarev

fbshipit-source-id: f012b78a57f5f88897b2b6b91713090c8984a0bc
2020-08-04 14:33:49 -07:00
0d1a689764 [vulkan] reshape op (#41223)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/41223

Test Plan: Imported from OSS

Reviewed By: AshkanAliabadi

Differential Revision: D22754942

Pulled By: IvanKobzarev

fbshipit-source-id: 99fc5888803d6afe2a73bb5bbed6651d2ea98313
2020-08-04 14:32:06 -07:00
e97e87368e Clean up CUDA Sleep and Tensor Initialization in ProcessGroupNCCLTest (#42211)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42211

Helper functions for launching CUDA Sleep and Tensor Value Initialization for the collective test functions.

This is more of a code cleanup fix compared to the previous diffs.
ghstack-source-id: 109097243

Test Plan: working on devGPU and devvm

Reviewed By: jiayisuse

Differential Revision: D22782671

fbshipit-source-id: 7d88f568a4e08feae778669affe69c8d638973db
2020-08-04 12:36:27 -07:00
3ca361791f TearDown function for ProcessGroupNCCLTest Initializer (#42209)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42209

This PR adds a TearDown function to the testing superclass to ensure that the NCCL_BLOCKING_WAIT environment variable is reset after each test case.
ghstack-source-id: 109097247

Test Plan: Working on devGPU and devvm.

Reviewed By: jiayisuse

Differential Revision: D22782672

fbshipit-source-id: 8f919a96d7112f9f167e90ce3df59886c88f3514
2020-08-04 12:36:24 -07:00
2b8e7e2f2d Moving ProcessGroupNCCLTest to Gtest (#42208)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42208

ProcessGroupNCCLTest is currently written without any testing framework, and all tests are simply called from the main function and throw exceptions upon failure. As a result, it is hard to debug and pinpoint which tests have succeeded/failed.

This PR moves ProcessGroupNCCLTest to gtest with appropriate setup and skipping functionality in the test superclass.
ghstack-source-id: 109097246

Test Plan: Working Correctly on devGPU and devvm.

Reviewed By: jiayisuse

Differential Revision: D22782673

fbshipit-source-id: 85bd407f4534f3d339ddcdd65ef3d2022aeb7064
2020-08-04 12:34:09 -07:00
b3ffebda7a [TensorExpr] Properly handle all dtypes of the condition in evaluation of IfThenElse exprs. (#42495)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42495

Test Plan: Imported from OSS

Reviewed By: nickgg

Differential Revision: D22910753

Pulled By: ZolotukhinM

fbshipit-source-id: f9ffd3dc4c50fb3fb84ce6d6916c1fbfd3201c8f
2020-08-04 12:25:56 -07:00
c334ebf1aa [TensorExpr] Properly handle all dtypes in evaluation of Intrinsics exprs. (#42494)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42494

Note that we're currently assuming that dtypes of all the arguments and
the return value is the same.

Test Plan: Imported from OSS

Reviewed By: nickgg

Differential Revision: D22910755

Pulled By: ZolotukhinM

fbshipit-source-id: 7f899692065428fbf2ad05d22b4ca39cab788ae5
2020-08-04 12:25:54 -07:00
38a9984451 [TensorExpr] Properly handle all dtypes in evaluation of CompareSelect exprs. (#42493)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42493

Test Plan: Imported from OSS

Reviewed By: nickgg

Differential Revision: D22910754

Pulled By: ZolotukhinM

fbshipit-source-id: cf7073d6ea792998a9fa3989c7ec486419476de0
2020-08-04 12:24:03 -07:00
5939d8a3e0 Revert "Revert D22360735: .circleci: Build docker images as part of C… (#40950)
Summary:
…I workflow"

This reverts commit 3c6b8a64964b0275884359dd6a5bf484655d8c7c.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/40950

Reviewed By: malfet

Differential Revision: D22909883

Pulled By: seemethere

fbshipit-source-id: 93c070400d7fbe1753f88c3291ab5eba4ab237fa
2020-08-04 12:12:17 -07:00
4b42a5b5a1 Remove redundant kernels calling TypeDefault in VariableType codegen. (#42031)
Summary:
We have code snippet like below in VariableType_X.cpp
```
Tensor __and___Scalar(const Tensor & self, Scalar other) {
  auto result = TypeDefault::__and___Scalar(self, other);
  return result;
}
 TORCH_LIBRARY_IMPL(aten, Autograd, m) {
  m.impl("__and__.Scalar",
         c10::impl::hacky_wrapper_for_legacy_signatures(TORCH_FN(VariableType::__and___Scalar))
  );
```
We already register TypeDefault kernels as catchAll so they're not needed to be wrapped and register to Autograd key in VariableType.cpp. This PR removes the wrapper and registration in VariableType.cpp. (The ones in other files like TracedType.cpp remains the same).
Here's a [diff in generated VariableTypeEverything.cpp](https://gist.github.com/ailzhang/18876edec4dad54e43a1db0c127c5707)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42031

Reviewed By: agolynski

Differential Revision: D22903507

Pulled By: ailzhang

fbshipit-source-id: 04e6672b6c79e079fc0dfd95c409ebca7f9d76fc
2020-08-04 11:56:15 -07:00
94e8676a70 Initialize uninitialized variable (#42419)
Summary:
Fixes internal T70924595

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42419

Reviewed By: allwu, Krovatkin

Differential Revision: D22889325

Pulled By: wconstab

fbshipit-source-id: 108b6a6c6bb7c98d77e22bae9974a6c00bc296f0
2020-08-04 11:35:54 -07:00
d2a2ac4eea Fix read/write bulk data (#42504)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42504

Reviewed By: glaringlee

Differential Revision: D22922750

Pulled By: mrshenli

fbshipit-source-id: 9008fa22c00513bd75c3cf88a3081184cd72b0e3
2020-08-04 11:30:53 -07:00
ec898b1ab5 fix discontiguous inputs/outputs for cummin/cummax (#42507)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/42363

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42507

Reviewed By: mruberry

Differential Revision: D22917876

Pulled By: ngimel

fbshipit-source-id: 05f3f4a55bcddf6a853552184c9fafcef8d36270
2020-08-04 10:12:07 -07:00
ecb88c5d11 Add NCCL Alltoall to PT NCCL process group (#42514)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42514

Add Alltoall and Alltoallv to PT NCCL process group using NCCL Send/Recv.

Reviewed By: mrshenli

Differential Revision: D22917967

fbshipit-source-id: 402f2870915bc237845864a4a27c97df4351d975
2020-08-04 08:39:28 -07:00
b56db305cf Improve the documentation of DistributedDataParallel (#42471)
Summary:
Fixes #{issue number}

It's not clear by illustrating 'gradients from each node are averaged' in the documentation of DistributedDataParallel. Many people, including me, have a totally wrong understanding on this part. I add a note into the documentation to make it more straight forward and more user friendly.

Here is some toy code to illustrate my point:

* non-DistributedDataParallel version
    ```python
    import torch
    import torch.nn as nn

    x = torch.tensor([-1, 2, -3, 4], dtype=torch.float).view(-1, 1)
    print("input:", x)

    model = nn.Linear(in_features=1, out_features=1, bias=False)
    model.weight.data.zero_()
    model.weight.data.add_(1.0)

    opti = torch.optim.SGD(model.parameters(), lr=0.001)
    opti.zero_grad()

    y = model(x)

    label = torch.zeros(4, 1, dtype=torch.float)
    loss = torch.sum((y - label)**2)

    loss.backward()
    opti.step()

    print("grad:", model.weight.grad)
    print("updated weight:\n", model.weight)

    # OUTPUT
    # $ python test.py
    # input: tensor([[-1.],
    #         [ 2.],
    #         [-3.],
    #         [ 4.]])
    # grad: tensor([[60.]])
    # updated weight:
    #  Parameter containing:
    # tensor([[0.9400]], requires_grad=True)
    ```

* DistributedDataParallel version
    ```python
    import os
    import torch
    import torch.nn as nn
    import torch.distributed as dist
    from torch.multiprocessing import Process

    def run(rank, size):
        x = torch.tensor([-(1 + 2 * rank), 2 + 2 * rank], dtype=torch.float).view(-1, 1)
        print("input:", x)

        model = nn.Linear(in_features=1, out_features=1, bias=False)
        model.weight.data.zero_()
        model.weight.data.add_(1.0)
        model = torch.nn.parallel.DistributedDataParallel(model)

        opti = torch.optim.SGD(model.parameters(), lr=0.001)
        opti.zero_grad()

        y = model(x)

        label = torch.zeros(2, 1, dtype=torch.float)
        loss = torch.sum((y.view(-1, 1) - label)**2)

        loss.backward()
        opti.step()

        if rank == 0:
            print("grad:", model.module.weight.grad)
            print("updated weight:\n", model.module.weight)

    def init_process(rank, size, fn, backend="gloo"):
        os.environ['MASTER_ADDR'] = '127.0.0.1'
        os.environ['MASTER_PORT'] = '29500'
        dist.init_process_group(backend, rank=rank, world_size=size)
        fn(rank, size)

    if __name__ == "__main__":
        size = 2
        process = []
        for rank in range(size):
            p = Process(target=init_process, args=(rank, size, run))
            p.start()
            process.append(p)

        for p in process:
            p.join()

    # OUTPUT
    # $ python test_d.py
    # input: tensor([[-3.],
    #         [ 4.]])input: tensor([[-1.],
    #         [ 2.]])

    # grad: tensor([[30.]])
    # updated weight:
    #  Parameter containing:
    # tensor([[0.9700]], requires_grad=True)
    ```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42471

Reviewed By: glaringlee

Differential Revision: D22923340

Pulled By: mrshenli

fbshipit-source-id: 40b8c8ba63a243f857cd5976badbf7377253ba82
2020-08-04 08:36:42 -07:00
f3e8fff0d2 Batching rules for: chunk, split, unbind (#42480)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42480

These are grouped together because they all return a tuple of multiple
tensors.

This PR implements batching rules for chunk, split, and unbind. It also
updates the testing logic. Previously, reference_vmap was not able to
handle multiple outputs, now, it does.

Test Plan: - `pytest test/test_vmap.py -v -k "Operators"`

Reviewed By: ezyang

Differential Revision: D22905401

Pulled By: zou3519

fbshipit-source-id: 9963c943d035e9035c866be74dbdf7ab1989f8c4
2020-08-04 08:33:43 -07:00
f1d7f001b9 Batching rules for: torch.movedim, torch.narrow, Tensor.unfold (#42474)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42474

Test Plan: - `pytest test/test_vmap.py -v -k "Operators"`

Reviewed By: ezyang

Differential Revision: D22903513

Pulled By: zou3519

fbshipit-source-id: 06b3fb0c7d12b9a045c73a5c5a4f4e3207e07b02
2020-08-04 08:33:41 -07:00
01cd613e7e Batching rules for: T, view, view_as, reshape, reshape_as (#42458)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42458

Test Plan: - `pytest test/test_vmap.py -v -k "Operators"`

Reviewed By: ezyang

Differential Revision: D22898715

Pulled By: zou3519

fbshipit-source-id: 47f374962697dcae1d5aec80a41085679d016f92
2020-08-04 08:31:33 -07:00
0c48aa1e07 Add typing annotations to hub.py and _jit_internal.py (#42252)
Summary:
xref: https://github.com/pytorch/pytorch/wiki/Guide-for-adding-type-annotations-to-PyTorch

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42252

Reviewed By: malfet

Differential Revision: D22916480

Pulled By: ezyang

fbshipit-source-id: 392ab805b0023640a3b5cdf600f70638b375f84f
2020-08-04 08:20:44 -07:00
d21e345ef0 Fix segfault in THPGenerator_dealloc (take 2) (#42510)
Summary:
Segfault happens when one tries to deallocate uninitialized generator.
Make `THPGenerator_dealloc` UBSAN-safe by moving implicit cast in the struct definition to reinterpret_cast

Add `TestTorch.test_invalid_generator_raises` that validates that Generator created on invalid device is handled correctly

Fixes https://github.com/pytorch/pytorch/issues/42281

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42510

Reviewed By: pbelevich

Differential Revision: D22917469

Pulled By: malfet

fbshipit-source-id: 5eaa68eef10d899ee3e210cb0e1e92f73be75712
2020-08-04 08:06:08 -07:00
8850fd1952 Add python inferface to create OfflineTensor (#42516)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42516

att. We need it for some scripts.

Reviewed By: houseroad

Differential Revision: D22918112

fbshipit-source-id: 8a1696ceeeda67a34114bc57cb52c925711cfb4c
2020-08-04 01:31:34 -07:00
ae67f4c8b8 Revert D22845258: [pytorch][PR] [ONNX] Enable scripting tests and update jit passes
Test Plan: revert-hammer

Differential Revision:
D22845258 (04e55d69f9)

Original commit changeset: d57fd4086f27

fbshipit-source-id: 15aa5cdae496a5e8ce2d8739a06dd4a7edc2200c
2020-08-03 23:15:06 -07:00
842759591d [ONNX] Refactor ONNX fixup for Loop and If (#40943)
Summary:
* move both under new file `fixup_onnx_controlflow`
* move the fixup to where the ONNX loop/if node is created, as oppose to running the fixup as postpass. This will help with enable onnx shape inference later.
* move `fuseSequenceSplitConcat` to `Peephole`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/40943

Reviewed By: mrshenli

Differential Revision: D22709999

Pulled By: bzinodev

fbshipit-source-id: 51d316991d25dc4bb4047a6bb46ad1e2401d3d2d
2020-08-03 22:33:17 -07:00
55d2a732cd Skip part of test_figure[_list] if Matplotlib-3.3.0 is installed (#42500)
Summary:
See https://github.com/matplotlib/matplotlib/issues/18163 for more details
Fixes https://github.com/pytorch/pytorch/issues/41680

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42500

Reviewed By: ezyang

Differential Revision: D22915857

Pulled By: malfet

fbshipit-source-id: 4f8858b7b0018c6958a49f908de81a13a29e6046
2020-08-03 21:43:22 -07:00
49e06e305f [ONNX] Updating input node removal in ONNX function_substitution pass. (#42146)
Summary:
ONNX pass `torch._C._jit_pass_onnx_function_substitution(graph)` inlines the function with the compiled torch graph. But while it removes all connections with the compiled function node (e.g. see below - `%6 : Function = prim::Constant[name="f"]()`), it does not remove the function node itself. For example, if the input graph is:
```
graph(%0 : Long(requires_grad=0, device=cpu),
      %1 : Long(requires_grad=0, device=cpu)):
  %6 : Function = prim::Constant[name="f"]()
  %7 : Tensor = prim::CallFunction(%6, %0, %1)
  return (%7)
```
The output graph is:
```
graph(%0 : Long(requires_grad=0, device=cpu),
      %1 : Long(requires_grad=0, device=cpu)):
  %6 : Function = prim::Constant[name="f"]()
  %8 : int = prim::Constant[value=1]()
  %z.1 : Tensor = aten::sub(%0, %1, %8) # test/onnx/test_utility_funs.py:790:20
  %10 : Tensor = aten::add(%0, %z.1, %8) # test/onnx/test_utility_funs.py:791:23
  return (%10)
```
Note that the `%6 : Function = prim::Constant[name="f"]()` has not been removed (though it is not being used).

This PR updates the pass to remove the function node completely. The updated graph looks as follows:
```
graph(%0 : Long(requires_grad=0, device=cpu),
      %1 : Long(requires_grad=0, device=cpu)):
  %8 : int = prim::Constant[value=1]()
  %z.1 : Tensor = aten::sub(%0, %1, %8) # test/onnx/test_utility_funs.py:790:20
  %10 : Tensor = aten::add(%0, %z.1, %8) # test/onnx/test_utility_funs.py:791:23
  return (%10)
```

A test point has also been added for this scenario.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42146

Reviewed By: VitalyFedyunin

Differential Revision: D22845314

Pulled By: bzinodev

fbshipit-source-id: 81fb351f0a36f47204e5327b60b84d7a91d3bcd9
2020-08-03 21:31:19 -07:00
0cb86afd72 Revert D22908795: [pytorch][PR] Fix segfault in THPGenerator_dealloc
Test Plan: revert-hammer

Differential Revision:
D22908795 (d3acfe3ba8)

Original commit changeset: c5b6a35db381

fbshipit-source-id: c7559c382fced23cef683c8c90cff2d6012801ec
2020-08-03 21:03:44 -07:00
dc1f87c254 Add typing_extensions as a dependency. (#42431)
Summary:
Closes gh-38221.

The related pytorch/builder PR: https://github.com/pytorch/builder/pull/475

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42431

Reviewed By: malfet

Differential Revision: D22916499

Pulled By: ezyang

fbshipit-source-id: c8fe9413b62fc7a6b829fc82aaf32531b55994d1
2020-08-03 20:06:16 -07:00
c8cb5e5bcb Relax cusparse windows guard on cuda 11 (#42412)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/42406

### cusparse Xcsrmm2 API:

(https://github.com/pytorch/pytorch/issues/37202)

- new: https://docs.nvidia.com/cuda/cusparse/index.html#cusparse-generic-function-spmm
- old (deprecated in cuda 11): https://docs.nvidia.com/cuda/archive/10.2/cusparse/index.html#csrmm2

Before:

|cuda ver | windows | linux |
|--|--|--|
| 10.1 | old api | old api  |
| 10.2 | old api | new api |
| 11    | old api (build error claimed in https://github.com/pytorch/pytorch/issues/42406) | new api |

After:

|cuda ver | windows | linux |
|--|--|--|
| 10.1 | old api | old api  |
| 10.2 | old api | **old api** |
| 11    | **new api** | new api |

### cusparse bmm-sparse-dense API

<details><summary>reverted, will be revisited in the future</summary>
(cc kurtamohler https://github.com/pytorch/pytorch/issues/33430)

- new: https://docs.nvidia.com/cuda/cusparse/index.html#cusparse-generic-function-spmm

Before:

|cuda ver | windows | linux |
|--|--|--|
| 10.1 | not supported | new api  |
| 10.2 | not supported | new api |
| 11    | not supported | new api |

After:

|cuda ver | windows | linux |
|--|--|--|
| 10.1 | not supported | new api  |
| 10.2 | not supported | new api |
| 11    | **new api** | new api |

</details>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42412

Reviewed By: agolynski

Differential Revision: D22892032

Pulled By: ezyang

fbshipit-source-id: cded614af970f0efdc79c74e18e1d9ea8a46d012
2020-08-03 19:59:59 -07:00
24199e0768 tuple_map / tuple_concat (#42326)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42326

ghstack-source-id: 108868289

Test Plan: Unit tests

Reviewed By: smessmer

Differential Revision: D22846504

fbshipit-source-id: fa9539d16e21996bbd80db3e3c524b174b22069e
2020-08-03 19:19:47 -07:00
1b18adb7e8 [ONNX] Export static as_strided (#41569)
Summary:
`as_strided` creates a view of an existing tensor with specified `sizes`, `strides`, and `storage_offsets`. This PR supports the export of `as_strided` with static argument `strides`. The following scenarios will not be supported:
* Calling on tensor of dynamic shape, i.e. the tensor shape differs between model runs and different model inputs.
* In-place operations, i.e. updates to the original tensor that are expected to reflect in the `as_strided` output, and vice versa.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41569

Reviewed By: VitalyFedyunin

Differential Revision: D22845295

Pulled By: bzinodev

fbshipit-source-id: 7d1aa88a810e6728688491478dbf029f17ae7201
2020-08-03 18:56:40 -07:00
04e55d69f9 [ONNX] Enable scripting tests and update jit passes (#41413)
Summary:
This PR initiates the process of updating the torchsciprt backend interface used by ONNX exporter.

- Replace jit lower graph pass by freeze module pass

- Enable ScriptModule tests for ONNX operator tests (ORT backend) and model tests by default.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41413

Reviewed By: VitalyFedyunin

Differential Revision: D22845258

Pulled By: bzinodev

fbshipit-source-id: d57fd4086f27bd0c3bf5f70af7fd0daa39a2814a
2020-08-03 18:51:19 -07:00
c000b890a8 [ONNX] Export torch.eye to ONNX::EyeLike (#41357)
Summary:
Export dynamic torch.eye, i.e. commonly from another tensor, shape for torch.eye is not known at export time.
Static torch.eye where n,m are constants is exported as constant tensor directly.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41357

Reviewed By: VitalyFedyunin

Differential Revision: D22845220

Pulled By: bzinodev

fbshipit-source-id: 6e5c331fa28ca542022ea16f9c88c69995a393b2
2020-08-03 18:51:17 -07:00
fb56299d4a Fix check highlight in filecheck. (#42417)
Summary:
* It originally failed to check for cases where highlight token appears more than once.
* Now it repeated tries to find highlight token if one doesn't seem correctly highlighted until end of error message.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42417

Reviewed By: SplitInfinity

Differential Revision: D22889411

Pulled By: gmagogsfm

fbshipit-source-id: 994835db32849f3d7e98ab7f662bd5c6b8a1662e
2020-08-03 18:49:22 -07:00
7a5708832f fix masked_select for discontiguous outputs (#41841)
Summary:
This fixes https://github.com/pytorch/pytorch/issues/41473 for discontiguous input, mask and out. Tests to follow. Reverting https://github.com/pytorch/pytorch/issues/33269 is not a great solution because I'm told masked_select was needed for printing complex tensors.
cc gchanan , zou3519, ezyang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41841

Reviewed By: mruberry

Differential Revision: D22706943

Pulled By: ngimel

fbshipit-source-id: 413d7fd3f3308b184de04fd56b8a9aaabcad22fc
2020-08-03 18:43:45 -07:00
d707d4bf6d Implement a light SGD optimizer (#42137)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42137

This PR implements an SGD optimizer class similar to torch::optim::SGD, but it doesn't inherit from torch::optim::Optimizer, for use on mobile devices (or other lightweight use case).

Adding Martin's comment for visibility: "SGD may be the only optimizer used in near future. If more client optimizers are needed, refactoring the full optim codes and reusing the existing code would be an option."

Test Plan: Imported from OSS

Reviewed By: iseeyuan

Differential Revision: D22846514

Pulled By: ann-ss

fbshipit-source-id: f5f46804aa021e7ada7c0cd3f16e24404d10c7eb
2020-08-03 17:27:53 -07:00
934b68f866 ecr_gc: Iterate through all tags, reduce prints (#42492)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42492

There's a potential for multiple tags to be created for the same digest
so we should iterate through all potential tags so that we're not
deleting digests that are associated with tags that we actually want.

Also, reduced the number of prints in this script to only the absolutely
necessary prints. (i.e. only the deleted images)

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D22909248

Pulled By: seemethere

fbshipit-source-id: 7f2e540d133485ed6464e413b01ef67aa73df432
2020-08-03 16:59:56 -07:00
d3acfe3ba8 Fix segfault in THPGenerator_dealloc (#42490)
Summary:
Segfault happens when one tries to deallocate unintialized generator

Add `TestTorch.test_invalid_generator_raises` that validates that Generator created on invalid device is handled correctly

Fixes https://github.com/pytorch/pytorch/issues/42281

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42490

Reviewed By: seemethere

Differential Revision: D22908795

Pulled By: malfet

fbshipit-source-id: c5b6a35db381738c0fc984aa54e5cab5ef2cbb76
2020-08-03 16:28:34 -07:00
dbdd28207c Expose a generic shape info struct for ONNXIFI Python interface (#42421)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42421

Previously, we can only feed shape info from Python with float dtype, and batch based dim type when we do onnxifi from Python. This diff removes this limitation and uses TensorBoundShapes protobuf as a generic shape info struct. This will make the onnxifi interface in Python more flexible.

Reviewed By: ChunliF

Differential Revision: D22889781

fbshipit-source-id: 1a89f3a68c215a0409738c425b4e0d0617d58245
2020-08-03 16:10:05 -07:00
f0fd1cc873 Calculate inverse of output scale first. (#41342)
Summary:
This is to unify how output scale calculation is to be done between
fbgemm and qnnpack (servers vs mobile).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41342

Test Plan: Quantization tests.

Reviewed By: vkuzo

Differential Revision: D22506347

Pulled By: kimishpatel

fbshipit-source-id: e14d22f13c6e751cafa3e52617e76ecd9d39dad5
2020-08-03 14:45:08 -07:00
c3236b6649 [quant] Expose register activation post process hook function to user (#42342)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42342

Test Plan: Imported from OSS

Reviewed By: raghuramank100

Differential Revision: D22856711

fbshipit-source-id: d6ad080c82b744ae1147a656c321c448ac5e7f10
2020-08-03 12:28:42 -07:00
1b9cd747cf Revert "Conda build (#38796)" (#42472)
Summary:
This reverts commit 9c7ca89ae637a9cea52b4fee0877adc7485f4eb7.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42472

Reviewed By: ezyang, agolynski

Differential Revision: D22903382

Pulled By: seemethere

fbshipit-source-id: e2b01537bcdf6c50d967329833cb6450a75b8247
2020-08-03 12:08:13 -07:00
0eb513beef Set a proper type for a variable (#42453)
Summary:
`ninputs` variable was always used as a `size_t` but declared as an `int32_t`

Now, some annoying warnings are fixed

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42453

Reviewed By: agolynski

Differential Revision: D22898282

Pulled By: mrshenli

fbshipit-source-id: b62d6b07f0bc3717482906df6010d88762ae0ccd
2020-08-03 11:44:37 -07:00
34025eb826 Vectorize arange (#38697)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38697

Benchmark (gcc 8.3, Debian Buster, turbo off, Release build, Intel(R)
Xeon(R) E-2136, Parallelization using OpenMP):

```python
import timeit
for dtype in ('torch.double', 'torch.float', 'torch.uint8', 'torch.int8', 'torch.int16', 'torch.int32', 'torch.int64'):
    for n, t in [(40_000, 50000),
                (400_000, 5000)]:
        print(f'torch.arange(0, {n}, dtype={dtype}) for {t} times')
        print(timeit.timeit(f'torch.arange(0, {n}, dtype={dtype})', setup=f'import torch', number=t))
```

Before:

```
torch.arange(0, 40000, dtype=torch.double) for 50000 times
1.587841397995362
torch.arange(0, 400000, dtype=torch.double) for 5000 times
0.47885190199303906
torch.arange(0, 40000, dtype=torch.float) for 50000 times
1.5519152240012772
torch.arange(0, 400000, dtype=torch.float) for 5000 times
0.4733216500026174
torch.arange(0, 40000, dtype=torch.uint8) for 50000 times
1.426058754004771
torch.arange(0, 400000, dtype=torch.uint8) for 5000 times
0.43596178699226584
torch.arange(0, 40000, dtype=torch.int8) for 50000 times
1.4289699140063021
torch.arange(0, 400000, dtype=torch.int8) for 5000 times
0.43451592899509706
torch.arange(0, 40000, dtype=torch.int16) for 50000 times
0.5714442400058033
torch.arange(0, 400000, dtype=torch.int16) for 5000 times
0.14837959500437137
torch.arange(0, 40000, dtype=torch.int32) for 50000 times
0.5964003179979045
torch.arange(0, 400000, dtype=torch.int32) for 5000 times
0.15676555599202402
torch.arange(0, 40000, dtype=torch.int64) for 50000 times
0.8390555799996946
torch.arange(0, 400000, dtype=torch.int64) for 5000 times
0.23184613398916554
```

After:

```
torch.arange(0, 40000, dtype=torch.double) for 50000 times
0.6895066159922862
torch.arange(0, 400000, dtype=torch.double) for 5000 times
0.16820953000569716
torch.arange(0, 40000, dtype=torch.float) for 50000 times
1.3640095089940587
torch.arange(0, 400000, dtype=torch.float) for 5000 times
0.39255041000433266
torch.arange(0, 40000, dtype=torch.uint8) for 50000 times
0.3422072059911443
torch.arange(0, 400000, dtype=torch.uint8) for 5000 times
0.0605111670010956
torch.arange(0, 40000, dtype=torch.int8) for 50000 times
0.3449254590086639
torch.arange(0, 400000, dtype=torch.int8) for 5000 times
0.06115841199061833
torch.arange(0, 40000, dtype=torch.int16) for 50000 times
0.7745441729930462
torch.arange(0, 400000, dtype=torch.int16) for 5000 times
0.22106765500211623
torch.arange(0, 40000, dtype=torch.int32) for 50000 times
0.720475220005028
torch.arange(0, 400000, dtype=torch.int32) for 5000 times
0.20230313099455088
torch.arange(0, 40000, dtype=torch.int64) for 50000 times
0.8144655400101328
torch.arange(0, 400000, dtype=torch.int64) for 5000 times
0.23762561299372464
```

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D22291236

Pulled By: VitalyFedyunin

fbshipit-source-id: 134dd08b77b11e631d914b5500ee4285b5d0591e
2020-08-03 11:14:57 -07:00
fa6e900e8c Let TensorIterator::nullary_op support check_mem_overlap option (#38693)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38693

Test Plan: Imported from OSS

Differential Revision: D22291237

Pulled By: VitalyFedyunin

fbshipit-source-id: 5bc96e617ed36ed076da73e3d019699f2efd6e4e
2020-08-03 11:13:04 -07:00
ed44269edc Add missing space after -> for topk.values (#42321)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42321

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: vkuzo

Differential Revision: D22846520

Pulled By: ezyang

fbshipit-source-id: 7c0ab0b019d05a13309c3b8d770582414795799f
2020-08-03 10:10:20 -07:00
326d777e53 Convert _wait_all_workers to _all_gather (#42276)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42276

This commit converts `_wait_all_workers()` to `_all_gather()` by
allowing each worker to provide its own data object. The `_all_gather()`
function blocks and returns the gathered results. This API can be
converted to `rpc.barrier()` latter.

Test Plan: Imported from OSS

Reviewed By: lw

Differential Revision: D22853480

Pulled By: mrshenli

fbshipit-source-id: 9d506813b9fd5b7c144885e2b76a863cbd19466a
2020-08-03 08:48:45 -07:00
ebde590864 Remove debug vestige (#42277)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42277

Test Plan: Imported from OSS

Reviewed By: lw

Differential Revision: D22853481

Pulled By: mrshenli

fbshipit-source-id: 74e58c532d8f872c1dd830573b2a4c4c86410de2
2020-08-03 08:46:38 -07:00
4cdbe5c495 Implement batching rules for some view ops (#42248)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42248

Including:
- torch.diagonal
- torch.t
- torch.select
- Tensor.expand_as
- Tensor slicing.

Please let me know in the future if it would be easier to review these
separately (I put five operators into this PR because each
implementation is relatively simple).

Test Plan:
- new tests in `test/test_vmap.py`.
- I would like to have a more structured/automated way of testing but
my previous attempts at making something resulted in something very
complicated.

Reviewed By: ezyang

Differential Revision: D22846273

Pulled By: zou3519

fbshipit-source-id: 8e45ebe11174512110faf1ee0fdc317a25e8b7ac
2020-08-03 08:01:48 -07:00
2f8d5b68fa vmap fallback kernel (#41943)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41943

If an operator doesn't have a batching rule implemented then we fallback
to this implementation. The fallback only works on out-of-place operators
that return only tensors with new memory. (e.g., no in-place operators,
no view operations).

The fallback effectively takes all of the BatchedTensors in `stack`,
slices them, and runs `op` on all of the corresponding slices to produce slices
of the outputs. The output slices then get `torch.stack`ed to create the
final returns.

The performance of the fallback is not very good because it introduces
an extra copy from stacking the sliced outputs. Because of this, we prefer
to write batching rules for operators whenever possible.

In the future, I'd like to disable the fallback kernel for random
functions until we have a better random story for vmap. I will probably
add a blocklist of operators to support that.

Test Plan: - `pytest test/test_vmap.py -v`

Reviewed By: ezyang

Differential Revision: D22764103

Pulled By: zou3519

fbshipit-source-id: b235833f7f27e11fb76a8513357ac3ca286a638b
2020-08-03 07:59:33 -07:00
192487d716 Update MAGMA to 2.5.3 for Windows (#42410)
Summary:
In order to introduce CUDA 11 build jobs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42410

Reviewed By: malfet

Differential Revision: D22892025

Pulled By: ezyang

fbshipit-source-id: 11bd7507f623d654a589ba00a138f6b947990f4c
2020-08-03 07:43:09 -07:00
ebfff31e19 [distributedhogwild] Introducing new tags for distributed hogwild. (#42381)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42381

Introduce new tag to support distributed hogwild.

Reviewed By: boryiingsu

Differential Revision: D20484099

fbshipit-source-id: 5973495589e0a7ab185d3867b37437aa747f408a
2020-08-03 07:10:44 -07:00
bfa94487b9 Remove register_mobile_autograd.cpp. (#42397)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42397

Since the autograd registration is unified to code-gen, we don't need to keep a manual registration file for mobile.
Remove it to avoid extra maintenance.

Test Plan: Imported from OSS

Reviewed By: ljk53

Differential Revision: D22883153

Pulled By: iseeyuan

fbshipit-source-id: 6db0bd89369beab9eed6e9a9692dd46f5bd1ff48
2020-08-02 14:14:33 -07:00
91c80d122a torch.gcd: Do not use std::abs() because it does not have an unsigned integer overload (#42254)
Summary:
`abs` doesn't have an signed overload across all compilers, so applying abs on uint8_t can be ambiguous: https://en.cppreference.com/w/cpp/numeric/math/abs

This may cause unexpected issue when the input is uint8 and is greater
than 128. For example, on MSVC, applying `std::abs` on an unsigned char
variable

```c++
#include <cmath>

unsigned char a(unsigned char x) {
    return std::abs(x);
}
```

gives the following warning:

    warning C4244: 'return': conversion from 'int' to 'unsigned char',
    possible loss of data

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42254

Reviewed By: VitalyFedyunin

Differential Revision: D22860505

Pulled By: mruberry

fbshipit-source-id: 0076d327bb6141b2ee94917a1a21c22bd2b7f23a
2020-08-01 23:03:33 -07:00
4cbf18ccc3 Enables integer -> float type promotion in TensorIterator (#42359)
Summary:
Many ufuncs (mostly unary ufuncs) in NumPy promote integer inputs to float. This typically occurs when the results of the function are not representable as integers.

For example:

```
a = np.array([1, 2, 3], dtype=np.int64)
np.sin(a)
: array([0.84147098, 0.90929743, 0.14112001])
```

In PyTorch we only have one function, `torch.true_divide`, which exhibits this behavior today, and it did it by explicitly pre-casting its inputs to the default (float) scalar type where necessary before calling TensorIterator.

This PR lets TensorIterator understand and implement this behavior directly, and it updates `torch.true_divide` to verify the behavior is properly implemented. This will be convenient when implementing more integer->float promotions later (like with `torch.sin`), and also saves copies on CUDA, where the cast from one dtype to another is fused with the computation.

The mechanism for this change is simple. A new flag, `promote_integer_inputs_to_float_` is added to TensorIteratorConfig, and it requires `promote_integer_inputs_to_float_` be true if it's set. When the new flag is set, after the TensorIterator's "common dtype" (AKA "computation type") is computed it's checked for being an integral (boolean included) type and, if it is, changed to the default (float) scalar type, instead. Only `torch.true_divide` sets this flag (for now).

In the future we'll likely...
- provide helpers (`binary_float_op`, `unary_float_op`) to more easily construct functions that promote int->float instead of requiring they build their own TensorIteratorConfigs.
- update torch.atan2 to use `binary_float_op`
- update many unary ufuncs, like `torch.sin` to use `unary_float_op` and support unary ops having different input and result type (this will also require a small modification to some of the "loops" code)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42359

Reviewed By: ngimel

Differential Revision: D22878394

Pulled By: mruberry

fbshipit-source-id: b8de01e46be859321522da411aed655e2c40e5b9
2020-08-01 22:41:00 -07:00
d403983695 Support List[str].index (#39210) (#40348)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40348

Test Plan: Imported from OSS

Reviewed By: wanchaol

Differential Revision: D22757035

Pulled By: firstprayer

fbshipit-source-id: 4fadf8beabf8d5bdfa5b0a185075f7caf9ba8b02
2020-08-01 13:47:25 -07:00
bdcf320bed Support custom exception message (#41907)
Summary:
Raise and assert used to have a hard-coded error message "Exception". User provided error message was ignored. This PR adds support to represent user's error message in TorchScript.

This breaks backward compatibility because now we actually need to script the user's error message, which can potentially contain unscriptable expressions. Such programs can break when scripting, but saved models can still continue to work.

Increased an op count in test_mobile_optimizer.py because now we need aten::format to form the actual exception message.

This is built upon an WIP PR:  https://github.com/pytorch/pytorch/pull/34112 by driazati

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41907

Reviewed By: ngimel

Differential Revision: D22778301

Pulled By: gmagogsfm

fbshipit-source-id: 2b94f0db4ae9fe70c4cd03f4048e519ea96323ad
2020-08-01 13:03:45 -07:00
5769b06ab5 [Caffe2] Remove explicitly divide by zero in SpatialBN training mode (#42380)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42380

[Caffe2] Remove explicitly divide by zero in SpatialBN training mode

Test Plan: buck test mode/dev-nosan //caffe2/caffe2/python/operator_test:spatial_bn_op_test

Reviewed By: houseroad

Differential Revision: D22873214

fbshipit-source-id: 70b505391b5db02b45fc46ecd7feb303e50c6280
2020-08-01 11:54:58 -07:00
115d226498 Pin NumPy version on MacOS testers to 1.18.5 (#42409)
Summary:
Otherwise numba linking by clang-9 fails with:
```
ld: in /Users/distiller/workspace/miniconda3/lib/python3.7/site-packages/numpy/core/lib/libnpymath.a(npy_math.o), could not parse object file /Users/distiller/workspace/miniconda3/lib/python3.7/site-packages/numpy/core/lib/libnpymath.a(npy_math.o): 'Unknown attribute kind (61) (Producer: 'LLVM10.0.0' Reader: 'LLVM APPLE_1_902.0.39.2_0')', using libLTO version 'LLVM version 9.1.0, (clang-902.0.39.2)' for architecture x86_64
```
Because conda's numpy-1.19.1 is compiled with clang-10
This should fix MacOS regressions in CIrcleCI

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42409

Reviewed By: xw285cornell

Differential Revision: D22887683

Pulled By: malfet

fbshipit-source-id: d58ee9bf53772b57c59e18f71151916d4f0a3c7d
2020-08-01 09:22:23 -07:00
2912390662 Limits cpu scalar error message to where it's appropriate (#42360)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/40986.

TensorIterator's test for a CUDA kernel getting too many CPU scalar inputs was too permissive. This update limits the check to not consider outputs and to only be performed if the kernel can support CPU scalars.

A test is added to verify the appropriate error message is thrown in a case where the old error message was thrown previously.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42360

Reviewed By: ngimel

Differential Revision: D22868536

Pulled By: mruberry

fbshipit-source-id: 2bc8227978f8f6c0a197444ff0c607aeb51b0671
2020-08-01 02:04:30 -07:00
206db5c127 Improve torch.norm functionality, errors, and tests (#41956)
Summary:
**BC-Breaking Note:**
BC breaking changes in the case where keepdim=True. Before this change, when calling `torch.norm` with keepdim=True and p='fro' or p=number, leaving all other optional arguments as their default values, the keepdim argument would be ignored. Also, any time `torch.norm` was called with p='nuc', the result would have one fewer dimension than the input, and the dimensions could be out of order depending on which dimensions were being reduced. After the change, for each of these cases, the result has the same number and order of dimensions as the input.

**PR Summary:**

* Fix keepdim behavior
* Throw descriptive errors for unsupported sparse norm args
* Increase unit test coverage for these cases and for complex inputs

These changes were taken from part of PR https://github.com/pytorch/pytorch/issues/40924. That PR is not going to be merged because it overrides `torch.norm`'s interface, which we want to avoid. But these improvements are still useful.

Issue https://github.com/pytorch/pytorch/issues/24802

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41956

Reviewed By: albanD

Differential Revision: D22837455

Pulled By: mruberry

fbshipit-source-id: 509ecabfa63b93737996f48a58c7188b005b7217
2020-08-01 01:55:12 -07:00
44b018ddeb Convert ProcessGroupNCCLTest.cpp to gtest unittest (#42365)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42365

Converting the test

Reviewed By: malfet

Differential Revision: D22855087

fbshipit-source-id: dc917950dcf99ec7036e48aaa4264d2c455cb19e
2020-07-31 20:34:11 -07:00
f47e00bdc3 [NNC] Bounds Inference: make inferred bounds respect gaps (#42185)
Summary:
A heavy refactor of bounds inference to fix some issues and bugs blocking using it to analyze cross thread interactions:
* We were merging all accesses to a Buf into a single bounds info entry, even if they did not overlap. E.g. if we accessed a[0:2] and a[5:6] we would merge that into a bound of a[0:6]. I've changed this behaviour to merge only overlapping bounds.
* We were not separating bounds of different kinds (e.g. Load vs Store) and would merge a Store bounds into a Load bounds, losing the information about what kind of access it was. E.g. this loop would produce bounds: [{Load, 0, 10}] and now produces bounds [{Load, 0, 9}, {Store, 1, 10}]:
```
for i in 1 to 10...
  x[i] = x[i-1]
```
* Both ComputeAt and Rfactor relied on the overzealous merging and only used a single entry in the bounds list to determine the bounds of temporary buffers they created, which could result in temporary buffers allocated smaller than accesses to them. I've fixed Rfactor, but *not* ComputeAt - however all ComputeAt tests still pass (may require loop fusion to trigger this issue) - I will come back to it.

Being more precise about bounds is more complex, rather than taking the minimum of starts and maximum of stops we now need to determine if two bounds overlap or are adjacent. There are many edge cases and so I've added a bunch of test coverage of the merging method.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42185

Reviewed By: mruberry

Differential Revision: D22870391

Pulled By: nickgg

fbshipit-source-id: 3ee34fcbf0740a47259defeb44cba783b54d0baa
2020-07-31 20:22:04 -07:00
dcc4d11ffa [TensorExpr] Make tensorOrConstant non-templatized function. (#42202)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42202

Currently we used the template in order to be able to take both
`std::vector<ExprHandle>` and `std::vector<VarHandle>`. However,
semantics of this function tells that the only allowed option should be
the former one: we're specifying indices for the tensor access we want
to generate. While it could be convenient to avoid conversion from
vector of vars to a vector of exprs at the callsites, it makes the code
less explicit and thus more difficult to reason about.

Test Plan: Imported from OSS

Reviewed By: SplitInfinity

Differential Revision: D22806429

Pulled By: ZolotukhinM

fbshipit-source-id: 8403af5fe6947c27213050a033e79a09f7075d4c
2020-07-31 20:05:24 -07:00
2decccea2e [TensorExpr] Implement shape inference for TE. (#41451)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41451

Since TE operates on a limited subset of ops with a well-defined
semantics, we can easily infer shapes of intermediate and output tensors
given shapes of the inputs.

There is a couple of ops that are not yet supported in the shape
inference, once we add them we could relax the shape info requirements
in the TE fuser: currently it requires all values in the fusion group to
have shapes known and we can change it to only inputs.

Test Plan: Imported from OSS

Reviewed By: eellison

Differential Revision: D22543470

Pulled By: ZolotukhinM

fbshipit-source-id: 256bae921028cb6ec3af91977f12bb870c385f40
2020-07-31 20:05:21 -07:00
f41bb1f92b [TensorExpr] Explicitly cast to bool results of comparison ops in kernel.cpp. (#42201)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42201

Previously, we've been using operators <, >, ==, et al. and relied on
the dtype to be picked automatically. It led to a wrong dtype being
picked for the result, but that choice was overwritten by the type
explicitly specified in JIT IR, which we were lowering. Now we are
moving towards using shape inference instead of relying on all types
being specified in the IR, and that made this issue to immediately pop
up.

Test Plan: Imported from OSS

Reviewed By: Krovatkin

Differential Revision: D22806428

Pulled By: ZolotukhinM

fbshipit-source-id: 89d2726340efa2bb3da45d1603bedc53955e14b9
2020-07-31 20:05:19 -07:00
f8c5800bb5 [TensorExpr] Add debug dumps to kernel.cpp. (#42196)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42196

Test Plan: Imported from OSS

Reviewed By: SplitInfinity

Differential Revision: D22803676

Pulled By: ZolotukhinM

fbshipit-source-id: 109372ca45d86478826190b868d005d2fb2c9ba7
2020-07-31 20:02:21 -07:00
655f376460 Implement Enum sugared value and Enum constant support (#42085)
Summary:
[3/N] Implement Enum JIT support

* Add enum value as constant support
* Add sugared value for EnumClass

Supported:
Enum-typed function arguments
using Enum type and comparing them
Support getting name/value attrs of enums
Using Enum value as constant

TODO:
Add PyThon sugared value for Enum
Support Enum-typed return values
Support serialization and deserialization

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42085

Reviewed By: eellison

Differential Revision: D22758042

Pulled By: gmagogsfm

fbshipit-source-id: 5c6e571686c0b60d7fbad59503f5f94b3b3cd125
2020-07-31 17:29:55 -07:00
ff91b169c7 Changes to match Fused Op: Dequantize->Swish->Quantize (#42255)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42255

Changes to match Fused Op: Dequantize->Swish->Quantize
* Changes to scale handling

Results showing matching intermediate and final Swish_Int8 Op.
P137389801

Test Plan: test case test_deq_swish_quant_nnpi.py

Reviewed By: hyuen

Differential Revision: D22827499

fbshipit-source-id: b469470ca66f6405ccc89696694af372ce6ce89e
2020-07-31 16:54:39 -07:00
1542c41a67 Change C++ frontend to take optional<Tensor> arguments (#41947)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41947

Previously, if an op took an optional `Tensor?` argument, the C++ frontend (i.e. `at::op()` and `Tensor::op()`)
were generated to take `Tensor`. A previous PR (https://github.com/pytorch/pytorch/pull/41610) changed the kernels
to be written with `c10::optional<Tensor>` instead of `Tensor`, but that did not touch the C++ frontend yet.

This PR changes the C++ frontend API to take `c10::optional<Tensor>` instead of `Tensor` as well.
This should be mostly bc conserving. Since `Tensor` implicitly converts to `c10::optional<Tensor>`, any old code
calling an op with a `Tensor` would still work. There are likely corner cases that get broken though.
For example, C++ only ever does *one* implicit conversion. So if you call an op with a non-tensor object
that gets implicitly converted to a `Tensor`, then that previously worked since the API took a `Tensor` and
C++ allows one implicit conversion. Now it wouldn't work anymore because it would require two implicit conversions
(to `Tensor` and then to `c10::optional<Tensor>`) and C++ doesn't do that.

The main reasons for doing this are
- Make the C++ API more sane. Those arguments are optional and that should be visible from the signature.
- Allow easier integration for XLA and Autocast. Those backends generate code to wrap operators and forward
  operator arguments to calls to at::op(). After https://github.com/pytorch/pytorch/pull/41610, there was
  a mismatch because they had to implement operators with `optional<Tensor>` but call `at::op()` with `Tensor`,
  so they had to manually convert between those. After this PR, they can just forward the `optional<Tensor>`
  in their call to `at::op()`.
ghstack-source-id: 108873705

Test Plan: unit tests

Reviewed By: bhosmer

Differential Revision: D22704832

fbshipit-source-id: f4c00d457b178fbc124be9e884a538a3653aae1f
2020-07-31 16:11:55 -07:00
3a19af2427 Make operators with optional Tensor? arguments c10-full (#41610)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41610

Previously, operators that have a `Tensor?` (i.e. optional tensor) in their schema implemented it using `Tensor` in C++ and filled in an undefined tensor for the None case.
The c10 operator library, however, expects `Tensor?` to be represented as `optional<Tensor>`, so those operators couldn't be c10-full yet and still had to use codegenerated unboxing instead of templated unboxing.

This PR changes that. It extends the `hacky_wrapper_for_legacy_signatures` to not only take case of TensorOptions, but now also map between signatures taking `Tensor` and `optional<Tensor>`.
For this, it requires an additional template parameter, the expected signature, and it uses that to go argument-by-argument and unwrap any optionals it finds.
ghstack-source-id: 108873701

Test Plan: waitforsandcastle

Reviewed By: bhosmer

Differential Revision: D22607879

fbshipit-source-id: 57b2fb01a294b804f82cd55cd70f0ef4a478e14f
2020-07-31 16:09:08 -07:00
f502290e91 [JIT] Make create autodiff subgraphs do in place updates to aliasDb (#42141)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42141

Update alias db in-place instead of having to construct alias db from scratch on each change, causing O(n^2) behavior.

Description from https://github.com/pytorch/pytorch/pull/37106 holds pretty well:
"""
Recomputing the aliasdb on every fusion iteration + in every subblock
is hugely expensive. Instead, update it in-place when doing fusion.

The graph fuser pass operates by pushing nodes into a fusion group. So
we start with

`x, y = f(a, b, c)`

and end with:
```
x_out, y_out = prim::fusionGroup(a, b, c)
   x_in, y_in = f(a_in, b_in, c_in)
   -> x_in, y_in
```

We destroy the x and y Value*s in the process. This operation is
easy to express as an update to the aliasDb--x_out just takes on all
the aliasing information x used to have. In particular, since we know
f and prim::fusionGroup are purely functional, we don't have to mess
with any write information.
"""

The one difficulty here is mapping x, y to x_out, y_out is not trivial in merging nodes into the autodiff subgraph node.
There are a few options:
- attempt to make all subgraph utils & ir cloning logic update a map
- mirror the subgraph utils implementation in create_autodiff_subgraph
- uniquely map x, y and x_in, y_in so you can back out the correspondence.

I went with the third option.

This shouldn't affect the results of the pass at all. LMK if you think there's anything else I should be doing to test, I was thinking about maybe exposing an option to run create autodiff subgraphs without the post processor and check that the alias db was correctly updated.

Test Plan: Imported from OSS

Reviewed By: SplitInfinity

Differential Revision: D22798377

Pulled By: eellison

fbshipit-source-id: 9a133bcaa3b051c0fb565afb23a3eed56dbe71f9
2020-07-31 15:13:32 -07:00
2285a2fc11 refactor canonical ordering to also be able to do isAfter checks (#42140)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42140

Test Plan: Imported from OSS

Reviewed By: SplitInfinity

Differential Revision: D22798378

Pulled By: eellison

fbshipit-source-id: d1a549f43b28fe927729597818a46674c58fe81d
2020-07-31 15:11:40 -07:00
4fc525e729 [Dper3] Implementation of squeezed input to DC++
Summary:
This Diff provides an option for DC++ module to use the squeezed sparse feature embeddings to generate attention weights, with the purpose of reducing the network size to achieve QPS gains. There are 3 squeeze options: sum, max, and mean, along the embedding dimension and are provided for both the attention weights and resnet generation.
Example workflow: f208474456

{F257199459}

Test Plan:
1. Test single ops
buck test dper3/dper3/modules/low_level_modules/tests:single_operators_test -- test_reduce_back_mean
buck test dper3/dper3/modules/low_level_modules/tests:single_operators_test -- test_reduce_back_max
2. Test DC++ module
buck test dper3/dper3/modules/tests:core_modules_test -- test_dc_pp_arch_one_layer_compressed_embeddings_only_squeeze_input
buck test dper3/dper3/modules/tests:core_modules_test -- test_dc_pp_arch_shared_input_squeeze_input
buck test dper3/dper3/modules/tests:core_modules_test -- test_dc_pp_input_compress_embeddings_squeeze_input
3. Test Arch
buck test dper3/dper3_models/ads_ranking/model_impl/sparse_nn/tests:sparse_nn_lib_test -- test_dense_sparse_interaction_compress_dot_arch_dot_compress_pp_squeezed_input
4. e2e test
buck test dper3/dper3_models/ads_ranking/tests:model_paradigm_e2e_tests -- test_sparse_nn_compress_dot_attention_fm_max_fc_size_squeeze_input

Reviewed By: taiqing

Differential Revision: D22825069

fbshipit-source-id: 29269ea22cb47d487a1c92a1f6daae1055f54cfc
2020-07-31 14:31:43 -07:00
a01e91e6b2 [pytorch] include all overloads for OSS custom build
Summary:
For mobile custom build, we only generate code for ops that are used by
specific models to reduce binary size.

There multiple places where we apply the op filtering:
- generated_unboxing_wrappers_*.cpp
- autograd/VariableType*.cpp
- c10 op registration (in aten/gen.py)

For c10 op registration, we filter by the main op name - all overloads
that match the main op name part will be kept.

For generated_unboxing_wrappers_*, we filter by the full op name - only
those having exactly the same overload name will be kept.

This PR changes generated_unboxing_wrappers_* and autograd/VariableType*.cpp
codegen to also filter by the main op name.

The reasons are:
- keeping all overloads can have better backward compatibility;
- generated_unboxing_wrappers_* are relatively small as it only contains
  thin wrappers for root ops.
- generated_unboxing_wrappers_* will be replaced by c10 op registration
  soon anyway.
- autograd/VariableType*.cpp are not included in OSS build.

Why it offers better backward compatibility? #40737 is an example:
It introduced a new `_convolution` overload and renamed the original one
to `_convolution.deprecated`. Before this PR, the model prepared by the
old version PyTorch won't be able to run on the custom mobile build
generated on the PR because `_convolution.deprecated` won't be kept in
the custom build due to full op name matching policy. By relaxing it to
partial matching policy, the mobile custom build CI on the PR can pass.

Will test the size impact for FB production build before landing.

Differential Revision: D22809564

Test Plan: Imported from OSS

Reviewed By: iseeyuan

Pulled By: ljk53

fbshipit-source-id: e2fc017da31f38b9430cc2113f33e6d21a0eaf0b
2020-07-31 12:43:31 -07:00
38bf5be24f [quant] Use PlaceholderObserver instead of Fp16Observer and NoopObserver (#42348)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42348

Use the dtype info in placeholderObserver to decide what ops to insert in the graph
In the next PR we can delete NoopObserver

Test Plan:
python test/test_quantization.py

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D22859457

fbshipit-source-id: a5c618f22315534ebd9a2df77b14a0aece196989
2020-07-31 12:33:56 -07:00
6bd46b583e [quant][graph] Add support for FP16 dynamic quant (#42222)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42222

This change adds the necessary passes to perform FP16 dynamic quantization.
We skip inserting observers for activations based on the dtype (torch.float16) and only insert the Fp16Observer for weights

Test Plan:
python test/test_quantization.py TestQuantizeJitOps

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D22849220

fbshipit-source-id: 2c53594ecd2485e9e3dd0b380eceaf7c5ab5fc50
2020-07-31 12:33:53 -07:00
8c5bf10264 [quant] Add FP16Observer for fp16 quant support (#42221)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42221

Adds a new observer that emits a warning if the range of tensor is beyond fp16 range. This will be further used in graph mode quantization to insert the cast to fp16 ops in the graph

Test Plan:
python test/test_quantizaton.py TestObserver.test_fp16_observer

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D22849222

fbshipit-source-id: a301281ce38ba4d4e7a009308400d34a08c113d2
2020-07-31 12:33:51 -07:00
a9eebaf693 [quant] Add saturate_to_fp16 op for FP16 quant support (#42147)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42147

Op to check the range of a tensor and clamp the values to fp16 range
This operator will be inserted into the graph in subsequent diffs.

Test Plan:
python test/test_quantization.py TestQuantizedTensor.test_fp16_saturate_op

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D22849221

fbshipit-source-id: 0da3298e179750f6311e3a09596a7b8070509096
2020-07-31 12:31:07 -07:00
bdd9ef1981 Support RowWiseSparseAdam on GPU (#35404)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35404

Implement RowWiseSparseAdam on CUDA

Reviewed By: xw285cornell

Differential Revision: D20650225

fbshipit-source-id: 5f871e2f259e362b713c9281b4d94534453995cf
2020-07-31 10:47:29 -07:00
a9e7e787f8 [jit] make clone works for interface type (#42121)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42121

This PR changes the Module API to allow register a module with module
interface type, and therefore allows Module::clone works on the case
where there's a module interface type being shared by two submodules.

interface type will be shared by the new cloned instance in the same
compilation unit bc it only
contains a list of functionSchema, which does not involve any
attributes compared to classType.

fixes https://github.com/pytorch/pytorch/issues/41882

Test Plan: Imported from OSS

Reviewed By: suo

Differential Revision: D22781205

Pulled By: wanchaol

fbshipit-source-id: f97f4b75970f0b434e38b5a1f778eda2c4e5109b
2020-07-31 10:24:27 -07:00
352e15f1a2 Revert D22812445: Update TensorPipe submodule
Test Plan: revert-hammer

Differential Revision:
D22812445 (2335430086)

Original commit changeset: e6d824bb28f5

fbshipit-source-id: 606632a9aaf2513b5ac949e4d6687aa7563eae5d
2020-07-31 10:16:48 -07:00
832b1659e7 Fix missing attribute when loading model from older version (#42242) (#42290)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/42242

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42290

Reviewed By: VitalyFedyunin

Differential Revision: D22844096

Pulled By: albanD

fbshipit-source-id: 707e552e0ed581fbe00f1527ab7426880edaed64
2020-07-31 09:03:07 -07:00
4c6878c97d [gloo] change ProcessGroupGlooAsyncTest to use gtest (#42313)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42313

Changes the tests in `ProcessGroupGlooAsyncTest.cpp` to use the Gtest testing framework.

Reviewed By: malfet

Differential Revision: D22821577

fbshipit-source-id: 326b24a334ae84a16434d0d5ef27d16ba4b90d5d
2020-07-31 08:54:50 -07:00
0adb584376 Make resize_ use normal device dispatch (#42240)
Summary:
`resize_` only requires manual registration to `Autograd` key and its device kernels can safely live together with our normal device dispatch in `native_functions.yaml`.
But currently we do manual registration for `CPU/CUDA` kernels (and leaves no dispatch in native_functions.yaml) which makes `resize_` non-overrideable from backend point of view. While it indeed should dispatch at device level, this caused xla to whitelist `resize_` and register a lowering to XLA key. This PR moves the device dispatch of `resize_` back to `native_functions.yaml` so that it shows up as `abstract` method properly for downstream extensions.
Note that we also do manual registration for `copy_/detach_/resize_as_/etc` in aten but they are slightly different than `resize_` since for them we only register `catchAll` kernels instead of device kernels. I'll need to investigate and send a followup PR for those ops.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42240

Reviewed By: VitalyFedyunin

Differential Revision: D22846311

Pulled By: ailzhang

fbshipit-source-id: 10b6cf99c4ed3d62fc4e1571f4a2a463d1b88c81
2020-07-31 02:15:27 -07:00
2f840b1662 Warns when TensorIterator would resize its output (#42079)
Summary:
See https://github.com/pytorch/pytorch/issues/41027.

This adds a helper to resize output to ATen/native/Resize.* and updates TensorIterator to use it. The helper throws a warning if a tensor with one or more elements needs to be resized. This warning indicates that these resizes will become an error in a future PyTorch release.

 There are many functions in PyTorch that will resize their outputs and don't use TensorIterator. For example,

985fd970aa/aten/src/ATen/native/cuda/NaiveConvolutionTranspose2d.cu (L243)

And these functions will need to be updated to use this helper, too. This PR avoids their inclusion since the work is separable, and this should let us focus on the function and its behavior in review. A TODO appears in the code to reflect this.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42079

Reviewed By: VitalyFedyunin

Differential Revision: D22846851

Pulled By: mruberry

fbshipit-source-id: d1a413efb97e30853923bce828513ba76e5a495d
2020-07-30 22:39:16 -07:00
e54f268a7a Enables torch.full bool and integer type inference (#41912)
Summary:
After being deprecated in 1.5 and throwing a runtime error in 1.6, we can now enable torch.full inferring its dtype when given bool and integer fill values. This PR enables that inference and updates the tests and docs to reflect this.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41912

Reviewed By: albanD

Differential Revision: D22836802

Pulled By: mruberry

fbshipit-source-id: 33dfbe4d4067800c418b314b1f60fab8adcab4e7
2020-07-30 22:39:13 -07:00
31d41f987a torch.where : Scalar Support (#40336)
Summary:
Reference: https://github.com/pytorch/pytorch/issues/38349 #9190

TODO
* [x] Add Tests
* [x] Update Docs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/40336

Reviewed By: albanD

Differential Revision: D22813834

Pulled By: mruberry

fbshipit-source-id: 67c1693c059a301b249213afee3c25cea9f64fec
2020-07-30 22:36:53 -07:00
1c8217a7a6 Abstract cuda calls made from torch_python (#42251)
Summary:
* Make c10::cuda functions regular non-inlined functions
* Add driver_version() and device_synchronize() functions

With this change I don't see anymore direct calls to CUDA API when look at Modules.cpp.obj

FYI malfet

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42251

Reviewed By: malfet

Differential Revision: D22826505

Pulled By: ziab

fbshipit-source-id: 8dc2f3e209d3710e2ce78411982a10e8c727573c
2020-07-30 19:18:33 -07:00
fbb052c2cc BlackList to BlockList (#42279)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/41701 blackList convention to blockList convention

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42279

Reviewed By: VitalyFedyunin

Differential Revision: D22843178

Pulled By: malfet

fbshipit-source-id: c9be5a5f084dfd0e46545d4a3d1124ef59277604
2020-07-30 18:06:49 -07:00
27c22b9b3c Modify function to takes dtype as argument
Summary: To avoid repeating to() casts for every argument of the function

Test Plan: CI

Reviewed By: malfet

Differential Revision: D22833521

fbshipit-source-id: ae0a8f70339cd6adfeea2f552d35bbcd48b11cf7
2020-07-30 16:27:55 -07:00
b5fcd89479 Add tests to sigmoid_backward and fmod (#42289)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42289

`sigmoid_backward` and `fmod` are not covered by neither in `test/cpp/api` nor in `Aten/test`. Add test functions to cover them

Test Plan:
1. Test locally and check new lines are covred
2. CI

Reviewed By: malfet

Differential Revision: D22804912

fbshipit-source-id: ea50ef0ef3dcf3940ac950d74f6f1cb38d8547a7
2020-07-30 16:26:13 -07:00
7d6c4f62ef Remove 4 unused variables in lp_pool_op.cc (#42329)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42329

Reviewed By: VitalyFedyunin

Differential Revision: D22850894

Pulled By: mrshenli

fbshipit-source-id: 1e91380a432525b83c0bb0bfef0d5067c767cb67
2020-07-30 15:50:17 -07:00
153673c33b fix quantized elu benchmark (#42318)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42318

We forgot to update this benchmark when quantized elu's signature
changed to require observation, fixing.

Test Plan:
```
cd benchmarks/operator_benchmark
python -m pt.qactivation_test
```

Imported from OSS

Reviewed By: supriyar

Differential Revision: D22845251

fbshipit-source-id: 1443f6f0deac695715b1f2bd47f0f22b96dc72ca
2020-07-30 14:57:12 -07:00
5ff54ff4ff import freeze (#42319)
Summary:
torch.jit.freeze was broken with https://github.com/pytorch/pytorch/pull/41154/files#diff-9084cd464651f7fa1ff030d2edd9eb55R1

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42319

Reviewed By: ZolotukhinM

Differential Revision: D22845476

Pulled By: eellison

fbshipit-source-id: bc9e50678d0e0ffca4062854ccc71bbef2e1a97b
2020-07-30 13:00:11 -07:00
344defc973 Let bfloat16 support promotion with other types (#41698)
Summary:
Fix https://github.com/pytorch/pytorch/issues/40580

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41698

Reviewed By: albanD

Differential Revision: D22824042

Pulled By: mruberry

fbshipit-source-id: 7dad9c12dc51d8f88c3ca963ae9c5f8aa2f72277
2020-07-30 12:28:09 -07:00
c489bbe122 Add typing support to torch._six (#42232)
Summary:
Also add __prepare__ method metaclass created by `with_metaclass` to conform with PEP 3115

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42232

Reviewed By: ezyang

Differential Revision: D22816936

Pulled By: malfet

fbshipit-source-id: a47d054b2f061985846d0db6b407f4e5df97b0d4
2020-07-30 12:12:46 -07:00
26d58503c2 Implementing NumPy-like function torch.signbit() (#41589)
Summary:
- Related with https://github.com/pytorch/pytorch/issues/38349
- Implementing the NumPy-like function `torch.signbit()` .

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41589

Reviewed By: albanD

Differential Revision: D22835249

Pulled By: mruberry

fbshipit-source-id: 7988f7fa8f591ce4b6a23ac884ee7b3aa718bcfd
2020-07-30 11:21:15 -07:00
c35faae10d [pytorch][ci] install nightly instead of stable libtorch for mobile CIs (#42220)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42220

Mobile custom build CI jobs need desktop version libtorch to prepare
models and dump root ops.

Ideally we should use the libtorch built on the PR so that backward
incompatible changes won't break this script - but it will significantly
slow down mobile CI jobs.

This PR changed it to install nightly instead of stable so that we have
an option to temporarily skip mobile CI jobs on BC-breaking PRs until
they are in nightly.

Test Plan: Imported from OSS

Reviewed By: seemethere

Differential Revision: D22810484

Pulled By: ljk53

fbshipit-source-id: eb5f7b762a969d1cfeeac2648816be546bd291b6
2020-07-30 11:07:14 -07:00
ce546328a3 Const-correctness, variable initialization, and error checking. (#42124)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42124

Test Plan: Imported from OSS

Reviewed By: IvanKobzarev

Differential Revision: D22835543

Pulled By: AshkanAliabadi

fbshipit-source-id: 29b7619b7bc6dd346eec91b8a2b6cc6a76769bcf
2020-07-30 11:04:24 -07:00
d0ed1e303f Add missing header guards. (#42272)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42272

Test Plan: Imported from OSS

Reviewed By: IvanKobzarev

Differential Revision: D22835546

Pulled By: AshkanAliabadi

fbshipit-source-id: c880199acaf0ad11c3db4ac9f9f2d000038f98f1
2020-07-30 11:04:21 -07:00
ee2150370e Add Vulkan Test to ATen Mobile Tests. (#42123)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42123

Test Plan: Imported from OSS

Reviewed By: IvanKobzarev

Differential Revision: D22835544

Pulled By: AshkanAliabadi

fbshipit-source-id: 08bce5d94ed8c966d25707f69e51b16d5b45febd
2020-07-30 11:04:19 -07:00
7cd92aaa6b Disable validation layers in non-debug builds. (#42122)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42122

Test Plan: Imported from OSS

Reviewed By: IvanKobzarev

Differential Revision: D22835545

Pulled By: AshkanAliabadi

fbshipit-source-id: b0eee550c8d727c79b5d45a7e1d603379ae3af5c
2020-07-30 11:01:51 -07:00
8e3d1908b6 Fix minor typo in comment (#42184)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42184

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D22809375

Pulled By: ezyang

fbshipit-source-id: 322a4c2059b612a10c6257013bbf2fd207e75df7
2020-07-30 09:48:22 -07:00
86b2faeb53 Automated submodule update: FBGEMM (#42302)
Summary:
This is an automated pull request to update the first-party submodule for [pytorch/FBGEMM](https://github.com/pytorch/FBGEMM).

New submodule commit: e04b9ce034

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42302

Test Plan: Ensure that CI jobs succeed on GitHub before landing.

Reviewed By: efiks

Differential Revision: D22841424

fbshipit-source-id: 211463b0207da986fc5b451242ae99edf32b9f68
2020-07-30 08:56:34 -07:00
f15af2fe4f Remove unused variable "schema" (#42245)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42245

Reviewed By: albanD

Differential Revision: D22835223

Pulled By: mrshenli

fbshipit-source-id: 94f0cbddb36feefc8a136ef38b0a74d22b305680
2020-07-30 08:40:36 -07:00
547bbdac86 Add MSFT Owners to the Windows Maintainership (#42280)
Summary:
Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42280

Reviewed By: albanD

Differential Revision: D22836782

Pulled By: soumith

fbshipit-source-id: a38f91e381abc0acf3ab41e05ff70611926091ac
2020-07-30 08:22:13 -07:00
269ec767ca [Codemod][FBSourceClangFormatLinter] Daily arc lint --take CLANGFORMAT
Reviewed By: zertosh

Differential Revision: D22838806

fbshipit-source-id: 29039585c82bb214db860d582cc4e269ab990c85
2020-07-30 04:01:20 -07:00
2335430086 Update TensorPipe submodule (#42225)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42225

Main changes:
- Consolidated CMake files to have a single entry point, rather than having a specialized one for PyTorch.
- Changed the way the preprocessor flags are provided, and changed their name.

There were a few instances in PyTorch's CMake files where we were directly adding TensorPipe's source directory as an include path, which however doesn't contain the auto-generated header we now added. We fix that by adding the `tensorpipe` CMake target as a dependency, so that the include paths defined by TensorPipe are used, which contain that auto-generated header.

I'm turning off SHM and CMA for now because they have never been covered by the CI. I'll enable them in a separate PR so that if they turn out to be flaky we can revert that change without reverting this one.

Test Plan: CircleCI is all green.

Reviewed By: beauby

Differential Revision: D22812445

fbshipit-source-id: e6d824bb28f5afe75fd765de0430968174f3531f
2020-07-30 02:32:52 -07:00
4f163df41a [caffe2] Special handling of If/AsyncIf op in RemoveOpsByType (#42286)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42286

One more bug to fix. Operators such as If and AsyncIf need special treatment not just in `onnx::SsaRewrite`, but also in `RemoveOpsByType`. The solution needs two steps:
1) add external inputs/outputs of the subnets of If/AsyncIf op to the inputs/outputs of the op
2) if the inputs/outputs of the If/AsyncIf op need to be renamed as a result, the same inputs/outputs of the subnets need to be renamed as well.

I also added unit tests to cover this corner case.

Test Plan:
```
buck test //caffe2/caffe2/fb/predictor:black_box_predictor_test

mkdir /tmp/models
rm -rf /tmp/$USER/snntest
rm -rf /tmp/snntest
buck run mode/opt admarket/lib/ranking/prediction_replayer/snntest_replayer_test/tools:snntest_replay_test -- --serving_paradigm=USER_AD_PRECOMPUTATION_DSNN
```

Differential Revision: D22834028

fbshipit-source-id: c070707316cac694f452a96e5c80255abf4014bc
2020-07-30 02:02:20 -07:00
f30ac66e79 [caffe2] Fix a performance bug in Dedup SparseAdagrad op (#42287)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42287

We shouldn't use block_size for thread dimensions in linear_index_weight_offsets_dedup_kernel, since the kernel doesn't iterate the embedding dimensions.
ghstack-source-id: 108834058

Test Plan:
```
buck test mode/dev-nosan //caffe2/caffe2/fb/net_transforms/tests:fuse_sparse_ops_test -- 'test_fuse_sparse_adagrad_with_sparse_lengths_sum_gradient \(caffe2\.caffe2\.fb\.net_transforms\.tests\.fuse_sparse_ops_test\.TestFuseSparseOps\)' --print-passing-details
```

Reviewed By: jspark1105

Differential Revision: D22800959

fbshipit-source-id: 641d52a51070715c04f9fd286e7e22ac62001f61
2020-07-30 01:00:59 -07:00
0444bac940 Add test to cross function
Summary: function `cross_kernel_scalar` is not covered in `Aten/native/cpu/CrossKernel.cpp`, add tests to cover it

Test Plan:
1. Test locally to check new lines are covered
2. CI

https://pxl.cl/1fZjG

Reviewed By: malfet

Differential Revision: D22834122

fbshipit-source-id: 0d50f3a3e6aee52cb6fdee2b9f5883f542c7b6e2
2020-07-29 22:48:52 -07:00
9ea7476d9c Add test to lerp function (#42266)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42266

function `lerp_kernel_scalar` and `lerp_kernel_tensor` are not covered in `Aten/native/cpu/LerpKernel.cpp`, add tests to cover them

Test Plan:
1. Test locally to check new lines are covered
2. CI

https://pxl.cl/1fXPd

Reviewed By: malfet

Differential Revision: D22832164

fbshipit-source-id: b1eaabbf8bfa08b4dedc1a468abfdfb619a50e3c
2020-07-29 22:47:37 -07:00
7459da268e Add typing annotations to torch.random (#42234)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42234

Reviewed By: ezyang

Differential Revision: D22816933

Pulled By: malfet

fbshipit-source-id: 9e2124ad16fed339abd507f6e474cb63feb7eada
2020-07-29 22:16:08 -07:00
872237c1f2 Output to stderr in distributed tests. (#42139)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42139

A bunch of tests were failing with buck since we would output to
stdout and buck would fail parsing stdout in some cases.

Moving these print statements to stderr fixes this issue.
ghstack-source-id: 108606579

Test Plan: Run the offending unit tests.

Reviewed By: mrshenli

Differential Revision: D22779135

fbshipit-source-id: 789af3b16a03b68a6cb12377ed852e5b5091bbad
2020-07-29 19:23:34 -07:00
fe4f19e164 [CUDA] max_pool2d NCHW performance improvement (#42182)
Summary:
Fix the regression introduced in https://github.com/pytorch/pytorch/issues/38953.

Please see https://github.com/xwang233/code-snippet/blob/master/max-pool2d-nchw-perf/max-pool2d.ipynb for detailed before & after performance comparisons.

Performance improvement for backward max_pool2d before and after this PR (negative value means speed up)

![image](https://user-images.githubusercontent.com/24860335/88712204-363c8e00-d0ce-11ea-8586-057e09b16103.png)

Seems like the forward modulo doesn't benefit much from a similar change, so I did not change forward. 1718f0ccfd

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42182

Reviewed By: albanD

Differential Revision: D22829498

Pulled By: ngimel

fbshipit-source-id: 4c81968fe072f4e264e70c70ade4c32d760a3af4
2020-07-29 19:01:31 -07:00
c18223f9ef add Dimname support to IValue (#42054)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42054

Test Plan: Imported from OSS

Reviewed By: smessmer

Differential Revision: D22750398

Pulled By: bhosmer

fbshipit-source-id: 7028268093f86b33c4117868b0edcb9e1ca6f7ee
2020-07-29 16:30:26 -07:00
6c251f74b2 replace black_list/blacklist with blocklist/block_list (#42089)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/41734

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42089

Reviewed By: pbelevich

Differential Revision: D22794556

Pulled By: SplitInfinity

fbshipit-source-id: 4404845b6293b076b3c8cc02b135b20c91397a79
2020-07-29 16:26:02 -07:00
27b03d62de [HT] Clear the device placement tag for the auto gen sum so that we could break the component for FC sharing the same input (#42219)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42219

Introduce a new extra info that is tagged on the forward net for the operators sharing the same input. The effect is that the auto gen sum of gradient for the input will not follow the tag of the operator tags in the forward net. This allow more flexible device allocation.

Test Plan:
# unit test
`./buck-out/gen/caffe2/caffe2/python/core_gradients_test#binary.par -r  testMultiUseInputAutoGenSumDevice`

Reviewed By: xianjiec, boryiingsu

Differential Revision: D22609080

fbshipit-source-id: d558145e5eb36295580a70e1ee3a822504dd439a
2020-07-29 15:21:27 -07:00
7cdf786a07 fix typo in GradScaler docstring (#42236)
Summary:
Closes https://github.com/pytorch/pytorch/issues/42226.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42236

Reviewed By: albanD

Differential Revision: D22817980

Pulled By: ngimel

fbshipit-source-id: 4326fe028dba1dbeed454edc4e4d4fffa56f51d6
2020-07-29 13:14:57 -07:00
79cfd85987 grad detach_ only when it has grad_fn in zero_grad call (#41283)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41283

in optimizer.zero_grad(), detach_ is useful to avoid memory leak only when grad has grad_fn, so add check to call grad.detach_ only when the grad has grad_fn in zero_grad() function
ghstack-source-id: 108702289

Test Plan: unit test

Reviewed By: mrshenli

Differential Revision: D22487315

fbshipit-source-id: 861909b15c8497f1da57f092d8963d4920c85e38
2020-07-29 11:40:13 -07:00
4b6e5f42a4 Creates spectral ops test suite (#42157)
Summary:
In preparation for creating the new torch.fft namespace and NumPy-like fft functions, as well as supporting our goal of refactoring and reducing the size of test_torch.py, this PR creates a test suite for our spectral ops.

The existing spectral op tests from test_torch.py and test_cuda.py are moved to test_spectral_ops.py and updated to run under the device generic test framework.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42157

Reviewed By: albanD

Differential Revision: D22811096

Pulled By: mruberry

fbshipit-source-id: e5c50f0016ea6bb8b093cd6df2dbcef6db9bb6b6
2020-07-29 11:36:18 -07:00
029007c8b6 Improved coverage for unboxed->boxed kernel wrappers (#38999)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38999

Adds boxing for inplace and outplace kernels, itemizes
remaining unsupported cases, and fails compilation when
new unsupported types are introduced in op signatures.

Test Plan: Imported from OSS

Differential Revision: D21718547

Pulled By: bhosmer

fbshipit-source-id: 03295128b21d1843e86789fb474f38411b26a8b6
2020-07-29 11:31:16 -07:00
60f51542dc [Caffe2] Fix spatial_bn bug for computing running_var on CPU or on CUDA without CuDNN (#42151)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42151

Previously our Caffe2 SpatialBN op impl was incorrect for computing running_var without unbias coefficent. Actually it should fail the test because the output will be different with CuDNN's output. However, our tests are too weak to find this bug. This diff fix all of them.

Test Plan: buck test mode/dev-nosan //caffe2/caffe2/python/operator_test:spatial_bn_op_test

Reviewed By: houseroad

Differential Revision: D22786127

fbshipit-source-id: db80becb67d60c44faae180c7e4257cb136a266d
2020-07-29 11:20:03 -07:00
91546a4b0f Environment variable for controlling type verbosity in debug output (#41906)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41906

Fixes #41770

Test Plan:
Example:
```
import torch
def bar():
    def test(a):
        return a
    x = torch.ones(10,10, device='cpu')
    print(torch.jit.trace(test, (x)).graph)
bar()
```

Bash:
```
for i in 0 1 2 3; do
  PYTORCH_JIT_TYPE_VERBOSITY=$i python test.py
done
```

Output:
```
graph(%0):
  return (%0)

graph(%0 : Float(10, 10)):
  return (%0)

graph(%0 : Float(10:10, 10:1)):
  return (%0)

graph(%0 : Float(10:10, 10:1, requires_grad=0, device=cpu)):
  return (%0)
```

Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D22687966

fbshipit-source-id: cd395257d79a4baa35245c778a74a55d1ea2a842
2020-07-29 11:17:24 -07:00
01b794f169 Operator-level Benchmark Test for Per Tensor and Per Channel Fake Quantization (#41974)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41974

In this diff, 2 new sets of benchmark tests are added to the `quantization` benchmark suite where operator-level benchmarking is conducted for the learnable Python operators, the learnable c++ kernels, and the original non-backprop c++ kernels.

Test Plan:
Inside the path `torch/benchmarks/operator_benchmark` (The root directory will be `caffe2` inside `fbcode` if working on a devvm):
- On a devvm, run the command `buck run pt:fake_quantize_learnable_test`
- On a personal laptop, run the command `python3 -m pt.fake_quantize_learnable_test`

Benchmark Results (On devGPU with 0% volatile utilization -- all GPUs are free):
Each sample has dimensions **3x256x256**;

### In **microseconds** (`1e-6` second),

|                           | Python Module | C++ Kernel | Non-backprop C++ Kernel |
|---------------------------|---------------|------------|-------------------------|
| Per Tensor CPU Forward    | 3112.666      | 3270.740   | 3596.864                |
| Per Tensor Cuda Forward   | 797.258       | 258.961    | 133.953                 |
| Per Channel CPU Forward   | 6587.693      | 6931.461   | 6352.417                |
| Per Channel Cuda Forward  | 1579.576      | 555.723    | 479.016                 |
| Per Tensor CPU Backward   | 72278.390     | 22466.648  | 12922.195               |
| Per Tensor Cuda Backward  | 6512.280      | 1546.218   | 652.942                 |
| Per Channel CPU Backward  | 74138.545     | 41212.777  | 14131.576               |
| Per Channel Cuda Backward | 6795.173      | 4321.351   | 1052.066                |

Reviewed By: z-a-f

Differential Revision: D22715683

fbshipit-source-id: 8be528b790663413cbeeabd4f68bbca00be052dd
2020-07-29 11:12:17 -07:00
48acdfd505 add tests to BinaryOpsKernel -- max/min kernel (#42198)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42198

1. add tests to max/min kernel

Test Plan:
1. Run locally to check cover the corresponding code part in BinaryOpsKernel.cpp.
2. CI

Reviewed By: malfet

Differential Revision: D22796019

fbshipit-source-id: 84c8d7df509de453c4ec3c5e38977733b0ef3457
2020-07-29 10:35:40 -07:00
382781221d Extending Learnable Fake Quantize module to support gradient scaling and factory (partial) construction (#41969)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41969

In this diff, the `_LearnableFakeQuantize` module is extended to provide support for gradient scaling where the gradients for both scale and zero point are multiplied by a constant `g` (in some cases, can help with quicker convergence). In addition, it is also augmented to provide a factory method via `_with_args` such that a partial constructor of the module can be built.

Test Plan:
For correctness of the fake quantizer operators, on a devvm, enter the following command:
```
buck test //caffe2/torch:quantization -- learnable_py_module
```

Reviewed By: z-a-f

Differential Revision: D22715629

fbshipit-source-id: ff8e5764f81ca7264bf9333789f57e0b0cec7a72
2020-07-29 10:22:26 -07:00
0a64f99162 [JIT] Dont include view ops in autodiff graphs (#42027)
Summary:
View ops as outputs of differentiable subgraphs can cause incorrect differentiation. For now, do not include them in the subgraph. This was observed with our autograd tests for MultiheadAttention and nn.Transformer, which currently fail with the legacy executor. This commit fixes those test failures.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42027

Reviewed By: pbelevich

Differential Revision: D22798133

Pulled By: eellison

fbshipit-source-id: 2f6c08953317bbe013933c6faaad20100376c039
2020-07-29 10:17:33 -07:00
b45b82b006 Fix type annotation for DistributedDataParallel (#42231)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42231

Reviewed By: albanD

Differential Revision: D22816589

Pulled By: mrshenli

fbshipit-source-id: a355f7e2fa895617bf81ef681b051f074d39ab8c
2020-07-29 10:12:20 -07:00
c8e15842aa Automated submodule update: FBGEMM (#42205)
Summary:
This is an automated pull request to update the first-party submodule for [pytorch/FBGEMM](https://github.com/pytorch/FBGEMM).

New submodule commit: cad1c21404

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42205

Test Plan: Ensure that CI jobs succeed on GitHub before landing.

Reviewed By: dskhudia

Differential Revision: D22806731

Pulled By: efiks

fbshipit-source-id: 779a9f7f00645e7e65f183e2832dc79117eae5fd
2020-07-29 09:26:18 -07:00
460970483d Revert D22790718: [pytorch][PR] Enables torch.full bool and integer type inference
Test Plan: revert-hammer

Differential Revision:
D22790718 (6b3f335641)

Original commit changeset: 8d1eb01574b1

fbshipit-source-id: c321177cce129a6c83f1a7b26bd5ed94a343ac0f
2020-07-29 07:52:04 -07:00
90074bbfa6 implement numpy-like functionality isposinf, isneginf (#41588)
Summary:
Related https://github.com/pytorch/pytorch/issues/38349

Numpy-like functionalities `isposinf` and `isneginf` are implemented.

Test-Plan:
- pytest test/test_torch.py -k "test_isposinf_isneginf"

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41588

Reviewed By: ngimel

Differential Revision: D22770732

Pulled By: mruberry

fbshipit-source-id: 7448653e8fb8df6b9cd4604a4739fe18a1135578
2020-07-29 03:29:31 -07:00
1c5c289b62 [pt] Add incude_last_offset option to EmbeddingBag mean and max (#42215)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42215

Specifically on https://github.com/pytorch/pytorch/pull/27477#discussion_r371402079

We would like to supported with include_last=True overall for other reduction types like mean and max. It now causes further code fragmentation in DPER (https://www.internalfb.com/intern/diff/D22794469/).

More details: https://www.internalfb.com/intern/diff/D22794469/?dest_fbid=309597093427021&transaction_id=631457624153457

ghstack-source-id: 108733009

Test Plan:
```
buck test mode/dev-nosan //caffe2/test:nn -- "test_EmbeddingBag_per_sample_weights_and_new_offsets_cpu"
```

```
(base) [jianyuhuang@devbig281.ftw3.facebook.com: ~/fbsource/fbcode/caffe2/test] $ TORCH_SHOW_CPP_STACKTRACES=1 buck test mode/dev-nosan //caffe2/test:
nn -- "test_EmbeddingBag_per_sample_weights_and_new_offsets_cpu" --print-passing-details
Parsing buck files: finished in 1.2 sec
Building: finished in 5.5 sec (100%) 10130/10130 jobs, 2 updated
  Total time: 6.7 sec
More details at https://www.internalfb.com/intern/buck/build/dbdc2063-69d8-45cb-9146-308a9e8505ef
First unknown argument: --print-passing-details.
Falling back to TestPilot classic.
Trace available for this run at /tmp/testpilot.20200728-195414.1422748.log
TestPilot test runner for Facebook. See https://fburl.com/testpilot for details.
Testpilot build revision cd2638f1f47250eac058b8c36561760027d16add fbpkg f88726c8ebde4ba288e1172a348c7f46 at Mon Jul 27 18:11:43 2020 by twsvcscm from /usr/local/fbprojects/packages/testinfra.testpilot/887/t.par
Discovering tests
Running 1 test
Started new test run: https://our.intern.facebook.com/intern/testinfra/testrun/844425097242375
      ✓ caffe2/test:nn - test_EmbeddingBag_per_sample_weights_and_new_offsets_cpu (test_nn.TestNNDeviceTypeCPU) 0.162 1/1 (passed)
Test output:
> /data/users/jianyuhuang/fbsource/fbcode/buck-out/dev/gen/caffe2/test/nn#binary,link-tree/torch/_utils_internal.py:103: DeprecationWarning: This is a NOOP in python >= 3.7, its just too dangerous with how we write code at facebook. Instead we patch os.fork and multiprocessing which can raise exceptions if a deadlock would happen.
>   threadSafeForkRegisterAtFork()
> /usr/local/fbcode/platform007/lib/python3.7/importlib/_bootstrap.py:219: ImportWarning: can't resolve package from __spec__ or __package__, falling back on __name__
and __path__
>   return f(*args, **kwds)
> test_EmbeddingBag_per_sample_weights_and_new_offsets_cpu (test_nn.TestNNDeviceTypeCPU) ... Couldn't download test skip set, leaving all tests enabled...
> ok
>
> ----------------------------------------------------------------------
> Ran 1 test in 0.162s
>
> OK
Finished test run: https://our.intern.facebook.com/intern/testinfra/testrun/844425097242375
Summary (total time 5.54s):
  PASS: 1
  FAIL: 0
  SKIP: 0
  FATAL: 0
  TIMEOUT: 0
  OMIT: 0
Did _not_ run with tpx. See https://fburl.com/tpx for details.
```

Reviewed By: dzhulgakov

Differential Revision: D22801881

fbshipit-source-id: 80a624465727081bb9bf55c28419695a3d79c6e5
2020-07-29 01:20:00 -07:00
6b3f335641 Enables torch.full bool and integer type inference (#41912)
Summary:
After being deprecated in 1.5 and throwing a runtime error in 1.6, we can now enable torch.full inferring its dtype when given bool and integer fill values. This PR enables that inference and updates the tests and docs to reflect this.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41912

Reviewed By: pbelevich

Differential Revision: D22790718

Pulled By: mruberry

fbshipit-source-id: 8d1eb01574b1977f00bc0696974ac38ffdd40d9e
2020-07-28 23:11:08 -07:00
8c653e05ff DOC: fail to build if there are warnings (#41335)
Summary:
Merge after gh-41334 and gh-41321 (EDIT: both are merged).
Closes gh-38011

This is the last in a series of PRs to build documentation without warnings. It adds `-WT --keepgoing` to the shpinx build which will [fail the build if there are warnings](https://www.sphinx-doc.org/en/master/man/sphinx-build.html#cmdoption-sphinx-build-W), print a [trackeback on error](https://www.sphinx-doc.org/en/master/man/sphinx-build.html#cmdoption-sphinx-build-T) and [finish the build](https://www.sphinx-doc.org/en/master/man/sphinx-build.html#cmdoption-sphinx-build-keep-going) even when there are warnings.

It should fail now, but pass once the PRs mentioned at the top are merged.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41335

Reviewed By: pbelevich

Differential Revision: D22794425

Pulled By: mruberry

fbshipit-source-id: eb2903e50759d1d4f66346ee2ceebeecfac7b094
2020-07-28 22:33:44 -07:00
4b108ca763 refactor save_data as non member function (#42045)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42045

This PR changes the save_data() member functions of torch::jit::mobile::Module which was introduced in #41403 to be the non member function torch::jit::mobile::_save_parameters() (taking a mobile Module as its first argument).

In addition, this PR:
* adds a getter function _ivalue() for the mobile::Module object
* renames torch::jit::mobile::_load_mobile_data() to torch::jit::mobile_load_parameters()
* refactors the import.h header file into import.h and import_data.h

Test Plan: Imported from OSS

Reviewed By: kwanmacher, iseeyuan

Differential Revision: D22766781

Pulled By: ann-ss

fbshipit-source-id: 5cabae31927187753a958feede5e9a28d71d9e92
2020-07-28 21:52:32 -07:00
8fc5adc88e Remove dead named_tensors_unsupported_error definitions. (#42171)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42171

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: pbelevich

Differential Revision: D22794980

Pulled By: ezyang

fbshipit-source-id: 250b6566270e19240361d758db55101d6fcb33e9
2020-07-28 21:40:28 -07:00
8deb4fe809 Fix flaky NCCL error handling tests. (#42149)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42149

Some of these tests were flaky since we could kill the process in some
way without cleaning up the ProcessGroup. This resulted in issues where the
FileStore didn't clean up appropriately resulting in other processes in the
group to crash.

Fixed this by explicitly deleting the process_group before we bring a process
down forcibly.
ghstack-source-id: 108629057

Test Plan: waitforbuildbot

Reviewed By: mrshenli

Differential Revision: D22785042

fbshipit-source-id: c31d0f723badbc23b7258e322f75b57e0a1a42cf
2020-07-28 18:38:26 -07:00
b6a9f42758 Add appropriate error messages for ProcessGroupNCCLTest (#42143)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42143

Replaces the original makeshift error messages in ProcessGroupNCCLTest
with more appropriate ones.
ghstack-source-id: 108711579

Test Plan: Ran the tests on DevGPU

Reviewed By: mrshenli

Differential Revision: D22778505

fbshipit-source-id: 27109874f0b474a74b09f588cf6e7528d2069702
2020-07-28 18:31:23 -07:00
e4c3f526c8 Fixed Skipping Logic in ProcessGroupNCCLErrors tests (#42192)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42192

This PR fixes the complicated skipping logic for ProcessGroupNCCLErrors Tests - it correctly logs the reason for skipping tests when GPUs are not available or the NCCL version is too old.

This is part of a broader effort to improve the testing of the ProcessGroup and Collectives tests.
ghstack-source-id: 108620568

Test Plan: Tested on devGPU and devvm. Tests are run correctly on GPU and skipped on CPU as expected.

Reviewed By: mrshenli

Differential Revision: D22782856

fbshipit-source-id: 6071dfdd9743f45e59295e5cee09e89c8eb299c9
2020-07-28 16:59:40 -07:00
b2ef7fa359 Add a flag to enforce fp32 to fp16 conversion for all inputs of the onnxifi net. (#39931)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39931

ATT.

Reviewed By: yinghai, ChunliF

Differential Revision: D21993492

fbshipit-source-id: ff386e6e9b95a783906fc1ae6a62462e6559a20b
2020-07-28 16:48:43 -07:00
8a644f0c13 [Shape Inference] Fix InferFC
Summary: Sometimes first dim of X in FC is BATCH_OF_FEATURE_MAX instead of BATCH. This caused an issue in f207899183 (when first dim of X is 64 but is set to 1 in inferFC). Change the check from `!= BATCH` to `== UNKNOWN`

Test Plan: unit test

Reviewed By: yinghai

Differential Revision: D22784691

fbshipit-source-id: eb66ba361d6fe75672b13edbac2fbd269a7e7a00
2020-07-28 16:43:19 -07:00
30eacb5fb6 [quant][graphmode] Support stack (#42187)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42187

Test Plan: Imported from OSS

Reviewed By: vkuzo

Differential Revision: D22801229

fbshipit-source-id: 7d1758c4fb1c8f742a275c3a631605f0f0d08e44
2020-07-28 16:35:34 -07:00
deac621ae2 Stop building PyTorch for VS2017 (#42144)
Summary:
And since CUDA-9.2 is incompatible with VS2019, disable CUDA-9.2 for Windows as well

Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42144

Reviewed By: pbelevich

Differential Revision: D22794475

Pulled By: malfet

fbshipit-source-id: 24fc980e6fc75240664b9de8a4a63b1153f8d8ee
2020-07-28 16:09:21 -07:00
3c084fd358 Dequant => Swish => Quant Test case. (#41976)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41976

Dequant => Swish => Quant Test case.

(Note: this ignores all push blocking failures!)

Test Plan: test_deq_swish_quant_nnpi.py.

Reviewed By: hyuen

Differential Revision: D22718593

fbshipit-source-id: 1cee503a27e339af6d89c819007511b90bb6610c
2020-07-28 16:05:12 -07:00
e2344db886 Use Python3.7 when running OSX builds/tests (#42191)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42191

Reviewed By: seemethere

Differential Revision: D22801091

Pulled By: malfet

fbshipit-source-id: b589343ef1bc6896d3d6d8d863f75aa3a102d985
2020-07-28 16:00:54 -07:00
4c7fb8c2b6 make FusionCallback refer to specified GraphFuser context (#41560)
Summary:
Fixes issue where
 - top level fuser's block_ was captured by callback due to [&] capture,
 - recursive/nested fusers would compare erroneously to top-level block_ instead of own block_

Closes (https://github.com/pytorch/pytorch/issues/39810)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41560

Reviewed By: Krovatkin

Differential Revision: D22583196

Pulled By: wconstab

fbshipit-source-id: 8f543cd9ea00e116cf3e776ab168cdd9fed69632
2020-07-28 15:01:24 -07:00
8ddd2c4e1b [pytorch] fix code analyzer for LLVM 9 & 10 (#42135)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42135

Tested the code analyzer with LLVM 9 & 10 and fixed a couple issues:
- Rename local demangle() which is available as public API since LLVM 9;
- Fix falsely associated op registrations due to the `phi` instruction;

Test Plan: Imported from OSS

Reviewed By: iseeyuan

Differential Revision: D22795508

Pulled By: ljk53

fbshipit-source-id: 2d47af088acd3312a7ea5fd9361cdccd48940fe6
2020-07-28 14:57:07 -07:00
fd9205e14b Enable caffe2 tests for RocM jobs (#41604)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/41604

Reviewed By: ezyang

Differential Revision: D22603703

Pulled By: malfet

fbshipit-source-id: 789ccf2bb79668a5a68006bb877b2d88fb569809
2020-07-28 14:21:42 -07:00
4d17ecb071 Changed Blacklisted to Blocklisted (#42100)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/41703

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42100

Reviewed By: ngimel

Differential Revision: D22780380

Pulled By: SplitInfinity

fbshipit-source-id: d465c41f1d4951ab6de55cb827c7ef53975209af
2020-07-28 13:21:26 -07:00
030ab2bda5 Replaced whitelist reference with allowlist (#42071)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/41741

Replaced whitelist reference with allowlist.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42071

Reviewed By: pbelevich

Differential Revision: D22795176

Pulled By: SplitInfinity

fbshipit-source-id: bcf1b8afe516b9684ce0298bc257ef81152ba20c
2020-07-28 12:29:33 -07:00
64965c4572 Replaced blacklist with blocklist (#42097)
Summary:
Closes https://github.com/pytorch/pytorch/issues/41726

Fixes https://github.com/pytorch/pytorch/issues/41726

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42097

Reviewed By: ngimel

Differential Revision: D22779535

Pulled By: SplitInfinity

fbshipit-source-id: 1d414af22a1b3e856a11d64cff4b4d33160d957b
2020-07-28 12:08:54 -07:00
5ed7cd0025 Allow drop_last option in DistributedSampler (#41171)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41171

DistributedSampler allows data to be split evenly across workers in
DDP, but it has always added additional samples in order for the data to be
evenly split in the case that the # of samples is not evenly divisible by the
number of workers. This can cause issues such as when doing distributed
validation accuracy, where multiple samples could be considered twice.

This PR adds a drop_last option where the tail of the data is dropped such that
the effective dataset size is still evenly divisible across the workers. This
ensures that DDP can train fine (there is no uneven inputs) and each replica
gets an equal number of data indices.
ghstack-source-id: 108617516

Test Plan: Added unittest

Reviewed By: mrshenli

Differential Revision: D22449974

fbshipit-source-id: e3156b751f5262cc66437b9191818b78aee8ddea
2020-07-28 11:33:08 -07:00
48ae5945de Skip TestExtractPredictorNet if compiled without OpenCV (#42168)
Summary:
Found while trying to get RocM Caffe2 CI green

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42168

Reviewed By: seemethere

Differential Revision: D22791879

Pulled By: malfet

fbshipit-source-id: 8f7ef9711bdc5941b2836e4c8943bb95c72ef8af
2020-07-28 11:26:55 -07:00
f666be7bc1 [vulkan] support add for dim < 4 (#41222)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/41222

Test Plan: Imported from OSS

Reviewed By: AshkanAliabadi

Differential Revision: D22754937

Pulled By: IvanKobzarev

fbshipit-source-id: f8c5e55c965c0a805e75c63b21f410fb0c323515
2020-07-28 11:15:37 -07:00
b3a9e21a29 [vulkan] mm op through addmm (#41221)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/41221

Test Plan: Imported from OSS

Reviewed By: AshkanAliabadi

Differential Revision: D22754938

Pulled By: IvanKobzarev

fbshipit-source-id: f9a0f48d7943a85b7dbb3fc9edf9e214ba07543b
2020-07-28 11:13:48 -07:00
b0424a895c Raise RuntimeError for zero stride pooling (#41819)
Summary:
Close https://github.com/pytorch/pytorch/issues/41767

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41819

Reviewed By: mrshenli

Differential Revision: D22780634

Pulled By: ngimel

fbshipit-source-id: 376ce5229ad5bd60804d839340d2c6505cf3288d
2020-07-28 11:07:12 -07:00
5aa2b572ff replace black list with block (#42091)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/41729

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42091

Reviewed By: pbelevich

Differential Revision: D22792096

Pulled By: ezyang

fbshipit-source-id: caafa42d12cbad377b67ddbaba8f84a2b8c98066
2020-07-28 10:23:51 -07:00
2f61aca17b Skip DataIO tests relying on LevelDB if compiled without it (#42169)
Summary:
Found while trying to get RocM Caffe2 job green

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42169

Reviewed By: seemethere

Differential Revision: D22791896

Pulled By: malfet

fbshipit-source-id: 9df6233876aec5ead056365499bab970aa7e8bdc
2020-07-28 10:18:26 -07:00
73ff252913 Back out "[NCCL] DDP communication hook: getFuture()" (#42152)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42152

Original commit changeset: 8c059745261d

Test Plan: .

Reviewed By: ajtulloch, jianyuh

Differential Revision: D22786183

fbshipit-source-id: 51155389d37dc82ccb4d2fa20d350f9d14abeaca
2020-07-28 10:05:35 -07:00
2de549518e Make fmod work with zero divisors consistently (#41948)
Summary:
Currently `torch.tensor(1, dtype=torch.int).fmod(0)` crashes (floating point exception).

This PR should fix this issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41948

Reviewed By: ngimel

Differential Revision: D22771081

Pulled By: ezyang

fbshipit-source-id: a94dd35d6cd85daa2d51cae8362004e31f97989e
2020-07-28 08:58:39 -07:00
e7ed0b3fae Avoid zero division in _cubic_interpolate (#42093)
Summary:
I encountered a zero division problem when using LBFGS:

File "/home/yshen/anaconda3/lib/python3.7/site-packages/torch/optim/lbfgs.py", line 118, in _strong_wolfe
    bracket[1], bracket_f[1], bracket_gtd[1])
File "/home/yshen/anaconda3/lib/python3.7/site-packages/torch/optim/lbfgs.py", line 21, in _cubic_interpolate
    d1 = g1 + g2 - 3 * (f1 - f2) / (x1 - x2)
ZeroDivisionError: float division by zero

My solution is to determine whether "line-search bracket is so small" before calling _cubic_interpolate

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42093

Reviewed By: pbelevich

Differential Revision: D22770667

Pulled By: mrshenli

fbshipit-source-id: f8fdfcbd3fd530235901d255208fef8005bf898c
2020-07-28 08:32:00 -07:00
f0c46878c6 Fix the issue GPU skip message(#41378) (#41973)
Summary:
Related https://github.com/pytorch/pytorch/issues/41378

Fix the issue GPU skip message

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41973

Reviewed By: pbelevich

Differential Revision: D22753459

Pulled By: mrshenli

fbshipit-source-id: d24b531926e28b860ae90b9ae07e8ca3438d21db
2020-07-28 08:28:31 -07:00
3acd6b7359 Document formatting (#42065)
Summary:
Apply syntax highlighting to the command in `README.md`. This makes `README.md` easier to read.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42065

Reviewed By: pbelevich

Differential Revision: D22753418

Pulled By: mrshenli

fbshipit-source-id: ebfa90fdf60478c34bc8a7284d163e0254cfbe3b
2020-07-28 08:27:42 -07:00
14e75fbdb9 Remove py2 specific code from test_utils.py (#42105)
Summary:
As https://github.com/pytorch/pytorch/issues/23795 mentioned drop Python 2 support. albanD
Fixes https://github.com/pytorch/pytorch/issues/31796

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42105

Reviewed By: ngimel

Differential Revision: D22765768

Pulled By: mrshenli

fbshipit-source-id: bae114a21cd5598004c7f92d313938ad826b4a24
2020-07-28 08:25:40 -07:00
86492410bc Don't run tests with custom arguments with pytest (#41397)
Summary:
This patch basically removes the `-m pytest` parameters when `extra_unittest_args` is used (e.g. `--subprocess`)

Fixes https://github.com/pytorch/pytorch/issues/41393

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41397

Reviewed By: pbelevich

Differential Revision: D22792133

Pulled By: ezyang

fbshipit-source-id: 29930d703666f4ecc0d727356bbab4a5f7ed4860
2020-07-28 08:17:36 -07:00
672ed3c06b replace onnx producer_version when updating results (#41910)
Summary:
xref gh-39002 which handled the reading but not the writing of the onnx expect files, and the last comment in that PR which points out `XXX` was suboptimal.
xref [this comment](https://github.com/pytorch/pytorch/pull/37091#discussion_r456460168) which pointed out the problem.

This PR:
- replaces `XXX` with `CURRENT_VERSION` in the stored files
- ensures that updating the results with the `--accept` flag will maintain the change

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41910

Reviewed By: pbelevich

Differential Revision: D22758671

Pulled By: ezyang

fbshipit-source-id: 47c345c66740edfc8f0fb9ff358047a41e19b554
2020-07-28 08:15:01 -07:00
b282297559 Replace whitelist with allowlist (#42067)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/41757

I've replaced all the whitelist with allowlist for this issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42067

Reviewed By: pbelevich

Differential Revision: D22791690

Pulled By: malfet

fbshipit-source-id: 638c13cf49915f5c83bd79c7f4a39b8390cc15b4
2020-07-28 08:01:16 -07:00
1a8269a566 Replace blacklist with blocklist in test/run_test.py file. (#42011)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/41716
test/run_test.py file updated with an appropriate replacement for blacklist and whitelist.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42011

Reviewed By: pbelevich

Differential Revision: D22791836

Pulled By: malfet

fbshipit-source-id: 8139649c5b70c876b711e25c33f3051ea8461063
2020-07-28 07:56:01 -07:00
e179966248 [caffe2][tpx] log to stderr (#42162)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42162

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D22791440

fbshipit-source-id: 14f16cd7a94a57161c5724177b518527f486232d
2020-07-28 07:50:27 -07:00
0571cfd875 Implement MultiBatchVmapTransform::logicalToPhysical(TensorList) (#41942)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41942

This function:
- permutes all batch dims to the front of the tensors
- aligns all the batch dims to the collective levels of all the tensors
- expands all of the batch dims such that they are present in each of
the result tensors

This function is useful for the next diff up on the stack (which is
implementing a fallback kernel for BatchedTensor). It's also useful in
general for implementing batching rules on operators that take in
multiple batch dimensions at the front of each tensor (but we don't have
too many of those in PyTorch).

Test Plan: - `./build/bin/vmap_test`

Reviewed By: ezyang

Differential Revision: D22764104

Pulled By: zou3519

fbshipit-source-id: d42cc8824a1bcf258687de164b7853af52852f53
2020-07-28 07:45:25 -07:00
1994ab1473 Optimize alignBatchDimsAtFront (#41941)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41941

If we know that the tensor already has the desired aligned size, we
don't need to put in the effort to align it.

Test Plan: - `./build/bin/vmap_test`, `pytest test/test_vmap.py -v`

Reviewed By: albanD

Differential Revision: D22764101

Pulled By: zou3519

fbshipit-source-id: a2ab7ce7b98d405ae905f7fd98db097210bfad65
2020-07-28 07:45:23 -07:00
5124436af4 Fix const correctness for VmapPhysicalView struct methods (#41940)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41940

See title. I marked methods that don't mutate the VmapPhysicalView as
`const`.

Test Plan: - wait for tests

Reviewed By: albanD

Differential Revision: D22764102

Pulled By: zou3519

fbshipit-source-id: 40f957ad61c85f0e5684357562a541a2712b1f38
2020-07-28 07:43:09 -07:00
2bc7dae2fc Use new sccache for RocM builds (#42134)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42134

Reviewed By: seemethere

Differential Revision: D22782146

Pulled By: malfet

fbshipit-source-id: 85ba69a705600e30ae0eddbf654298b3dc6f96ed
2020-07-28 07:15:56 -07:00
6bd88f581a Revert D22790238: [caffe2][tpx] Use logger instead of print
Test Plan: revert-hammer

Differential Revision:
D22790238 (3c6fae6567)

Original commit changeset: c0a801cdf7f0

fbshipit-source-id: cadfbd22f7d3ce656624483c9a19062f7c9a5b61
2020-07-28 06:11:30 -07:00
3c6fae6567 [caffe2][tpx] Use logger instead of print
Test Plan: CI?

Differential Revision: D22790238

fbshipit-source-id: c0a801cdf7f0da489c67708a0eb1b498ff104c64
2020-07-28 04:26:51 -07:00
5336ccc1b2 [BugFix] Fix bug in onnx::SsaRewrite (#42148)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42148

Differential Revision: D22687388

fbshipit-source-id: facf7a186dd48d6f919d0ff5d42f756977c3f9f4
2020-07-28 01:44:47 -07:00
4f723825b4 [vulkan] adaptive_avg_pool2d (#41220)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/41220

Test Plan: Imported from OSS

Reviewed By: AshkanAliabadi

Differential Revision: D22754943

Pulled By: IvanKobzarev

fbshipit-source-id: 91a94f32db005ebb693384f4d27efe66e2c33a14
2020-07-27 23:24:14 -07:00
0a0960126c If we don't collect tracing, always free the trace data (#42118)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42118

We toggle trace on with a certain probablility. In the case of 3 inferences with trace on/off/on. We leak the trace from the first inference. Always clean up the trace will fix it.

Test Plan:
predictor

I created a tiny repro here: D22786551

With this fix, this issue is gone.

Reviewed By: gcatron

Differential Revision: D22768382

fbshipit-source-id: 9ee0bbcb2bc5f76107dae385759fe578909a683d
2020-07-27 21:49:30 -07:00
83762844e5 Make run_binary_ops_test function generic and Add tests to add_kernel function (#42101)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42101

1. Add test fixture `atest class` to store global variables
2. Make `run_binary_ops_test` function generic: can dispose different dtypes and different numbers of parameters
3. add test to `add_kernel`

Test Plan:
Run locally to check cover the corresponding code part in `BinaryOpsKernel.cpp`.
CI

Reviewed By: malfet

Differential Revision: D22760015

fbshipit-source-id: 95b47732f661124615c0856efa827445dd714125
2020-07-27 21:03:00 -07:00
c062cdbd90 Log the net if blob doesn't exist when setting output record (#41971)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/41971

Reviewed By: wx1988

Differential Revision: D22490309

fbshipit-source-id: d967ee211b610f5523a307b5266b9fcb0277a21c
2020-07-27 19:13:50 -07:00
f805184165 onnxifi: make it work with AsyncIf
Summary:
the onnxifi path didn't handle the input/output name rewrite for ssa correctly for AsyncIf op. Add support for it.

Also fixed a place where we lose the net type while doing onnxifi transform.

Test Plan: Load 163357582_593 which is a multi feed model that uses AsyncIf. This used to fail with c2 not finding some blobs in workspace. Now it works.

Reviewed By: dhe95

Differential Revision: D21268230

fbshipit-source-id: ce7ec0e952513d0f251df1bfcfb2b0250f51fd94
2020-07-27 18:27:35 -07:00
c76fada4a8 Let DDP.train() return self to stay consistent with nn.Module (#42131)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42131

Test Plan: Imported from OSS

Reviewed By: pritamdamania87

Differential Revision: D22775311

Pulled By: mrshenli

fbshipit-source-id: ac9e6cf8b2381036a2b6064bd029dca361a81777
2020-07-27 18:22:13 -07:00
bcd75bd683 [ModelLints] Refine dropout lint message. (#42046)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42046

Refine dropout lint message as we have enabled dropout operator removal in optimize_for_mobile method.
ghstack-source-id: 108607182

Test Plan: buck test ai_infra/ai_mobile_infra/tests:mobile_model_util_tests

Reviewed By: kimishpatel

Differential Revision: D22741132

fbshipit-source-id: 8f87356aae2bd9c89d1cad0d7be7286278bb14ad
2020-07-27 18:15:30 -07:00
d5de616a4a Enable c10d Store tests in CI (#42128)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42128

Reviewed By: pritamdamania87

Differential Revision: D22774445

Pulled By: mrshenli

fbshipit-source-id: 6e5e56f42833414ef375b6cd23fdb3260cb07be9
2020-07-27 18:12:37 -07:00
509c18a096 Documentation for torch.optim.swa_utils (#41228)
Summary:
This PR adds a description of `torch.optim.swa_utils` added in https://github.com/pytorch/pytorch/pull/35032 to the docs at `docs/source/optim.rst`. Please let me know what you think!

vincentqb andrewgordonwilson

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41228

Reviewed By: ngimel

Differential Revision: D22609451

Pulled By: vincentqb

fbshipit-source-id: 8dd98102c865ae4a074a601b047072de8cc5a5e3
2020-07-27 17:52:16 -07:00
646042e0fb Add suggestion to enumerate ModuleDict in error message (#41946)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/41946

Reviewed By: ngimel

Differential Revision: D22774243

Pulled By: wconstab

fbshipit-source-id: 5cfbe52b5b1c540f824593e67ae6ba4973458bb5
2020-07-27 16:24:00 -07:00
1df35ba61e Back out "Support aarch32 neon backend for Vec256"
Summary: Original commit changeset: 1c22cf67ec35

Test Plan: sandcastle, testing on Portal

Reviewed By: currybeef

Differential Revision: D22774614

fbshipit-source-id: 8897aec5df32092c4df86c0d54b0d2fe58d66e66
2020-07-27 16:09:05 -07:00
d198fb3efe changed white-allowlisted (#41796)
Summary:
closes https://github.com/pytorch/pytorch/issues/41749

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41796

Reviewed By: gmagogsfm

Differential Revision: D22718991

Pulled By: SplitInfinity

fbshipit-source-id: 6c2d2b0e3b1e79fd515f9bdd395335a32f525a26
2020-07-27 16:01:45 -07:00
cb9c2049cd replace blacklist in aten/src/ATen/native/cudnn/Conv.cpp (#41627)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/41700.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41627

Reviewed By: gmagogsfm

Differential Revision: D22678492

Pulled By: SplitInfinity

fbshipit-source-id: 75b82bd10059754d8e6c25fc20e9dde775d54698
2020-07-27 15:56:36 -07:00
6ca5421a8f Enable non-synchronizing cub scan for cum* operations (#42036)
Summary:
This uses cub for cum* operations, because, unlike thrust, cub is non-synchronizing.
Cub does not support more than `2**31` element tensors out of the box (in fact, due to cub bugs the cutoff point is even smaller)
so to support that I split the tensor into `2**30` element chunks, and modify the first value of the second and subsequent chunks to contain the cumsum result of the previous chunks. Since modification is done inplace on the source tensor, if something goes wrong and we error out before the source tensor is reverted back to its original state, source tensor will be corrupted, but in most cases errors will invalidate the full coda context.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42036

Reviewed By: ajtulloch

Differential Revision: D22749945

Pulled By: ngimel

fbshipit-source-id: 9fc9b54d466df9c8885e79c4f4f8af81e3f224ef
2020-07-27 15:44:03 -07:00
330a107199 Refactor lite serializer dependencies from full jit (#42127)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42127

This diff renames core_autograd_sources to core_trainer_sources and moves/adds dependencies for the lite trainer in order to build the serializer functionality internally.
ghstack-source-id: 108589416

Test Plan: Manually tested serializer functionality from the internal lite trainer and verified that data is written correctly.

Reviewed By: iseeyuan

Differential Revision: D22738293

fbshipit-source-id: 992beb0c4368b2395f5bd5563fb2bc12ddde39a1
2020-07-27 15:38:54 -07:00
f7d50f50b9 .circleci: Prefer netrc for docs push (#42136)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42136

Expect was giving weird issues so let's just use netrc since it doesn't
rely on janky expect behavior

Another follow up for: https://github.com/pytorch/pytorch/pull/41964

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: yns88

Differential Revision: D22778940

Pulled By: seemethere

fbshipit-source-id: 1bdf879a5cfbf68a7d2d34b6966c20f95bd0a3b5
2020-07-27 15:28:46 -07:00
ed822de0fc change 2 instances of blacklist to blocklist in tools/pyi/gen_pyi.py (#41979)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/41722

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41979

Reviewed By: ngimel

Differential Revision: D22764112

Pulled By: zou3519

fbshipit-source-id: 3f8580c96cf45078a9df3cd9ca6fdb10d58e143f
2020-07-27 14:12:32 -07:00
5246bc4e87 register parameters correctly in c++ MultiheadAttention (#42037)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42037

This is to fix #41951

Test Plan: Imported from OSS

Reviewed By: yf225

Differential Revision: D22764717

Pulled By: glaringlee

fbshipit-source-id: e6da0aeb05a2356f52446e6d5fad391f2cd1cf6f
2020-07-27 13:58:11 -07:00
e59db43313 Find hip properly (#42064)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/41886

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42064

Reviewed By: seemethere

Differential Revision: D22757115

Pulled By: malfet

fbshipit-source-id: 9c8805e6eb0b7d7defe0ecb08c1e45dcc775a237
2020-07-27 13:47:01 -07:00
d6f1346c37 Add a new op for converting the dense feature to sparse representation
Summary: we need this op to avoid the splicing of a dense tensor and then use the Mergesinglescaler op

Test Plan: integrated test with dper2

Differential Revision: D22677523

fbshipit-source-id: f4f9a1f06841b0906ec8cbb435482ae0a89e1721
2020-07-27 12:45:37 -07:00
4281240cb5 Raise error for duplicate params in param group #40967 (#41597)
Summary:
This PR fixes an issue in https://github.com/pytorch/pytorch/issues/40967 where duplicate parameters across different parameter groups are not allowed, but duplicates inside the same parameter group are accepted. After this PR, both cases are treated equally and raise `ValueError`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41597

Reviewed By: zou3519

Differential Revision: D22608019

Pulled By: vincentqb

fbshipit-source-id: 6df41dac62b80db042cfefa6e53fb021b49f4399
2020-07-27 12:25:52 -07:00
6367a9d2b0 [vulkan] Shaders caching (#39384)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39384

Introducing `ComputeUnitFactory` which is responsible for providing `ComputeUnit`s (Shaders),
it caches it, using shader name (glsl file name)+workGroupSize as a cacheKey, just `std::map<string, std::shared_ptr>`

Macro GLSL_SPV changed to have literal name for cache key as a first argument.

All constructors of ComputeUnit are changed to use `ComputeUnitFactory`

Ownership model:
ComputeUnitFactory also owns `vkPipelineCache` that is internal vulkan cache object ( https://www.khronos.org/registry/vulkan/specs/1.2-extensions/man/html/VkPipelineCache.html )

`VContext` (global object) owns ComputeUnitFactory, that owns ComputeUnits, vkPipelineCache, for destruction of them we need valid VkDevice, so it should be destructed before `vkDestryDevice` in `~VContext` => As members of the class will be destructed only after destructor - forcing destruction of ComputeUnitFactory before `vkDestroyDevice`, doing `unique_ptr<ComputeUnitFactory>.reset()`

Test Plan: Imported from OSS

Reviewed By: AshkanAliabadi

Differential Revision: D21962430

Pulled By: IvanKobzarev

fbshipit-source-id: effe60538308805f317c11448b31dbcf670487e8
2020-07-27 11:57:07 -07:00
d4735ff490 Avoid refcount bump in IValue::toStringRef() (#42019)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42019

According to benchmarks, this makes IValue::toStringRef() 3-4x as fast.
ghstack-source-id: 108451154

Test Plan: unit tests

Reviewed By: ezyang

Differential Revision: D22731354

fbshipit-source-id: 3ca3822ea7310d8593e38b1d3e6014d6d80963db
2020-07-27 11:44:27 -07:00
5a6d88d503 Updates to Scale and Zero Point Gradient Calculation (#42034)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42034

In this diff, scale and zero point gradient calculations are updated to correctly reflect the actual backpropagation equation (instead of `dScale * dX`, the near-final output should be `dScale * dY`; the same applies to zero point).

Test Plan:
To execute the unit tests for all affected learnable fake quantize modules and kernels, on a devvm, execute the following command:

`buck test //caffe2/test:quantization -- learnable`

To enable the `cuda` tests, execute the following command:

`buck test mode/dev-nosan //caffe2/test:quantization -- learnable`

Reviewed By: jerryzh168

Differential Revision: D22735668

fbshipit-source-id: 45c1e0fd38cbb2d8d5e60be4711e1e989e9743b4
2020-07-27 11:18:49 -07:00
c261a894d1 Updates to Python Module for Calculation of dX and Addition of Unit Tests (#42033)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42033

In this diff, the Python `_LearnableFakeQuantize` module is updated where the gradient with respect to the input `x` is actually computed instead of passed through. Argument naming is also updated for better clarity; and unit tests on the `PerTensor` and `PerChannel` operators are added for asserting correctness.

Test Plan:
On a devvm, execute the command:

`buck test //caffe2/test:quantization -- learnable_py_module`

To include `cuda` tests as well, run:

`buck test mode/dev-nosan //caffe2/test:quantization -- learnable_py_module`

Reviewed By: jerryzh168

Differential Revision: D22735580

fbshipit-source-id: 66bea7e9f8cb6422936e653500f917aa597c86de
2020-07-27 11:18:47 -07:00
e62bf89273 Renaming variables from dX to dY in Learnable Fake Quantize kernels for Better Clarity (#42032)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42032

In this diff, the arguments `dX` within the C++ kernels are named as `dY` for clarity and avoid confusion since it doesn't represent the gradient with respect to the input.

Test Plan:
To test all related fake quantize kernel operators, on a devvm, run the command:

`buck test //caffe2/test:quantization -- learnable`

Reviewed By: z-a-f, jerryzh168

Differential Revision: D22735429

fbshipit-source-id: 9d6d967f08b98a720eca39a4d2280ca8109dcdd6
2020-07-27 11:17:26 -07:00
3e121d9688 Amend docstring and add test for Flatten module (#42084)
Summary:
I've noticed when PR https://github.com/pytorch/pytorch/issues/22245 introduced `nn.Flatten`, the docstring had a bug where it wouldn't render properly on the web, and this PR addresses that. Additionally, it adds a unit test for this module.

**Actual**
![image](https://user-images.githubusercontent.com/13088001/88483672-cf896a00-cf3f-11ea-8b1b-a30d152e1368.png)

**Expected**
![image](https://user-images.githubusercontent.com/13088001/88483642-86391a80-cf3f-11ea-8333-0964a027a172.png)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42084

Reviewed By: mrshenli

Differential Revision: D22756662

Pulled By: ngimel

fbshipit-source-id: 60c58c18c9a68854533196ed6b9e9fb0d4f83520
2020-07-27 11:04:28 -07:00
4290d0be60 Remove settings for the logit test case. (#42114)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42114

Remove settings for the logit test case.

(Note: this ignores all push blocking failures!)

Test Plan: test_op_nnpi_fp16.py test case.

Reviewed By: hyuen

Differential Revision: D22766728

fbshipit-source-id: 2fe8404b103c613524cf1beddf1a0eb9068caf8a
2020-07-27 10:59:23 -07:00
11e5174926 Added support for Huber Loss (#37599)
Summary:
Current losses in PyTorch only include a (partial) implementation of Huber loss through `smooth l1` based on Fast RCNN - which essentially uses a delta value of 1. Changing/Renaming the [`_smooth_l1_loss()`](3e1859959a/torch/nn/functional.py (L2487)) and refactoring to include delta, enables to use the actual function.

Supplementary to this, I have also made a functional and criterion versions for anyone that wants to set the delta explicitly - based on the functional `smooth_l1_loss()` and the criterion `Smooth_L1_Loss()`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/37599

Differential Revision: D21559311

Pulled By: vincentqb

fbshipit-source-id: 34b2a5a237462e119920d6f55ba5ab9b8e086a8c
2020-07-27 10:42:30 -07:00
fbdaa555a2 Enable ProcessGroupGlooTest in CI (take 2) (#42086)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/42073

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42086

Reviewed By: ngimel

Differential Revision: D22765777

Pulled By: malfet

fbshipit-source-id: ebbcd44f448a1e7f9a3d18fa9967461129dd1dcd
2020-07-27 10:21:59 -07:00
96aaa311c0 Grammar Changes (#42076)
Summary:
Small grammatical updates.
![Screenshot (188)](https://user-images.githubusercontent.com/56619747/88471271-02723480-cf25-11ea-8fd1-ae98d5ebcc86.png)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42076

Reviewed By: mrshenli

Differential Revision: D22756651

Pulled By: ngimel

fbshipit-source-id: e810eb7397a5831d801348c8fff072854658830e
2020-07-26 13:53:41 -07:00
b7bda236d1 DOC: split quantization.rst into smaller pieces (#41321)
Summary:
xref gh-38010 and gh-38011.

After this PR, there should be only two warnings:
```
pytorch/docs/source/index.rst:65: WARNING: toctree contains reference to nonexisting \
      document 'torchvision/index'
WARNING: autodoc: failed to import class 'tensorboard.writer.SummaryWriter' from module \
     'torch.utils'; the following exception was raised:
No module named 'tensorboard'
```

If tensorboard and torchvision are prerequisites to building docs, they should be added to the `requirements.txt`.

As for breaking up quantization into smaller pieces: I split out the list of supported operations and the list of modules to separate documents. I think this makes the page flow better, makes it much "lighter" in terms of page cost, and also removes some warnings since the same class names appear in multiple sub-modules.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41321

Reviewed By: ngimel

Differential Revision: D22753099

Pulled By: mruberry

fbshipit-source-id: d504787fcf1104a0b6e3d1c12747ec53450841da
2020-07-25 23:59:40 -07:00
6af659629a DOC: fix two build warnings (#41334)
Summary:
xref gh-38011.

Fixes two warnings when building documentation by
- using the external link to torchvision
- install tensorboard before building documentation

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41334

Reviewed By: ngimel

Differential Revision: D22753083

Pulled By: mruberry

fbshipit-source-id: 876377e9bd09750437fbfab0378664b85701f827
2020-07-25 23:38:33 -07:00
47e6d4b3c8 Revert D22741514: [pytorch][PR] Enable ProcessGroupGlooTest in CI
Test Plan: revert-hammer

Differential Revision:
D22741514 (45e6f2d600)

Original commit changeset: 738d2e27f523

fbshipit-source-id: 0381105ed0ab676b0abd1927f602a35b1b264a6a
2020-07-25 18:19:17 -07:00
b00c05c86c update cub submodule (#42042)
Summary:
Per title

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42042

Reviewed By: mruberry

Differential Revision: D22752345

Pulled By: ngimel

fbshipit-source-id: 363735bfe3d49bab12fedef43b68c9dc9e372815
2020-07-25 17:52:45 -07:00
c5b4f60fc2 Move qconfig removal into convert() (#41930)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41930

As title
ghstack-source-id: 108517079

Test Plan: CI

Reviewed By: jerryzh168

Differential Revision: D22698386

fbshipit-source-id: 4f748c9bae4a0b615aa69c7cc8d8e451e5d26863
2020-07-25 13:27:13 -07:00
12cd083fd7 Updates torch.tensor, torch.as_tensor, and sparse ctors to use the device of inputs tensors they're given, by default (#41984)
Summary:
**BC-Breaking Note**

This PR changes the behavior of the torch.tensor, torch.as_tensor, and sparse constructors. When given a tensor as input and a device is not explicitly specified, these constructors now always infer their device from the tensor. Historically, if the optional dtype kwarg was provided then these constructors would not infer their device from tensor inputs. Additionally, for the sparse ctor a runtime error is now thrown if the indices and values tensors are on different devices and the device kwarg is not specified.

**PR Summary**
This PR's functional change is a single line:

```
auto device = device_opt.has_value() ? *device_opt : (type_inference ? var.device() : at::Device(computeDeviceType(dispatch_key)));
```
=>
```
auto device = device_opt.has_value() ? *device_opt : var.device();
```

in `internal_new_from_data`. This line entangled whether the function was performing type inference with whether it inferred its device from an input tensor, and in practice meant that

```
t = torch.tensor((1, 2, 3), device='cuda')
torch.tensor(t, dtype=torch.float64)
```

would return a tensor on the CPU, not the default CUDA device, while

```
t = torch.tensor((1, 2, 3), device='cuda')
torch.tensor(t)
```

would return a tensor on the device of `t`!

This behavior is niche and odd, but came up while aocsa was fixing https://github.com/pytorch/pytorch/issues/40648.

An additional side affect of this change is that the indices and values tensors given to a sparse constructor must be on the same device, or the sparse ctor must specify the dtype kwarg. The tests in test_sparse.py have been updated to reflect this behavior.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41984

Reviewed By: ngimel

Differential Revision: D22721426

Pulled By: mruberry

fbshipit-source-id: 909645124837fcdf3d339d7db539367209eccd48
2020-07-25 02:49:45 -07:00
366c014a77 [Resubmit #41318] NCCL backend support for torch bool (#41959)
Summary:
Resubmit of https://github.com/pytorch/pytorch/issues/41318 pushed to ci-all branch.

Original description:
Closes https://github.com/pytorch/pytorch/issues/24137.
This PR adds support for the torch.bool tensor type to ProcessGroupNCCL. For most types we use the existing mapping, but since bool is not supported as a native ncclDataType_t, we add the following logic:

Map at::kBool to ncclUint8
During reduction (allreduce for example), if the operation is SUM, we instead override to to a MAX, to avoid overflow issues. The rest of the operations work with no changes. In the boolean case, changing sum to max makes no correctness difference since they both function as a bitwise OR.
The reduction logic (for example for reduce/allreduce) is as follows:
sum, max = bitwise or
product, min = bitwise and

Note that this PR doesn't add support for BAND/BOR/BXOR. That is because these reduction ops currently are not supported by NCCL backend, see https://github.com/pytorch/pytorch/issues/41362

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41959

Reviewed By: mrshenli

Differential Revision: D22719665

Pulled By: rohan-varma

fbshipit-source-id: 8bc4194a8d1268589640242277124f277d2ec9f1
2020-07-24 23:44:29 -07:00
38580422bb Allow specifying PYTHON executable to build_android (#41927)
Summary:
build_android.sh should check PYTHON environment variable before trying to use default python executable.
Even in that case, try to pick python3 over python2 when available.

Closes https://github.com/pytorch/pytorch/issues/41795

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41927

Reviewed By: seemethere

Differential Revision: D22696850

Pulled By: malfet

fbshipit-source-id: be236c2baf54a1cd111e55ee7743cdc93cb6b9d7
2020-07-24 18:34:42 -07:00
8e03c38a4f Add prim::EnumName and prim::EnumValue ops (#41965)
Summary:
[2/N] Implement Enum JIT support

Add prim::EnumName and prim::EnumValue and their lowerings to support getting `name` and `value` attribute of Python enums.

Supported:
Enum-typed function targuments
using Enum type and comparing them
Support getting name/value attrs of enums

TODO:
Add PyThon sugared value for Enum
Support Enum-typed return values
Support enum values of different types in same Enum class
Support serialization and deserialization

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41965

Reviewed By: eellison

Differential Revision: D22714446

Pulled By: gmagogsfm

fbshipit-source-id: db8c4e26b657e7782dbfc2b58a141add1263f76e
2020-07-24 18:33:18 -07:00
6287f9ed65 Remove AllGatherTestWithTimeout (#41945)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41945

This test previously did a thread sleep before launching the allgather operation, and then waited on the work object. Since the sleep was done before the work object was created, it did not affect the allgather call, and thus, did not test work-level timeouts as intended.

I am removing this test for now. In the future we can add this test back, but would need to somehow inject a `cudaSleep` call before the  allgather (so the collective operation itself is delayed). This may require overriding the `ProcessGroupNCCL::collective`, so it's a bit more heavy-weight.

In the meantime, we can remove this test - work-level timeouts are still thoroughly tested with Gloo.
ghstack-source-id: 108370178

Test Plan: Ran ProcessGroupNCCL tests on devGPU

Reviewed By: jiayisuse

Differential Revision: D22702291

fbshipit-source-id: a36ac3d83abfab6351c0476046a2f3b04a80c44d
2020-07-24 18:17:48 -07:00
45e6f2d600 Enable ProcessGroupGlooTest in CI (#41985)
Summary:
Partially addresses https://github.com/pytorch/pytorch/issues/41143

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41985

Reviewed By: rohan-varma

Differential Revision: D22741514

Pulled By: malfet

fbshipit-source-id: 738d2e27f52334e402b65b724b8ba3b0b41372ee
2020-07-24 17:44:00 -07:00
cf7e7909d5 NCCL must depend on librt (#41978)
Summary:
Since NCCL makes calls to shm_open/shm_close it must depend on librt on Linux

This should fix `DSO missing from command line` error on some platforms

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41978

Reviewed By: colesbury

Differential Revision: D22721430

Pulled By: malfet

fbshipit-source-id: d2ae08ce9da3979daaae599e677d5e4519b080f0
2020-07-24 16:47:19 -07:00
dede71d6e3 Support aarch32 neon backend for Vec256 (#41267)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41267

Due to llvm bug and some unsupported intrinsics we could not directly
use intrinsics for implementing aarch32 neon back end for Vec256.
Instead we resort to inline assembly.

Test Plan:
vec256_test run on android phone.

Imported from OSS

Reviewed By: AshkanAliabadi

Differential Revision: D22482196

fbshipit-source-id: 1c22cf67ec352942c465552031e9329550b27b3e
2020-07-24 15:49:26 -07:00
976e614915 caffe2: add PIPELINE tag (#41482)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41482

This adds a new tag for use with pipeline parallelism.

Test Plan: CI

Reviewed By: heslami

Differential Revision: D22551487

fbshipit-source-id: 90910f458a9bce68f7ef684773322a49aa24494a
2020-07-24 15:25:14 -07:00
0c0864c6be update tests to run back-compat check using new binary (#41949)
Summary:
instead exporting schemas using the current binary being tested, install nightly and export its schemas to use in a back-compat test run by the current binary being tested.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41949

Reviewed By: houseroad

Differential Revision: D22731054

Pulled By: bradleyhd

fbshipit-source-id: 68a7e7637b9be2604c0ffcde2a40dd208057ba72
2020-07-24 15:20:05 -07:00
42a0b51f71 Easier english updated tech docs (#42016)
Summary:
Just added a easier way to understand the tech docs

![Screenshot from 2020-07-24 21-48-07](https://user-images.githubusercontent.com/55920093/88412562-6991cb00-cdf7-11ea-9612-5f69146ea233.png)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42016

Reviewed By: colesbury

Differential Revision: D22735752

Pulled By: mrshenli

fbshipit-source-id: 8e3dfb721f51ee0869b0df66bf856d9949553453
2020-07-24 14:36:17 -07:00
becc1b26dd updated white list/allow list (#41789)
Summary:
closes https://github.com/pytorch/pytorch/issues/41758

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41789

Reviewed By: izdeby

Differential Revision: D22648038

Pulled By: SplitInfinity

fbshipit-source-id: 5abc895789d8803ca542dfc0c62069350c6977c4
2020-07-24 14:26:16 -07:00
7e84913233 .circleci: Make sure to install expect for docs push (#41964)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41964

Since we're not executing this in a docker container we should go ahead
an install expect explicitly

This is a follow up PR to #41871

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D22736738

Pulled By: seemethere

fbshipit-source-id: a56e19c1ee13c2f6e2750c2483202c1eea3b558a
2020-07-24 14:19:23 -07:00
d4736ef95f Add done() API to Future (#42013)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42013

Test Plan: Imported from OSS

Reviewed By: rohan-varma

Differential Revision: D22729596

Pulled By: mrshenli

fbshipit-source-id: ed31021a35af6e2c3393b9b14e4572cf51013bc0
2020-07-24 14:13:41 -07:00
890b52e09f Reduce instability in runCleanUpPasses by reordering passes. (#41891)
Summary:
Currently constant pooling runs before const propagation, which can create more constants that need pooling. This can get in the way of serialization/deserialization stability because each time user serializes and deserializes a module, runCleanUpPasses is called upon it. Doing so multiple times would lead to different saved module.

This PR moves constant pooling after const propagation, which may slow down const propagation a little bit, but would otherwise side-step aforementioned problem.

test_constant_insertion in test_jit.py is also updated because after fixing the pass ordering, the number of constants is no longer a constant and it is extremely difficult to get the exact number with the current convoluted test structure. So for now, I changed the test to check only that CSE doesn't change number of "prim::constant" rather than comparing against a known number. Also left a TODO to improve this test.

ConstantPropagation pass is replaced by ConstantPropagationImmutableTypes because the latter is used in runCleanUpPasses. If not replaced, the former would create new CSE opportunities by folding more constants. This voids the purpose of the test case.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41891

Reviewed By: colesbury

Differential Revision: D22701540

Pulled By: gmagogsfm

fbshipit-source-id: 8e60dbdcc54a93dac111d81b8d88fb39387224f5
2020-07-24 11:39:20 -07:00
d904ea5972 [NCCL] DDP communication hook: getFuture() (#41596)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41596

We've modified the previous design of `convert_dist_work_to_future` API in the GH Issue [#39272](https://github.com/pytorch/pytorch/issues/39272).

1. Whenever we create a `WorkNCCL` object, create a `Future` associated with `WorkNCCL` and store it with the object.
2. Add an API `c10::intrusive_ptr<c10::ivalue::Future> getFuture()` to `c10d::ProcessGroup::Work`.
3. This API will only be supported by NCCL in the first version, the default implementation will throw UnsupportedOperation.
4. To mark the future associated with WorkNCCL completed, implement a `cudaStreamCallback` function.

`cudaStreamAddCallback` is marked as deprecated. An alternative is `cudaLaunchHostFunc`, but it is supported for CUDA > 10 and may not be deprecated until there's a reasonable alternative available according to [this discussion](https://stackoverflow.com/questions/56448390/how-to-recover-from-cuda-errors-when-using-cudalaunchhostfunc-instead-of-cudastr).
ghstack-source-id: 108409748

Test Plan:
Run old  python test/distributed/test_c10d.py.
Some additional tests:
`test_ddp_comm_hook_allreduce_hook_nccl`: This unit test verifies whether a DDP communication hook that just calls allreduce gives the same result result with the case of no hook registered.  Without the then callback, the future_value in reducer is no longer a PyObject, and this unit test verifies future_value is properly checked.
`test_ddp_comm_hook_allreduce_then_mult_ten_hook_nccl`: This unit test verifies whether a DDP communication hook that calls allreduce and then multiplies the result by ten gives the expected result.

As of v10:
```
........................s.....s.....................................................s...............................
----------------------------------------------------------------------
Ran 116 tests

OK (skipped=3)
```
`flow-cli` performance validation using a stacked diff where `bucket.work` is completely replaced with `bucket.future_work` in `reducer`. See PR [#41840](https://github.com/pytorch/pytorch/pull/41840) [D22660198](https://www.internalfb.com/intern/diff/D22660198/).

Reviewed By: izdeby

Differential Revision: D22583690

fbshipit-source-id: 8c059745261d68d543eaf21a5700e64826e8d94a
2020-07-24 11:22:44 -07:00
2e95b29988 restore at::Half support for caffe2 SumOp (#41952)
Summary:
PR https://github.com/pytorch/pytorch/issues/40379 added long support but removed at::Half support.  Restore at::Half support.

CC ezyang xw285cornell neha26shah

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41952

Reviewed By: colesbury

Differential Revision: D22720656

Pulled By: xw285cornell

fbshipit-source-id: be83ca7fe51fc43d81bc0685a3b658353d42f8ea
2020-07-24 10:49:06 -07:00
e9e6cc8c83 Added Prehook option to prepare method (#41863)
Summary:
Added a logic so that if a prehook is passed into the prepare method during quantization, then the hook will be added as a prehook to all leaf nodes (and modules specified in the non_leaf_module_list).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41863

Test Plan:
Small demo, made simple module then called prepare with prehook parameter set to the numeric suite logger, printed the results to verify its what we wanted
{F245156246}

Reviewed By: jerryzh168

Differential Revision: D22671288

Pulled By: edmundw314

fbshipit-source-id: ce65a00830ff03360a82c0a075b3b6d8cbc4362e
2020-07-24 10:26:39 -07:00
1b55e2b043 add prefetch_factor for multiprocessing prefetching process (#41130)
Summary:
fix https://github.com/pytorch/pytorch/issues/40604
Add parameter to Dataloader to configure the per-worker prefetch number.
Before this edit, the prefetch process always prefetch 2 * num_workers data items, this commit help us make this configurable, e.x. you can specify to prefetch 10 * num_workers data items.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41130

Reviewed By: izdeby

Differential Revision: D22705288

Pulled By: albanD

fbshipit-source-id: 2c483fce409735fef1351eb5aa0b033f8e596561
2020-07-24 08:38:13 -07:00
79cdd84c81 Downloading different sccache binary in case of ROCm build (#41958)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/41958

Reviewed By: colesbury

Differential Revision: D22717509

Pulled By: malfet

fbshipit-source-id: 96c94512f12193fa549ec84cd51f17978f221bc6
2020-07-24 08:04:25 -07:00
c0bfa45f9d Enable typechecking for torch.futures (#41675)
Summary:
Add typing declarations for torch._C.Future and torch._C._collect_all

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41675

Reviewed By: izdeby

Differential Revision: D22627539

Pulled By: malfet

fbshipit-source-id: 29b87685d65dd24ee2094bae8a84a0fe3787e7f8
2020-07-23 23:06:45 -07:00
750d9dea49 move min/max tests to TestTorchDeviceType (#41908)
Summary:
so that testing _min_max on the different devices is easier, and min/max operations have better CUDA test coverage.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41908

Reviewed By: mruberry

Differential Revision: D22697032

Pulled By: ngimel

fbshipit-source-id: a796638fdbed8cda90a23f7ff4ee167f45530914
2020-07-23 22:49:30 -07:00
6a8c9f601f Removed whitelist references from test/backward_compatibility/check_b… (#41691)
Summary:
Removed whitelist reference
Fixes https://github.com/pytorch/pytorch/issues/41733.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41691

Reviewed By: houseroad

Differential Revision: D22641467

Pulled By: SplitInfinity

fbshipit-source-id: 72899b7410d4fc8454d87ca0c042f1ede7cf73de
2020-07-23 21:36:14 -07:00
e42eab4b1c Update PULL_REQUEST_TEMPLATE.md (#41812)
Summary:
**Summary**
This commit updates the repository's pull request template to remind contributors to tag the issue that their pull request addresses.

**Fixes**
This commit fixes https://github.com/pytorch/pytorch/issues/35319.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41812

Reviewed By: gmagogsfm

Differential Revision: D22667902

Pulled By: SplitInfinity

fbshipit-source-id: cda5ff7cbbbfeb89c589fd0dfd378bf73a59d77b
2020-07-23 21:30:43 -07:00
2da69081d7 Fix one error message format of torch.dot() (#41963)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41963

the error message of dot(CUDA) was copied from dot(CPU), however, they both are easy to cause confusion

Test Plan: wait for unittests

Reviewed By: ngimel

Differential Revision: D22710822

fbshipit-source-id: 565b51149ff4bee567ef0775e3f8828579565f8a
2020-07-23 20:47:11 -07:00
f00a37dd71 Make setup.py Python-2 syntactically correct (#41960)
Summary:
Import __future__ to make `print(*args)` a syntactically correct statement under Python-2
Otherwise, if once accidentally invokes setup.py using Python-2 interpreter they will be greeted by:
```
  File "setup.py", line 229
    print(*args)
          ^
SyntaxError: invalid syntax
```
instead of:
```
Python 2 has reached end-of-life and is no longer supported by PyTorch.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41960

Reviewed By: orionr, seemethere

Differential Revision: D22710174

Pulled By: malfet

fbshipit-source-id: ffde3ddd585707ba1d39e57e0c6bc9c4c53f8004
2020-07-23 19:10:20 -07:00
97ab33d47c Fix memory leak in XNNPACK/MaxPool2D. (#41874)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/41874

Test Plan: Imported from OSS

Reviewed By: ann-ss

Differential Revision: D22699598

Pulled By: AshkanAliabadi

fbshipit-source-id: fec59ed3d5d23bd9197349057fcf2ce56a2b278b
2020-07-23 18:59:53 -07:00
36fb14b68b [quant] Add Graph Mode Passes to quantize EmbeddingBag operators (#41612)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41612

This change adds preliminary support to quantize the EmbeddingBag operators. We currently support 4-bit and 8-bit quantization+packing of the weights.

To quantize these operators, specify the operator name in the `custom_op_name` field of the NoopObserver. Based on the op name (4bit or 8bit) we call the corresponding quantization functions.
Refer to the testplan for how to invoke the qconfig for the embedding_bag ops.

Future versions of this will support 4-bit and 2-bit qtensors with native support to observe and quantize it.

NB - This version assumes that the weights in the EmbeddingBag Module reside on the same device.

Test Plan:
python test/test_quantization.py TestQuantizeDynamicJitOps.test_embedding_bag

Imported from OSS

Reviewed By: vkuzo, jerryzh168

Differential Revision: D22609342

fbshipit-source-id: 23e33f44a451c26719e6e283e87fbf09b584c0e6
2020-07-23 18:54:59 -07:00
401ac2dd39 Replaced whitelisted with allowed (#41867)
Summary:
Closes https://github.com/pytorch/pytorch/issues/41746
Closes https://github.com/pytorch/pytorch/issues/41745

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41867

Reviewed By: izdeby

Differential Revision: D22703533

Pulled By: mrshenli

fbshipit-source-id: 915895463a92e18f36db93b8884d9fd432c0997d
2020-07-23 16:53:51 -07:00
a1cfcd4d22 Change whitelist to another context in binary_smoketest.py (#41822)
Summary:
Fix https://github.com/pytorch/pytorch/issues/41740

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41822

Reviewed By: izdeby

Differential Revision: D22703682

Pulled By: mrshenli

fbshipit-source-id: 1df82fd43890142dfd261eb7bf49dbd128295e03
2020-07-23 16:14:54 -07:00
b6690eb29a Might be good for newcomers to read what N means (#41851)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/41851

Reviewed By: izdeby

Differential Revision: D22703602

Pulled By: mrshenli

fbshipit-source-id: 44905f43cdf53b38e383347e5002a28c9363a446
2020-07-23 16:10:38 -07:00
7646f3c77f Fix type annotation for CosineAnnealingLR (#41866)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/41866

Reviewed By: izdeby

Differential Revision: D22703576

Pulled By: mrshenli

fbshipit-source-id: 10a0f593ffaaae82a2923a42815c36793a9043d5
2020-07-23 15:56:50 -07:00
cyy
c5fdcd85c7 check pruned attributes before deleting (#41913)
Summary:
I copyed a pruned model after deleteing the derived tensors. In order to be able to reparameter the model, we should check the existence of the tensors here.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41913

Reviewed By: izdeby

Differential Revision: D22703248

Pulled By: mrshenli

fbshipit-source-id: f5274d2c634a4c9a038100d8a6e837f132eabd34
2020-07-23 15:56:48 -07:00
183b43f323 Clarify Python 3.5 is the minimum supported version in the installation section. (#41937)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/41937

Reviewed By: izdeby

Differential Revision: D22702924

Pulled By: mrshenli

fbshipit-source-id: 67306435e80f80236b585f1d5406444daec782d6
2020-07-23 15:54:56 -07:00
a4b831a86a Replace if(NOT ${var}) by if(NOT var) (#41924)
Summary:
As explained in https://github.com/pytorch/pytorch/issues/41922 using `if(NOT ${var})" is usually wrong and can lead to issues like https://github.com/pytorch/pytorch/issues/41922 where the condition is wrongly evaluated to FALSE instead of TRUE. Instead the unevaluated variable name should be used in all cases, see the CMake docu for details.

This fixes the `NOT ${var}` cases by using a simple regexp replacement. It seems `pybind11_PREFER_third_party` is the only variable really prone to causing an issue as all others are set. However due to CMake evaluating unquoted strings in `if` conditions as variable names I recommend to never use unquoted `${var}` in an if condition. A similar regexp based replacement could be done on the whole codebase but as that does a lot of changes I didn't include this now. Also `if(${var})` will likely lead to a parser error if `var` is unset instead of a wrong result

Fixes https://github.com/pytorch/pytorch/issues/41922

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41924

Reviewed By: seemethere

Differential Revision: D22700229

Pulled By: mrshenli

fbshipit-source-id: e2b3466039e4312887543c2e988270547a91c439
2020-07-23 15:49:20 -07:00
dbe6bfbd7e Revert D22496604: NCCL Backend support for torch.bool
Test Plan: revert-hammer

Differential Revision:
D22496604 (3626473105)

Original commit changeset: a1a15381ec41

fbshipit-source-id: 693c2f9fd1df568508cbcf8c734c092cec3b0a72
2020-07-23 15:33:58 -07:00
b898bdd4d3 [JIT] Don't re run CSE on every block (#41479)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41479

Previously we were re-running CSE every time we recursed into a new block, which in turn created a new Alias Db for the whole graph. This was O(# Nodes * # Blocks).

For graphs which don't have any autodiff opportunities, such as Densenet,  create_autodiff_subgraphs is now linear in number of nodes. For Densenet this pass was measured at ~.1 seconds.

This pass is still non-linear for models which actually do create autodiff subgraphs, because in the
```
      bool any_changed = true;
      while (any_changed) {
        AliasDb aliasDb(graph_);
        any_changed = false;
        for (auto it = workblock.end()->reverseIterator();
             it != workblock.begin()->reverseIterator();) {
          bool changed;
          std::tie(it, changed) = scanNode(*it, aliasDb);
          any_changed |= changed;
        }
      }
```
loop we recreate the AliasDb (which is O(N)) every time we merge something and scan node returns. I will make that linear in next PR in the stack.

Test Plan: Imported from OSS

Reviewed By: Krovatkin

Differential Revision: D22600606

Pulled By: eellison

fbshipit-source-id: b08abfde2df474f168104c5b477352362e0b7b16
2020-07-23 14:50:04 -07:00
25b6e2e5ee [JIT] optimize autodiff subgraph slicing (#41437)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41437

[copied from commented code]
the IR has many nodes which can never be reordered around, such as a
prim::Bailout. if a node N is surrounded by two nodes which cannot be
reordered, A and B, then a differentiable subgraph that is created from N
can only contain nodes from [A, B] The nodes from A to B represent one
work block for the subgraph slicer to work on. By creating these up
front, we avoid retraversing the whole graph block any time scanNode
returns, and we can also avoid attempting to create differentiable
subgraphs in work blocks that do not contain a minimum number of differentiable nodes

This improved compilation time of e of densenet (the model with the slowest compilation time we're tracking) from 56s  -> 28s, and for mobilenet from 8s -> 6s.

Test Plan: Imported from OSS

Reviewed By: Krovatkin, ZolotukhinM

Differential Revision: D22600607

Pulled By: eellison

fbshipit-source-id: e5ab6ed87bf6820b4e22c86eabafd9d17bf7cedc
2020-07-23 14:49:57 -07:00
da3ff5e473 [JIT] dont count constants in subgraph size (#41436)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41436

Constants are not executed as instructions, we should ignore them when counting subgraph size, as we ignore them in counting block size for loop unrolling.

Test Plan: Imported from OSS

Reviewed By: Krovatkin, ZolotukhinM

Differential Revision: D22600608

Pulled By: eellison

fbshipit-source-id: 9770b21c936144a3d6a1df89cf3be5911095187e
2020-07-23 14:48:25 -07:00
dfe7d27d0e implement lite parameter serializer (#41403)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/41403

Test Plan: Imported from OSS

Reviewed By: kwanmacher

Differential Revision: D22611633

Pulled By: ann-ss

fbshipit-source-id: b391e8c96234b2e69f350119a11f688e920c7817
2020-07-23 14:25:44 -07:00
b85df3709a Add __main__ entrypoint to test_futures.py (#41826)
Summary:
Per comment in run_test.py, every test module must have a __main__ entrypoint:
60e2baf5e0/test/run_test.py (L237-L238)
Also disable test_wait_all on Windows, as it fails with an uncaught exception:
```
  test_wait_all (__main__.TestFuture) ... Traceback (most recent call last):
  File "run_test.py", line 744, in <module>
    main()
  File "run_test.py", line 733, in main
    raise RuntimeError(err)
RuntimeError: test_futures failed!
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41826

Reviewed By: seemethere, izdeby

Differential Revision: D22654899

Pulled By: malfet

fbshipit-source-id: ab7fdd7adce3f32c53034762ae37cf35ce08cafc
2020-07-23 12:56:03 -07:00
3626473105 NCCL Backend support for torch.bool (#41318)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41318

Closes https://github.com/pytorch/pytorch/issues/24137.

This PR adds support for the `torch.bool` tensor type to ProcessGroupNCCL. For most types we use the existing mapping, but since `bool` is not supported as a native `ncclDataType_t`, we add the following logic:
1) Map `at::kBool` to `ncclUint8`
2) During reduction (allreduce for example), if the operation is SUM, we instead override to to a MAX, to avoid overflow issues. The rest of the operations work with no changes. In the boolean case, changing sum to max makes no correctness difference since they both function as a bitwise OR.

The reduction logic (for example for reduce/allreduce) is as follows:
sum, max = bitwise or
product, min = bitwise and

Tests are added to ensure that the reductions work as expected.
ghstack-source-id: 108315417

Test Plan: Added unittests

Reviewed By: mrshenli

Differential Revision: D22496604

fbshipit-source-id: a1a15381ec41dc59923591885d40d966886ff556
2020-07-23 12:33:39 -07:00
01c406cc22 [pytorch] bump up variable version regardless of differentiability (#41269)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41269

The ultimate goal is to move things that are not gated with `if (compute_requires_grad(...))`
or `if (grad_fn)` out from VariableType so that VariableType kernels can be enabled/disabled
based upon `GradMode`. Then we can merge `AutoNonVariableTypeMode` and `NoGradGuard`.

We've moved profiling / tracing logic out from VariableType. One remaining thing that's
not gated with the if-statement is the `increment_version` call.

However, the `gen_variable_type.py` does use bits from `derivatives.yaml` to determine whether
to emit the `increment_version` call. If an output is never going to be differentiable (not based
upon runtime property of the variable but based upon static property, e.g. it's integral type)
then it would never emit the increment_version call.

Hypothetically, increment_version for a tensor can be orthogonal to its differentiability.

This PR is to make the change and test its impact. Making this logical simplification would
allow us to move this out from VariableType to aten codegen.
ghstack-source-id: 108318746

Test Plan: CI

Reviewed By: ezyang

Differential Revision: D22471643

fbshipit-source-id: 3e3a442c7fd851641eb4a9c4f024d1f5438acdb8
2020-07-23 12:07:32 -07:00
1978188639 Remove two "return"s that return "void" (#41811)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/41811

Reviewed By: izdeby

Differential Revision: D22673690

Pulled By: ezyang

fbshipit-source-id: 10d4aff90e2e051116e682fa51fb9494af8482c1
2020-07-23 10:17:29 -07:00
77db93228b Temporary fix for determinant bug on CPU (#35136)
Summary:
Changelog:
- Make diagonal contiguous

Temporarily Fixes https://github.com/pytorch/pytorch/issues/34061

Pull Request resolved: https://github.com/pytorch/pytorch/pull/35136

Reviewed By: izdeby

Differential Revision: D22673153

Pulled By: ezyang

fbshipit-source-id: 850f537483f929fcb43bcdef9d4ec264a7c3d354
2020-07-23 10:12:06 -07:00
17f76f9a78 Verbose param for schedulers that don't have it #38726 (#41580)
Summary:
Verbose param for schedulers that don't have it https://github.com/pytorch/pytorch/issues/38726

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41580

Reviewed By: izdeby

Differential Revision: D22671163

Pulled By: vincentqb

fbshipit-source-id: 53a6c9e929141d411b6846bc25f3fe7f46fdf3be
2020-07-23 09:57:33 -07:00
37e7f0caf6 Fix docstring in Unflatten (#41835)
Summary:
I'd like to amend the docstring introduced in https://github.com/pytorch/pytorch/issues/41564. It's not rendering correctly on the web, and this should fix it.

cc albanD

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41835

Reviewed By: izdeby

Differential Revision: D22672368

Pulled By: albanD

fbshipit-source-id: f0b03c2b2a4c79b790d54f7c8f2ae28ef9d76a75
2020-07-23 09:55:11 -07:00
fab1795577 move benchmark utils into torch namespace (#41506)
Summary:
Move the timing utils to `torch.utils._benchmark`. I couldn't figure out how to get setuptools to pick it up and put it under `torch` unless it is in the `torch` directory. (And I think it has to be for `setup.py develop` anyway.)

I also modified the record function benchmark since `Timer` and `Compare` should always be available now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41506

Reviewed By: ngimel

Differential Revision: D22601460

Pulled By: robieta

fbshipit-source-id: 9cea7ff1dcb0bb6922c15b99dd64833d9631c37b
2020-07-23 09:48:39 -07:00
266657182a Add torch.movedim (#41480)
Summary:
https://github.com/pytorch/pytorch/issues/38349 #36048

TODO:
* [x] Tests
* [x] Docs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41480

Reviewed By: zhangguanheng66

Differential Revision: D22649917

Pulled By: zou3519

fbshipit-source-id: a7f3920a24bae16ecf2ad731698ca65ca3e8c1ce
2020-07-23 09:41:01 -07:00
c0e3839845 fix #36801 (#41607)
Summary:
unittest actually did stdout testname like (test_accumulate_grad (__main__.TestAutograd) ... ) first befor test start running. export PYTHONUNBUFFERED=1 or python -u could record this msg. ezyang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41607

Reviewed By: izdeby

Differential Revision: D22673930

Pulled By: ezyang

fbshipit-source-id: 18512b6f5f80485c2b0d812f2ebdecc1fdc4b4ec
2020-07-23 09:32:46 -07:00
272fb3635f Add regression test for ONNX exports of modules that embed an Embedding layer inside a Sequential (#32598)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/19227

This PR adds a regression test for ONNX exports where a module has a sequential that references an Embedding layer

Pull Request resolved: https://github.com/pytorch/pytorch/pull/32598

Reviewed By: izdeby

Differential Revision: D22672790

Pulled By: ezyang

fbshipit-source-id: c88beb29a36b07378c28b0e4546efe887fcbc3be
2020-07-23 09:32:44 -07:00
e831299bae Fix typing error of torch/optim/lr_scheduler.pyi (#41775)
Summary:
* add `_LRScheduler.get_last_lr` type stub.
* remove `CosineAnnealingWarmRestarts.step` because its signature is same with `_LRScheduler`'s.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41775

Reviewed By: izdeby

Differential Revision: D22649350

Pulled By: vincentqb

fbshipit-source-id: 5355dd062a5af437f4fc153244dda793a2382e7e
2020-07-23 09:30:32 -07:00
4b4273a04e Update Adam documentation (#41679)
Summary:
This PR fixes https://github.com/pytorch/pytorch/issues/41477

Adam implementation is doing L2 regularization and not decoupled weight decay. However, the change mentioned in https://github.com/pytorch/pytorch/issues/41477 was motivated by Line 12 of algorithm 2 in [Decoupled Weight Decay Regularization](https://arxiv.org/pdf/1711.05101.pdf) paper.

Please let me know if you have other suggestions about how to deliver this info in the docs.
cc ezyang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41679

Reviewed By: izdeby

Differential Revision: D22671329

Pulled By: vincentqb

fbshipit-source-id: 2caf60e4f62fe31f29aa35a9532d1c6895a24224
2020-07-23 09:25:41 -07:00
30ce7b3740 Fix bug when compiling with caffe2 (#41868)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41868

Fix bug when compiling with caffe2

Reviewed By: jianyuh

Differential Revision: D22670707

fbshipit-source-id: aa654d7b9004257e0288c8ae8819ca5752eea443
2020-07-23 09:11:05 -07:00
0ec7ba4088 [iOS] Bump up the cocoapods version (#41895)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41895

### Summary

The iOS binary for 1.6.0 has been uploaded to AWS. This PR bumps up the version for cocoapods.

### Test Plan

- Check CI

Test Plan: Imported from OSS

Reviewed By: husthyc

Differential Revision: D22683787

Pulled By: xta0

fbshipit-source-id: bb95b670a7945d823d55e9c65b357765753f295a
2020-07-22 22:03:40 -07:00
2a3ab71f28 [quant][graphmode][fix] Remove useQuantizable check for dynamic quant (#41892)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41892

Currently the input of batch_norm is considered as dynamically quantizable but it shouldn't be
this PR fixes that

Test Plan:
internal models

Imported from OSS

Reviewed By: vkuzo

Differential Revision: D22681423

fbshipit-source-id: 7f428751de0c4af0a811b9c952e1d01afda42d85
2020-07-22 21:06:48 -07:00
ca3ba1095e Do not chown files inside docker for pytorch-job-tests (#41884)
Summary:
They are already owned by `jenkins` user after the build

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41884

Reviewed By: orionr

Differential Revision: D22682441

Pulled By: malfet

fbshipit-source-id: daf99532d300d30a5de591ad03af4597e145fdfc
2020-07-22 19:53:59 -07:00
586b7f991c Enable skipped tests from test_torch on ROCm (#41611)
Summary:
This pull request enables the following tests from test_torch, previously skipped on ROCm:
test_pow_-2_cuda_float32/float64
test_sum_noncontig_cuda_float64
test_conv_transposed_large

The first two tests experienced precision issues on earlier ROCm version, whereas the conv_transposed test was hitting a bug in MIOpen which is fixed with the version shipping with ROCm 3.5

ezyang jeffdaily

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41611

Reviewed By: xw285cornell

Differential Revision: D22672690

Pulled By: ezyang

fbshipit-source-id: 5585387c048f301a483c4c0566eb9665555ef874
2020-07-22 19:49:17 -07:00
7fefa46820 scatter/gather - check that inputs are of the same dimensionality (#41672)
Summary:
As per title.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41672

Reviewed By: malfet, ngimel

Differential Revision: D22678302

Pulled By: gchanan

fbshipit-source-id: 95a1bde81e660b8963e5914d5348fd4fbff1338e
2020-07-22 18:51:51 -07:00
b40ef422d3 .circleci: Separate out docs build from push (#41871)
Summary:
Separates out the docs build from the push and limits when the push
actually happens.

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41871

Reviewed By: yns88

Differential Revision: D22673716

Pulled By: seemethere

fbshipit-source-id: fff8b35ba8465dc15832214c4c9ef03ce12faa48
2020-07-22 17:01:24 -07:00
4e16be9073 [MemLeak] Fix memory leak from releasing unique ptr (#41883)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41883

Fix memory leak from releasing unique ptr

Test Plan:
Tested serialization with and without the change.

Heap profile without change:
```
Welcome to jeprof!  For help, type 'help'.
(jeprof) top
Total: 7298.4 MB
  4025.2  55.2%  55.2%   4025.2  55.2% c10::alloc_cpu (inline)
  3195.3  43.8%  98.9%   3195.3  43.8% caffe2::SerializeUsingBytesOrInt32
    63.6   0.9%  99.8%     63.6   0.9% __gnu_cxx::new_allocator::allocate (inline)
     5.0   0.1%  99.9%      5.0   0.1% google::protobuf::RepeatedField::Reserve
     2.5   0.0%  99.9%      2.5   0.0% folly::aligned_malloc (inline)
     1.2   0.0%  99.9%      1.2   0.0% caffe2::detail::CopyFromProtoWithCast (inline)
     1.0   0.0%  99.9%      1.0   0.0% __new_exitfn
     1.0   0.0% 100.0%      1.0   0.0% std::_Function_base::_Base_manager::_M_init_functor (inline)
     0.5   0.0% 100.0%      0.5   0.0% folly::HHWheelTimerBase::newTimer (inline)
     0.5   0.0% 100.0%      0.5   0.0% std::__detail::_Hashtable_alloc::_M_allocate_node
```

Heap profile with change:
```
Welcome to jeprof!  For help, type 'help'.
(jeprof) top
Total: 6689.2 MB
  4025.2  60.2%  60.2%   4025.2  60.2% c10::alloc_cpu (inline)
  2560.0  38.3%  98.4%   2560.0  38.3% caffe2::::HugePagesArena::alloc_huge (inline)
    90.9   1.4%  99.8%     90.9   1.4% __gnu_cxx::new_allocator::allocate (inline)
     5.0   0.1%  99.9%      5.0   0.1% google::protobuf::RepeatedField::Reserve
     2.0   0.0%  99.9%      2.0   0.0% prof_backtrace_impl (inline)
     1.0   0.0%  99.9%     20.3   0.3% std::__cxx11::basic_string::_M_construct (inline)
     1.0   0.0%  99.9%      1.0   0.0% std::_Function_base::_Base_manager::_M_init_functor (inline)
     0.5   0.0%  99.9%      0.5   0.0% folly::UnboundedQueue::allocNextSegment (inline)
     0.5   0.0% 100.0%      0.5   0.0% folly::aligned_malloc (inline)
     0.5   0.0% 100.0%      0.5   0.0% __new_exitfn
```

Reviewed By: yinghai

Differential Revision: D22662093

fbshipit-source-id: d0b8ff1ed26c72b14bb02fb1146c51ef11a7e519
2020-07-22 16:54:19 -07:00
dbc6a2904b [quant][graphmode][fix] Remove assert for uses == 1 in remove dequantize pass (#41859)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41859

A value can be used multiple times in the same node, we don't really need to assert uses of dequantize == 1

Test Plan: Imported from OSS

Reviewed By: raghuramank100

Differential Revision: D22673525

fbshipit-source-id: 2c4a770e0ddee722ca54e68d310c395e7f418b3b
2020-07-22 15:58:11 -07:00
dfa914a90c Modify lazy_dyndep loading to trigger inside workspace. (#41687)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41687

Specifically, this makes a new library (lazy), which can be used from both core
and workspace.

This allows workspace.Createnet to trigger lazy loading of dyndep dependencies.

Test Plan: Added a unit test specifically for workspace.CreateNet

Reviewed By: dzhulgakov

Differential Revision: D22441877

fbshipit-source-id: 3a9d1af9962585d08ea2566c9c85bec7377d39f2
2020-07-22 15:36:43 -07:00
af5d0bff00 [ONNX] Add pass that fuses Conv and BatchNormalization (#40547)
Summary:
Add pass that fuses Conv and Batchnormalization nodes into one node Conv.
This pass is only applied in inference mode (training is None or TrainingMode.Eval).
Since this pass needs access to param_dict it is written outside peephole file where these kind of passes (fusing multiple nodes into one) is usually placed.

This PR also adds wrapper skipIfNoEmbed to skip debug_embed_params test:
Pass that fuses Conv and Batchnorm changes the params of resnet model and parameters of onnx and pytorch model won't match. Since parameters are not matching, debug_embed_params test for test_resnet will fail and that is expected, therefore debug_embed_params test for test_resnet should be skipped.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/40547

Reviewed By: gchanan

Differential Revision: D22631687

Pulled By: bzinodev

fbshipit-source-id: fe45812400398a32541e797f727fd8697eb6d8c0
2020-07-22 14:59:27 -07:00
ad7133d3c1 Patch for #40026 RandomSampler generates samples one at a time when replacement=True (#41682)
Summary:
Fix https://github.com/pytorch/pytorch/issues/32530
Fix/Patch https://github.com/pytorch/pytorch/pull/40026

Resubmit this patch and fix the type error.

Force the input type to `manual_seed()` in `sampler.py` to be `int`.

ezyang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41682

Reviewed By: izdeby

Differential Revision: D22665477

Pulled By: ezyang

fbshipit-source-id: 1725c8aa742c31e74321f20448f4b6a392afb38d
2020-07-22 13:45:09 -07:00
2d15b39745 [Onnxifi] Support running with quantized int8 inputs (#41820)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41820

Pull Request resolved: https://github.com/pytorch/glow/pull/4721

In order to support int8 quantized tensor as an input to OnnxifiOp, we need to
- Add support to recognize and extract shape meta from int8 tensor at input of OnnxifiOp
- Make a copy of the input data and shift by 128 in Glow if input data is uint8 quantized tensor to get correct result because Glow uses int8 to represent the quantized data regardless.
- Propagate correct quantization parameters to through shape info in C2.

This diff implements the above.

Test Plan:
```
buck test caffe2/caffe2/contrib/fakelowp/test:test_int8_quantnnpi
```

Reviewed By: jackm321

Differential Revision: D22650584

fbshipit-source-id: 5e867f7ec7ce98bb066ec4128ceb7cad321b3392
2020-07-22 13:42:34 -07:00
47c57e8804 rename TestFuser to TestTEFuser (#41542)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/41542

Reviewed By: jamesr66a

Differential Revision: D22579606

Pulled By: Krovatkin

fbshipit-source-id: f65b2cae996b42d55ef864bc0b424d9d43d8a2e2
2020-07-22 13:37:27 -07:00
6ceb65f98c Document default dim for cross being None (#41850)
Summary:
The function torch.cross is a bit confusing, in particular the defaulting of the dim argument.

The default `dim` has been documented as -1 but it is actually `None`. This increases the confusion, in two possible ways depending on how carefully you read the rest. I also add a final warning to the final sentence.

This partially addresses https://github.com/pytorch/pytorch/issues/39310.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41850

Reviewed By: izdeby

Differential Revision: D22664625

Pulled By: albanD

fbshipit-source-id: b8669e026fd01de9e4ec16da1414b9edfaa76bdd
2020-07-22 13:31:47 -07:00
b80ffd44b0 Revert D20781624: Add NCCL Alltoall to PT NCCL process group
Test Plan: revert-hammer

Differential Revision:
D20781624 (b87f0e5085)

Original commit changeset: 109436583ff6

fbshipit-source-id: 03f6ee4d56baea93a1cf795d26dd92b7d6d1df28
2020-07-22 13:22:17 -07:00
ec683299eb Reland Add non-deterministic alert to CUDA operations that use atomicAdd() (#41538)
Summary:
Reland PR https://github.com/pytorch/pytorch/issues/40056

A new overload of upsample_linear1d_backward_cuda was added in a recent commit, so I had to add the nondeterministic alert to it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41538

Reviewed By: zou3519

Differential Revision: D22608376

Pulled By: ezyang

fbshipit-source-id: 54a2aa127e069197471f1feede6ad8f8dc6a2f82
2020-07-22 13:12:29 -07:00
aa91a65b59 [TensorExpr] Fix propagation of loop options when splitting loops (#40035)
Summary:
Fix a bug in SplitWithTail and SplitWithMask where loop_options such as Cuda block/thread bindings are overwritten by the split. This PR fixes this bug by propagating the loop options to the outer loop, which for axis bindings should be equivalent.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/40035

Reviewed By: ZolotukhinM

Differential Revision: D22080263

Pulled By: nickgg

fbshipit-source-id: b8a9583fd90f69319fc4bb4db644e91f6ffa8e67
2020-07-22 11:49:07 -07:00
9c7ca89ae6 Conda build (#38796)
Summary:
closes gh-37584. ~I think I need to do more to generate an image, but the `.circleci/README.md` is vague in the details. The first commit reflows and updates that document a bit, I will continue to update it as the PR progresses :)~ Dropped updating `.circleci/README.md`, will do that in a separate PR once this is merged.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/38796

Reviewed By: gchanan

Differential Revision: D22627522

Pulled By: ezyang

fbshipit-source-id: 99d5c19e942f15b9fc10f0de425790474a4242ab
2020-07-22 11:42:39 -07:00
61511aa1d6 Remove zmath_std.h (#39835)
Summary:
std::complex is gone

Pull Request resolved: https://github.com/pytorch/pytorch/pull/39835

Reviewed By: gchanan

Differential Revision: D22639834

Pulled By: anjali411

fbshipit-source-id: 57da43d4e6c82261b1f9e5b876f1bbbdf9ae56ca
2020-07-22 11:08:17 -07:00
ca68dc7fa2 replace std::clamp with shim (#41855)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41855

replace std::clamp with shim

Test Plan: test_op_nnpi_fp16.py covers the testing.

Reviewed By: hyuen

Differential Revision: D22667645

fbshipit-source-id: 5e7c94b499f381bde73f1984a6f0d01fb962a671
2020-07-22 11:06:36 -07:00
b87f0e5085 Add NCCL Alltoall to PT NCCL process group (#39984)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39984

Add Alltoall and Alltoallv to PT NCCL process group using NCCL Send/Recv.

Reviewed By: jiayisuse

Differential Revision: D20781624

fbshipit-source-id: 109436583ff69a3fea089703d32cfc5a75f973e0
2020-07-22 10:55:51 -07:00
2da8c8df08 [quant] Reaname from quantized... to ...quantized_cpu in the native_functions.yaml (#41071)
Summary:
Issue https://github.com/pytorch/pytorch/issues/40315

Reaname from `quantized...` to `...quantized_cpu` in the native_functions.yaml

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41071

Reviewed By: z-a-f

Differential Revision: D22487087

Pulled By: jerryzh168

fbshipit-source-id: f0d12907967739794839c1ffea44e78957f50b9b
2020-07-22 10:45:41 -07:00
f03156f9df replace blacklist in caffe2/python/onnx/frontend.py (#41777)
Summary:
Close https://github.com/pytorch/pytorch/issues/41712

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41777

Reviewed By: izdeby

Differential Revision: D22648532

Pulled By: yinghai

fbshipit-source-id: 7f4c9f313e2887e70bb4eb1ab037aea6b549cec7
2020-07-22 10:02:16 -07:00
5152633258 [ROCm] update hip library name (#41813)
Summary:
With transition to hipclang, the HIP runtime library name was changed.  A symlink was added to ease the transition, but is going to be removed.  Conditionally set library name based on HIP compiler used.  Patch gloo submodule as part of build_amd.py script until its associated fix is available.

CC ezyang xw285cornell sunway513

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41813

Reviewed By: zhangguanheng66

Differential Revision: D22660077

Pulled By: xw285cornell

fbshipit-source-id: c538129268d9947535b34523201f655b13c9e0a3
2020-07-22 09:42:45 -07:00
9fbcfe848b Automated submodule update: FBGEMM (#41814)
Summary:
This is an automated pull request to update the first-party submodule for [pytorch/FBGEMM](https://github.com/pytorch/FBGEMM).

New submodule commit: 139c6f2292

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41814

Test Plan: Ensure that CI jobs succeed on GitHub before landing.

Reviewed By: dskhudia

Differential Revision: D22648844

fbshipit-source-id: 4cfa8d83585407f870ea2bdee74e1c1f371082eb
2020-07-22 09:38:15 -07:00
71aad6ea66 Revert "port masked_select from TH to ATen and optimize perf on CPU (#33269)" (#41828)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41828

This reverts commit fe66bdb498efe912d8b9c437a14efa4295c04fdd.

This also makes a sense to THTensorEvenMoreMath because sumall was removed, see THTensor_wrap.

Test Plan: Imported from OSS

Reviewed By: orionr

Differential Revision: D22657473

Pulled By: malfet

fbshipit-source-id: 95a806cedf1a3f4df91e6a21de1678252b117489
2020-07-22 09:28:04 -07:00
fd62847eb2 cross_layer_equalization (#41685)
Summary:
The goal is to implement cross layer equalization as described in section 4.1 in this paper: https://arxiv.org/pdf/1906.04721.pdf
Given two adjacent submodules in a trained model, A,B quantization might hurt one of the submodules more than the other. The paper poses the idea that a loss in accuracy from quantizing can be due to a difference in the channel ranges between the two submodules (the output channel range of A can be small, while the input channel range of B can be large). To minimize this source of error, we want to scale the tensors of A,B s.t. their channel ranges are equal (them being equal means no difference in ranges and minimizes this source of error).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41685

Test Plan: Imported from OSS

Reviewed By: z-a-f

Differential Revision: D22630219

Pulled By: edmundw314

fbshipit-source-id: ccc91ba12c10b652d7275222da8b85455b8a7cd5
2020-07-22 08:39:23 -07:00
fced54aa67 [RPC tests] Fix test_init_(rpc|pg)_then_(rpc|pg) not shutting down RPC (#41558)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41558

The problem was due to non-deterministic destruction order of two global static variables: the mutexes used by glog and the RPC agent (which was still set because we didn't call `rpc.shutdown()`). When the TensorPipe RPC agent shuts down some callbacks may fire with an error and thus attempt to log something. If the mutexes have already been destroyed this causes a SIGABRT.

Fixes https://github.com/pytorch/pytorch/issues/41474
ghstack-source-id: 108231453

Test Plan: Verified in https://github.com/pytorch/pytorch/issues/41474.

Reviewed By: fmassa

Differential Revision: D22582779

fbshipit-source-id: 63e34d8a020c4af996ef079cfb7041b2474e27c9
2020-07-22 06:33:19 -07:00
e17e55831d [pytorch] disable per-op profiling for internal mobile build (#41825)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41825

Add flag to gate D21374246 (e7a09b4d17) to mitigate mobile size regression.
ghstack-source-id: 108212047

Test Plan: CI

Reviewed By: linbinyu

Differential Revision: D22650708

fbshipit-source-id: ac9318af824ac31f519b7d5b4fe72df892d8d3f9
2020-07-22 03:02:21 -07:00
825a387ea2 Fix bug on the backpropagation of LayerNorm when create_graph=True (#41595)
Summary:
Solve an issue https://github.com/pytorch/pytorch/issues/41332

I found the bug at https://github.com/pytorch/pytorch/issues/41332 is caused by LayerNorm.

Current implementations of LayerNorm have a disparity between
1. [`create_graph=False` CUDA implementation](dde3d5f4a8/aten/src/ATen/native/cuda/layer_norm_kernel.cu (L145))
2. [`create_graph=True` implementation](dde3d5f4a8/tools/autograd/templates/Functions.cpp (L2536))

With this bug-fix, https://github.com/pytorch/pytorch/issues/41332 is solved.

Ailing BIT-silence

Signed-off-by: Vinnam Kim <vinnamkim@gmail.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41595

Reviewed By: houseroad

Differential Revision: D22598415

Pulled By: BIT-silence

fbshipit-source-id: 63e390724bd935dc8e028b4dfb75d34a80558c3a
2020-07-22 00:19:12 -07:00
5c9918e757 Fix row-wise sparse SparseLengthSum and sparse adagrad fused operator (#41818)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41818

Fix row-wise sparse SparseLengthSum and sparse adagrad fused operator

Reviewed By: jianyuh

Differential Revision: D22345013

fbshipit-source-id: 7c2d6c506b404f15a7aa8f1d0ccadb82e515a4c3
2020-07-21 19:32:16 -07:00
a0f2a5625f [quant][graphmode][fix] Make it work with CallMethod on non-Module objects (#41576)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41576

Previously we are assuming CallMethod only happens on module instances,
but it turns out this is not true, this PR fixes this issue.

Test Plan: Imported from OSS

Reviewed By: z-a-f

Differential Revision: D22592789

fbshipit-source-id: 48217626d9ea8e82536f00a296b8f9a471582ebe
2020-07-21 19:03:40 -07:00
ce8c7185de Add unittests to Comparison Operator Kernels in BinaryOpsKernel.cpp (#41809)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41809

Add new unittests to Operator Kernels.
Explicitly announce function type in tests because it can't be inferred.

Test Plan: CI

Reviewed By: malfet

Differential Revision: D22647221

fbshipit-source-id: ef2f0e8c847841e90aa26d028753f23c8c53d6b0
2020-07-21 18:26:53 -07:00
302e566205 add max_and_min function and cpu kernel to speed up observers (#41570)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41570

For min/max based quantization observers, calculating min and max of a tensor
takes most of the runtime. Since the calculation of min and max is done
on the same tensor, we can speed this up by only reading the tensor
once, and reducing with two outputs.

One question I had is whether we should put this into the quantization
namespace, since the use case is pretty specific.

This PR implements the easier CPU path to get an initial validation.
There is some needed additional work in future PRs, which durumu will
take a look at:
* CUDA kernel and tests
* making this work per channel
* benchmarking on observer
* benchmarking impact on QAT overhead

Test Plan:
```
python test/test_torch.py TestTorch.test_min_and_max
```

quick bench (not representative of real world use case):
https://gist.github.com/vkuzo/7fce61c3456dbc488d432430cafd6eca
```
(pytorch) [vasiliy@devgpu108.ash6 ~/local/pytorch] OMP_NUM_THREADS=1 python ~/nfs/pytorch_scripts/observer_bench.py
tensor(5.0390) tensor(-5.4485) tensor([-5.4485,  5.0390])
min and max separate 11.90243935585022
min and max combined 6.353186368942261
% decrease 0.466228209277153
(pytorch) [vasiliy@devgpu108.ash6 ~/local/pytorch] OMP_NUM_THREADS=4 python ~/nfs/pytorch_scripts/observer_bench.py
tensor(5.5586) tensor(-5.3983) tensor([-5.3983,  5.5586])
min and max separate 3.468616485595703
min and max combined 1.8227086067199707
% decrease 0.4745142294372342
(pytorch) [vasiliy@devgpu108.ash6 ~/local/pytorch] OMP_NUM_THREADS=8 python ~/nfs/pytorch_scripts/observer_bench.py
tensor(5.2146) tensor(-5.2858) tensor([-5.2858,  5.2146])
min and max separate 1.5707778930664062
min and max combined 0.8645427227020264
% decrease 0.4496085496757899
```

Imported from OSS

Reviewed By: supriyar

Differential Revision: D22589349

fbshipit-source-id: c2e3f1b8b5c75a23372eb6e4c885f842904528ed
2020-07-21 18:16:22 -07:00
9e0c746b15 Augmenting Concrete Observer Constructors to Support Dynamic Quantization Range; Modifying Utility Functions in _LearnableFakeQuantize Module for Better Logging and Baseline Construction. (#41815)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41815

**All are minor changes to enable better simulations.**

The constructors of MinMaxObserver, MovingAverageMinMaxObserver, PerChannelMinMaxObserver, and MovingAveragePerChannelMinMaxObserver are augmented so they can utilize the dynamic quantization range support in the _ObserverBase class.

In addition, minor adjustments are made to the enable_static_observation function that allow observer to update parameters but do not fake quantize on the output (for constructing baseline).

Test Plan:
To ensure this modification is still backward compatible with past usages, numerics are verified by running the quantization unit test suite, which contains various observer tests. The following command executes the test suite, which also verifies the observer numerics:
```
buck test //caffe2/test:quantization -- observer
```

Reviewed By: z-a-f

Differential Revision: D22649128

fbshipit-source-id: 32393b706f9b69579dc2f644fb4859924d1f3773
2020-07-21 17:59:40 -07:00
60e2baf5e0 [doc] Add LSTM non-deterministic workaround (#40893)
Summary:
Related: https://github.com/pytorch/pytorch/issues/35661

Preview
![image](https://user-images.githubusercontent.com/24860335/86861581-4b4c7100-c07c-11ea-950a-3145bfae9af9.png)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/40893

Reviewed By: vincentqb

Differential Revision: D22535418

Pulled By: ngimel

fbshipit-source-id: f194ddaff8ec6d03a3616c87466e2cbbe7e429a9
2020-07-21 16:20:02 -07:00
941069ca09 [tensorexpr][trivial] Remove debug printing from test (#41806)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41806

Generally a good practice not to have tests spew output.

Test Plan:
`build/bin/test_tensorexpr`

Imported from OSS

Reviewed By: zheng-xq

Differential Revision: D22646833

fbshipit-source-id: 444e883307d058fe77e7550d436fa61b7d91a701
2020-07-21 15:54:31 -07:00
7ffdd765c8 [TensorExpr] more convenient outer Rfactor output (#40050)
Summary:
Auto fuse the output loops of outer Rfactors, so it is in a more convenient format for binding GPU axes.

An example:
```
  Tensor* c = Reduce("sum", {}, Sum(), b, {{m, "m"}, {n, "n"}, {k, "k"}});
  LoopNest loop({c});
  std::vector<For*> loops = loop.getLoopStmtsFor(c);
  auto v = loops.at(0)->var();
  loop.rfactor(c->body(), v);
```
Before:
```
{
  Allocate(tmp_buf, float, {m});
  sum[0] = 0.f;
  for (int m_1 = 0; m_1 < m; m_1++) {
    tmp_buf[m_1] = 0.f;
  }
  for (int m_1 = 0; m_1 < m; m_1++) {
    for (int n = 0; n < n_1; n++) {
      for (int k = 0; k < k_1; k++) {
        tmp_buf[m_1] = (tmp_buf[m_1]) + (b[((n_1 * m_1) * k_1 + k) + k_1 * n]);
      }
    }
  }
  for (int m_1 = 0; m_1 < m; m_1++) {
    sum[0] = (sum[0]) + (tmp_buf[m_1]);
  }
  Free(tmp_buf);
}
```

After:
```
{
  sum[0] = 0.f;
  for (int m = 0; m < m_1; m++) {
    Allocate(tmp_buf, float, {m_1});
    tmp_buf[m] = 0.f;
    for (int n = 0; n < n_1; n++) {
      for (int k = 0; k < k_1; k++) {
        tmp_buf[m] = (tmp_buf[m]) + (b[((n_1 * m) * k_1 + k) + k_1 * n]);
      }
    }
    sum[0] = (sum[0]) + (tmp_buf[m]);
    Free(tmp_buf);
  }
}
```

The existing Rfactor tests cover this case, although I did rename a few for clarity. This change broke the LLVMRFactorVectorizedReduction test because it now does what its intending to (vectorize a loop with a reduction in it) rather than nothing, and since that doesn't work it correctly fails. I've disabled it for now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/40050

Reviewed By: ZolotukhinM

Differential Revision: D22605639

Pulled By: nickgg

fbshipit-source-id: e359be53ea62d9106901cfbbc42d55d0e300e8e0
2020-07-21 14:44:26 -07:00
dac393fa24 [PT] enforce duplicate op name check on mobile
Summary: Enforce duplicate op name check on mobile

Test Plan: run full/lite predictor

Reviewed By: iseeyuan

Differential Revision: D22639758

fbshipit-source-id: 2993c4bc1b14c833b273183f4f343ffad62121b3
2020-07-21 13:14:17 -07:00
62f4f87914 Removed whitelist reference from tools/clang_format_ci.sh (#41636)
Summary:
Removed whitelist and blacklist references
Fixes https://github.com/pytorch/pytorch/issues/41753

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41636

Reviewed By: SplitInfinity

Differential Revision: D22648632

Pulled By: suo

fbshipit-source-id: d22130a7cef96274f3fc73d00b50327dfcae332c
2020-07-21 12:32:14 -07:00
1ad7160a59 fix backward compat (#41810)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/41810

Reviewed By: malfet

Differential Revision: D22647763

Pulled By: albanD

fbshipit-source-id: 8ce70ecb706bb98ed24b0b3e7e9ebf3d4c270964
2020-07-21 12:14:55 -07:00
03186a86d9 Add test dependencies to CONTRIBUTING.md (#41799)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/41799

Reviewed By: zhangguanheng66

Differential Revision: D22645323

Pulled By: zou3519

fbshipit-source-id: 0a695bffb57b29024461472dd1c8518a9a0d1d3b
2020-07-21 11:29:38 -07:00
341c4045df replaced blacklist with blocklist in test/test_type_hints.py (#41644)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/41719.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41644

Reviewed By: zhangguanheng66

Differential Revision: D22645479

Pulled By: zou3519

fbshipit-source-id: 82710acae96ab508b8e9198dadb7d7911cb97235
2020-07-21 11:23:19 -07:00
46808b49a8 Change whitelist to allow in file test_quantized_op.py (#41771)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/41751

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41771

Reviewed By: zhangguanheng66

Differential Revision: D22641463

Pulled By: SplitInfinity

fbshipit-source-id: 1a60af8d43ccdf1f35dc84dbf4a7bc64965eb44a
2020-07-21 11:08:07 -07:00
72a1146339 Skip warning 4522 with MSVC (#41648)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/41648

Reviewed By: zhangguanheng66

Differential Revision: D22644623

Pulled By: malfet

fbshipit-source-id: 7fb86f05b3d8cd6a4c7c0e3fdfd651b70a5094c9
2020-07-21 09:47:30 -07:00
2da2b5c081 update CONTRIBUTING.md for ccache (#41619)
Summary:
ccache now use cmake for building, update installation script.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41619

Reviewed By: zhangguanheng66

Differential Revision: D22644594

Pulled By: malfet

fbshipit-source-id: f894dd408822231f8aab36efbce188f06f004057
2020-07-21 09:43:30 -07:00
523f80e894 .circleci: Remove docker_hub_index_job, wasn't used (#41800)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41800

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: soumith

Differential Revision: D22645363

Pulled By: seemethere

fbshipit-source-id: 35ed43ed5fb4053f71dc9525c4ed62f1c60eacc1
2020-07-21 09:16:02 -07:00
1f11e930d0 [ROCm] skip test_streams on rocm. (#41697)
Summary:
Skipping the test test_streams as it is flaky on rocm.
cc: jeffdaily  sunway513

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41697

Reviewed By: zhangguanheng66

Differential Revision: D22644600

Pulled By: malfet

fbshipit-source-id: b1b16d496e58a91c44c40d640851fd62a5d7393d
2020-07-21 08:55:07 -07:00
48569cc330 Reland split (#41567)
Summary:
Take 3

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41567

Reviewed By: zou3519

Differential Revision: D22586331

Pulled By: albanD

fbshipit-source-id: ca08199da716d64a335455610edbce752fee224b
2020-07-21 08:06:27 -07:00
c89c294ef9 Add Unflatten Module (#41564)
Summary:
This PR implements a feature extension discussed in https://github.com/pytorch/pytorch/issues/41516.

I followed this other PR https://github.com/pytorch/pytorch/issues/22245 to add this other module. While I was at it, I also added `extra_repr()` method in `Flatten` which was missing.

I see there are no unit tests for these modules. Should I add those too? If so, what is the best place I should place these?

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41564

Reviewed By: gchanan

Differential Revision: D22636766

Pulled By: albanD

fbshipit-source-id: f9efdefd3ffe7d9af9482087625344af8f990943
2020-07-21 07:43:02 -07:00
fe415589a9 disable mkl for expm1 (#41654)
Summary:
On some systems/mkl versions it produces expm1(nan)=-1

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41654

Reviewed By: mruberry

Differential Revision: D22621333

Pulled By: ngimel

fbshipit-source-id: 84544679fe96aed7de6873dce6f31f488e5e35dd
2020-07-20 23:40:17 -07:00
65bd38127a GLOO process group GPU alltoall (#41690)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41690

Gloo alltoall for GPU

Test Plan: buck test mode/dev-nosan caffe2/torch/lib/c10d:ProcessGroupGlooTest

Reviewed By: osalpekar

Differential Revision: D22631554

fbshipit-source-id: 4b126d9d991a118f3925c005427f399fc60f92f7
2020-07-20 19:01:12 -07:00
5c50cb567c Generalized Learnable Fake Quantizer Module (#41535)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41535

A generalized fake quantization module is built to support lower-bit fake quantization with back propagation on the scale and zero point. The module supports both per tensor and per channel fake quantization.

Test Plan:
Please see diff D22337313 for a related experiment performed on the fake quantizer module.

The `_LearnableFakeQuantize` module supports the following use cases:
- Per Tensor Fake Quantization or Per Channel Fake Quantization
- Static Estimation from Observers or Quantization Parameter Learning through Back Propagation

By default, the module assumes per tensor affine fake quantization. To switch to per channel, during initialization, declare `channel_size` with the appropriate length. To toggle between utilizing static estimation and parameter learning with back propagation, you can invoke the call `enable_param_learning` or `enable_static_estimate`. For more information on the flags that support these operations, please see the doc string of the `_LearnableFakeQuantize` module.

The `_LearnableFakeQuantizer` module relies on 2 operators for its forward and backward paths: `_LearnableFakeQuantizePerTensorOp` and `_LearnableFakeQuantizePerChannelOp`. The backpropagation routine is developed based on the following literature:
- Learned Step Size Quantization: https://openreview.net/pdf?id=rkgO66VKDS
- Trained Quantization Thresholds: https://arxiv.org/pdf/1903.08066.pdf

Reviewed By: z-a-f

Differential Revision: D22573645

fbshipit-source-id: cfd9ece8a959ae31c00d9beb1acf9dfed71a7ea1
2020-07-20 18:24:21 -07:00
3a9a64a4da Add non zero offset test cases for Quantize and Dequantize Ops. (#41693)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41693

Add non zero offset test cases for Quantize and Dequantize Ops.

Test Plan: Added new test case test_int8_non_zero_offset_quantize part of the test_int8_ops_nnpi.py test file.

Reviewed By: hyuen

Differential Revision: D22633796

fbshipit-source-id: be17ee7a0caa6e9bc7b175af539be2e6625ad47a
2020-07-20 16:03:32 -07:00
1039bbf4eb add named parameters to mobile module (#41376)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41376

torch::jit::mobile::Module does not currently support accessing parameters via their attribute names, but torch::jit::Module does. This diff adds an equivalent functionality to mobile::Module.

Test Plan: Imported from OSS

Reviewed By: iseeyuan

Differential Revision: D22609142

Pulled By: ann-ss

fbshipit-source-id: 1a5272ff336f99a3c0bb6194c6a6384754f47846
2020-07-20 15:57:49 -07:00
30551ea7b2 Update NCCL from 2.4.8 to 2.7.3 (#41608)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/41608

Reviewed By: mrshenli, ngimel

Differential Revision: D22604953

Pulled By: malfet

fbshipit-source-id: 28151e2d5b6ea360b79896cb79c761756687d121
2020-07-20 13:21:47 -07:00
f07816003a [2/n][Compute Meta] support analysis for null flag features
Summary:
## TLDR
Support using NaN default value for missing dense features in RawInputProcessor for DPER2. In preparation for subsequent support for null flag features in compute meta. For train_eval this is already supported in DPER3 and we do not plan to support this in DPER2 train eval.

Differential Revision: D22439142

fbshipit-source-id: 99ae9755bd41a5d5f43bf5a9a2819d64f3883005
2020-07-20 13:13:45 -07:00
897cabc081 Add operators for smart keyboard to lite interpreter (#41539)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/41539

Test Plan: Imported from OSS

Reviewed By: iseeyuan

Differential Revision: D22574746

Pulled By: ann-ss

fbshipit-source-id: 3e2b78385149d7bde2598c975e60845a766ef86a
2020-07-20 12:08:58 -07:00
de400fa5ac [JIT] handle specially mapped ops (#41503)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41503

Fix for https://github.com/pytorch/pytorch/issues/41192

We can map fill_ and zero_ to their functional equivalents full_like and zeros_like

Test Plan: Imported from OSS

Reviewed By: jamesr66a

Differential Revision: D22629269

Pulled By: eellison

fbshipit-source-id: f1c62684dc55682c0b3845022e0461ec77d07179
2020-07-20 12:03:31 -07:00
6161730174 [JIT] move remove mutation to its own test file (#41502)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/41502

Test Plan: Imported from OSS

Reviewed By: Krovatkin

Differential Revision: D22629270

Pulled By: eellison

fbshipit-source-id: fcec6ae4ff8f108164539d67427ef3d72fa07494
2020-07-20 12:03:28 -07:00
cfcee816f1 .circleci: Prefix docker jobs with docker- (#41689)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41689

It's annoying not to know which jobs are actually related to docker
builds so let's just add the prefix.

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D22631578

Pulled By: seemethere

fbshipit-source-id: ac0cdd983ccc3bebcc360ba479b378d8f0eaa9c0
2020-07-20 12:00:53 -07:00
cc3c18edbc More LayerNorm Vectorization in calcMeanStd function. (#41618)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41618

More LayerNorm Vectorization in calcMeanStd function.

Test Plan: test covered in test_layernorm_nnpi_fp16.py

Reviewed By: hyuen

Differential Revision: D22606585

fbshipit-source-id: be773e62f0fc479dbc2d6735f60c2e98441916e9
2020-07-20 11:55:54 -07:00
26bbbeaea4 [DOCS] Fix the docs for the inputs arg of trace_module func (#41586)
Summary:
Fix the docs for the `inputs` arg of `trace_module` func.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41586

Reviewed By: ezyang

Differential Revision: D22598453

Pulled By: zou3519

fbshipit-source-id: c2d182238b5a51f6d0a7d0683372d72a239146c5
2020-07-20 10:57:56 -07:00
ce443def01 Grammar patch 1 (.md) (#41599)
Summary:
A minor spell check!
I have gone through a dozen of .md files to fix the typos.
zou3519 take a look!

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41599

Reviewed By: ezyang

Differential Revision: D22601629

Pulled By: zou3519

fbshipit-source-id: 68d8f77ad18edc1e77874f778b7dadee04b393ef
2020-07-20 10:19:08 -07:00
6769b850b2 Remove needless test duplication (#41583)
Summary:
The test loops over `upper` but does not use it effectively running the same test twice which increases test times for no gain.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41583

Reviewed By: soumith, seemethere, izdeby

Differential Revision: D22598475

Pulled By: zou3519

fbshipit-source-id: d100f20143293a116ff3ba08b0f4eaf0cc5a8099
2020-07-20 10:14:11 -07:00
16dde6e3a0 Augmenting Observers to Support Dynamic Quantization Range (#41113)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41113

In this diff, the `ObserverBase` class is augmented with 2 additional optional arguments qmin and qmax. Correspondingly the calculation of qmin and qmax and the related quantization parameters are modified to accommodate this additional flexibility should the number of bits for quantization be lower than 8 (the default value).

Additional logic in the base class `_calculate_qparams` function has also been modified to provide support for dynamic quantization range.

Test Plan:
To ensure this modification is still backward compatible with past usages, numerics are verified by running the quantization unit test suite, which contains various observer tests. The following command executes the test suite, which also verifies the observer numerics:

`buck test //caffe2/test:quantization -- observer`

This modified observer script can be tested within the experiments for lower bit fake quantization. Please see the following diffs for reference.
- Single Fake Quantizer: D22337447
- Single Conv Layer: D22338532

Reviewed By: z-a-f

Differential Revision: D22427134

fbshipit-source-id: f405e633289322078b0f4a417f54b684adff2549
2020-07-20 08:51:31 -07:00
9600ed9af3 typo fixes (#41632)
Summary:
typo fixes

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41632

Reviewed By: ezyang

Differential Revision: D22617827

Pulled By: mrshenli

fbshipit-source-id: c2bfcb7cc36913a8dd32f13fc9adc3aa0a9b682f
2020-07-20 07:23:00 -07:00
bd42e1a082 Doc language fixes (#41643)
Summary:
Updates doc for abs, acos, and isinf for clarity and consistency.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41643

Reviewed By: ngimel

Differential Revision: D22622957

Pulled By: mruberry

fbshipit-source-id: 040f01b4e101153098577bf10dcd569b679aae2c
2020-07-19 21:31:51 -07:00
a69a262810 workaround segfault in deviceGuard construction (#41621)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41621

Per title. In some situation, deviceGuard constructor in mul_kernel_cuda segfaults, so construct deviceGuard conditionally only when first argument is scalar.
This does not root cause why deviceGuard constructor segfaults, so the issue might come back.

Test Plan: pytorch oss CI

Reviewed By: jianyuh

Differential Revision: D22616460

fbshipit-source-id: b91bbe55c6eb0bbe80b8d6a61c41f09288752658
2020-07-18 23:41:43 -07:00
4a3aad354a [1/N] Implement Enum JIT support (#41390)
Summary:
* Add EnumType and AnyEnumType as first-class jit type
* Add Enum-typed IValue
* Enhanced aten::eq to support Enum

Supported:
Enum-typed function targuments
using Enum type and comparing them

TODO:
Add PyThon sugared value for Enum
Support getting name/value attrs of enums
Support Enum-typed return values
Support enum values of different types in same Enum class
Support serialization and deserialization

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41390

Reviewed By: eellison

Differential Revision: D22524388

Pulled By: gmagogsfm

fbshipit-source-id: 1627154a64e752d8457cd53270f3d14aea4b1150
2020-07-18 22:15:06 -07:00
46eb8d997c Revert D22533824: [PT] add check for duplicated op names in JIT
Test Plan: revert-hammer

Differential Revision:
D22533824 (d72c9f4200)

Original commit changeset: b36884531d41

fbshipit-source-id: 8bf840a09b4001cc68858a5dc3540505a0e1abdc
2020-07-18 17:26:42 -07:00
c7bcb285f3 Makes elementwise comparison docs more consistent (#41626)
Summary:
- Removes outdated language like "BoolTensor"
- Consistently labels keyword arguments, like out
- Uses a more natural string to describe their return type
- A few bonus fixes

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41626

Reviewed By: ngimel

Differential Revision: D22617322

Pulled By: mruberry

fbshipit-source-id: 03cc3562b78a07ed30bd1dc7936d7a4f4e31f01d
2020-07-18 16:30:59 -07:00
e7a09b4d17 RecordFunction in Dispatcher (#37587)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37587

Lifting RecordFunction up into the dispatcher code

Test Plan: Imported from OSS

Differential Revision: D21374246

fbshipit-source-id: 19f9c1719e6fd3990e451c5bbd771121e91128f7
2020-07-17 22:20:05 -07:00
c6d0fdd215 torch.isreal (#41298)
Summary:
https://github.com/pytorch/pytorch/issues/38349

mruberry
Not entirely sure if all the changes are necessary in how functions are added to Pytorch.

Should it throw an error when called with a non-complex tensor? Numpy allows non-complex arrays in its imag() function which is used in its isreal() function but Pytorch's imag() throws an error for non-complex arrays.

Where does assertONNX() get its expected output to compare to?

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41298

Reviewed By: ngimel

Differential Revision: D22610500

Pulled By: mruberry

fbshipit-source-id: 817d61f8b1c3670788b81690636bd41335788439
2020-07-17 22:07:24 -07:00
581e9526bb [GradualGating] support better k value change (#41557)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41557

 - add new learning rate functor "slope"
 - use "slope" learning rate in gated_sparse_feature module

Test Plan:
buck test dper3/dper3/modules/tests:core_modules_test -- test_gated_sparse_features_shape_num_warmup_tensor_k
buck test caffe2/caffe2/python/operator_test:learning_rate_op_test -- test_slope_learning_rate_op

Reviewed By: huayuli00

Differential Revision: D22544628

fbshipit-source-id: f2fcae564e79e1d8bcd3a2305d0c11ca7c0d3b3c
2020-07-17 20:44:28 -07:00
d72c9f4200 [PT] add check for duplicated op names in JIT (#41549)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41549

D22467871 (a548c6b18f) was reverted due to double linking torch_mobile_train.
Re-do this change after D22531358 (7a33d8b001).

Test Plan:
buck install fb4a
Train mnist in Internal Settings.

Reviewed By: iseeyuan

Differential Revision: D22533824

fbshipit-source-id: b36884531d41cea2e76b7fb1a567f21106c612b6
2020-07-17 20:26:48 -07:00
96ac12fdf4 [PT] add overload name for int prim ops (#41578)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41578

A new op aten::gcd(Tensor...) was added while the duplicated op name check was disabled. It's not a prime op, but it has the same name with one prime op aten::gcd(int, int).

It will be safer to enforce all prim ops have overload name, even there is no duplicated name right now. People may add tensor ops without overload name in the future.

This diff added the overload name for all ops defined using "DEFINE_INT_OP".

```
aten::__and__
aten::__or__
aten::__xor__
aten::__lshift__
aten::__rshift__
aten::__round_to_zero_floordiv
aten::gcd
```

Test Plan: run full JIT predictor

Reviewed By: iseeyuan

Differential Revision: D22593689

fbshipit-source-id: b3335d356a774d33450a09d0a43ff947197f9b8a
2020-07-17 18:18:38 -07:00
445e7eb01b Add quantized CELU operator by adding additional parameters to quantized ELU (#39199)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/39199

Test Plan: Imported from OSS

Differential Revision: D21771202

Pulled By: durumu

fbshipit-source-id: 910de6202fa3d5780497c5bf85208568a09297dd
2020-07-17 17:56:33 -07:00
1734f24276 Revert D22525217: [pytorch][PR] Initial implementation of quantile operator
Test Plan: revert-hammer

Differential Revision:
D22525217 (c7798ddf7b)

Original commit changeset: 27a8bb23feee

fbshipit-source-id: 3beb3d4f8a4d558e993fbdfe977af12c7153afc8
2020-07-17 17:22:48 -07:00
b774ce54f8 remediation of S205607
fbshipit-source-id: 798decc90db4f13770e97cdce3c0df7d5421b2a3
2020-07-17 17:19:47 -07:00
8fdea489af remediation of S205607
fbshipit-source-id: 5113fe0c527595e4227ff827253b7414abbdf7ac
2020-07-17 17:17:03 -07:00
39b4701d31 [caffe2][redo] Reimplement RemoveOpsByType with SSA (#41606)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41606

The previous diff (D22220798 (59294fbbb9) and D22220797) was recently reverted (D22492356 (28291d3cf8), D22492355) because of a bug associated with the op AsyncIf. The AsyncIf op has net_defs as args and the SSA rewriting didn't take that into account. It has a special path for the op If, but not for AsyncIf. Several changes I made to fix the bug:
1) Add op AsyncIf to the special path for If op in SSA rewriting
2) clear inputs/outputs of the netdefs that are args in If/AsyncIf ops because they're no longer valid
3) revert renamed inputs/outputs in the arg netdefs that are in the external_outputs in the parent netdef

2) and 3) are existing bugs in the `SsaRewrite` function that were just never exposed before.

The algorithm for `RemoveOpsByType` is the same as in my previous diff D22220798 (59294fbbb9). The only new changes in this diff are in `onnx::SsaRewrite` and a few newly added unit tests.

(Note: this ignores all push blocking failures!)

Reviewed By: yinghai

Differential Revision: D22588652

fbshipit-source-id: ebb68ecd1662ea2bae14d4be8f61a75cd8b7e3e6
2020-07-17 16:06:43 -07:00
349c40507c Revert "[CircleCI] Delete docker image after testing" (#41601)
Summary:
Per AMD request, this reverts commit 1e64bf4c40ef82d6bc3dcc42b3874353f7632be0.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41601

Reviewed By: ezyang

Differential Revision: D22603147

Pulled By: malfet

fbshipit-source-id: f423d406601383f26ea83a51f1de37e60b53810e
2020-07-17 14:42:27 -07:00
92b95e5243 Fix NCCL version check when nccl.h in non-standard location. (#40982)
Summary:
The NCCL discovery process fails to compile detect_nccl_version.cc when nccl.h resides in a non-standard location.
Pass __NCCL_INCLUDE_DIRS__ to _try_run(... detect_nccl_version.cc)_ to fix this.

Can reproduce with Dockerfile ..
```Dockerfile
FROM nvidia/cuda:10.2-cudnn7-devel-ubuntu18.04 as build
WORKDIR /stage

# install conda
ARG CONDA_VERSION=4.7.10
ARG CONDA_URL=https://repo.anaconda.com/miniconda/Miniconda3-${CONDA_VERSION}-Linux-x86_64.sh
RUN cd /stage && curl -fSsL --insecure ${CONDA_URL} -o install-conda.sh &&\
    /bin/bash ./install-conda.sh -b -p /opt/conda &&\
    /opt/conda/bin/conda clean -ya
ENV PATH=/opt/conda/bin:${PATH}

# install prerequisites
RUN conda install numpy ninja pyyaml mkl mkl-include setuptools cmake cffi

# attempt compile
ENV CUDA_HOME="/usr/local/cuda" \
    CUDNN_LIBRARY="/usr/lib/x86_64-linux-gnu" \
    NCCL_INCLUDE_DIR="/usr/local/cuda/include" \
    NCCL_LIB_DIR="/usr/local/cuda/lib64" \
    USE_SYSTEM_NCCL=1
RUN apt-get -y update &&\
    apt-get -y install git &&\
    cd /stage && git clone https://github.com/pytorch/pytorch.git &&\
    cd pytorch &&\
    git submodule update --init --recursive &&\
    python setup.py bdist_wheel
```

This generates the following error ..
```
-- Found NCCL: /usr/local/cuda/include
-- Determining NCCL version from /usr/local/cuda/include/nccl.h...
-- Looking for NCCL_VERSION_CODE
-- Looking for NCCL_VERSION_CODE - found
CMake Error at cmake/Modules/FindNCCL.cmake:78 (message):
  Found NCCL header version and library version do not match! (include:
  /usr/local/cuda/include, library: /usr/local/cuda/lib64/libnccl.so) Please
  set NCCL_INCLUDE_DIR and NCCL_LIB_DIR manually.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/40982

Reviewed By: zou3519

Differential Revision: D22603911

Pulled By: malfet

fbshipit-source-id: 084870375a270fb9c7daf3c2e731992a03614ad6
2020-07-17 13:54:17 -07:00
cf811d2fb3 retain undefined tensors in backward pass (#41490)
Summary:
Leave undefined tensors / None returned from custom backward functions as undefined/None instead of creating a tensor full of zeros. This change improves performance in some cases.

**This is BC-Breaking:** Custom backward functions that return None will now see it potentially being propagated all the way up to AccumulateGrad nodes. Potential impact is that .grad field of leaf tensors as well as the result of autograd.grad may be undefined/None where it used to be a tensor full of zeros. Also, autograd.grad may raise an error, if so, consider using allow_unused=True ([see doc](https://pytorch.org/docs/stable/autograd.html?highlight=autograd%20grad#torch.autograd.grad)) if it applies to your case.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41490

Reviewed By: albanD

Differential Revision: D22578241

Pulled By: heitorschueroff

fbshipit-source-id: f4966f4cb520069294f8c5c1691eeea799cc0abe
2020-07-17 12:42:50 -07:00
a874c1e584 Adds missing abs to lcm (#41552)
Summary:
lcm was missing an abs. This adds it plus extends the test for NumPy compliance. Also includes a few doc fixes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41552

Reviewed By: ngimel

Differential Revision: D22580997

Pulled By: mruberry

fbshipit-source-id: 5ce1db56f88df4355427e1b682fcf8877458ff4e
2020-07-17 12:29:50 -07:00
0f78e596ba ROCm: Fix linking of custom ops in load_inline (#41257)
Summary:
Previously we did not link against amdhip64 (roughly equivalent to cudart). Apparently, the recent RTDL_GLOBAL fixes prevent the extensions from finding the symbols needed for launching kernels.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41257

Reviewed By: zou3519

Differential Revision: D22573288

Pulled By: ezyang

fbshipit-source-id: 89f9329b2097df26785e2f67e236d60984d40fdd
2020-07-17 12:14:50 -07:00
3c862c80cf Move list size constants for profiler::Event and profiler::ProfilerConfig into (#40474)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40474

These constants are unnecessary since there is an enum, and we can add
the size at the end of the enum and it will be equal to the list size. I
believe that this is the typical pattern used to represent enum sizes.
ghstack-source-id: 107969012

Test Plan: CI

Reviewed By: ezyang

Differential Revision: D22147754

fbshipit-source-id: 7064a897a07f9104da5953c2f87b58179df8ea84
2020-07-17 12:00:18 -07:00
fbd960801a [JIT] Replace use of "whitelist" in lower_tuples pass (#41460)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41460

**Test Plan**
Continuous integration.

**Fixes**
This commit partially addresses #41443.

Test Plan: Imported from OSS

Reviewed By: suo

Differential Revision: D22544272

Pulled By: SplitInfinity

fbshipit-source-id: b46940d1e24f81756daaace260bad7a1feda1e8f
2020-07-17 11:33:14 -07:00
c2c2c1c106 [JIT] Remove use of "whitelist" in quantization/helper.cpp (#41459)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41459

**Test Plan**
Continuous integration.

**Fixes**
This commit partially addresses #41443.

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D22544269

Pulled By: SplitInfinity

fbshipit-source-id: d4bb7c0c9c71e953677a34f0530b66e5119447d0
2020-07-17 11:33:12 -07:00
4f4e3a0f15 [JIT] Replace uses of "whitelist" in jit/_script.py (#41458)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41458

**Test Plan**
Continuous integration.

**Fixes**
This commit partially fixes #41443.

Test Plan: Imported from OSS

Reviewed By: suo

Differential Revision: D22544273

Pulled By: SplitInfinity

fbshipit-source-id: 8148e5338f90a5ef19177cf68bf36b56926d5a6c
2020-07-17 11:33:10 -07:00
bf0d0900a7 [JIT] Replace uses of "blacklist" in jit/_recursive.py (#41457)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41457

**Test Plan**
Continuous integration.

**Fixes**
This commit partially addresses #41443.

Test Plan: Imported from OSS

Reviewed By: suo

Differential Revision: D22544274

Pulled By: SplitInfinity

fbshipit-source-id: ee74860c48d85d819d46c8b8848960e77bb5013e
2020-07-17 11:33:07 -07:00
758edcd7df [JIT] Replace use of "blacklist" in python/init.cpp (#41456)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41456

**Test Plan**
Continuous integration.

**Fixes**
This commit partially addresses #41443.

Test Plan: Imported from OSS

Reviewed By: Krovatkin

Differential Revision: D22544270

Pulled By: SplitInfinity

fbshipit-source-id: 649b30e1fcc6516a4def6b148a1da07bc3ce941d
2020-07-17 11:33:05 -07:00
c9bdf474d7 [JIT] Replace use of "blacklist" in xnnpack_rewrite (#41455)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41455

**Test Plan**
Continuous integration.

**Fixes**
This commit partially addresses #41443.

Test Plan: Imported from OSS

Reviewed By: eellison

Differential Revision: D22544275

Pulled By: SplitInfinity

fbshipit-source-id: 5037b16e6ebc9e3b40dd03d2ce5a0671d7867892
2020-07-17 11:33:03 -07:00
3b7c05b11b [JIT] Replace uses of "blacklist" in gen_unboxing_wrappers.py (#41454)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41454

**Test Plan**
Continuous integration (if this file is still used).

**Fixes**
This commit partially addresses #41443.

Test Plan: Imported from OSS

Reviewed By: eellison

Differential Revision: D22544271

Pulled By: SplitInfinity

fbshipit-source-id: 84a4d552745fe5163b2e3200103c3b1f2a9ffb2a
2020-07-17 11:33:01 -07:00
f85a27e100 [JIT] Replace "blacklist" in test_jit.py (#41453)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41453

**Test Plan**
`python test/test_jit.py`

**Fixes**
This commit partially addresses #41443.

Test Plan: Imported from OSS

Reviewed By: suo

Differential Revision: D22544268

Pulled By: SplitInfinity

fbshipit-source-id: 8b6b94211a626209c3960fda6c860593148dcbf2
2020-07-17 11:30:27 -07:00
43b1923d98 Enable SLS FP32 accumulation SparseLengthsWeightedSumFused8BitRowwiseFakeFP32NNPI Op. (#41577)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41577

* Remove skipping test
* Use fma_avx_emulation
* Increase test examples to 100

(Note: this ignores all push blocking failures!)

Test Plan: Tests are covered in test_sls_8bit_nnpi.py

Reviewed By: hyuen

Differential Revision: D22585742

fbshipit-source-id: e1f62f47eb10b402b11893ffca7a6786e31daa79
2020-07-17 11:19:47 -07:00
319b20b7db [ONNX] Update ORT version (#41372)
Summary:
Update ORT version [1.4 candidate].

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41372

Reviewed By: houseroad

Differential Revision: D22580050

Pulled By: bzinodev

fbshipit-source-id: c66e3bab865b3221d52eea30db48e0870ae5b681
2020-07-17 11:17:17 -07:00
346c69a626 [ONNX] Export embedding_bag (#41234)
Summary:
Enable export of embedding_bag op to ONNX

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41234

Reviewed By: houseroad

Differential Revision: D22567470

Pulled By: bzinodev

fbshipit-source-id: 2fcf74e54f3a9dee4588d7877a4ac9eb6c2a3629
2020-07-17 11:11:43 -07:00
7eb71b4beb Profiler: Do not record zero duration kernel events (#41540)
Summary:
Changes in the ROCm runtime have improved hipEventRecord.  The events no longer take ~4 usec to execute on the gpu stream, instead they appear instantaneous.  If you record two events, with no other activity in between, then they will have the same timestamp and the elapsed duration will be 0.

The profiler uses hip/cuda event pairs to infer gpu execution times.  It wraps functions whether they send work to the gpu or not.  Functions that send no gpu work will show as having zero duration.  Also they will show as running at the same time as neighboring functions.  On a trace, all those functions combine into a 'call stack' that can be tens of functions tall (when indeed they should be sequential).

This patch suppresses recording the zero duration 'kernel' events, leaving only the CPU execution part.  This means functions that do not use the GPU do not get an entry for how long they were using the GPU, which seams reasonable.  This fixes the 'stacking' on traces.  It also improves the signal to noise of the GPU trace beyond what was available previously.

This patch will not effect CUDA or legacy ROCm as those are not able to 'execute' eventRecord markers instantaneously.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41540

Reviewed By: zou3519

Differential Revision: D22597207

Pulled By: albanD

fbshipit-source-id: 5e89de2b6d53888db4f9dbcb91a94478cde2f525
2020-07-17 11:03:43 -07:00
324c18fcad fix division by low precision scalar (#41446)
Summary:
Before, inverse for division by scalar is calculated in the precision of the non-scalar operands, which can lead to underflow:
```
>>> x = torch.tensor([3388.]).half().to(0)
>>> scale = 524288.0
>>> x.div(scale)
tensor([0.], device='cuda:0', dtype=torch.float16)
>>> x.mul(1. / scale)
tensor([0.0065], device='cuda:0', dtype=torch.float16)
```
This PR makes results of multiplication by inverse and division the same.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41446

Reviewed By: ezyang

Differential Revision: D22542872

Pulled By: ngimel

fbshipit-source-id: b60e3244809573299c2c3030a006487a117606e9
2020-07-17 10:41:28 -07:00
5d7046522b [JIT] Teach IRPrinter and IRParser to handle 'requires_grad' and 'device' as a part of type info. (#41507)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41507

These fields have always been a part of tensor types, this change just
makes them serializable through IR dumps.

Test Plan: Imported from OSS

Reviewed By: Krovatkin, ngimel

Differential Revision: D22563661

Pulled By: ZolotukhinM

fbshipit-source-id: f01aaa130b7e0005bf1ff21f65827fc24755b360
2020-07-17 10:27:04 -07:00
241bc648c9 Adding missing setting state_.ptr() and hook_.ptr() to nullptr. (#41537)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41537

Explicitly setting PyObject* state_ and hook_ to nullptr to prevent py::object's dtor to decref on the PyObject again.
Reference PR [#40848](https://github.com/pytorch/pytorch/pull/40848).
ghstack-source-id: 107959254

Test Plan: `python test/distributed/test_c10d.py`

Reviewed By: zou3519

Differential Revision: D22573858

fbshipit-source-id: 84cc5949a370ffdb4ac3ca7a16a6f0f136563c1c
2020-07-17 10:21:03 -07:00
c7798ddf7b Initial implementation of quantile operator (#39417)
Summary:
Implementing the quantile operator similar to [numpy.quantile](https://numpy.org/devdocs/reference/generated/numpy.quantile.html).

For this implementation I'm reducing it to existing torch operators to get free CUDA implementation. It is more efficient to implement multiple quickselect algorithm instead of sorting but this can be addressed in a future PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/39417

Reviewed By: mruberry

Differential Revision: D22525217

Pulled By: heitorschueroff

fbshipit-source-id: 27a8bb23feee24fab7f8c228119d19edbb6cea33
2020-07-17 10:15:57 -07:00
71fdf748e5 Add torch.atleast_{1d/2d/3d} (#41317)
Summary:
https://github.com/pytorch/pytorch/issues/38349

TODO:
 * [x] Docs
 * [x] Tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41317

Reviewed By: ngimel

Differential Revision: D22575456

Pulled By: mruberry

fbshipit-source-id: cc79f4cd2ca4164108ed731c33cf140a4d1c9dd8
2020-07-17 10:10:41 -07:00
840ad94ef5 Add reference documentation for torch/library.h (#41470)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41470

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D22577426

Pulled By: ezyang

fbshipit-source-id: 4bfe5806061e74181a74d161c868acb7c1ecd1e4
2020-07-17 10:05:16 -07:00
1e230a5c52 rewrite C++ __torch_function__ handling to work with TensorList operands (#41575)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41575

Fixes https://github.com/pytorch/pytorch/issues/34294

This updates the C++ argument parser to correctly handle `TensorList` operands. I've also included a number of updates to the testing infrastructure, this is because we're now doing a much more careful job of testing the signatures of aten kernels, using the type information about the arguments as read in from `Declarations.yaml`. The changes to the tests are required because we're now only checking for `__torch_function__` attributes on `Tensor`, `Optional[Tensor]` and elements of `TensorList` operands, whereas before we were checking for `__torch_function__` on all operands, so the relatively simplistic approach the tests were using before -- assuming all positional arguments might be tensors -- doesn't work anymore. I now think that checking for `__torch_function__` on all operands was a mistake in the original design.

The updates to the signatures of the `lambda` functions are to handle this new, more stringent checking of signatures.

I also added override support for `torch.nn.functional.threshold` `torch.nn.functional.layer_norm`, which did not yet have python-level support.

Benchmarks are still WIP.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/34725

Reviewed By: mruberry

Differential Revision: D22357738

Pulled By: ezyang

fbshipit-source-id: 0e7f4a58517867b2e3f193a0a8390e2ed294e1f3
2020-07-17 08:54:29 -07:00
cb9029df9d Assert valid inner type for OptionalType creation (#41509)
Summary:
Assert in OptionalType::create for valid TypePtr to catch all uses, as well as in python resolver to propagate slightly more helpful error message.

Closes https://github.com/pytorch/pytorch/issues/40713.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41509

Reviewed By: suo

Differential Revision: D22563710

Pulled By: wconstab

fbshipit-source-id: ee6314b1694a55c1ba7c8251260ea120be148b17
2020-07-17 07:22:41 -07:00
e3e58e20cd enable jit profiling tests on macos (#41550)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/41550

Reviewed By: SplitInfinity

Differential Revision: D22579593

Pulled By: Krovatkin

fbshipit-source-id: 3e67bcf418ef266d5416b7fac413e94b1ac1ec7e
2020-07-16 22:55:24 -07:00
eb3bf96f95 During inbatch broadcast, move Tile op after Fused8BitRowwiseQuantizedToFloat if applicable (#41464)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41464

If input is int8 rowwise quantized, currently we cannot low it to Glow. And previously, we had some error when running with inbatch broadcast. The main issue is that Tile op doesn't support uint8_t type, which is very easily added here. However, this will result in non-ideal situation that we will leave Tile -> Fused8BitRowwiseQuantizedToFloat on host side, which probably hurt the memory bw a lot. Even we later add the support to Fused8BitRowwiseQuantizedToFloat in Glow, it's still not ideal because we are doing redudant compute on identical columns. So the solution here is to swap the order of Fused8BitRowwiseQuantizedToFloat and Tile to make it Tile -> Fused8BitRowwiseQuantizedToFloat. In this way, it will resolve the error we saw immediately. For the short term, we can still run Tile in card. And for longer term, things runs faster on card.

The optimization is a heuristic. If in the net, there isn't such pattern, inbatch broadcast will work as it was before.

(Note: this ignores all push blocking failures!)

Test Plan:
```
buck test caffe2/caffe2/opt/custom:in_batch_broadcast_test
```

Reviewed By: benjibc

Differential Revision: D22544162

fbshipit-source-id: b6dd36a5925a9c8103b80f034e7730a7a085a6ff
2020-07-16 21:25:18 -07:00
5376785a70 Run NO_AVX jobs on CPU (#41565)
Summary:
Delete "nogpu" job since both "AVX" and "AVX2" jobs already act like one
Fix naming problem when NO_AVX_NO_AVX2 job and NO_AVX2 jobs were semantically identical, due to the following logic in test.sh:
```
if [[ "${BUILD_ENVIRONMENT}" == *-NO_AVX-* ]]; then
  export ATEN_CPU_CAPABILITY=default
elif [[ "${BUILD_ENVIRONMENT}" == *-NO_AVX2-* ]]; then
  export ATEN_CPU_CAPABILITY=avx
fi
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41565

Reviewed By: seemethere

Differential Revision: D22584743

Pulled By: malfet

fbshipit-source-id: 783cce60f35947b5d1e8b93901db36371ef78243
2020-07-16 21:21:48 -07:00
728fd37d92 [JIT] make fastrnns runnable on cpu (#41483)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/41483

Reviewed By: gmagogsfm

Differential Revision: D22580275

Pulled By: eellison

fbshipit-source-id: f2805bc7fa8037cfde7862b005d2940add3ac864
2020-07-16 15:53:39 -07:00
b1d4e33c8b Revert D22552377: [pytorch][PR] Reland split unsafe version
Test Plan: revert-hammer

Differential Revision:
D22552377 (5bba973afd)

Original commit changeset: 1d1b713d2429

fbshipit-source-id: 8194458f99bfd5f077b7daa46ca3e81b549adc1b
2020-07-16 15:24:19 -07:00
415ff0bceb Create lazy_dyndeps to avoid caffe2 import costs. (#41343)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41343

Currently caffe2.InitOpLibrary does the dll import uniliaterally. Instead if we make a lazy version and use it, then many pieces of code which do not need the caffe2urrenoperators get a lot faster.

One a real test, the import time went from 140s to 68s. 8s.

This also cleans up the algorithm slightly (although it makes a very minimal
difference), by parsing the list of operators once, rather than every time a
new operator is added, since we defer the RefreshCall until after we've
imported all the operators.

The key way we maintain safety, is that as soon as someone does an operation
which requires a operator (or could), we force importing of all available
operators.

Future work could include trying to identify which code is needed for which
operator and only import the needed ones. There may also be wins available by
playing with dlmopen (which opens within a namespace), or seeing if the dl
flags have an impact (I tried this and didn't see an impact, but dlmopen may
make it better).

Note that this was previously landed and reverted. The issue was that if a import failed and raised an exception, the specific library would not be removed from the lazy imports. This caused our tests which had libraries that failed to poison all other tests that ran after it. This has been fixed and a unit test has been added for this case (to help make it obvious what failed).

Test Plan:
I added a new test a lazy_dyndep_test.py (copied from all_compare_test.py).
I'm a little concerned that I don't see any explicit tests for dyndep, but this
should provide decent coverage.

I've added a specific test to handle the poisoning issues mentioned above, which caused the previous version to get reverted.

Differential Revision: D22506369

fbshipit-source-id: 7395df4778e8eb0220630c570360b99a7d60eb83
2020-07-16 15:17:41 -07:00
9ed825746a Use c10::cuda:: primitives rather than make CUDA runtime calls directly (#41405)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/41405

Test Plan:
**Imported from GitHub: all checks have passed**

{F244195355}

**The Intern Builds & Tests have 127 success, 5 no signals, and 1 failure. Double check the failed test log file, the failure is result differences:**
- AssertionError: 0.435608434677124 != 0.4356083869934082
- AssertionError: 0.4393022060394287 != 0.4393021583557129
- AssertionError: 0.44707541465759276 != 0.44707536697387695

These are all very small numerical errors (within 0.0000001).

Reviewed By: malfet

Differential Revision: D22531486

Pulled By: threekindoms

fbshipit-source-id: 21543ec76bb9b502885b5146c8ba5ede719be9ff
2020-07-16 15:11:57 -07:00
a0e58996fb Makes the use of the term "module" consistent through the serialization note (#41563)
Summary:
module -> torch.nn.Module or ScriptModule, as appropriate. + bonus grammar fix.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41563

Reviewed By: gchanan

Differential Revision: D22584173

Pulled By: mruberry

fbshipit-source-id: 8c90f1f9a194bfdb277c97cf02c9b8c1c6ddc601
2020-07-16 14:59:49 -07:00
454cd3ea2e Fix RocM resource class allocation (#41553)
Summary:
Add Conf.is_test_stage() method to avoid duplicating state in ['test', 'test1', 'test2'] throughout the code

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41553

Test Plan: Make sure that in modified config.yml ROCM tests jobs are assigned `pytorch/amd-gpu` resource class

Reviewed By: yns88

Differential Revision: D22580471

Pulled By: malfet

fbshipit-source-id: 514555f0c0ac94c807bf837ba209560055335587
2020-07-16 14:13:25 -07:00
e324ea85ea Add tests to logical operation in BinaryOpsKernel.cpp (#41515)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41515

add test in atest.cpp to cover logical_and_kernel, logical_or_kernel and logical_nor_kernel in Aten/native/cpu/BinaryOpsKernel.cpp

https://pxl.cl/1drmV

Test Plan: CI

Reviewed By: malfet

Differential Revision: D22565235

fbshipit-source-id: 7ad9fd8420d7fdd23fd9a703c75da212f72bde2c
2020-07-16 13:21:57 -07:00
f49d97a848 Notes for lcm and gcd, formatting doc fixes (#41526)
Summary:
A small PR fixing some formatting in lcm, gcd, and the serialization note. Adds a note to lcm and gcd explaining behavior that is not always defined.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41526

Reviewed By: ngimel

Differential Revision: D22569341

Pulled By: mruberry

fbshipit-source-id: 5f5ff98c0831f65e82b991ef444a5cee8e3c8b5a
2020-07-16 13:15:29 -07:00
86590f226e Revert D22519869: [pytorch][PR] RandomSampler generates samples one at a time when replacement=True
Test Plan: revert-hammer

Differential Revision:
D22519869 (09647e1287)

Original commit changeset: be6585002586

fbshipit-source-id: 31ca5ceb24dd0b291f46f427a6f30f1037252a5d
2020-07-16 12:59:10 -07:00
ba6b235461 [RocM] Switch to rocm-3.5.1 image (#41273)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/41273

Reviewed By: seemethere

Differential Revision: D22575277

Pulled By: malfet

fbshipit-source-id: 6f43654c8c8c33adbc1de928dd43911931244978
2020-07-16 12:52:17 -07:00
09647e1287 RandomSampler generates samples one at a time when replacement=True (#40026)
Summary:
Fix https://github.com/pytorch/pytorch/issues/32530

I used the next() function to generate samples one at a time. To compensate replacement=False, I added a variable called "sample_list" to RandomSampler for random permutation.

cc SsnL

Pull Request resolved: https://github.com/pytorch/pytorch/pull/40026

Reviewed By: zhangguanheng66

Differential Revision: D22519869

Pulled By: ezyang

fbshipit-source-id: be65850025864d659a713b3bc461b25d6d0048a2
2020-07-16 11:42:32 -07:00
6f5f455c54 [Gloo] alltoall to ProcessGroupGloo (#41424)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41424

Adding alltoall to Gloo process group

Test Plan:
buck test caffe2/torch/lib/c10d:ProcessGroupGlooTest

Verified on TSC as well D22141532

Reviewed By: osalpekar

Differential Revision: D22451929

fbshipit-source-id: 695c4655c894c85229b16097fa63352ed04523ef
2020-07-16 11:27:26 -07:00
1ac4692489 Remove unnecessary test in rpc_test.py (#41218)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41218

This test doesn't assert anything and was accidentally committed as
part of a larger diff a few months ago.
ghstack-source-id: 107882848

Test Plan: CI

Reviewed By: ezyang

Differential Revision: D22469852

fbshipit-source-id: 0baa23da56b08200e16cf66df514566223dd9b15
2020-07-16 11:23:52 -07:00
b5e32528d0 Fix flaky test_udf_remote_message_delay_timeout_to_self (#41217)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41217

Fixes this flaky test. Due to the possibility of callback
finishCreatingOwnerRRef running after request_callback has processed and
created the owner RRef, we could actually end up with 0 owners on the node,
since the callback removes from the owners_ map. In this case, shutdown is fine
since there are no owners. On the other hand, if the callback runs first, there
will be 1 owner which we will delete in shutdown when we detect it has no
forks. So either way, shutdown works fine and we don't need to enforce there to
be 1 owner.
ghstack-source-id: 107883497

Test Plan: Ran the test 500 times with TSAN.

Reviewed By: ezyang

Differential Revision: D22469806

fbshipit-source-id: 02290d6d5922f91a9e2d5ede21d1cf1c4598cb46
2020-07-16 11:20:56 -07:00
94e4248d80 Split ASAN and ROCM tests into test1 and test2 (#41520)
Summary:
This should reduce end-to-end test runtime for 2 slowest configs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41520

Reviewed By: seemethere

Differential Revision: D22575028

Pulled By: malfet

fbshipit-source-id: a65bfa5932fcda3cf0f4fdd97bcc7ebb3f54c281
2020-07-16 11:15:03 -07:00
81e964904e [Gloo] Tests for Gloo Async Work Wait-level Timeouts (#41265)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41265

This PR adds tests for the Async Work wait-level timeouts that were added in the previous PR
ghstack-source-id: 107835732

Test Plan: New tests are in this diff - Running on local machine and Sandcastle

Reviewed By: jiayisuse

Differential Revision: D22470084

fbshipit-source-id: 5552e384d384962e359c5f665e6572df03b6aa63
2020-07-16 10:59:01 -07:00
b979129cba [Gloo] Support work-level timeouts in ProcessGroupGloo (#40948)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40948

Add work-level timeouts to ProcessGroupGloo. This uses the timeout support in `waitSend` and `waitRecv` functions from Gloo's `unbound_buffer` construct.
ghstack-source-id: 107835738

Test Plan: Tests are in the last PR in this stack

Reviewed By: jiayisuse

Differential Revision: D22173763

fbshipit-source-id: e0493231a23033464708ee2bc0e295d2b087a1c9
2020-07-16 10:58:59 -07:00
01dcef2e15 [NCCL] Tests for WorkNCCL::wait with Timeouts (#40947)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40947

This PR adds tests for work-level timeouts in WorkNCCL objects. We kick off an allgather operation that waits for 1000ms before actually starting computation. We wait on completion of this allgather op with a timeout of 250ms, expecting the operation to timeout and throw a runtime error.
ghstack-source-id: 107835734

Test Plan: This diff added tests - checking CI/Sandcastle for correctness. These are NCCL tests so they require at least 2 GPUs to run.

Reviewed By: jiayisuse

Differential Revision: D22173101

fbshipit-source-id: 8595e4b67662cef781b20ced0befdcc53d157c39
2020-07-16 10:58:56 -07:00
edf3dc73f2 [NCCL] Support Wait Timeout in ProcessGroupNCCL (#40946)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40946

Adds timeout to ProcessGroupNCCL::wait. Currently, WorkNCCL objects already have a timeout set during ProcessGroupNCCL construction. The new wait function will override the existing timeout with the user-defined timeout if one is provided. Timed out operations result in NCCL communicators being aborted and an exception being thrown.
ghstack-source-id: 107835739

Test Plan: Test added to `ProcessGroupNCCLTest` in the next PR in this stack.

Reviewed By: jiayisuse

Differential Revision: D22127898

fbshipit-source-id: 543964855ac5b41e464b2df4bb6c211ef053e73b
2020-07-16 10:58:54 -07:00
9d92fa2679 [NCCL] Add timeout to ProcessGroup Work Wait (#40944)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40944

This stack adds Work-level timeout for blocking wait.

This PR just changes the API to accept a default wait arg for the wait function in each ProcessGroup backend. The ProcessGroup superclass correctly waits for the given timeout by changing the CV wait to wait_for.

Closes: https://github.com/pytorch/pytorch/issues/37571
ghstack-source-id: 107835735

Test Plan: Tests in 4th PR in this stack

Reviewed By: jiayisuse

Differential Revision: D22107135

fbshipit-source-id: b38c07cb5e79e6c86c205e580336e7918ed96501
2020-07-16 10:56:58 -07:00
fef30220fd Runs CUDA test_istft_of_sine on CUDA (#41523)
Summary:
The test was always running on the CPU. This actually caused it to throw an error on non-MKL builds, since the CUDA test (which ran on the CPU) tried to execute but the test requires MKL (a requirement only checked for the CPU variant of the test).

Fixes https://github.com/pytorch/pytorch/issues/41402.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41523

Reviewed By: ngimel

Differential Revision: D22569344

Pulled By: mruberry

fbshipit-source-id: e9908c0ed4b5e7b18cc7608879c6213fbf787da2
2020-07-16 10:43:51 -07:00
b2b8af9645 Removes assertAlmostEqual (#41514)
Summary:
This test function is confusing since our `assertEqual` behavior allows for tolerance to be specified, and this is a redundant mechanism.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41514

Reviewed By: ngimel

Differential Revision: D22569348

Pulled By: mruberry

fbshipit-source-id: 2b2ff8aaa9625a51207941dfee8e07786181fe9f
2020-07-16 10:35:12 -07:00
58244a9586 Automated submodule update: FBGEMM (#40332)
Summary:
This is an automated pull request to update the first-party submodule for [pytorch/FBGEMM](https://github.com/pytorch/FBGEMM).

New submodule commit: 73ea1f5828

Pull Request resolved: https://github.com/pytorch/pytorch/pull/40332

Test Plan: Ensure that CI jobs succeed on GitHub before landing.

Reviewed By: gchanan, yns88

Differential Revision: D22150737

fbshipit-source-id: fe7e6787adef9e2fedee5d1a0a1e57bc4760b88c
2020-07-16 10:32:39 -07:00
2b14f2d368 [reland][DNNL]:enable max_pool3d and avg_pool3d (#40996)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40996

Test Plan: Imported from OSS

Differential Revision: D22440766

Pulled By: VitalyFedyunin

fbshipit-source-id: 242711612920081eb4a7e5a7e80bc8b2d4c9f978
2020-07-16 10:26:45 -07:00
45c5bac870 [WIP] Fix cpp grad accessor API (#40887)
Summary:
Update the API to access grad in cpp to avoid unexpected thread safety issues.
In particular, with the current API, a check like `t.grad().defined()` is not thread safe.

- This introduces `t.mutable_grad()` that should be used when getting a mutable version of the saved gradient. This function is **not** thread safe.
- The `Tensor& grad()` API is now removed. We could not do a deprecation cycle as most of our call side use non-const Tensors that use the non-const overload. This would lead to most calls hitting the warning. This would be too verbose for all the users.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/40887

Reviewed By: ezyang

Differential Revision: D22343932

Pulled By: albanD

fbshipit-source-id: d5eb909bb743bc20caaf2098196e18ca4110c5d2
2020-07-16 09:11:12 -07:00
5bba973afd Reland split unsafe version (#41484)
Summary:
Reland of https://github.com/pytorch/pytorch/pull/39299

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41484

Reviewed By: glaringlee

Differential Revision: D22552377

Pulled By: albanD

fbshipit-source-id: 1d1b713d2429ae162e04bda845ef0838c52df789
2020-07-16 09:01:45 -07:00
b9442bb03e Doc note for complex (#41252)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/41252

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D22553266

Pulled By: anjali411

fbshipit-source-id: f6dc409da048496d72b29b0976dfd3dd6645bc4d
2020-07-16 08:53:27 -07:00
d80e0c62be fix dequantization to match nnpi (#41505)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41505

fix the dequantization to match the fixes from quantization

Test Plan:
test is not conclusive, since only comparing emulation with reference collected from Amy's run

running an evaluation workflow at the moment

Reviewed By: venkatacrc

Differential Revision: D22558092

fbshipit-source-id: 3ff00ea15eac76007e194659c3b4949f07ff02a4
2020-07-16 00:40:57 -07:00
26790fb26d fix quantization mechanism to match nnpi (#41494)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41494

revert back to the changes from amylittleyang to make quantization work

Test Plan:
ran against a dump from ctr_instagram, and verified that:
-nnpi and fakelowp match bitwise
-nnpi is different at most by 1 vs fbgemm, most likely due to the type of
rounding

Reviewed By: venkatacrc

Differential Revision: D22555276

fbshipit-source-id: 7074521d181f15ef6270985bb71c4b44d25d1c30
2020-07-16 00:40:55 -07:00
e6859ec78f resurrect single quantization op test (#41476)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41476

deleted this test by default, re-adding it in its own file to make it
more explicit

Test Plan: ran the test

Reviewed By: yinghai

Differential Revision: D22550217

fbshipit-source-id: 758e279b2bab3b23452a3d0ce75fb366f7afb7be
2020-07-16 00:37:46 -07:00
04c0f2e3cc enable TE on windows (#41501)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/41501

Reviewed By: ZolotukhinM

Differential Revision: D22563872

Pulled By: Krovatkin

fbshipit-source-id: 2b5730017b34af27800cc03f3ba62f1cc8b4f240
2020-07-15 23:00:05 -07:00
b2e52186b9 Rename capacity to nbytes in ShareExternalPointer to avoid confusion in future (#41461)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41461

capacity is misleading, and we have many wrong uses internally. Let's rename to nbytes to avoid the confusion in future. Ultimately, we could remove this parameter if possible.
So far I haven't seen any case this capacity is necessary.

Test Plan: oss ci

Differential Revision: D22544189

fbshipit-source-id: f310627f2ab8f4ebb294e0dd5eabc380926991eb
2020-07-15 22:04:18 -07:00
702140758f Move GLOG_ constants into c10 namespace (#41504)
Summary:
Declaring GLOG_ constants in google namespace causes a conflict in C++ project that uses GLOG and links with LibPyTorch compiled without GLOG.
For example, see https://github.com/facebookresearch/ReAgent/issues/288

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41504

Reviewed By: kaiwenw

Differential Revision: D22564308

Pulled By: malfet

fbshipit-source-id: 2167bd2c6124bd14a67cc0a1360521d3c375e3c2
2020-07-15 21:56:00 -07:00
f27e395a4a [Gloo] update gloo submodule for PyTorch (#41462)
Summary:
To include alltoall

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41462

Test Plan: CI

Reviewed By: osalpekar

Differential Revision: D22544255

Pulled By: jiayisuse

fbshipit-source-id: ad55a50a31e5e5affaf3e14e2401d38f99657dc9
2020-07-15 21:50:08 -07:00
1fb2a7e5a2 onnx export of fake quantize functions (#39738)
Summary:
As discussed in https://github.com/pytorch/pytorch/issues/39502.

This PR adds support for exporting  `fake_quantize_per_tensor_affine` to a pair of `QuantizeLinear` and `DequantizeLinear`.

Exporting `fake_quantize_per_channel_affine` to ONNX depends on https://github.com/onnx/onnx/pull/2772. will file another PR once ONNX merged the change.

It will generate ONNX graph like this:
![image](https://user-images.githubusercontent.com/1697840/84180123-ddd90080-aa3b-11ea-81d5-eaf6f5f26715.png)

jamesr66a

Pull Request resolved: https://github.com/pytorch/pytorch/pull/39738

Reviewed By: hl475

Differential Revision: D22517911

Pulled By: houseroad

fbshipit-source-id: e998b4012e11b0f181b193860ff6960069a91d70
2020-07-15 21:20:23 -07:00
7a33d8b001 [PyTorch Mobile] Modularize the autograd source files shared by mobile and full-jit (#41430)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41430

To avoid duplication at compile time, modularize the common autograd files used by both mobile and full-jit.
ghstack-source-id: 107742889

Test Plan: CI

Reviewed By: kwanmacher

Differential Revision: D22531358

fbshipit-source-id: 554f10be89b7ed59c9bde13387a0e1b08000c116
2020-07-15 21:14:47 -07:00
23174ca71b [reland] Enable TF32 support for cuBLAS (#41498)
Summary:
fix rocm

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41498

Reviewed By: mruberry

Differential Revision: D22560572

Pulled By: ngimel

fbshipit-source-id: 5ee79e96cb29e70d9180830d058efb53d1c6c041
2020-07-15 21:00:55 -07:00
200c343184 Implement gcd, lcm (#40651)
Summary:
Resolves https://github.com/pytorch/pytorch/issues/40018.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/40651

Reviewed By: ezyang

Differential Revision: D22511828

Pulled By: mruberry

fbshipit-source-id: 3ef251e45da4688b1b64c79f530fb6642feb63ab
2020-07-15 20:56:23 -07:00
e44f460079 [jit] Fix jit not round to even if const is folded (#40897)
Summary:
Fixed https://github.com/pytorch/pytorch/issues/40771

Pull Request resolved: https://github.com/pytorch/pytorch/pull/40897

Reviewed By: Krovatkin

Differential Revision: D22543261

Pulled By: gmagogsfm

fbshipit-source-id: 0bd4b1d910a42d5aa87e120c81acfdfb7ca895fa
2020-07-15 20:13:12 -07:00
1770937c9c Restore the contiguity preprocessing of linspace (#41286)
Summary:
The contiguity preprocessing was mistakenly removed in
cd48fb503088af2c00884f1619db571fffbcdafa . It causes erroneous output
when the output tensor is not contiguous. Here we restore this
preprocessing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41286

Reviewed By: zou3519

Differential Revision: D22550822

Pulled By: ezyang

fbshipit-source-id: ebad4e2ba83d2d808e3f958d4adc9a5513a95bec
2020-07-15 20:02:16 -07:00
d90fb72b5a remove use of the term "blacklist" from docs/cpp/source/Doxyfile (#41450)
Summary:
As requested in https://github.com/pytorch/pytorch/issues/41443

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41450

Reviewed By: ezyang

Differential Revision: D22561782

Pulled By: SplitInfinity

fbshipit-source-id: b38ab5e2725735d1f0c70a4d0012678636e992c3
2020-07-15 19:45:53 -07:00
404799d43f Disable failed caffe2 tests for BoundShapeInference on Windows (#41472)
Summary:
Related:
https://github.com/pytorch/pytorch/issues/40861
https://github.com/pytorch/pytorch/issues/41471

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41472

Reviewed By: yns88

Differential Revision: D22562385

Pulled By: malfet

fbshipit-source-id: aebc600915342b984f4fc47cef0a1e79d8965c10
2020-07-15 19:39:45 -07:00
60f2fa6a84 Updates serialization note to explain versioned symbols and dynamic versioning (#41395)
Summary:
Doc update intended to clarify and expand our current serialization behavior, including explaining the difference between torch.save/torch.load, torch.nn.Module.state_dict/torch.nn.Module.load_state_dict, and torch.jit.save/torch.jit.load. Also explains, for the time, when historic serialized Torchscript behavior is preserved and our recommendation for preserving behavior (using the same PyTorch version to consume a model as produced it).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41395

Reviewed By: ngimel

Differential Revision: D22560538

Pulled By: mruberry

fbshipit-source-id: dbc2f1bb92ab61ff2eca4888febc21f7dda76ba1
2020-07-15 19:05:19 -07:00
488ee3790e Support @torch.jit.unused on a @torch.no_grad decorated function (#41496)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41496

use the wrapped function (instead of the wrapper) to obtain argument names

Test Plan:
```
buck test mode/dev-nosan //caffe2/test:jit -- 'test_unused_decorator \(test_jit\.TestScript\)'
```

Before:
```
> Traceback (most recent call last):
>   File "/data/users/yuxinwu/fbsource2/fbcode/buck-out/dev/gen/caffe2/test/jit#binary,link-tree/test_jit.py", line 3014, in test_unused_decorator
>     torch.jit.script(MyMod())
>   File "/data/users/yuxinwu/fbsource2/fbcode/buck-out/dev/gen/caffe2/test/jit#binary,link-tree/torch/jit/_script.py", line 888, in script
>     obj, torch.jit._recursive.infer_methods_to_compile
>   File "/data/users/yuxinwu/fbsource2/fbcode/buck-out/dev/gen/caffe2/test/jit#binary,link-tree/torch/jit/_recursive.py", line 317, in create_script_module
>     return create_script_module_impl(nn_module, concrete_type, stubs_fn)
>   File "/data/users/yuxinwu/fbsource2/fbcode/buck-out/dev/gen/caffe2/test/jit#binary,link-tree/torch/jit/_recursive.py", line 376, in create_script_module_impl
>     create_methods_from_stubs(concrete_type, stubs)
>   File "/data/users/yuxinwu/fbsource2/fbcode/buck-out/dev/gen/caffe2/test/jit#binary,link-tree/torch/jit/_recursive.py", line 292, in create_methods_from_stubs
>     concrete_type._create_methods(defs, rcbs, defaults)
> RuntimeError:
> Non-static method does not have a self argument:
>   File "/data/users/yuxinwu/fbsource2/fbcode/buck-out/dev/gen/caffe2/test/jit#binary,link-tree/test_jit.py", line 3012
>             def forward(self, x):
>                 return self.fn(x)
>                        ~~~~~~~ <--- HERE
>
```

Reviewed By: eellison

Differential Revision: D22554479

fbshipit-source-id: 03e432ea92ed973cc57ff044da80ae7a36f6af4c
2020-07-15 16:54:43 -07:00
71c3b397a6 Reduce Image Size (2) (#41301)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/41301

Reviewed By: malfet

Differential Revision: D22559626

Pulled By: ssylvain

fbshipit-source-id: 32da88b7efe2e8d134f74b6ff2dff0bffede012c
2020-07-15 16:47:15 -07:00
5bd71259ed remove blacklist reference (#41447)
Summary:
Reference : issue https://github.com/pytorch/pytorch/issues/41443
Removed blacklist reference

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41447

Reviewed By: ezyang

Differential Revision: D22542428

Pulled By: SplitInfinity

fbshipit-source-id: 09728c7718bb99ff56b16fda6971ebd887a99c97
2020-07-15 16:25:12 -07:00
b7147fe6d7 Learnable Fake Quantizer Benchmark Test (#41429)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41429

This diff contains the benchmark test to evaluate the speed of executing the learnable fake quantization operator, both in the forward path and the backward path, with respect to both per tensor and per channel usages.

Test Plan:
Inside the path `torch/benchmarks/operator_benchmark` (The root directory will be `caffe2` inside `fbcode` if working on a devvm):
- On a devvm, run the command `buck run pt:fake_quantize_learnable_test`
- On a personal laptop, run the command `python3 -m pt.fake_quantize_learnable_test`

Benchmark Results (Locally on CPU):
Each sample has dimensions **3x256x256**; Each batch has 16 samples (`N=16`)
- Per Tensor Forward: 0.023688 sec/sample
- Per Tensor Backward: 0.165926 sec/sample
- Per Channel Forward: 0.040432 sec / sample
- Per Channel Backward: 0.173528 sec / sample

Reviewed By: vkuzo

Differential Revision: D22535252

fbshipit-source-id: e8e953ff2de2107c6f2dde4c8d5627bdea67ef7f
2020-07-15 14:00:20 -07:00
2b8db35c7e [reland][DNNL]:enable batchnorm3d (#40995)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40995

Test Plan: Imported from OSS

Differential Revision: D22440765

Pulled By: VitalyFedyunin

fbshipit-source-id: b4bf427bbb7010ee234a54e81ade371627f9e82c
2020-07-15 13:56:47 -07:00
b48ee175e6 [reland][DNNL]:enable conv3d (#40691)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40691

Test Plan: Imported from OSS

Differential Revision: D22296548

Pulled By: VitalyFedyunin

fbshipit-source-id: 8e2a7cf14e8bdfa2f29b735a89e8c83f6119e68d
2020-07-15 13:54:41 -07:00
ff6e560301 Add C++ end to end test for RPC and distributed autograd. (#36893)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36893

Adding an end to end test for running a simple training loop in C++
for the distributed RPC framework.

The goal of this change is to enable LeakSanitizer and potentially catch memory
leaks in the Future. Enabling LSAN with python multiprocessing is tricky and we
haven't found a solution for this. As a result, adding a C++ test that triggers
most of the critical codepaths would be good for now.

As an example, this unit test would've caught the memory leak fixed by:
https://github.com/pytorch/pytorch/pull/31030
ghstack-source-id: 107781167

Test Plan:
1) Verify the test catches memory leaks.
2) waitforbuildbot

Reviewed By: mrshenli

Differential Revision: D21112208

fbshipit-source-id: 4eb2a6b409253108f6b6e14352e593d250c7a64d
2020-07-15 12:59:19 -07:00
8940a4e684 Pull upstream select_compute_arch from cmake for Ampere (#41133)
Summary:
This pulls the following merge requests from CMake upstream:
- https://gitlab.kitware.com/cmake/cmake/-/merge_requests/4979
- https://gitlab.kitware.com/cmake/cmake/-/merge_requests/4991

The above two merge requests improve the Ampere build:
- If `TORCH_CUDA_ARCH_LIST` is not set, it can now automatically pickup 8.0 as its part of its default value
- If `TORCH_CUDA_ARCH_LIST=Ampere`, it no longer fails with `Unknown CUDA Architecture Name Ampere in CUDA_SELECT_NVCC_ARCH_FLAGS`

Codes related to architecture < 3.5 are manually removed because PyTorch no longer supports it.

cc: ngimel ptrblck

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41133

Reviewed By: malfet

Differential Revision: D22540547

Pulled By: ezyang

fbshipit-source-id: 6e040f4054ef04f18ebb7513497905886a375632
2020-07-15 12:53:32 -07:00
c62550e3f4 Cuda Support for Learnable Fake Quantize Per Channel (GPU) (#41262)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41262

In this diff, implementation is provided to support the GPU kernel running the learnable fake quantize per tensor kernels.

Test Plan: On a devvm, run `buck test //caffe2/test:quantization -- learnable` to test both the forward and backward for the learnable per tensor fake quantize kernels. The test will test the `cuda` version if a gpu is available.

Reviewed By: vkuzo

Differential Revision: D22478832

fbshipit-source-id: 2731bd8b57bc83416790f6d65ef42d450183873c
2020-07-15 12:23:43 -07:00
4367a73399 Cuda Support for Learnable Fake Quantize Per Tensor (GPU) (#41127)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41127

In this diff, implementation is provided to support the GPU kernel running the learnable fake quantize per tensor kernels.

Test Plan: On a devvm, run `buck test //caffe2/test:quantization -- learnable` to test both the forward and backward for the learnable per tensor fake quantize kernels. The test will test the `cuda` version if a gpu is available.

Reviewed By: z-a-f

Differential Revision: D22435037

fbshipit-source-id: 515afde13dd224d21fd47fb7cb027ee8d704cbdd
2020-07-15 12:21:48 -07:00
225289abc6 Adding epsilon input argument to the Logit Op
Summary: Adding epsilon input argument to the Logit Op

Test Plan: Added test_logit test case.

Reviewed By: hyuen

Differential Revision: D22537133

fbshipit-source-id: d6f89afd1589fda99f09550a9d1b850cfc0b9ee1
2020-07-15 12:16:19 -07:00
954c260061 Revert D22480638: [pytorch][PR] Add non-deterministic alert to CUDA operations that use atomicAdd()
Test Plan: revert-hammer

Differential Revision:
D22480638 (6ff306b8b5)

Original commit changeset: 4cc913cb3ca6

fbshipit-source-id: e47fa14b5085bb2b74a479bd0830efc2d7604eea
2020-07-15 12:10:05 -07:00
008ab27b22 [quant][pyper] Add embedding_bag weight quantize and dequantize ops (#41293)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41293

Add new operators that does quantize and packing for 8 bit and 4 bit embedding bag operators.
This is an initial change to help unblock testing. This will be follwed by adding graph mode passes to enable quantization of embedding_bag module

Note to reviewers: Future PRs will replace this op with a separate quantize and pack operator and add support for floating point scale and zero point.

Test Plan:
python test/test_quantization.py TestQuantizedEmbeddingBag

Imported from OSS

Reviewed By: vkuzo

Differential Revision: D22506700

fbshipit-source-id: 090cc85a8f56da417e4b7e45818ea987ae97ca8a
2020-07-15 11:34:53 -07:00
d5ae4a07ef DDP Communication Hook Main Structure (#40848)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40848

Sub-tasks 1 and 2 of [39272](https://github.com/pytorch/pytorch/issues/39272)
ghstack-source-id: 107787878

Test Plan:
1\. Perf tests to to validate new code (if conditions before `allreduce`) doesn't slow down today's DDP. Execute the following command with diff patched/unpatched (with V25):

* **Unpatched Runs:**
```
hg checkout D22514243
flow-cli canary pytorch.benchmark.main.workflow --parameters-json '{"model_arch": "resnet50", "batch_size": 32, "world_size": 1, "use_fp16": false, "print_percentile": true, "backend": "gloo"}' --entitlement pytorch_ftw_gpu --name test_torchelastic_gloo_masterD22514243 --run-as-secure-group pytorch_distributed
```
* **Run 1 (unpatched):** `elastic_gang:benchmark_single.elastic_operator` Ran for 2 mins 59 s
f204539235
```
sum:
8 GPUs: p25:  0.156   205/s  p50:  0.160   200/s  p75:  0.164   194/s  p90:  0.169   189/s  p95:  0.173   185/s
fwds:
8 GPUs: p25:  0.032  1011/s  p50:  0.032  1006/s  p75:  0.032  1000/s  p90:  0.032   992/s  p95:  0.033   984/s
bwds:
8 GPUs: p25:  0.121   265/s  p50:  0.125   256/s  p75:  0.129   248/s  p90:  0.134   239/s  p95:  0.137   232/s
opts:
8 GPUs: p25:  0.003  11840/s  p50:  0.003  11550/s  p75:  0.004  8037/s  p90:  0.006  5633/s  p95:  0.007  4631/s
```
* **Run 2 (unpatched):** `elastic_gang:benchmark_single.elastic_operator` Ran for 3 mins 1 s
f204683840
```
sum:
8 GPUs: p25:  0.145   220/s  p50:  0.147   217/s  p75:  0.150   213/s  p90:  0.154   207/s  p95:  0.157   204/s
fwds:
8 GPUs: p25:  0.032  1015/s  p50:  0.032  1009/s  p75:  0.032  1002/s  p90:  0.032   994/s  p95:  0.032   990/s
bwds:
8 GPUs: p25:  0.107   297/s  p50:  0.111   288/s  p75:  0.115   278/s  p90:  0.119   268/s  p95:  0.122   262/s
opts:
8 GPUs: p25:  0.003  11719/s  p50:  0.004  9026/s  p75:  0.006  5160/s  p90:  0.009  3700/s  p95:  0.010  3184/s
```

* **Patched Runs:**
```
hg checkout D22328310
flow-cli canary pytorch.benchmark.main.workflow --parameters-json '{"model_arch": "resnet50", "batch_size": 32, "world_size": 1, "use_fp16": false, "print_percentile": true, "backend": "gloo"}' --entitlement pytorch_ftw_gpu --name test_torchelastic_gloo_localD22328310 --run-as-secure-group pytorch_distributed
```
* **Run 1 (patched):** `elastic_gang:benchmark_single.elastic_operator` Ran for 3 mins 30 s
f204544541
```
sum:
8 GPUs: p25:  0.148   216/s  p50:  0.152   210/s  p75:  0.156   205/s  p90:  0.160   200/s  p95:  0.163   196/s
fwds:
8 GPUs: p25:  0.032  1011/s  p50:  0.032  1005/s  p75:  0.032   999/s  p90:  0.032   991/s  p95:  0.033   984/s
bwds:
8 GPUs: p25:  0.112   286/s  p50:  0.116   275/s  p75:  0.120   265/s  p90:  0.125   256/s  p95:  0.128   250/s
opts:
8 GPUs: p25:  0.003  11823/s  p50:  0.003  10948/s  p75:  0.004  7225/s  p90:  0.007  4905/s  p95:  0.008  3873/s
```
* **Run 2 (patched):** `elastic_gang:benchmark_single.elastic_operator`
Ran for 3 mins 14 s
f204684520
```
sum:
8 GPUs: p25:  0.146   219/s  p50:  0.147   217/s  p75:  0.150   214/s  p90:  0.152   210/s  p95:  0.153   208/s
fwds:
8 GPUs: p25:  0.032  1013/s  p50:  0.032  1008/s  p75:  0.032  1002/s  p90:  0.032   996/s  p95:  0.032   990/s
bwds:
8 GPUs: p25:  0.107   299/s  p50:  0.110   290/s  p75:  0.114   280/s  p90:  0.117   274/s  p95:  0.119   269/s
opts:
8 GPUs: p25:  0.003  11057/s  p50:  0.005  6490/s  p75:  0.008  4110/s  p90:  0.010  3309/s  p95:  0.010  3103/s
```
* **Run 3 (patched):** `elastic_gang:benchmark_single.elastic_operator` Ran for 2 mins 54 s
f204692872
```
sum:
8 GPUs: p25:  0.145   220/s  p50:  0.147   217/s  p75:  0.150   213/s  p90:  0.154   207/s  p95:  0.156   204/s
fwds:
8 GPUs: p25:  0.032  1001/s  p50:  0.032   995/s  p75:  0.032   988/s  p90:  0.033   980/s  p95:  0.033   973/s
bwds:
8 GPUs: p25:  0.108   295/s  p50:  0.111   287/s  p75:  0.114   280/s  p90:  0.119   269/s  p95:  0.121   264/s
opts:
8 GPUs: p25:  0.003  11706/s  p50:  0.003  9257/s  p75:  0.005  6333/s  p90:  0.008  4242/s  p95:  0.009  3554/s
```

* **Memory:**
   * Unpatched:
```
CUDA Memory Summary After                     first iteration: |===========================================================================|
|                  PyTorch CUDA memory summary, device ID 0                 |
|---------------------------------------------------------------------------|
|            CUDA OOMs: 0            |        cudaMalloc retries: 0         |
|===========================================================================|
|        Metric         | Cur Usage  | Peak Usage | Tot Alloc  | Tot Freed  |
|---------------------------------------------------------------------------|
| Allocated memory      |  428091 KB |    2892 MB |    9825 MB |    9407 MB |
|       from large pool |  374913 KB |    2874 MB |    9752 MB |    9386 MB |
|       from small pool |   53178 KB |      52 MB |      73 MB |      21 MB |
|---------------------------------------------------------------------------|
| Active memory         |  428091 KB |    2892 MB |    9825 MB |    9407 MB |
|       from large pool |  374913 KB |    2874 MB |    9752 MB |    9386 MB |
|       from small pool |   53178 KB |      52 MB |      73 MB |      21 MB |
|---------------------------------------------------------------------------|
| GPU reserved memory   |    3490 MB |    3490 MB |    3490 MB |       0 B  |
|       from large pool |    3434 MB |    3434 MB |    3434 MB |       0 B  |
|       from small pool |      56 MB |      56 MB |      56 MB |       0 B  |
|---------------------------------------------------------------------------|
| Non-releasable memory |  315332 KB |  343472 KB |    2295 MB |    1987 MB |
|       from large pool |  311166 KB |  340158 KB |    2239 MB |    1935 MB |
|       from small pool |    4166 KB |    4334 KB |      56 MB |      52 MB |
|---------------------------------------------------------------------------|
| Allocations           |     704    |     705    |    1390    |     686    |
|       from large pool |      60    |     131    |     395    |     335    |
|       from small pool |     644    |     645    |     995    |     351    |
|---------------------------------------------------------------------------|
| Active allocs         |     704    |     705    |    1390    |     686    |
|       from large pool |      60    |     131    |     395    |     335    |
|       from small pool |     644    |     645    |     995    |     351    |
|---------------------------------------------------------------------------|
| GPU reserved segments |     102    |     102    |     102    |       0    |
|       from large pool |      74    |      74    |      74    |       0    |
|       from small pool |      28    |      28    |      28    |       0    |
|---------------------------------------------------------------------------|
| Non-releasable allocs |      34    |      54    |     430    |     396    |
|       from large pool |      15    |      48    |     208    |     193    |
|       from small pool |      19    |      19    |     222    |     203    |
|===========================================================================|

```
   * Patched:
```
CUDA Memory Summary After                     first iteration: |===========================================================================|
|                  PyTorch CUDA memory summary, device ID 0                 |
|---------------------------------------------------------------------------|
|            CUDA OOMs: 0            |        cudaMalloc retries: 0         |
|===========================================================================|
|        Metric         | Cur Usage  | Peak Usage | Tot Alloc  | Tot Freed  |
|---------------------------------------------------------------------------|
| Allocated memory      |  428091 KB |    2892 MB |    9825 MB |    9407 MB |
|       from large pool |  374913 KB |    2874 MB |    9752 MB |    9386 MB |
|       from small pool |   53178 KB |      52 MB |      73 MB |      21 MB |
|---------------------------------------------------------------------------|
| Active memory         |  428091 KB |    2892 MB |    9825 MB |    9407 MB |
|       from large pool |  374913 KB |    2874 MB |    9752 MB |    9386 MB |
|       from small pool |   53178 KB |      52 MB |      73 MB |      21 MB |
|---------------------------------------------------------------------------|
| GPU reserved memory   |    3490 MB |    3490 MB |    3490 MB |       0 B  |
|       from large pool |    3434 MB |    3434 MB |    3434 MB |       0 B  |
|       from small pool |      56 MB |      56 MB |      56 MB |       0 B  |
|---------------------------------------------------------------------------|
| Non-releasable memory |  315332 KB |  343472 KB |    2295 MB |    1987 MB |
|       from large pool |  311166 KB |  340158 KB |    2239 MB |    1935 MB |
|       from small pool |    4166 KB |    4334 KB |      56 MB |      52 MB |
|---------------------------------------------------------------------------|
| Allocations           |     704    |     705    |    1390    |     686    |
|       from large pool |      60    |     131    |     395    |     335    |
|       from small pool |     644    |     645    |     995    |     351    |
|---------------------------------------------------------------------------|
| Active allocs         |     704    |     705    |    1390    |     686    |
|       from large pool |      60    |     131    |     395    |     335    |
|       from small pool |     644    |     645    |     995    |     351    |
|---------------------------------------------------------------------------|
| GPU reserved segments |     102    |     102    |     102    |       0    |
|       from large pool |      74    |      74    |      74    |       0    |
|       from small pool |      28    |      28    |      28    |       0    |
|---------------------------------------------------------------------------|
| Non-releasable allocs |      34    |      54    |     431    |     397    |
|       from large pool |      15    |      48    |     208    |     193    |
|       from small pool |      19    |      19    |     223    |     204    |
|===========================================================================|

```

2\. As of v18: `python test/distributed/test_c10d.py`
```
....................s.....s.....................................................s................................
----------------------------------------------------------------------
Ran 114 tests in 215.983s

OK (skipped=3)

```

3\. Additional tests in `python test/distributed/test_c10d.py`:
* `test_ddp_comm_hook_future_passing_cpu`: This unit test verifies whether the Future object is passed properly. The callback function creates a Future object and sets a value to it.
* `_test_ddp_comm_hook_future_passing_gpu`: This unit test verifies whether the Future object is passed properly. The callback function creates a Future object and sets a value to it.
* `test_ddp_comm_hook_future_passing_gpu_gloo`: This unit test executes _test_ddp_comm_hook_future_passing_gpu using gloo backend.
* `test_ddp_comm_hook_future_passing_gpu_nccl`: This unit test executes _test_ddp_comm_hook_future_passing_gpu using nccl backend.
* `test_ddp_invalid_comm_hook_init`: This unit test makes sure that register_comm_hook properly checks the format of hook defined by user. The Python hook must be callable. This test also checks whether bucket annotation checked properly if defined.
* `test_ddp_invalid_comm_hook_return_type`: This test checks whether return annotation checked properly if defined. It also checks whether an internal error is thrown if return type is incorrect and user hasn't specified any return type annotation.
* `test_ddp_comm_hook_register_just_once`: DDP communication hook can only be registered once. This test validates whether the error is thrown properly when register_comm_hook is called more than once.

Reviewed By: ezyang

Differential Revision: D22328310

fbshipit-source-id: 77a6a71808e7b6e947795cb3fcc68c8c8f024549
2020-07-15 11:25:29 -07:00
c86699d425 [cmake] Use PROJECT_SOURCE_DIR instead of CMAKE_* (#41387)
Summary:
Add support for including pytorch via an add_subdirectory()
This requires using PROJECT_* instead of CMAKE_* which refer to
the top-most project including pytorch.

TEST=add_subdirectory() into a pytorch checkout and build.
There are still some hardcoded references to TORCH_SRC_DIR, I will
fix in a follow on commit. For now you can create a symlink to
 <pytorch>/torch/ in your project.

Change-Id: Ic2a8aec3b08f64e2c23d9e79db83f14a0a896abc

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41387

Reviewed By: zhangguanheng66

Differential Revision: D22539944

Pulled By: ezyang

fbshipit-source-id: b7e9631021938255f0a6ea897a7abb061759093d
2020-07-15 11:09:05 -07:00
563b60b890 Fix flaky test_stream_event_nogil due to missing event sync (#41398)
Summary:
The test asserts that the stream is "ready" but doesn't wait for the
event to be "executed" which makes it fail on some platforms where the
`query` call occurs "soon enough".

Fixes https://github.com/pytorch/pytorch/issues/38807

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41398

Reviewed By: zhangguanheng66

Differential Revision: D22540012

Pulled By: ezyang

fbshipit-source-id: 6f56d951e48133ce4f6a9a54534298b7d2877c80
2020-07-15 11:03:35 -07:00
6ff306b8b5 Add non-deterministic alert to CUDA operations that use atomicAdd() (#40056)
Summary:
Issue https://github.com/pytorch/pytorch/issues/15359

Pull Request resolved: https://github.com/pytorch/pytorch/pull/40056

Differential Revision: D22480638

Pulled By: ezyang

fbshipit-source-id: 4cc913cb3ca6d4206de80f4665bbc9031aa3ca01
2020-07-15 10:57:32 -07:00
dddac948a3 Add CUDA to pooling benchmark configs (#41438)
Summary:
Related to https://github.com/pytorch/pytorch/issues/41368

These benchmarks support CUDA already so there is no reason for it not to be in the benchmark config.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41438

Reviewed By: zhangguanheng66

Differential Revision: D22540756

Pulled By: ezyang

fbshipit-source-id: 621eceff37377c1ab06ff7483b39fc00dc34bd46
2020-07-15 10:51:43 -07:00
3971777ebb Krovatkin/reenable test tensorexpr (#41445)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/41445

Reviewed By: ZolotukhinM

Differential Revision: D22543075

Pulled By: Krovatkin

fbshipit-source-id: fd8c0a94f5b3aff34d2b444dbf551425fdc1df04
2020-07-15 10:42:40 -07:00
04320a47d7 Add optimizer_for_mobile doc into python api root doc (#41211)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/41211

Test Plan: Imported from OSS

Reviewed By: xta0

Differential Revision: D22543608

fbshipit-source-id: bf522a6c94313bf2696eca3c5bb5812ea98998d0
2020-07-15 09:57:40 -07:00
3a63a939d4 Revert D22517785: [pytorch][PR] Enable TF32 support for cuBLAS
Test Plan: revert-hammer

Differential Revision:
D22517785 (288ece89e1)

Original commit changeset: 87334c893561

fbshipit-source-id: 0a0674f49c1bcfc98f7f88af5a8c7de93b76e458
2020-07-15 08:15:48 -07:00
8548a21c00 Revert D22543215: Adjust bound_shape_inferencer to take 4 inputs for FCs
Test Plan: revert-hammer

Differential Revision:
D22543215 (86a2bdc35e)

Original commit changeset: 0977fca06630

fbshipit-source-id: b440f9b1eaeb35ec8b08e899890691e7a77a9f6d
2020-07-15 08:10:39 -07:00
f153b35b9b Shape inference for SparseToDense in ExpertCombiner
Summary: Adding shape inference for SpraseToDense. Proposal impl of shape inference only works when data_to_infer_dim is given, otherwise SpraseToDense output dimension depends on max value of input tensor

Test Plan:
buck test //caffe2/caffe2/python:sparse_to_dense_test
buck test //caffe2/caffe2/python:hypothesis_test -- test_sparse_to_dense

Dper3 Changes:
f204594813
buck test dper3/dper3_models/ads_ranking/model_impl/sparse_nn/tests:sparse_nn_lib_test

Reviewed By: zhongyx12, ChunliF

Differential Revision: D22479511

fbshipit-source-id: 8983a9baea8853deec53ad6f795c874c3fb93de0
2020-07-15 08:04:48 -07:00
86a2bdc35e Adjust bound_shape_inferencer to take 4 inputs for FCs (#41452)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41452

The model exported from online training workflow with int8 quantization contains FCs with 4 inputs. The extra input is the quant_param blob. This diff is to adjust the bound_shape_inferencer to get shape info for the quant_param input.

Test Plan:
```
buck test caffe2/caffe2/opt:bound_shape_inference_test
```

Reviewed By: anurag16

Differential Revision: D22543215

fbshipit-source-id: 0977fca06630e279d47292e6b44f3d8180a767a5
2020-07-15 01:43:39 -07:00
14f19ab833 Port index_select to ATen (CUDA) (#39946)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/24578

Pull Request resolved: https://github.com/pytorch/pytorch/pull/39946

Reviewed By: ngimel

Differential Revision: D22520160

Pulled By: mruberry

fbshipit-source-id: 7eb3029e3917e793f3c020359acb0989d5deb61e
2020-07-15 01:11:32 -07:00
9552ec787c Revert D22516606: [pytorch][PR] Temporary fix for determinant bug on CPU
Test Plan: revert-hammer

Differential Revision:
D22516606 (fcd6d91045)

Original commit changeset: 7ea8299b9d2c

fbshipit-source-id: 41e19d5e1ba843cd70dce677869892f2e33fac09
2020-07-14 23:44:32 -07:00
921d2a164f SparseAdagrad/RowWiseSparseAdagrad mean fusion on CPU & GPU and dedup version for RowWiseSparse mean fusion on GPU
Summary:
1. Support SparseAdagradFusedWithSparseLengthsMeanGradient and RowWiseSparseAdagradFusedWithSparseLengthsMeanGradient on CPU and GPU
2. Add the dedup implementation of fused RowWiseAdagrad op on GPUs for mean pooling

Reviewed By: xianjiec

Differential Revision: D22165603

fbshipit-source-id: 743fa55ed5893c34bc6406ddfbbbb347b88091d1
2020-07-14 22:36:16 -07:00
44b9306d0a Export replaceAllUsesAfterNodeWith for PythonAPI (#41414)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41414

This diff exports replaceAllUsesAfterNodeWith to PythonAPI.

Test Plan: Tested locally. Please let me know if there is a set of unit tests to be passed outside of the default ones triggered by Sandcastle.

Reviewed By: soumith

Differential Revision: D22523211

fbshipit-source-id: 3f075bafa6208ada462abc57d495c15179a6e53d
2020-07-14 22:20:19 -07:00
20f3051f7d [adaptive_]max_pool{1,2,3}d: handle edge case when input is filled with -inf (#40665)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/40131

Pull Request resolved: https://github.com/pytorch/pytorch/pull/40665

Differential Revision: D22463538

Pulled By: ezyang

fbshipit-source-id: 7e08fd0205926911d45aa150012154637e64a8d4
2020-07-14 21:51:40 -07:00
fcd6d91045 Temporary fix for determinant bug on CPU (#35136)
Summary:
Changelog:
- Make diagonal contiguous

Temporarily Fixes https://github.com/pytorch/pytorch/issues/34061

Pull Request resolved: https://github.com/pytorch/pytorch/pull/35136

Reviewed By: vincentqb

Differential Revision: D22516606

Pulled By: ezyang

fbshipit-source-id: 7ea8299b9d2c1c244995955b333a1dffb0cdff73
2020-07-14 21:20:50 -07:00
f074994a31 vectorize rounding ops (#41439)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41439

use RoundToFloat16 on arrays

Test Plan: layernorm unittest

Reviewed By: venkatacrc

Differential Revision: D22540118

fbshipit-source-id: dc84fd22b5dc6a3bd15ad4ec1eecb9db13d64e97
2020-07-14 20:59:39 -07:00
96f124e623 remove template arguments of layernorm
Summary:
remove layernorm templates and make them float since that's the only variant
minor fixes in logging and testing

Test Plan: ran the test

Reviewed By: venkatacrc

Differential Revision: D22527359

fbshipit-source-id: d6eec362a6e88e1c12fddf820ae629ede13fb2b8
2020-07-14 20:56:23 -07:00
0b73ea0ea2 Change BCELoss size mismatch warning into an error (#41426)
Summary:
BCELoss currently uses different broadcasting semantics than numpy. Since previous versions of PyTorch have thrown a warning in these cases telling the user that input sizes should match, and since the CUDA and CPU results differ when sizes do not match, it makes sense to upgrade the size mismatch warning to an error.

We can consider supporting numpy broadcasting semantics in BCELoss in the future if needed.

Closes https://github.com/pytorch/pytorch/issues/40023

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41426

Reviewed By: zou3519

Differential Revision: D22540841

Pulled By: ezyang

fbshipit-source-id: 6c6d94c78fa0ae30ebe385d05a9e3501a42b3652
2020-07-14 20:34:06 -07:00
fd0329029f Fix flaky profiler and test_callback_simple RPC tests (#41287)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41287

Profiler tests that test profiling with builtin functions and `test_callback_simple` test has been broken for a while. This diff fixes that by preferring c10 ops to non-c10 ops in our operation matching logic.

The result of this is that these ops go through the c10 dispatch and thus have profiling enabled. For `test_callback_simple` this results in the effect that we choose `aten::add.Tensor` over `aten::add.Int` which fixes the type issue.

Test Plan:
Ensured that the tests are no longer flaky by running them a bunch
of times.

Reviewed By: vincentqb

Differential Revision: D22489197

fbshipit-source-id: 8452b93e4d45703453f77d968350c0d32f3f63fe
2020-07-14 19:26:44 -07:00
0d4a110c28 [JIT] Fix dead stores in JIT (#41202)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41202

This commit fixes dead stores in JIT surfaced by the Quality Analyzer.

Test Plan: Continuous integration.

Reviewed By: jerryzh168

Differential Revision: D22461492

fbshipit-source-id: c587328f952054fb9449848e90b7d28a20aed4af
2020-07-14 17:59:50 -07:00
4ddf27ba48 [op-bench] check device attribute in user inputs
Summary: The device attribute in the op benchmark can only include 'cpu' or 'cuda'. So adding a check in this diff.

Test Plan: buck run caffe2/benchmarks/operator_benchmark:benchmark_all_test -- --warmup_iterations 1 --iterations 1

Reviewed By: ngimel

Differential Revision: D22538252

fbshipit-source-id: 3e5af72221fc056b8d867321ad22e35a2557b8c3
2020-07-14 17:17:59 -07:00
a0f110190c clamp Categorical logit from -inf to min_fifo when calculating entropy (#41002)
Summary:
Fixes gh-40553 by clamping logit values when calculating Categorical.entropy

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41002

Reviewed By: mruberry

Differential Revision: D22436432

Pulled By: ngimel

fbshipit-source-id: 08b7c7b0c15ab4e5a56b3a8ec0d0237ad360202e
2020-07-14 16:21:12 -07:00
359cdc20e2 Revert D22432885: [pytorch][PR] unsafe_split, unsafe_split_with_sizes, unsafe_chunk operations
Test Plan: revert-hammer

Differential Revision:
D22432885 (c17670ac50)

Original commit changeset: 324aef091b32

fbshipit-source-id: 6b7c52bde46932e1cf77f61e7035d8a641b0beb6
2020-07-14 16:06:42 -07:00
144f04e7ef Fix qobserver test
Summary: Change the device config in qobserver test to a string to honor --device flag.

Test Plan: buck run caffe2/benchmarks/operator_benchmark/pt:qobserver_test  -- --iterations 1 --device cpu

Reviewed By: ngimel

Differential Revision: D22536379

fbshipit-source-id: 8926b2393be1f52f9183f8205959a3ff18e3ed2a
2020-07-14 15:47:03 -07:00
c68c5ea0e6 Upgrade cpp docs Sphinx/breathe/exhale to latest version (#41312)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41312

I was hoping that exhale had gotten incremental recompilation
in its latest version, but experimentally this does not seem
to have been the case.  Still, I had gotten the whole shebang
to be working on the latest version of these packages, so might
as well land the upgrade.  There was one bug in Optional.h that
I had to fix; see the cited bug report.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D22526349

Pulled By: ezyang

fbshipit-source-id: d4169c2f48ebd8dfd8a593cc8cd232224d008ae9
2020-07-14 15:35:43 -07:00
05207b7371 .circleci: Re-split postnightly into its own thing (#41354)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41354

The nightly pipeline has the potential to be flaky and thus the html
pages have the potential not to be updated.

This should actually be done as an automatic lambda job that runs
whenever the S3 bucket updates but this is intermediate step in order to
get there.

Closes https://github.com/pytorch/pytorch/issues/40998

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D22530283

Pulled By: seemethere

fbshipit-source-id: 0d80b7751ede83e6dd466690cc0a0ded68f59c5d
2020-07-14 14:49:01 -07:00
c17670ac50 unsafe_split, unsafe_split_with_sizes, unsafe_chunk operations (#39299)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/36403

Copy-paste of the issue description:

* Escape hatch: Introduce unsafe_* version of the three functions above that have the current behavior (outputs not tracked as views). The documentation will explain in detail why they are unsafe and when it is safe to use them. (basically, only the outputs OR the input can be modified inplace but not both. Otherwise, you will get wrong gradients).
* Deprecation: Use the CreationMeta on views to track views created by these three ops and throw warning when any of the views is modified inplace saying that this is deprecated and will raise an error soon. For users that really need to modify these views inplace, they should look at the doc of the unsafe_* version to make sure their usecase is valid:
  * If it is not, then pytorch is computing wrong gradients for their use case and they should not do inplace anymore.
  * If it is, then they can use the unsafe_* version to keep the current behavior.
* Removal: Use the CreationMeta on view to prevent any inplace on these views (like we do for all other views coming from multi-output Nodes). The users will still be able to use the unsafe_ versions if they really need to do this.

Note about BC-breaking:
- This PR changes the behavior of the regular function by making them return proper views now. This is a modification that the user will be able to see.
- We skip all the view logic for these views and so the code should behave the same as before (except the change in the `._is_view()` value).
- Even though the view logic is not performed, we do raise deprecation warnings for the cases where doing these ops would throw an error.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/39299

Differential Revision: D22432885

Pulled By: albanD

fbshipit-source-id: 324aef091b32ce69dd067fe9b13a3f17d85d0f12
2020-07-14 14:15:41 -07:00
e2c4c2f102 addmm: Reduce constant time overhead (#41374)
Summary:
Fixes the overhead reported by ngimel in https://github.com/pytorch/pytorch/pull/40927#issuecomment-657709646

As it turns out, `Tensor.size(n)` has more overhead than `Tensor.sizes()[n]`. Since addmm does a lot of introspection of the input matrix sizes and strides, this added up to a noticeable (~1 us) constant time overhead.

With this change, a 1x1 matmul takes 2.85 us on my machine compared to 2.90 us on pytorch 1.5.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41374

Reviewed By: ailzhang

Differential Revision: D22519924

Pulled By: ngimel

fbshipit-source-id: b29504bee7de79ce42e5e50f91523dde42b073b7
2020-07-14 13:47:16 -07:00
288ece89e1 Enable TF32 support for cuBLAS (#40800)
Summary:
Benchmark on a fully connected network and torchvision models (time in seconds) on GA100:

| model              | batch size | forward(TF32) | forward(FP32) | backward(TF32) | backward(FP32) |
|--------------------|------------|---------------|---------------|----------------|----------------|
| FC 512-128-32-8    | 512        | 0.000211      | 0.000321      | 0.000499       | 0.000532       |
| alexnet            | 512        | 0.0184        | 0.0255        | 0.0486         | 0.0709         |
| densenet161        | 128        | 0.0665        | 0.204         | 0.108          | 0.437          |
| googlenet          | 256        | 0.0925        | 0.110         | 0.269          | 0.326          |
| inception_v3       | 256        | 0.155         | 0.214         | 0.391          | 0.510          |
| mnasnet1_0         | 512        | 0.108         | 0.137         | 0.298          | 0.312          |
| mobilenet_v2       | 512        | 0.114         | 0.294         | 0.133          | 0.303          |
| resnet18           | 512        | 0.0722        | 0.100         | 0.182          | 0.228          |
| resnext50_32x4d    | 256        | 0.170         | 0.237         | 0.373          | 0.479          |
| shufflenet_v2_x1_0 | 512        | 0.0463        | 0.0473        | 0.125          | 0.123          |
| squeezenet1_0      | 512        | 0.0870        | 0.0948        | 0.205          | 0.214          |
| vgg16              | 256        | 0.167         | 0.234         | 0.401          | 0.502          |
| wide_resnet50_2    | 512        | 0.186         | 0.310         | 0.415          | 0.638          |

Pull Request resolved: https://github.com/pytorch/pytorch/pull/40800

Reviewed By: mruberry

Differential Revision: D22517785

Pulled By: ngimel

fbshipit-source-id: 87334c8935616f72a6af5abbd3ae69f76923dc3e
2020-07-14 13:21:10 -07:00
c528faac7d [ROCm] Skip problematic mgpu tests on ROCm3.5 (#41409)
Summary:
nccl tests and parallelize_bmuf_distributed test are failing on rocm3.5.1. Skipping these tests to upgrade the CI to rocm3.5.1

jeffdaily sunway513

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41409

Reviewed By: orionr

Differential Revision: D22528928

Pulled By: seemethere

fbshipit-source-id: 928196b7a62a441d391e69f54b278313ecc75d77
2020-07-14 11:55:43 -07:00
5f146a4125 fix include file path in unary ops
Summary: fix include file path in unary ops

Test Plan: compile

Reviewed By: amylittleyang

Differential Revision: D22527312

fbshipit-source-id: 589efd2231ff8bd3133cb7844738429927ecee68
2020-07-14 11:08:51 -07:00
4972cf06a2 [JIT] Add out-of-source-tree to_backend tests (#41145)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41145

**Summary**
This commit adds out-of-source-tree tests for `to_backend`. These tests check
that a Module can be lowered to a backend, exported, loaded (in both
Python and C++) and executed.

**Fixes**
This commit fixes #40067.

Test Plan: Imported from OSS

Reviewed By: jamesr66a

Differential Revision: D22510076

Pulled By: SplitInfinity

fbshipit-source-id: f65964ef3092a095740f06636ed5b1eb0884492d
2020-07-14 10:57:04 -07:00
0e7b9d4ff8 Fix logit doc (#41384)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41384

Fix logit doc

Test Plan: unittest

Reviewed By: houseroad

Differential Revision: D22521730

fbshipit-source-id: 270462008c6ac73cd90aecd77c5de112fc93ea8d
2020-07-14 10:40:52 -07:00
87bf04fe12 AvgPool: Ensure all cells are valid in ceil mode (#41368)
Summary:
Closes https://github.com/pytorch/pytorch/issues/36977

This avoid the division by zero that was causing NaNs to appear in the output. `AvgPooling2d` and `AvgPooling3d` both had this issue on CPU and CUDA.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41368

Reviewed By: ailzhang

Differential Revision: D22520013

Pulled By: ezyang

fbshipit-source-id: 3ece7829f858f5bc17c2c1d905266ac510f11194
2020-07-14 09:24:30 -07:00
535e8814a4 Add operators for LiteLMLSTM to Lite Interpreter (#41270)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41270

The Smart Keyboard model for Oculus requires operators previously not in the lite interpreter: aten::exp (for floats), aten::ord, aten::lower, aten::__contains__.str_list, aten::slice.str, aten::strip, aten::split.str, and aten::__getitem__.str.

Test Plan:
Verify smart keyboard model can be used:
Check out next diff in stack and follow test instructions there

Reviewed By: iseeyuan

Differential Revision: D22289812

fbshipit-source-id: df574d5af4d4fafb40f0e209b66a93fe02d83020
2020-07-14 09:18:41 -07:00
befb22790f Fix a number of deprecation warnings (#40179)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40179

- Pass no-psabi to shut up GCC about # Suppress "The ABI for passing
  parameters with 64-byte alignment has changed in GCC 4.6"
- Fix use of deprecated data() accessor (and minor optimization: hoist
  accessor out of loop)
- Undeprecate NetDef.num_workers, no one is serious about fixing these
- Suppress warnings about deprecated pthreadpool types

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Differential Revision: D22234138

Pulled By: ezyang

fbshipit-source-id: 6a1601b6d7551a7e6487a44ae65b19acdcb7b849
2020-07-14 09:11:34 -07:00
13dd53b3d2 [Codemod][FBSourceClangFormatLinter] Daily arc lint --take CLANGFORMAT
Reviewed By: zertosh

Differential Revision: D22523334

fbshipit-source-id: e687e26f68a4f923164a51ce0b69ec1d131b9022
2020-07-14 08:42:23 -07:00
e888c3bca1 Update torch.set_default_dtype doc (#41263)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/41263

Test Plan: Imported from OSS

Differential Revision: D22482989

Pulled By: anjali411

fbshipit-source-id: 2aadfbb84bbab66f3111970734a37ba74d817ffd
2020-07-14 07:29:49 -07:00
c20426f86d Fix torch.cuda.check_error type errors (#41330)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41330

`torch.cuda.check_error` is annotated as taking an `int` as argument but when running `torch.cuda.check_error(34)` one would get:
```
TypeError: cudaGetErrorString(): incompatible function arguments. The following argument types are supported:
    1. (arg0: torch._C._cudart.cudaError) -> str

Invoked with: 34
```
Even if one explicitly casted the argument, running `torch.cuda.check_error(torch._C._cudart.cudaError(34))` would give:
```
AttributeError: 'str' object has no attribute 'decode'
```

This PR fixes both issues (thus allowing `check_error` to be called with a un-casted int) and adds a test.
ghstack-source-id: 107628709

Test Plan: Unit tests

Reviewed By: ezyang

Differential Revision: D22500549

fbshipit-source-id: 9170c1e466dd554d471e928b26eb472a712da9e1
2020-07-14 00:47:14 -07:00
80d5b3785b Add torch.logit function (#41062)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41062

Add torch.logit function

Test Plan: buck test mode/dev-nosan //caffe2/test:torch -- "logit"

Reviewed By: hl475

Differential Revision: D22406912

fbshipit-source-id: b303374f4c68850eb7477eb0645546a24b844606
2020-07-13 19:33:20 -07:00
34e11b45c9 Remove thrust casting from static_cast_with_inter_type (#39905)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/39905

Reviewed By: ZolotukhinM

Differential Revision: D22510307

Pulled By: ngimel

fbshipit-source-id: 34357753fca4f2a8d5e2b1bbf8de8d642ca9bb20
2020-07-13 19:16:00 -07:00
5f6c6ed157 Fix FC issue (#41198)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41198

https://github.com/pytorch/pytorch/pull/39611 unified signatures of some ops taking TensorOptions arguments by making them optional.
That has FC implications but only for models writting with a PyTorch version after that version (see explanation in description of that PR).

However, it also changed the default from `pin_memory=False` to `pin_memory=None`, which actually breaks FC for preexisting models too if they're re-exported with a newer PyTorch,
because we materialize default values when exporting. This is bad.

This PR reverts that particular part of https://github.com/pytorch/pytorch/pull/39611 to revert the FC breakage.
ghstack-source-id: 107475024

Test Plan: waitforsandcastle

Reviewed By: bhosmer

Differential Revision: D22461661

fbshipit-source-id: ba2776267c3bba97439df66ecb50be7c1971d20d
2020-07-13 18:48:56 -07:00
ca1b8ebbcb move misc implementation out of jit/__init__.py (#41154)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/41154

Test Plan: Imported from OSS

Reviewed By: ailzhang

Differential Revision: D22445213

Pulled By: suo

fbshipit-source-id: 200545715c5ef13beb1437f49e01efb21498ddb7
2020-07-13 16:59:55 -07:00
6392713584 add spaces in .md annotation for python indent (#41260)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/41260

Reviewed By: ezyang

Differential Revision: D22504634

Pulled By: ailzhang

fbshipit-source-id: 9d2d605dc19b07896ee4b1811fcd34d4dcb9b0c7
2020-07-13 15:11:46 -07:00
b6e1944d35 .circleci: Explicitly remove nvidia apt repos (#41367)
Summary:
The nvidia apt repositories seem to be left over on the amd nodes so
let's just go ahead and remove them explicitly if we're not testing for
CUDA

Example: https://app.circleci.com/pipelines/github/pytorch/pytorch/190222/workflows/8f75b5cd-1afd-43dc-9fa7-f7b058f07b46/jobs/6223743/steps

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41367

Reviewed By: ezyang

Differential Revision: D22513844

Pulled By: seemethere

fbshipit-source-id: 6da4dd8423de5f7ec80c7904187cf80c1b91ab14
2020-07-13 15:05:57 -07:00
d601325de4 update operators in the mapping to fp16 emulation
Summary: add logit and swish to this list

Test Plan: f203925461

Reviewed By: amylittleyang

Differential Revision: D22506814

fbshipit-source-id: b449e4ea16354cb76915adb01cf317cffb494733
2020-07-13 14:08:24 -07:00
4196605776 helper function to print out all DDP-relevant env vars (#41297)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41297

GH issue: https://github.com/pytorch/pytorch/issues/40105

Add a helper function to DDP to print out all relevant env vars for debugging

Test Plan:
test through unittest, example output:
 ---
env:RANK=3
env:LOCAL_RANK=N/A
env:WORLD_SIZE=N/A
env:MASTER_PORT=N/A
env:MASTER_ADDR=N/A
env:CUDA_VISIBLE_DEVICES=N/A
env:GLOO_SOCKET_IFNAME=N/A
env:GLOO_DEVICE_TRANSPORT=N/A
env:NCCL_SOCKET_IFNAME=N/A
env:NCCL_BLOCKING_WAIT=N/A
...
 ---

Reviewed By: mrshenli

Differential Revision: D22490486

fbshipit-source-id: 5dc7d2a18111e5a5a12a1b724d90eda5d35acd1c
2020-07-13 14:03:04 -07:00
6e6931e234 fix duplicate extern sdot and missing flags (#41195)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41195

`BLAS_F2C` is set in `THGeneral.h`.
`sdot` redefined with double return type in the case that `BLAS_F2C` is set and `BLAS_USE_CBLAS_DOT` is not.

Test Plan: CircleCI green, ovrsource green

Reviewed By: malfet

Differential Revision: D22460253

fbshipit-source-id: 75f17b3e47da0ed33fcadc2843a57ad616f27fb5
2020-07-13 13:43:48 -07:00
0c77bd7c0b Quantization: preserving pre and post forward hooks (#37233)
Summary:
1. While do convert() preserve module's **pre and post forward** hooks
2. While do fusion preserve only module's **pre forward** hooks (because after fusion output no longer the same)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/37233

Differential Revision: D22425141

Pulled By: jerryzh168

fbshipit-source-id: e69b81821d507dcd110d2ff3594ba94b9593c8da
2020-07-13 12:41:24 -07:00
c451ddaeda Add shape inference functions for int8 quantization related ops (#41215)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41215

To unblock int8 model productization on accelerators, we need the shape and type info for all the blobs after int8 quantization. This diff added shape inference functions for int8 quantization related ops.

Test Plan:
```
buck test caffe2/caffe2/quantization/server:int8_gen_quant_params_test
buck test caffe2/caffe2/quantization/server:fully_connected_dnnlowp_op_test
```

Reviewed By: hx89

Differential Revision: D22467487

fbshipit-source-id: 8298abb0df3457fcb15df81f423f557c1a11f530
2020-07-13 12:02:11 -07:00
7183fd20f8 Add interpolate-style overloads to aten::upsample* ops (#37176)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37176

The non-deprecated user-facing interface to these ops (F.interpolate)
has a good interface: output size and scale are both specified as
a scalar or list, and exactly one must be present.  These aten ops
have an older, clunkier interface where output size is required and
scales are specified as separate optional scalars per dimension.

This change adds new overloads to the aten ops that match the interface
of interpolate.  The plan is to eventually remove the old overloads,
resulting in roughly net-zero code added.  I also believe it is possible
to push this interface down further, eliminating multiple optional<double>
arguments, and simplifying the implementations.

The rollout plan is to land this, wait for a reasonable interval for
forwards-compatibility (maybe 1 week?), land the change that updates
interpolate to call these overloads, wait for a reasonable interval
for backwards-compatibility (maybe 6 months?), then remove the old
overloads.

This diff does not add the `.out` variants of the ops because they
are not currently accessible through any user-facing API.

ghstack-source-id: 106938113

Test Plan:
test_nn covers these ops fairly well, so that should prevent this diff
from breaking anything on its own.

test_nn on the next diff in the stack actually uses these new overloads,
so that should validate that they are actually correct.

Differential Revision: D21209989

fbshipit-source-id: 2b74d230401f071364eb05e138cdaa55279cfe91
2020-07-13 11:53:29 -07:00
fb9e44f8dd Add support for float[]? arguments in native_functions.yaml (#37175)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37175

ghstack-source-id: 106938114

Test Plan: Upcoming diffs use this for upsampling.

Differential Revision: D21209994

fbshipit-source-id: 1a71c07e45e28772a2bbe450b68280dcc0fe2def
2020-07-13 11:51:10 -07:00
d04a2e4dae Back out "Revert D22329069: Self binning histogram" (#41313)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41313

This diff backs out the backout diff.  The failure was due to C++ `or`
not being supported in MSVC. This is now replaced with ||

Original commit changeset: fc7f3f8c968d

Test Plan: Existing unit tests, check github CI.

Reviewed By: malfet

Differential Revision: D22494777

fbshipit-source-id: 3271288919dc3a6bfb82508ab9d021edc910ae45
2020-07-13 11:46:34 -07:00
86d803a9da .cirlceci: Setup nvidia runtime for cu as well (#41268)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41268

We also want nvidia runtime packages to get installed when the
BUILD_ENVIRONMENT also includes "*cu*"

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Differential Revision: D22505885

Pulled By: seemethere

fbshipit-source-id: 4d8e70ed8aed9c6fd1828bc13cf7d5b0f8f50a0a
2020-07-13 10:29:25 -07:00
dea39b596e reduce logging for layernorm (#41305)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41305

added a warning message when layernorm under/overflows, which is what
nnpi does, reducing the frequency of the logging to every 1000

Test Plan: compilation

Reviewed By: yinghai

Differential Revision: D22492726

fbshipit-source-id: 9343beeae6e65bf3846c6b3d2edd2a08dac85ed6
2020-07-13 10:23:46 -07:00
67a4f375cd Pass the number of indices but not embedding size in PyTorch operator (#41315)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41315

We should pass the number of indices but not embedding size in SparseAdagrad fused PyTorch operator

Reviewed By: jianyuh

Differential Revision: D22495422

fbshipit-source-id: ec5d3a5c9547fcd8f95106d912b71888217a5af0
2020-07-12 20:55:40 -07:00
98df9781a7 Impl for ParameterList (#41259)
Summary:
This is a new PR for https://github.com/pytorch/pytorch/issues/40850, https://github.com/pytorch/pytorch/issues/40987 and https://github.com/pytorch/pytorch/issues/41206(I unintentionally closed), as I have some issues for rebates for that one. Very sorry about that. And I have fixed the tests failed in that PR.

This diff contains the implementation of C++ API for ParameterList from https://github.com/pytorch/pytorch/issues/25883.
Refer to the Python API: bc9e8af218/torch/nn/modules/container.py (L376)
Not sure about some naming difference between C++ API and Python API, like `append`, should it be called `push_back`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41259

Test Plan: Add unit tests in this diff

Differential Revision: D22495780

Pulled By: glaringlee

fbshipit-source-id: 79ea3592db640f35477d445ecdaeafbdad814bec
2020-07-12 20:50:31 -07:00
fa153184c8 Fake Quantization Per Channel Kernel Core Implementation (CPU) (#41037)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41037

This diff contains the core implementation for the fake quantizer per channel kernel that supports back propagation on the scale and zero point.

Test Plan:
On a devvm, use:
- `buck test //caffe2/test:quantization -- learnable_forward_per_channel`
- `buck test //caffe2/test:quantization -- learnable_backward_per_channel`

Reviewed By: z-a-f

Differential Revision: D22395665

fbshipit-source-id: 280c2405d04adfeda9fb9cfc94d89e8d868e0d41
2020-07-12 12:14:00 -07:00
5e72ebeda3 Fake Quantization Per Tensor Kernel Core Implementation (CPU) (#41029)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41029

This diff contains the core implementation for the fake quantizer per tensor kernel that supports back propagation on the scale and zero point.

Test Plan:
On a devvm, use:
- `buck test //caffe2/test:quantization -- learnable_forward_per_tensor`
- `buck test //caffe2/test:quantization -- learnable_backward_per_tensor`

Reviewed By: z-a-f

Differential Revision: D22394145

fbshipit-source-id: f6748b635b86679aa9174a8065e6be5e20a95d81
2020-07-12 12:11:38 -07:00
402be850a8 [quant] Adding zero point type check for per channel quantization (#40811)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40811

Test Plan: Imported from OSS

Differential Revision: D22319417

Pulled By: z-a-f

fbshipit-source-id: 7be3a511ddd33b5fe749a83166bbc5874d1bd539
2020-07-12 11:40:19 -07:00
4b4184fc69 [quant][graphmode] use RemoveMutation to remove append (#41161)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/41161

Test Plan: Imported from OSS

Reviewed By: z-a-f

Differential Revision: D22446714

fbshipit-source-id: 15da28ef773300a141603d67a1c4524f1ec32239
2020-07-11 16:49:56 -07:00
106b0b6a62 Op to create quant scheme blob (#40760)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40760

Add op to create a quant scheme.

Test Plan:
buck test mode/opt caffe2/caffe2/quantization/server:int8_quant_scheme_blob_fill_test

{F241838981}

Reviewed By: csummersea

Differential Revision: D22228154

fbshipit-source-id: 1b7a02c06937c68e2fcccf77eb10a965300ed732
2020-07-11 10:53:10 -07:00
edcf2cdf86 [quant] dequantize support list and tuple of tensors (#41079)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/41079

Test Plan: Imported from OSS

Differential Revision: D22420700

fbshipit-source-id: bc4bf0fb47dcf8b94b11fbdc91e8d5a75142b7be
2020-07-11 10:44:19 -07:00
c864158475 Add fp16 support to SparseLengthSum PyTorch operator (#41058)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41058

SparseLengthSum PyTorch operator just accept float and double type before, this diff add fp16 support to SparseLengthSum PT operator.

Reviewed By: jianyuh

Differential Revision: D22387253

fbshipit-source-id: 2a7d03ceaadbb7b04077cff72ab77da6457ba989
2020-07-11 07:54:32 -07:00
28291d3cf8 [caffe2] Revert D22220798 (#41302)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/41302

Test Plan:
```
buck test //caffe2/caffe2/fb/predictor:black_box_predictor_test
```

Differential Revision: D22492356

fbshipit-source-id: efcbc3c67abda5cb9da47e633804a4800d92f89b
2020-07-11 03:28:29 -07:00
e544bf2924 fix the range of the random weights used in the int8fc test (#41303)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41303

the error came from I0710 18:02:48.025024 1780875 NNPIOptions.cpp:49]
[NNPI_LOG][D] [KS] convert_base_kernel_ivp.cpp(524): Output Scale 108240.101562
is out of valid range +-(Min 0.000061 Max 65504.000000)!!!

Seems like the weights we are using are too small, thus generating scaling
factors out of the range of fp16 (>65k). I am tentatively increasing this
factor to a higher value to avoid this. (10x bigger)

Also increased max_examples to 100

Test Plan: ran this test

Reviewed By: yinghai

Differential Revision: D22492481

fbshipit-source-id: c0f9e59b0e70895ab787868ef1d87e6e80106554
2020-07-11 00:19:29 -07:00
a1ed6e1eb3 Revert D22467871: add check for duplicated op registration in JIT
Test Plan: revert-hammer

Differential Revision:
D22467871 (a548c6b18f)

Original commit changeset: 9b7a40a217e6

fbshipit-source-id: b594d4d0a079f7e24ef0efb45476ded2838cbef1
2020-07-10 23:39:23 -07:00
095886fa42 [caffe2] Fix the issues when using CUB RadixSort (#41299)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41299

When using `cub::DeviceRadixSort::SortPairs` (https://nvlabs.github.io/cub/structcub_1_1_device_radix_sort.html), the `end_bit` argument, or the most-significant bit index (exclusive) needed for key comparison, should be passed with  `int(log2(float(num_rows)) + 1)` instead of `int(log2(float(num_indice)) + 1)`. This is because all the values in indices array are guaranteed to be less than num_rows (hash_size), not num_indices. Thanks ngimel for pointing this point and thanks malfet for quickly fixing the log2() compilation issues.

Note:
An optional bit subrange [begin_bit, end_bit) of differentiating key bits can be specified. This can reduce overall sorting overhead and yield a corresponding performance improvement.

Test Plan:
```
buck test mode/dev-nosan //caffe2/caffe2/fb/net_transforms/tests:fuse_sparse_ops_test -- 'test_fuse_sparse_adagrad_with_sparse_lengths_sum_gradient \(caffe2\.caffe2\.fb\.net_transforms\.tests\.fuse_sparse_ops_test\.TestFuseSparseOps\)' --print-passing-details
```

Reviewed By: malfet

Differential Revision: D22491662

fbshipit-source-id: 4fdabe86244c948af6244f9bd91712844bf1dec1
2020-07-10 22:39:43 -07:00
d1f06da9b7 Solve log2(x:int) ambiguity by using log2(float(x)) (#41295)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/41295

Differential Revision: D22490995

Pulled By: malfet

fbshipit-source-id: 17037e551ce5986f3162389a61932099563c02a7
2020-07-10 20:12:36 -07:00
1c098ae339 Fix arg type annotations in jit.trace and onnx.export (#41093)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/40350

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41093

Differential Revision: D22477950

Pulled By: malfet

fbshipit-source-id: f1141c129b6d9efb373d22291b441df86c529ddd
2020-07-10 20:07:05 -07:00
877a59967f Ampere has CUDA_MAX_THREADS_PER_SM == 2048 (#41138)
Summary:
See: https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/nvidia-ampere-architecture-whitepaper.pdf
page 44, table 5
![image](https://user-images.githubusercontent.com/1032377/86958633-56051580-c111-11ea-94da-c726a61dc00a.png)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41138

Differential Revision: D22488904

Pulled By: malfet

fbshipit-source-id: 97bd585d91e1a368f51aa6bd52081bc57d42dbf8
2020-07-10 20:02:20 -07:00
6cbb92494d Better THGeneric.h generation rules in bazel (#41285)
Summary:
It  doesn't do a good job of checking BLAS library capabilities, so hardcode the undef of BLAS_F2C

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41285

Differential Revision: D22489781

Pulled By: malfet

fbshipit-source-id: 13a14f31e08d7f9ded49731e4fd23663bac75cd2
2020-07-10 17:40:04 -07:00
67f5d68fdf Revert D22465221: [pytorch][PR] Reducing size of docker Linux image
Test Plan: revert-hammer

Differential Revision:
D22465221 (7c143e5d3e)

Original commit changeset: 487542597294

fbshipit-source-id: f085763a13497bd5ceea0ed6aa7676320c8806bf
2020-07-10 17:12:26 -07:00
ac3542fa59 Define PSIMD_SOURCE_DIR when including FP16 (#41233)
Summary:
Avoids a superflous redownload when *NNPACK is not used (e.g. on Power)

Example: https://powerci.osuosl.org/job/pytorch-master-nightly-py3-linux-ppc64le/1128/consoleFull
Search for "Downloading PSimd"

See also https://github.com/pytorch/pytorch/issues/41178

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41233

Differential Revision: D22488833

Pulled By: malfet

fbshipit-source-id: 637291419ddd3b2a8dc25e211a4ebbba955e5855
2020-07-10 16:55:10 -07:00
abea7cd561 msvc anonymous namespace bug (#41199)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41199

workaround for: https://developercommunity.visualstudio.com/content/problem/900452/variable-in-anonymous-namespace-has-external-linka.html

Test Plan: CI green, ovrsource green

Reviewed By: malfet

Differential Revision: D22462050

fbshipit-source-id: 11a2fd6a4db1f29ce350699cfc3121dc89ab7ef6
2020-07-10 16:45:14 -07:00
48d6e2adce Disable the mkldnn for conv2d in some special cases (#40610)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40610

We have benchmarked several models, which shows the native implementation of conv2d is faster mkldnn path. For group conv, the native implementation does not batch all the groups.

Test Plan:
```
import torch
import torch.nn.functional as F

import numpy as np

from timeit import Timer

num = 50

S = [
#         [1, 1, 100, 40, 16, 3, 3, 1, 1, 1, 1],
#         [1, 2048, 4, 2, 512, 1, 1, 1, 1, 0, 0],
#         [1, 512, 4, 2, 512, 3, 3, 1, 1, 1, 1],
#         [1, 512, 4, 2, 2048, 1, 1, 1, 1, 0, 0],
#         [1, 2048, 4, 2, 512, 1, 1, 1, 1, 0, 0],
#         [1, 512, 4, 2, 512, 3, 3, 1, 1, 1, 1],
#         [1, 512, 4, 2, 2048, 1, 1, 1, 1, 0, 0],
#         [1, 2048, 4, 2, 512, 1, 1, 1, 1, 0, 0],
#         [1, 512, 4, 2, 512, 3, 3, 1, 1, 1, 1],
#         [1, 512, 4, 2, 2048, 1, 1, 1, 1, 0, 0],
#         [1, 2048, 4, 2, 512, 1, 1, 1, 1, 0, 0],
#         [1, 512, 4, 2, 512, 3, 3, 1, 1, 1, 1],
#         [1, 512, 4, 2, 2048, 1, 1, 1, 1, 0, 0],
#         [1, 2048, 4, 2, 512, 1, 1, 1, 1, 0, 0],
#         [1, 512, 4, 2, 512, 3, 3, 1, 1, 1, 1],
#         [1, 512, 4, 2, 2048, 1, 1, 1, 1, 0, 0],
#         [1, 2048, 4, 2, 512, 1, 1, 1, 1, 0, 0],
#         [1, 512, 4, 2, 512, 3, 3, 1, 1, 1, 1],
#         [1, 512, 4, 2, 2048, 1, 1, 1, 1, 0, 0],
[1, 3, 224, 224, 64, 7, 7, 2, 2, 3, 3, 1],
[1, 64, 56, 56, 128, 1, 1, 1, 1, 0, 0, 1],
[1, 128, 56, 56, 128, 3, 3, 1, 1, 1, 1, 32],
[1, 128, 56, 56, 256, 1, 1, 1, 1, 0, 0, 1],
[1, 64, 56, 56, 256, 1, 1, 1, 1, 0, 0, 1],
[1, 256, 56, 56, 128, 1, 1, 1, 1, 0, 0, 1],
[1, 128, 56, 56, 128, 3, 3, 1, 1, 1, 1, 32],
[1, 128, 56, 56, 256, 1, 1, 1, 1, 0, 0, 1],
[1, 256, 56, 56, 128, 1, 1, 1, 1, 0, 0, 1],
[1, 128, 56, 56, 128, 3, 3, 1, 1, 1, 1, 32],
[1, 128, 56, 56, 256, 1, 1, 1, 1, 0, 0, 1],
[1, 256, 56, 56, 256, 1, 1, 1, 1, 0, 0, 1],
[1, 256, 56, 56, 256, 3, 3, 2, 2, 1, 1, 32],
[1, 256, 28, 28, 512, 1, 1, 1, 1, 0, 0, 1],
[1, 256, 56, 56, 512, 1, 1, 2, 2, 0, 0, 1],
[1, 512, 28, 28, 256, 1, 1, 1, 1, 0, 0, 1],
[1, 256, 28, 28, 256, 3, 3, 1, 1, 1, 1, 32],
[1, 256, 28, 28, 512, 1, 1, 1, 1, 0, 0, 1],
[1, 512, 28, 28, 256, 1, 1, 1, 1, 0, 0, 1],
[1, 256, 28, 28, 256, 3, 3, 1, 1, 1, 1, 32],
[1, 256, 28, 28, 512, 1, 1, 1, 1, 0, 0, 1],
[1, 512, 28, 28, 256, 1, 1, 1, 1, 0, 0, 1],
[1, 256, 28, 28, 256, 3, 3, 1, 1, 1, 1, 32],
[1, 256, 28, 28, 512, 1, 1, 1, 1, 0, 0, 1],
[1, 512, 28, 28, 512, 1, 1, 1, 1, 0, 0, 1],
[1, 512, 28, 28, 512, 3, 3, 2, 2, 1, 1, 32],
[1, 512, 14, 14, 1024, 1, 1, 1, 1, 0, 0, 1],
[1, 512, 28, 28, 1024, 1, 1, 2, 2, 0, 0, 1],
[1, 1024, 14, 14, 512, 1, 1, 1, 1, 0, 0, 1],
[1, 512, 14, 14, 512, 3, 3, 1, 1, 1, 1, 32],
[1, 512, 14, 14, 1024, 1, 1, 1, 1, 0, 0, 1],
[1, 1024, 14, 14, 512, 1, 1, 1, 1, 0, 0, 1],
[1, 512, 14, 14, 512, 3, 3, 1, 1, 1, 1, 32],
[1, 512, 14, 14, 1024, 1, 1, 1, 1, 0, 0, 1],
[1, 1024, 14, 14, 512, 1, 1, 1, 1, 0, 0, 1],
[1, 512, 14, 14, 512, 3, 3, 1, 1, 1, 1, 32],
[1, 512, 14, 14, 1024, 1, 1, 1, 1, 0, 0, 1],
[1, 1024, 14, 14, 512, 1, 1, 1, 1, 0, 0, 1],
[1, 512, 14, 14, 512, 3, 3, 1, 1, 1, 1, 32],
[1, 512, 14, 14, 1024, 1, 1, 1, 1, 0, 0, 1],
[1, 1024, 14, 14, 512, 1, 1, 1, 1, 0, 0, 1],
[1, 512, 14, 14, 512, 3, 3, 1, 1, 1, 1, 32],
[1, 512, 14, 14, 1024, 1, 1, 1, 1, 0, 0, 1],
[1, 1024, 14, 14, 512, 1, 1, 1, 1, 0, 0, 1],
[1, 512, 14, 14, 512, 3, 3, 1, 1, 1, 1, 32],
[1, 512, 14, 14, 1024, 1, 1, 1, 1, 0, 0, 1],
[1, 1024, 14, 14, 512, 1, 1, 1, 1, 0, 0, 1],
[1, 512, 14, 14, 512, 3, 3, 1, 1, 1, 1, 32],
[1, 512, 14, 14, 1024, 1, 1, 1, 1, 0, 0, 1],
[1, 1024, 14, 14, 512, 1, 1, 1, 1, 0, 0, 1],
[1, 512, 14, 14, 512, 3, 3, 1, 1, 1, 1, 32],
[1, 512, 14, 14, 1024, 1, 1, 1, 1, 0, 0, 1],
[1, 1024, 14, 14, 512, 1, 1, 1, 1, 0, 0, 1],
[1, 512, 14, 14, 512, 3, 3, 1, 1, 1, 1, 32],
[1, 512, 14, 14, 1024, 1, 1, 1, 1, 0, 0, 1],
[1, 1024, 14, 14, 512, 1, 1, 1, 1, 0, 0, 1],
[1, 512, 14, 14, 512, 3, 3, 1, 1, 1, 1, 32],
[1, 512, 14, 14, 1024, 1, 1, 1, 1, 0, 0, 1],
[1, 1024, 14, 14, 512, 1, 1, 1, 1, 0, 0, 1],
[1, 512, 14, 14, 512, 3, 3, 1, 1, 1, 1, 32],
[1, 512, 14, 14, 1024, 1, 1, 1, 1, 0, 0, 1],
[1, 1024, 14, 14, 512, 1, 1, 1, 1, 0, 0, 1],
[1, 512, 14, 14, 512, 3, 3, 1, 1, 1, 1, 32],
[1, 512, 14, 14, 1024, 1, 1, 1, 1, 0, 0, 1],
[1, 1024, 14, 14, 512, 1, 1, 1, 1, 0, 0, 1],
[1, 512, 14, 14, 512, 3, 3, 1, 1, 1, 1, 32],
[1, 512, 14, 14, 1024, 1, 1, 1, 1, 0, 0, 1],
[1, 1024, 14, 14, 512, 1, 1, 1, 1, 0, 0, 1],
[1, 512, 14, 14, 512, 3, 3, 1, 1, 1, 1, 32],
[1, 512, 14, 14, 1024, 1, 1, 1, 1, 0, 0, 1],
[1, 1024, 14, 14, 512, 1, 1, 1, 1, 0, 0, 1],
[1, 512, 14, 14, 512, 3, 3, 1, 1, 1, 1, 32],
[1, 512, 14, 14, 1024, 1, 1, 1, 1, 0, 0, 1],
[1, 1024, 14, 14, 512, 1, 1, 1, 1, 0, 0, 1],
[1, 512, 14, 14, 512, 3, 3, 1, 1, 1, 1, 32],
[1, 512, 14, 14, 1024, 1, 1, 1, 1, 0, 0, 1],
[1, 1024, 14, 14, 512, 1, 1, 1, 1, 0, 0, 1],
[1, 512, 14, 14, 512, 3, 3, 1, 1, 1, 1, 32],
[1, 512, 14, 14, 1024, 1, 1, 1, 1, 0, 0, 1],
[1, 1024, 14, 14, 512, 1, 1, 1, 1, 0, 0, 1],
[1, 512, 14, 14, 512, 3, 3, 1, 1, 1, 1, 32],
[1, 512, 14, 14, 1024, 1, 1, 1, 1, 0, 0, 1],
[1, 1024, 14, 14, 512, 1, 1, 1, 1, 0, 0, 1],
[1, 512, 14, 14, 512, 3, 3, 1, 1, 1, 1, 32],
[1, 512, 14, 14, 1024, 1, 1, 1, 1, 0, 0, 1],
[1, 1024, 14, 14, 512, 1, 1, 1, 1, 0, 0, 1],
[1, 512, 14, 14, 512, 3, 3, 1, 1, 1, 1, 32],
[1, 512, 14, 14, 1024, 1, 1, 1, 1, 0, 0, 1],
[1, 1024, 14, 14, 512, 1, 1, 1, 1, 0, 0, 1],
[1, 512, 14, 14, 512, 3, 3, 1, 1, 1, 1, 32],
[1, 512, 14, 14, 1024, 1, 1, 1, 1, 0, 0, 1],
[1, 1024, 14, 14, 512, 1, 1, 1, 1, 0, 0, 1],
[1, 512, 14, 14, 512, 3, 3, 1, 1, 1, 1, 32],
[1, 512, 14, 14, 1024, 1, 1, 1, 1, 0, 0, 1],
[1, 1024, 14, 14, 1024, 1, 1, 1, 1, 0, 0, 1],
[1, 1024, 14, 14, 1024, 3, 3, 2, 2, 1, 1, 32],
[1, 1024, 7, 7, 2048, 1, 1, 1, 1, 0, 0, 1],
[1, 1024, 14, 14, 2048, 1, 1, 2, 2, 0, 0, 1],
[1, 2048, 7, 7, 1024, 1, 1, 1, 1, 0, 0, 1],
[1, 1024, 7, 7, 1024, 3, 3, 1, 1, 1, 1, 32],
[1, 1024, 7, 7, 2048, 1, 1, 1, 1, 0, 0, 1],
[1, 2048, 7, 7, 1024, 1, 1, 1, 1, 0, 0, 1],
[1, 1024, 7, 7, 1024, 3, 3, 1, 1, 1, 1, 32],
[1, 1024, 7, 7, 2048, 1, 1, 1, 1, 0, 0, 1],
    ]
for x in range(105):
    P = S[x]
    print(P)
    (N, C, H, W) = P[0:4]
    M = P[4]
    (kernel_h, kernel_w) = P[5:7]
    (stride_h, stride_w) = P[7:9]
    (padding_h, padding_w) = P[9:11]

    X_np = np.random.randn(N, C, H, W).astype(np.float32)
    W_np = np.random.randn(M, C, kernel_h, kernel_w).astype(np.float32)
    X = torch.from_numpy(X_np)
    g = P[11]
    conv2d_pt = torch.nn.Conv2d(
        C, M, (kernel_h, kernel_w), stride=(stride_h, stride_w),
        padding=(padding_h, padding_w), groups=g, bias=True)

    class ConvNet(torch.nn.Module):
        def __init__(self):
            super(ConvNet, self).__init__()
            self.conv2d = conv2d_pt

        def forward(self, x):
            return self.conv2d(x)

    model = ConvNet()

    def pt_forward():
        with torch.no_grad():
            model(X)

    torch._C._set_mkldnn_enabled(True)
    t = Timer("pt_forward()", "from __main__ import pt_forward, X")
    print("MKLDNN pt time = {}".format(t.timeit(num) / num * 1000.0))
    torch._C._set_mkldnn_enabled(False)
    t = Timer("pt_forward()", "from __main__ import pt_forward, X")
    print("TH pt time = {}".format(t.timeit(num) / num * 1000.0))

OMP_NUM_THREADS=1 MKL_NUM_THREADS=1 python bm.py
```

output:
```
[1, 3, 224, 224, 64, 7, 7, 2, 2, 3, 3, 1]
MKLDNN pt time = 5.891108009964228
TH pt time = 7.0624795742332935
[1, 64, 56, 56, 128, 1, 1, 1, 1, 0, 0, 1]
MKLDNN pt time = 1.4464975893497467
TH pt time = 0.721491202712059
[1, 128, 56, 56, 128, 3, 3, 1, 1, 1, 1, 32]
MKLDNN pt time = 1.4036639966070652
TH pt time = 3.299683593213558
[1, 128, 56, 56, 256, 1, 1, 1, 1, 0, 0, 1]
MKLDNN pt time = 3.3908068016171455
TH pt time = 2.227546200156212
[1, 64, 56, 56, 256, 1, 1, 1, 1, 0, 0, 1]
MKLDNN pt time = 2.226586602628231
TH pt time = 1.3865559734404087
[1, 256, 56, 56, 128, 1, 1, 1, 1, 0, 0, 1]
MKLDNN pt time = 3.31307839602232
TH pt time = 2.4284918047487736
[1, 128, 56, 56, 128, 3, 3, 1, 1, 1, 1, 32]
MKLDNN pt time = 1.5028003975749016
TH pt time = 3.824346773326397
[1, 128, 56, 56, 256, 1, 1, 1, 1, 0, 0, 1]
MKLDNN pt time = 3.4405963867902756
TH pt time = 2.6227117888629436
[1, 256, 56, 56, 128, 1, 1, 1, 1, 0, 0, 1]
MKLDNN pt time = 3.405764400959015
TH pt time = 2.644723802804947
[1, 128, 56, 56, 128, 3, 3, 1, 1, 1, 1, 32]
MKLDNN pt time = 1.5220053866505623
TH pt time = 3.9365867897868156
[1, 128, 56, 56, 256, 1, 1, 1, 1, 0, 0, 1]
MKLDNN pt time = 3.606868200004101
TH pt time = 2.5387956015765667
[1, 256, 56, 56, 256, 1, 1, 1, 1, 0, 0, 1]
MKLDNN pt time = 6.0041105933487415
TH pt time = 5.305919591337442
[1, 256, 56, 56, 256, 3, 3, 2, 2, 1, 1, 32]
MKLDNN pt time = 1.4830979891121387
TH pt time = 7.532084975391626
[1, 256, 28, 28, 512, 1, 1, 1, 1, 0, 0, 1]
MKLDNN pt time = 3.025687597692013
TH pt time = 2.2185291908681393
[1, 256, 56, 56, 512, 1, 1, 2, 2, 0, 0, 1]
MKLDNN pt time = 3.5893129743635654
TH pt time = 2.696530409157276
[1, 512, 28, 28, 256, 1, 1, 1, 1, 0, 0, 1]
MKLDNN pt time = 2.8203356079757214
TH pt time = 2.0819314010441303
[1, 256, 28, 28, 256, 3, 3, 1, 1, 1, 1, 32]
MKLDNN pt time = 0.8583215996623039
TH pt time = 2.7761065773665905
[1, 256, 28, 28, 512, 1, 1, 1, 1, 0, 0, 1]
MKLDNN pt time = 2.9077288135886192
TH pt time = 2.045416794717312
[1, 512, 28, 28, 256, 1, 1, 1, 1, 0, 0, 1]
MKLDNN pt time = 2.805021796375513
TH pt time = 2.131381593644619
[1, 256, 28, 28, 256, 3, 3, 1, 1, 1, 1, 32]
MKLDNN pt time = 0.9023251943290234
TH pt time = 2.9028950072824955
[1, 256, 28, 28, 512, 1, 1, 1, 1, 0, 0, 1]
MKLDNN pt time = 3.1174601800739765
TH pt time = 2.275596000254154
[1, 512, 28, 28, 256, 1, 1, 1, 1, 0, 0, 1]
MKLDNN pt time = 3.100480604916811
TH pt time = 2.399571593850851
[1, 256, 28, 28, 256, 3, 3, 1, 1, 1, 1, 32]
MKLDNN pt time = 0.9321337938308716
TH pt time = 2.886691205203533
[1, 256, 28, 28, 512, 1, 1, 1, 1, 0, 0, 1]
MKLDNN pt time = 3.065785188227892
TH pt time = 2.1640316024422646
[1, 512, 28, 28, 512, 1, 1, 1, 1, 0, 0, 1]
MKLDNN pt time = 5.891813579946756
TH pt time = 4.2956990003585815
[1, 512, 28, 28, 512, 3, 3, 2, 2, 1, 1, 32]
MKLDNN pt time = 0.9399276040494442
TH pt time = 4.7622935846447945
[1, 512, 14, 14, 1024, 1, 1, 1, 1, 0, 0, 1]
MKLDNN pt time = 3.2426914013922215
TH pt time = 2.3699573799967766
[1, 512, 28, 28, 1024, 1, 1, 2, 2, 0, 0, 1]
MKLDNN pt time = 3.0341636016964912
TH pt time = 2.6606030017137527
[1, 1024, 14, 14, 512, 1, 1, 1, 1, 0, 0, 1]
MKLDNN pt time = 2.991385366767645
TH pt time = 2.6313263922929764
[1, 512, 14, 14, 512, 3, 3, 1, 1, 1, 1, 32]
MKLDNN pt time = 0.7330256141722202
TH pt time = 3.008321188390255
[1, 512, 14, 14, 1024, 1, 1, 1, 1, 0, 0, 1]
MKLDNN pt time = 2.880081795156002
TH pt time = 2.289068605750799
[1, 1024, 14, 14, 512, 1, 1, 1, 1, 0, 0, 1]
MKLDNN pt time = 2.9583285935223103
TH pt time = 2.6302105747163296
[1, 512, 14, 14, 512, 3, 3, 1, 1, 1, 1, 32]
MKLDNN pt time = 0.7322711870074272
TH pt time = 2.8230775892734528
[1, 512, 14, 14, 1024, 1, 1, 1, 1, 0, 0, 1]
MKLDNN pt time = 2.8620235808193684
TH pt time = 2.4078205972909927
[1, 1024, 14, 14, 512, 1, 1, 1, 1, 0, 0, 1]
MKLDNN pt time = 2.828651014715433
TH pt time = 2.616014201194048
[1, 512, 14, 14, 512, 3, 3, 1, 1, 1, 1, 32]
MKLDNN pt time = 0.7084695994853973
TH pt time = 2.8024527989327908
[1, 512, 14, 14, 1024, 1, 1, 1, 1, 0, 0, 1]
MKLDNN pt time = 2.7884829975664616
TH pt time = 2.4237345717847347
[1, 1024, 14, 14, 512, 1, 1, 1, 1, 0, 0, 1]
MKLDNN pt time = 2.89030060172081
TH pt time = 2.5852439925074577
[1, 512, 14, 14, 512, 3, 3, 1, 1, 1, 1, 32]
MKLDNN pt time = 0.724627785384655
TH pt time = 2.651805803179741
[1, 512, 14, 14, 1024, 1, 1, 1, 1, 0, 0, 1]
MKLDNN pt time = 2.249914798885584
TH pt time = 2.0440668053925037
[1, 1024, 14, 14, 512, 1, 1, 1, 1, 0, 0, 1]
MKLDNN pt time = 2.722136974334717
TH pt time = 2.531316000968218
[1, 512, 14, 14, 512, 3, 3, 1, 1, 1, 1, 32]
MKLDNN pt time = 0.7164162024855614
TH pt time = 2.8521843999624252
[1, 512, 14, 14, 1024, 1, 1, 1, 1, 0, 0, 1]
MKLDNN pt time = 2.8891782090067863
TH pt time = 2.436912599951029
[1, 1024, 14, 14, 512, 1, 1, 1, 1, 0, 0, 1]
MKLDNN pt time = 3.0049769952893257
TH pt time = 2.649025786668062
[1, 512, 14, 14, 512, 3, 3, 1, 1, 1, 1, 32]
MKLDNN pt time = 0.7299130037426949
TH pt time = 2.67714099958539
[1, 512, 14, 14, 1024, 1, 1, 1, 1, 0, 0, 1]
MKLDNN pt time = 2.799382768571377
TH pt time = 2.4427592009305954
[1, 1024, 14, 14, 512, 1, 1, 1, 1, 0, 0, 1]
MKLDNN pt time = 3.0201382003724575
TH pt time = 2.6285660080611706
[1, 512, 14, 14, 512, 3, 3, 1, 1, 1, 1, 32]
MKLDNN pt time = 0.6983320042490959
TH pt time = 2.9118607938289642
[1, 512, 14, 14, 1024, 1, 1, 1, 1, 0, 0, 1]
MKLDNN pt time = 2.8802538104355335
TH pt time = 2.385452575981617
[1, 1024, 14, 14, 512, 1, 1, 1, 1, 0, 0, 1]
MKLDNN pt time = 2.9600497893989086
TH pt time = 2.594646792858839
[1, 512, 14, 14, 512, 3, 3, 1, 1, 1, 1, 32]
MKLDNN pt time = 0.5688861943781376
TH pt time = 2.5941073894500732
[1, 512, 14, 14, 1024, 1, 1, 1, 1, 0, 0, 1]
MKLDNN pt time = 2.7758505940437317
TH pt time = 2.336081601679325
[1, 1024, 14, 14, 512, 1, 1, 1, 1, 0, 0, 1]
MKLDNN pt time = 2.6135251857340336
TH pt time = 2.3902921937406063
[1, 512, 14, 14, 512, 3, 3, 1, 1, 1, 1, 32]
MKLDNN pt time = 0.6303061917424202
TH pt time = 2.6228136010468006
[1, 512, 14, 14, 1024, 1, 1, 1, 1, 0, 0, 1]
MKLDNN pt time = 2.8868251852691174
TH pt time = 2.5620524026453495
[1, 1024, 14, 14, 512, 1, 1, 1, 1, 0, 0, 1]
MKLDNN pt time = 3.057632204145193
TH pt time = 2.691414188593626
[1, 512, 14, 14, 512, 3, 3, 1, 1, 1, 1, 32]
MKLDNN pt time = 0.7316274009644985
TH pt time = 3.14683198928833
[1, 512, 14, 14, 1024, 1, 1, 1, 1, 0, 0, 1]
MKLDNN pt time = 3.2674955762922764
TH pt time = 2.602821197360754
[1, 1024, 14, 14, 512, 1, 1, 1, 1, 0, 0, 1]
MKLDNN pt time = 3.0993166007101536
TH pt time = 2.609328981488943
[1, 512, 14, 14, 512, 3, 3, 1, 1, 1, 1, 32]
MKLDNN pt time = 0.7257938012480736
TH pt time = 2.9255208000540733
[1, 512, 14, 14, 1024, 1, 1, 1, 1, 0, 0, 1]
MKLDNN pt time = 3.3086097799241543
TH pt time = 2.544360812753439
[1, 1024, 14, 14, 512, 1, 1, 1, 1, 0, 0, 1]
MKLDNN pt time = 3.0537622049450874
TH pt time = 2.6343842037022114
[1, 512, 14, 14, 512, 3, 3, 1, 1, 1, 1, 32]
MKLDNN pt time = 0.7194169983267784
TH pt time = 2.9009717889130116
[1, 512, 14, 14, 1024, 1, 1, 1, 1, 0, 0, 1]
MKLDNN pt time = 2.6461398042738438
TH pt time = 2.3600555770099163
[1, 1024, 14, 14, 512, 1, 1, 1, 1, 0, 0, 1]
MKLDNN pt time = 2.6328082010149956
TH pt time = 2.415131386369467
[1, 512, 14, 14, 512, 3, 3, 1, 1, 1, 1, 32]
MKLDNN pt time = 0.6832938082516193
TH pt time = 2.6299685798585415
[1, 512, 14, 14, 1024, 1, 1, 1, 1, 0, 0, 1]
MKLDNN pt time = 2.9594415985047817
TH pt time = 2.509857602417469
[1, 1024, 14, 14, 512, 1, 1, 1, 1, 0, 0, 1]
MKLDNN pt time = 2.956229578703642
TH pt time = 2.691046390682459
[1, 512, 14, 14, 512, 3, 3, 1, 1, 1, 1, 32]
MKLDNN pt time = 0.7222409918904305
TH pt time = 2.938339803367853
[1, 512, 14, 14, 1024, 1, 1, 1, 1, 0, 0, 1]
MKLDNN pt time = 2.9467295855283737
TH pt time = 2.4219116009771824
[1, 1024, 14, 14, 512, 1, 1, 1, 1, 0, 0, 1]
MKLDNN pt time = 3.0215882137417793
TH pt time = 2.7782391756772995
[1, 512, 14, 14, 512, 3, 3, 1, 1, 1, 1, 32]
MKLDNN pt time = 0.719242412596941
TH pt time = 2.8529402054846287
[1, 512, 14, 14, 1024, 1, 1, 1, 1, 0, 0, 1]
MKLDNN pt time = 2.8062099777162075
TH pt time = 2.9951974004507065
[1, 1024, 14, 14, 512, 1, 1, 1, 1, 0, 0, 1]
MKLDNN pt time = 3.1621821969747543
TH pt time = 2.5330167822539806
[1, 512, 14, 14, 512, 3, 3, 1, 1, 1, 1, 32]
MKLDNN pt time = 0.690075010061264
TH pt time = 2.5531245954334736
[1, 512, 14, 14, 1024, 1, 1, 1, 1, 0, 0, 1]
MKLDNN pt time = 2.832614816725254
TH pt time = 2.339891381561756
[1, 1024, 14, 14, 512, 1, 1, 1, 1, 0, 0, 1]
MKLDNN pt time = 2.7835668064653873
TH pt time = 2.513139396905899
[1, 512, 14, 14, 512, 3, 3, 1, 1, 1, 1, 32]
MKLDNN pt time = 0.7026367820799351
TH pt time = 2.796882800757885
[1, 512, 14, 14, 1024, 1, 1, 1, 1, 0, 0, 1]
MKLDNN pt time = 2.6479675993323326
TH pt time = 2.4971639923751354
[1, 1024, 14, 14, 512, 1, 1, 1, 1, 0, 0, 1]
MKLDNN pt time = 2.9846629686653614
TH pt time = 2.4657804146409035
[1, 512, 14, 14, 512, 3, 3, 1, 1, 1, 1, 32]
MKLDNN pt time = 0.5969022028148174
TH pt time = 2.697007991373539
[1, 512, 14, 14, 1024, 1, 1, 1, 1, 0, 0, 1]
MKLDNN pt time = 2.7602720074355602
TH pt time = 2.4498093873262405
[1, 1024, 14, 14, 512, 1, 1, 1, 1, 0, 0, 1]
MKLDNN pt time = 2.963611613959074
TH pt time = 2.6310251839458942
[1, 512, 14, 14, 512, 3, 3, 1, 1, 1, 1, 32]
MKLDNN pt time = 0.7004458084702492
TH pt time = 2.9164502024650574
[1, 512, 14, 14, 1024, 1, 1, 1, 1, 0, 0, 1]
MKLDNN pt time = 2.887732572853565
TH pt time = 2.4575488083064556
[1, 1024, 14, 14, 512, 1, 1, 1, 1, 0, 0, 1]
MKLDNN pt time = 2.8350806050002575
TH pt time = 2.23197178915143
[1, 512, 14, 14, 512, 3, 3, 1, 1, 1, 1, 32]
MKLDNN pt time = 0.5626789852976799
TH pt time = 2.704860605299473
[1, 512, 14, 14, 1024, 1, 1, 1, 1, 0, 0, 1]
MKLDNN pt time = 2.6168799959123135
TH pt time = 2.2481359727680683
[1, 1024, 14, 14, 512, 1, 1, 1, 1, 0, 0, 1]
MKLDNN pt time = 2.5654693879187107
TH pt time = 2.2636358067393303
[1, 512, 14, 14, 512, 3, 3, 1, 1, 1, 1, 32]
MKLDNN pt time = 0.6836861930787563
TH pt time = 2.825192976742983
[1, 512, 14, 14, 1024, 1, 1, 1, 1, 0, 0, 1]
MKLDNN pt time = 2.7971909940242767
TH pt time = 2.471243590116501
[1, 1024, 14, 14, 512, 1, 1, 1, 1, 0, 0, 1]
MKLDNN pt time = 2.8480279818177223
TH pt time = 2.553586605936289
[1, 512, 14, 14, 512, 3, 3, 1, 1, 1, 1, 32]
MKLDNN pt time = 0.7191735878586769
TH pt time = 2.6465672068297863
[1, 512, 14, 14, 1024, 1, 1, 1, 1, 0, 0, 1]
MKLDNN pt time = 2.7811027877032757
TH pt time = 2.457349617034197
[1, 1024, 14, 14, 1024, 1, 1, 1, 1, 0, 0, 1]
MKLDNN pt time = 5.434317365288734
TH pt time = 4.639615211635828
[1, 1024, 14, 14, 1024, 3, 3, 2, 2, 1, 1, 32]
MKLDNN pt time = 0.9400106035172939
TH pt time = 2.9971951991319656
[1, 1024, 7, 7, 2048, 1, 1, 1, 1, 0, 0, 1]
MKLDNN pt time = 4.494664408266544
TH pt time = 3.478870000690222
[1, 1024, 14, 14, 2048, 1, 1, 2, 2, 0, 0, 1]
MKLDNN pt time = 4.8432330042123795
TH pt time = 3.6410867795348167
[1, 2048, 7, 7, 1024, 1, 1, 1, 1, 0, 0, 1]
MKLDNN pt time = 4.779010973870754
TH pt time = 3.4093930013477802
[1, 1024, 7, 7, 1024, 3, 3, 1, 1, 1, 1, 32]
MKLDNN pt time = 0.8385192044079304
TH pt time = 3.0921380035579205
[1, 1024, 7, 7, 2048, 1, 1, 1, 1, 0, 0, 1]
MKLDNN pt time = 3.9088409766554832
TH pt time = 3.130124807357788
[1, 2048, 7, 7, 1024, 1, 1, 1, 1, 0, 0, 1]
MKLDNN pt time = 4.0072557888925076
TH pt time = 2.977220807224512
[1, 1024, 7, 7, 1024, 3, 3, 1, 1, 1, 1, 32]
MKLDNN pt time = 0.8867520093917847
TH pt time = 3.1505179964005947
[1, 1024, 7, 7, 2048, 1, 1, 1, 1, 0, 0, 1]
MKLDNN pt time = 4.118196591734886
TH pt time = 3.46621660515666
```

Reviewed By: dzhulgakov

Differential Revision: D22250817

fbshipit-source-id: c9dc61b633e11a378a05810d711a696effd7f02b
2020-07-10 16:43:29 -07:00
ce3ba3b9bc [JIT] Add support for backend-lowered submodules (#41146)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41146

**Summary**
This commit adds support for using `Modules` that have been lowered as
submodules in `ScriptModules`.

**Test Plan**
This commit adds execution and save/load tests to test_backends.py for
backend-lowered submodules.

**Fixes**
This commit fixes #40069.

Test Plan: Imported from OSS

Reviewed By: ailzhang

Differential Revision: D22459543

Pulled By: SplitInfinity

fbshipit-source-id: 02e0c0ccdce26c671ade30a34aca3e99bcdc5ba7
2020-07-10 16:35:24 -07:00
1f2e91fa4f Impilcit casting resulting internal build failure. (#41272)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41272

Implicit casting from int to float is resulting in vec256_test build failure
internally. This diff fixes that.

Test Plan: Build vec256_test for android and run it on android phone.

Reviewed By: ljk53, paulshaoyuqiao

Differential Revision: D22484635

fbshipit-source-id: ebb9fc2eccb8261ab01d8266150fc3b05166f1e7
2020-07-10 16:29:54 -07:00
7bae5780a2 Revert D22329069: Self binning histogram
Test Plan: revert-hammer

Differential Revision:
D22329069 (16c8146da9)

Original commit changeset: 28406b94e284

fbshipit-source-id: fc7f3f8c968d1ec7d2a1cf7a4d05900f51055d82
2020-07-10 16:22:29 -07:00
dd0c98d82a [ONNX]Add tests for ConvTranspose 1D and 3D (#40703)
Summary:
Add tests for ConvTranspose 1D and 3D

Pull Request resolved: https://github.com/pytorch/pytorch/pull/40703

Reviewed By: hl475

Differential Revision: D22480087

Pulled By: houseroad

fbshipit-source-id: 92846ed7181f543af20669e5ea191bfb5522ea13
2020-07-10 16:10:09 -07:00
9daba76ba1 Change to.dtype_layout to c10-full (#41169)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41169

-
ghstack-source-id: 107537240

Test Plan: waitforsandcastle

Differential Revision: D22289257

fbshipit-source-id: ed3cc06327951fa886eb3b8f1c8bcc014ae2bc41
2020-07-10 16:04:34 -07:00
7c143e5d3e Reducing size of docker Linux image (#41207)
Summary:
# Description
The goal is to reduce the size of the docker image. I checked a few things:
* Docker layer overlaps
* Removing .git folder
* Removing intermediate build artifacts (*.o and *.a)

The only one that gave satisfying result was the 3rd approach, removing *.o and *.a. The final image went from 10 GB to 9.7 GB.

I used Dive (https://github.com/wagoodman/dive) to inspect the Docker image manually.

# Test:
* Check the image size was reduced
* No test failures in CI

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41207

Test Plan:
* Check the image size was reduced
* No test failures in CI

Differential Revision: D22465221

Pulled By: ssylvain

fbshipit-source-id: 48754259729401e3c08447b0fa0630ca7217cb98
2020-07-10 15:59:18 -07:00
0651887eb4 Improve repr for torch.iinfo & torch.finfo (#40488)
Summary:
- fix https://github.com/pytorch/pytorch/issues/39991
- Include directly `min`/`max`/`eps`/`tiny` values in repr of `torch.iinfo` & `torch.finfo` for inspection
- Use `torch.float16` / `torch.int16` instead of uncorrespond names `Half` / `Short`
- The improved repr is shown just like:
```
>>> torch.iinfo(torch.int8)
iinfo(type=torch.int8, max=127, min=-128)
>>> torch.iinfo(torch.int16)
iinfo(type=torch.int16, max=32767, min=-32768)
>>> torch.iinfo(torch.int32)
iinfo(type=torch.int32, max=2.14748e+09, min=-2.14748e+09)
>>> torch.iinfo(torch.int64)
iinfo(type=torch.int64, max=9.22337e+18, min=-9.22337e+18)
>>> torch.finfo(torch.float16)
finfo(type=torch.float16, eps=0.000976563, max=65504, min=-65504, tiny=6.10352e-05)
>>> torch.finfo(torch.float32)
finfo(type=torch.float32, eps=1.19209e-07, max=3.40282e+38, min=-3.40282e+38, tiny=1.17549e-38)
>>> torch.finfo(torch.float64)
finfo(type=torch.float64, eps=2.22045e-16, max=1.79769e+308, min=-1.79769e+308, tiny=2.22507e-308)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/40488

Differential Revision: D22445301

Pulled By: mruberry

fbshipit-source-id: 552af9904c423006084b45d6c4adfb4b5689db54
2020-07-10 15:22:55 -07:00
cb6c3526c6 Migrate addmm, addbmm and THBlas_gemm to ATen (#40927)
Summary:
Resubmit #40927
Closes https://github.com/pytorch/pytorch/issues/24679, closes https://github.com/pytorch/pytorch/issues/24678

`addbmm` depends on `addmm` so needed to be ported at the same time. I also removed `THTensor_(baddbmm)` which I noticed had already been ported so was just dead code.

After having already written this code, I had to fix merge conflicts with https://github.com/pytorch/pytorch/issues/40354 which revealed there was already an established place for cpu blas routines in ATen. However, the version there doesn't make use of ATen's AVX dispatching so thought I'd wait for comment before migrating this into that style.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/40927

Reviewed By: ezyang

Differential Revision: D22468490

Pulled By: ngimel

fbshipit-source-id: f8a22be3216f67629420939455e31a88af20201d
2020-07-10 14:30:55 -07:00
16c8146da9 Self binning histogram (#40875)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40875

This op uses the given num_bins and a spacing strategy to automatically bin and compute the histogram of given matrices.

Test Plan: Unit tests.

Reviewed By: neha26shah

Differential Revision: D22329069

fbshipit-source-id: 28406b94e284d52d875f73662fc82f93dbc00064
2020-07-10 13:55:42 -07:00
9b0393fcf1 [ONNX]Fix export of flatten (#40418)
Summary:
Shape is passed to _reshape_to_tensor as a Constant and cannot infer shape of the input when model is exported with dynamic axes set. Instead of a Constant pass output of a subgraph Shape-Slice-Concat to compute the shape for the Reshape node in _reshape_to_tensor function.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/40418

Reviewed By: hl475

Differential Revision: D22480127

Pulled By: houseroad

fbshipit-source-id: 11853adb6e6914936871db1476916699141de435
2020-07-10 13:06:25 -07:00
a548c6b18f add check for duplicated op registration in JIT (#41214)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41214

Same as D21032976, add check for duplicated op name in JIT

Test Plan:
run full JIT predictor
also
buck test pytorch-playground

Reviewed By: smessmer

Differential Revision: D22467871

fbshipit-source-id: 9b7a40a217e6c63cca44cad54f9f657b8b207a45
2020-07-10 12:19:04 -07:00
75b6dd3d49 Wrap Caffe2's SparseLengthsSum into a PyTorch op (#39596)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39596

This diff wraps Caffe2's SparseLengthsSum on GPU as a PT op.

Reviewed By: jianyuh

Differential Revision: D21895309

fbshipit-source-id: 38bb156f9be8d28225d2b44f5b4c93d27779aff9
2020-07-10 11:19:13 -07:00
d927aee312 Small clarification of torch.cuda.amp multi-model example (#41203)
Summary:
some people have been confused by `retain_graph` in the snippet, they thought it was an additional requirement imposed by amp.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41203

Differential Revision: D22463700

Pulled By: ngimel

fbshipit-source-id: e6fc8871be2bf0ecc1794b1c6f5ea99af922bf7e
2020-07-10 11:13:26 -07:00
4a09501fbe LogitOp LUT based fake FP16 Op. (#41258)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41258

LogitOp LUT based fake FP16 Op.

(Note: this ignores all push blocking failures!)

Test Plan: test_op_nnpi_fp16.py covers the test_logit testing.

Reviewed By: hyuen

Differential Revision: D22351963

fbshipit-source-id: e2ed2bd9bfdc58c6f823d7d41557109c08628bd7
2020-07-10 10:53:42 -07:00
33f9fbf8ba Modularize parsing NCCL_BLOCKING_WAIT in ProcessGroupNCCL (#41076)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41076

Modularizes Parsing the NCCL_BLOCKING_WAIT environment variable in the ProcessGroupNCCL Constructor.
ghstack-source-id: 107491850

Test Plan: Sandcastle/CI

Differential Revision: D22401225

fbshipit-source-id: 79866d3f4f1a617cdcbca70e3bea1ce9dcac3316
2020-07-10 10:47:38 -07:00
db38487ece Autograd Doc for Complex Numbers (#41012)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/41012

Test Plan: Imported from OSS

Differential Revision: D22476911

Pulled By: anjali411

fbshipit-source-id: 7da20cb4312a0465272bebe053520d9911475828
2020-07-10 09:57:43 -07:00
e568b3fa2d test nan and inf in TestTorchMathOps (#41225)
Summary:
Per title. `lgamma` produces a different result for `-inf` compared to scipy, so there comparison is skipped.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41225

Differential Revision: D22473346

Pulled By: ngimel

fbshipit-source-id: e4ebda1b10e2a061bd4cef38d1d7b5bf0f581790
2020-07-10 09:46:46 -07:00
62e16934cb [caffe2] Add the dedup implementation of fused RowWiseAdagrad op on GPUs (#40282)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40282

Test Plan:
```
buck test mode/dev-nosan //caffe2/caffe2/fb/net_transforms/tests:fuse_sparse_ops_test -- 'test_fuse_sparse_adagrad_with_sparse_lengths_sum_gradient \(caffe2\.caffe2\.fb\.net_transforms\.tests\.fuse_sparse_ops_test\.TestFuseSparseOps\)' --print-passing-details
```

https://our.intern.facebook.com/intern/testinfra/testrun/4785074632584150

Reviewed By: jspark1105

Differential Revision: D22102737

fbshipit-source-id: fa3fef7cecb1e2cf5c9b6019579dc0f86fd3a3b2
2020-07-10 09:05:24 -07:00
08227072e2 Benchmark RecordFunction overhead on some models (#40952)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40952

Adding a benchmark to measure RecordFunction overhead,
currently on resnet50 and lstm models

Test Plan:
python benchmarks/record_function_benchmark/record_function_bench.py
Benchmarking RecordFunction overhead for lstm_jit
Running warmup... finished
Running 100 iterations with RecordFunction... finished
N = 100, avg. time: 251.970 ms, stddev: 39.348 ms
Running 100 iterations without RecordFunction... finished
N = 100, avg. time: 232.828 ms, stddev: 24.556 ms

Reviewed By: dzhulgakov

Differential Revision: D22368357

Pulled By: ilia-cher

fbshipit-source-id: bff4f4e0e06fb80fdfcf85966c2468e48ed7bc98
2020-07-10 08:46:19 -07:00
8a79eec98a Add add_relu fusion pass to optimize_for_mobile. (#40252)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40252

As title says.

Test Plan:
python test/test_mobile_optimizer.py

Imported from OSS

Differential Revision: D22126825

fbshipit-source-id: a1880587ba8db9dee0fa450bc463734e4a8693d9
2020-07-10 08:10:22 -07:00
75a4862f63 Added SiLU activation function (#41034)
Summary:
Implemented the SiLU activation function as discussed in https://github.com/pytorch/pytorch/issues/3169.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41034

Reviewed By: glaringlee

Differential Revision: D22465203

Pulled By: heitorschueroff

fbshipit-source-id: b27d064529fc99600c586ad49b594b52b718b0d2
2020-07-10 07:37:30 -07:00
f6eb92a354 Expose private APIs to enable/disable pickling ScriptModules without RPC (#39631)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39631

Background:
Currently, we cannot send ScriptModule over RPC as an argument.
Otherwise, it would hit the following error:

> _pickle.PickleError: ScriptModules cannot be deepcopied using
> copy.deepcopy or saved using torch.save. Mixed serialization of
> script and non-script modules is not supported. For purely
> script modules use my_script_module.save(<filename>) instead.

Failed attempt:
tried to install `torch.jit.ScriptModule` to RPC's
dispatch table, but it does not work as the dispatch table only
matches exact types and using base type `torch.jit.ScriptModule`
does not work for derived typed.

Current solution:
The current solution exposes `_enable_jit_rref_pickle` and
`_disable_jit_rref_pickle` APIs to toggle the `allowJitRRefPickle`
flag. See `test_pickle_script_module_with_rref` as an example.

Test Plan: Imported from OSS

Differential Revision: D21920870

Pulled By: mrshenli

fbshipit-source-id: 4d58afce5d0b4b81249b383c173488820b1a47d6
2020-07-10 07:27:51 -07:00
df252c059c [ROCm] Skip caffe2 unique op test for rocm3.5 (#41219)
Summary:
unique op test failure in caffe2 blocks upgrading CI to rocm3.5.1. Skipping the test to unblock will re-enable after root causing and fixing the issue.
jeffdaily sunway513

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41219

Differential Revision: D22471452

Pulled By: xw285cornell

fbshipit-source-id: 9e503c8b37c0a4b92632f77b2f8a90281a9889c3
2020-07-09 20:00:29 -07:00
a79b416847 make Int8 FC bias quantization use round flush to infinity
Summary:
the current quantization rounding function uses fbgemm which
defaults to round to nearest. The current implementation of hw uses round
flush to infinity. Adding such an option to switch the mode of rounding.

Test Plan: ran against test_fc_int8

Reviewed By: venkatacrc

Differential Revision: D22452306

fbshipit-source-id: d2a1fbfc695612fe07caaf84f52669643507cc9c
2020-07-09 17:25:41 -07:00
7c2c752e6d Revert D22458928: [pytorch][PR] Use explicit templates in CUDALoops kernels
Test Plan: revert-hammer

Differential Revision:
D22458928 (e374280768)

Original commit changeset: cca623bb6e76

fbshipit-source-id: 6dd24f783ec3b781140f314716ffb02f0892c57a
2020-07-09 16:31:50 -07:00
c5dcf056ee JIT pass for add relu fusion. (#39343)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39343

Building on top of previous PR that adds fused add_relu op, this PR adds
a JIT pass to transform input graph to find all fusable instancs of add
+ relu and fuses them.

Test Plan:
python test/test_jit.py TestJit.test_add_relu_fusion

Imported from OSS

Differential Revision: D21822396

fbshipit-source-id: 12c7e8db54c6d70a2402b32cc06c7e305ffbb1be
2020-07-09 16:25:13 -07:00
82c9f79e0e Add fused add_relu op. (#39342)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39342

Many networks such as resnet have adds followed by relu. This op is the
first step in enabling this fused implementation.
Once we have the fused add_relu op, a JIT pass will be written to
replace add + relu patterns with add_relu.

Test Plan:
python test/test_nn.py TestAddRelu

Imported from OSS

Differential Revision: D21822397

fbshipit-source-id: 03df83a3e46ddb48a90c5a6f755227a7e361a0e8
2020-07-09 16:25:11 -07:00
d6feb6141f [Vec256][neon] Add neon backend for vec256 (#39341)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39341

This PR introduces neon backend for vec256 class for float datatype.
For now only aarch64 is enabled due to few issues with enabling in
aarch32 bit.

Test Plan:
vec256_test

Imported from OSS

Differential Revision: D21822399

fbshipit-source-id: 3851c4336d93d1c359c85b38cf19904f82bc7b8d
2020-07-09 16:25:09 -07:00
bddba1e336 Add benchmark for add op. (#40059)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40059

This benchmark is added specifically for mobile to see if compiler is
autovectorizing and thus we have no advantage of neon backend for vec256
for add op.

Test Plan:
CI

Imported from OSS

Differential Revision: D22055146

fbshipit-source-id: 43ba6c4ae57c6f05d84887c2750ce21ae1b0f0b5
2020-07-09 16:22:55 -07:00
dde3d5f4a8 [RPC docs] Remove mention of TensorPipe's SHM and CMA backends as they're not built (#41200)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41200

In short, we messed up. The SHM and CMA backends of TensorPipe are Linux-specific and thus they are guarded by a #ifdef in the agent's code. Due to a mishap with CMake (due the fact that TensorPipe has two CMake files, one for PyTorch and a "standalone" one) we were not correctly propagating some flags and these #ifdefs were always false. This means that these two backends have always been disabled and have thus never been covered by our OSS CI. It would be irresponsible to enable them now in v1.6, so instead we remove any mention of them from the docs.

Note that this is perhaps not as bad as it sounds. These two backends were providing higher performance (latency) when the two endpoints were on the same machine. However, I suspect that most RPC users will only do transfers across machines, for which SHM and CMA wouldn't have played any role.
ghstack-source-id: 107458630

Test Plan: Docs only

Differential Revision: D22462158

fbshipit-source-id: 0d72fea11bcaab6d662184bbe7270529772a5e9b
2020-07-09 15:33:07 -07:00
a88099ba3e restore old documentation references (#39086)
Summary:
Fixes gh-39007

We replaced actual content with links to generated content in many places to break the documentation into manageable chunks. This caused references like
```
https://pytorch.org/docs/stable/torch.html#torch.flip
```
to become
```
https://pytorch.org/docs/master/generated/torch.flip.html#torch.flip
```
The textual content that was located at the old reference was replaced with a link to the new reference. This PR adds a `<p id="xxx"/p>` reference next to the link, so that the older references from outside tutorials and forums still work: they will bring the user to the link that they can then follow through to see the actual content.

The way this is done is to monkeypatch the sphinx writer method that produces the link. It is ugly but practical, and in my mind not worse than adding javascript to do the same thing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/39086

Differential Revision: D22462421

Pulled By: jlin27

fbshipit-source-id: b8f913b38c56ebb857c5a07bded6509890900647
2020-07-09 15:20:10 -07:00
b952eaf668 Preserve CUDA gencode flags (#41173)
Summary:
Add `torch._C._cuda_getArchFlags()` that returns list of architecture `torch_cuda` were compiled with
Add `torch.cuda.get_arch_list()` and `torch.cuda.get_gencode_flags()` methods that returns architecture list and gencode flags PyTorch were compiled with
Print warning if some of GPUs is not compatible with any of the CUBINs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41173

Differential Revision: D22459998

Pulled By: malfet

fbshipit-source-id: 65d40ae29e54a0ba0f3f2da11b821fdb4d452d95
2020-07-09 14:59:35 -07:00
e374280768 Use explicit templates in CUDALoops kernels (#41059)
Summary:
Follow up after https://github.com/pytorch/pytorch/pull/40992
Use explicit templates instead of lambdas to reduce binary size without affecting the perf by 100-200Kb per arch per CU, namely:
BinaryMulDivKernel.cu 3.8Mb -> 3.5Mb
CompareEQKernel.cu 1.8Mb -> 1.7Mb
BinaryAddSubKernel.cu 2.0Mb -> 1.8Mb
BinaryBitwiseOpsKernels.cu 2.6Mb -> 2.3Mb

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41059

Differential Revision: D22458928

Pulled By: malfet

fbshipit-source-id: cca623bb6e769cfe372977b08463d98b1a02dd14
2020-07-09 14:55:38 -07:00
1f1351488e Revert D21870844: Create lazy_dyndeps to avoid caffe2 import costs.
Test Plan: revert-hammer

Differential Revision:
D21870844 (07fd5f8ff9)

Original commit changeset: 3f65fedb65bb

fbshipit-source-id: 4f661072d72486a9c14711e368247b3d30e28af9
2020-07-09 14:18:38 -07:00
22f940b7bd add clang code coverage compile flags (#41103)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41103

add a CLANG_CODE_COVERAGE option to CMakeList. If the option is ON, add code coverage needed compile flags.

Test Plan:
Clone pytorch source code to local, modified these changes and builded it with `CLANG_CODE_COVERAGE ON` and `BUILD_TESTS ON`.  Run a manual test and attach code coverage report.

{F243609020}

Reviewed By: malfet

Differential Revision: D22422513

fbshipit-source-id: 27a31395c31b5b5f4b72523954722771d8f61080
2020-07-09 14:14:18 -07:00
2cf31fb577 Fix max_pool2d perf regression (#41174)
Summary:
The two pointer variables `ptr_top_diff` and `ptr_top_mask` were introduced in https://github.com/pytorch/pytorch/issues/38953. Some end-to-end testing showed training performance regression due to this change. The performance is restored after removing the two pointer variables, and adding offset directly below in the indexing [ ] calculations.

See PR change https://github.com/pytorch/pytorch/pull/38953/files#diff-8085d370f4e98295074a51b8a1f829e9R187-R188

e4a3c584d5/aten/src/ATen/native/cuda/DilatedMaxPool2d.cu (L186-L195)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41174

Differential Revision: D22451565

Pulled By: ngimel

fbshipit-source-id: 37ed6b9fd785e1be31a027ef5d60794656cc575a
2020-07-09 14:00:05 -07:00
1922f2212a Make IterableDataset dataloader.__len__ warning clearer (#41175)
Summary:
Based on discussion with jlucier (https://github.com/pytorch/pytorch/pull/38925#issuecomment-655859195) . `batch_size` change isn't made because data loader only has the notion of `batch_sampler`, not batch size. If `batch_size` dependent sharding is needed, users can still access it from their own code.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41175

Differential Revision: D22456525

Pulled By: zou3519

fbshipit-source-id: 5281fcf14807f219de06e32107d5fe7d5b6a8623
2020-07-09 13:49:29 -07:00
e84ef45dd3 [JIT] Fix JIT triage workflow (#41170)
Summary:
**Summary**
This commit fixes the JIT triage workflow based on testing done in my
own fork.

**Test Plan**
This commit has been tested against my own fork. This commit is
currently at the tip of my master branch, and if you open an issue in my
fork and label it JIT, it will be added to the Triage Review project in
that fork under the Needs triage column.

*Old issue that is labelled JIT later*

<img width="700" alt="Captura de Pantalla 2020-07-08 a la(s) 6 59 42 p  m" src="https://user-images.githubusercontent.com/4392003/86988551-5b805100-c14d-11ea-9de3-072916211f24.png">

*New issue that is opened with the JIT label*
<img width="725" alt="Captura de Pantalla 2020-07-08 a la(s) 6 59 17 p  m" src="https://user-images.githubusercontent.com/4392003/86988560-60dd9b80-c14d-11ea-94f0-fac01a0d239b.png">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41170

Differential Revision: D22460584

Pulled By: SplitInfinity

fbshipit-source-id: 278483cebbaf3b35e5bdde2a541513835b644464
2020-07-09 12:40:01 -07:00
c1fa74b2d7 [quant][refactor] test_only_eval_fn (#41078)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/41078

Test Plan: Imported from OSS

Differential Revision: D22420699

fbshipit-source-id: cf105cd41d83036df65c6bb3147cc14aaf755897
2020-07-09 12:34:05 -07:00
7c29a4e66f Don't add NCCL dependency to gloo if system NCCL is used (#41180)
Summary:
This avoids a (currently only) warning of cmake:
```
The dependency target "nccl_external" of target "gloo_cuda" does not exist.
Call Stack (most recent call first):
  CMakeLists.txt:411 (include)
```

This will be a real problem once Policy CMP0046 is set which will make this warning be an error

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41180

Differential Revision: D22460623

Pulled By: malfet

fbshipit-source-id: 0222b12b435e5e2fdf2bc85752f95abba1e3d4d5
2020-07-09 12:10:29 -07:00
2252188e85 [caffe2] Fix spatial_batch_norm_op dividision-by-zero crash (#40806)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40806

When the input is empty, the operator will crash on "runtime error: division by zero". This has been causing Inference platform server crashes.

Example crash logs:

{P134526683}

Test Plan:
Unit test

See reproducing steps in the Test Plan of D22300135

Reviewed By: houseroad

Differential Revision: D22302089

fbshipit-source-id: aaa5391fddc86483b0f3aba3efa7518e54913635
2020-07-09 12:04:11 -07:00
df1f8a48d8 add null check for c2 tensor conversion (#41096)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41096

The spark spot model had some issues in tensor conversion, see P134598596. It happens when we convert an undefined c10 tensor to caffe2 tensor.
This diff added a null check.

Test Plan: spark spot model runs without problem

Reviewed By: smessmer

Differential Revision: D22330705

fbshipit-source-id: dfe0f29a48019b6611cad3fd8f2ae49e8db5427e
2020-07-09 11:44:23 -07:00
a318234eb0 Print raising warnings in Python rather than C++ if other error occurs (#41116)
Summary:
When we return to Python from C++ in PyTorch and have warnings and and error, we have the problem of what to do when the warnings throw because we can only throw one error.
Previously, if we had an error, we punted all warnings to the C++ warning handler which would write them to stderr (i.e. system fid 2) or pass them on to glog.

This has drawbacks if an error happened:
- Warnings are not handled through Python even if they don't raise,
- warnings are always printed with no way to suppress this,
- the printing bypasses sys.stderr, so Python modules wanting to
  modify this don't work (with the prominent example being Jupyter).

This patch does the following instead:
- Set the warning using standard Python extension mechanisms,
- if Python decides that this warning is an error and we have a
  PyTorch error, we print the warning through Python and clear
  the error state (from the warning).

This resolves the three drawbacks discussed above, in particular it fixes https://github.com/pytorch/pytorch/issues/37240 .

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41116

Differential Revision: D22456393

Pulled By: albanD

fbshipit-source-id: c3376735723b092efe67319321a8a993402985c7
2020-07-09 11:38:07 -07:00
07fd5f8ff9 Create lazy_dyndeps to avoid caffe2 import costs. (#39488)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39488

Currently caffe2.InitOpLibrary does the dll import uniliaterally. Instead if we make a lazy version and use it, then many pieces of code which do not need the caffe2urrenoperators get a lot faster.

One a real test, the import time went from 140s to 68s. 8s.

This also cleans up the algorithm slightly (although it makes a very minimal
difference), by parsing the list of operators once, rather than every time a
new operator is added, since we defer the RefreshCall until after we've
imported all the operators.

The key way we maintain safety, is that as soon as someone does an operation
which requires a operator (or could), we force importing of all available
operators.

Future work could include trying to identify which code is needed for which
operator and only import the needed ones. There may also be wins available by
playing with dlmopen (which opens within a namespace), or seeing if the dl
flags have an impact (I tried this and didn't see an impact, but dlmopen may
make it better).

Test Plan:
I added a new test a lazy_dyndep_test.py (copied from all_compare_test.py).
I'm a little concerned that I don't see any explicit tests for dyndep, but this
should provide decent coverage.

Differential Revision: D21870844

fbshipit-source-id: 3f65fedb65bb48663670349cee5e1d3e22d560ed
2020-07-09 11:34:57 -07:00
f69d6a7ea3 [ONNX] Update Default Value of recompute_scale_factor in Interpolate (#39453)
Summary:
This is a duplicate of https://github.com/pytorch/pytorch/pull/38362

"This PR completes Interpolate's deprecation process for recomputing the scales values, by updating the default value of the parameter recompute_scale_factor as planned for pytorch 1.6.0.
The warning message is also updated accordingly."

I'm recreating this PR as previous one is not being updated.

cc gchanan

Pull Request resolved: https://github.com/pytorch/pytorch/pull/39453

Reviewed By: hl475

Differential Revision: D21955284

Pulled By: houseroad

fbshipit-source-id: 911585d39273a9f8de30d47e88f57562216968d8
2020-07-09 11:32:49 -07:00
9b3a212d30 quantizer.cpp: fix cuda memory pinning (#41139)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41139

Fixes the test case in https://github.com/pytorch/pytorch/issues/41115
by using PyTorch's CUDA allocator instead of the old Caffe2 one.

Test Plan:
run the test case from the issue:
https://gist.github.com/vkuzo/6d013aa1645cb986d0d4464a931c779b

let's run CI and see what it uncovers

Imported from OSS

Reviewed By: malfet

Differential Revision: D22438787

fbshipit-source-id: 0853b0115d198a99c43e6176aef34ea951bf5c2e
2020-07-09 11:14:58 -07:00
62cee0001e Move async + serialization implementation out of 'jit/__init__.py' (#41018)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41018

See https://github.com/pytorch/pytorch/pull/40807 for context.

Test Plan: Imported from OSS

Reviewed By: ailzhang

Differential Revision: D22393869

Pulled By: suo

fbshipit-source-id: a71cc571a423ccb81cd148444dc2a18d2ee43464
2020-07-09 10:10:01 -07:00
c8deca8ea8 Update pthreadpool to pthreadpool:029c88620802e1361ccf41d1970bd5b07fd6b7bb. (#40524)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40524

Reviewed By: ezyang

Differential Revision: D22215742

Pulled By: AshkanAliabadi

fbshipit-source-id: ef594e0901337a92b21ddd44e554da66c723eb7c
2020-07-09 10:00:36 -07:00
c038f8afcc Do not install nvidia docker for non-NVIDIA configs (#41144)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/41144

Differential Revision: D22457124

Pulled By: malfet

fbshipit-source-id: e615199cb78b315aa700efcc7332ebf4299212bf
2020-07-09 09:24:26 -07:00
690946c49d Generalize constant_table from tensor only to ivalue (#40718)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40718

Currently only constant except tensor must be inlined during serialization.
Tensor are stored in the contant table. This patch generalizes this capability
to any IValue. This is particularly useful for non ASCII string literal that
cannot be inlined.

Test Plan: Imported from OSS

Differential Revision: D22298169

Pulled By: bzinodev

fbshipit-source-id: 88cc59af9cc45e426ca8002175593b9e431f4bac
2020-07-09 09:09:40 -07:00
86f72953dd [Codemod][FBSourceClangFormatLinter] Daily arc lint --take CLANGFORMAT
Reviewed By: zertosh

Differential Revision: D22452776

fbshipit-source-id: a103da6a5b1db7f1c91ca25490358da268fdfe96
2020-07-09 08:49:32 -07:00
3e26709a4e Remove copy_ warnings for angle and abs for complex tensors (#41152)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41152

fixes https://github.com/pytorch/pytorch/issues/40838

Test Plan: Imported from OSS

Differential Revision: D22444357

Pulled By: anjali411

fbshipit-source-id: 2879d0cffc0a011c624eb8e00c7b64bd33522cc3
2020-07-09 08:05:36 -07:00
7ff7c9738c Revert D22418756: [pytorch][PR] Migrate addmm, addbmm and THBlas_gemm to ATen
Test Plan: revert-hammer

Differential Revision:
D22418756 (6725c034b6)

Original commit changeset: 44e7bb596426

fbshipit-source-id: cbaaf3ad277648901700ef0e47715580e8f8e0dc
2020-07-09 07:47:19 -07:00
bf9cc5c776 Add callback with TLS state API in futures (#40326)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40326

Adds a helper function `addCallbackWithTLSState` to both
torch/csrc/utils/future.h which is used internally by RPC framework and the JIT
future. Uses this helper function to avoid to pass in TLS state where it is needed for rpc and `record_function_ops.cpp`. For example, the following:

```
at::ThreadLocalState tls_state;
fut->addCallback([tls_state = std::move(tls_state)]() {
at::ThreadLocalStateGuard g(tls_state);
some_cb_that_requires_tls_state();
}
```

becomes

```
fut->addCallbackWithTLSState(some_cb_that_requires_tls_state);
```
ghstack-source-id: 107383961

Test Plan: RPC Tests and added a test in test_misc.cpp

Differential Revision: D22147634

fbshipit-source-id: 46c02337b90ee58ca5a0861e932413c40d06ed4c
2020-07-08 23:25:35 -07:00
155fb22e77 Run single-threaded gradgradcheck in testnn (#41147)
Summary:
Reland https://github.com/pytorch/pytorch/issues/40999

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41147

Reviewed By: mruberry

Differential Revision: D22450357

Pulled By: ngimel

fbshipit-source-id: 02b6e020af5e6ef52542266bd9752b9cfbec4159
2020-07-08 22:53:27 -07:00
8e2841781e [easy] Use torch.typename in JIT error messages (#41024)
Summary:
Noticed while trying to script one of the models which happened to have numpy values as constants. Lacking the numpy prefix in the error message was quite confusing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41024

Differential Revision: D22426399

Pulled By: dzhulgakov

fbshipit-source-id: 06158b75355fac6871e4861f82fc637c2420e370
2020-07-08 21:49:37 -07:00
33e26656fa list workaround for CREATE_OBJECT failure (#41129)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/41129

Test Plan: Imported from OSS

Differential Revision: D22436064

Pulled By: ann-ss

fbshipit-source-id: 7cfc38eb953410edfe3d21346c6e377c3b3bfc1f
2020-07-08 18:36:04 -07:00
302cf6835e [ROCm][Caffe2] Enable MIOpen 3D Pooling (#38260)
Summary:
This PR contains the following updates:
1. MIOpen 3D pooling enabled in Caffe2.
2. Refactored the MIOpen pooling code in caffe2.
3. Enabled unit test cases for 3D pooling.

CC: ezyang jeffdaily ashishfarmer

Pull Request resolved: https://github.com/pytorch/pytorch/pull/38260

Differential Revision: D21524754

Pulled By: xw285cornell

fbshipit-source-id: ddfe09dc585cd61e42eee22eff8348d326fd0c3b
2020-07-08 17:42:55 -07:00
f71cccc457 test: Add option to continue testing through error (#41136)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41136

Running this within CI seems impossible since this script exits out
after one failed test, so let's just add an option that CI can use to
power through these errors.

Should not affect current functionality.

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Differential Revision: D22441694

Pulled By: seemethere

fbshipit-source-id: 7f152fea15af9d47a964062ad43830818de5a109
2020-07-08 17:26:13 -07:00
04004bf10c Fix a minor typo "forget add" -> "forget to add" (#41131)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/41131

Differential Revision: D22441122

Pulled By: gmagogsfm

fbshipit-source-id: 383ef167b7742e2f211d1cae010b6ebb37c6e7a0
2020-07-08 17:00:42 -07:00
c7768e21b1 [JIT] Add GitHub workflow for importing issues to triage project (#41056)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41056

**Summary**
This commit adds a new GitHub workflow that automatically adds a card to
the "Need triage" section of the project board for tracking JIT triage
for each new issue that is opened and labelled "jit".

**Test Plan**
???

Test Plan: Imported from OSS

Differential Revision: D22444262

Pulled By: SplitInfinity

fbshipit-source-id: 4e7d384822bffb978468c303322f3e2c04062644
2020-07-08 17:00:40 -07:00
6725c034b6 Migrate addmm, addbmm and THBlas_gemm to ATen (#40927)
Summary:
Closes https://github.com/pytorch/pytorch/issues/24679, closes https://github.com/pytorch/pytorch/issues/24678

`addbmm` depends on `addmm` so needed to be ported at the same time. I also removed `THTensor_(baddbmm)` which I noticed had already been ported so was just dead code.

After having already written this code, I had to fix merge conflicts with https://github.com/pytorch/pytorch/issues/40354 which revealed there was already an established place for cpu blas routines in ATen. However, the version there doesn't make use of ATen's AVX dispatching so thought I'd wait for comment before migrating this into that style.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/40927

Differential Revision: D22418756

Pulled By: ezyang

fbshipit-source-id: 44e7bb5964263d73ae8cc6adc5f6d4e966476ae6
2020-07-08 17:00:37 -07:00
3f32332ee6 [JIT][Easy]move remove mutation to own file (#41137)
Summary:
This should be in its own file...

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41137

Reviewed By: jamesr66a

Differential Revision: D22437922

Pulled By: eellison

fbshipit-source-id: 1b62dde1a4ebac673b5c60aea4f398f734d62501
2020-07-08 17:00:35 -07:00
b8d2ccf009 Unify TensorOptions signatures (#39611)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39611

A few ops have been taking non-optional ScalarType, Device and Layout. That isn't supported by the hacky wrapper that makes those
kernels work with the c10 operator library. This PR unifies the signatures and makes those ops c10-full.
ghstack-source-id: 107330186

Test Plan: waitforsandcastle

Differential Revision: D21915788

fbshipit-source-id: 39f0e114f2766a3b27b80f93f2c1a95fa23c78d4
2020-07-08 17:00:33 -07:00
10caf58a52 [typing] tensor._version is int (#41125)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/41125

Differential Revision: D22440717

Pulled By: ezyang

fbshipit-source-id: f4849c6e13f01cf247b2f64f68a621b055c8bc17
2020-07-08 17:00:30 -07:00
97052c5fa8 Extend SparseAdagrad fusion with stochastic rounding FP16 (#41107)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41107

Extend row wise sparse Adagrad fusion op to FP16 (stochastic rounding) for PyTorch.

Differential Revision: D22195408

fbshipit-source-id: e9903ca7ca3b542fb56f36580e69bb2a39b554f6
2020-07-08 16:58:53 -07:00
af2680e9ce Update ShipIt sync
fbshipit-source-id: ceb761e28fe8c53bc53f3b82b304ea8ab0e98183
2020-07-08 16:52:13 -07:00
0edbe6b063 Add a link in RPC doc page to point to PT Distributed overview (#41108)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/41108

Test Plan: Imported from OSS

Differential Revision: D22440751

Pulled By: mrshenli

fbshipit-source-id: 9e7b002091a3161ae385fdfcc26484ae8fc243bb
2020-07-08 14:00:05 -07:00
9d1138afec Remove unnecessary atomic ops in DispatchStub (#40930)
Summary:
I noticed this very unusual use of atomics in `at::native::DispatchStub`. The comment asserts that `choose_cpu_impl()` will always return the same value on every thread, yet for some reason it uses a CAS loop to exchange the value instead of a simple store? That makes no sense considering it doesn't even read the exchanged value.

This replaces the CAS loop with a simple store and also improves the non-initializing case to a single atomic load instead of two.

For reference, the `compare_exchange` was added in https://github.com/pytorch/pytorch/issues/32148 and the while loop added in https://github.com/pytorch/pytorch/issues/35794.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/40930

Differential Revision: D22438224

Pulled By: ezyang

fbshipit-source-id: d56028ce18c8c5dbabdf366379a0b6aaa41aa391
2020-07-08 13:55:11 -07:00
ec58d739c6 .circleci: Remove pynightly jobs
These jobs didn't really fulfill the intended purpose that they had once
had since the travis python versions were basically locked to 3.7.

Going to go ahead and remove these along with its docker jobs as well
since we don't actively need them anymore.

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

ghstack-source-id: cdfc4fc2ae15a0c86d322cc706d383d6bc189fbc
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41134
2020-07-08 13:46:42 -07:00
dfd21ec00d Revert D22418716: [JIT] Add support for backend-lowered submodules
Test Plan: revert-hammer

Differential Revision:
D22418716 (6777ea19fe)

Original commit changeset: d2b2c6d5d2cf

fbshipit-source-id: 5ce177e13cab0be60020f8979f9b6c520cc8654e
2020-07-08 13:14:21 -07:00
2bc9ee97d1 Revert D22418731: [JIT] Add out-of-source-tree to_backend tests
Test Plan: revert-hammer

Differential Revision:
D22418731 (e2a291b396)

Original commit changeset: 621ba4efc1b1

fbshipit-source-id: 475ae24c5b612fe285035e5ebb92ffc66780a468
2020-07-08 13:11:45 -07:00
131a0ea277 Add version number to bytecode. (#36439)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36439

A proposal of versioning in bytecode, as suggested by dzhulgakov in the internal post: https://fb.workplace.com/groups/pytorch.mobile.work/permalink/590192431851054/

kProducedBytecodeVersion is added. If the model version is not the same as the number in the code, an error will be thrown.

The updated bytecode would look like below. It's a tuple of elements, where the first element is the version number.
```
(3,
 ('__torch__.m.forward',
  (('instructions',
    (('STOREN', 1, 2),
     ('DROPR', 1, 0),
     ('MOVE', 2, 0),
     ('OP', 0, 0),
     ('RET', 0, 0))),
   ('operators', (('aten::Int', 'Tensor'),)),
   ('constants', ()),
   ('types', ()),
   ('register_size', 2))))
```

Test Plan: Imported from OSS

Differential Revision: D22433532

Pulled By: iseeyuan

fbshipit-source-id: 6d62e4abe679cf91a8e18793268ad8c1d94ce746
2020-07-08 12:30:58 -07:00
58d7d91f88 Return atomic (#41028)
Summary:
Per title. This is not used currently in the pytorch codebase, but it is a legitimate usecase, and we have extensions that want to do that and are forced to roll their own atomic implementations for non-standard types. Whether atomic op returns old value or not should not affect performance, compiler is able to generate correct code depending on whether return value is used. https://godbolt.org/z/DBU_UW.
Atomic operations for non-standard integer types (1,2 and 8 byte-width) are left as is, with void return.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41028

Differential Revision: D22425008

Pulled By: ngimel

fbshipit-source-id: ca064edb768a6b290041a599e5b50620bdab7168
2020-07-08 11:54:24 -07:00
351407dd75 Disables unary op casting to output dtype (#41097)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/41047.

Some CPU kernel implementations don't call `cast_outputs()`, so when CPU temporaries were created to hold their outputs they weren't copied back to the out parameters correctly. Instead of fixing that issue, for simplicity this PR disables the behavior. The corresponding test in test_type_promotion.py is expanded with more operations to verify that unary ops can no longer have out arguments with different dtypes than their inputs (except in special cases like torch.abs which maps complex inputs to float outputs and torch.deg2rad which is secretly torch.mul).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41097

Differential Revision: D22422352

Pulled By: mruberry

fbshipit-source-id: 8e61d34ef1c9608790b35cf035302fd226fd9421
2020-07-08 11:48:40 -07:00
c93e96fbd9 [jit] move script-related implementation out of torch/jit/__init__.py (#40902)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40902

See the bottom of this stack for context.

Test Plan: Imported from OSS

Reviewed By: eellison

Differential Revision: D22360210

Pulled By: suo

fbshipit-source-id: 4275127173a36982ce9ad357aa344435b98e1faf
2020-07-08 11:38:34 -07:00
6c9b869930 [ROCm] Skip Conv2d, Conv3d transpose fp16 test for ROCm3.5 (#41088)
Summary:
There's a regression in MIOpen in ROCm3.5 that results in failure of autocast tests. Skipping the tests for now and will re-enable once the fixes are in MIOpen.

ezyang jeffdaily sunway513

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41088

Differential Revision: D22419823

Pulled By: xw285cornell

fbshipit-source-id: 347fb9a03368172fe0b263d14d27ee0c3efbf4f6
2020-07-08 11:13:49 -07:00
dde18041a6 [quant][graphmode] Refactor quantization patterns (#40894)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40894

Test Plan:
python test/test_quantization.py

Imported from OSS

Differential Revision: D22403901

fbshipit-source-id: e0bcf8a628c6a1acfe6fa10a52912360a619bc62
2020-07-08 10:36:25 -07:00
03eec07956 Move error messages in-line in _vmap_internals.py (#41077)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41077

This PR is a refactor that moves error messages into their callsites in
`_vmap_internals.py`. Furthermore, because a little bird told me we've
dropped python 3.5 support, this PR adopts f-string syntax to clean up
the string replace logic. Together these changes make the error messages
read better IMO.

Test Plan:
- `python test/test_vmap.py -v`. There exists tests that invoke each of the
error messages.

Differential Revision: D22420473

Pulled By: zou3519

fbshipit-source-id: cfd46b2141ac96f0a62864928a95f8eaa3052f4e
2020-07-08 08:42:56 -07:00
de4fc23381 clean up duplicated op names (#41092)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41092

added overload name for some full JIT operators and removed some duplicated op registrations

Test Plan:
apply D21032976, then buck run fbsource//xplat/caffe2/fb/pytorch_predictor:predictor
make sure there's no runtime error in operator registration

Reviewed By: iseeyuan

Differential Revision: D22419922

fbshipit-source-id: f651898e75b5bdb8dc03fc00b136689536c51707
2020-07-08 06:39:39 -07:00
e4fbcaa2bc [Codemod][FBSourceClangFormatLinter] Daily arc lint --take CLANGFORMAT
Reviewed By: zertosh

Differential Revision: D22429730

fbshipit-source-id: 585d8df36d7fa18a9c2d3fa54c1d333bf94464d0
2020-07-08 05:02:26 -07:00
3d3fd13e04 [quant][graphmode][fix] filter for list append change (#41020)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41020

Only support quantization of list append for List[Tensor]

Test Plan: Imported from OSS

Differential Revision: D22420698

fbshipit-source-id: 179677892037e136d90d16230a301620c3111063
2020-07-08 03:44:44 -07:00
e0e8b98c43 Export logic op to pytorch
Summary: Export logit op to pt for better preproc perf

Test Plan:
unit test
Also tested with model re-generation

Reviewed By: houseroad

Differential Revision: D22324611

fbshipit-source-id: 86accb6b4528e5c818d2c3f8c67926f279d158d6
2020-07-08 02:27:09 -07:00
6ef94590fa match int8 quantization of nnpi (#41094)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41094

mimic nnpi's quantization operations

removed redundant int8 test

Test Plan: ran FC with sizes up to 5, running bigger sizes

Reviewed By: venkatacrc

Differential Revision: D22420537

fbshipit-source-id: 91211c8a6e4d3d3bec2617b758913b44aa44b1b1
2020-07-08 00:07:42 -07:00
e2a291b396 [JIT] Add out-of-source-tree to_backend tests (#40842)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40842

**Summary**
This commit adds out-of-source-tree tests for `to_backend`. These tests check
that a Module can be lowered to a backend, exported, loaded (in both
Python and C++) and executed.

**Fixes**
This commit fixes #40067.

Test Plan: Imported from OSS

Differential Revision: D22418731

Pulled By: SplitInfinity

fbshipit-source-id: 621ba4efc1b121fa76c9c7ca377792ac7440d250
2020-07-07 21:00:43 -07:00
6777ea19fe [JIT] Add support for backend-lowered submodules (#40841)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40841

**Summary**
This commit adds support for using `Modules` that have been lowered as
submodules in `ScriptModules`.

**Test Plan**
This commit adds execution and save/load tests to test_backends.py for
backend-lowered submodules.

**Fixes**
This commit fixes #40069.

Test Plan: Imported from OSS

Differential Revision: D22418716

Pulled By: SplitInfinity

fbshipit-source-id: d2b2c6d5d2cf3042a620b3bde7d494f1abe28dc1
2020-07-07 21:00:40 -07:00
5a4c45f8d1 [JIT] Move TestBackend to test directory (#40840)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40840

**Summary**
This commit moves the TestBackend used for the JIT backend
extension to the tests directory. It was temporarily placed
in the source directory while figuring out some details of
the user experience for this feature.

**Test Plan**
`python test/test_jit.py TestBackends`

**Fixes**
This commit fixes #40067.

Test Plan: Imported from OSS

Differential Revision: D22418682

Pulled By: SplitInfinity

fbshipit-source-id: 9356af1341ec4d552a41c2a8929b327bc8b56057
2020-07-07 21:00:38 -07:00
3e01931e49 [JIT] Separate to_backend API into libtorch and libtorch_python (#40839)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40839

**Summary**
This commit splits the to_backend API properly into
`libtorch` and `libtorch_python`. The backend interface and all
of the code needed to run a graph on a backend is in
libtorch, and all of the code related to creating a Python binding
for the lowering process is in `libtorch_python`.

**Test Plan**
`python test/test_jit.py TestBackends`

**Fixes**
This commit fixes #40072.

Test Plan: Imported from OSS

Differential Revision: D22418664

Pulled By: SplitInfinity

fbshipit-source-id: b96e0c34ab84e45dff0df68b8409ded57a55ab25
2020-07-07 20:58:42 -07:00
0911c1e71a Added index_put to promotelist (#41035)
Summary:
[index_put](https://pytorch.org/docs/master/tensors.html#torch.Tensor.index_put) requires src and dst tensors to be the same dtype, so imo it belongs on the promote list when autocast is active (output should be widest dtype among input dtypes).

i also put some other registrations in alphabetical order.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41035

Differential Revision: D22418305

Pulled By: ngimel

fbshipit-source-id: b467cb16ac6c2ba1f9e43531f69a144b17f00b87
2020-07-07 20:36:55 -07:00
c55d8a6f62 Remove std::complex from c10::Scalar (#39831)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/39831

Differential Revision: D22018505

Pulled By: ezyang

fbshipit-source-id: 4719c0f1673077598c5866dafc7391d9e074f4eb
2020-07-07 20:31:42 -07:00
3615e344a3 Unit test case for the Int8FC to cover quantization scale errors. (#41100)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41100

Unit test case for the Int8FC to cover quantization scale errors.

Test Plan: test_int8_ops_nnpi.py test case test_int8_small_input.

Reviewed By: hyuen

Differential Revision: D22422353

fbshipit-source-id: b1c1baadc32751cd7e98e0beca8f0c314d9e5f10
2020-07-07 20:04:17 -07:00
bacca663ff Fix Broken Link in CONTRIBUTING.md (#41066)
Summary:
Spotted a broken link, and while I was at it, fixed a few little language and formatting nits.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41066

Reviewed By: mruberry

Differential Revision: D22415371

Pulled By: dongreenberg

fbshipit-source-id: 7d11c13235b28a01886063c11a4c5ccb333c0c02
2020-07-07 20:02:47 -07:00
445128d0f2 Add PyTorch Glossary (#40639)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40639

Differential Revision: D22421207

Pulled By: gmagogsfm

fbshipit-source-id: 7df8bfc85e28bcf1fb08892a3671e7a9cb0dee9c
2020-07-07 19:53:44 -07:00
bce75a2536 add first implementation of swish (#41085)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41085

add the first LUT implementation of swish

Test Plan:
compared against swish lowered as x*sigmoid(x), had to
increase the threshold of error but looks generally right

Reviewed By: venkatacrc

Differential Revision: D22418117

fbshipit-source-id: c75fa496aa7a5356ddc87f1d61650f432e389457
2020-07-07 19:48:34 -07:00
a8bc7545d5 use PYTORCH_ROCM_ARCH to set GLOO_ROCM_ARCH (#40170)
Summary:
Previously it used the default arch set which may or may not coincide with the user's.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/40170

Differential Revision: D22400866

Pulled By: xw285cornell

fbshipit-source-id: 222ba684782024fa68f37bf7d4fdab9a2389bdea
2020-07-07 19:41:02 -07:00
054e5d8943 .circleci: Fix job-specs-custom docker tag (#41111)
Summary:
Should resolve master breakages

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41111

Differential Revision: D22426863

Pulled By: seemethere

fbshipit-source-id: 561eaaa0d97a6fe13c75c1a73e4324b92d94afed
2020-07-07 19:32:23 -07:00
cc29c192a6 add "aten::add.str" op and remove two duplicated ops
Summary: add "aten::add.str" op and remove two duplicated ops

Test Plan:
```
buck run //xplat/caffe2/fb/pytorch_predictor:converter /mnt/vol/gfsfblearner-altoona/flow/data/2020-06-29/1ca8a85f-dbd5-4181-b5fc-63d24465c1fc/201084299/2068673333/model.pt1 ~/model_f201084299.bc

buck run xplat/assistant/model_benchmark_tool/mobile/binary/:lite_predictor -- --model ~/model_f201084299.bc --input_file /tmp/gc_model_input.txt --model_input_args src_tokens,dict_feat,contextual_token_embedding --warmup 1 --iter 2
```

Reviewed By: pengtxiafb

Differential Revision: D22395604

fbshipit-source-id: 0ce21e8b8ae989d125f2f3739523e3c486590b9f
2020-07-07 19:07:35 -07:00
a4fd4905c8 bump docker version to more recent tag (#41105)
Summary:
Tag was introduced originally as https://github.com/pytorch/pytorch/pull/40385

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41105

Reviewed By: malfet

Differential Revision: D22423910

Pulled By: seemethere

fbshipit-source-id: 336fc7ef5243a5863c59762efd182ed7ea6dfc2c
2020-07-07 18:28:24 -07:00
eea535742f Add bfloat16 support for nccl path (#38515)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38515

Differential Revision: D22420896

Pulled By: ezyang

fbshipit-source-id: 80d2d0c2052c91c9035e1e025ebb14e210cb0100
2020-07-07 18:07:06 -07:00
38b465db27 ROCm 3.5.1 image (#40385)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40385

Differential Revision: D22421426

Pulled By: ezyang

fbshipit-source-id: 1a131cdb1a0d5ad7ccd55dc1db17cae982cc286b
2020-07-07 15:37:23 -07:00
5e03a1e926 Add support for int[]? arguments in native_functions.yaml (#37174)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37174

ghstack-source-id: 106938112

Test Plan: Upcoming diffs use this for upsampling.

Differential Revision: D21210002

fbshipit-source-id: d6a55ab6420c05a92873a569221b613149aa0daa
2020-07-07 13:52:20 -07:00
4dad829ea3 In interpolate, inline the call to _interp_output_size (#37173)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37173

This function is only called in one place, so inline it.  This eliminates
boilerplate related to overloads and allows for further simplification
of shared logic in later diffs.

All shared local variables have the same names (from closed_over_args),
and no local variables accidentally collide.
ghstack-source-id: 106938108

Test Plan: Existing tests for interpolate.

Differential Revision: D21209995

fbshipit-source-id: acfadf31936296b2aac0833f704764669194b06f
2020-07-07 13:52:18 -07:00
3c1c74c366 In interpolate, move exceptional cases to the bottom (#37172)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37172

This improves readability by keeping cases with similar behavior close
together.  It should also have a very tiny positive impact on perf.
ghstack-source-id: 106938109

Test Plan: Existing tests for interpolate.

Differential Revision: D21209996

fbshipit-source-id: c813e56aa6ba7370b89a2784fcb62cc146005258
2020-07-07 13:52:16 -07:00
8f0e254790 In interpolate, use if instead of elif (#37171)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37171

Every one of these branches returns or raises, so there's no need for elif.
This makes it a little easier to reorder and move conditions.
ghstack-source-id: 106938110

Test Plan: Existing test for interpolate.

Differential Revision: D21209992

fbshipit-source-id: 5c517e61ced91464b713f7ccf53349b05e27461c
2020-07-07 13:49:53 -07:00
93778f3b24 Expose certain methods in OpaqueTensorImpl. (#41060)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41060

Exposes a const ref opaque_handle and made copy_tensor_metdata a
protected method. This helps in reusing code in sub classes of OpaqueTensorImpl

Test Plan: waitforbuildbot

Reviewed By: dzhulgakov

Differential Revision: D22406602

fbshipit-source-id: e3b8338099f257da7f6bbff679f1fdb71e5f335a
2020-07-07 13:36:32 -07:00
8d570bc708 Decouple DataParallel/DistributedDataParallel from CUDA (#38454)
Summary:
Decouple DataParallel/DistributedDataParallel from CUDA to support more device types.
- Move torch/cuda/comm.py to torch/nn/parallel/comm.py with minor changes for common devices support. Torch.cuda.comm is kept as is for backward compatibility
- Provide common APIs to arbitrary device types without changing existing CUDA APIs in torch.cuda space.
- Replace the torch.cuda calls in DataParellel/DistributedDataParallel with the new APIs.

Related RFC: [https://github.com/pytorch/pytorch/issues/36160](https://github.com/pytorch/pytorch/issues/36160)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/38454

Differential Revision: D22051557

Pulled By: mrshenli

fbshipit-source-id: 7842dad0e5d3ca0f6fb760bda49182dcf6653af8
2020-07-07 12:48:16 -07:00
75155df8b4 Doc warnings (#41068)
Summary:
solves most of gh-38011 in the framework of solving gh-32703.

These should only be formatting fixes, I did not try to fix grammer and syntax.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41068

Differential Revision: D22411919

Pulled By: zou3519

fbshipit-source-id: 25780316b6da2cfb4028ea8a6f649bb18b746440
2020-07-07 11:43:21 -07:00
ff3ba25b8e .circleci: Output binary sizes, store binaries (#41074)
Summary:
We need an easy to way to quickly visually grep binary sizes from builds
and then have a way to test out those binaries quickly.

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41074

Differential Revision: D22415667

Pulled By: seemethere

fbshipit-source-id: 86386e5390dce6aae26e952a47f9e2a2221d30b5
2020-07-07 11:36:49 -07:00
0e6b750288 Insert parentheses around kernel name argument to hipLaunchKernelGGL (#41022)
Summary:
This is to workaround an issue in hipclang wrt templated kernel name arguments to hipLaunchKernelGGL.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41022

Differential Revision: D22404183

Pulled By: ngimel

fbshipit-source-id: 63135ccb9e087f4c8e8663ed383979f7e2c1ba06
2020-07-07 11:31:45 -07:00
630e7ed9cc Splitting embedding_bag to embedding_bag_forward_only and embedding_bag (#40557)
Summary:
Currently embedding_bag's CPU kernel queries whether weight.requires_grad() is true. This violates layering of AutoGrad and Op Kernels, causing issues in third-party backends like XLA. See this [issue](https://github.com/pytorch/xla/issues/2215) for more details.

This PR hoists the query of weight.requires_grad() to Python layer, and splits embedding_bag into two separate ops, each corresponding to weight.requires_grad() == true and false.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/40557

Reviewed By: ailzhang

Differential Revision: D22327476

Pulled By: gmagogsfm

fbshipit-source-id: c815b3690d676a43098e12164517c5debec90fdc
2020-07-07 11:24:29 -07:00
00ee54d2a4 Fix link to PyTorch organization (from Governance) (#40984)
Summary:
PR fixes https://github.com/pytorch/pytorch/issues/40666

Pull Request resolved: https://github.com/pytorch/pytorch/pull/40984

Differential Revision: D22404543

Pulled By: ngimel

fbshipit-source-id: 0d39e8f4d701517cce9c31fddaaad46be3d4844b
2020-07-07 11:22:57 -07:00
452d5e191b Grammatically updated the tech docs (#41031)
Summary:
Small grammatical update to the torch tech docs

![image](https://user-images.githubusercontent.com/26879385/86633690-e126c400-bfc8-11ea-8892-23cdc037daa9.png)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41031

Differential Revision: D22404342

Pulled By: ngimel

fbshipit-source-id: 1c723119cfb050c4ef53de7971fe6e0acf3e91a9
2020-07-07 11:17:17 -07:00
22c7d183f7 If ninja is being used, force build_ext to run. (#40837)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40837

As ninja has accurate dependency tracking, if there is nothing to do,
then we will very quickly noop.  But this is important for correctness:
if a change was made to a header that is not listed explicitly in
the distutils Extension, then distutils will come to the wrong
conclusion about whether or not recompilation is needed (but Ninja
will work it out.)

This caused https://github.com/pytorch/vision/issues/2367

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D22340930

Pulled By: ezyang

fbshipit-source-id: 481b74f6e2cc78159d2a74d413751cf7cf16f592
2020-07-07 09:49:31 -07:00
733b8c23c4 Fix several quantization documentation typos (#40567)
Summary:
This PR fixes several typos I noticed in the docs here: https://pytorch.org/docs/master/quantization.html. In one case there was a misspelled module [torch.nn.instrinsic.qat](https://pytorch.org/docs/master/quantization.html#torch-nn-instrinsic-qat) which I corrected and am including screenshots of below just in case.

<img width="1094" alt="before" src="https://user-images.githubusercontent.com/54918401/85766765-5cdd6280-b6e5-11ea-93e6-4944cf820b71.png">

<img width="1093" alt="after" src="https://user-images.githubusercontent.com/54918401/85766769-5d75f900-b6e5-11ea-8850-0d1f5ed67b16.png">
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40567

Differential Revision: D22311291

Pulled By: ezyang

fbshipit-source-id: 65d1f3dd043357e38a584d9e30f31634a5b0995c
2020-07-07 09:45:23 -07:00
2d98f8170e Add option to warn if elements in a Compare table are suspect (#41011)
Summary:
This PR adds a `.highlight_warnings()` method to `Compare`, which will include a `(! XX%)` next to measurements with high variance to highlight that fact. For example:
```
[------------- Record function overhead ------------]
                      |    lstm_jit   |  resnet50_jit
1 threads: ------------------------------------------
      with_rec_fn     |   650         |  8600
      without_rec_fn  |   660         |  8000
2 threads: ------------------------------------------
      with_rec_fn     |   360         |  4200
      without_rec_fn  |   350         |  4000
4 threads: ------------------------------------------
      with_rec_fn     |   250         |  2100
      without_rec_fn  |   260         |  2000
8 threads: ------------------------------------------
      with_rec_fn     |   200 (! 6%)  |  1200
      without_rec_fn  |   210 (! 6%)  |  1100
16 threads: -----------------------------------------
      with_rec_fn     |   220 (! 8%)  |   900 (! 5%)
      without_rec_fn  |   200 (! 5%)  |  1000 (! 7%)
32 threads: -----------------------------------------
      with_rec_fn     |  1000 (! 7%)  |   920
      without_rec_fn  |  1000 (! 6%)  |   900 (! 6%)

Times are in milliseconds (ms).
(! XX%) Measurement has high variance, where XX is the median / IQR * 100.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41011

Differential Revision: D22412905

Pulled By: robieta

fbshipit-source-id: 2c90e719d9a5a1c0267ed113dd1b1b1738fa8269
2020-07-07 09:39:22 -07:00
a04af4dccb Revert D22396896: [pytorch][PR] run single-threaded gradgradcheck in test_nn
Test Plan: revert-hammer

Differential Revision:
D22396896 (dac63a13cb)

Original commit changeset: 3b247caceb65

fbshipit-source-id: 90bbd71ca5128a7f07fe2907c061ee0922d16edf
2020-07-07 07:43:39 -07:00
0e09511af9 type annotations for dataloader, dataset, sampler (#39392)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/38913

Pull Request resolved: https://github.com/pytorch/pytorch/pull/39392

Reviewed By: anjali411

Differential Revision: D22102489

Pulled By: zou3519

fbshipit-source-id: acb68d9521145f0b047214d62b5bdc5a0d1b9be4
2020-07-07 07:16:18 -07:00
a6b703cc89 Make torch_cpu compileable when USE_TENSORPIPE is not set. (#40846)
Summary:
Forward-declare `tensorpipe::Message` class in utils.h
Guard TensorPipe specific methods in utils.cpp with `#ifdef USE_TENSORPIPE`
Pass `USE_TENSORPIPE` as private flag to `torch_cpu` library
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40846

Differential Revision: D22338864

Pulled By: malfet

fbshipit-source-id: 2ea2aea84527ae7480e353afb55951a068b3b980
2020-07-07 07:02:57 -07:00
12b5bdc601 Remove unused Logger in get_matching_activations (#41023)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41023

Remove Logger in get_matching_activations since it's not used.
ghstack-source-id: 107237046

Test Plan:
buck test mode/dev caffe2/test:quantization -- 'test_compare_weights_lstm_dynamic'
buck test mode/dev caffe2/test:quantization -- 'test_compare_model_stub_lstm_dynamic'
buck test mode/dev caffe2/test:quantization -- 'test_compare_model_outputs_lstm_dynamic'
buck test mode/dev caffe2/test:quantization -- 'test_compare_weights_conv_static'
buck test mode/dev caffe2/test:quantization -- 'test_compare_weights_linear_static'
buck test mode/dev caffe2/test:quantization -- 'test_compare_weights_linear_dynamic'
buck test mode/dev caffe2/test:quantization -- 'test_compare_model_stub_conv_static'
buck test mode/dev caffe2/test:quantization -- 'test_compare_model_stub_linear_static'
buck test mode/dev caffe2/test:quantization -- 'test_compare_model_stub_submodule_static'
buck test mode/dev caffe2/test:quantization -- 'test_compare_model_stub_functional_static'
buck test mode/dev caffe2/test:quantization -- 'test_compare_model_stub_linear_dynamic'
buck test mode/dev caffe2/test:quantization -- 'test_compare_model_outputs_conv_static'
buck test mode/dev caffe2/test:quantization -- 'test_compare_model_outputs_linear_static'
buck test mode/dev caffe2/test:quantization -- 'test_compare_model_outputs_functional_static'
buck test mode/dev caffe2/test:quantization -- 'test_compare_model_outputs_linear_dynamic'

Differential Revision: D22394957

fbshipit-source-id: 7d59e0f35e9f4c304b8487460d48236ee6e5a872
2020-07-07 00:33:07 -07:00
4aa543ed2e Fix unordered-map-over-enum for GCC 5.4 (#41063)
Summary:
Forgot to add this to https://github.com/pytorch/pytorch/pull/41055

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41063

Differential Revision: D22407451

Pulled By: malfet

fbshipit-source-id: 6f06653b165cc4817d134657f87caf643182832a
2020-07-06 23:26:31 -07:00
50df097599 Fix CUDA jit codegen compilation with gcc-5.4 (#41055)
Summary:
It's a known gcc-5.4 bug that enum class is not hasheable by default, so `std::unordered_map` needs 3rd explicit parameters to compute hash from the type.

Should fix regression caused by https://github.com/pytorch/pytorch/pull/40864

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41055

Differential Revision: D22405478

Pulled By: malfet

fbshipit-source-id: f4bd36bebdc1ad0251ebd1e6cefba866e6605fe6
2020-07-06 21:09:17 -07:00
56396ad024 ONNX: support view_as operator (#40496)
Summary:
This PR adds support for the torch `view_as` operator.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/40496

Reviewed By: hl475

Differential Revision: D22398318

Pulled By: houseroad

fbshipit-source-id: f92057f9067a201b707aa9b8fc4ad34643dd5fa3
2020-07-06 20:38:46 -07:00
b2cc8a2617 [ONNX]Fix export of full_like (#40063)
Summary:
Fix export of full_like when fill_value is of type torch._C.Value.

This PR fixes a bug when exporting GPT2DoubleHeadsModel https://github.com/huggingface/transformers/issues/4950

Pull Request resolved: https://github.com/pytorch/pytorch/pull/40063

Reviewed By: hl475

Differential Revision: D22398353

Pulled By: houseroad

fbshipit-source-id: 6980a61211fe571c2e4a57716970f474851d811e
2020-07-06 20:36:09 -07:00
6e4f501f1a Improve error message for Pad operator (#39651)
Summary:
In issue https://github.com/pytorch/pytorch/issues/36997 the user encountered a non-meaningful error message when trying to export the model to ONNX. The Pad operator in opset 9 requires the list of paddings to be constant. This PR tries to improve the error message given to the user when this is not the case.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/39651

Reviewed By: hl475

Differential Revision: D21992262

Pulled By: houseroad

fbshipit-source-id: b817111c2a40deba85e4c6cdb874c1713312dba1
2020-07-06 20:26:02 -07:00
6b50874cb7 Fix HTTP links in documentation to HTTPS (#40878)
Summary:
I ran `make linkcheck` using `sphinx.builders.linkcheck` on the documentation and noticed a few links weren't using HTTPS so I quickly updated them all.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/40878

Differential Revision: D22404647

Pulled By: ngimel

fbshipit-source-id: 9c9756db59197304023fddc28f252314f6cf4af3
2020-07-06 20:05:21 -07:00
63ef706979 [ATen] Add native_cuda_h list to CMakeLists.txt (#41038)
Summary:
Closes https://github.com/pytorch/pytorch/issues/40784

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41038

Differential Revision: D22404273

Pulled By: malfet

fbshipit-source-id: 8df05f948f069ac95591d523222faa1327429e71
2020-07-06 19:58:36 -07:00
5d1d8a58b8 Enable in_dims for vmap frontend api (#40717)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40717

`in_dims` specifies which dimension of the input tensors should be
vmapped over. One can also specify `None` as an `in_dim` for a particular
input to indicate that we do not map over said input.

We implement `in_dims` by creating a BatchedTensor with BatchDim equal
to said `in_dim`. Most of this PR is error checking. `in_dims` must
satisfy the following:
- `in_dim` can be either an int or a Tuple[Optional[int]]. If it is an
int, we use it to mean the `in_dim` for every input.
- If `in_dims` is not-None at some index `idx`, then the input at index
`idx` MUST be a tensor (vmap can only map over tensors).

jax supports something more generalized: their `in_dims` can match the
structure of the `inputs` to the function (i.e., it is a nested python
data structure matching the data structure of `inputs` specifying where
in `inputs` the Tensors to be mapped are and what their map dims should
be). We don't have the infrastruture yet so we only support `int` or a
flat tuple for `in_dims`.

Test Plan: - `pytest test/test_vmap.py -v`

Differential Revision: D22397914

Pulled By: zou3519

fbshipit-source-id: 56d2e14be8b6024e4cde2729eff384da305b4ea3
2020-07-06 19:14:43 -07:00
dac63a13cb run single-threaded gradgradcheck in test_nn (#40999)
Summary:
Most time-consuming tests in test_nn (taking about half the time) were gradgradchecks on Conv3d. Reduce their sizes, and, most importantly, run gradgradcheck single-threaded, because that cuts the time of conv3d tests by an order of magnitude, and barely affects other tests.
These changes bring test_nn time down from 1200 s to ~550 s on my machine.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/40999

Differential Revision: D22396896

Pulled By: ngimel

fbshipit-source-id: 3b247caceb65d64be54499de1a55de377fdf9506
2020-07-06 17:21:25 -07:00
37a572f33e fix grad thrashing of shape analysis (#40939)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40939

Previously, when we would do shape analysis by running the op with representative inputs, we would always set the grad property to false. This led to a wrong static analysis when we would create differentiable subgraphs, and propagate shapes without also propagating requires_grad, and then uninline them.

Test Plan: Imported from OSS

Differential Revision: D22394676

Pulled By: eellison

fbshipit-source-id: 254e6e9f964b40d160befe0e125abe1b7aa2bd5e
2020-07-06 17:12:13 -07:00
4af8424377 shape analysis fix for default dtype' (#40938)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40938

already accepted in https://github.com/pytorch/pytorch/pull/40645

Test Plan: Imported from OSS

Reviewed By: jamesr66a, Krovatkin

Differential Revision: D22394675

Pulled By: eellison

fbshipit-source-id: 1e9dbb24a4cb564d9a68280d2166329ca9fb0425
2020-07-06 17:10:01 -07:00
078669f6c3 Back out "[2/n][Compute Meta] support analysis for null flag features"
Summary:
Original commit changeset: 46c59d849fa8

The original commit is breaking DPER3 release pipeline with the following failures:
https://www.internalfb.com/intern/chronos/jobinstance?jobinstanceid=9007207344413239&smc=chronos_gp_admin_client&offset=0
```
Child workflow f 202599639  failed with error: c10::Error: [enforce fail at operator.cc:76] blob != nullptr. op Save: Encountered a non-existing input blob: feature_preproc/feature_sparse_to_dense/default_float_value
```
https://www.internalfb.com/intern/chronos/jobinstance?jobinstanceid=9007207344855973&smc=chronos_gp_admin_client&offset=0
```
Child workflow f 202629391  failed with error: c10::Error: [enforce fail at operator.cc:76] blob != nullptr. op Save: Encountered a non-existing input blob: tum_preproc/inductive/feature_sparse_to_dense/default_float_value
```

Related UBN tasks: T69529846, T68986110

Test Plan: Build a DPER3 package on top of this commit, and check that DPER3 release test `model_deliverability_test` is passing.

Differential Revision: D22396317

fbshipit-source-id: 92d5b30cc146c005d6159a8d5bfe8973e2c546dd
2020-07-06 16:29:03 -07:00
a78024476b Port equal from THC to ATen (CUDA) (#36483)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/24557

ASV benchmark:

```
import torch

sizes = [
    (10**6,),
    (1000, 1000),
    (10, 10),
    (1, 2, 3, 4, 5, 6, 7, 8, 9, 10),
]

class EqualTrue:
    params = range(len(sizes))

    def setup(self, n):
        dims = sizes[n]
        self.a = torch.rand(dims, device='cuda')
        self.b = self.a.clone()

    def time_equal(self, n):
        torch.equal(self.a, self.b)

class EqualFalse:
    params = range(len(sizes))

    def setup(self, n):
        dims = sizes[n]
        self.a = torch.rand(dims, device='cuda')
        self.b = torch.rand(dims, device='cuda')

    def time_equal(self, n):
        torch.equal(self.a, self.b)
```

Old results:
```
[ 75.00%] ··· equal.EqualFalse.time_equal
[ 75.00%] ··· ======== ============
               param1
              -------- ------------
                 0       67.7±7μs
                 1       74.0±2μs
                 2      24.4±0.1μs
                 3      135±0.2μs
              ======== ============

[100.00%] ··· equal.EqualTrue.time_equal
[100.00%] ··· ======== ============
               param1
              -------- ------------
                 0      59.8±0.2μs
                 1      59.9±0.3μs
                 2      25.0±0.5μs
                 3      136±0.2μs
              ======== ============
```

New results:
```
[ 75.00%] ··· equal.EqualFalse.time_equal
[ 75.00%] ··· ======== ============
               param1
              -------- ------------
                 0      44.4±0.2μs
                 1      44.5±0.4μs
                 2      31.3±0.3μs
                 3      96.6±0.5μs
              ======== ============

[100.00%] ··· equal.EqualTrue.time_equal
[100.00%] ··· ======== ============
               param1
              -------- ------------
                 0      44.2±0.2μs
                 1      44.6±0.2μs
                 2      30.8±0.3μs
                 3      97.3±0.2μs
              ======== ============
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/36483

Differential Revision: D21451829

Pulled By: VitalyFedyunin

fbshipit-source-id: 033e8060192c54f139310aeafe8ba784bab94ded
2020-07-06 16:00:16 -07:00
c0f9bf9bea s/torch::jit::class_/torch::class_/ (#40795)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40795

Test Plan: Imported from OSS

Reviewed By: suo

Differential Revision: D22314215

Pulled By: jamesr66a

fbshipit-source-id: a2fb5c6804d4014f8e437c6858a7be8cd3efb380
2020-07-06 15:53:33 -07:00
cbe52d762c Mish Activation Function (#40856)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40856

Add a new activation function - Mish: A Self Regularized Non-Monotonic Neural Activation Function https://arxiv.org/abs/1908.08681

Test Plan:
buck test //caffe2/caffe2/python/operator_test:elementwise_ops_test -- 'test_mish'

{F242275183}

Differential Revision: D22158035

fbshipit-source-id: 459c1dd0ac5b515913fc09b5f4cd13dcf095af31
2020-07-06 15:51:23 -07:00
87f9b55aa5 Use explicit templates in gpu_kernel_with_scalars (#40992)
Summary:
This trick should have no effect on performance, but it reduces size of kernels using the template by 10%
For example, sizeof(BinaryMulDivKernel.cu.o) compiled by CUDA-10.1 toolchain for sm_75 before the change was 4.2Mb, after 3.8Mb

Pull Request resolved: https://github.com/pytorch/pytorch/pull/40992

Differential Revision: D22398733

Pulled By: malfet

fbshipit-source-id: 6576f4da00dc5fc2575b2313577f52c6571d5e6f
2020-07-06 15:46:28 -07:00
945ae5bd7b Update the documentation of the scatter_ method with support for reduction methods. (#40962)
Summary:
Follow up to https://github.com/pytorch/pytorch/pull/36447 . Update for https://github.com/pytorch/pytorch/issues/33389.

Also removes unused `unordered_map` include from the CPP file.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/40962

Differential Revision: D22376253

Pulled By: ngimel

fbshipit-source-id: 4e7432190e9a847321aec6d6f6634056fa69bdb8
2020-07-06 15:27:16 -07:00
35bd2b3c8b DOC: Clarify that CrossEntropyLoss mean is weighted (#40991)
Summary:
Closes https://github.com/pytorch/pytorch/issues/40560

This adds the equation for the weighted mean to `CrossEntropyLoss`'s docs and the `reduction` argument for `CrossEntropyLoss` and `NLLLoss` no longer describes a non-weighted mean of the outputs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/40991

Differential Revision: D22395805

Pulled By: ezyang

fbshipit-source-id: a623b6dd2aab17220fe0bf706bd9b62d6ba531fd
2020-07-06 15:05:31 -07:00
b9b4f05abf [nvFuser] Working towards reductions, codegen improvements (#40864)
Summary:
Have basic reduction fusion working, and have improved code generator to approach performance of eager mode reductions. Coming soon will be pointwise-reduction fusions in a way that should prevent the possibility of hitting regressions. Also working on performant softmax kernels in the code generator which may be our next fusion target.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/40864

Reviewed By: ngimel

Differential Revision: D22392877

Pulled By: soumith

fbshipit-source-id: 457448a807d628b1035f6d90bc0abe8a87bf8447
2020-07-06 14:52:49 -07:00
e026d91506 [JIT] Remove dead store in unpickler.cpp (#40625)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40625

Test Plan: Continuous integration.

Reviewed By: suo

Differential Revision: D22259289

fbshipit-source-id: 76cb097dd06a636004fc780b17cb20f27d3821de
2020-07-06 14:48:03 -07:00
d753f1c2e1 Fixes formatting of vander, count_nonzero, DistributedSampler documentation (#41025)
Summary:
Bundle of small edits to fix formatting.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41025

Differential Revision: D22398364

Pulled By: mruberry

fbshipit-source-id: 8d484cb52a1cf4a8eb1f64914574250c9fd5043d
2020-07-06 14:26:13 -07:00
0fbd42b20f [pytorch] deprecate PYTORCH_DISABLE_TRACING macro (#41004)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41004

Tracing has been moved into separate files. Now we can disable it by not compiling the source files for xplat mobile build.
ghstack-source-id: 107158627

Test Plan: CI + build size bot

Reviewed By: iseeyuan

Differential Revision: D22372615

fbshipit-source-id: bf2e2249e401295ff63020a292df119b188fb966
2020-07-06 14:22:59 -07:00
7f60642bae [pytorch] add manual registration for trace type (#40903)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40903

This PR continues the work of #38467 - decoupling Autograd and Trace for manually registered ops.
ghstack-source-id: 107158638

Test Plan: CI

Differential Revision: D22354804

fbshipit-source-id: f5ea45ade2850296c62707a2a4449d7d67a9f5b5
2020-07-06 14:20:37 -07:00
e173278348 Update quantization.rst (#40896)
Summary:
Add documentation for dynamic quantized modules

Pull Request resolved: https://github.com/pytorch/pytorch/pull/40896

Differential Revision: D22395955

Pulled By: z-a-f

fbshipit-source-id: cdc956d1509a0901bc24b73b6ca68a1b65e00cc2
2020-07-06 13:47:39 -07:00
e75f12ac15 Check statstical diff rather than exact match for test_dropout_cuda. (#40883)
Summary:
There's is a TODO tracked in https://github.com/pytorch/pytorch/issues/40882

Pull Request resolved: https://github.com/pytorch/pytorch/pull/40883

Reviewed By: pbelevich

Differential Revision: D22346087

Pulled By: ailzhang

fbshipit-source-id: b4789ca3a10f6a72c6e77276bde45633eb6cf545
2020-07-06 13:11:48 -07:00
c38a5cba0d Remove duplicate assignment in collate.py (#40655)
Summary:
Duplicated assignment
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40655

Reviewed By: ezyang

Differential Revision: D22308827

Pulled By: colesbury

fbshipit-source-id: 48361da8994b3ca00ef29e9afd3ec2672266f00a
2020-07-06 12:37:59 -07:00
c935712d58 Use unbind for tensor.__iter__ (#40884)
Summary:
Unbind, which has a special backward with cat, is arguably better than multiple selects, whose backward is creating & adding a bunch of tensors as big as `self`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/40884

Reviewed By: pbelevich

Differential Revision: D22363376

Pulled By: zou3519

fbshipit-source-id: 0911cdbb36f9a35d1b95f315d0a2f412424e056d
2020-07-06 10:53:15 -07:00
f6f3c0094a Revert D22369579: add eq.str, ne.str, and add.str ops
Test Plan: revert-hammer

Differential Revision:
D22369579 (0deb2560b8)

Original commit changeset: 7ac9a184d437

fbshipit-source-id: 9c861b9f6bf32fe51fa0ea516cf09a3d09d78a7c
2020-07-06 09:52:59 -07:00
9c82b570bf Fix delegating to jit.load from torch.load (#40937)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40937

Test Plan: Imported from OSS

Differential Revision: D22363816

Pulled By: jamesr66a

fbshipit-source-id: 50fc318869407fe8b215368026eaceb129b68a46
2020-07-06 09:00:13 -07:00
73c5a78f43 Test test_int8_ops_nnpi.py case typo fix. (#41008)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41008

Test test_int8_ops_nnpi.py case typo fix.

Test Plan: test_int8_ops_nnpi.py case typo fix.

Reviewed By: hl475

Differential Revision: D22390331

fbshipit-source-id: 8d257c72114ce890720219eb519b9cb43b2ca49b
2020-07-06 08:44:08 -07:00
46f5cf1e31 Improve error reporting of AVX instruction in CI job (#40681)
Summary:
Close https://github.com/pytorch/pytorch/issues/40320

Leverage `qemu` and `gdbserver` for printing backtrace and instruction, and help developers to understand the causes of failed tests better.

Signed-off-by: Xiong Wei <xiongw.fnst@cn.fujitsu.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/40681

Differential Revision: D22391512

Pulled By: malfet

fbshipit-source-id: 19f125cf6c0e5a51814aff2b1d4d3c81298e3cb6
2020-07-06 08:31:01 -07:00
e1afa9daff fix cmake bug (#39930)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/39930

Differential Revision: D22391207

Pulled By: ezyang

fbshipit-source-id: bde19a112846e124d4e5316ba947f48d4dccf361
2020-07-06 08:02:30 -07:00
0b9717b86a When linking libtorch_cpu.so, put AVX sources last in the input list (#40449)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/39600
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40449

Reviewed By: VitalyFedyunin

Differential Revision: D22312501

Pulled By: colesbury

fbshipit-source-id: 4c09adb0173749046f20b84241d6c940b339ad77
2020-07-06 07:56:12 -07:00
063d5b0d3f Remove get_fail_msg in test_dataloader.test_proper_exit (#40745)
Summary:
Close https://github.com/pytorch/pytorch/issues/40744
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40745

Reviewed By: ezyang

Differential Revision: D22308972

Pulled By: colesbury

fbshipit-source-id: 4b4847e6b926b2614c8b14f17a9db3b0376baabe
2020-07-06 07:48:32 -07:00
450ba49653 Add the missing resource_class key in the update_s3_htmls job (#41000)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/40998.

Actually I don't know why it is needed. But without it, the build won't start. See my rerun of the update_s3_html3 job: https://app.circleci.com/pipelines/github/pytorch/pytorch/187926/workflows/432dbe98-ca2f-484d-acc7-0482cb3fd01f/jobs/6121551/steps.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41000

Differential Revision: D22390654

Pulled By: malfet

fbshipit-source-id: 0f296c8a82fa92d5382f883bca951e6576f75b15
2020-07-06 07:02:11 -07:00
Liu
54d7a1e3f4 Fix module dict key ordering (#40905)
Summary:
fix https://github.com/pytorch/pytorch/issues/40227
Removed the sorting operation both in ModuleDict class, updated the docstring.
Also remove a sort operation in corresponding unit test, which will lead to unit test fail.

BC Note: Python version after 3.6, the plain dict will preserve the order of keys.
example:
For a python 3.6+ user, if he is initial a ModuleDict instance using plain python dict:
{
"b": torch.nn.MaxPool2d(3),
"a": torch.nn.MaxPool2d(3)
}
, he will get a ModuleDict which preserve the order:
ModuleDict(
(b): MaxPool2d(kernel_size=3, stride=3, padding=0, dilation=1, ceil_mode=False)
(a): MaxPool2d(kernel_size=3, stride=3, padding=0, dilation=1, ceil_mode=False)
)

For a python 3.5 user, if we maintain the same input, then the output ModuleDict could be:
ModuleDict(
(a): MaxPool2d(kernel_size=3, stride=3, padding=0, dilation=1, ceil_mode=False)
(b): MaxPool2d(kernel_size=3, stride=3, padding=0, dilation=1, ceil_mode=False)
)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40905

Differential Revision: D22357480

Pulled By: albanD

fbshipit-source-id: 0e2502769647bb64f404978243ca1ebe5346d573
2020-07-06 06:40:48 -07:00
0deb2560b8 add eq.str, ne.str, and add.str ops (#40958)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40958

add 3 str operators to lite interpreter
eq.str
ne.str
add.str

Test Plan:
```
buck run //xplat/caffe2/fb/pytorch_predictor:converter /mnt/vol/gfsfblearner-altoona/flow/data/2020-06-29/1ca8a85f-dbd5-4181-b5fc-63d24465c1fc/201084299/2068673333/model.pt1 ~/model_f201084299.bc

buck run xplat/assistant/model_benchmark_tool/mobile/binary/:lite_predictor -- --model ~/model_f201084299.bc --input_file /tmp/gc_model_input.txt --model_input_args src_tokens,dict_feat,contextual_token_embedding --warmup 1 --iter 2

```

Reviewed By: pengtxiafb

Differential Revision: D22369579

fbshipit-source-id: 7ac9a184d437c875edfb584221edd706bffb16e1
2020-07-06 01:01:15 -07:00
300a3aaaad [jit] move private implementation out of jit/__init__.py (#40807)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40807

We pack a lot of logic into `jit/__init__.py`, making it unclear to
developers and users which parts of our API are public vs. internal. This
is one in a series of PRs intended to pull implementation out into
separate files, and leave `__init__.py` as a place to register the
public API.

This PR moves all the tracing-related stuff out, and fixes other spots up
as necessary. Followups will move other core APIs out.

The desired end-state is that we conform to the relevant rules in [PEP 8](https://www.python.org/dev/peps/pep-0008/#public-and-internal-interfaces). In particular:
- Internal implementation goes in modules prefixed by `_`.
- `__init__.py` exposes a public API from these private modules, and nothing more.
- We set `__all__` appropriately to declare our public API.
- All use of JIT-internal functionality outside the JIT are removed (in particular, ONNX is relying on a number internal APIs). Since they will need to be imported explicitly, it will be easier to catch new uses of internal APIs in review.

Test Plan: Imported from OSS

Reviewed By: eellison

Differential Revision: D22320645

Pulled By: suo

fbshipit-source-id: 0720ea9976240e09837d76695207e89afcc58270
2020-07-05 22:01:11 -07:00
1e64bf4c40 [CircleCI] Delete docker image after testing (#40917)
Summary:
Needed maintenance step to avoid running out of disk space on RocM testers

Pull Request resolved: https://github.com/pytorch/pytorch/pull/40917

Differential Revision: D22385844

Pulled By: malfet

fbshipit-source-id: b6dc9ba888a2e34c311e9bf3c8b7b98fa1ec5435
2020-07-05 13:21:00 -07:00
8ecd4f36aa fix __len__, __contains__, getitem inherited from interface class derived from nn container (closes #40603) (#40789)
Summary:
Define static script implementation of __len__ and __contains__ on any subclass derived from a type such as ModuleList, Sequential, or ModuleDict.  Implement getitem for classes derived from ModuleDict.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40789

Reviewed By: eellison

Differential Revision: D22325159

Pulled By: wconstab

fbshipit-source-id: fc1562c29640fe800e13b5a1dd48e595c2c7239b
2020-07-04 15:45:18 -07:00
8223858cc1 shape inference of undefined for prim::grad (#40866)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40866

Reviewed By: pbelevich

Differential Revision: D22358988

Pulled By: Krovatkin

fbshipit-source-id: 7118d7f8d4eaf056cfb71dc0d588d38b1dfb0fc7
2020-07-04 14:10:22 -07:00
88c0d886e3 update requires_gard on loop inputs correctly (master) (#40926)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40926

Reviewed By: eellison

Differential Revision: D22359471

Pulled By: Krovatkin

fbshipit-source-id: 823e87674e2d2917f075255ec926e0485972f4e2
2020-07-04 13:58:29 -07:00
0790d11a18 typing for tensor.T/grad_fn torch.Size (#40879)
Summary:
fixes  https://github.com/pytorch/pytorch/issues/40658 https://github.com/pytorch/pytorch/issues/40658

Pull Request resolved: https://github.com/pytorch/pytorch/pull/40879

Reviewed By: pbelevich

Differential Revision: D22339146

Pulled By: ezyang

fbshipit-source-id: 6b4695e102591e7a2c391eb337c154414bacf67c
2020-07-04 11:58:29 -07:00
0fc0a9308a fix autodoc for torch.distributed.launch (#40963)
Summary:
The doc for `torch.distributed.launch` is missing since v1.2.0 (see issue https://github.com/pytorch/pytorch/issues/36386) because PR https://github.com/pytorch/pytorch/issues/22501 added some imports at the first line.
542ac74987/torch/distributed/launch.py (L1-L5)
I move it below the docstring to make the autodoc in Sphinx work normally.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/40963

Differential Revision: D22380816

Pulled By: mrshenli

fbshipit-source-id: ee8406785b9a198bbf3fc65e589854379179496f
2020-07-04 08:59:41 -07:00
480851ad2c Docstring changes for dynamic quantized classes (#40931)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40931

Fix docstrings for dynamic quantized Linear/LSTM and associated classes
ghstack-source-id: 107064446

Test Plan: Docs show up in correctly

Differential Revision: D22360787

fbshipit-source-id: 8e357e081dc59ee42fd7f12ea5079ce5d0cc9df2
2020-07-03 21:04:12 -07:00
3b7df2388e [RFC] Profile rpc_async call from JIT (#40652)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40652

Resolves https://github.com/pytorch/pytorch/issues/40304, but looking for
feedback on whether there is a better approach for this.

In order to profile `rpc_async` calls made within a torchscript function, we
add the profiling logic to `rpcTorchscript` which is the point where the RPC is
dispatched and is called by the jit `rpc_async` operator. We take a somewhat
similar approach to how this is done in the python API. If profiling is
enabled, we call `record_function_enter` which creates a `RecordFunction`
object and runs its starting callbacks. Then, we schedule end callbacks for
this `RecordFunction` to be run when the jit future completes.

One caveat is that `rpcTorchscript` can also be called by rpc_async from a
non-JIT function, in which case the profiling logic lives in Python. We add a
check to ensure that we don't double profile in this case.
ghstack-source-id: 107109485

Test Plan: Added relevant unittests.

Differential Revision: D22270608

fbshipit-source-id: 9f62d1a2a27f9e05772d0bfba47842229f0c24e1
2020-07-03 15:17:16 -07:00
f3f113f103 [quant][graphmode][fix] Print the node in error message (#40889)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40889

Test Plan: Imported from OSS

Differential Revision: D22348266

fbshipit-source-id: eed2ece5c94fcfaf187d6770bed4a7109f0c0b4a
2020-07-03 10:01:55 -07:00
f083cea227 [RPC tests] Fix file descriptor leak (#40913)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40913

Summary of the entire stack:
--

This diff is part of an attempt to refactor the RPC tests. They currently suffer from several problems:
- Several ways to specify the agent to use: there exists one "generic" fixture that uses the global variable TEST_CONFIG to look up the agent name, and is used for process group and Thrift, and then there are separate fixtures for the flaky agent and the TensorPipe one.
- These two ways lead to having two separate decorators (`requires_process_group_agent` and `@_skip_if_tensorpipe_agent`) which must both be specified, making it unclear what the effect of each of them is and what happens if only one is given.
- Thrift must override the TEST_CONFIG global variable before any other import (in order for the `requires_process_group_agent` decorator to work correctly) and for that it must use a "trap" file, which makes it even harder to track which agent is being used, and which is specific to Buck, and thus cannot be used in OSS by other agents.
- Even if the TensorPipe fixture doesn't use TEST_CONFIG, it still needs to set it to the right value for other parts of the code to work. (This is done in `dist_init`).
- There are a few functions in dist_utils.py that return some properties of the agent (e.g., a regexp to match against the error it returns in case of shutdown). These functions are effectively chained if/elses on the various agents, which has the effect of "leaking" some part of the Thrift agent into OSS.
- Each test suite (RPC, dist autograd/dist optimizer, their JIT versions, remote module, ...) must be run on each agent (or almost; the faulty one is an exception) in both fork and spawn mode. Each of these combinations is a separate file, which leads to a proliferation of scripts.
- There is no "master list" of what combinations make sense and should be run. Therefore it has happened that when adding new tests or new agents we forgot to enroll them into the right tests. (TensorPipe is still missing a few tests, it turns out).
- All of these tiny "entry point" files contain almost the same duplicated boilerplate. This makes it very easy to get the wrong content into one of them due to a bad copy-paste.

This refactoring aims to address these problems by:
- Avoiding global state, defaults/override, traps, if/elses, ... and have a single way to specify the agent, based on an abstract base class and several concrete subclasses which can be "mixed in" to any test suite.
- Instead of enabling/disabling tests using decorators, the tests that are specific to a certain agent are now in a separate class (which is a subclass of the "generic" test suite) so that they are only picked up by the agent they apply to.
- Instead of having one separate entry point script for each combination, it uses one entry point for each agent, and in that script it provides a list of all the test suites it wants to run on that agent. And it does that by trying to deduplicate the boilerplate as much as possible. (In fact, the various agent-suite combinations could be grouped in any way, not necessarily by agent as I did here).

It provides further advantages:
- It puts all the agents on equal standing, by not having any of them be the default, making it thus easier to migrate from process group to TensorPipe.
- It will make it easier to add more versions of the TensorPipe tests (e.g., one that disables the same-machine backends in order to test the TCP-based ones) without a further duplication of entry points, of boilerplate, ...

Summary of this commit
--
Once we start merging multiple test suites in a single file (which we'll happen in the next diffs in the stack) the OSX tests on CircleCI start failing due to "too many open files". This indicates a file descriptor leak. I then managed to repro it on Linux too by lowering the limit on open file descriptors (`ulimit -n 500`). Each test method that unittest runs is run on a new instance of the Testcase class. With our multiprocessing wrappers, this instance contains a list of child processes. Even after these processes are terminated, it appears they still hold some open file descriptor (for example a pipe to communicate with the subprocess). It also appears unittest is keeping these Testcase instances alive until the entire suite completes, which I suspect is what leads to this "leak" of file descriptors. Based on that guess, in this diff I am resetting the list of subprocesses during shutdown, and this seems to fix the problem.
ghstack-source-id: 107045908

Test Plan: Sandcastle and CircleCI

Differential Revision: D22356784

fbshipit-source-id: c93bb9db60fde72cae0b0c735a50c17e427580a6
2020-07-03 06:22:40 -07:00
f9a71d3de4 [RPC tests] Align ddp_under_dist_autograd test with others (#40815)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40815

Summary of the entire stack:
--

This diff is part of an attempt to refactor the RPC tests. They currently suffer from several problems:
- Several ways to specify the agent to use: there exists one "generic" fixture that uses the global variable TEST_CONFIG to look up the agent name, and is used for process group and Thrift, and then there are separate fixtures for the flaky agent and the TensorPipe one.
- These two ways lead to having two separate decorators (`requires_process_group_agent` and `@_skip_if_tensorpipe_agent`) which must both be specified, making it unclear what the effect of each of them is and what happens if only one is given.
- Thrift must override the TEST_CONFIG global variable before any other import (in order for the `requires_process_group_agent` decorator to work correctly) and for that it must use a "trap" file, which makes it even harder to track which agent is being used, and which is specific to Buck, and thus cannot be used in OSS by other agents.
- Even if the TensorPipe fixture doesn't use TEST_CONFIG, it still needs to set it to the right value for other parts of the code to work. (This is done in `dist_init`).
- There are a few functions in dist_utils.py that return some properties of the agent (e.g., a regexp to match against the error it returns in case of shutdown). These functions are effectively chained if/elses on the various agents, which has the effect of "leaking" some part of the Thrift agent into OSS.
- Each test suite (RPC, dist autograd/dist optimizer, their JIT versions, remote module, ...) must be run on each agent (or almost; the faulty one is an exception) in both fork and spawn mode. Each of these combinations is a separate file, which leads to a proliferation of scripts.
- There is no "master list" of what combinations make sense and should be run. Therefore it has happened that when adding new tests or new agents we forgot to enroll them into the right tests. (TensorPipe is still missing a few tests, it turns out).
- All of these tiny "entry point" files contain almost the same duplicated boilerplate. This makes it very easy to get the wrong content into one of them due to a bad copy-paste.

This refactoring aims to address these problems by:
- Avoiding global state, defaults/override, traps, if/elses, ... and have a single way to specify the agent, based on an abstract base class and several concrete subclasses which can be "mixed in" to any test suite.
- Instead of enabling/disabling tests using decorators, the tests that are specific to a certain agent are now in a separate class (which is a subclass of the "generic" test suite) so that they are only picked up by the agent they apply to.
- Instead of having one separate entry point script for each combination, it uses one entry point for each agent, and in that script it provides a list of all the test suites it wants to run on that agent. And it does that by trying to deduplicate the boilerplate as much as possible. (In fact, the various agent-suite combinations could be grouped in any way, not necessarily by agent as I did here).

It provides further advantages:
- It puts all the agents on equal standing, by not having any of them be the default, making it thus easier to migrate from process group to TensorPipe.
- It will make it easier to add more versions of the TensorPipe tests (e.g., one that disables the same-machine backends in order to test the TCP-based ones) without a further duplication of entry points, of boilerplate, ...

Summary of this commit
--
This prepares the stack by aligning the `ddp_under_dist_autograd` test to the other ones, so that later changes will be more consistent and thus easier to follow. It does so by moving the `skipIf` decorators and the `setUp` methods from the base test suite to the entry point scripts.
ghstack-source-id: 107045911

Test Plan: Sandcastle and CircleCI

Differential Revision: D22287535

fbshipit-source-id: ab0c9eb774b21d81e0ebd3078df958dbb4bfa0c7
2020-07-03 06:20:29 -07:00
d0f2079b5e [RPC tests] Remove world_size and init_method from TensorPipe fixture (#40814)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40814

Summary of the entire stack:
--

This diff is part of an attempt to refactor the RPC tests. They currently suffer from several problems:
- Several ways to specify the agent to use: there exists one "generic" fixture that uses the global variable TEST_CONFIG to look up the agent name, and is used for process group and Thrift, and then there are separate fixtures for the flaky agent and the TensorPipe one.
- These two ways lead to having two separate decorators (`requires_process_group_agent` and `@_skip_if_tensorpipe_agent`) which must both be specified, making it unclear what the effect of each of them is and what happens if only one is given.
- Thrift must override the TEST_CONFIG global variable before any other import (in order for the `requires_process_group_agent` decorator to work correctly) and for that it must use a "trap" file, which makes it even harder to track which agent is being used, and which is specific to Buck, and thus cannot be used in OSS by other agents.
- Even if the TensorPipe fixture doesn't use TEST_CONFIG, it still needs to set it to the right value for other parts of the code to work. (This is done in `dist_init`).
- There are a few functions in dist_utils.py that return some properties of the agent (e.g., a regexp to match against the error it returns in case of shutdown). These functions are effectively chained if/elses on the various agents, which has the effect of "leaking" some part of the Thrift agent into OSS.
- Each test suite (RPC, dist autograd/dist optimizer, their JIT versions, remote module, ...) must be run on each agent (or almost; the faulty one is an exception) in both fork and spawn mode. Each of these combinations is a separate file, which leads to a proliferation of scripts.
- There is no "master list" of what combinations make sense and should be run. Therefore it has happened that when adding new tests or new agents we forgot to enroll them into the right tests. (TensorPipe is still missing a few tests, it turns out).
- All of these tiny "entry point" files contain almost the same duplicated boilerplate. This makes it very easy to get the wrong content into one of them due to a bad copy-paste.

This refactoring aims to address these problems by:
- Avoiding global state, defaults/override, traps, if/elses, ... and have a single way to specify the agent, based on an abstract base class and several concrete subclasses which can be "mixed in" to any test suite.
- Instead of enabling/disabling tests using decorators, the tests that are specific to a certain agent are now in a separate class (which is a subclass of the "generic" test suite) so that they are only picked up by the agent they apply to.
- Instead of having one separate entry point script for each combination, it uses one entry point for each agent, and in that script it provides a list of all the test suites it wants to run on that agent. And it does that by trying to deduplicate the boilerplate as much as possible. (In fact, the various agent-suite combinations could be grouped in any way, not necessarily by agent as I did here).

It provides further advantages:
- It puts all the agents on equal standing, by not having any of them be the default, making it thus easier to migrate from process group to TensorPipe.
- It will make it easier to add more versions of the TensorPipe tests (e.g., one that disables the same-machine backends in order to test the TCP-based ones) without a further duplication of entry points, of boilerplate, ...

Summary of this commit
--
This prepares the stack by simplifying the TensorPipe fixture. A comment says that the TensorPipe fixture cannot subclass the generic fixture class as that would lead to a diamond class hierarchy which Python doesn't support (whereas in fact it does), and therefore it copies over two properties that are defined on the generic fixture. However, each class that uses the TensorPipe fixture also inherits from the generic fixture, so there's no need to redefine those properties. And, in fact, by not redefining it we save ourselves some trouble when the TensorPipe fixture would end up overriding another override.
ghstack-source-id: 107045914

Test Plan: Sandcastle and CircleCI

Differential Revision: D22287533

fbshipit-source-id: 254c38b36ba51c9d852562b166027abacbbd60ef
2020-07-03 02:52:14 -07:00
3890550940 [RPC tests] Fix @_skip_if_tensorpipe always skipping for all agents (#40860)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40860

It turns out that the `@_skip_if_tensorpipe_agent` decorator was written in such a way that it accidentally caused the test to become a no-op (and thus always succeed) for all agents. What this means is that all tests wrapped by that decorator were never ever being run, for any agent.

My understanding of the root cause is that the following code:
```
@_skip_if_tensorpipe_agent
def test_foo(self):
    self.assertEqual(2 + 2, 4)
```
ended up behaving somewhat like this:
```
def test_foo(self):
    def original_test_func(self):
        self.assertEqual(2 + 2, 4)
    return unittest.skipIf(self.agent == "TENSORPIPE")(original_test_func)
```
which means that the test body of the decorated method was not actually calling the original test method.

This issue probably came from the `@_skip_if_tensorpipe_agent` being copy-pasted from `requires_process_group_agent` (which, however, is not a decorator but rather a decorator *factory*). An unfortunate naming (calling `decorator` what was in fact the wrapped method) then hindered readability and hid the issue.

Note that a couple of tests had become legitimately broken in the meantime and no one had noticed. The breakages have been introduced in #39909 (a.k.a., D22011868 (145df306ae)).
ghstack-source-id: 107045916

Test Plan: Discovered this as part of my refactoring, in D22332611. After fixing the decorator two tests started breaking (for real reasons). After fixing them all is passing.

Differential Revision: D22332611

fbshipit-source-id: f88ca5574675fdb3cd09a9f6da12bf1e25203a14
2020-07-03 02:50:11 -07:00
cab7d94d47 [PyTorch Numeric Suite] Remove unnecessary Logger in input arguments (#40890)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40890

Remove unnecessary Logger in input arguments and simplify the API.
ghstack-source-id: 107110487

Test Plan:
buck test mode/dev caffe2/test:quantization -- 'test_compare_weights_lstm_dynamic'
buck test mode/dev caffe2/test:quantization -- 'test_compare_model_stub_lstm_dynamic'
buck test mode/dev caffe2/test:quantization -- 'test_compare_model_outputs_lstm_dynamic'
buck test mode/dev caffe2/test:quantization -- 'test_compare_weights_conv_static'
buck test mode/dev caffe2/test:quantization -- 'test_compare_weights_linear_static'
buck test mode/dev caffe2/test:quantization -- 'test_compare_weights_linear_dynamic'
buck test mode/dev caffe2/test:quantization -- 'test_compare_model_stub_conv_static'
buck test mode/dev caffe2/test:quantization -- 'test_compare_model_stub_linear_static'
buck test mode/dev caffe2/test:quantization -- 'test_compare_model_stub_submodule_static'
buck test mode/dev caffe2/test:quantization -- 'test_compare_model_stub_functional_static'
buck test mode/dev caffe2/test:quantization -- 'test_compare_model_stub_linear_dynamic'
buck test mode/dev caffe2/test:quantization -- 'test_compare_model_outputs_conv_static'
buck test mode/dev caffe2/test:quantization -- 'test_compare_model_outputs_linear_static'
buck test mode/dev caffe2/test:quantization -- 'test_compare_model_outputs_functional_static'
buck test mode/dev caffe2/test:quantization -- 'test_compare_model_outputs_linear_dynamic'

Differential Revision: D22345477

fbshipit-source-id: d8b4eb3d6cb3049aa3296dead8ba29bf5467bd1c
2020-07-03 02:45:46 -07:00
542ac74987 [quant][graphmode][fix] Fold conv bn (#40865)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40865

1. applied filter for the module types
2. removed the assumption that the conv bn are immediate child of parent module

Test Plan:
python test/test_quantization.py TestQuantizeJitPasses

Imported from OSS

Differential Revision: D22338074

fbshipit-source-id: 64739a5e56c0a74249a1dbc2c8454b88ec32aa9e
2020-07-03 00:01:04 -07:00
824ab19941 [quant][graphmode] Support quantization for aten::apend (#40743)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40743

`aten::append` modifies input inplace and the output is ignored, these ops are not
supported right now, so we'll need to first make `aten::append` non-inplace
by change
```
ignored = aten::append(list, x)
```
to
```
x_list = aten::ListConstruct(x)
result = aten::add(list, x_list)
```
and then quantize the aten::add instead.

Test Plan:
TestQuantizeJitOps.test_general_shape_ops

Imported from OSS

Differential Revision: D22302151

fbshipit-source-id: 931000388e7501e9dd17bec2fad8a96b71a5efc5
2020-07-02 22:26:52 -07:00
ff17b83fd8 [pytorch][ci] add custom selective build flow for android build (#40199)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40199

Mobile custom selective build has already been covered by `test/mobile/custom_build/build.sh`.
It builds a CLI binary with host-toolchain and runs on host machine to
check correctness of the result.

But that custom build test doesn't cover the android/gradle build part.
And we cannot use it to measure and track the in-APK size of custom
build library.

So this PR adds the selective build test coverage for android NDK build.
Also integrate with the CI to upload the custom build size to scuba.

TODO:
Ideally it should build android/test_app and measure the in-APK size.
But the test_app hasn't been covered by any CI yet and is currently
broken, so build & measure AAR instead (which can be inaccurate as we
plan to pack C++ header files into AAR soon).

Sample result: https://fburl.com/scuba/pytorch_binary_size/skxwb1gh
```

+---------------------+-------------+-------------------+-----------+----------+
|     build_mode      |    arch     |        lib        | Build Num |   Size   |
+---------------------+-------------+-------------------+-----------+----------+
| custom-build-single | armeabi-v7a | libpytorch_jni.so |   5901579 | 3.68 MiB |
| prebuild            | armeabi-v7a | libpytorch_jni.so |   5901014 | 6.23 MiB |
| prebuild            | x86_64      | libpytorch_jni.so |   5901014 | 7.67 MiB |
+---------------------+-------------+-------------------+-----------+----------+
```

Test Plan: Imported from OSS

Differential Revision: D22111115

Pulled By: ljk53

fbshipit-source-id: 11d24efbc49a85f851ecd0e481d14123f405b3a9
2020-07-02 21:11:01 -07:00
28e1d241cd [pytorch] factor out binary size upload command (#40188)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40188

Create a custom command for this task to avoid copy/paste for new build jobs.

Test Plan: Imported from OSS

Differential Revision: D22111114

Pulled By: ljk53

fbshipit-source-id: a7d4d6bbd61ba6b6cbaa137ec7f884736957dc39
2020-07-02 21:08:17 -07:00
3c22c7aadc infer tensor properties based on an input tensor rather than defaults for xxx_like ctors (#40895)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40895

Reviewed By: eellison

Differential Revision: D22358878

Pulled By: Krovatkin

fbshipit-source-id: 2db2429aa89c180d8e52a6bb1265308483da46a2
2020-07-02 20:56:35 -07:00
6095808d22 fix pca_lowrank memory consumption (#40853)
Summary:
Per title, fixes https://github.com/pytorch/pytorch/issues/40768
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40853

Reviewed By: pbelevich

Differential Revision: D22363906

Pulled By: ngimel

fbshipit-source-id: 966a4b230d351f7632c5cfae4a3b7c9a787bc9a5
2020-07-02 17:52:41 -07:00
3ca5849f0a Add serializer and deserializer for Int8QuantSchemeBlob and Int8QuantParamsBlob (#40661)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40661

Add ser-de to support int8 quantization during online training

Test Plan:
```
buck test caffe2/caffe2/fb/fbgemm:int8_serializer_test
```

Reviewed By: hx89

Differential Revision: D22273292

fbshipit-source-id: 3b1e9c820243acf41044270afce72a262ef92bd4
2020-07-02 17:17:05 -07:00
f8d4878b3c check for unsupported instructions when exporting mobile models (#40791)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40791

Test Plan: Imported from OSS

Differential Revision: D22311469

Pulled By: ann-ss

fbshipit-source-id: 7a6abb3f2477e8553f8c71f4aa0442df4f712fb5
2020-07-02 16:24:11 -07:00
3c6b8a6496 Revert D22360735: .circleci: Build docker images as part of CI workflow
Test Plan: revert-hammer

Differential Revision:
D22360735 (af5bcba217)

Original commit changeset: 4ffbde563fdc

fbshipit-source-id: 4ae2288f466703754c9e329d34d344269c70db83
2020-07-02 16:16:31 -07:00
a1c234e372 Revert D22330340: [C2] Fixed a bug in normalization operator
Test Plan: revert-hammer

Differential Revision:
D22330340 (ce63f70981)

Original commit changeset: 0bccf925bb76

fbshipit-source-id: e27d70dee0fbe9e708b0cf3be81dbd33c4015026
2020-07-02 16:05:23 -07:00
9cc73966b3 [TVM] Fix build and sync with caffe2/caffe2/python/dlpack.h (#40888)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40888

Reviewed By: yinghai

Differential Revision: D22326379

fbshipit-source-id: 96ffcff5738973312c49368f53f35bf410e4c0c9
2020-07-02 15:37:45 -07:00
b7517a76ba rshift use default >> operator (#40545)
Summary:
Fix https://github.com/pytorch/pytorch/issues/40032
Also see https://github.com/pytorch/pytorch/pull/35339
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40545

Reviewed By: pbelevich

Differential Revision: D22362816

Pulled By: ngimel

fbshipit-source-id: 4bbf9212b21a4158badbfee8146b3b67e94d5a33
2020-07-02 15:13:12 -07:00
dec3f918a0 Migrate 'torch.dot' from TH to Aten (CUDA) (#40646)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40646

Support double, float, at::Half.
Avoid creating output result on CPU.

Both of two tensors must be on GPU.

Reviewed By: ngimel

Differential Revision: D22258840

fbshipit-source-id: 95f4747477f09b40b1d682cd1f76e4c2ba28c452
2020-07-02 14:48:59 -07:00
81aebf380e pytorch | Fix linking of qnnpack params on windows. (#40920)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40920

Pytorch depends on this from both C and C++ source files, so unify linking so it's fully fixed.

Test Plan: Build it on Windows

Reviewed By: dreiss, supriyar

Differential Revision: D22348247

fbshipit-source-id: 2933b4804f4725ab1742914656fa367527f8f7e1
2020-07-02 13:46:20 -07:00
a7e09b8727 pytorch | Namespace init_win symbol in qnnpack.
Summary: Namespacing the symbol, since it clashes with "the real thing" otherwise.

Test Plan: Sandcastle + build it on windows

Reviewed By: dreiss

Differential Revision: D22348240

fbshipit-source-id: f9c9a7abc97626ba327605cb4749fc5c38a24d35
2020-07-02 13:37:40 -07:00
e1428cf41b [JIT] fix unfold shape analysis (#40749)
Summary:
unfold on a 0-dimensioned tensor returns a 1-dim tensor
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40749

Differential Revision: D22361481

Pulled By: eellison

fbshipit-source-id: 621597e5f97f6e39953eb86f8b85bb4142527a9f
2020-07-02 13:32:37 -07:00
ce63f70981 [C2] Fixed a bug in normalization operator (#40925)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40925

normalization operator does not handle empty tensors correctly. This is a fix.

Test Plan: unit tests

Differential Revision: D22330340

fbshipit-source-id: 0bccf925bb768ebb997ed0c88130c5556308087f
2020-07-02 13:24:56 -07:00
af5bcba217 .circleci: Build docker images as part of CI workflow (#40827)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40827

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Differential Revision: D22360735

Pulled By: seemethere

fbshipit-source-id: 4ffbde563fdc3c49fdd14794ed3c2e881030361d
2020-07-02 13:00:39 -07:00
9f14e48834 Override shape hints with real weight shape extracted from workspace (#40872)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40872

Shape hints as its name suggests, is hint. We should use real shape from workspace for the weights.

Reviewed By: ChunliF

Differential Revision: D22337680

fbshipit-source-id: e7a6101fb613ccb332c3e34b1c2cb8c6c47ce79b
2020-07-02 12:55:29 -07:00
db39542509 [2/n][Compute Meta] support analysis for null flag features
Summary:
## TLDR
Support using NaN default value for missing dense features in RawInputProcessor for DPER2. In preparation for subsequent support for null flag features in compute meta. For train_eval this is already supported in DPER3 and we do not plan to support this in DPER2 train eval.
## Overview
Intern project plan to support adding dense flags for missing feature values instead of replacing with zero.
## Project plan :
https://docs.google.com/document/d/1OsPUTjpJycwxWLCue3Tnb1mx0uDC_2KKWvC1Rwpo2NI/edit?usp=sharing
## Code paths:
See https://fb.quip.com/eFXUA0tbDmNw for the call stack for all affected code paths.

Test Plan:
## fblearner flow test

1. `flow-cli clone f197867430 --run-as-secure-group ads_personalization_systems --force-build` to build a ephemeral package and start a fblearner flow run (may fail)
2. Clone the new run and change the secure_group to `XXXX` and entitlement to `default` in the UI
3. Adds explicit_null_min_coverage flag
4. Optionally reduce `max_examples` since we only test pass/fail instead of quality.
5. Submit the run to test the change

Example:
f198538878

## compare output coverages to daiquery runs

1. Randomly select null flag features from compute meta workflow output
2. Look up the feature id in feature metadata using feature name
3. Check against a daiquery sample of coverage to see if the coverage falls within guidelines.
https://www.internalfb.com/intern/daiquery/workspace/275342740223489/192619942076136/

## Sampled features:
GFF_C66_ADS_USER_SUM_84_PAGE_TYPE_RATIO_EVENT_LIKE_IMPRESSION: 15694257
- original feature compute meta coverage: 0.999992
- daiquery feature coverage (10k rows): 0.69588
- null flag compute meta coverage: 0.293409
GFF_R1303_ADS_USER_SUM_7_PAGE_TYPE_COUNTER_CONVERSION: 16051183
-  original feature compute meta coverage: 0.949868
- daiquery feature coverage: 0.82241
- null flag compute meta coverage: 0.151687

## Unit tests:

`buck test  fblearner/flow/projects/dper/tests/workflows:ads_test`

https://www.internalfb.com/intern/testinfra/testconsole/testrun/6192449504303863/

Differential Revision: D22026450

fbshipit-source-id: 46c59d849fa89253f14dc2b035c4c677cd6e3a4c
2020-07-02 12:44:41 -07:00
b678666a04 Add module.training to docs (#40923)
Summary:
A lot of people ask https://discuss.pytorch.org/t/check-if-model-is-eval-or-train/9395/3
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40923

Reviewed By: pbelevich

Differential Revision: D22358799

Pulled By: zou3519

fbshipit-source-id: b5465ffedb691fb4811e097c4dbd7bbc405be09c
2020-07-02 12:36:59 -07:00
6ae3cd0d9d Configure RPC metrics handlers and pass them into Thrift RPC Agent (#40602)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40602

Reviewed By: pritamdamania87

Differential Revision: D22250592

fbshipit-source-id: d38131f30939fc26af241b40e057a9dc1109e950
2020-07-02 11:41:21 -07:00
6aabd12390 fix issue #31759 (allow valid ASCII python identifiers as dimnames) (#40871)
Summary:
Fixes issue https://github.com/pytorch/pytorch/issues/31759:
- Changes is_valid_identifier check on named tensor dimensions to allow digits if they are not at the beginning of the name (this allows exactly the ASCII subset of [valid python identifiers](https://docs.python.org/3/reference/lexical_analysis.html#identifiers)).
- Updates error message for illegal dimension names.
- Updates and adds relevant tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40871

Reviewed By: pbelevich

Differential Revision: D22357314

Pulled By: zou3519

fbshipit-source-id: 9550a1136dd0673dd30a5cd5ade28069ba4c9086
2020-07-02 11:35:54 -07:00
5db5a0f2bb Re-enable Caffe2 test RoiAlignTest.CheckCPUGPUEqual (#40901)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/35547.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40901

Differential Revision: D22357760

Pulled By: malfet

fbshipit-source-id: 43f7dc13a905416288a9a317ae31a4dc78276ce4
2020-07-02 11:22:23 -07:00
1a74bb84f2 Remove Int8FC diff restriction.
Summary: Remove Int8FC diff restriction.

Test Plan: test_int8_ops_nnpi.py

Reviewed By: hyuen

Differential Revision: D22353200

fbshipit-source-id: c6c80c9dda3245c02da8343ecd5689994baf0143
2020-07-02 08:15:31 -07:00
591fffc524 Type-annotate serialization.py (#40862)
Summary:
Move Storage class from __init__.pyi.in to types.py and make it a protocol, since this is not a real class
Expose `PyTorchFileReader` and `PyTorchFileWriter` native classes

Ignore function attributes, as there are yet no good way to type annotate those, see https://github.com/python/mypy/issues/2087
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40862

Differential Revision: D22344743

Pulled By: malfet

fbshipit-source-id: 95cdb6f980ee79383960f306223e170c63df3232
2020-07-02 07:10:55 -07:00
9fa1f27968 [jit] Fix value association with dictionaries in the tracer (#40885)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40885

`TracingState::setValue` associates a concrete IValue in the traced
program with a `Value*` symbolic. Previously, the logic for how
GenericDicts worked was special cased to only work for very simple cases
and silently eat other cases.

This PR generalizes the logic to reflect the same behavior as using
dictionaries on input: whenever we encounter a dictionary in the system,
we completely "burn in" all the keys into the graph, and then
recursively call `setValue` on the associated value.

This has the effect of requiring that any dictionary structure you are
creating in a traced program be of fixed structure, similar to how any
dictionary used as input must be static as well.

Test Plan: Imported from OSS

Differential Revision: D22342490

Pulled By: suo

fbshipit-source-id: 93e610a4895d61d9b8b19c8d2aa4e6d57777eaf6
2020-07-02 04:09:35 -07:00
59294fbbb9 [caffe2] Reimplement RemoveOpsByType with SSA (#40649)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40649

The original implementation of RemoveOpsByType is pretty buggy and does not remove all instances of the ops that should be removed. It's also quite complicated and hard to modify. I reimplemented it by first converting the graph to its SSA form. The algorithm is quite simple once the graph is in SSA form. It's very similar to constant propagation with a few modifications. The hardest part is to deal with the case of removing an op with the output being an output of the predict net, because that output has to be preserved.

(Note: this ignores all push blocking failures!)

Reviewed By: yinghai, dzhulgakov

Differential Revision: D22220798

fbshipit-source-id: faf6ed5242f1e2f310125d964738c608c6c55c94
2020-07-02 02:45:36 -07:00
ea03f954ad [ONNX] Add warning in ONNX export when constant folding is on in training-amenable mode (#40546)
Summary:
This PR introduces a warning when user tries to export the model to ONNX in training-amenable mode while constant folding is turned on. We want to warn against any unintentional use because constant folding may fold some parameters that may be intended to be trainable in the exported model.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40546

Reviewed By: hl475

Differential Revision: D22310917

Pulled By: houseroad

fbshipit-source-id: ba83b8e63af7c458b5ecca8ff2ee1c77e2064f90
2020-07-01 21:40:38 -07:00
73f11dc3d1 torch._six.PY37 should be true for Python-3.8 as well (#40868)
Summary:
Right now it is used to check whether `math.remainder` exists, which is the case for both Python-3.7 and 3.8
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40868

Differential Revision: D22343454

Pulled By: malfet

fbshipit-source-id: 6b6d4869705b64c4b952309120f92c04ac7e39fd
2020-07-01 19:49:37 -07:00
8f6e50d013 Make some more ops c10-full (#40747)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40747

-
ghstack-source-id: 106833603

Test Plan: waitforsandcastle

Reviewed By: ezyang

Differential Revision: D22299161

fbshipit-source-id: 6e34999b5f8244d9582e4978754039d340720ca8
2020-07-01 19:39:32 -07:00
d7c9f96e43 Optimize perf for calling ops with custom classes (#38257)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38257

It seems we're doing a runtime type check for custom classes on each operator call if the operator has custom class arguments.
This does not have an effect on operators without custom class arguments, but this is a problem for operators with custom class arguments,
for example operators taking a at::native::xnnpack::Conv2dOpContext argument.

The long term solution would be to move those checks to op registration time instead of doing them at call time,
but as an intermediate fix, we can at least make the check fast by

- Using ska::flat_hash_map instead of std::unordered_map
- Using std::type_index instead of std::string (i.e. avoid calling std::hash on a std::string)
ghstack-source-id: 106805209

Test Plan: waitforsandcastle

Reviewed By: ezyang

Differential Revision: D21507226

fbshipit-source-id: bd120d5574734be843c197673ea4222599fee7cb
2020-07-01 19:28:29 -07:00
2f47e953f7 Fixes #40158 (#40617)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/40158

Description
- docs update: removed incorrect statements
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40617

Reviewed By: ezyang

Differential Revision: D22308802

Pulled By: yns88

fbshipit-source-id: e33084af320f249c0c9ba04bdbe2191d1b954d17
2020-07-01 18:05:44 -07:00
04b6e4273e clang format reducer.cpp (#40876)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40876

clang format reducer.cpp
ghstack-source-id: 106980050

Test Plan: unit test

Differential Revision: D22321422

fbshipit-source-id: 54afdff206504c7bbdf2e408928cc32068e15cdc
2020-07-01 17:24:37 -07:00
ad30d465d5 Move install_torchvision to common.sh so that it can be sourced. (#40828)
Summary:
Moving this to a file that can be source by downstream pytorch/xla.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40828

Reviewed By: malfet

Differential Revision: D22339513

Pulled By: ailzhang

fbshipit-source-id: c43b18fa2b7e1e8bb6810a6a43bb7dccd4756238
2020-07-01 16:40:43 -07:00
49e12d888a [NCCL - reland] Explicitly abort NCCL Communicators on Process Group Destruction (#40585)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40585

This PR aborts incomplete NCCL Communicators in the ProcessGroupNCCL
destructor. This should prevent pending NCCL communicators from blocking other CUDA ops.
ghstack-source-id: 106988073

Test Plan: Sandcastle/ OSS CI

Differential Revision: D22244873

fbshipit-source-id: 4b4fe65e1bd875a50151870f8120498193d7535e
2020-07-01 16:21:16 -07:00
af34f2f63b Added missing generator argument in type annotation(pytorch#40803) (#40873)
Summary:
Added missing generator argument in type annotation(pytorch#40803)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40873

Differential Revision: D22344217

Pulled By: malfet

fbshipit-source-id: 9871401b97c96fa20c70e3f66334259ead1f8429
2020-07-01 16:05:18 -07:00
c73255801f Fix the autograd codegen for repeat function (#40766)
Summary:
Fix https://github.com/pytorch/pytorch/issues/40701

A new special case is added to let `dim()` save an int instead of self.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40766

Differential Revision: D22308354

Pulled By: albanD

fbshipit-source-id: 69008230d7398b9e06b8e074a549ae921c2bf603
2020-07-01 15:43:28 -07:00
26543e6caf [quant][graphmode] FP16 quant support - Operator Fusion (#40710)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40710

Test Plan:
python test/test_quantization.py

Imported from OSS

Differential Revision: D22335975

fbshipit-source-id: 5c176bb6b9c300e1beb83df972149dd5a400b854
2020-07-01 14:15:53 -07:00
55b5ab14d3 [quant][graphmode] FP16 quant support - Insert cast operators (#40709)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40709

Cast to kHalf and back to kFloat before the linear operator to mimic FP16 quant support

Test Plan:
python test/test_quantization.py test_convert_dynamic_fp16

Imported from OSS

Differential Revision: D22335977

fbshipit-source-id: f964128ec733469672a1ed4cb0d757d0a6c22c3a
2020-07-01 14:15:51 -07:00
6aebd2c412 [quant][graphmode] Add FP16 quant support - Insert Noop Observers (#40708)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40708

Insert NoopObservers for activations and weight tensors for FP16

Test Plan:
python test/test_quantization.py test_prepare_dynamic

Imported from OSS

Differential Revision: D22335976

fbshipit-source-id: b19e8035c7db3b0b065ec09c9ad6d913eb434f3e
2020-07-01 14:13:31 -07:00
d1352192e2 Move OperatorBase::AddRelatedBlobInfo implementation to .cc file (#40844)
Summary:
If virtual function is implemented in header file, it's implementation will be included as a weak symbol to every shared library that includes this header along with all of it's dependencies.

This was one of the reasons why size of libcaffe2_module_test_dynamic.so  was 500Kb (AddRelatedBlobInfo implementation pulled a quarter of libprotobuf.a with it)

Combination of this and https://github.com/pytorch/pytorch/issues/40845 reduces size of `libcaffe2_module_test_dynamic.so` from 500kb to 50Kb.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40844

Differential Revision: D22334725

Pulled By: malfet

fbshipit-source-id: 836a4cbb9f344355ddd2512667e77472546616c0
2020-07-01 11:48:15 -07:00
cbdf399fc6 Move OperatorSchema default inference function implementations to .cc… (#40845)
Summary:
… file

This prevents implementation of those functions(as lambdas) to be embedded as weak symbol into every shared library that includes this header.

Combination of this and https://github.com/pytorch/pytorch/pull/40844 reduces size of `libcaffe2_module_test_dynamic.so` from 500kb to 50Kb.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40845

Differential Revision: D22334779

Pulled By: malfet

fbshipit-source-id: 64706918fc2947350a58c0877f294b1b8b085455
2020-07-01 11:42:52 -07:00
c71ec1c717 Fix zip serialization for file > 2GiB for Windows (#40783)
Summary:
`long long == int64_t != long` in MSVC
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40783

Differential Revision: D22328757

Pulled By: ezyang

fbshipit-source-id: bc7301d6b0e7e00ee6d7ca8637e3fce7810b15e2
2020-07-01 08:15:27 -07:00
a0569ad8f8 [android][readme] Aar native linking add fbjni (#40578)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40578

Test Plan: Imported from OSS

Differential Revision: D22239286

Pulled By: IvanKobzarev

fbshipit-source-id: 7a4160b621af8cfcc3b3d9e6da1a75c8afefba27
2020-07-01 08:09:17 -07:00
fcadca1bda serialization: validate sparse tensors after loading (#34059)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/33439

This introduces torch._sparse_coo_tensor_unsafe(...) and
torch._validate_sparse_coo_tensor_args(...)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34059

Differential Revision: D22161254

Pulled By: ezyang

fbshipit-source-id: 994efc9b0e30abbc23ddd7b2ec987e6ba08a8ef0
2020-06-30 22:31:21 -07:00
5f9e7240f5 Fix bug where explicitly providing a namespace never worked. (#40830)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40830

Fixes #40725

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Differential Revision: D22323886

Pulled By: ezyang

fbshipit-source-id: b8a61496923d9f086d4c201024748505ba783238
2020-06-30 22:20:05 -07:00
2cf9fe2d92 Remove more error-exposing tests in exp that cannot be reliably reproduced (#40825)
Summary:
Continuing https://github.com/pytorch/pytorch/issues/40824

All CIs have been enabled (on a branch that starts with `ci-all/`)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40825

Differential Revision: D22328732

Pulled By: ezyang

fbshipit-source-id: 3e517d01a9183d95df0687b328fb268947ea5fb0
2020-06-30 22:14:32 -07:00
f13653db29 [Update transforms.py]use build-in atanh in TanhTransform (#40160)
Summary:
Since `torch.atanh` is recently implemented in https://github.com/pytorch/pytorch/issues/38388, we should simply use it for `TanhTransform`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40160

Differential Revision: D22208039

Pulled By: ezyang

fbshipit-source-id: 34dfbc91eb9383461e16d3452e3ebe295f39df26
2020-06-30 21:38:22 -07:00
fbcf419173 Respect user set thread count. (#40707)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40707

Test Plan: Imported from OSS

Differential Revision: D22318197

Pulled By: AshkanAliabadi

fbshipit-source-id: f11b7302a6e91d11d750df100d2a3d8d96b5d1db
2020-06-30 20:14:49 -07:00
0203d70c63 [nit] fix some typo within documentation (#40692)
Summary:
Apologize if this seems trivial, but i'd like to fix them on my way of reading some of the source code. Thanks!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40692

Differential Revision: D22284651

Pulled By: mrshenli

fbshipit-source-id: 4259d1808aa4d15a02cfd486cfb44dd75fdc58f8
2020-06-30 19:24:44 -07:00
8e0714a60d [rfc] Reduce number of coin flips in RecordFunction (#40758)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40758

Currently we flip a coin for each sampled callback each time
we run RecordFunction, this PR is an attempt to skip most of the coin
flips (for the low-probability observers) and keep the distribution
close to the original one

Test Plan:
CI and record_function_benchmark
```
(python_venv) iliacher@devgpu151:~/local/pytorch  (reduce_coin_flops)$ ./build/bin/record_function_benchmark
Warmup time: 30108 us.
Time per iteration (1x1): 1496.78 us.
Time per iteration (16x16): 2142.46 us.
Pure RecordFunction runtime of 10000000 iterations 687929 us, number of callback invocations: 978
(python_venv) iliacher@devgpu151:~/local/pytorch  (reduce_coin_flops)$ ./build/bin/record_function_benchmark
Warmup time: 19051 us.
Time per iteration (1x1): 1581.89 us.
Time per iteration (16x16): 2195.67 us.
Pure RecordFunction runtime of 10000000 iterations 682402 us, number of callback invocations: 1023
(python_venv) iliacher@devgpu151:~/local/pytorch  (reduce_coin_flops)$ ./build/bin/record_function_benchmark
Warmup time: 18715 us.
Time per iteration (1x1): 1566.11 us.
Time per iteration (16x16): 2131.17 us.
Pure RecordFunction runtime of 10000000 iterations 693571 us, number of callback invocations: 963
(python_venv) iliacher@devgpu151:~/local/pytorch  (reduce_coin_flops)$

(python_venv) iliacher@devgpu151:~/local/pytorch  (reduce_coin_flops)$ ./build/bin/record_function_benchmark
Warmup time: 18814 us.
Time per iteration (1x1): 1536.2 us.
Time per iteration (16x16): 1985.82 us.
Pure RecordFunction runtime of 10000000 iterations 944959 us, number of callback invocations: 1015
(python_venv) iliacher@devgpu151:~/local/pytorch  (reduce_coin_flops)$ ./build/bin/record_function_benchmark
Warmup time: 18278 us.
Time per iteration (1x1): 1526.32 us.
Time per iteration (16x16): 2093.77 us.
Pure RecordFunction runtime of 10000000 iterations 985307 us, number of callback invocations: 1013
(python_venv) iliacher@devgpu151:~/local/pytorch  (reduce_coin_flops)$ ./build/bin/record_function_benchmark
Warmup time: 18545 us.
Time per iteration (1x1): 1524.65 us.
Time per iteration (16x16): 2080 us.
Pure RecordFunction runtime of 10000000 iterations 952835 us, number of callback invocations: 1048
```

Reviewed By: dzhulgakov

Differential Revision: D22320879

Pulled By: ilia-cher

fbshipit-source-id: 2193f07d2f7625814fe7bc3cc85ba4092fe036bc
2020-06-30 17:23:00 -07:00
179dbd4f25 [jit] preserve keys on dictionary input tracing (#40792)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40792

Fixes https://github.com/pytorch/pytorch/issues/40529.

One followup should be to produce a better error message when a new
dictionary has different keys than the traced input. Right now it
presents as a fairly opaque `KeyError`.

Test Plan: Imported from OSS

Differential Revision: D22311731

Pulled By: suo

fbshipit-source-id: c9fbe0b54cf69daed2f11a191d988568521a3932
2020-06-30 16:50:36 -07:00
0ddaaf6a92 [codemod][caffe2] Run clang-format - 5/7
Summary:
This directory is opted-in to clang-format but is not format-clean. This blocks continuous formatting from being enabled on fbcode, and causes hassle for other codemods that leave inconsistent formatting. This diff runs clang-format, which is widely used and considered safe.

If you are unhappy with the formatting of a particular block, please *accept this diff* and then in a stacked commit undo the change and wrap that code in `// clang-format off` and `// clang-format on`, or `/* clang-format off */` and `/* clang-format on */`.

drop-conflicts

Test Plan: sandcastleit

Reviewed By: jerryzh168

Differential Revision: D22311706

fbshipit-source-id: 1ca59a82e96156a4a5dfad70ba3e64d44c5e762a
2020-06-30 15:45:11 -07:00
29aef8f460 Skip some error-producing exp tests that cannot be reliably reproduced (#40824)
Summary:
This is to take care of additional master CI tests for https://github.com/pytorch/pytorch/issues/39087
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40824

Differential Revision: D22321429

Pulled By: ezyang

fbshipit-source-id: 607e284688b3e4ce24d803a030e31991e4e32fd7
2020-06-30 15:39:09 -07:00
0a75234934 Allow np.memmap objects (numpy arrays based on files) to be processed… (#39847)
Summary:
Allow np.memmap objects to be processed by default_collate

np.memmap objects has the same behavior as numpy arrays, and the only difference is that they are stored in a binary file on the disk. However, the default_collate function used by PyTorch DataLoader only accepts np.array, and rejects np.memmap by type checking. This commit allows np.memmap objects to be processed by default_collate. In this way, users can use in-disk large arrays with PyTorch DataLoader.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39847

Reviewed By: ezyang

Differential Revision: D22284650

Pulled By: zou3519

fbshipit-source-id: 003e3208a2afd1afc2e4640df14b3446201e00b4
2020-06-30 15:00:20 -07:00
9d8dc0318b [pruning] add rowwise counter to sparse adagrad
Summary: Use the newly added counter op in sparse adagrad

Reviewed By: chocjy, ellie-wen

Differential Revision: D19221100

fbshipit-source-id: d939d83e3b5b3179f57194be2e8864d0fbbee2c1
2020-06-30 14:40:02 -07:00
40e79bb1d3 Update the version of ninja and scipy (#40677)
Summary:
Update scipy to 1.15 and ninja to 1.10.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40677

Differential Revision: D22311602

Pulled By: ezyang

fbshipit-source-id: ddc852b3b8c3091409d1b3bd579dd144b58e5d47
2020-06-30 14:29:40 -07:00
e762ce8ecf Avoid initializing new_group in test_backward_no_ddp. (#40727)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40727

This unit test doesn't need to initialize a PG, as a result avoiding
initializing a process group.

#Closes: https://github.com/pytorch/pytorch/issues/40292
ghstack-source-id: 106817362

Test Plan: waitforbuildbot

Differential Revision: D22295131

fbshipit-source-id: 5a60e91e4beeb61cc204d24c564106d0215090a6
2020-06-30 14:01:05 -07:00
5a4911834d Add CUDA11 build and test (#40452)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40452

Differential Revision: D22316007

Pulled By: malfet

fbshipit-source-id: 94f4b4ba2a46ff3d3042ba842a615f8392cdc350
2020-06-30 13:50:44 -07:00
1571dd8692 Refactor duplicated string literals (#40788)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40788

Avoid repeated the same `:gencode[foo/bar]` over and over again

Test Plan: CI

Reviewed By: EscapeZero

Differential Revision: D22271151

fbshipit-source-id: f8db57db4ee0948bcca0c8945fdf30380ba81cae
2020-06-30 13:45:02 -07:00
6e4f99b063 Fix wrong MSVC version constraint for CUDA 9.2 (#40794)
Summary:
Tested with https://github.com/pytorch/pytorch/pull/40782.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40794

Differential Revision: D22318045

Pulled By: malfet

fbshipit-source-id: a737ffd7cb8a6a9efb62b84378318f4c3800ad8f
2020-06-30 13:02:45 -07:00
9ac0febb1f Pin torchvision version for doc_push (#40802)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40802

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Differential Revision: D22317343

Pulled By: ezyang

fbshipit-source-id: 8a982dd93a28d102dfd63163cd44704e899922e0
2020-06-30 12:52:13 -07:00
f3949794a3 Prototype benchmarking util (#38338)
Summary:
This is the prototype for the modular utils that we've been discussing. It is admittedly a large PR, but a good fraction of that is documentation and examples. I've trimmed a bit on the edges since we last discussed this design (for instance Timer is no longer Fuzzer aware), but it's mostly the same.

In addition to the library and hermetic examples, I've included `examples.end_to_end` which tests https://github.com/pytorch/pytorch/pull/38061 over a variety of shapes, dtypes, degrees of broadcasting, and layouts. (CC crcrpar)  I only did CPU as I'm not set up on a GPU machine yet. [Results from my devserver](https://gist.github.com/robieta/d1a8e1980556dc3f4f021c9f7c3738e2)

Key takeaways:
  1) For contiguous Tensors, larger dtypes (fp32 and fp64) and lots of reuse of the mask due to broadcasting, improvements are significant. (Presumably due to better vectorization?)
  2) There is an extra ~1.5 us overhead, which dominates small kernels.
  3) Cases with lower write intensity (int8, lower mask fraction, etc) or non-contiguous seem to suffer.

Hopefully this demonstrates the proof-of-concept for how this tooling can be used to tune kernels and assess PRs. Looking forward to thoughts and feedback.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38338

Differential Revision: D21551048

Pulled By: robieta

fbshipit-source-id: 6c50e5439a04eac98b8a2355ef731852ba0500db
2020-06-30 11:31:27 -07:00
c648cd372f Fix complex printing for sci_mode=True (#40513)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40513

This PR makes the following changes:
1. Complex Printing now uses print formatting for it's real and imaginary values and they are joined at the end.
2. Adding 1. naturally fixes the printing of complex tensors in sci_mode=True

```
>>> torch.tensor(float('inf')+float('inf')*1j)
tensor(nan+infj)
>>> torch.randn(2000, dtype=torch.cfloat)
tensor([ 0.3015-0.2502j, -1.1102+1.2218j, -0.6324+0.0640j,  ...,
        -1.0200-0.2302j,  0.6511-0.1889j, -0.1069+0.1702j])
>>> torch.tensor([1e-3, 3+4j, 1e-5j, 1e-2+3j, 5+1e-6j])
tensor([1.0000e-03+0.0000e+00j, 3.0000e+00+4.0000e+00j, 0.0000e+00+1.0000e-05j,
        1.0000e-02+3.0000e+00j, 5.0000e+00+1.0000e-06j])
>>> torch.randn(3, dtype=torch.cfloat)
tensor([ 1.0992-0.4459j,  1.1073+0.1202j, -0.2177-0.6342j])
>>> x = torch.tensor([1e2, 1e-2])
>>> torch.set_printoptions(sci_mode=False)
>>> x
tensor([  100.0000,     0.0100])
>>> x = torch.tensor([1e2, 1e-2j])
>>> x
tensor([100.+0.0000j,   0.+0.0100j])
```

Test Plan: Imported from OSS

Differential Revision: D22309294

Pulled By: anjali411

fbshipit-source-id: 20edf9e28063725aeff39f3a246a2d7f348ff1e8
2020-06-30 11:13:42 -07:00
871bfaaba1 [JIT] Fix shape analysis for aten::masked_select. (#40753)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40753

The reference says that this op always returns a 1-D tensor, even if
the input and the mask are 0-D.

Test Plan: Imported from OSS

Reviewed By: eellison

Differential Revision: D22300354

Pulled By: ZolotukhinM

fbshipit-source-id: f6952989c8facf87d73d00505bf6d41573eff2d6
2020-06-30 11:04:50 -07:00
50d55b9f2b [JIT] Update type of the unsqueeze's output in shape analysis. (#40733)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40733

Test Plan: Imported from OSS

Reviewed By: eellison

Differential Revision: D22298537

Pulled By: ZolotukhinM

fbshipit-source-id: a5d4597ed10bcf14d1b28e914bf898d0cae5b4c0
2020-06-30 11:01:45 -07:00
c3237c7a87 Print hostname of RoCM tester (#40755)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40755

Differential Revision: D22311699

Pulled By: malfet

fbshipit-source-id: 057702800fec84fae787b7837f39348273c80cec
2020-06-30 10:56:31 -07:00
a303fd2ea6 Let exp support complex types on CUDA and enable device/dtype in complex tests (#39087)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/39087

Differential Revision: D22169697

Pulled By: anjali411

fbshipit-source-id: 4866b7be6742508cc40540ed1ac811f005531d8b
2020-06-30 10:50:40 -07:00
ef5a314597 [typing] fix register_buffer/parameter (#40669)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40669

Differential Revision: D22286130

Pulled By: ezyang

fbshipit-source-id: c0cc173279678978726895a0830343d5234e474e
2020-06-30 10:39:32 -07:00
5923a802fa Back out "[pytorch][PR] [ONNX] Add eliminate_unused_items pass"
Summary:
Original commit changeset: 30e1a6e8823a

cause issue to fusing BN

Test Plan: revert

Reviewed By: houseroad

Differential Revision: D22296958

fbshipit-source-id: 62664cc77baa8811ad6ecce9d0520a2ab7f89868
2020-06-30 10:26:35 -07:00
3ecae99dd9 Support Pathlike for zipfile serialization (#40723)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40723

Test Plan: Imported from OSS

Differential Revision: D22294575

Pulled By: jamesr66a

fbshipit-source-id: b157fa0ab02c4eb22cb99ac870942aeab352b0c5
2020-06-30 10:07:23 -07:00
c56255499a Reverts running clang-tidy on ATen (#40764)
Summary:
Reverts https://github.com/pytorch/pytorch/pull/39713.

We are seeing CUDA-related clang-tidy failures on multiple PRs after the above change. The cause of these failures is unclear. Example error message:

```
2020-06-26T18:45:10.9763273Z + python tools/clang_tidy.py --verbose --paths torch/csrc/ aten/src/ATen/ --diff 5036c94a6e868963e0354fc04c92e204d8d77677 -g-torch/csrc/jit/serialization/export.cpp -g-torch/csrc/jit/serialization/import.cpp -g-torch/csrc/jit/serialization/import_legacy.cpp -g-torch/csrc/onnx/init.cpp '-g-torch/csrc/cuda/nccl.*' -g-torch/csrc/cuda/python_nccl.cpp
2020-06-26T18:45:11.1990578Z Error while processing /home/runner/work/pytorch/pytorch/aten/src/ATen/native/cuda/UnaryOpsKernel.cu.
2020-06-26T18:45:11.1992832Z Found compiler error(s).
2020-06-26T18:45:11.2286995Z Traceback (most recent call last):
2020-06-26T18:45:11.2288334Z   File "tools/clang_tidy.py", line 55, in run_shell_command
2020-06-26T18:45:11.2288607Z     output = subprocess.check_output(arguments).decode().strip()
2020-06-26T18:45:11.2289053Z   File "/opt/hostedtoolcache/Python/3.8.3/x64/lib/python3.8/subprocess.py", line 411, in check_output
2020-06-26T18:45:11.2289337Z     return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
2020-06-26T18:45:11.2289786Z   File "/opt/hostedtoolcache/Python/3.8.3/x64/lib/python3.8/subprocess.py", line 512, in run
2020-06-26T18:45:11.2290038Z     raise CalledProcessError(retcode, process.args,
2020-06-26T18:45:11.2292206Z subprocess.CalledProcessError: Command '['clang-tidy', '-p', 'build', '-config', '{"Checks": "-*, bugprone-*, -bugprone-forward-declaration-namespace, -bugprone-macro-parentheses, -bugprone-lambda-function-name, cppcoreguidelines-*, -cppcoreguidelines-interfaces-global-init, -cppcoreguidelines-owning-memory, -cppcoreguidelines-pro-bounds-array-to-pointer-decay, -cppcoreguidelines-pro-bounds-constant-array-index, -cppcoreguidelines-pro-bounds-pointer-arithmetic, -cppcoreguidelines-pro-type-cstyle-cast, -cppcoreguidelines-pro-type-reinterpret-cast, -cppcoreguidelines-pro-type-static-cast-downcast, -cppcoreguidelines-pro-type-union-access, -cppcoreguidelines-pro-type-vararg, -cppcoreguidelines-special-member-functions, hicpp-exception-baseclass, hicpp-avoid-goto, modernize-*, -modernize-return-braced-init-list, -modernize-use-auto, -modernize-use-default-member-init, -modernize-use-using, -modernize-use-trailing-return-type, performance-*, -performance-noexcept-move-constructor, ", "HeaderFilterRegex": "torch/csrc/.*", "AnalyzeTemporaryDtors": false, "CheckOptions": null}', '-line-filter', '[{"name": "aten/src/ATen/native/cuda/UnaryOpsKernel.cu", "lines": [[10, 11], [29, 30]]}]', 'aten/src/ATen/native/cuda/UnaryOpsKernel.cu']' returned non-zero exit status 1.
2020-06-26T18:45:11.2292551Z
2020-06-26T18:45:11.2292684Z During handling of the above exception, another exception occurred:
2020-06-26T18:45:11.2292775Z
2020-06-26T18:45:11.2292894Z Traceback (most recent call last):
2020-06-26T18:45:11.2293208Z   File "tools/clang_tidy.py", line 306, in <module>
2020-06-26T18:45:11.2293364Z     main()
2020-06-26T18:45:11.2293817Z   File "tools/clang_tidy.py", line 298, in main
2020-06-26T18:45:11.2293980Z     clang_tidy_output = run_clang_tidy(options, line_filters, files)
2020-06-26T18:45:11.2294282Z   File "tools/clang_tidy.py", line 191, in run_clang_tidy
2020-06-26T18:45:11.2294439Z     output = run_shell_command(command)
2020-06-26T18:45:11.2294703Z   File "tools/clang_tidy.py", line 59, in run_shell_command
2020-06-26T18:45:11.2294931Z     raise RuntimeError("Error executing {}: {}".format(" ".join(arguments), error_output))
2020-06-26T18:45:11.2296875Z RuntimeError: Error executing clang-tidy -p build -config {"Checks": "-*, bugprone-*, -bugprone-forward-declaration-namespace, -bugprone-macro-parentheses, -bugprone-lambda-function-name, cppcoreguidelines-*, -cppcoreguidelines-interfaces-global-init, -cppcoreguidelines-owning-memory, -cppcoreguidelines-pro-bounds-array-to-pointer-decay, -cppcoreguidelines-pro-bounds-constant-array-index, -cppcoreguidelines-pro-bounds-pointer-arithmetic, -cppcoreguidelines-pro-type-cstyle-cast, -cppcoreguidelines-pro-type-reinterpret-cast, -cppcoreguidelines-pro-type-static-cast-downcast, -cppcoreguidelines-pro-type-union-access, -cppcoreguidelines-pro-type-vararg, -cppcoreguidelines-special-member-functions, hicpp-exception-baseclass, hicpp-avoid-goto, modernize-*, -modernize-return-braced-init-list, -modernize-use-auto, -modernize-use-default-member-init, -modernize-use-using, -modernize-use-trailing-return-type, performance-*, -performance-noexcept-move-constructor, ", "HeaderFilterRegex": "torch/csrc/.*", "AnalyzeTemporaryDtors": false, "CheckOptions": null} -line-filter [{"name": "aten/src/ATen/native/cuda/UnaryOpsKernel.cu", "lines": [[10, 11], [29, 30]]}] aten/src/ATen/native/cuda/UnaryOpsKernel.cu: error: cannot find libdevice for sm_20. Provide path to different CUDA installation via --cuda-path, or pass -nocudalib to build without linking with libdevice. [clang-diagnostic-error]
2020-06-26T18:45:11.2313329Z error: unable to handle compilation, expected exactly one compiler job in ' "/usr/bin/c++" "-cc1" "-triple" "x86_64-pc-linux-gnu" "-aux-triple" "nvptx64-nvidia-cuda" "-fsyntax-only" "-disable-free" "-disable-llvm-verifier" "-discard-value-names" "-main-file-name" "UnaryOpsKernel.cu" "-mrelocation-model" "pic" "-pic-level" "2" "-mthread-model" "posix" "-fno-trapping-math" "-masm-verbose" "-mconstructor-aliases" "-munwind-tables" "-fuse-init-array" "-target-cpu" "x86-64" "-dwarf-column-info" "-debugger-tuning=gdb" "-momit-leaf-frame-pointer" "-resource-dir" "/usr/lib/llvm-8/bin/../lib/clang/8.0.1" "-internal-isystem" "/usr/lib/llvm-8/bin/../lib/clang/8.0.1/include/cuda_wrappers" "-internal-isystem" "/usr/local/cuda/include" "-include" "__clang_cuda_runtime_wrapper.h" "-isystem" "/home/runner/work/pytorch/pytorch/build/third_party/gloo" "-isystem" "/home/runner/work/pytorch/pytorch/cmake/../third_party/gloo" "-isystem"
```

My guess is that our clang-tidy build is improperly configured to handle CUDA code. Until that issue is resolved this stops running clang-tidy on ATen.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40764

Differential Revision: D22310032

Pulled By: mruberry

fbshipit-source-id: 035067e1017f0097026cee9866bba424dd4668b4
2020-06-30 09:35:55 -07:00
3cc18d7139 .circleci: Remove executor from windows uploads (#40742)
Summary:
This wasn't needed and broke nightly builds

Fixes some issues introduced in https://github.com/pytorch/pytorch/pull/40592/files

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40742

Differential Revision: D22310055

Pulled By: seemethere

fbshipit-source-id: 095be3be06a730138d860ca6b73eaf22c24cf08f
2020-06-30 09:29:29 -07:00
a6a31bcd47 Enable out_dims for vmap frontend API (#40576)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40576

`out_dims` specifies where in the output tensors the vmapped dimension
should appear. We implement this by simply creating a view with the
batch dimension moved to the desired position.

`out_dims` must either:
- be int (use the same value for all outputs)
- be Tuple[int] (so the user specifies one out_dim per output).
(See the vmap docstring for what we advertise out_dims to do).

I also renamed `TestVmap` to `TestVmapAPI` to make it clearer that we
are testing the API here and not specific operators (which will go into
their own test class).

Test Plan: - `pytest test/test_vmap.py -v`

Differential Revision: D22288086

Pulled By: zou3519

fbshipit-source-id: c8666cb1a0e22c54473d8045477e14c2089167cf
2020-06-30 08:20:39 -07:00
2f94b7f95c Initial vmap docstring (#40575)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40575

This provides some more context for the next ~2 PRs that will implement
the `out_dims` and `in_dims` functionality. I will probably add more to
it later (things I think we should add: examples (maybe in a dedicated
docs page), specific examples of things vmap cannot handle).

Test Plan:
- Code reading for now. When we are ready to add vmap to master documentation,
I'll build the docs and fix any formatting problems.

Differential Revision: D22288085

Pulled By: zou3519

fbshipit-source-id: 6e28d7bd524242395160c20270159b4b121d6789
2020-06-30 08:18:20 -07:00
4a235b87be pop warning message for cuda module when asan is built in (#35088)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35088

Test Plan: Imported from OSS

Differential Revision: D20552708

Pulled By: glaringlee

fbshipit-source-id: 0b809712378596ccf83211bf8ae39cd71c27dbba
2020-06-30 08:00:37 -07:00
4104ab8b18 Add torch.count_nonzero (#39992)
Summary:
Reference https://github.com/pytorch/pytorch/issues/38349

TODO:

* [x] Add tests
* [x] Add docs (pending add to docs.rst)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39992

Reviewed By: ezyang

Differential Revision: D22236738

Pulled By: mruberry

fbshipit-source-id: 8520068b086b5ffc4de9e4939e746ff889293987
2020-06-30 06:39:13 -07:00
31de10a392 Int8FC dequantize fix (#40608)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40608

Changes to fix uint8_t to fp16 dequantization error.
Enabled test_int8_quantize

(Note: this ignores all push blocking failures!)

Test Plan: Verified with test_int8_ops_nnpi.py

Reviewed By: hyuen

Differential Revision: D22252860

fbshipit-source-id: bb44673327f0c8f44974cef2ab773aa0d89f4dc7
2020-06-30 06:20:09 -07:00
b9cca4b186 fix range of results for pairwise operations (#40728)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40728

there are two reasons the test is failing:
1) div by 0
2) result is bigger than fp16 max

for 1) make the divisor some safe number like 1e-3
2) when a combination of random numbers results in their resulting value  bigger than 65e3, clip

multiplication is fine because range of random numbers is 0,100 -> result is 0->10000

Test Plan: ran tes_div test

Reviewed By: hl475

Differential Revision: D22295934

fbshipit-source-id: 173f3f2187137d6c1c4d4a505411a27f1c059f1a
2020-06-29 23:49:08 -07:00
a371652bc8 Allow to get string references to strings inside torch::List (#39763)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39763

This is an ask from fluent. For performance reasons, they need a way to get read access to the std::string inside of a torch::List<std::string> without having to copy that string.

Instead of special casing std::string, we decided to give access to the underlying value. The API now looks like:

```cpp
torch::List<std::string> list = ...;
const std::string& str = list[2].toIValueRef().toStringRef();
```
ghstack-source-id: 106806840

Test Plan: unit tests

Reviewed By: ezyang

Differential Revision: D21966183

fbshipit-source-id: 8b80b0244d10215c36b524d1d80844832cf8b69a
2020-06-29 20:52:32 -07:00
fabd60ec1a Add comment with UNBOXEDONLY explanation to codegen (#40117)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40117

ghstack-source-id: 106804731

Test Plan: just comments

Reviewed By: ezyang

Differential Revision: D22075103

fbshipit-source-id: 76677dc337196b71c50075f2845a1899451a705f
2020-06-29 20:50:45 -07:00
01e2099bb8 [TB] Add support for hparam domain_discrete (#40720)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40720

Add support for populating domain_discrete field in TensorBoard add_hparams API

Test Plan: Unit test test_hparams_domain_discrete

Reviewed By: edward-io

Differential Revision: D22291347

fbshipit-source-id: 78db9f62661c9fe36cd08d563db0e7021c01428d
2020-06-29 19:33:57 -07:00
53af9df557 Unify boxed function signature between jit and c10 (#37034)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37034

c10 takes a Stack* in boxed functions while JIT took Stack&.
c10 doesn't return anything while JIT returns an int which is always zero.

This changes JIT to follow the c10 behavior.
ghstack-source-id: 106834069

Test Plan: unit tests

Differential Revision: D20567950

fbshipit-source-id: 1a7aea291023afc52ae706957e9a5ca576fbb53b
2020-06-29 19:24:26 -07:00
320164f878 Fix zip serialization for file > 2GiB (#40722)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40722

Test Plan: Imported from OSS

Differential Revision: D22294016

Pulled By: jamesr66a

fbshipit-source-id: 0288882873d4b59bdef37d018c030519c4be7f03
2020-06-29 19:17:06 -07:00
9393ac011a [CUDA] addmm for complex (#40431)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40431

Test Plan: Imported from OSS

Differential Revision: D22285916

Pulled By: anjali411

fbshipit-source-id: 5863c713bdaa8e5b4f3d2b41fa59108502145a23
2020-06-29 17:41:46 -07:00
d7cd16858f Add documentation about storage sharing is preserved and serialized f… (#40412)
Summary:
…ile size.
fixes https://github.com/pytorch/pytorch/issues/40157
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40412

Reviewed By: ezyang

Differential Revision: D22265639

Pulled By: ailzhang

fbshipit-source-id: 16b0301f16038bd784e7e92f63253fedc7820adc
2020-06-29 17:23:29 -07:00
8f5b28674c [JIT] Remove dead store in quantization_patterns.h (#40724)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40724

Test Plan: Continuous integration.

Differential Revision: D22294600

Pulled By: SplitInfinity

fbshipit-source-id: 04546579273d8864d91c3c74a654aa75ba34ee45
2020-06-29 16:55:15 -07:00
0235676f8a [pytorch][ci] run mobile code analysis on PR (#40247)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40247

This CI job was bypassed on PR because most part of it has already been
covered by mobile-custom-build-dynamic job that runs on every PR.

However, it can still fail independently because it builds and analyzes
a small test project, e.g.: if people forget to update the registration API
used in the test project.

So this PR changed it to only build and analyze the test project and run
the job on every PR.

Test Plan: Imported from OSS

Differential Revision: D22126044

Pulled By: ljk53

fbshipit-source-id: 6699a200208a65b249bd3a4e43ad72bc07388ce3
2020-06-29 16:44:45 -07:00
6e1cf000b3 [jit][oacr] Add some operators for Assistant NLU joint lite model (#40126)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40126

These are needed for benchmarking / running our model, following Step 7 in the [Lite interpreter wiki](https://www.internalfb.com/intern/wiki/PyTorch/PyTorchDev/Mobile/Lite_Interpreter/#make-your-model-work-wit) and [this thread](https://www.internalfb.com/intern/qa/56293/atenemptymemory_format-missing-on-fb4a).

Test Plan: Sandcastle

Reviewed By: iseeyuan

Differential Revision: D22073611

fbshipit-source-id: daa46a39c386806be8d5d589740663e85451757e
2020-06-29 16:41:04 -07:00
21de450fcb Fix batch size zero for QNNPACK linear_dynamic (#40588)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40588

Two bugs were preventing this from working.  One was a divide by zero
when multithreading was enabled, fixed similarly to the fix for static
quantized linear in the previous commit.  The other was computation of
min and max to determine qparams.  FBGEMM uses [0,0] for [min,max] of
empty input, do the same.

Test Plan: Added a unit test.

Differential Revision: D22264415

Pulled By: dreiss

fbshipit-source-id: 6ca9cf48107dd998ef4834e5540279a8826bc754
2020-06-29 16:31:11 -07:00
14145f9775 Fix and reenable threaded QNNPACK linear (#40587)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40587

Previously, this was causing divide-by-zero only in the multithreaded
empty-batch case, while calculating tiling parameters for the threads.
In my opinion, the bug here is using a value that is allowed to be zero
(batch size) for an argument that should not be zero (tile size), so I
fixed the bug by bailing out right before the call to
pthreadpool_compute_4d_tiled.

Test Plan: TestQuantizedOps.test_empty_batch

Differential Revision: D22264414

Pulled By: dreiss

fbshipit-source-id: 9446d5231ff65ef19003686f3989e62f04cf18c9
2020-06-29 16:29:29 -07:00
9ca4a46bf8 Implement parallel scatter reductions for CPU (#36447)
Summary:
This PR implements gh-33389.

As a result of this PR, users can now specify various reduction modes for scatter operations. Currently, `add`, `subtract`, `multiply` and `divide` have been implemented, and adding new ones is not hard.

While we now allow dynamic runtime selection of reduction modes, the performance is the same as as was the case for the `scatter_add_` method in the master branch. Proof can be seen in the graph below, which compares `scatter_add_` in the master branch (blue) and `scatter_(reduce="add")` from this PR (orange).
![scatter-regression py csv](https://user-images.githubusercontent.com/2629909/82671491-e5e22380-9c79-11ea-95d6-6344760c8578.png)

The script used for benchmarking is as follows:
``` python
import os
import sys
import torch
import time
import numpy
from IPython import get_ipython

Ms=256
Ns=512
dim = 0
top_power = 2
ipython = get_ipython()

plot_name = os.path.basename(__file__)
branch = sys.argv[1]
fname = open(plot_name + ".csv", "a+")

for pM in range(top_power):
    M = Ms * (2 ** pM)
    for pN in range(top_power):
        N = Ns * (2 ** pN)
        input_one = torch.rand(M, N)
        index = torch.tensor(numpy.random.randint(0, M, (M, N)))
        res = torch.randn(M, N)

        test_case = f"{M}x{N}"
        print(test_case)
        tobj = ipython.magic("timeit -o res.scatter_(dim, index, input_one, reduce=\"add\")")

        fname.write(f"{test_case},{branch},{tobj.average},{tobj.stdev}\n")

fname.close()
```

Additionally, one can see that various reduction modes take almost the same time to execute:
```
op: add
70.6 µs ± 27.3 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
26.1 µs ± 26.5 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
op: subtract
71 µs ± 20.5 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
26.4 µs ± 34.4 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
op: multiply
70.9 µs ± 31.5 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
27.4 µs ± 29.3 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
op: divide
164 µs ± 48.8 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
52.3 µs ± 132 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
```
Script:
``` python
import torch
import time
import numpy
from IPython import get_ipython

ipython = get_ipython()

nrows = 3000
ncols = 10000
dims = [nrows, ncols]

res = torch.randint(5, 10, dims)
idx1 = torch.randint(dims[0], (1, dims[1])).long()
src1 = torch.randint(5, 10, (1, dims[1]))
idx2 = torch.randint(dims[1], (dims[0], 1)).long()
src2 = torch.randint(5, 10, (dims[0], 1))

for op in ["add", "subtract", "multiply", "divide"]:
    print(f"op: {op}")
    ipython.magic("timeit res.scatter_(0, idx1, src1, reduce=op)")
    ipython.magic("timeit res.scatter_(1, idx2, src2, reduce=op)")
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36447

Differential Revision: D22272631

Pulled By: ngimel

fbshipit-source-id: 3cdb46510f9bb0e135a5c03d6d4aa5de9402ee90
2020-06-29 15:52:11 -07:00
11a74a58c8 Setter for real and imag tensor attributes (#39860)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/39860

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D22163234

Pulled By: anjali411

fbshipit-source-id: 35b4aa16499341edff1a4be4076539ac7c74f5be
2020-06-29 15:44:55 -07:00
fd90e4b309 [CircleCI] Add RocM build/test jobs (#39760)
Summary:
Set PYTORCH_ROCM_ARCH to `gfx900;gfx906` if `CIRCLECI` environment variable is defined
Add RocM build test jobs and schedule them on `xlarge` and `amd-gpu` resource classes respectively.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39760

Differential Revision: D22290335

Pulled By: malfet

fbshipit-source-id: 7462f97b262abcacac3e515086ac6236a45626d2
2020-06-29 14:15:44 -07:00
63e5a53b8c DNNL: fix build error when DNNL using TBB threading pool (#40699)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40699

Differential Revision: D22286334

Pulled By: albanD

fbshipit-source-id: 0635a0a5e4bf80d44d90c86945d92e98e26ef480
2020-06-29 13:53:18 -07:00
ed83b9a4be Change function parameter self to input in torch.__init__.pyi (#40235)
Summary:
Fix https://github.com/pytorch/pytorch/issues/40223: Incorrect "self" keyword arguments in `torch.__init__.pyi` type hints
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40235

Differential Revision: D22285816

Pulled By: ezyang

fbshipit-source-id: ebc35290c0c625916289f1a46abc6ff2197f4bcf
2020-06-29 13:49:13 -07:00
d2e16dd888 Remove constexpr for NVCC on Windows (#40675)
Summary:
They are not well supported. Fixes https://github.com/pytorch/pytorch/issues/40393 and https://github.com/pytorch/pytorch/issues/39394.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40675

Differential Revision: D22286031

Pulled By: ezyang

fbshipit-source-id: 7e309916ae21cd3909ee6466952ba89847c74d71
2020-06-29 10:58:42 -07:00
4a174c83ca Add option to preserve certain methods during optimize_for_mobile. (#40629)
Summary:
By default freeze_module pass, invoked from optimize_for_mobile,
preserves only forward method. There is an option to specify a list of
methods that can be preserved during freeze_module. This PR exposes that
to optimize_for_module pass.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40629

Test Plan: python test/test_mobile_optimizer.py

Reviewed By: dreiss

Differential Revision: D22260972

Pulled By: kimishpatel

fbshipit-source-id: 452c653269da8bb865acfb58da2d28c23c66e326
2020-06-29 09:32:53 -07:00
4121d34036 Python/C++ API Parity: Add impl and tests for ParameterDict (#40654)
Summary:
This diff contains the implementation of C++ api for ParameterDict from https://github.com/pytorch/pytorch/issues/25883, refer to  https://github.com/pytorch/pytorch/issues/36904 and https://github.com/pytorch/pytorch/issues/28652
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40654

Test Plan: Add unit test in this diff

Differential Revision: D22273265

Pulled By: glaringlee

fbshipit-source-id: 9134a92c95eacdd53d5b24470d5f7edbeb40a488
2020-06-29 08:50:44 -07:00
b35cdc5200 [Fix] torch_common target shared by lite-interpreter and full-jit" and turn on query-based selective build (#40673)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40673

As title. We planed to have lite-interpreter and full-jit co-exist for short-term. To avoid the duplicated symbol and operator registrations in dynamic lib loading, we put the common files in a separate component.

The original source file list names are reserved.
ghstack-source-id: 106757184

Test Plan: CI

Reviewed By: kwanmacher

Differential Revision: D22276185

fbshipit-source-id: 328a8ba9c3d88437da0d30c6e6791087d0df5e2e
2020-06-28 16:38:52 -07:00
b4db529352 Fix wrong link in docs/source/notes/ddp.rst (#40484)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40484

Differential Revision: D22259834

Pulled By: mrshenli

fbshipit-source-id: 4ec912c600c81010bdb2778c35cbb0321480199f
2020-06-28 13:55:56 -07:00
502ec8f7f7 Revert D22227939: [TB] Add support for hparam domain_discrete
Test Plan: revert-hammer

Differential Revision:
D22227939 (4c25428c8c)

Original commit changeset: d2f0cd8e5632

fbshipit-source-id: c4329fcead69cb0f3d368a254d8756fb04be742d
2020-06-27 22:20:31 -07:00
5377827b3e Revert D22275201: [Fix] torch_common target shared by lite-interpreter and full-jit
Test Plan: revert-hammer

Differential Revision:
D22275201 (1399655a98)

Original commit changeset: dafd3ad36bb3

fbshipit-source-id: a89c8b1fbb55eb7c116dd6ca9dad04bb90727c0a
2020-06-27 22:00:19 -07:00
521722751f Add examples and tests for combining static/class method with async execution (#40619)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40619

Test Plan: Imported from OSS

Differential Revision: D22258407

Pulled By: mrshenli

fbshipit-source-id: 036d85a2affc4505efd2df197fc513dba010e359
2020-06-27 20:42:23 -07:00
1399655a98 [Fix] torch_common target shared by lite-interpreter and full-jit
Summary:
Pull the shared source files to "torch_common" to avoid duplicated symbols and operator registrations.

(Note: this ignores all push blocking failures!)

Test Plan:
CI
buck install -c fbandroid.force_native_library_merge_map=true -c pt.build_from_deps_query=1 -c pt.selective_build=0 -c pt.static_dispatch=0 -r fb4a

Reviewed By: kwanmacher

Differential Revision: D22275201

fbshipit-source-id: dafd3ad36bb33e3ec33f4accfdc5af1d5f8ab775
2020-06-27 17:48:32 -07:00
21991b63f5 Migrate dot from the TH to Aten (CPU) (#40354)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/24692
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40354

Reviewed By: ezyang

Differential Revision: D22214203

Pulled By: ngimel

fbshipit-source-id: 500e60d1c02b3b39db19b518f2af43cd69f2e984
2020-06-27 17:11:10 -07:00
4c25428c8c [TB] Add support for hparam domain_discrete
Summary: Add support for populating domain_discrete field in TensorBoard add_hparams API

Test Plan: Unit test test_hparams_domain_discrete

Reviewed By: edward-io

Differential Revision: D22227939

fbshipit-source-id: d2f0cd8e5632cbcc578466ff3cd587ee74f847af
2020-06-27 14:07:24 -07:00
2456e078d3 [TB] Support custom run_name in add_hparams (#40660)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40660

Support custom run_name since using timestamp as run_name can be confusing to people

Test Plan:
hp = {"lr": 0.1, "bool_var": True, "string_var": "hi"}
  mt = {"accuracy": 0.1}
  writer.add_hparams(hp, mt, run_name="run1")
  writer.flush()

Reviewed By: edward-io

Differential Revision: D22157749

fbshipit-source-id: 3d4974381e3be3298f3e4c40e3d4bf20e49dfb07
2020-06-27 14:05:20 -07:00
15be823455 caffe2 | Revert range loop analysis fix
Summary: This reverts a change that was made to fix range loop analysis warning.

Test Plan: CI

Reviewed By: nlutsenko

Differential Revision: D22274461

fbshipit-source-id: dedc3fcaa6e32259460380163758d6c9c9b73211
2020-06-27 13:02:23 -07:00
68042c7466 Skip mypy on pynightly if numpy-1.20.0-dev0... is used (#40656)
Summary:
Also modernize the test script itself by using `mypy.api.run` rather than `subprocess.call`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40656

Differential Revision: D22274421

Pulled By: malfet

fbshipit-source-id: 59232d4d37ee01cda56375b84ac1476d16686bfe
2020-06-27 09:08:50 -07:00
ac8c8b028d [ROCm] restore jit tests (#40447)
Summary:
Remove `skipIfRocm` from most jit tests and enable `RUN_CUDA_HALF` tests for ROCm.

These changes passed more than three rounds of CI testing against the ROCm CI.

CC ezyang xw285cornell sunway513
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40447

Differential Revision: D22190711

Pulled By: xw285cornell

fbshipit-source-id: bac44825a2675d247b3abe2ec2f80420a95348a3
2020-06-27 01:03:59 -07:00
411bc2b8d5 [quant][graphmode][fix] remove unsupported ops in the list (#40653)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40653

(Note: this ignores all push blocking failures!)

Test Plan: Imported from OSS

Differential Revision: D22271413

fbshipit-source-id: a01611b5d90849ac673fa5a310f910c858e907a3
2020-06-27 00:07:57 -07:00
61a8de77cf [quant] aten::repeat work for quantized tensor (#40644)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40644

Test Plan: Imported from OSS

Differential Revision: D22268558

fbshipit-source-id: 3bc9a129bece1b547c519772ecc6b980780fb904
2020-06-26 22:54:19 -07:00
0309f6a4bb [quant][graphmode][fix] cloning schema in insert_observers (#40624)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40624

Previously we didn't clone schema, so the default schema is used, this is
causing issue for some models

Test Plan: Imported from OSS

Differential Revision: D22259519

fbshipit-source-id: e2a393a54cb18f55da0c7152a74ddc22079ac350
2020-06-26 20:19:09 -07:00
0a19534dd2 [JIT] Remove dead store in quantization_patterns.h (#40623)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40623

Test Plan: Continuous integration.

Reviewed By: jerryzh168

Differential Revision: D22259209

fbshipit-source-id: 90c9e79e039100f2961195504bb81230bba5c5fe
2020-06-26 19:43:43 -07:00
e368b11226 [JIT] Remove dead stores in loopnest.cpp (#40626)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40626

Test Plan: Continuous integration.

Reviewed By: ZolotukhinM

Differential Revision: D22259586

fbshipit-source-id: 447accb5b94392f0b5e4c27956a34403bb0d1ea8
2020-06-26 19:28:03 -07:00
15864d1703 Skip allreducing local_used_maps_dev_ when find_unused_param=False
Summary:
1. In reducer.cpp, we have a new boolean `find_unused_param_` and its value is set in `Reducer::prepare_for_backward`.
If `!find_unused_param_`, then it avoids `allreduce(local_used_maps_dev_)`.
2. Solves issue [38942](https://github.com/pytorch/pytorch/issues/38942).
3. Fixes incorrect `find_unused_parameters_` passing like checking `outputs.empty()` or `unused_parameters_.empty()`.

ghstack-source-id: 106693089

Test Plan:
1. Run `test/distributed/test_c10d.py` and make sure all tests pass.
2. A new test case `test_find_unused_parameters_when_unused_parameters_empty` is included. Old `reducer.cpp` was failing in that unit test because it was checking `find_unused_parameters_` by `unused_parameters_.empty()`. Current `reducer.cpp` passes this unit test.
3. Two test cases were failing `test_forward_backward_unused_parameters` and `test_forward_backward_optimizer` , because `find_unused_parameter_` of their `reducer` object was not set properly. I fixed that as well.

Imported from OSS

**Output of version 14:**
```
................s.....s...............................................test/distributed/test_c10d.py:1531: UserWarning: Deprecation warning: In a future PyTorch release torch.full will no longer return tensors of floating dtype by default. Instead, a bool fill_value will return a tensor of torch.bool dtype, and an integral fill_value will return a tensor of torch.long dtype. Set the optional `dtype` or `out` arguments to suppress this warning. (Triggered internally at  ../aten/src/ATen/native/TensorFactories.cpp:364.)
  tensor = torch.full([100, 100], self.rank)
test/distributed/test_c10d.py:1531: UserWarning: Deprecation warning: In a future PyTorch release torch.full will no longer return tensors of floating dtype by default. Instead, a bool fill_value will return a tensor of torch.bool dtype, and an integral fill_value will return a tensor of torch.long dtype. Set the optional `dtype` or `out` arguments to suppress this warning. (Triggered internally at  ../aten/src/ATen/native/TensorFactories.cpp:364.)
  tensor = torch.full([100, 100], self.rank)
test/distributed/test_c10d.py:1531: UserWarning: Deprecation warning: In a future PyTorch release torch.full will no longer return tensors of floating dtype by default. Instead, a bool fill_value will return a tensor of torch.bool dtype, and an integral fill_value will return a tensor of torch.long dtype. Set the optional `dtype` or `out` arguments to suppress this warning. (Triggered internally at  ../aten/src/ATen/native/TensorFactories.cpp:364.)
  tensor = torch.full([100, 100], self.rank)
test/distributed/test_c10d.py:1531: UserWarning: Deprecation warning: In a future PyTorch release torch.full will no longer return tensors of floating dtype by default. Instead, a bool fill_value will return a tensor of torch.bool dtype, and an integral fill_value will return a tensor of torch.long dtype. Set the optional `dtype` or `out` arguments to suppress this warning. (Triggered internally at  ../aten/src/ATen/native/TensorFactories.cpp:364.)
  tensor = torch.full([100, 100], self.rank)
.test/distributed/test_c10d.py:1554: UserWarning: Deprecation warning: In a future PyTorch release torch.full will no longer return tensors of floating dtype by default. Instead, a bool fill_value will return a tensor of torch.bool dtype, and an integral fill_value will return a tensor of torch.long dtype. Set the optional `dtype` or `out` arguments to suppress this warning. (Triggered internally at  ../aten/src/ATen/native/TensorFactories.cpp:364.)
  self.assertEqual(torch.full([10, 10], self.world_size), tensor)
test/distributed/test_c10d.py:1554: UserWarning: Deprecation warning: In a future PyTorch release torch.full will no longer return tensors of floating dtype by default. Instead, a bool fill_value will return a tensor of torch.bool dtype, and an integral fill_value will return a tensor of torch.long dtype. Set the optional `dtype` or `out` arguments to suppress this warning. (Triggered internally at  ../aten/src/ATen/native/TensorFactories.cpp:364.)
  self.assertEqual(torch.full([10, 10], self.world_size), tensor)
test/distributed/test_c10d.py:1554: UserWarning: Deprecation warning: In a future PyTorch release torch.full will no longer return tensors of floating dtype by default. Instead, a bool fill_value will return a tensor of torch.bool dtype, and an integral fill_value will return a tensor of torch.long dtype. Set the optional `dtype` or `out` arguments to suppress this warning. (Triggered internally at  ../aten/src/ATen/native/TensorFactories.cpp:364.)
  self.assertEqual(torch.full([10, 10], self.world_size), tensor)
test/distributed/test_c10d.py:1554: UserWarning: Deprecation warning: In a future PyTorch release torch.full will no longer return tensors of floating dtype by default. Instead, a bool fill_value will return a tensor of torch.bool dtype, and an integral fill_value will return a tensor of torch.long dtype. Set the optional `dtype` or `out` arguments to suppress this warning. (Triggered internally at  ../aten/src/ATen/native/TensorFactories.cpp:364.)
  self.assertEqual(torch.full([10, 10], self.world_size), tensor)
.....s...............................
----------------------------------------------------------------------
Ran 108 tests in 214.210s

OK (skipped=3)
```

Differential Revision: D22176231

fbshipit-source-id: b5d15f034e13a0915a474737779cc5aa8e068836
2020-06-26 19:20:59 -07:00
4102fbdf08 [1/n] Allow dense NaN value in dper raw input processor output
Summary:
## TLDR
Support using NaN default value for missing dense features in RawInputProcessor for *DPER2*. In preparation for subsequent support for null flag features in *compute meta*. For train_eval this is already supported in DPER3 and we do not plan to support this in DPER2 train eval.
## Overview
Intern project plan to support adding dense flags for missing feature values instead of replacing with zero.

Project plan :
https://docs.google.com/document/d/1OsPUTjpJycwxWLCue3Tnb1mx0uDC_2KKWvC1Rwpo2NI/edit?usp=sharing

## Code paths:
See https://fb.quip.com/eFXUA0tbDmNw for the call stack for all affected code paths.

Test Plan:
# A. DPER3 blob value inspection
## 1. Build local bento kernel in fbcode folder
`buck build mode/dev-nosan //bento/kernels:bento_kernel_ads_ranking`

## 2. Use kernel `ads_ranking (local)` to print dense feature blob values
n280239

## 2.1 Try `default_dense_value = "0.0"` (default)
```
preproc_6/feature_preproc_6/dper_feature_processor_7/raw_input_proc_7/float_feature_sparse_to_dense_7/float_features [[0.       ]
 [0.       ]
 [0.       ]
 [0.       ]
 [0.       ]
 [0.       ]
 [0.       ]
 [1.       ]
 [1.7857143]
 [1.7777778]
 [1.       ]
 [0.       ]
 [0.5625   ]
 [0.       ]
 [0.       ]
 [0.8      ]
 [0.       ]
 [1.       ]
 [0.56     ]
 [0.       ]]
```
## 2.2 Try `default_dense_value = "123"`
```
preproc_2/feature_preproc_2/dper_feature_processor_3/raw_input_proc_3/float_feature_sparse_to_dense_3/float_features [[123.       ]
 [123.       ]
 [123.       ]
 [123.       ]
 [123.       ]
 [123.       ]
 [123.       ]
 [  1.       ]
 [  1.7857143]
 [  1.7777778]
 [  1.       ]
 [123.       ]
 [  0.5625   ]
 [123.       ]
 [123.       ]
 [  0.8      ]
 [123.       ]
 [  1.       ]
 [  0.56     ]
 [123.       ]]
```
## 2.3 Try `default_dense_value = float("nan")`
```
RuntimeError: [enforce fail at enforce_finite_op.h:40] std::isfinite(input_data[i]). Index 0 is not finite (e.g., NaN, Inf): -nan (Error from operator:
input: "unary_4/logistic_regression_loss_4/average_loss_4/average_loss" name: "" type: "EnforceFinite" device_option { random_seed: 54 })
```
which is expected due to nan input.

# B. Unit test
`buck test  fblearner/flow/projects/dper/tests/preprocs:raw_feature_extractor_test`

https://www.internalfb.com/intern/testinfra/testconsole/testrun/5348024586274923/

{F241336814}

Differential Revision: D21961595

fbshipit-source-id: 3dcb153b3c7f42f391584f5e7f52f3d9c76de31f
2020-06-26 16:54:14 -07:00
897e610c82 FP16 rounding-to-nearest for row-wise SparseAdagrad fusion (#40466)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40466

Extend row wise sparse Adagrad fusion op to FP16 (rounding-to-nearest) for PyTorch.

Reviewed By: jianyuh

Differential Revision: D22003571

fbshipit-source-id: e97e01745679a9f6e7b0f81ce5a6ebf4d4a1df41
2020-06-26 16:14:59 -07:00
47c72be3d7 Port /test/cpp_extensions/rng_extension.cpp to new operator registration API (#39459)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39459

Update to this PR: this code isn't going to fully solve https://github.com/pytorch/pytorch/issues/37010. The changes required for 37010 is more than this PR initially planned. Instead, this PR switches op registration of rng related tests to use the new API (similar to what was done in #36925)

Test Plan:
1) unit tests

Imported from OSS

Reviewed By: ezyang

Differential Revision: D22264889

fbshipit-source-id: 82488ac6e3b762a756818434e22c2a0f9cb9dd47
2020-06-26 16:12:54 -07:00
24a8614cac [Reland][doc] Add overflow notice for cuFFT on half precision (#40551)
Summary:
Reland of https://github.com/pytorch/pytorch/issues/35594
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40551

Reviewed By: ezyang

Differential Revision: D22249831

Pulled By: ngimel

fbshipit-source-id: b221b3c0a490ccaaabba50aa698a2490536e0917
2020-06-26 15:40:19 -07:00
6debc28964 Ignore error code from apt-get purge (#40631)
Summary:
This replicates the pattern of other "do for luck" commands.
Prep change to add RocM to CircleCI.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40631

Differential Revision: D22261707

Pulled By: malfet

fbshipit-source-id: 3dadfa434deab866a8800715f3197e84169cf43e
2020-06-26 13:34:07 -07:00
375cd852fa Add a utility function for bundling large input tensors (#37055)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37055

Sometimes it's okay to bundle a large example input tensor with a model.
Add a utility function to make it easy for users to do that *on purpose*.

Test Plan: Unit test.

Differential Revision: D22264239

Pulled By: dreiss

fbshipit-source-id: 05c6422be1aa926cca850f994ff1ae83c0399119
2020-06-26 13:34:02 -07:00
41ea7f2d86 Add channels-last support to bundled_inputs (#36764)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36764

This allows bundling inputs that are large uniform buffers in
channels-last memory format.

Test Plan: Unit test.

Differential Revision: D21142660

Pulled By: dreiss

fbshipit-source-id: 31bbea6586d07c1fd0bcad4cb36ed2b8bb88a7e4
2020-06-26 13:31:17 -07:00
edac323378 Add special rules to launch docker image with RocM (#40632)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40632

Differential Revision: D22262316

Pulled By: malfet

fbshipit-source-id: 3d525767bfbfc8e2497541849d85cabf0379a43b
2020-06-26 13:28:36 -07:00
0494e0ad70 Back out "Revert D21581908: Move TensorOptions ops to c10" (#40595)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40595

ghstack-source-id: 106691774

Test Plan: waitforsandcastle

Differential Revision: D22247729

fbshipit-source-id: 14745588cae267c1e0cc51cd9541a9b8abb830e5
2020-06-26 12:57:09 -07:00
b8f4f6868d [JIT] Remove dead store in exit_transforms.cpp (#40611)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40611

This commit removes a dead store in `transformWith` of exit_transforms.cpp.

Test Plan: Continuous integration.

Reviewed By: suo

Differential Revision: D22254136

fbshipit-source-id: f68c4625f7be8ae29b3500303211b2299ce5d6f6
2020-06-26 12:35:58 -07:00
a62f8805e7 Update TensorPipe submodule (#40614)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40614

This update pulls in a oneliner fix, which sets the TCP_NODELAY option on the TCP sockets of the UV transport. This leads to exceptional performance gains in terms of latency, with about a 25x improvement in one simple benchmark. This thus resolves a regression that TensorPipe had compared to the ProcessGroup agent and, in fact, ends up beating it by 2x.

The benchmark I ran is this, with the two endpoints pinned to different cores of the same machine:
```
torch.jit.script
def remote_fn(t: int):
    return t

torch.jit.script
def local_fn():
    for _ in range(1_000_000):
        fut = rpc.rpc_async("rhs", remote_fn, (42,))
        fut.wait()
```

And the average round-trip time (one iteration) is:
- TensorPipe with SHM: 97.2 us
- TensorPipe with UV _after the fix_: 205us
- Gloo: 440us
- TensorPipe with UV _before the fix_: 5ms

Test Plan: Ran PyTorch RPC test suite

Differential Revision: D22255393

fbshipit-source-id: 3f6825d03317d10313704c05a9280b3043920507
2020-06-26 11:45:51 -07:00
5036c94a6e properly skip legacy tests regardless of the default executor (#40381)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40381

Differential Revision: D22173938

Pulled By: Krovatkin

fbshipit-source-id: 305fc4484977e828cc4cee6e053a1e1ab9f0d6c7
2020-06-26 11:13:50 -07:00
7676682584 Fix illegal opcode bug in caffe2 (#40584)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40584

Also patch [this github issue](https://github.com/pytorch/pytorch/issues/33124)
involving an illegal assembly instruction in 8x8-dq-aarch64-neon.S.

Test Plan:
Build binaries, copy to shaker, run executables. Also run all
existing caffe tests.

Reviewed By: kimishpatel

Differential Revision: D22240670

fbshipit-source-id: 51960266ce58699fe6830bcf75632b92a122f638
2020-06-26 11:11:54 -07:00
fb5d784fb4 Further reduce windows build/test matrix (#40592)
Summary:
Switch windows CPU testers from `windows.xlarge` to `windows.medium` class.
Remove VS 14.16 CUDA build
Only do smoke force-on-cpu tests using VS2019+CUDA10.1 config.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40592

Differential Revision: D22259351

Pulled By: malfet

fbshipit-source-id: f934ff774dfc7d47f12c3da836ca314c12d92208
2020-06-26 10:18:46 -07:00
10822116c5 build docker image for CUDA11 (#40534)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40534

Differential Revision: D22258874

Pulled By: seemethere

fbshipit-source-id: 1954a22ed52e1a65caf89725ab1db9f40ff917b8
2020-06-26 10:07:53 -07:00
fc8bca094c skip_if_rocm test_rnn in test_c10d_spawn.py (#40577)
Summary:
Test was added a few months back in https://github.com/pytorch/pytorch/issues/36503 but recently became flaky for ROCm.

CC ezyang xw285cornell sunway513
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40577

Differential Revision: D22258196

Pulled By: ezyang

fbshipit-source-id: 8a22b0c17b536b3d42d0382f7737df0f8823ba08
2020-06-26 09:45:45 -07:00
67c79bb045 update schema to reflect aliasing behavior (#39794)
Summary:
Fixes: https://github.com/pytorch/pytorch/issues/38555

I did an audit of `native_functions.yaml` and found several functions in addition to `reshape` which were not reporting that they could alias:

```
torch.jit.script
def foo(t: torch.Tensor):
    new_value = torch.tensor(1, dtype=t.dtype, device=t.device)

    t.flatten()[0] = new_value
    t.reshape(-1)[1] = new_value
    t.view_as(t)[2] = new_value
    t.expand_as(t)[3] = new_value
    t.reshape_as(t)[4] = new_value
    t.contiguous()[5] = new_value
    t.detach()[6] = new_value

    return t
```

Currently none of the values are assigned after dead code elimination, after this PR all are. (And the JIT output matches that of eager.)

I don't think this needs to be unit tested; presumably the generic machinery already is and this just brings these ops under the same umbrella.

**BC-breaking note**: This updates the native operator schema and the aliasing rules for autograd. JIT passes will no longer incorrectly optimize mutations on graphs containing these ops, and inplace ops on the result of `flatten` will now properly be tracked in Autograd and the proper backward graph will be created.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39794

Differential Revision: D22008358

Pulled By: robieta

fbshipit-source-id: 9d3ff536e58543211e08254a75c6110f2a3b4992
2020-06-26 09:25:27 -07:00
a0ba7fb43e Precompute entries in dispatch tables (#40512)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40512

Fixes https://github.com/pytorch/pytorch/issues/32454

The heart of this diff is changing this:

```
inline const KernelFunction& Dispatcher::dispatch_(const DispatchTable& dispatchTable, DispatchKey dispatchKey) c
nst {
  const KernelFunction* backendKernel = dispatchTable.lookup(dispatchKey);

  if (nullptr != backendKernel) {
    return *backendKernel;
  }

  const auto& backendFallbackKernel = backendFallbackKernels_[dispatchKey];
  if (backendFallbackKernel.isValid()) {
    return backendFallbackKernel;
  }

  const KernelFunction* catchallKernel = dispatchTable.lookupCatchallKernel();
  if (C10_LIKELY(nullptr != catchallKernel)) {
    return *catchallKernel;
  }

  reportError(dispatchTable, dispatchKey);
}
```

to this:

```
const KernelFunction& OperatorEntry::lookup(DispatchKey k) const {
  const auto& kernel = dispatchTable_[static_cast<uint8_t>(k)];
  if (C10_UNLIKELY(!kernel.isValid())) {
    reportError(k);
  }
  return kernel;
}
```

The difference is that instead of checking a bunch of places to find the
right kernel to use for an operator, all of the operators are
precomputed into dispatchTable_ itself (so you don't have to consult
anything else at runtime.)  OperatorEntry::computeDispatchTableEntry
contains that computation (which is exactly the same as it was before.)
By doing this, we are able to substantially simplify many runtime
components of dispatch.

The diff is fairly large, as there are also some refactors interspersed
with the substantive change:

- I deleted the DispatchTable abstraction, folding it directly into
  OperatorEntry.  It might make sense to have some sort of DispatchTable
  abstraction (if only to let you do operator[] on DispatchKey without
  having to cast it to integers first), but I killed DispatchTable to
  avoid having to design a new abstraction; the old abstraction wasn't
  appropriate for the new algorithm.

- I renamed OperatorEntry::KernelEntry to AnnotatedKernel, and use it
  to store backend fallbacks as well as regular kernel registrations
  (this improves error messages when you incorrectly register a backend
  fallback twice).

- I moved schema_ and debug_ into an AnnotatedSchema type, to make the
  invariant clearer that these are set together, or not at all.

- I moved catch-all kernels out of kernels_ into its own property
  (undoing a refactor I did before).  The main reason I did this was
  because our intended future state is to not have a single catch-all,
  but rather possibly multiple catch-alls which fill-in different
  portions of the dispatch table.  This may change some more in
  the future: if we allow registrations for multiple types of
  catch alls, we will need a NEW data type (representing bundles
  of dispatch keys) which can represent this case, or perhaps
  overload DispatchKey to also record these types.

The key changes for precomputation:

- OperatorEntry::updateDispatchTable_ is now updated to fill in the
  entry at a DispatchKey, considering both kernels (what it did
  before) as well as catch-all and backend fallback.  There is also
  OperatorEntry::updateDispatchTableFull_ which will update the
  entire dispatch table (which is necessary when someone sets a
  catch-all kernel).  OperatorEntry::computeDispatchTableEntry
  holds the canonical algorithm specifying how we decide what
  function will handle a dispatch key for the operator.

- Because dispatch table entry computation requires knowledge of
  what backend fallbacks are (which is recorded in Dispatcher,
  not OperatorEntry), several functions on OperatorEntry now
  take Dispatcher as an argument so they can query this information.

- I modified the manual boxing wrapper invariant: previously, kernels
  stored in kernels_ did NOT have manual boxing wrappers and this
  was maintained by DispatchTable.  Now, we just ALWAYS maintain
  manual boxing wrappers for all KernelFunctions we store.

- DispatchKeyExtractor is greatly simplified: we only need to maintain
  a single per-operator bitmask of what entries are fallthrough
  (we don't need the global bitmask anymore).

- Introduced a new debugging 'dumpComputedTable' method, which prints
  out the computed dispatch table, and how we computed it to be some way.
  This was helpful for debugging cases when the dispatch table and
  the canonical metadata were not in sync.

Things that I didn't do but would be worth doing at some point:

- I really wanted to get rid of the C10_UNLIKELY branch for
  whether or not the KernelFunction is valid, but it looks like
  I cannot easily do this while maintaining good error messages.
  In principle, I could always populate a KernelFunction which
  errors, but the KernelFunction needs to know what the dispatch
  key that is missing is (this is not passed in from the
  calling convention).  Actually, it might be possible to do
  something with functors, but I didn't do it here.

- If we are going to get serious about catchalls for subsets of
  operators, we will need to design a new API for them.  This diff
  is agnostic to this question; we don't change public API at all.

- Precomputation opens up the possibility of subsuming DispatchStub
  by querying CPU capability when filling in the dispatch table.
  This is not implemented yet. (There is also a mild blocker here,
  which is that DispatchStub is also used to share TensorIterator
  configuration, and this cannot be directly supported by the
  regular Dispatcher.)

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Differential Revision: D22236352

Pulled By: ezyang

fbshipit-source-id: d6d90f267078451816b1899afc3f79737b4e128c
2020-06-26 09:03:39 -07:00
a4cabd1a3c Generalize Python dispatcher testing API; disallow overwriting fallback (#40469)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40469

- The old testing interface C._dispatch_import was based off the old
  c10::import variation, which meant the API lined up in a strange
  way with the actual torch/library.h.  This diff reduces the
  differences by letting you program the Library constructor directly.

- Using this newfound flexibility, we add a test for backend fallbacks
  from Python; specifically testing that we disallow registering a
  backend fallback twice.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Differential Revision: D22236351

Pulled By: ezyang

fbshipit-source-id: f8365e3033e9410c7e6eaf9f78aa32e1f7d55833
2020-06-26 09:01:28 -07:00
44bf822084 Add C++ standard version check to top level headers (#40510)
Summary:
Remove `-std=c++14` flag from `utils.cmake`, since PyTorch C++ API can be invoked by any compiler compliant with C++14 standard or later
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40510

Differential Revision: D22253313

Pulled By: malfet

fbshipit-source-id: ff731525868b251c27928fc98b0724080ead9be2
2020-06-26 08:44:04 -07:00
dfc7e71d13 [Selective Build] Apply query-based on instrumentation_tests
Summary:
1. Modularize some bzl files to break circular buck load
2. Use query-based on instrumentation_tests

(Note: this ignores all push blocking failures!)

Test Plan: CI

Reviewed By: kwanmacher

Differential Revision: D22188728

fbshipit-source-id: affbabd333c51c8b1549af6602c6bb79fabb7236
2020-06-26 08:05:53 -07:00
f1406c43fc [papaya][aten] Fix compiler error: loop variable 'tensor' is always a copy because the range of type 'c10::List<at::Tensor>' does not return a reference. (#40599)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40599

.

Test Plan: CI

Reviewed By: smessmer

Differential Revision: D22246106

fbshipit-source-id: a5d0535e627b9f493fca7234dcfc15c521b0ed7f
2020-06-26 02:43:25 -07:00
eebd492dcf [doc] fix autograd doc subsubsection display issue (#40582)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40582

There's a misuse in the `requires_grad` with ~~~~~~, "~~~~" is not a official section marker, change it to "^^^^^" to denote subsubsections, also fix the other places where we should use subsection "-----" instead of subsubsection "^^^^"

see https://www.sphinx-doc.org/en/master/usage/restructuredtext/basics.html#sections

Before:
<img width="712" alt="rst_before" src="https://user-images.githubusercontent.com/9443650/85789835-2226fa80-b6e4-11ea-97b6-2b19fdf324a4.png">
After:
<img width="922" alt="rst_after" src="https://user-images.githubusercontent.com/9443650/85789856-281cdb80-b6e4-11ea-925f-cb3f4ebaa2bf.png">

Test Plan: Imported from OSS

Differential Revision: D22245747

Pulled By: wanchaol

fbshipit-source-id: 11548ed42f627706863bb74d4269827d1b3450d4
2020-06-25 23:28:33 -07:00
3ab60ff696 Remove cpu vec256 for std::complex (#39830)
Summary:
std::complex is gone. We are now using c10::complex
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39830

Differential Revision: D22252066

Pulled By: malfet

fbshipit-source-id: cdd5bb03ec66825d82177d609cbcf0738922dba0
2020-06-25 23:25:58 -07:00
fab412a8f3 Bump nightlies to 1.7.0 (#40519)
Summary:
edit: apparently we hardcode a lot more versions that I would've anticipated.

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40519

Differential Revision: D22221280

Pulled By: seemethere

fbshipit-source-id: ba15a910a6755ec08c10f7783ed72b1e06e6b570
2020-06-25 22:36:33 -07:00
e3a97688cc [quant][graphmode][fix] dequantize propagation for {add/mul}_scalar (#40596)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40596

Previously the fusion patterns for {add/mul}_scalar is inconsistent since the op pattern
produces a non-quantized tensor and the op replacement graph produces a quantized tensor

Test Plan: Imported from OSS

Differential Revision: D22251072

fbshipit-source-id: e16eb92cf6611578cca1ed8ebde961f8d0610137
2020-06-25 22:17:08 -07:00
547ea787ff [ONNX] Add eliminate_unused_items pass (#38812)
Summary:
This PR:

- Adds eliminate_unused_items pass that removes unused inputs and initializers.
- Fixes run_embed_params function so it doesn't export unnecessary parameters.
- Removes  test_modifying_params in test_verify since it's no longer needed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38812

Reviewed By: ezyang

Differential Revision: D22236416

Pulled By: houseroad

fbshipit-source-id: 30e1a6e8823a7e36b51ae1823cc90476a53cd5bb
2020-06-25 22:00:26 -07:00
5466231187 Fixes lint (#40606)
Summary:
'= ' => '='
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40606

Differential Revision: D22252511

Pulled By: mruberry

fbshipit-source-id: 5f90233891be58a742371e4416166a267aee4669
2020-06-25 21:53:00 -07:00
ac79c874ce [PyTorch Operator] [2/n] Adding python test
Summary: Adding python test file with image files wit the input image being p.jpg. Test for the quality difference between the raw image and the decoded image

Test Plan:
Parsing buck files: finished in 1.5 sec
Building: finished in 6.4 sec (100%) 10241/10241 jobs, 2 updated
  Total time: 8.0 sec
More details at https://www.internalfb.com/intern/buck/build/387cb1c1-2902-4f90-ae9f-83fb6d473487
Tpx test run coordinator for Facebook. See https://fburl.com/tpx for details.
Running with tpx session id: 93e6ef88-ec68-41cb-9de7-7868a14e6d65
Trace available for this run at /tmp/tpx-20200623-055836.283269/trace.log
Started reporting to test run: https://our.intern.facebook.com/intern/testinfra/testrun/4222124679431330
    ✓ ListingSuccess: caffe2/test:test_bundled_images - main (18.865)
    ✓ Pass: caffe2/test:test_bundled_images - test_single_tensors (test_bundled_images.TestBundledInputs) (18.060)
    ✓ Pass: caffe2/test:test_bundled_images - main (18.060)
Summary
  Pass: 2
  ListingSuccess: 1
Finished test run: https://our.intern.facebook.com/intern/testinfra/testrun/4222124679431330

Reviewed By: dreiss

Differential Revision: D22046611

fbshipit-source-id: fabc604269a5a4d8a37135ce776200da2794a252
2020-06-25 18:36:44 -07:00
c790476384 Back out "Revert D22072830: [wip] Upgrade msvc to 14.13" (#40594)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40594

Original commit changeset: 901de185e607
ghstack-source-id: 106642590

Test Plan: oss ci

Differential Revision: D22247269

fbshipit-source-id: be0c64d1a579f8aa3999cb84a9d20488095a81bd
2020-06-25 17:19:33 -07:00
b05c34259b relax size check in flatten_for_scatter_gather (#40573)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40573

Per title, to workaround apex sbn bug.

Test Plan: Covered by existing tests

Reviewed By: blefaudeux

Differential Revision: D22236942

fbshipit-source-id: ddb164ee347a7d472a206087e4dbd16aa9d72387
2020-06-25 15:16:37 -07:00
e180ca652f Add __all__ to torch/_C/_VariableFunctions.pyi (#40499)
Summary:
Related to https://github.com/pytorch/pytorch/issues/40397

Inspired by ezyang's comment at https://github.com/pytorch/pytorch/issues/40397#issuecomment-648233001, this PR attempts to leverage using `__all__` to explicitly export private functions from `_VariableFunctions.pyi` in order to make `mypy` aware of them after:

```
if False:
    from torch._C._VariableFunctions import *
```

The generation of the `__all__` template variable excludes some items from `unsorted_function_hints`, as it seems that those without hints end up not being explicitly included in the `.pyi` file: I leaned on the side of caution and opted for having `__all__` consistent with the definitions inside the file. Additionally, added some pretty-printing to avoid having an extremely long line.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40499

Differential Revision: D22240716

Pulled By: ezyang

fbshipit-source-id: 77718752577a82b1e8715e666a8a2118a9d3a1cf
2020-06-25 14:10:07 -07:00
c6e0c67449 [PyTorch Error Logging][2/N] Adding Error Logging for Loading Model (#40537)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40537

Adding Error Logging when loading model, adding event "MOBILE_MODULE_LOAD"
ghstack-source-id: 106615128

Test Plan: {F241028136}

Reviewed By: iseeyuan

Differential Revision: D22098818

fbshipit-source-id: 4de7df4432c7c6c297a9dc173e5cafa13fe2833c
2020-06-25 14:05:43 -07:00
e231405ef6 [jit] Fix type annotations in select assignments (#40528)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40528

Previously, an assignment like `self.foo : List[int] = []` would ignore
the type hint.

Test Plan: Imported from OSS

Reviewed By: eellison

Differential Revision: D22222927

Pulled By: suo

fbshipit-source-id: b0af19b87c6fbe0670d06b55f2002a783d00549d
2020-06-25 13:08:03 -07:00
dfbf0164c9 Revert D22103662: [NCCL] Explicitly Abort NCCL Communicators on Process Group Destruction
Test Plan: revert-hammer

Differential Revision:
D22103662 (527ab13436)

Original commit changeset: 1f6f88b56bd7

fbshipit-source-id: d0944462c021ec73c7f883f98609fc4a3408efd9
2020-06-25 12:27:24 -07:00
4d40ec1480 [PyTorch Error Logging][1/N] Adding Error Logging for Run_Method (#40535)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40535

Adding error logging for run_method.
Adding CANCEL(the method cannot be found) and FAIL status(error occurred when running the method)
ghstack-source-id: 106604786

Test Plan: {F240891059}

Reviewed By: xcheng16

Differential Revision: D22097857

fbshipit-source-id: 4bdc8e3993e40cb1ba51e4706be6637e3afd40b4
2020-06-25 12:25:34 -07:00
f41173b975 [PyPer][quant] Add quantized embedding operators to OSS. (#40076)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40076

Pull Request resolved: https://github.com/pytorch/glow/pull/4606

[PyPer][quant] Add quantized embedding operators to OSS.

This is the first step in supporting Graph Mode Quantization for EmbeddingBag.

At a high level, the next steps would be
a) Implementation of Embedding prepack/unpack operators,
b) Implementation of torch.nn.quantized.dynamic.EmbeddingBag Module,
c) Implementation of torch.nn.quantized.EmbeddingBag Module,
d) Implementation (modification) of IR passes to support graph quantization of EmbeddingBag module.

More in-depth details regarding each step will be in the follow up diffs. Consider this as an initial diff that moves operators to respective places that's required for us to proceed.

Test Plan: ```buck test mode/no-gpu caffe2/test:quantization -- --stress-runs 100  test_embedding_bag```

Reviewed By: supriyar

Differential Revision: D21949828

fbshipit-source-id: cad5ed0a855db7583bddb1d93e2da398c128024a
2020-06-25 12:01:49 -07:00
461014d54b Unify libtorch_python_cuda_core_sources filelists between CMakeList, fbcode and bazel (#40554)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40554

Get a sublist of `libtorch_python_cuda_sources` named `libtorch_python_cuda_core_sources`. Use it to replace the list which has the same content in `CMakeList.txt`.
This is a change to make consistency between CMakeList and bazel.

Test Plan: CI

Reviewed By: malfet

Differential Revision: D22223207

fbshipit-source-id: 2bde3c42a0b2d60d689581561075df4ef52ab694
2020-06-25 11:02:33 -07:00
7369dc8d1f Use CPU Allocator for reading from zip container
Summary:
This code path is used to read tensor bodies, so we need it to respect
alignment and padding requirements.

Test Plan: Ran an internal test that was failing.

Reviewed By: zdevito

Differential Revision: D22225622

fbshipit-source-id: f2126727f96616366850642045ab9704f3885824
2020-06-25 10:51:49 -07:00
c362138f43 Disallow passing functions that don't return Tensors to vmap (#40518)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40518

I overlooked this in the initial vmap frontend api PR. Right now we
want to restrict vmap to taking in functions that only return Tensors.
A function that only return tensors can look like one of the following:
```
def fn1(x):
    ...
    return y

def fn2(x):
    ...
    return y, z
```
fn1 returns a Tensor, while fn2 returns a tuple of Tensors. So we add a
check that the output of the function passed to vmap returns either a
single tensor or a tuple of tensors.

NB: These checks allow passing a function that returns a tuple with a
single-element tensor from vmap. That seems OK to me.

Test Plan: - `python test/test_vmap.py -v`

Differential Revision: D22216166

Pulled By: zou3519

fbshipit-source-id: a92215e9c26f6138db6b10ba81ab0c2c2c030929
2020-06-25 08:54:05 -07:00
43757ea913 Add batching rule for Tensor.permute (#40517)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40517

This is necessary for implementing the vmap frontend API's out_dims
functionality.

Test Plan:
- `./build/bin/vmap_test`. The vmap python API can't accept inputs that
aren't integers right now. There are workarounds around that (use a
lambda) but that doesn't look too nice. In the future we'll test all
batching rules in Python.

Differential Revision: D22216168

Pulled By: zou3519

fbshipit-source-id: b6ef552f116fddc433e242c1594059b9d2fe1ce4
2020-06-25 08:54:01 -07:00
7038579c03 Add batching rule for unsqueeze, squeeze, and transpose (#40455)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40455

These don't need to be implemented right now but are useful later down
the line. I thought I would use these in implementing vmap's `out_dims`
functionality, but it turns out they weren't necessary. Since the code
exists and is useful anyways, I am leaving this PR here.

Test Plan:
- `./build/bin/vmap_test`. We could test this using the vmap frontend API,
but there is the catch that vmap cannot directly take integers right
now (all inputs passed to vmap must be Tensors at the moment). It's
possible to hack around that by declaring lambdas that take in a single
tensor argument, but those don't look nice.

Differential Revision: D22216167

Pulled By: zou3519

fbshipit-source-id: 1a010f5d7784845cca19339d37d6467f5b987c32
2020-06-25 08:51:27 -07:00
88ea51c061 doc string fix for torch.cuda.set_rng_state_all (#40544)
Summary:
Fix https://github.com/pytorch/pytorch/issues/40239
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40544

Differential Revision: D22233989

Pulled By: ezyang

fbshipit-source-id: b5098357a3e0c50037f95ba0d701523d5dce2628
2020-06-25 08:37:14 -07:00
e440c370c5 [quant] Fix fuse linear pass (#40549)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40549

Currently we didn't check if %weight_t is produced by `aten::t`, this will fuse some `matmul`/`addmm` that is
not 2d to `aten::linear`, which is incorrect

Test Plan: Imported from OSS

Differential Revision: D22225921

fbshipit-source-id: 9723e82fdbac6d8e1a7ade22f3a9791321ab12b6
2020-06-25 07:10:09 -07:00
eae1ed99a3 caffe2 | Fix building with -Wrange-loop-analysis on
Summary: `-Wrange-loop-analysis` is turned on by default for clang 10 (see https://reviews.llvm.org/D73834). This fixes a warning that's found with that.

Test Plan: Build with clang 10 and check there are no `range-loop-analysis` warnings.

Reviewed By: yinghai

Differential Revision: D22207072

fbshipit-source-id: 858ba8a36c653071eab961cb891ce945faf0fa87
2020-06-24 23:42:33 -07:00
cf8a9b50ca Allow ReflectionPad to accept 0-dim batch sizes. (#39231)
Summary:
Allows ReflectionPad 1D and 2D to accept 0-dim batch sizes.

Related to issues:

* https://github.com/pytorch/pytorch/issues/38115
* https://github.com/pytorch/pytorch/issues/12013
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39231

Reviewed By: ezyang

Differential Revision: D22205717

Pulled By: mruberry

fbshipit-source-id: 6744661002fcbeb4aaafd8693fb550ed53f3e00f
2020-06-24 22:24:05 -07:00
82e9318a16 Adjust CUDA memory leak test (#40504)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40504

Make CUDA mem liek test not flaky

Test Plan: python test/test_profiler.py

Differential Revision: D22215527

Pulled By: ilia-cher

fbshipit-source-id: 5f1051896342ac50cd3a21ea86ce7487b5f82a19
2020-06-24 18:22:46 -07:00
85b87df5ba Revert D22208758: [pytorch][PR] Report error when ATEN_THEADING is OMP and USE_OPENMP is turned off.
Test Plan: revert-hammer

Differential Revision:
D22208758 (3ed96e465c)

Original commit changeset: 0866c9bb9b3b

fbshipit-source-id: 9e2b469469e274292b2559c02aa0256425fd355e
2020-06-24 18:20:28 -07:00
06debf6373 move __range_length and __derive_index to lite interpreter (#40533)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40533

These ops are required by the demucs denoiser model

Test Plan: build

Reviewed By: kaustubh-kp, linbinyu

Differential Revision: D22216217

fbshipit-source-id: f300ac246fe3a7a6566a70bb89858770af68a90c
2020-06-24 18:14:51 -07:00
adcd755e69 Fix backup solution (#40515)
Summary:
These were changes that had to be made in the `release/1.6` branch in order to get backups to work.

They should be brought to the master branch.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40515

Differential Revision: D22221308

Pulled By: seemethere

fbshipit-source-id: 24e2a0196a8e775fe324a383c8f0c681118b741b
2020-06-24 17:21:38 -07:00
e12f73ee12 Add missing file to BUILD.bazel (#40536)
Summary:
Add `int8_gen_quant_params.cc` added by
https://github.com/pytorch/pytorch/pull/40494/ to bazel build rules
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40536

Reviewed By: mruberry

Differential Revision: D22219595

Pulled By: malfet

fbshipit-source-id: 2875a0b9c55bad2b052a898661b96eab490f6451
2020-06-24 17:16:26 -07:00
3dcc329746 Use tree-based sum for floats to avoid numerical instability (#39516)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/38716, fixes https://github.com/pytorch/pytorch/issues/37234

This algorithm does the summation along a single axis with multiple "levels" of accumulator, each of which is designed to hold the sum of an order of magnitude more values than the previous.

e.g. if there are 2^16 elements, the first level will hold the sum of 2^4 elements, and so on in increasing powers of 2: 2^4, 2^8, 2^12 and finally 2^16.

This limits the differences in magnitude of the partial results being added together, and so we don't lose accuracy as the axis length increases.

WIP to write a vectorized version.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39516

Reviewed By: ezyang

Differential Revision: D22106251

Pulled By: ngimel

fbshipit-source-id: b56de4773292439dbda62b91f44ff37715850ae9
2020-06-24 17:06:38 -07:00
ea06db9466 Release GIL during DDP construction. (#40495)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40495

As part of debugging flaky ddp_under_dist_autograd tests, I realized
we were running into the following deadlock.

1) Rank 0 would go into DDP construction, hold GIL and wait for broadcast in
DDP construction.
2) Rank 3 is a little slower and performs an RRef fetch call before the DDP
construction.
3) The RRef fetch call is done on Rank 0 and tries to acquire GIL.
4) We now have a deadlock since Rank 0 is waiting for Rank 3 to enter the
collective and Rank 3 is waiting for Rank 0 to release GIL.
ghstack-source-id: 106534442

Test Plan:
1) Ran ddp_under_dist_autograd 500 times.
2) waitforbuildbot

Differential Revision: D22205180

fbshipit-source-id: 6afd55342e801b9edb9591ff25158a244a8ea66a
2020-06-24 16:58:42 -07:00
71edd7f175 Update FP16 to FP16:4dfe081cf6bcd15db339cf2680b9281b8451eeb3. (#40526)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40526

Differential Revision: D22215600

Pulled By: AshkanAliabadi

fbshipit-source-id: 6ff0c17d17f118b64ae34c0007b705c7127f07ef
2020-06-24 16:58:40 -07:00
16f276cef9 Add C++-only int dim overloads to std-related operations (#40451)
Summary:
Fixes gh-40287

The `int -> bool` conversion takes higher precedence than `int -> IntArrayRef`. So, calling `std(0)` in C++ would select the `std(unbiased=False)` overload instead.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40451

Differential Revision: D22217926

Pulled By: ezyang

fbshipit-source-id: 7520792fab5ab6665bddd03b6f57444c6c729af4
2020-06-24 16:56:55 -07:00
a208a272cb Update cpuinfo to cpuinfo:63b254577ed77a8004a9be6ac707f3dccc4e1fd9. (#40516)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40516

Differential Revision: D22215554

Pulled By: AshkanAliabadi

fbshipit-source-id: f779cf6e08cf344b87071c2ffc9b3f7cf4659085
2020-06-24 16:47:24 -07:00
c120fdc05b Unify torch/csrc/cuda/shared/cudnn.cpp include path (#40525)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40525

Move `USE_CUDNN` define under `USE_CUDA` guard, add `cuda/shared/cudnn.cpp` to filelist if either USE_ROCM or USE_CUDNN is set.
This is a prep change for PyTorch CUDA src filelist unification change.

Test Plan: CI

Differential Revision: D22214899

fbshipit-source-id: b71b32fc603783b41cdef0e7fab2cc9cbe750a4e
2020-06-24 16:40:11 -07:00
cef35e339f Update FXdiv to FXdiv:b408327ac2a15ec3e43352421954f5b1967701d1. (#40520)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40520

Differential Revision: D22215614

Pulled By: AshkanAliabadi

fbshipit-source-id: 5e41a3a69522cbfe1cc4ac76a0d1f3e90a58528d
2020-06-24 16:31:25 -07:00
4a0ba62ded Update psimd to psimd:072586a71b55b7f8c584153d223e95687148a900. (#40522)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40522

Differential Revision: D22215685

Pulled By: AshkanAliabadi

fbshipit-source-id: 78c103c4f7ad21e78069dc86a8ee47aebc9aa73e
2020-06-24 16:21:25 -07:00
3e09268c0a [jit] allow dict to be mixed between tracing and scripting (#39601)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/39601

Test Plan: Imported from OSS

Reviewed By: jamesr66a

Differential Revision: D22202689

Pulled By: wanchaol

fbshipit-source-id: 5271eb3d8fdcda3d730a085aa555b43c35d14876
2020-06-24 16:14:13 -07:00
787e1c4c7d [jit] fix dictConstruct order issue (#40424)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40424

dictConstruct should preserve the inputs order

Test Plan: Imported from OSS

Differential Revision: D22202690

Pulled By: wanchaol

fbshipit-source-id: c313b531b7fa49e6f3486396d61bfc5d6400cd01
2020-06-24 16:12:32 -07:00
2e6e8d557c Update docs feature classifications (#39966)
Summary:
Update the following feature classifications in docs to align with the changes:
1. [High Level Autograd APIs](https://pytorch.org/docs/stable/autograd.html#functional-higher-level-api): Beta (was experimental)
2. [Eager Mode Quantization](https://pytorch.org/docs/stable/quantization.html): Beta (was experimental)
3. [Named Tensors](https://pytorch.org/docs/stable/named_tensor.html): Prototype (was experimental)
4. [TorchScript/RPC](https://pytorch.org/docs/stable/rpc.html#rpc): Prototype (was experimental)
5. [Channels Last Memory Layout](https://pytorch.org/docs/stable/tensor_attributes.html#torch-memory-format): Beta (was experimental)
6. [Custom C++ Classes](https://pytorch.org/docs/stable/cpp_index.html): Beta (was experimental)
7. [Torch.Sparse](https://pytorch.org/docs/stable/sparse.html): Beta (was experimental)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39966

Differential Revision: D22213217

Pulled By: jlin27

fbshipit-source-id: dc49337cbc7026ed8dcac506fc60029dc3add854
2020-06-24 15:35:59 -07:00
72f2c479e3 Migrate equal from the TH to Aten (CPU) (#33286)
Summary:
https://github.com/pytorch/pytorch/issues/24697
VitalyFedyunin
glaringlee

Test script:
```Python
import timeit

setup_ones = """
import torch
a = torch.ones(({n}, {n}), dtype={dtype})
b = torch.ones(({n}, {n}), dtype={dtype})
"""

for n, t in [(1000, 10000), (2000, 10000)]:
  for dtype in ('torch.bool', 'torch.int', 'torch.long', 'torch.bfloat16', 'torch.float', 'torch.double'):
  #for dtype in ('torch.bool', 'torch.int', 'torch.long', 'torch.float', 'torch.double'):
    print('torch.ones(({n}, {n})) equal for {t} times {dtype}'.format(n=n, t=t, dtype=dtype))
    print(timeit.timeit(stmt='torch.equal(a, b)', setup=setup_ones.format(n=n, dtype=dtype), number=t))

setup_rand = """
import torch
a = torch.rand(({n}, {n}), dtype={dtype})
b = a.clone()
"""
for n, t in [(1000, 10000), (2000, 10000)]:
  for dtype in ('torch.float', 'torch.double'):
    print('torch.rand(({n}, {n})) for {t} times {dtype}'.format(n=n, t=t, dtype=dtype))
    print(timeit.timeit(stmt='torch.equal(a, b)', setup=setup_rand.format(n=n, dtype=dtype), number=t))

setup_non_contiguous = """
import torch
a = torch.rand(({n}, {n}), dtype={dtype})
a2 = a[:, 500:]
a3 = a2.clone()
torch.equal(a2, a3)
"""
for n, t in [(1000, 10000), (2000, 10000)]:
  for dtype in ('torch.float', 'torch.double'):
    print('non_contiguous torch.rand(({n}, {n})) for {t} times {dtype}'.format(n=n, t=t, dtype=dtype))
    print(timeit.timeit(stmt='torch.equal(a2, a3)', setup=setup_non_contiguous.format(n=n, dtype=dtype), number=t))

setup_not_equal = """
import torch
a = torch.rand(({n}, {n}), dtype={dtype})
b = torch.rand(({n}, {n}), dtype={dtype})
torch.equal(a, b)
"""
for n, t in [(1000, 10000), (2000, 10000)]:
  for dtype in ('torch.float', 'torch.double'):
    print('not equal torch.rand(({n}, {n})) for {t} times {dtype}'.format(n=n, t=t, dtype=dtype))
    print(timeit.timeit(stmt='torch.equal(a, b)', setup=setup_not_equal.format(n=n, dtype=dtype), number=t))
```

TH
```
torch.ones((1000, 1000)) equal for 10000 times torch.bool
1.8391206220258027
torch.ones((1000, 1000)) equal for 10000 times torch.int
1.8877864250680432
torch.ones((1000, 1000)) equal for 10000 times torch.long
1.938108820002526
torch.ones((1000, 1000)) equal for 10000 times torch.bfloat16
3.184849138953723
torch.ones((1000, 1000)) equal for 10000 times torch.float
1.8825413499725983
torch.ones((1000, 1000)) equal for 10000 times torch.double
2.7266416549682617
torch.ones((2000, 2000)) equal for 10000 times torch.bool
7.227149627986364
torch.ones((2000, 2000)) equal for 10000 times torch.int
7.76215292501729
torch.ones((2000, 2000)) equal for 10000 times torch.long
9.631909006042406
torch.ones((2000, 2000)) equal for 10000 times torch.bfloat16
8.097328286035918
torch.ones((2000, 2000)) equal for 10000 times torch.float
5.5739822529722005
torch.ones((2000, 2000)) equal for 10000 times torch.double
8.444009944912978
torch.rand((1000, 1000)) for 10000 times torch.float
1.168096570065245
torch.rand((1000, 1000)) for 10000 times torch.double
1.6577326939441264
torch.rand((2000, 2000)) for 10000 times torch.float
5.49395391496364
torch.rand((2000, 2000)) for 10000 times torch.double
8.507486199960113
non_contiguous torch.rand((1000, 1000)) for 10000 times torch.float
6.074504268006422
non_contiguous torch.rand((1000, 1000)) for 10000 times torch.double
6.1426916810451075
non_contiguous torch.rand((2000, 2000)) for 10000 times torch.float
37.501055537955835
non_contiguous torch.rand((2000, 2000)) for 10000 times torch.double
44.6880351039581
not equal torch.rand((1000, 1000)) for 10000 times torch.float
0.029356416082009673
not equal torch.rand((1000, 1000)) for 10000 times torch.double
0.025421109050512314
not equal torch.rand((2000, 2000)) for 10000 times torch.float
0.026333761983551085
not equal torch.rand((2000, 2000)) for 10000 times torch.double
0.02748022007290274
```

ATen
```
torch.ones((1000, 1000)) equal for 10000 times torch.bool
0.7961567062884569
torch.ones((1000, 1000)) equal for 10000 times torch.int
0.49172434909269214
torch.ones((1000, 1000)) equal for 10000 times torch.long
0.9459248608909547
torch.ones((1000, 1000)) equal for 10000 times torch.bfloat16
2.0877483217045665
torch.ones((1000, 1000)) equal for 10000 times torch.float
0.606857153121382
torch.ones((1000, 1000)) equal for 10000 times torch.double
1.1388208279386163
torch.ones((2000, 2000)) equal for 10000 times torch.bool
2.0329296849668026
torch.ones((2000, 2000)) equal for 10000 times torch.int
3.534358019940555
torch.ones((2000, 2000)) equal for 10000 times torch.long
8.19841272290796
torch.ones((2000, 2000)) equal for 10000 times torch.bfloat16
6.595649406313896
torch.ones((2000, 2000)) equal for 10000 times torch.float
4.193911510054022
torch.ones((2000, 2000)) equal for 10000 times torch.double
7.931309659034014
torch.rand((1000, 1000)) for 10000 times torch.float
0.8877940969541669
torch.rand((1000, 1000)) for 10000 times torch.double
1.4142901846207678
torch.rand((2000, 2000)) for 10000 times torch.float
4.010025603231043
torch.rand((2000, 2000)) for 10000 times torch.double
8.126411964651197
non_contiguous torch.rand((1000, 1000)) for 10000 times torch.float
0.602473056409508
non_contiguous torch.rand((1000, 1000)) for 10000 times torch.double
0.6784545010887086
non_contiguous torch.rand((2000, 2000)) for 10000 times torch.float
3.0991827426478267
non_contiguous torch.rand((2000, 2000)) for 10000 times torch.double
5.719010795000941
not equal torch.rand((1000, 1000)) for 10000 times torch.float
0.046060710679739714
not equal torch.rand((1000, 1000)) for 10000 times torch.double
0.036034489050507545
not equal torch.rand((2000, 2000)) for 10000 times torch.float
0.03686975734308362
not equal torch.rand((2000, 2000)) for 10000 times torch.double
0.04189508780837059
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33286

Differential Revision: D22211962

Pulled By: glaringlee

fbshipit-source-id: a5c48f328432c1996f28e19bc75cb495fb689f6b
2020-06-24 15:08:06 -07:00
4d549077a2 Skip test_mem_leak on Windows (#40486)
Summary:
Addresses https://github.com/pytorch/pytorch/issues/40485.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40486

Differential Revision: D22217493

Pulled By: malfet

fbshipit-source-id: 6654c3b53e8af063b508f91728e58262ffbab053
2020-06-24 14:49:14 -07:00
0c923eea0a Add finishAndThrow function to ProcessGroup::Work, and use with Gloo (#40405)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40405

This adds a finishAndThrow function that completes the work object,
sets an exception if one is provided by the user, and throws an exception (if
it is already set or passed by the caller). This is now done by grabbing the
lock just once and simplifies the wait functions in ProcessGroupGloo.
ghstack-source-id: 106516114

Test Plan: CI

Differential Revision: D22174890

fbshipit-source-id: ea74702216c4328187c8d193bf39e1fea43847f6
2020-06-24 14:46:25 -07:00
3e2d2fc856 [NCCL Docs] Adding Comments for Work-level Finish in ProcessGroup (#40404)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40404

Adds docs to the finish function in ProcessGroup::Work. It's better to have some documentation around these functions since we have some PR's with API-changes/optimizations for these work-level functions here and in the subclasses.
ghstack-source-id: 106381736

Test Plan: CI (Docs change only)

Differential Revision: D22174891

fbshipit-source-id: 7901ea3b35caf6f69f37178ca574104d3412de28
2020-06-24 14:44:18 -07:00
527ab13436 [NCCL] Explicitly Abort NCCL Communicators on Process Group Destruction (#40241)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40241

We abort incomplete NCCL Communicators in the ProcessGroupNCCL
destructor, otherwise pending NCCL communciators may block other CUDA ops.

Closes: https://github.com/pytorch/pytorch/issues/32231
ghstack-source-id: 106469423

Test Plan: CI/Sandcastle

Reviewed By: jiayisuse

Differential Revision: D22103662

fbshipit-source-id: 1f6f88b56bd7a5e9ca5a41698995a76e60e8ad9f
2020-06-24 14:34:00 -07:00
fe18dcd692 Use GLOG logging prefixes (#40491)
Summary:
PyTorch should stop polluting global namespace with symbols such as `ERROR` `WARNING` and `INFO`.
Since `logging_is_not_google_glog.h` is a C++ header, define severity levels in namespace and add `GLOG_` prefix to match an unshortened glog severity levels.
Change `LOG` and `LOG_IF` macros to use prefix + namespaced severity levels.

Closes https://github.com/pytorch/pytorch/issues/40083
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40491

Test Plan: CI

Reviewed By: ezyang

Differential Revision: D22210925

Pulled By: malfet

fbshipit-source-id: 0ec1181a53baa8bca2f526f245e398582304aeab
2020-06-24 14:07:00 -07:00
fc4824aa4a enable mkldnn dilation conv (#40483)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40483

Reviewed By: ezyang

Differential Revision: D22213696

Pulled By: ngimel

fbshipit-source-id: 0321eee8fcaf144b20a5182aa76f98d505c65400
2020-06-24 13:28:05 -07:00
de7ac60cf4 Add out= variants for cuda.comm.broadcast/gather/scatter (#39681)
Summary:
Partially fixes https://github.com/pytorch/pytorch/issues/38911
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39681

Differential Revision: D22161342

Pulled By: mrshenli

fbshipit-source-id: 60295077159b02087823e93bb6ebac9d70adea0a
2020-06-24 12:58:19 -07:00
e66445878d Adds dynamic versioning pattern (#40279)
Summary:
BC NOTE:

This change makes it so modules saved with torch.jit.save in PyTorch 1.6 can be loaded by previous versions of PyTorch unless they use torch.div or (soon) torch.full. It also lets tensors saved using torch.save be loaded by previous versions. So this is the opposite of BC-breaking, but I'm using that label to highlight this issue since we don't have a "BC-improving" label.

PR NOTE:
When an operator's semantics change in PyTorch we want to do two things:

1) Preserve the semantics of older serialized Torchscript programs that use the operator
2) Ensure the new semantics are respected

Historically, this meant writing a Versioned Symbol that would remap older versions of the operator into current PyTorch code (1), and bumping the produced file format version (2). Unfortunately, bumping the produced file format version is a nuclear option for ensuring semantics are respected, since it also prevents older versions of PyTorch from loading anything (even tensors!) from newer versions.

Dynamic versioning addresses the nuclear consequences of bumping the produced file format version by only bumping it when necessary. That is, when an operator with changed semantics is detected in the serialized Torchscript. This will prevent Torchscript programs that use the changed operator from loading on earlier versions of PyTorch, as desired, but will have no impact on programs that don't use the changed operator.

Note that this change is only applicable when using torch.jit.save and torch.jit.load. torch.save pickles the given object using pickle (by default), which saves a function's Python directly.

No new tests for this behavior are added since the existing tests for versioned division in test_save_load already validate that models with div are loaded correctly at version 4.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40279

Reviewed By: dzhulgakov

Differential Revision: D22168291

Pulled By: mruberry

fbshipit-source-id: e71d6380e727e25123c7eedf6d80e5d7f1fe9f95
2020-06-24 12:52:50 -07:00
a2e1a948a4 Increase number of iterations in DDP SPMD tests (#40506)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40506

Test Plan: Imported from OSS

Differential Revision: D22208965

Pulled By: mrshenli

fbshipit-source-id: 7d27b60e2c09e641b4eeb1c89d9f9917c4e72e52
2020-06-24 12:48:04 -07:00
9a3e16c773 Add guard for non-default stream in DDP's autograd engine callback (#40115)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40115

Closes https://github.com/pytorch/pytorch/issues/37790
Closes https://github.com/pytorch/pytorch/issues/37944

A user may wish to run DDP's forward + backwards step under a non-default CUDA stream such as those created by `with torch.cuda.Stream(stream)`. In this case, the user should be responsible for synchronizing events on this stream with other streams used in the program (per the documentation at https://pytorch.org/docs/stable/notes/cuda.html#cuda-semantics), but currently DDP has a bug which causes DDP under non-default streams to fail.

If a user does the following:
```
model = DDP(...)
loss = model(inptut).sum()
loss.backward()
grad = model.module.weight.grad()
average = dist.all_reduce(grad)
```

There is a chance that `average` and `grad` will not be equal. This is because the CUDA kernels corresponding to the  `all_reduce` call may run before `loss.backward()`'s kernels are finished. Specifically, in DDP we copy the allreduced gradients back to the model parameter gradients in an autograd engine callback, but this callback runs on the default stream. Note that this can also be fixed by the application synchronizing on the current stream, although this should not be expected, since the application is not using the current stream at all.

This PR fixes the issue by passing the current stream into DDP's callback.

Tested by adding a UT `test_DistributedDataParallel_non_default_stream` that fails without this PR
ghstack-source-id: 106481208

Differential Revision: D22073353

fbshipit-source-id: 70da9b44e5f546ff8b6d8c42022ecc846dff033e
2020-06-24 11:26:51 -07:00
597cb04b2f Use Int8QuantParamsBlob to pass the scale and zeropoint params (#40494)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40494

Resubmit the diff because D22124313 (1ec4337b7d) was reverted due to CI test failures
Added the int8_gen_quant_params.cc to CMakeList.txt to fix the CI failures

Test Plan: buck test caffe2/caffe2/quantization/server:

Reviewed By: hx89

Differential Revision: D22204244

fbshipit-source-id: a2c8b668f199cc5b0c5894086f554f7c459b1ad7
2020-06-24 10:20:16 -07:00
3ed96e465c Report error when ATEN_THEADING is OMP and USE_OPENMP is turned off. (#40146)
Summary:
Currently, even if USE_OPENMP is turned off, ATEN_THEADING can still use OpenMP. This commit fixes it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40146

Reviewed By: ezyang

Differential Revision: D22208758

Pulled By: pbelevich

fbshipit-source-id: 0866c9bb9b3b5b99d586aed176eb0fbe177efa4a
2020-06-24 09:55:10 -07:00
b4ccdef090 Allow torch.cuda.amp.GradScaler to support sparse gradients (#36786)
Summary:
Should close https://github.com/pytorch/pytorch/issues/35810.

I decided to keep sparse handling on the Python side for clarity, although it could be moved to the C++ side (into `_amp_non_finite_check_and_unscale_`) without much trouble.

For non-fp16 sparse grads the logic is simple (call `_amp_non_finite_check_and_unscale_` on `grad._values()`) instead of `grad` itself.  At least I hope it's that easy.

For fp16 sparse grads, it's tricker.  Sparse tensors can be uncoalesced.  From the [Note](https://pytorch.org/docs/master/sparse.html#torch.sparse.FloatTensor):
> Our sparse tensor format permits uncoalesced sparse tensors, where there may be duplicate coordinates in the indices; in this case, the interpretation is that the value at that index is the sum of all duplicate value entries.

An uncoalesced scaled fp16 grad may have values at duplicate coordinates that are all finite but large, such that adding them to make the coalesced version WOULD cause overflows.**  If I checked `_values()` on the uncoalesced version, it might not report overflows, but I think it should.

So, if the grad is sparse, fp16, and uncoalesced, I still call `_amp_non_finite_check_and_unscale_` to unscale `grad._values()` in-place, but I also double-check the coalesced version by calling a second `_amp_non_finite_check_and_unscale_` on `grad.coalesce()._values()`.  `coalesce()` is out-of-place, so this call doesn't redundantly affect `grad._values()`, but it does have the power to populate the same `found_inf` tensor.  The `is_coalesced()` check and `coalesce()` probably aren't great for performance, but if someone needs a giant embedding table in FP16, they're better than nothing and memorywise, they'll only create a copy of nnz gradient values+indices, which is still way better than changing the whole table to FP32.

An `unscale` variant with liberty to create unscaled grads out-of-place, and replace `param.grad` instead of writing through it, could get away with just one `_amp_non_finite_check_and_unscale_`.  It could say `coalesced = grad.coalesced()`, do only the stronger `_amp_non_finite_check_and_unscale_` on `coalesced._values()`, and set `param.grad = coalesced`.  I could even avoid replacing `param.grad` itself by going one level deeper and setting `param.grad`'s indices and values to `coalesced`'s, but that seems brittle and still isn't truly "in place".

** you could whiteboard an uncoalesced fp32 grad with the same property, but fp32's range is big enough that I don't think it's realistic.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36786

Reviewed By: ezyang

Differential Revision: D22202832

Pulled By: ngimel

fbshipit-source-id: b70961a4b6fc3a4c1882f65e7f34874066435735
2020-06-24 09:10:49 -07:00
d855528186 wconstab/38034-sliced-sequential (#40445)
Summary:
Partial support for slicing of Sequential containers.

- works around missing Sequential slice functionality
   by converting to tuple
- only supports iteration of resulting tuple values,
   not direct call() on the sliced sequential
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40445

Differential Revision: D22192469

Pulled By: wconstab

fbshipit-source-id: 61c85deda2d58f6e3bea2f1fa1d5d5dde568b9b5
2020-06-24 09:05:51 -07:00
727463a727 Initial vmap frontend API (#40172)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40172

This PR introduces the initial vmap frontend API. It has the following
limitations that we can resolve in the future:
- the inputs must be a flat list of tensors
- the outputs must be a flat list of tensors
- in_dims = 0 (so we always vmap over dim 0 of input tensors)
- out_dims = 0 (so the returned tensors have their vmap dim appear at
dim 0)
- Coverage limited to operations that have batching rules implemented
(torch.mul, torch.sum, torch.expand).

There are some other semantic limitations (like not being able to handle
mutation, aside from pytorch operations that perform mutation) that will
be documented in the future.

I wanted to introduce the API before adding a slow fallback for the
coverage so that we can test future batching rules (and coverage) via
the python API to avoid verbosity in C++-land.

The way vmap works is that `vmap(func)(inputs)` wraps all Tensor inputs
to be batched in BatchedTensors, sends those into func, and then unwraps
the output BatchedTensors. Operations on BatchedTensors perform the batched
operations that the user is asking for. When performing nested vmaps,
each nested vmap adds a batch dimension upon entry and removes a batch
dimension on exit.

Coming up in the near future:
- Support for non-zero in_dims and out_dims
- docstring for vmap
- slow fallback for operators that do not have a batching rule
implemented.

Test Plan: - `pytest test/test_vmap.py -v`

Differential Revision: D22102076

Pulled By: zou3519

fbshipit-source-id: b119f0a8a3a3b1717c92dbbd180dfb1618295563
2020-06-24 08:14:24 -07:00
43ab9c677b Add invariants check to BatchedTensorImpl (#40171)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40171

It checks that all of the bdims in BatchedTensorImpl are sorted in
order of ascending `level`.

Test Plan: - Check that nothing breaks in `./build/bin/vmap_test`

Differential Revision: D22102077

Pulled By: zou3519

fbshipit-source-id: 094b7abc6c65208437f0f51a0d0083091912decc
2020-06-24 08:12:16 -07:00
e490352dc4 Simplify complex case for tanh backward (#39997)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/39997

Differential Revision: D22195797

Pulled By: anjali411

fbshipit-source-id: 21eb91bcbd3bfc67acd322a1579fe737b0c02e6e
2020-06-24 07:51:34 -07:00
4975be80f8 fix typo "normal" -> "Cauchy" (#40334)
Summary:
just looks like a real simple typo
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40334

Reviewed By: ezyang

Differential Revision: D22195107

Pulled By: zou3519

fbshipit-source-id: 6c43842d22cbc15db2307976381f6dc1536b5047
2020-06-24 07:45:35 -07:00
ecd9a64712 fix torch.jit.trace_module documentation (#40248)
Summary:
This should fix https://github.com/pytorch/pytorch/issues/39328

Before:

![image](https://user-images.githubusercontent.com/24580222/85076992-4720e800-b18f-11ea-9c6e-19bcf3f1cb7d.png)

After:

![image](https://user-images.githubusercontent.com/24580222/85077064-6ddf1e80-b18f-11ea-9274-e8cee6909baa.png)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40248

Reviewed By: ezyang

Differential Revision: D22195038

Pulled By: zou3519

fbshipit-source-id: c4bff6579a422a56ed28b644f5558b20d901c94e
2020-06-24 07:31:31 -07:00
a4dec0674c [doc] fix typo in formula of MarginRankingLoss (#40285)
Summary:
This is just a minor doc fix:

the `MarginRankingLoss` takes 2 input samples `x1` and `x2`, not just `x`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40285

Reviewed By: ezyang

Differential Revision: D22195069

Pulled By: zou3519

fbshipit-source-id: 909f491c94dca329a37216524f4088e9096e0bc6
2020-06-24 07:24:51 -07:00
e439cf738a Fix examples Adaptive avg pooling typo (#40217)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40217

Reviewed By: ezyang

Differential Revision: D22193711

Pulled By: zou3519

fbshipit-source-id: f96f71e025aa1c81b232e78b1d5b3a3bbd8f331f
2020-06-24 07:22:46 -07:00
72e8690b78 Fix typo. in error message (#39958)
Summary:
Changed sould to should
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39958

Reviewed By: ezyang

Differential Revision: D22193674

Pulled By: zou3519

fbshipit-source-id: ad7bc0aa3ee1f31f5e7965ae36c1903b28509095
2020-06-24 07:17:10 -07:00
b4eb82cd29 Temporary commit at 6/17/2020, 6:49:44 PM
Summary: [WIP] Logit Fake16 Op

Test Plan: [WIP] Tests will be enabled in test_op_nnpi_fp16.py file.

Reviewed By: hyuen

Differential Revision: D22109329

fbshipit-source-id: fd73850c3ec61375ff5bbf0ef5460868a874fbf3
2020-06-24 06:51:48 -07:00
0ecea2d64d [JIT x RPC] Consolidate Future type class and Future impl class (#40406)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40406

Same motivation for https://github.com/pytorch/pytorch/issues/35110.

`Future` and `RRef` are two important types for `rpc` module, should make users feel easy to use.

Reference, https://www.sphinx-doc.org/en/master/usage/extensions/autodoc.html#directive-autoclass

Follow https://github.com/pytorch/pytorch/pull/35694.
ghstack-source-id: 106484664

Test Plan:
```
buck test mode/dev-nosan //caffe2/test/distributed/rpc/jit:rpc_fork

buck build mode/dev-nosan //caffe2/test/distributed/rpc/jit:rpc_fork && \
buck-out/gen/caffe2/test/distributed/rpc/jit/rpc_fork\#binary.par \
-r test_rref_local_value
```

```
buck test mode/dev-nosan //caffe2/test/distributed/rpc/tensorpipe:rpc_fork_tensorpipe
```

pyre -l caffe2/torch/fb/training_toolkit
pyre -l caffe2/torch/fb/distributed
pyre -l aiplatform

Differential Revision: D7722176

fbshipit-source-id: f3b9ccd7bccb233b2b33ad59dd65e178ba34d67f
2020-06-24 01:44:49 -07:00
f035f73d53 Fix the issue that run clang-tidy on the aten folder (#39713)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/39713

Differential Revision: D22203850

Pulled By: mruberry

fbshipit-source-id: 43f690e748b7a3c123ad20f6d640d6dae25c641c
2020-06-24 01:27:54 -07:00
46b9e519aa Remove print (#40475)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40475

As title
ghstack-source-id: 106474870

Test Plan: CI

Differential Revision: D22200640

fbshipit-source-id: 1f4c7bbf54be8c4187c9338fefdf14b501597d98
2020-06-24 00:42:25 -07:00
7b0f867c48 Perf improvement of Conv2d and Conv3d (#40324)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40324

1) avoid the use of item 2) bypass the im2col for 1x1 conv

Test Plan:
unit test and perf benchmark to show improvement
```
num = 50

N = 1
C = 512
H = 4
W = 4

M = 512
kernel_h = 1
kernel_w = 1
stride_h = 1
stride_w = 1
padding_h = 0
padding_w = 0

X_np = np.random.randn(N, C, H, W).astype(np.float32)
W_np = np.random.randn(M, C, kernel_h, kernel_w).astype(np.float32)
X = torch.from_numpy(X_np)

conv2d_pt = torch.nn.Conv2d(
    C, M, (kernel_h, kernel_w), stride=(stride_h, stride_w),
    padding=(padding_h, padding_w), groups=1, bias=True)

class ConvNet(torch.nn.Module):
    def __init__(self):
        super(ConvNet, self).__init__()
        self.conv2d = conv2d_pt

    def forward(self, x):
        return self.conv2d(x)

model = ConvNet()

def pt_forward():
    # with torch.autograd.profiler.profile(record_shapes=True) as prof:
    model(X)
    # print(prof.key_averages().table(sort_by="self_cpu_time_total"))

torch._C._set_mkldnn_enabled(False)

t = Timer("pt_forward()", "from __main__ import pt_forward, X")
```
Before the optimization:
pt time = 5.841153813526034
After the optimization:
pt time = 4.513134760782123

Differential Revision: D22149067

fbshipit-source-id: 538d9eea5b729e6c3da79444bde1784bde828876
2020-06-23 23:39:05 -07:00
cb26661fe4 Throws runtime error when torch.full would infer a float dtype from a bool or integral fill value (#40364)
Summary:
BC-breaking NOTE:

In PyTorch 1.6 bool and integral fill values given to torch.full must set the dtype our out keyword arguments. In prior versions of PyTorch these fill values would return float tensors by default, but in PyTorch 1.7 they will return a bool or long tensor, respectively. The documentation for torch.full has been updated to reflect this.

PR NOTE:

This PR causes torch.full to throw a runtime error when it would have inferred a float dtype by being given a boolean or integer value. A versioned symbol for torch.full is added to preserve the behavior of already serialized Torchscript programs. Existing tests for this behavior being deprecated have been updated to reflect it now being unsupported, and a couple new tests have been added to validate the versioned symbol behavior. The documentation of torch.full has also been updated to reflect this change.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40364

Differential Revision: D22176640

Pulled By: mruberry

fbshipit-source-id: b20158ebbcb4f6bf269d05a688bcf4f6c853a965
2020-06-23 23:27:22 -07:00
a2d4d9eca6 Improve Dynamic Library for Windows (#40365)
Summary:
1. Use LoadLibraryEx if available
2. Print more info on error
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40365

Differential Revision: D22194974

Pulled By: malfet

fbshipit-source-id: e8309f39d78fd4681de5aa032288882910dff928
2020-06-23 20:29:48 -07:00
e2201e2ed8 Fixes caffe2 loading issues on Windows (#39513)
Summary:
Addresses https://github.com/pytorch/pytorch/issues/27840#issuecomment-638715422.
Contains a bunch of fixes (https://github.com/pytorch/pytorch/pull/39376 + https://github.com/pytorch/pytorch/pull/39334 + https://github.com/pytorch/pytorch/pull/38302 + https://github.com/pytorch/pytorch/pull/35362)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39513

Differential Revision: D22190761

Pulled By: malfet

fbshipit-source-id: b2d52f6cb16c233d16071e9c0670dfff7da2710e
2020-06-23 20:11:24 -07:00
7c07c39845 [torch.distributed.rpc] Install method docstrings from PyRRef to RRef (#40461)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40461

It turned out `:inheried-members:` (see [doc](https://www.sphinx-doc.org/en/master/usage/extensions/autodoc.html#directive-autoclass)) is not really usable.

Because pybind11 generates a docstring that writes `self` as parent class, `rpc.PyRRef`, type.

As a workaround, I am pulling docstrings on parent-class, `PyRRef` class, into subclass, `RRef`. And do surgery on the docstring generated by pybind11.

{F241283111}

ghstack-source-id: 106472496

Test Plan:
buck test mode/dev-nosan //caffe2/test/distributed/rpc/:rpc_fork

buck build mode/dev-nosan //caffe2/test/distributed/rpc/:rpc_fork && \
buck-out/gen/caffe2/test/distributed/rpc/rpc_fork\#binary.par \
-r test_rref_str

buck build mode/dev-nosan //caffe2/test/distributed/rpc/:rpc_fork && \
buck-out/gen/caffe2/test/distributed/rpc/rpc_fork\#binary.par \
-r test_return_local_rrefs

buck test mode/dev-nosan //caffe2/torch/fb/distributed/model_parallel/tests:test_elastic_averaging -- 'test_elastic_averaging_center \(caffe2\.torch\.fb\.distributed\.model_parallel\.tests\.test_elastic_averaging\.TestElasticAveragingCenter\)'

P134031188

Differential Revision: D7933834

fbshipit-source-id: c03a8a4c9d98888b64492a8caba1591595bfe247
2020-06-23 19:58:36 -07:00
7c737eab59 Remove table of contents at the top of rpc.rst (#40205)
Summary:
mattip - Can we remove the table of contents created by the `.. contents:: :local: :depth: 2` since this page isn't one of the large documentation pages (https://github.com/pytorch/pytorch/issues/38010) and is simply a landing page for the Distributed RPC Framework?

Changes made in this original PR: f10fbcc820 (diff-250b9b23fd6f1a5c15aecdb72afb9d7d)

cc mrshenli
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40205

Differential Revision: D22194943

Pulled By: jlin27

fbshipit-source-id: 4e42845daf2784a17ad81645fe3b838385656bba
2020-06-23 19:45:11 -07:00
b7e044f0e5 Re-apply PyTorch pthreadpool changes
Summary:
This re-applies D21232894 (b9d3869df3) and D22162524, plus updates jni_deps in a few places
to avoid breaking host JNI tests.

Test Plan: `buck test @//fbandroid/mode/server //fbandroid/instrumentation_tests/com/facebook/caffe2:host-test`

Reviewed By: xcheng16

Differential Revision: D22199952

fbshipit-source-id: df13eef39c01738637ae8cf7f581d6ccc88d37d5
2020-06-23 19:26:21 -07:00
bdc00196d1 Enable XNNPACK ops on iOS and macOS.
Test Plan: buck run aibench:run_bench -- -b aibench/specifications/models/pytorch/pytext/pytext_mobile_inference.json --platform ios --framework pytorch --remote --devices D221 (9788a74da8)AP-12.0.1

Reviewed By: xta0

Differential Revision: D21886736

fbshipit-source-id: ac482619dc1b41a110a3c4c79cc0339e5555edeb
2020-06-23 18:50:36 -07:00
c314e0deb5 [quant] Quantized adaptive_avg_pool3d (#40271)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40271

Closes #40244

Test Plan: Imported from OSS

Reviewed By: vkuzo

Differential Revision: D22134318

Pulled By: z-a-f

fbshipit-source-id: 0489b6c083a3cbc21a1d81d8bfcc499372308088
2020-06-23 18:13:48 -07:00
6468bc4637 [JIT] script if tracing fix (#40468)
Summary:
Currently, torchvision annotates `batched_nms` with `torch.jit.script` so the function gets compiled when it is traced and ONNX will work. Unfortunately, this means we are eagerly compiling batched_nms, which fails if torchvision isn't built with `torchvision.ops.nms`. As a result, torchvision doesn't work on torch hub right now.

`_script_if_tracing` could solve our problem here, but right now it does not correctly interact with recursive compilation. This PR fixes that bug.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40468

Reviewed By: jamesr66a

Differential Revision: D22195771

Pulled By: eellison

fbshipit-source-id: 83022ca0bab6d389a48a478aec03052c9282d2b7
2020-06-23 17:14:28 -07:00
92d3182c11 Revert D21232894: Unify PyTorch mobile's threadpool usage.
Test Plan: revert-hammer

Differential Revision:
D21232894 (b9d3869df3)

Original commit changeset: 8b3de86247fb

fbshipit-source-id: e6517cfec08f7dd0f4f8877dab62acf1d65afacd
2020-06-23 17:09:14 -07:00
ddb8565b25 Revert D22162469: [pytorch][PR] Migrate var & std to ATen
Test Plan: revert-hammer

Differential Revision:
D22162469 (7a3c223bbb)

Original commit changeset: 8d901c779767

fbshipit-source-id: 9e0fa439732478349c0ac6c7baafba063edfac5d
2020-06-23 17:04:15 -07:00
7e32e6048d Fix linspace step computation for large integral types (#40132)
Summary:
Convert start and end to `step_t` before computing the difference
Should fix `torch.linspace(-2147483647, 2147483647, 10, dtype=torch.int32)`

Closes https://github.com/pytorch/pytorch/issues/40118
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40132

Differential Revision: D22190095

Pulled By: malfet

fbshipit-source-id: 01cb158a30c505191df663d021804d411b697871
2020-06-23 16:59:59 -07:00
883e4c44b2 Raise exception when trying to build PyTorch on 32-bit Windows system (#40321)
Summary:
Makes errors in cases described in https://github.com/pytorch/pytorch/issues/27815 more obvious
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40321

Differential Revision: D22198352

Pulled By: malfet

fbshipit-source-id: 327d81103c066048dcf5f900fd9083b09942af0e
2020-06-23 16:54:20 -07:00
a6a2dd14ea Fix typo in warning message (#39854)
Summary:
Fix typo
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39854

Reviewed By: ezyang

Differential Revision: D22193544

Pulled By: zou3519

fbshipit-source-id: 04b9f59da7b6ba0649fc6d315adcf20685e10930
2020-06-23 16:47:35 -07:00
0e26a03ef9 [quant][graphmode] Enable inplace option for top level API (#40414)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40414

after `_reconstruct` is supported in RecursiveScriptModule: https://github.com/pytorch/pytorch/pull/39979
we can support inplace option in quantization API

Test Plan: Imported from OSS

Differential Revision: D22178326

fbshipit-source-id: c78bc2bcf2c42b06280c12262bb31aebcadc6c32
2020-06-23 16:42:48 -07:00
2e6da36298 [android][ci] Fix CI packaging headers to aar (#40442)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40442

Problem:
Nightly builds do not include libtorch headers as local build.
The reason is that on docker images path is different than local path when building with `scripts/build_pytorch_android.sh`

Solution:
Introducing gradle property to be able to specify it and add its specification to gradle build job and snapshots publishing job which run on the same docker image.

Test:
ci-all jobs check https://github.com/pytorch/pytorch/pull/40443
checking that gradle build will result with headers inside aar

Test Plan: Imported from OSS

Differential Revision: D22190955

Pulled By: IvanKobzarev

fbshipit-source-id: 9379458d8ab024ee991ca205a573c21d649e5f8a
2020-06-23 16:41:12 -07:00
b9d3869df3 Unify PyTorch mobile's threadpool usage. (#37243)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37243

*** Why ***

As it stands, we have two thread pool solutions concurrently in use in PyTorch mobile: (1) the open source pthreadpool library under third_party, and (2) Caffe2's implementation of pthreadpool under caffe2/utils/threadpool.  Since the primary use-case of the latter has been to act as a drop-in replacement for the third party version so as to enable integration and usage from within NNPACK and QNNPACK, Caffe2's implementation is intentionally written to the exact same interface as the third party version.

The original argument in favor of C2's implementation has been improved performance as a result of using spin locks, as opposed to relinquishing the thread's time slot and putting it to sleep - a less expensive operation up to a point.  That seems to have given C2's implementation the upper hand in performance, hence justifying the added maintenance complexity, until the third party version improved in parallel surpassing the efficiency of C2's implementation as I have verified in benchmarks.  With that advantage gone, there is no reason to continue using C2's implementation in PyTorch mobile either from the perspective of performance or code hygiene.  As a matter of fact, there is considerable performance benefit to be had as a result of using the third party version as it currently stands.

This is a tricky change though, mainly because in order to avoid potential performance regressions, of which I have witnessed none but just in abundance of caution, we have decided to continue using the internal C2's implementation whenever building for Caffe2.  Again, this is mainly to avoid potential performance regressions in production C2 use cases even if doing so results in reduced performance as far as I can tell.

So to summarize, today, and as it currently stands, we are using C2's implementation for (1) NNPACK, (2) PyTorch QNNPACK, and (3) ATen parallel_for on mobile builds, while using the third party version of pthreadpool for XNNPACK as XNNPACK does not provide any build options to link against an external implementation unlike NNPACK and QNNPACK do.

The goal of this PR then, is to unify all usage on mobile to the third party implementation both for improved performance and better code hygiene.  This applies to PyTorch's use of NNPACK, QNNPACK, XNNPACK, and mobile's implementation of ATen parallel_for, all getting routed to the
exact same third party implementation in this PR.

Considering that NNPACK, QNNPACK, and XNNPACK are not mobile specific, these benefits carry over to non-mobile builds of PyTorch (but not Caffe2) as well.  The implementation of ATen parallel_for on non-mobile builds remains unchanged.

*** How ***

This is where things get tricky.

A good deal of the build system complexity in this PR arises from our desire to maintain C2's implementation intact for C2's use.

pthreadpool is a C library with no concept of namespaces, which means two copies of the library cannot exist in the same binary or symbol collision will occur violating ODR.  This means that somehow, and based on some condition, we must decide on the choice of a pthreadpool implementation.  In practice, this has become more complicated as a result of all the possible combinations that USE_NNPACK, USE_QNNPACK, USE_PYTORCH_QNNPACK, USE_XNNPACK, USE_SYSTEM_XNNPACK, USE_SYSTEM_PTHREADPOOL and other variables can result in.  Having said that, I have done my best in this PR to surgically cut through this complexity in a way that minimizes the side effects, considering the significance of the performance we are leaving on the table, yet, as a result of this combinatorial explosion explained above I cannot guarantee that every single combination will work as expected on the first try.  I am heavily relying on CI to find any issues as local testing can only go that far.

Having said that, this PR provides a simple non mobile-specific C++ thread pool implementation on top of pthreadpool, namely caffe2::PThreadPool that automatically routes to C2's implementation or the third party version depending on the build configuration.  This simplifies the logic at the cost of pushing the complexity to the build scripts.  From there on, this thread pool is used in aten parallel_for, and NNPACK and family, again, routing all usage of threading to C2 or third party pthreadpool depending on the build configuration.

When it is all said or done, the layering will look like this:

a) aten::parallel_for, uses
b) caffe2::PThreadPool, which uses
c) pthreadpool C API, which delegates to
    c-1) third_party implementation of pthreadpool if that's what the build has requested, and the rabbit hole ends here.
    c-2) C2's implementation of pthreadpool if that's what the build has requested, which itself delegates to
    c-2-1) caffe2::ThreadPool, and the rabbit hole ends here.

NNPACK, and (PyTorch) QNNPACK directly hook into (c). They never go through (b).

Differential Revision: D21232894

Test Plan: Imported from OSS

Reviewed By: dreiss

Pulled By: AshkanAliabadi

fbshipit-source-id: 8b3de86247fbc3a327e811983e082f9d40081354
2020-06-23 16:34:51 -07:00
c7d79f35e3 Header rename complex_type.h -> complex.h (#39885)
Summary:
This file should have been renamed as `complex.h`, but unfortunately, it was named as `complex_type.h` due to a name clash with FBCode. Is this still the case and is it easy to resolve the name clash? Maybe related to the comment at https://github.com/pytorch/pytorch/pull/39834#issuecomment-642950012
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39885

Differential Revision: D22018575

Pulled By: ezyang

fbshipit-source-id: e237ccedbe2b30c31aca028a5b4c8c063087a30f
2020-06-23 16:27:09 -07:00
111b399c91 Delete requires_tensor (#40184)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40184

Whenever requires_tensor is True, it is also the case that abstract
is true.  Thus, it is not necessary to specify requires_tensor.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Differential Revision: D22187353

Pulled By: ezyang

fbshipit-source-id: d665bb69cffe491bd989495020e1ae32340aa9da
2020-06-23 16:18:28 -07:00
cc9075c5d4 Add some syntax sugar for when backends use the same function. (#40182)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40182

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Differential Revision: D22187354

Pulled By: ezyang

fbshipit-source-id: 875a6a7837981b60830bd7b1c35d2a3802ed7dd7
2020-06-23 16:16:42 -07:00
d8ec19bc03 Revert D22072830: [wip] Upgrade msvc to 14.13
Test Plan: revert-hammer

Differential Revision:
D22072830

Original commit changeset: 6fa03725f3fe

fbshipit-source-id: 901de185e607810cb3871c2e4d23816848c97f4b
2020-06-23 16:13:03 -07:00
581ad48806 Revert D21581908: Move TensorOptions ops to c10
Test Plan: revert-hammer

Differential Revision:
D21581908

Original commit changeset: 6d4a9f526fd7

fbshipit-source-id: fe1e6368a09120ea40dea405e8409983541e3cb5
2020-06-23 16:10:07 -07:00
cbd53bfee8 [jit] Remove unnecessary clone APIs for script::Module and RecursiveScriptModule (#40297)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40297

Test Plan: Imported from OSS

Differential Revision: D22191660

fbshipit-source-id: 4b338ca82caaca04784bffe01fdae3d180c192f4
2020-06-23 16:03:22 -07:00
8c20fb6481 [JIT] freeze doc (#40409)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40409

Reviewed By: ezyang

Differential Revision: D22192709

Pulled By: eellison

fbshipit-source-id: 68cdb2e5040d31957fbd64690fdc03c058d13f9a
2020-06-23 15:44:03 -07:00
09285070a7 Doc fix for complex views (#40450)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40450

Test Plan: Imported from OSS

Differential Revision: D22190911

Pulled By: anjali411

fbshipit-source-id: eb13559c7a2f62d63344601c750b5715686e95c3
2020-06-23 15:03:22 -07:00
5fce7137a9 [WIP][JIT] Add ScriptModule._reconstruct (#39979)
Summary:
**Summary**
This commit adds an instance method `_reconstruct` that permits users
to reconstruct a `ScriptModule` from a given C++ `Module` instance.

**Testing**
This commit adds a unit test for `_reconstruct`.

**Fixes**
This pull request fixes https://github.com/pytorch/pytorch/issues/33912.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39979

Differential Revision: D22172323

Pulled By: SplitInfinity

fbshipit-source-id: 9aa6551c422a5a324b822a09cd8d7c660f99ca5c
2020-06-23 14:42:27 -07:00
5ad885b823 [Caffe2][Pruning] Make the caffe2 Sum operator support long types (#40379)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40379

The current sum operator doesn't support Long .. hence modify the code

Test Plan: Write a test case

Reviewed By: jspark1105, yinghai

Differential Revision: D21917365

fbshipit-source-id: b37d2c100c70d17d2f89c309e40360ddfab584ee
2020-06-23 14:18:29 -07:00
b623bdeabb Move TensorOptions ops to c10 (#39492)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39492

This PR adds use_c10_dispatcher: full to ops taking TensorOptions. To allow this, since the c10 operator library doesn't know about TensorOptions, we need to register the operator kernels as optional<ScalarType>, optional<Device>, optional<Layout>, optional<bool> instead, and also call them this way.

Changes:

Add use_c10_dispatcher: full to those ops
Write hacky_wrapper_for_legacy_signatures which takes an old-style kernel (i.e. one written to take TensorOptions) an creates a wrapper kernel for it that takes the scattered optional<ScalarType>, optional<Device>, optional<Layout>, optional<bool> instead.
Change codegen so that all op registrations are wrapped into hacky_wrapper_for_legacy_signatures. This is added to all ops but is a no-op if the op doesn't take TensorOptions. This allows us in the future to just change a kernel signature from TensorOptions to the scattered version and have it work without having to touch codegen.
Change codegen so that the frontend calls those operators with expanded arguments instead of with a TensorOptions object. This is required because now the kernels are written in this way.
This PR does not remove TensorOptions special cases from codegen, but instead it separates kernels from the codegen/frontend issues. After this, kernels can be worked on separately without having to touch codegen and codegen can be worked on without having to touch kernels.

Codegen diff: P133121032

ghstack-source-id: 106426630

Test Plan: waitforsandcastle

Differential Revision: D21581908

fbshipit-source-id: 6d4a9f526fd70fae40581bf26f3ccf794ce6a89e
2020-06-23 14:13:34 -07:00
f6b9848c25 Use chain.from_iterable in optimizer.py (#40156)
Summary:
This is a faster and more idiomatic way of using `itertools.chain`. Instead of computing all the items in the iterable and storing them in memory, they are computed one-by-one and never stored as a huge list. This can save on both runtime and memory space.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40156

Reviewed By: ezyang

Differential Revision: D22189038

Pulled By: vincentqb

fbshipit-source-id: 160b2c27f442686821a6ea541e1f48f4a846c186
2020-06-23 14:07:05 -07:00
0e074074f3 Disable inlining an opaque tensor into a constant (#40367)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40367

If the tensor has no storage then do not inline as a constant. This
situation when Mkldnn tensors are used.

Test Plan: Imported from OSS

Reviewed By: eellison

Differential Revision: D22158240

Pulled By: bzinodev

fbshipit-source-id: 8d2879044f2429004983a1242d837367b75a9f2a
2020-06-23 13:28:31 -07:00
f000b44d89 Fork/Join Inline Docs (relanding) (#40438)
Summary:
Added fork/wait to docs/source/jit.rst, hopefully that will fix test error.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40438

Differential Revision: D22188152

Pulled By: eellison

fbshipit-source-id: c19277284455fb6e7c0138b0c1423d90b147d18e
2020-06-23 13:25:51 -07:00
d21ee2de66 [wip] Upgrade msvc to 14.13 (#40109)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40109

ghstack-source-id: 106426627

Test Plan: oss CI

Differential Revision: D22072830

fbshipit-source-id: 6fa03725f3fe272795553c9c4acf46130b8c6039
2020-06-23 13:05:36 -07:00
3252 changed files with 251544 additions and 74223 deletions

View File

@ -178,8 +178,7 @@ CircleCI creates a final yaml file by inlining every <<* segment, so if we were
So, CircleCI has several executor types: macos, machine, and docker are the ones we use. The 'machine' executor gives you two cores on some linux vm. The 'docker' executor gives you considerably more cores (nproc was 32 instead of 2 back when I tried in February). Since the dockers are faster, we try to run everything that we can in dockers. Thus
* linux build jobs use the docker executor. Running them on the docker executor was at least 2x faster than running them on the machine executor
* linux test jobs use the machine executor and spin up their own docker. Why this nonsense? It's cause we run nvidia-docker for our GPU tests; any code that calls into the CUDA runtime needs to be run on nvidia-docker. To run a nvidia-docker you need to install some nvidia packages on the host machine and then call docker with the '—runtime nvidia' argument. CircleCI doesn't support this, so we have to do it ourself.
* This is not just a mere inconvenience. **This blocks all of our linux tests from using more than 2 cores.** But there is nothing that we can do about it, but wait for a fix on circleci's side. Right now, we only run some smoke tests (some simple imports) on the binaries, but this also affects non-binary test jobs.
* linux test jobs use the machine executor in order for them to properly interface with GPUs since docker executors cannot execute with attached GPUs
* linux upload jobs use the machine executor. The upload jobs are so short that it doesn't really matter what they use
* linux smoke test jobs use the machine executor for the same reason as the linux test jobs
@ -419,8 +418,6 @@ You can build Linux binaries locally easily using docker.
# in the docker container then you will see path/to/foo/baz on your local
# machine. You could also clone the pytorch and builder repos in the docker.
#
# If you're building a CUDA binary then use `nvidia-docker run` instead, see below.
#
# If you know how, add ccache as a volume too and speed up everything
docker run \
-v your/pytorch/repo:/pytorch \
@ -444,9 +441,7 @@ export DESIRED_CUDA=cpu
**Building CUDA binaries on docker**
To build a CUDA binary you need to use `nvidia-docker run` instead of just `docker run` (or you can manually pass `--runtime=nvidia`). This adds some needed libraries and things to build CUDA stuff.
You can build CUDA binaries on CPU only machines, but you can only run CUDA binaries on CUDA machines. This means that you can build a CUDA binary on a docker on your laptop if you so choose (though its gonna take a loong time).
You can build CUDA binaries on CPU only machines, but you can only run CUDA binaries on CUDA machines. This means that you can build a CUDA binary on a docker on your laptop if you so choose (though its gonna take a long time).
For Facebook employees, ask about beefy machines that have docker support and use those instead of your laptop; it will be 5x as fast.

View File

@ -25,8 +25,10 @@ DEPS_INCLUSION_DIMENSIONS = [
]
def get_processor_arch_name(cuda_version):
return "cpu" if not cuda_version else "cu" + cuda_version
def get_processor_arch_name(gpu_version):
return "cpu" if not gpu_version else (
"cu" + gpu_version.strip("cuda") if gpu_version.startswith("cuda") else gpu_version
)
LINUX_PACKAGE_VARIANTS = OrderedDict(
@ -42,7 +44,7 @@ LINUX_PACKAGE_VARIANTS = OrderedDict(
)
CONFIG_TREE_DATA = OrderedDict(
linux=(dimensions.CUDA_VERSIONS, LINUX_PACKAGE_VARIANTS),
linux=(dimensions.GPU_VERSIONS, LINUX_PACKAGE_VARIANTS),
macos=([None], OrderedDict(
wheel=dimensions.STANDARD_PYTHON_VERSIONS,
conda=dimensions.STANDARD_PYTHON_VERSIONS,
@ -50,13 +52,17 @@ CONFIG_TREE_DATA = OrderedDict(
"3.7",
],
)),
windows=(dimensions.CUDA_VERSIONS, OrderedDict(
wheel=dimensions.STANDARD_PYTHON_VERSIONS,
conda=dimensions.STANDARD_PYTHON_VERSIONS,
libtorch=[
"3.7",
],
)),
# Skip CUDA-9.2 builds on Windows
windows=(
[v for v in dimensions.GPU_VERSIONS if v not in ['cuda92'] + dimensions.ROCM_VERSION_LABELS],
OrderedDict(
wheel=dimensions.STANDARD_PYTHON_VERSIONS,
conda=dimensions.STANDARD_PYTHON_VERSIONS,
libtorch=[
"3.7",
],
)
),
)
# GCC config variants:
@ -93,12 +99,12 @@ class TopLevelNode(ConfigNode):
class OSConfigNode(ConfigNode):
def __init__(self, parent, os_name, cuda_versions, py_tree):
def __init__(self, parent, os_name, gpu_versions, py_tree):
super(OSConfigNode, self).__init__(parent, os_name)
self.py_tree = py_tree
self.props["os_name"] = os_name
self.props["cuda_versions"] = cuda_versions
self.props["gpu_versions"] = gpu_versions
def get_children(self):
return [PackageFormatConfigNode(self, k, v) for k, v in self.py_tree.items()]
@ -117,7 +123,7 @@ class PackageFormatConfigNode(ConfigNode):
elif self.find_prop("os_name") == "windows" and self.find_prop("package_format") == "libtorch":
return [WindowsLibtorchConfigNode(self, v) for v in WINDOWS_LIBTORCH_CONFIG_VARIANTS]
else:
return [ArchConfigNode(self, v) for v in self.find_prop("cuda_versions")]
return [ArchConfigNode(self, v) for v in self.find_prop("gpu_versions")]
class LinuxGccConfigNode(ConfigNode):
@ -127,14 +133,22 @@ class LinuxGccConfigNode(ConfigNode):
self.props["gcc_config_variant"] = gcc_config_variant
def get_children(self):
cuda_versions = self.find_prop("cuda_versions")
gpu_versions = self.find_prop("gpu_versions")
# XXX devtoolset7 on CUDA 9.0 is temporarily disabled
# see https://github.com/pytorch/pytorch/issues/20066
if self.find_prop("gcc_config_variant") == 'devtoolset7':
cuda_versions = filter(lambda x: x != "90", cuda_versions)
gpu_versions = filter(lambda x: x != "cuda_90", gpu_versions)
return [ArchConfigNode(self, v) for v in cuda_versions]
# XXX disabling conda rocm build since docker images are not there
if self.find_prop("package_format") == 'conda':
gpu_versions = filter(lambda x: x not in dimensions.ROCM_VERSION_LABELS, gpu_versions)
# XXX libtorch rocm build is temporarily disabled
if self.find_prop("package_format") == 'libtorch':
gpu_versions = filter(lambda x: x not in dimensions.ROCM_VERSION_LABELS, gpu_versions)
return [ArchConfigNode(self, v) for v in gpu_versions]
class WindowsLibtorchConfigNode(ConfigNode):
@ -144,14 +158,14 @@ class WindowsLibtorchConfigNode(ConfigNode):
self.props["libtorch_config_variant"] = libtorch_config_variant
def get_children(self):
return [ArchConfigNode(self, v) for v in self.find_prop("cuda_versions")]
return [ArchConfigNode(self, v) for v in self.find_prop("gpu_versions")]
class ArchConfigNode(ConfigNode):
def __init__(self, parent, cu):
super(ArchConfigNode, self).__init__(parent, get_processor_arch_name(cu))
def __init__(self, parent, gpu):
super(ArchConfigNode, self).__init__(parent, get_processor_arch_name(gpu))
self.props["cu"] = cu
self.props["gpu"] = gpu
def get_children(self):
return [PyVersionConfigNode(self, v) for v in self.find_prop("python_versions")]

View File

@ -6,10 +6,10 @@ import cimodel.lib.conf_tree as conf_tree
import cimodel.lib.miniutils as miniutils
class Conf(object):
def __init__(self, os, cuda_version, pydistro, parms, smoke, libtorch_variant, gcc_config_variant, libtorch_config_variant):
def __init__(self, os, gpu_version, pydistro, parms, smoke, libtorch_variant, gcc_config_variant, libtorch_config_variant):
self.os = os
self.cuda_version = cuda_version
self.gpu_version = gpu_version
self.pydistro = pydistro
self.parms = parms
self.smoke = smoke
@ -18,7 +18,7 @@ class Conf(object):
self.libtorch_config_variant = libtorch_config_variant
def gen_build_env_parms(self):
elems = [self.pydistro] + self.parms + [binary_build_data.get_processor_arch_name(self.cuda_version)]
elems = [self.pydistro] + self.parms + [binary_build_data.get_processor_arch_name(self.gpu_version)]
if self.gcc_config_variant is not None:
elems.append(str(self.gcc_config_variant))
if self.libtorch_config_variant is not None:
@ -37,9 +37,12 @@ class Conf(object):
docker_distro_prefix = miniutils.override(self.pydistro, docker_word_substitution)
# The cpu nightlies are built on the pytorch/manylinux-cuda102 docker image
alt_docker_suffix = self.cuda_version or "102"
docker_distro_suffix = "" if self.pydistro == "conda" else alt_docker_suffix
return miniutils.quote("pytorch/" + docker_distro_prefix + "-cuda" + docker_distro_suffix)
# TODO cuda images should consolidate into tag-base images similar to rocm
alt_docker_suffix = "cuda102" if not self.gpu_version else (
"rocm:" + self.gpu_version.strip("rocm") if self.gpu_version.startswith("rocm") else self.gpu_version)
docker_distro_suffix = alt_docker_suffix if self.pydistro != "conda" else (
"cuda" if alt_docker_suffix.startswith("cuda") else "rocm")
return miniutils.quote("pytorch/" + docker_distro_prefix + "-" + docker_distro_suffix)
def get_name_prefix(self):
return "smoke" if self.smoke else "binary"
@ -69,14 +72,10 @@ class Conf(object):
"update_s3_htmls",
]
job_def["filters"] = branch_filters.gen_filter_dict(
branches_list=["nightly"],
tags_list=[branch_filters.RC_PATTERN],
branches_list=["postnightly"],
)
else:
if phase in ["upload"]:
filter_branch = "nightly"
else:
filter_branch = r"/.*/"
filter_branch = r"/.*/"
job_def["filters"] = branch_filters.gen_filter_dict(
branches_list=[filter_branch],
tags_list=[branch_filters.RC_PATTERN],
@ -89,28 +88,61 @@ class Conf(object):
if not (self.smoke and self.os == "macos") and self.os != "windows":
job_def["docker_image"] = self.gen_docker_image()
if self.os != "windows" and self.cuda_version:
# fix this. only works on cuda not rocm
if self.os != "windows" and self.gpu_version:
job_def["use_cuda_docker_runtime"] = miniutils.quote("1")
else:
if self.os == "linux" and phase != "upload":
job_def["docker_image"] = self.gen_docker_image()
if phase == "test":
if self.cuda_version:
if self.gpu_version:
if self.os == "windows":
job_def["executor"] = "windows-with-nvidia-gpu"
else:
job_def["resource_class"] = "gpu.medium"
if phase == "upload":
job_def["context"] = "org-member"
job_def["requires"] = [
self.gen_build_name(upload_phase_dependency, nightly)
]
os_name = miniutils.override(self.os, {"macos": "mac"})
job_name = "_".join([self.get_name_prefix(), os_name, phase])
return {job_name : job_def}
def gen_upload_job(self, phase, requires_dependency):
"""Generate binary_upload job for configuration
Output looks similar to:
- binary_upload:
name: binary_linux_manywheel_3_7m_cu92_devtoolset7_nightly_upload
context: org-member
requires: binary_linux_manywheel_3_7m_cu92_devtoolset7_nightly_test
filters:
branches:
only:
- nightly
tags:
only: /v[0-9]+(\\.[0-9]+)*-rc[0-9]+/
package_type: manywheel
upload_subfolder: cu92
"""
return {
"binary_upload": OrderedDict({
"name": self.gen_build_name(phase, nightly=True),
"context": "org-member",
"requires": [self.gen_build_name(
requires_dependency,
nightly=True
)],
"filters": branch_filters.gen_filter_dict(
branches_list=["nightly"],
tags_list=[branch_filters.RC_PATTERN],
),
"package_type": self.pydistro,
"upload_subfolder": binary_build_data.get_processor_arch_name(
self.gpu_version,
),
})
}
def get_root(smoke, name):
return binary_build_data.TopLevelNode(
@ -129,7 +161,7 @@ def gen_build_env_list(smoke):
for c in config_list:
conf = Conf(
c.find_prop("os_name"),
c.find_prop("cu"),
c.find_prop("gpu"),
c.find_prop("package_format"),
[c.find_prop("pyver")],
c.find_prop("smoke"),
@ -149,32 +181,19 @@ def get_nightly_uploads():
mylist = []
for conf in configs:
phase_dependency = "test" if predicate_exclude_macos(conf) else "build"
mylist.append(conf.gen_workflow_job("upload", phase_dependency, nightly=True))
mylist.append(conf.gen_upload_job("upload", phase_dependency))
return mylist
def get_post_upload_jobs():
"""Generate jobs to update HTML indices and report binary sizes"""
configs = gen_build_env_list(False)
common_job_def = {
"context": "org-member",
"filters": branch_filters.gen_filter_dict(
branches_list=["nightly"],
tags_list=[branch_filters.RC_PATTERN],
),
"requires": [],
}
for conf in configs:
upload_job_name = conf.gen_build_name(
build_or_test="upload",
nightly=True
)
common_job_def["requires"].append(upload_job_name)
return [
{
"update_s3_htmls": {
"name": "update_s3_htmls",
**common_job_def,
"context": "org-member",
"filters": branch_filters.gen_filter_dict(
branches_list=["postnightly"],
),
},
},
]

View File

@ -1,91 +0,0 @@
from cimodel.lib.conf_tree import ConfigNode, XImportant
from cimodel.lib.conf_tree import Ver
CONFIG_TREE_DATA = [
(Ver("ubuntu", "16.04"), [
([Ver("clang", "7")], [XImportant("onnx_main_py3.6"),
XImportant("onnx_ort1_py3.6"),
XImportant("onnx_ort2_py3.6")]),
]),
]
class TreeConfigNode(ConfigNode):
def __init__(self, parent, node_name, subtree):
super(TreeConfigNode, self).__init__(parent, self.modify_label(node_name))
self.subtree = subtree
self.init2(node_name)
# noinspection PyMethodMayBeStatic
def modify_label(self, label):
return str(label)
def init2(self, node_name):
pass
def get_children(self):
return [self.child_constructor()(self, k, v) for (k, v) in self.subtree]
def is_build_only(self):
if str(self.find_prop("language_version")) == "onnx_main_py3.6" or \
str(self.find_prop("language_version")) == "onnx_ort1_py3.6" or \
str(self.find_prop("language_version")) == "onnx_ort2_py3.6":
return False
return set(str(c) for c in self.find_prop("compiler_version")).intersection({
"clang3.8",
"clang3.9",
"clang7",
"android",
}) or self.find_prop("distro_version").name == "macos"
def is_test_only(self):
if str(self.find_prop("language_version")) == "onnx_ort1_py3.6" or \
str(self.find_prop("language_version")) == "onnx_ort2_py3.6":
return True
return False
class TopLevelNode(TreeConfigNode):
def __init__(self, node_name, subtree):
super(TopLevelNode, self).__init__(None, node_name, subtree)
# noinspection PyMethodMayBeStatic
def child_constructor(self):
return DistroConfigNode
class DistroConfigNode(TreeConfigNode):
def init2(self, node_name):
self.props["distro_version"] = node_name
# noinspection PyMethodMayBeStatic
def child_constructor(self):
return CompilerConfigNode
class CompilerConfigNode(TreeConfigNode):
def init2(self, node_name):
self.props["compiler_version"] = node_name
# noinspection PyMethodMayBeStatic
def child_constructor(self):
return LanguageConfigNode
class LanguageConfigNode(TreeConfigNode):
def init2(self, node_name):
self.props["language_version"] = node_name
self.props["build_only"] = self.is_build_only()
self.props["test_only"] = self.is_test_only()
def child_constructor(self):
return ImportantConfigNode
class ImportantConfigNode(TreeConfigNode):
def init2(self, node_name):
self.props["important"] = True
def get_children(self):
return []

View File

@ -1,174 +0,0 @@
from collections import OrderedDict
import cimodel.data.dimensions as dimensions
import cimodel.lib.conf_tree as conf_tree
from cimodel.lib.conf_tree import Ver
import cimodel.lib.miniutils as miniutils
from cimodel.data.caffe2_build_data import CONFIG_TREE_DATA, TopLevelNode
from cimodel.data.simple.util.branch_filters import gen_filter_dict
from dataclasses import dataclass
DOCKER_IMAGE_PATH_BASE = "308535385114.dkr.ecr.us-east-1.amazonaws.com/caffe2/"
DOCKER_IMAGE_VERSION = "376"
@dataclass
class Conf:
language: str
distro: Ver
# There could be multiple compiler versions configured (e.g. nvcc
# for gpu files and host compiler (gcc/clang) for cpu files)
compilers: [Ver]
build_only: bool
test_only: bool
is_important: bool
@property
def compiler_names(self):
return [c.name for c in self.compilers]
# TODO: Eventually we can probably just remove the cudnn7 everywhere.
def get_cudnn_insertion(self):
omit = self.language == "onnx_main_py3.6" \
or self.language == "onnx_ort1_py3.6" \
or self.language == "onnx_ort2_py3.6" \
or set(self.compiler_names).intersection({"android", "mkl", "clang"}) \
or str(self.distro) in ["ubuntu14.04", "macos10.13"]
return [] if omit else ["cudnn7"]
def get_build_name_root_parts(self):
return [
"caffe2",
self.language,
] + self.get_build_name_middle_parts()
def get_build_name_middle_parts(self):
return [str(c) for c in self.compilers] + self.get_cudnn_insertion() + [str(self.distro)]
def construct_phase_name(self, phase):
root_parts = self.get_build_name_root_parts()
build_name_substitutions = {
"onnx_ort1_py3.6": "onnx_main_py3.6",
"onnx_ort2_py3.6": "onnx_main_py3.6",
}
if phase == "build":
root_parts = [miniutils.override(r, build_name_substitutions) for r in root_parts]
return "_".join(root_parts + [phase]).replace(".", "_")
def get_platform(self):
platform = self.distro.name
if self.distro.name != "macos":
platform = "linux"
return platform
def gen_docker_image(self):
lang_substitutions = {
"onnx_main_py3.6": "py3.6",
"onnx_ort1_py3.6": "py3.6",
"onnx_ort2_py3.6": "py3.6",
"cmake": "py3",
}
lang = miniutils.override(self.language, lang_substitutions)
parts = [lang] + self.get_build_name_middle_parts()
return miniutils.quote(DOCKER_IMAGE_PATH_BASE + "-".join(parts) + ":" + str(DOCKER_IMAGE_VERSION))
def gen_workflow_params(self, phase):
parameters = OrderedDict()
lang_substitutions = {
"onnx_py3": "onnx-py3",
"onnx_main_py3.6": "onnx-main-py3.6",
"onnx_ort1_py3.6": "onnx-ort1-py3.6",
"onnx_ort2_py3.6": "onnx-ort2-py3.6",
}
lang = miniutils.override(self.language, lang_substitutions)
parts = [
"caffe2",
lang,
] + self.get_build_name_middle_parts() + [phase]
build_env_name = "-".join(parts)
parameters["build_environment"] = miniutils.quote(build_env_name)
if "ios" in self.compiler_names:
parameters["build_ios"] = miniutils.quote("1")
if phase == "test":
# TODO cuda should not be considered a compiler
if "cuda" in self.compiler_names:
parameters["use_cuda_docker_runtime"] = miniutils.quote("1")
if self.distro.name != "macos":
parameters["docker_image"] = self.gen_docker_image()
if self.build_only:
parameters["build_only"] = miniutils.quote("1")
if phase == "test":
resource_class = "large" if "cuda" not in self.compiler_names else "gpu.medium"
parameters["resource_class"] = resource_class
return parameters
def gen_workflow_job(self, phase):
job_def = OrderedDict()
job_def["name"] = self.construct_phase_name(phase)
if phase == "test":
job_def["requires"] = [self.construct_phase_name("build")]
job_name = "caffe2_" + self.get_platform() + "_test"
else:
job_name = "caffe2_" + self.get_platform() + "_build"
if not self.is_important:
job_def["filters"] = gen_filter_dict()
job_def.update(self.gen_workflow_params(phase))
return {job_name : job_def}
def get_root():
return TopLevelNode("Caffe2 Builds", CONFIG_TREE_DATA)
def instantiate_configs():
config_list = []
root = get_root()
found_configs = conf_tree.dfs(root)
for fc in found_configs:
c = Conf(
language=fc.find_prop("language_version"),
distro=fc.find_prop("distro_version"),
compilers=fc.find_prop("compiler_version"),
build_only=fc.find_prop("build_only"),
test_only=fc.find_prop("test_only"),
is_important=fc.find_prop("important"),
)
config_list.append(c)
return config_list
def get_workflow_jobs():
configs = instantiate_configs()
x = []
for conf_options in configs:
phases = ["build"]
if not conf_options.build_only:
phases = dimensions.PHASES
if conf_options.test_only:
phases = ["test"]
for phase in phases:
x.append(conf_options.gen_workflow_job(phase))
return x

View File

@ -1,14 +1,23 @@
PHASES = ["build", "test"]
CUDA_VERSIONS = [
None, # cpu build
"92",
"101",
"102",
"110",
]
ROCM_VERSIONS = [
"3.7",
"3.8",
]
ROCM_VERSION_LABELS = ["rocm" + v for v in ROCM_VERSIONS]
GPU_VERSIONS = [None] + ["cuda" + v for v in CUDA_VERSIONS] + ROCM_VERSION_LABELS
STANDARD_PYTHON_VERSIONS = [
"3.6",
"3.7",
"3.8"
"3.8",
]

View File

@ -3,15 +3,13 @@ from cimodel.lib.conf_tree import ConfigNode, X, XImportant
CONFIG_TREE_DATA = [
("xenial", [
(None, [
X("nightly"),
]),
("gcc", [
("5.4", [ # All this subtree rebases to master and then build
XImportant("3.6"),
("3.6", [
("important", [X(True)]),
("parallel_tbb", [X(True)]),
("parallel_native", [X(True)]),
("pure_torch", [X(True)]),
]),
]),
# TODO: bring back libtorch test
@ -19,20 +17,41 @@ CONFIG_TREE_DATA = [
]),
("clang", [
("5", [
XImportant("3.6"), # This is actually the ASAN build
("3.6", [
("asan", [XImportant(True)]),
]),
]),
("7", [
("3.6", [
("onnx", [XImportant(True)]),
]),
]),
]),
("cuda", [
("9.2", [
X("3.6"),
("3.6", [
("cuda_gcc_override", [X("gcc5.4")])
X(True),
("cuda_gcc_override", [
("gcc5.4", [
('build_only', [XImportant(True)]),
]),
]),
])
]),
("10.1", [X("3.6")]),
("10.2", [
XImportant("3.6"),
("10.1", [
("3.6", [
('build_only', [X(True)]),
]),
]),
("10.2", [
("3.6", [
("important", [X(True)]),
("libtorch", [X(True)]),
]),
]),
("11.0", [
("3.8", [
X(True),
("libtorch", [XImportant(True)])
]),
]),
@ -46,11 +65,23 @@ CONFIG_TREE_DATA = [
("9", [
("3.6", [
("xla", [XImportant(True)]),
("vulkan", [XImportant(True)]),
]),
]),
]),
("gcc", [
("9", [XImportant("3.8")]),
("9", [
("3.8", [
("coverage", [XImportant(True)]),
]),
]),
]),
("rocm", [
("3.7", [
("3.6", [
('build_only', [XImportant(True)]),
]),
]),
]),
]),
]
@ -118,17 +149,33 @@ class ExperimentalFeatureConfigNode(TreeConfigNode):
experimental_feature = self.find_prop("experimental_feature")
next_nodes = {
"asan": AsanConfigNode,
"xla": XlaConfigNode,
"vulkan": VulkanConfigNode,
"parallel_tbb": ParallelTBBConfigNode,
"parallel_native": ParallelNativeConfigNode,
"onnx": ONNXConfigNode,
"libtorch": LibTorchConfigNode,
"important": ImportantConfigNode,
"build_only": BuildOnlyConfigNode,
"cuda_gcc_override": CudaGccOverrideConfigNode
"cuda_gcc_override": CudaGccOverrideConfigNode,
"coverage": CoverageConfigNode,
"pure_torch": PureTorchConfigNode,
}
return next_nodes[experimental_feature]
class PureTorchConfigNode(TreeConfigNode):
def modify_label(self, label):
return "PURE_TORCH=" + str(label)
def init2(self, node_name):
self.props["is_pure_torch"] = node_name
def child_constructor(self):
return ImportantConfigNode
class XlaConfigNode(TreeConfigNode):
def modify_label(self, label):
return "XLA=" + str(label)
@ -140,6 +187,39 @@ class XlaConfigNode(TreeConfigNode):
return ImportantConfigNode
class AsanConfigNode(TreeConfigNode):
def modify_label(self, label):
return "Asan=" + str(label)
def init2(self, node_name):
self.props["is_asan"] = node_name
def child_constructor(self):
return ImportantConfigNode
class ONNXConfigNode(TreeConfigNode):
def modify_label(self, label):
return "Onnx=" + str(label)
def init2(self, node_name):
self.props["is_onnx"] = node_name
def child_constructor(self):
return ImportantConfigNode
class VulkanConfigNode(TreeConfigNode):
def modify_label(self, label):
return "Vulkan=" + str(label)
def init2(self, node_name):
self.props["is_vulkan"] = node_name
def child_constructor(self):
return ImportantConfigNode
class ParallelTBBConfigNode(TreeConfigNode):
def modify_label(self, label):
return "PARALLELTBB=" + str(label)
@ -178,7 +258,7 @@ class CudaGccOverrideConfigNode(TreeConfigNode):
self.props["cuda_gcc_override"] = node_name
def child_constructor(self):
return ImportantConfigNode
return ExperimentalFeatureConfigNode
class BuildOnlyConfigNode(TreeConfigNode):
@ -186,7 +266,16 @@ class BuildOnlyConfigNode(TreeConfigNode):
self.props["build_only"] = node_name
def child_constructor(self):
return ImportantConfigNode
return ExperimentalFeatureConfigNode
class CoverageConfigNode(TreeConfigNode):
def init2(self, node_name):
self.props["is_coverage"] = node_name
def child_constructor(self):
return ExperimentalFeatureConfigNode
class ImportantConfigNode(TreeConfigNode):

View File

@ -1,14 +1,13 @@
from collections import OrderedDict
from dataclasses import dataclass, field
from typing import List, Optional
from cimodel.data.pytorch_build_data import TopLevelNode, CONFIG_TREE_DATA
import cimodel.data.dimensions as dimensions
import cimodel.lib.conf_tree as conf_tree
import cimodel.lib.miniutils as miniutils
from cimodel.data.simple.util.branch_filters import gen_filter_dict
from cimodel.data.simple.util.docker_constants import gen_docker_image_path
from dataclasses import dataclass, field
from typing import List, Optional
from cimodel.data.pytorch_build_data import CONFIG_TREE_DATA, TopLevelNode
from cimodel.data.simple.util.branch_filters import gen_filter_dict, RC_PATTERN
from cimodel.data.simple.util.docker_constants import gen_docker_image
@dataclass
@ -18,19 +17,25 @@ class Conf:
parms_list_ignored_for_docker_image: Optional[List[str]] = None
pyver: Optional[str] = None
cuda_version: Optional[str] = None
rocm_version: Optional[str] = None
# TODO expand this to cover all the USE_* that we want to test for
# tesnrorrt, leveldb, lmdb, redis, opencv, mkldnn, ideep, etc.
# (from https://github.com/pytorch/pytorch/pull/17323#discussion_r259453608)
is_xla: bool = False
vulkan: bool = False
is_vulkan: bool = False
is_pure_torch: bool = False
restrict_phases: Optional[List[str]] = None
gpu_resource: Optional[str] = None
dependent_tests: List = field(default_factory=list)
parent_build: Optional['Conf'] = None
parent_build: Optional["Conf"] = None
is_libtorch: bool = False
is_important: bool = False
parallel_backend: Optional[str] = None
@staticmethod
def is_test_phase(phase):
return "test" in phase
# TODO: Eliminate the special casing for docker paths
# In the short term, we *will* need to support special casing as docker images are merged for caffe2 and pytorch
def get_parms(self, for_docker):
@ -42,31 +47,47 @@ class Conf:
leading.append("pytorch")
if self.is_xla and not for_docker:
leading.append("xla")
if self.is_vulkan and not for_docker:
leading.append("vulkan")
if self.is_libtorch and not for_docker:
leading.append("libtorch")
if self.is_pure_torch and not for_docker:
leading.append("pure_torch")
if self.parallel_backend is not None and not for_docker:
leading.append(self.parallel_backend)
cuda_parms = []
if self.cuda_version:
cuda_parms.extend(["cuda" + self.cuda_version, "cudnn7"])
cudnn = "cudnn8" if self.cuda_version.startswith("11.") else "cudnn7"
cuda_parms.extend(["cuda" + self.cuda_version, cudnn])
if self.rocm_version:
cuda_parms.extend([f"rocm{self.rocm_version}"])
result = leading + ["linux", self.distro] + cuda_parms + self.parms
if not for_docker and self.parms_list_ignored_for_docker_image is not None:
result = result + self.parms_list_ignored_for_docker_image
return result
def gen_docker_image_path(self):
parms_source = self.parent_build or self
base_build_env_name = "-".join(parms_source.get_parms(True))
image_name, _ = gen_docker_image(base_build_env_name)
return miniutils.quote(image_name)
return miniutils.quote(gen_docker_image_path(base_build_env_name))
def gen_docker_image_requires(self):
parms_source = self.parent_build or self
base_build_env_name = "-".join(parms_source.get_parms(True))
_, requires = gen_docker_image(base_build_env_name)
return miniutils.quote(requires)
def get_build_job_name_pieces(self, build_or_test):
return self.get_parms(False) + [build_or_test]
def gen_build_name(self, build_or_test):
return ("_".join(map(str, self.get_build_job_name_pieces(build_or_test)))).replace(".", "_").replace("-", "_")
return (
("_".join(map(str, self.get_build_job_name_pieces(build_or_test))))
.replace(".", "_")
.replace("-", "_")
)
def get_dependents(self):
return self.dependent_tests or []
@ -78,20 +99,26 @@ class Conf:
build_env_name = "-".join(map(str, build_job_name_pieces))
parameters["build_environment"] = miniutils.quote(build_env_name)
parameters["docker_image"] = self.gen_docker_image_path()
if phase == "test" and self.gpu_resource:
if Conf.is_test_phase(phase) and self.gpu_resource:
parameters["use_cuda_docker_runtime"] = miniutils.quote("1")
if phase == "test":
if Conf.is_test_phase(phase):
resource_class = "large"
if self.gpu_resource:
resource_class = "gpu." + self.gpu_resource
if self.rocm_version is not None:
resource_class = "pytorch/amd-gpu"
parameters["resource_class"] = resource_class
if phase == "build" and self.rocm_version is not None:
parameters["resource_class"] = "xlarge"
if hasattr(self, 'filters'):
parameters['filters'] = self.filters
return parameters
def gen_workflow_job(self, phase):
job_def = OrderedDict()
job_def["name"] = self.gen_build_name(phase)
if phase == "test":
if Conf.is_test_phase(phase):
# TODO When merging the caffe2 and pytorch jobs, it might be convenient for a while to make a
# caffe2 test job dependent on a pytorch build job. This way we could quickly dedup the repeated
@ -103,36 +130,59 @@ class Conf:
job_name = "pytorch_linux_test"
else:
job_name = "pytorch_linux_build"
job_def["requires"] = [self.gen_docker_image_requires()]
if not self.is_important:
job_def["filters"] = gen_filter_dict()
job_def.update(self.gen_workflow_params(phase))
return {job_name : job_def}
return {job_name: job_def}
# TODO This is a hack to special case some configs just for the workflow list
class HiddenConf(object):
def __init__(self, name, parent_build=None):
def __init__(self, name, parent_build=None, filters=None):
self.name = name
self.parent_build = parent_build
self.filters = filters
def gen_workflow_job(self, phase):
return {self.gen_build_name(phase): {"requires": [self.parent_build.gen_build_name("build")]}}
return {
self.gen_build_name(phase): {
"requires": [self.parent_build.gen_build_name("build")],
"filters": self.filters,
}
}
def gen_build_name(self, _):
return self.name
class DocPushConf(object):
def __init__(self, name, parent_build=None, branch="master"):
self.name = name
self.parent_build = parent_build
self.branch = branch
def gen_workflow_job(self, phase):
return {
"pytorch_doc_push": {
"name": self.name,
"branch": self.branch,
"requires": [self.parent_build],
"context": "org-member",
"filters": gen_filter_dict(branches_list=["nightly"],
tags_list=RC_PATTERN)
}
}
# TODO Convert these to graph nodes
def gen_dependent_configs(xenial_parent_config):
extra_parms = [
(["multigpu"], "large"),
(["NO_AVX2"], "medium"),
(["NO_AVX", "NO_AVX2"], "medium"),
(["nogpu", "NO_AVX2"], None),
(["nogpu", "NO_AVX"], None),
(["slow"], "medium"),
(["nogpu"], None),
]
configs = []
@ -141,12 +191,12 @@ def gen_dependent_configs(xenial_parent_config):
c = Conf(
xenial_parent_config.distro,
["py3"] + parms,
pyver="3.6",
pyver=xenial_parent_config.pyver,
cuda_version=xenial_parent_config.cuda_version,
restrict_phases=["test"],
gpu_resource=gpu,
parent_build=xenial_parent_config,
is_important=xenial_parent_config.is_important,
is_important=False,
)
configs.append(c)
@ -157,9 +207,44 @@ def gen_dependent_configs(xenial_parent_config):
def gen_docs_configs(xenial_parent_config):
configs = []
for x in ["pytorch_python_doc_push", "pytorch_cpp_doc_push", "pytorch_doc_test"]:
configs.append(HiddenConf(x, parent_build=xenial_parent_config))
configs.append(
HiddenConf(
"pytorch_python_doc_build",
parent_build=xenial_parent_config,
filters=gen_filter_dict(branches_list=r"/.*/",
tags_list=RC_PATTERN),
)
)
configs.append(
DocPushConf(
"pytorch_python_doc_push",
parent_build="pytorch_python_doc_build",
branch="site",
)
)
configs.append(
HiddenConf(
"pytorch_cpp_doc_build",
parent_build=xenial_parent_config,
filters=gen_filter_dict(branches_list=r"/.*/",
tags_list=RC_PATTERN),
)
)
configs.append(
DocPushConf(
"pytorch_cpp_doc_push",
parent_build="pytorch_cpp_doc_build",
branch="master",
)
)
configs.append(
HiddenConf(
"pytorch_doc_test",
parent_build=xenial_parent_config
)
)
return configs
@ -186,12 +271,12 @@ def instantiate_configs():
compiler_name = fc.find_prop("compiler_name")
compiler_version = fc.find_prop("compiler_version")
is_xla = fc.find_prop("is_xla") or False
is_asan = fc.find_prop("is_asan") or False
is_onnx = fc.find_prop("is_onnx") or False
is_pure_torch = fc.find_prop("is_pure_torch") or False
is_vulkan = fc.find_prop("is_vulkan") or False
parms_list_ignored_for_docker_image = []
vulkan = fc.find_prop("vulkan") or False
if vulkan:
parms_list_ignored_for_docker_image.append("vulkan")
python_version = None
if compiler_name == "cuda" or compiler_name == "android":
python_version = fc.find_prop("pyver")
@ -200,9 +285,14 @@ def instantiate_configs():
parms_list = ["py" + fc.find_prop("pyver")]
cuda_version = None
rocm_version = None
if compiler_name == "cuda":
cuda_version = fc.find_prop("compiler_version")
elif compiler_name == "rocm":
rocm_version = fc.find_prop("compiler_version")
restrict_phases = ["build", "test1", "test2", "caffe2_test"]
elif compiler_name == "android":
android_ndk_version = fc.find_prop("compiler_version")
# TODO: do we need clang to compile host binaries like protoc?
@ -216,14 +306,19 @@ def instantiate_configs():
gcc_version = compiler_name + (fc.find_prop("compiler_version") or "")
parms_list.append(gcc_version)
# TODO: This is a nasty special case
if gcc_version == 'clang5' and not is_xla:
parms_list.append("asan")
python_version = fc.find_prop("pyver")
parms_list[0] = fc.find_prop("abbreviated_pyver")
if is_asan:
parms_list.append("asan")
python_version = fc.find_prop("pyver")
parms_list[0] = fc.find_prop("abbreviated_pyver")
restrict_phases = ["build", "test1", "test2"]
if cuda_version in ["9.2", "10", "10.1", "10.2"]:
# TODO The gcc version is orthogonal to CUDA version?
if is_onnx:
parms_list.append("onnx")
python_version = fc.find_prop("pyver")
parms_list[0] = fc.find_prop("abbreviated_pyver")
restrict_phases = ["build", "ort_test1", "ort_test2"]
if cuda_version:
cuda_gcc_version = fc.find_prop("cuda_gcc_override") or "gcc7"
parms_list.append(cuda_gcc_version)
@ -231,8 +326,13 @@ def instantiate_configs():
is_important = fc.find_prop("is_important") or False
parallel_backend = fc.find_prop("parallel_backend") or None
build_only = fc.find_prop("build_only") or False
if build_only and restrict_phases is None:
is_coverage = fc.find_prop("is_coverage") or False
# TODO: fix pure_torch python test packaging issue.
if build_only or is_pure_torch:
restrict_phases = ["build"]
if is_coverage and restrict_phases is None:
restrict_phases = ["build", "coverage_test"]
gpu_resource = None
if cuda_version and cuda_version != "10":
@ -244,8 +344,10 @@ def instantiate_configs():
parms_list_ignored_for_docker_image,
python_version,
cuda_version,
rocm_version,
is_xla,
vulkan,
is_vulkan,
is_pure_torch,
restrict_phases,
gpu_resource,
is_libtorch=is_libtorch,
@ -255,20 +357,33 @@ def instantiate_configs():
# run docs builds on "pytorch-linux-xenial-py3.6-gcc5.4". Docs builds
# should run on a CPU-only build that runs on all PRs.
if distro_name == 'xenial' and fc.find_prop("pyver") == '3.6' \
and cuda_version is None \
and parallel_backend is None \
and compiler_name == 'gcc' \
and fc.find_prop('compiler_version') == '5.4':
# XXX should this be updated to a more modern build? Projects are
# beginning to drop python3.6
if (
distro_name == "xenial"
and fc.find_prop("pyver") == "3.6"
and cuda_version is None
and parallel_backend is None
and not is_vulkan
and not is_pure_torch
and compiler_name == "gcc"
and fc.find_prop("compiler_version") == "5.4"
):
c.filters = gen_filter_dict(branches_list=r"/.*/",
tags_list=RC_PATTERN)
c.dependent_tests = gen_docs_configs(c)
if cuda_version == "10.1" and python_version == "3.6" and not is_libtorch:
if cuda_version == "10.2" and python_version == "3.6" and not is_libtorch:
c.dependent_tests = gen_dependent_configs(c)
if (compiler_name == "gcc"
and compiler_version == "5.4"
and not is_libtorch
and parallel_backend is None):
if (
compiler_name == "gcc"
and compiler_version == "5.4"
and not is_libtorch
and not is_vulkan
and not is_pure_torch
and parallel_backend is None
):
bc_breaking_check = Conf(
"backward-compatibility-check",
[],
@ -297,7 +412,7 @@ def get_workflow_jobs():
for phase in phases:
# TODO why does this not have a test?
if phase == "test" and conf_options.cuda_version == "10":
if Conf.is_test_phase(phase) and conf_options.cuda_version == "10":
continue
x.append(conf_options.gen_workflow_job(phase))

View File

@ -0,0 +1,28 @@
from collections import OrderedDict
from cimodel.data.simple.util.branch_filters import gen_filter_dict
from cimodel.lib.miniutils import quote
CHANNELS_TO_PRUNE = ["pytorch-nightly", "pytorch-test"]
PACKAGES_TO_PRUNE = "pytorch torchvision torchaudio torchtext ignite torchcsprng"
def gen_workflow_job(channel: str):
return OrderedDict(
{
"anaconda_prune": OrderedDict(
{
"name": f"anaconda-prune-{channel}",
"context": quote("org-member"),
"packages": quote(PACKAGES_TO_PRUNE),
"channel": channel,
"filters": gen_filter_dict(branches_list=["postnightly"]),
}
)
}
)
def get_workflow_jobs():
return [gen_workflow_job(channel) for channel in CHANNELS_TO_PRUNE]

View File

@ -1,5 +1,7 @@
import cimodel.data.simple.util.branch_filters
from cimodel.data.simple.util.docker_constants import DOCKER_IMAGE_NDK
import cimodel.data.simple.util.branch_filters as branch_filters
from cimodel.data.simple.util.docker_constants import (
DOCKER_IMAGE_NDK, DOCKER_REQUIREMENT_NDK
)
class AndroidJob:
@ -34,10 +36,11 @@ class AndroidJob:
"name": full_job_name,
"build_environment": "\"{}\"".format(build_env_name),
"docker_image": "\"{}\"".format(DOCKER_IMAGE_NDK),
"requires": [DOCKER_REQUIREMENT_NDK]
}
if self.is_master_only:
props_dict["filters"] = cimodel.data.simple.util.branch_filters.gen_filter_dict()
props_dict["filters"] = branch_filters.gen_filter_dict(branch_filters.NON_PR_BRANCH_LIST)
return [{self.template_name: props_dict}]
@ -47,12 +50,14 @@ class AndroidGradleJob:
job_name,
template_name,
dependencies,
is_master_only=True):
is_master_only=True,
is_pr_only=False):
self.job_name = job_name
self.template_name = template_name
self.dependencies = dependencies
self.is_master_only = is_master_only
self.is_pr_only = is_pr_only
def gen_tree(self):
@ -62,7 +67,9 @@ class AndroidGradleJob:
}
if self.is_master_only:
props_dict["filters"] = cimodel.data.simple.util.branch_filters.gen_filter_dict()
props_dict["filters"] = branch_filters.gen_filter_dict(branch_filters.NON_PR_BRANCH_LIST)
elif self.is_pr_only:
props_dict["filters"] = branch_filters.gen_filter_dict(branch_filters.PR_BRANCH_LIST)
return [{self.template_name: props_dict}]
@ -77,7 +84,14 @@ WORKFLOW_DATA = [
"pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-build-x86_32",
"pytorch_android_gradle_build-x86_32",
["pytorch_linux_xenial_py3_clang5_android_ndk_r19c_x86_32_build"],
is_master_only=False),
is_master_only=False,
is_pr_only=True),
AndroidGradleJob(
"pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single",
"pytorch_android_gradle_custom_build_single",
[DOCKER_REQUIREMENT_NDK],
is_master_only=False,
is_pr_only=True),
AndroidGradleJob(
"pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-build",
"pytorch_android_gradle_build",

View File

@ -1,4 +1,7 @@
from cimodel.data.simple.util.docker_constants import DOCKER_IMAGE_GCC7
from cimodel.data.simple.util.docker_constants import (
DOCKER_IMAGE_GCC7,
DOCKER_REQUIREMENT_GCC7
)
def gen_job_name(phase):
@ -38,7 +41,10 @@ class BazelJob:
full_job_name = gen_job_name(self.phase)
build_env_name = "-".join(build_env_parts)
extra_requires = [gen_job_name("build")] if self.phase == "test" else []
extra_requires = (
[gen_job_name("build")] if self.phase == "test" else
[DOCKER_REQUIREMENT_GCC7]
)
props_dict = {
"build_environment": build_env_name,

View File

@ -5,7 +5,7 @@ TODO: Refactor circleci/cimodel/data/binary_build_data.py to generate this file
NB: If you modify this file, you need to also modify
the binary_and_smoke_tests_on_pr variable in
pytorch-ci-hud to adjust the list of whitelisted builds
pytorch-ci-hud to adjust the allowed build list
at https://github.com/ezyang/pytorch-ci-hud/blob/master/src/BuildHistoryDisplay.js
Note:

View File

@ -1,6 +1,7 @@
from collections import OrderedDict
from cimodel.lib.miniutils import quote
from cimodel.data.simple.util.branch_filters import gen_filter_dict, RC_PATTERN
# TODO: make this generated from a matrix rather than just a static list
@ -11,6 +12,7 @@ IMAGE_NAMES = [
"pytorch-linux-bionic-py3.6-clang9",
"pytorch-linux-bionic-cuda10.2-cudnn7-py3.6-clang9",
"pytorch-linux-bionic-py3.8-gcc9",
"pytorch-linux-bionic-rocm3.5.1-py3.6",
"pytorch-linux-xenial-cuda10-cudnn7-py3-gcc7",
"pytorch-linux-xenial-cuda10.1-cudnn7-py3-gcc7",
"pytorch-linux-xenial-cuda10.2-cudnn7-py3-gcc7",
@ -19,26 +21,34 @@ IMAGE_NAMES = [
"pytorch-linux-xenial-cuda9.2-cudnn7-py3-gcc7",
"pytorch-linux-xenial-py3-clang5-android-ndk-r19c",
"pytorch-linux-xenial-py3-clang5-asan",
"pytorch-linux-xenial-py3-clang7-onnx",
"pytorch-linux-xenial-py3.8",
"pytorch-linux-xenial-py3.6-clang7",
"pytorch-linux-xenial-py3.6-gcc4.8",
"pytorch-linux-xenial-py3.6-gcc5.4",
"pytorch-linux-xenial-py3.6-gcc5.4", # this one is used in doc builds
"pytorch-linux-xenial-py3.6-gcc7.2",
"pytorch-linux-xenial-py3.6-gcc7",
"pytorch-linux-xenial-pynightly",
"pytorch-linux-xenial-rocm3.3-py3.6",
"pytorch-linux-bionic-rocm3.7-py3.6",
"pytorch-linux-bionic-rocm3.8-py3.6",
]
def get_workflow_jobs():
"""Generates a list of docker image build definitions"""
return [
OrderedDict(
ret = []
for image_name in IMAGE_NAMES:
parameters = OrderedDict({
"name": quote(f"docker-{image_name}"),
"image_name": quote(image_name),
})
if image_name == "pytorch-linux-xenial-py3.6-gcc5.4":
# pushing documentation on tags requires CircleCI to also
# build all the dependencies on tags, including this docker image
parameters['filters'] = gen_filter_dict(branches_list=r"/.*/",
tags_list=RC_PATTERN)
ret.append(OrderedDict(
{
"docker_build_job": OrderedDict(
{"name": quote(image_name), "image_name": quote(image_name)}
)
"docker_build_job": parameters
}
)
for image_name in IMAGE_NAMES
]
))
return ret

View File

@ -62,8 +62,8 @@ class IOSJob:
WORKFLOW_DATA = [
IOSJob(IOS_VERSION, ArchVariant("x86_64"), is_org_member_context=False),
IOSJob(IOS_VERSION, ArchVariant("arm64")),
IOSJob(IOS_VERSION, ArchVariant("arm64", True), extra_props={"op_list": "mobilenetv2.yaml"}),
# IOSJob(IOS_VERSION, ArchVariant("arm64")),
# IOSJob(IOS_VERSION, ArchVariant("arm64", True), extra_props={"op_list": "mobilenetv2.yaml"}),
]

View File

@ -4,12 +4,23 @@ PyTorch Mobile PR builds (use linux host toolchain + mobile build options)
import cimodel.lib.miniutils as miniutils
import cimodel.data.simple.util.branch_filters
from cimodel.data.simple.util.docker_constants import DOCKER_IMAGE_ASAN, DOCKER_IMAGE_NDK
from cimodel.data.simple.util.docker_constants import (
DOCKER_IMAGE_ASAN,
DOCKER_REQUIREMENT_ASAN,
DOCKER_IMAGE_NDK,
DOCKER_REQUIREMENT_NDK
)
class MobileJob:
def __init__(self, docker_image, variant_parts, is_master_only=False):
def __init__(
self,
docker_image,
docker_requires,
variant_parts,
is_master_only=False):
self.docker_image = docker_image
self.docker_requires = docker_requires
self.variant_parts = variant_parts
self.is_master_only = is_master_only
@ -30,6 +41,7 @@ class MobileJob:
"build_environment": build_env_name,
"build_only": miniutils.quote(str(int(True))),
"docker_image": self.docker_image,
"requires": self.docker_requires,
"name": full_job_name,
}
@ -40,15 +52,27 @@ class MobileJob:
WORKFLOW_DATA = [
MobileJob(DOCKER_IMAGE_ASAN, ["build"]),
MobileJob(DOCKER_IMAGE_ASAN, ["custom", "build", "static"]),
MobileJob(
DOCKER_IMAGE_ASAN,
[DOCKER_REQUIREMENT_ASAN],
["build"]
),
# Use LLVM-DEV toolchain in android-ndk-r19c docker image
MobileJob(DOCKER_IMAGE_NDK, ["custom", "build", "dynamic"]),
MobileJob(
DOCKER_IMAGE_NDK,
[DOCKER_REQUIREMENT_NDK],
["custom", "build", "dynamic"]
),
# Use LLVM-DEV toolchain in android-ndk-r19c docker image
# Most of this CI is already covered by "mobile-custom-build-dynamic" job
MobileJob(DOCKER_IMAGE_NDK, ["code", "analysis"], True),
MobileJob(
DOCKER_IMAGE_NDK,
[DOCKER_REQUIREMENT_NDK],
["code", "analysis"],
True
),
]

View File

@ -1,4 +1,7 @@
from cimodel.data.simple.util.docker_constants import DOCKER_IMAGE_NDK
from cimodel.data.simple.util.docker_constants import (
DOCKER_IMAGE_NDK,
DOCKER_REQUIREMENT_NDK
)
class AndroidNightlyJob:
@ -48,12 +51,13 @@ class AndroidNightlyJob:
return [{self.template_name: props_dict}]
BASE_REQUIRES = [DOCKER_REQUIREMENT_NDK]
WORKFLOW_DATA = [
AndroidNightlyJob(["x86_32"], "pytorch_linux_build"),
AndroidNightlyJob(["x86_64"], "pytorch_linux_build"),
AndroidNightlyJob(["arm", "v7a"], "pytorch_linux_build"),
AndroidNightlyJob(["arm", "v8a"], "pytorch_linux_build"),
AndroidNightlyJob(["x86_32"], "pytorch_linux_build", requires=BASE_REQUIRES),
AndroidNightlyJob(["x86_64"], "pytorch_linux_build", requires=BASE_REQUIRES),
AndroidNightlyJob(["arm", "v7a"], "pytorch_linux_build", requires=BASE_REQUIRES),
AndroidNightlyJob(["arm", "v8a"], "pytorch_linux_build", requires=BASE_REQUIRES),
AndroidNightlyJob(["android_gradle"], "pytorch_android_gradle_build",
with_docker=False,
requires=[

View File

@ -60,7 +60,7 @@ BUILD_CONFIGS = [
WORKFLOW_DATA = BUILD_CONFIGS + [
IOSNightlyJob("binary", is_upload=True),
# IOSNightlyJob("binary", is_upload=True),
]

View File

@ -4,6 +4,11 @@ NON_PR_BRANCH_LIST = [
r"/release\/.*/",
]
PR_BRANCH_LIST = [
r"/gh\/.*\/head/",
r"/pull\/.*/",
]
RC_PATTERN = r"/v[0-9]+(\.[0-9]+)*-rc[0-9]+/"
def gen_filter_dict(

View File

@ -1,30 +1,33 @@
AWS_DOCKER_HOST = "308535385114.dkr.ecr.us-east-1.amazonaws.com"
# ARE YOU EDITING THIS NUMBER? MAKE SURE YOU READ THE GUIDANCE AT THE
# TOP OF .circleci/config.yml
DOCKER_IMAGE_TAG = "209062ef-ab58-422a-b295-36c4eed6e906"
def gen_docker_image(container_type):
return (
"/".join([AWS_DOCKER_HOST, "pytorch", container_type]),
f"docker-{container_type}",
)
def gen_docker_image_requires(image_name):
return [f"docker-{image_name}"]
def gen_docker_image_path(container_type):
return "/".join([
AWS_DOCKER_HOST,
"pytorch",
container_type + ":" + DOCKER_IMAGE_TAG,
])
DOCKER_IMAGE_BASIC, DOCKER_REQUIREMENT_BASE = gen_docker_image(
"pytorch-linux-xenial-py3.6-gcc5.4"
)
DOCKER_IMAGE_CUDA_10_2, DOCKER_REQUIREMENT_CUDA_10_2 = gen_docker_image(
"pytorch-linux-xenial-cuda10.2-cudnn7-py3-gcc7"
)
DOCKER_IMAGE_GCC7, DOCKER_REQUIREMENT_GCC7 = gen_docker_image(
"pytorch-linux-xenial-py3.6-gcc7"
)
DOCKER_IMAGE_BASIC = gen_docker_image_path("pytorch-linux-xenial-py3.6-gcc5.4")
DOCKER_IMAGE_CUDA_10_2 = gen_docker_image_path("pytorch-linux-xenial-cuda10.2-cudnn7-py3-gcc7")
DOCKER_IMAGE_GCC7 = gen_docker_image_path("pytorch-linux-xenial-py3.6-gcc7")
def gen_mobile_docker_name(specifier):
def gen_mobile_docker(specifier):
container_type = "pytorch-linux-xenial-py3-clang5-" + specifier
return gen_docker_image_path(container_type)
return gen_docker_image(container_type)
DOCKER_IMAGE_ASAN = gen_mobile_docker_name("asan")
DOCKER_IMAGE_ASAN, DOCKER_REQUIREMENT_ASAN = gen_mobile_docker("asan")
DOCKER_IMAGE_NDK = gen_mobile_docker_name("android-ndk-r19c")
DOCKER_IMAGE_NDK, DOCKER_REQUIREMENT_NDK = gen_mobile_docker("android-ndk-r19c")

View File

@ -43,8 +43,11 @@ class WindowsJob:
if base_phase == "test":
prerequisite_jobs.append("_".join(base_name_parts + ["build"]))
if self.cuda_version:
self.cudnn_version = 8 if self.cuda_version.major == 11 else 7
arch_env_elements = (
["cuda" + str(self.cuda_version.major), "cudnn7"]
["cuda" + str(self.cuda_version.major), "cudnn" + str(self.cudnn_version)]
if self.cuda_version
else ["cpu"]
)
@ -93,11 +96,14 @@ class WindowsJob:
class VcSpec:
def __init__(self, year, version_elements=None):
def __init__(self, year, version_elements=None, hide_version=False):
self.year = year
self.version_elements = version_elements or []
self.hide_version = hide_version
def get_elements(self):
if self.hide_version:
return [self.prefixed_year()]
return [self.prefixed_year()] + self.version_elements
def get_product(self):
@ -110,7 +116,7 @@ class VcSpec:
return "vs" + str(self.year)
def render(self):
return "_".join(filter(None, [self.prefixed_year(), self.dotted_version()]))
return "_".join(self.get_elements())
def FalsePred(_):
return False
@ -118,23 +124,22 @@ def FalsePred(_):
def TruePred(_):
return True
_VC2019 = VcSpec(2019)
WORKFLOW_DATA = [
# VS2017 CUDA-10.1
WindowsJob(None, VcSpec(2017, ["14", "11"]), CudaVersion(10, 1), master_only_pred=FalsePred),
WindowsJob(1, VcSpec(2017, ["14", "11"]), CudaVersion(10, 1)),
# VS2017 no-CUDA (builds only)
WindowsJob(None, VcSpec(2017, ["14", "16"]), CudaVersion(10, 1)),
WindowsJob(None, VcSpec(2017, ["14", "16"]), None),
# VS2019 CUDA-10.1
WindowsJob(None, VcSpec(2019), CudaVersion(10, 1)),
WindowsJob(1, VcSpec(2019), CudaVersion(10, 1)),
WindowsJob(2, VcSpec(2019), CudaVersion(10, 1)),
WindowsJob(None, _VC2019, CudaVersion(10, 1)),
WindowsJob(1, _VC2019, CudaVersion(10, 1)),
WindowsJob(2, _VC2019, CudaVersion(10, 1)),
# VS2019 CUDA-11.0
WindowsJob(None, _VC2019, CudaVersion(11, 0)),
WindowsJob(1, _VC2019, CudaVersion(11, 0), master_only_pred=TruePred),
WindowsJob(2, _VC2019, CudaVersion(11, 0), master_only_pred=TruePred),
# VS2019 CPU-only
WindowsJob(None, VcSpec(2019), None),
WindowsJob(1, VcSpec(2019), None),
WindowsJob(2, VcSpec(2019), None, master_only_pred=TruePred),
WindowsJob(1, VcSpec(2019), CudaVersion(10, 1), force_on_cpu=True),
WindowsJob(2, VcSpec(2019), CudaVersion(10, 1), force_on_cpu=True, master_only_pred=TruePred),
WindowsJob(None, _VC2019, None),
WindowsJob(1, _VC2019, None, master_only_pred=TruePred),
WindowsJob(2, _VC2019, None, master_only_pred=TruePred),
WindowsJob(1, _VC2019, CudaVersion(10, 1), force_on_cpu=True, master_only_pred=TruePred),
]

File diff suppressed because it is too large Load Diff

View File

@ -10,14 +10,35 @@ if [ -z "${image}" ]; then
exit 1
fi
# TODO: Generalize
OS="ubuntu"
DOCKERFILE="${OS}/Dockerfile"
if [[ "$image" == *-cuda* ]]; then
DOCKERFILE="${OS}-cuda/Dockerfile"
elif [[ "$image" == *-rocm* ]]; then
DOCKERFILE="${OS}-rocm/Dockerfile"
fi
function extract_version_from_image_name() {
eval export $2=$(echo "${image}" | perl -n -e"/$1(\d+(\.\d+)?(\.\d+)?)/ && print \$1")
if [ "x${!2}" = x ]; then
echo "variable '$2' not correctly parsed from image='$image'"
exit 1
fi
}
function extract_all_from_image_name() {
# parts $image into array, splitting on '-'
keep_IFS="$IFS"
IFS="-"
declare -a parts=($image)
IFS="$keep_IFS"
unset keep_IFS
for part in "${parts[@]}"; do
name=$(echo "${part}" | perl -n -e"/([a-zA-Z]+)\d+(\.\d+)?(\.\d+)?/ && print \$1")
vername="${name^^}_VERSION"
# "py" is the odd one out, needs this special case
if [ "x${name}" = xpy ]; then
vername=ANACONDA_PYTHON_VERSION
fi
# skip non-conforming fields such as "pytorch", "linux" or "xenial" without version string
if [ -n "${name}" ]; then
extract_version_from_image_name "${name}" "${vername}"
fi
done
}
if [[ "$image" == *-trusty* ]]; then
UBUNTU_VERSION=14.04
@ -29,6 +50,26 @@ elif [[ "$image" == *-bionic* ]]; then
UBUNTU_VERSION=18.04
elif [[ "$image" == *-focal* ]]; then
UBUNTU_VERSION=20.04
elif [[ "$image" == *ubuntu* ]]; then
extract_version_from_image_name ubuntu UBUNTU_VERSION
elif [[ "$image" == *centos* ]]; then
extract_version_from_image_name centos CENTOS_VERSION
fi
if [ -n "${UBUNTU_VERSION}" ]; then
OS="ubuntu"
elif [ -n "${CENTOS_VERSION}" ]; then
OS="centos"
else
echo "Unable to derive operating system base..."
exit 1
fi
DOCKERFILE="${OS}/Dockerfile"
if [[ "$image" == *cuda* ]]; then
DOCKERFILE="${OS}-cuda/Dockerfile"
elif [[ "$image" == *rocm* ]]; then
DOCKERFILE="${OS}-rocm/Dockerfile"
fi
TRAVIS_DL_URL_PREFIX="https://s3.amazonaws.com/travis-python-archives/binaries/ubuntu/14.04/x86_64"
@ -71,13 +112,6 @@ case "$image" in
DB=yes
VISION=yes
;;
pytorch-linux-xenial-pynightly)
TRAVIS_PYTHON_VERSION=nightly
GCC_VERSION=7
PROTOBUF=yes
DB=yes
VISION=yes
;;
pytorch-linux-xenial-cuda9.2-cudnn7-py3-gcc5.4)
CUDA_VERSION=9.2
CUDNN_VERSION=7
@ -126,7 +160,6 @@ case "$image" in
KATEX=yes
;;
pytorch-linux-xenial-cuda11.0-cudnn8-py3-gcc7)
UBUNTU_VERSION=16.04-rc
CUDA_VERSION=11.0
CUDNN_VERSION=8
ANACONDA_PYTHON_VERSION=3.6
@ -143,6 +176,13 @@ case "$image" in
DB=yes
VISION=yes
;;
pytorch-linux-xenial-py3-clang7-onnx)
ANACONDA_PYTHON_VERSION=3.6
CLANG_VERSION=7
PROTOBUF=yes
DB=yes
VISION=yes
;;
pytorch-linux-xenial-py3-clang5-android-ndk-r19c)
ANACONDA_PYTHON_VERSION=3.6
CLANG_VERSION=5.0
@ -167,6 +207,8 @@ case "$image" in
PROTOBUF=yes
DB=yes
VISION=yes
VULKAN_SDK_VERSION=1.2.148.0
SWIFTSHADER=yes
;;
pytorch-linux-bionic-py3.8-gcc9)
ANACONDA_PYTHON_VERSION=3.8
@ -194,7 +236,6 @@ case "$image" in
VISION=yes
;;
pytorch-linux-bionic-cuda11.0-cudnn8-py3.6-gcc9)
UBUNTU_VERSION=18.04-rc
CUDA_VERSION=11.0
CUDNN_VERSION=8
ANACONDA_PYTHON_VERSION=3.6
@ -205,7 +246,6 @@ case "$image" in
KATEX=yes
;;
pytorch-linux-bionic-cuda11.0-cudnn8-py3.8-gcc9)
UBUNTU_VERSION=18.04-rc
CUDA_VERSION=11.0
CUDNN_VERSION=8
ANACONDA_PYTHON_VERSION=3.8
@ -215,22 +255,52 @@ case "$image" in
VISION=yes
KATEX=yes
;;
pytorch-linux-xenial-rocm3.3-py3.6)
pytorch-linux-bionic-rocm3.7-py3.6)
ANACONDA_PYTHON_VERSION=3.6
PROTOBUF=yes
DB=yes
VISION=yes
ROCM_VERSION=3.3
# newer cmake version required
CMAKE_VERSION=3.6.3
ROCM_VERSION=3.7
;;
pytorch-linux-bionic-rocm3.3-py3.6)
pytorch-linux-bionic-rocm3.8-py3.6)
ANACONDA_PYTHON_VERSION=3.6
PROTOBUF=yes
DB=yes
VISION=yes
ROCM_VERSION=3.3
ROCM_VERSION=3.8
;;
*)
# Catch-all for builds that are not hardcoded.
PROTOBUF=yes
DB=yes
VISION=yes
echo "image '$image' did not match an existing build configuration"
if [[ "$image" == *py* ]]; then
extract_version_from_image_name py ANACONDA_PYTHON_VERSION
fi
if [[ "$image" == *cuda* ]]; then
extract_version_from_image_name cuda CUDA_VERSION
extract_version_from_image_name cudnn CUDNN_VERSION
fi
if [[ "$image" == *rocm* ]]; then
extract_version_from_image_name rocm ROCM_VERSION
fi
if [[ "$image" == *gcc* ]]; then
extract_version_from_image_name gcc GCC_VERSION
fi
if [[ "$image" == *clang* ]]; then
extract_version_from_image_name clang CLANG_VERSION
fi
if [[ "$image" == *devtoolset* ]]; then
extract_version_from_image_name devtoolset DEVTOOLSET_VERSION
fi
if [[ "$image" == *glibc* ]]; then
extract_version_from_image_name glibc GLIBC_VERSION
fi
if [[ "$image" == *cmake* ]]; then
extract_version_from_image_name cmake CMAKE_VERSION
fi
;;
esac
# Set Jenkins UID and GID if running Jenkins
@ -259,6 +329,9 @@ docker build \
--build-arg "JENKINS_UID=${JENKINS_UID:-}" \
--build-arg "JENKINS_GID=${JENKINS_GID:-}" \
--build-arg "UBUNTU_VERSION=${UBUNTU_VERSION}" \
--build-arg "CENTOS_VERSION=${CENTOS_VERSION}" \
--build-arg "DEVTOOLSET_VERSION=${DEVTOOLSET_VERSION}" \
--build-arg "GLIBC_VERSION=${GLIBC_VERSION}" \
--build-arg "CLANG_VERSION=${CLANG_VERSION}" \
--build-arg "ANACONDA_PYTHON_VERSION=${ANACONDA_PYTHON_VERSION}" \
--build-arg "TRAVIS_PYTHON_VERSION=${TRAVIS_PYTHON_VERSION}" \
@ -268,6 +341,8 @@ docker build \
--build-arg "ANDROID=${ANDROID}" \
--build-arg "ANDROID_NDK=${ANDROID_NDK_VERSION}" \
--build-arg "GRADLE_VERSION=${GRADLE_VERSION}" \
--build-arg "VULKAN_SDK_VERSION=${VULKAN_SDK_VERSION}" \
--build-arg "SWIFTSHADER=${SWIFTSHADER}" \
--build-arg "CMAKE_VERSION=${CMAKE_VERSION:-}" \
--build-arg "NINJA_VERSION=${NINJA_VERSION:-}" \
--build-arg "KATEX=${KATEX:-}" \
@ -277,6 +352,14 @@ docker build \
"$@" \
.
# NVIDIA dockers for RC releases use tag names like `11.0-cudnn8-devel-ubuntu18.04-rc`,
# for this case we will set UBUNTU_VERSION to `18.04-rc` so that the Dockerfile could
# find the correct image. As a result, here we have to replace the
# "$UBUNTU_VERSION" == "18.04-rc"
# with
# "$UBUNTU_VERSION" == "18.04"
UBUNTU_VERSION=$(echo ${UBUNTU_VERSION} | sed 's/-rc$//')
function drun() {
docker run --rm "$tmp_tag" $*
}

View File

@ -13,7 +13,7 @@ retry () {
#until we find a way to reliably reuse previous build, this last_tag is not in use
# last_tag="$(( CIRCLE_BUILD_NUM - 1 ))"
tag="${CIRCLE_WORKFLOW_ID}"
tag="${DOCKER_TAG}"
registry="308535385114.dkr.ecr.us-east-1.amazonaws.com"
@ -45,9 +45,5 @@ trap "docker logout ${registry}" EXIT
docker push "${image}:${tag}"
# TODO: Get rid of duplicate tagging once ${DOCKER_TAG} becomes the default
docker tag "${image}:${tag}" "${image}:${DOCKER_TAG}"
docker push "${image}:${DOCKER_TAG}"
docker save -o "${IMAGE_NAME}:${tag}.tar" "${image}:${tag}"
aws s3 cp "${IMAGE_NAME}:${tag}.tar" "s3://ossci-linux-build/pytorch/base/${IMAGE_NAME}:${tag}.tar" --acl public-read

View File

@ -0,0 +1,93 @@
ARG CENTOS_VERSION
FROM centos:${CENTOS_VERSION}
ARG CENTOS_VERSION
# Install required packages to build Caffe2
# Install common dependencies (so that this step can be cached separately)
ARG EC2
ADD ./common/install_base.sh install_base.sh
RUN bash ./install_base.sh && rm install_base.sh
# Install devtoolset
ARG DEVTOOLSET_VERSION
ADD ./common/install_devtoolset.sh install_devtoolset.sh
RUN bash ./install_devtoolset.sh && rm install_devtoolset.sh
ENV BASH_ENV "/etc/profile"
# (optional) Install non-default glibc version
ARG GLIBC_VERSION
ADD ./common/install_glibc.sh install_glibc.sh
RUN if [ -n "${GLIBC_VERSION}" ]; then bash ./install_glibc.sh; fi
RUN rm install_glibc.sh
# Install user
ADD ./common/install_user.sh install_user.sh
RUN bash ./install_user.sh && rm install_user.sh
# Install conda
ENV PATH /opt/conda/bin:$PATH
ARG ANACONDA_PYTHON_VERSION
ADD ./common/install_conda.sh install_conda.sh
RUN bash ./install_conda.sh && rm install_conda.sh
# (optional) Install protobuf for ONNX
ARG PROTOBUF
ADD ./common/install_protobuf.sh install_protobuf.sh
RUN if [ -n "${PROTOBUF}" ]; then bash ./install_protobuf.sh; fi
RUN rm install_protobuf.sh
ENV INSTALLED_PROTOBUF ${PROTOBUF}
# (optional) Install database packages like LMDB and LevelDB
ARG DB
ADD ./common/install_db.sh install_db.sh
RUN if [ -n "${DB}" ]; then bash ./install_db.sh; fi
RUN rm install_db.sh
ENV INSTALLED_DB ${DB}
# (optional) Install vision packages like OpenCV and ffmpeg
ARG VISION
ADD ./common/install_vision.sh install_vision.sh
RUN if [ -n "${VISION}" ]; then bash ./install_vision.sh; fi
RUN rm install_vision.sh
ENV INSTALLED_VISION ${VISION}
# Install rocm
ARG ROCM_VERSION
ADD ./common/install_rocm.sh install_rocm.sh
RUN bash ./install_rocm.sh
RUN rm install_rocm.sh
ENV PATH /opt/rocm/bin:$PATH
ENV PATH /opt/rocm/hcc/bin:$PATH
ENV PATH /opt/rocm/hip/bin:$PATH
ENV PATH /opt/rocm/opencl/bin:$PATH
ENV PATH /opt/rocm/llvm/bin:$PATH
ENV HIP_PLATFORM hcc
ENV LANG en_US.utf8
ENV LC_ALL en_US.utf8
# (optional) Install non-default CMake version
ARG CMAKE_VERSION
ADD ./common/install_cmake.sh install_cmake.sh
RUN if [ -n "${CMAKE_VERSION}" ]; then bash ./install_cmake.sh; fi
RUN rm install_cmake.sh
# (optional) Install non-default Ninja version
ARG NINJA_VERSION
ADD ./common/install_ninja.sh install_ninja.sh
RUN if [ -n "${NINJA_VERSION}" ]; then bash ./install_ninja.sh; fi
RUN rm install_ninja.sh
# Install ccache/sccache (do this last, so we get priority in PATH)
ADD ./common/install_cache.sh install_cache.sh
ENV PATH /opt/cache/bin:$PATH
RUN bash ./install_cache.sh && rm install_cache.sh
# Include BUILD_ENVIRONMENT environment variable in image
ARG BUILD_ENVIRONMENT
ENV BUILD_ENVIRONMENT ${BUILD_ENVIRONMENT}
USER jenkins
CMD ["bash"]

View File

@ -4,13 +4,15 @@ set -ex
[ -n "${ANDROID_NDK}" ]
_https_amazon_aws=https://ossci-android.s3.amazonaws.com
apt-get update
apt-get install -y --no-install-recommends autotools-dev autoconf unzip
apt-get autoclean && apt-get clean
rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/*
pushd /tmp
curl -Os --retry 3 https://dl.google.com/android/repository/android-ndk-${ANDROID_NDK}-linux-x86_64.zip
curl -Os --retry 3 $_https_amazon_aws/android-ndk-${ANDROID_NDK}-linux-x86_64.zip
popd
_ndk_dir=/opt/ndk
mkdir -p "$_ndk_dir"
@ -45,43 +47,22 @@ export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/
# Installing android sdk
# https://github.com/circleci/circleci-images/blob/staging/android/Dockerfile.m4
_sdk_version=sdk-tools-linux-3859397.zip
_tmp_sdk_zip=/tmp/android-sdk-linux.zip
_android_home=/opt/android/sdk
rm -rf $_android_home
sudo mkdir -p $_android_home
curl --silent --show-error --location --fail --retry 3 --output /tmp/$_sdk_version https://dl.google.com/android/repository/$_sdk_version
sudo unzip -q /tmp/$_sdk_version -d $_android_home
rm /tmp/$_sdk_version
curl --silent --show-error --location --fail --retry 3 --output /tmp/android-sdk-linux.zip $_https_amazon_aws/android-sdk-linux-tools3859397-build-tools2803-2902-platforms28-29.zip
sudo unzip -q $_tmp_sdk_zip -d $_android_home
rm $_tmp_sdk_zip
sudo chmod -R 777 $_android_home
export ANDROID_HOME=$_android_home
export ADB_INSTALL_TIMEOUT=120
export PATH="${ANDROID_HOME}/emulator:${ANDROID_HOME}/tools:${ANDROID_HOME}/tools/bin:${ANDROID_HOME}/platform-tools:${PATH}"
export PATH="${ANDROID_HOME}/tools:${ANDROID_HOME}/tools/bin:${ANDROID_HOME}/platform-tools:${PATH}"
echo "PATH:${PATH}"
alias sdkmanager="$ANDROID_HOME/tools/bin/sdkmanager"
sudo mkdir ~/.android && sudo echo '### User Sources for Android SDK Manager' > ~/.android/repositories.cfg
sudo chmod -R 777 ~/.android
yes | sdkmanager --licenses
yes | sdkmanager --update
sdkmanager \
"tools" \
"platform-tools" \
"emulator"
sdkmanager \
"build-tools;28.0.3" \
"build-tools;29.0.2"
sdkmanager \
"platforms;android-28" \
"platforms;android-29"
sdkmanager --list
# Installing Gradle
echo "GRADLE_VERSION:${GRADLE_VERSION}"
@ -89,8 +70,7 @@ _gradle_home=/opt/gradle
sudo rm -rf $gradle_home
sudo mkdir -p $_gradle_home
wget --no-verbose --output-document=/tmp/gradle.zip \
"https://services.gradle.org/distributions/gradle-${GRADLE_VERSION}-bin.zip"
curl --silent --output /tmp/gradle.zip --retry 3 $_https_amazon_aws/gradle-${GRADLE_VERSION}-bin.zip
sudo unzip -q /tmp/gradle.zip -d $_gradle_home
rm /tmp/gradle.zip

View File

@ -2,55 +2,123 @@
set -ex
# NVIDIA dockers for RC releases use tag names like `11.0-cudnn8-devel-ubuntu18.04-rc`,
# for this case we will set UBUNTU_VERSION to `18.04-rc` so that the Dockerfile could
# find the correct image. As a result, here we have to check for
# "$UBUNTU_VERSION" == "18.04"*
# instead of
# "$UBUNTU_VERSION" == "18.04"
if [[ "$UBUNTU_VERSION" == "18.04"* ]]; then
cmake3="cmake=3.10*"
else
cmake3="cmake=3.5*"
fi
install_ubuntu() {
# NVIDIA dockers for RC releases use tag names like `11.0-cudnn8-devel-ubuntu18.04-rc`,
# for this case we will set UBUNTU_VERSION to `18.04-rc` so that the Dockerfile could
# find the correct image. As a result, here we have to check for
# "$UBUNTU_VERSION" == "18.04"*
# instead of
# "$UBUNTU_VERSION" == "18.04"
if [[ "$UBUNTU_VERSION" == "18.04"* ]]; then
cmake3="cmake=3.10*"
else
cmake3="cmake=3.5*"
fi
# Install common dependencies
apt-get update
# TODO: Some of these may not be necessary
# TODO: libiomp also gets installed by conda, aka there's a conflict
ccache_deps="asciidoc docbook-xml docbook-xsl xsltproc"
numpy_deps="gfortran"
apt-get install -y --no-install-recommends \
$ccache_deps \
$numpy_deps \
${cmake3} \
apt-transport-https \
autoconf \
automake \
build-essential \
ca-certificates \
curl \
git \
libatlas-base-dev \
libc6-dbg \
libiomp-dev \
libyaml-dev \
libz-dev \
libjpeg-dev \
libasound2-dev \
libsndfile-dev \
python \
python-dev \
python-setuptools \
python-wheel \
software-properties-common \
sudo \
wget \
vim
# Install common dependencies
apt-get update
# TODO: Some of these may not be necessary
# TODO: libiomp also gets installed by conda, aka there's a conflict
ccache_deps="asciidoc docbook-xml docbook-xsl xsltproc"
numpy_deps="gfortran"
apt-get install -y --no-install-recommends \
$ccache_deps \
$numpy_deps \
${cmake3} \
apt-transport-https \
autoconf \
automake \
build-essential \
ca-certificates \
curl \
git \
libatlas-base-dev \
libc6-dbg \
libiomp-dev \
libyaml-dev \
libz-dev \
libjpeg-dev \
libasound2-dev \
libsndfile-dev \
python \
python-dev \
python-setuptools \
python-wheel \
software-properties-common \
sudo \
wget \
vim
# TODO: THIS IS A HACK!!!
# distributed nccl(2) tests are a bit busted, see https://github.com/pytorch/pytorch/issues/5877
if dpkg -s libnccl-dev; then
apt-get remove -y libnccl-dev libnccl2 --allow-change-held-packages
fi
# Cleanup package manager
apt-get autoclean && apt-get clean
rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/*
}
install_centos() {
# Need EPEL for many packages we depend on.
# See http://fedoraproject.org/wiki/EPEL
yum --enablerepo=extras install -y epel-release
ccache_deps="asciidoc docbook-dtds docbook-style-xsl libxslt"
numpy_deps="gcc-gfortran"
# Note: protobuf-c-{compiler,devel} on CentOS are too old to be used
# for Caffe2. That said, we still install them to make sure the build
# system opts to build/use protoc and libprotobuf from third-party.
yum install -y \
$ccache_deps \
$numpy_deps \
autoconf \
automake \
bzip2 \
cmake \
cmake3 \
curl \
gcc \
gcc-c++ \
gflags-devel \
git \
glibc-devel \
glibc-headers \
glog-devel \
hiredis-devel \
libstdc++-devel \
make \
opencv-devel \
sudo \
wget \
vim
# Cleanup
yum clean all
rm -rf /var/cache/yum
rm -rf /var/lib/yum/yumdb
rm -rf /var/lib/yum/history
}
# Install base packages depending on the base OS
ID=$(grep -oP '(?<=^ID=).+' /etc/os-release | tr -d '"')
case "$ID" in
ubuntu)
install_ubuntu
;;
centos)
install_centos
;;
*)
echo "Unable to determine OS..."
exit 1
;;
esac
# Install Valgrind separately since the apt-get version is too old.
mkdir valgrind_build && cd valgrind_build
VALGRIND_VERSION=3.15.0
VALGRIND_VERSION=3.16.1
if ! wget http://valgrind.org/downloads/valgrind-${VALGRIND_VERSION}.tar.bz2
then
wget https://sourceware.org/ftp/valgrind/valgrind-${VALGRIND_VERSION}.tar.bz2
@ -64,12 +132,3 @@ cd ../../
rm -rf valgrind_build
alias valgrind="/usr/local/bin/valgrind"
# TODO: THIS IS A HACK!!!
# distributed nccl(2) tests are a bit busted, see https://github.com/pytorch/pytorch/issues/5877
if dpkg -s libnccl-dev; then
apt-get remove -y libnccl-dev libnccl2 --allow-change-held-packages
fi
# Cleanup package manager
apt-get autoclean && apt-get clean
rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/*

View File

@ -8,7 +8,11 @@ sed -e 's|PATH="\(.*\)"|PATH="/opt/cache/bin:\1"|g' -i /etc/environment
export PATH="/opt/cache/bin:$PATH"
# Setup compiler cache
curl --retry 3 https://s3.amazonaws.com/ossci-linux/sccache -o /opt/cache/bin/sccache
if [ -n "$ROCM_VERSION" ]; then
curl --retry 3 http://repo.radeon.com/misc/.sccache_amd/sccache -o /opt/cache/bin/sccache
else
curl --retry 3 https://s3.amazonaws.com/ossci-linux/sccache -o /opt/cache/bin/sccache
fi
chmod a+x /opt/cache/bin/sccache
function write_sccache_stub() {
@ -20,8 +24,12 @@ write_sccache_stub cc
write_sccache_stub c++
write_sccache_stub gcc
write_sccache_stub g++
write_sccache_stub clang
write_sccache_stub clang++
# NOTE: See specific ROCM_VERSION case below.
if [ "x$ROCM_VERSION" = x ]; then
write_sccache_stub clang
write_sccache_stub clang++
fi
if [ -n "$CUDA_VERSION" ]; then
# TODO: This is a workaround for the fact that PyTorch's FindCUDA
@ -33,3 +41,47 @@ if [ -n "$CUDA_VERSION" ]; then
printf "#!/bin/sh\nexec sccache $(which nvcc) \"\$@\"" > /opt/cache/lib/nvcc
chmod a+x /opt/cache/lib/nvcc
fi
if [ -n "$ROCM_VERSION" ]; then
# ROCm compiler is hcc or clang. However, it is commonly invoked via hipcc wrapper.
# hipcc will call either hcc or clang using an absolute path starting with /opt/rocm,
# causing the /opt/cache/bin to be skipped. We must create the sccache wrappers
# directly under /opt/rocm while also preserving the original compiler names.
# Note symlinks will chain as follows: [hcc or clang++] -> clang -> clang-??
# Final link in symlink chain must point back to original directory.
# Original compiler is moved one directory deeper. Wrapper replaces it.
function write_sccache_stub_rocm() {
OLDCOMP=$1
COMPNAME=$(basename $OLDCOMP)
TOPDIR=$(dirname $OLDCOMP)
WRAPPED="$TOPDIR/original/$COMPNAME"
mv "$OLDCOMP" "$WRAPPED"
printf "#!/bin/sh\nexec sccache $WRAPPED \$*" > "$OLDCOMP"
chmod a+x "$1"
}
if [[ -e "/opt/rocm/hcc/bin/hcc" ]]; then
# ROCm 3.3 or earlier.
mkdir /opt/rocm/hcc/bin/original
write_sccache_stub_rocm /opt/rocm/hcc/bin/hcc
write_sccache_stub_rocm /opt/rocm/hcc/bin/clang
write_sccache_stub_rocm /opt/rocm/hcc/bin/clang++
# Fix last link in symlink chain, clang points to versioned clang in prior dir
pushd /opt/rocm/hcc/bin/original
ln -s ../$(readlink clang)
popd
elif [[ -e "/opt/rocm/llvm/bin/clang" ]]; then
# ROCm 3.5 and beyond.
mkdir /opt/rocm/llvm/bin/original
write_sccache_stub_rocm /opt/rocm/llvm/bin/clang
write_sccache_stub_rocm /opt/rocm/llvm/bin/clang++
# Fix last link in symlink chain, clang points to versioned clang in prior dir
pushd /opt/rocm/llvm/bin/original
ln -s ../$(readlink clang)
popd
else
echo "Cannot find ROCm compiler."
exit 1
fi
fi

View File

@ -24,13 +24,20 @@ if [ -n "$ANACONDA_PYTHON_VERSION" ]; then
mkdir /opt/conda
chown jenkins:jenkins /opt/conda
# Work around bug where devtoolset replaces sudo and breaks it.
if [ -n "$DEVTOOLSET_VERSION" ]; then
SUDO=/bin/sudo
else
SUDO=sudo
fi
as_jenkins() {
# NB: unsetting the environment variables works around a conda bug
# https://github.com/conda/conda/issues/6576
# NB: Pass on PATH and LD_LIBRARY_PATH to sudo invocation
# NB: This must be run from a directory that jenkins has access to,
# works around https://github.com/conda/conda-package-handling/pull/34
sudo -H -u jenkins env -u SUDO_UID -u SUDO_GID -u SUDO_COMMAND -u SUDO_USER env "PATH=$PATH" "LD_LIBRARY_PATH=$LD_LIBRARY_PATH" $*
$SUDO -H -u jenkins env -u SUDO_UID -u SUDO_GID -u SUDO_COMMAND -u SUDO_USER env "PATH=$PATH" "LD_LIBRARY_PATH=$LD_LIBRARY_PATH" $*
}
pushd /tmp
@ -49,10 +56,10 @@ if [ -n "$ANACONDA_PYTHON_VERSION" ]; then
pushd /opt/conda
# Track latest conda update
as_jenkins conda update -n base conda
as_jenkins conda update -y -n base conda
# Install correct Python version
as_jenkins conda install python="$ANACONDA_PYTHON_VERSION"
as_jenkins conda install -y python="$ANACONDA_PYTHON_VERSION"
conda_install() {
# Ensure that the install command don't upgrade/downgrade Python
@ -67,9 +74,9 @@ if [ -n "$ANACONDA_PYTHON_VERSION" ]; then
if [ "$ANACONDA_PYTHON_VERSION" = "3.8" ]; then
# DO NOT install typing if installing python-3.8, since its part of python-3.8 core packages
# Install llvm-8 as it is required to compile llvmlite-0.30.0 from source
conda_install numpy pyyaml mkl mkl-include setuptools cffi future six llvmdev=8.0.0
conda_install numpy=1.18.5 pyyaml mkl mkl-include setuptools cffi future six llvmdev=8.0.0 dataclasses
else
conda_install numpy pyyaml mkl mkl-include setuptools cffi typing future six
conda_install numpy=1.18.5 pyyaml mkl mkl-include setuptools cffi typing future six dataclasses
fi
if [[ "$CUDA_VERSION" == 9.2* ]]; then
conda_install magma-cuda92 -c pytorch
@ -79,6 +86,8 @@ if [ -n "$ANACONDA_PYTHON_VERSION" ]; then
conda_install magma-cuda101 -c pytorch
elif [[ "$CUDA_VERSION" == 10.2* ]]; then
conda_install magma-cuda102 -c pytorch
elif [[ "$CUDA_VERSION" == 11.0* ]]; then
conda_install magma-cuda110 -c pytorch
fi
# TODO: This isn't working atm

View File

@ -51,11 +51,16 @@ install_centos() {
}
# Install base packages depending on the base OS
if [ -f /etc/lsb-release ]; then
install_ubuntu
elif [ -f /etc/os-release ]; then
install_centos
else
echo "Unable to determine OS..."
exit 1
fi
ID=$(grep -oP '(?<=^ID=).+' /etc/os-release | tr -d '"')
case "$ID" in
ubuntu)
install_ubuntu
;;
centos)
install_centos
;;
*)
echo "Unable to determine OS..."
exit 1
;;
esac

View File

@ -0,0 +1,10 @@
#!/bin/bash
set -ex
[ -n "$DEVTOOLSET_VERSION" ]
yum install -y centos-release-scl
yum install -y devtoolset-$DEVTOOLSET_VERSION
echo "source scl_source enable devtoolset-$DEVTOOLSET_VERSION" > "/etc/profile.d/devtoolset-$DEVTOOLSET_VERSION.sh"

View File

@ -0,0 +1,34 @@
#!/bin/bash
set -ex
[ -n "$GLIBC_VERSION" ]
if [[ -n "$CENTOS_VERSION" ]]; then
[ -n "$DEVTOOLSET_VERSION" ]
fi
yum install -y wget sed
mkdir -p /packages && cd /packages
wget -q http://ftp.gnu.org/gnu/glibc/glibc-$GLIBC_VERSION.tar.gz
tar xzf glibc-$GLIBC_VERSION.tar.gz
if [[ "$GLIBC_VERSION" == "2.26" ]]; then
cd glibc-$GLIBC_VERSION
sed -i 's/$name ne "nss_test1"/$name ne "nss_test1" \&\& $name ne "nss_test2"/' scripts/test-installation.pl
cd ..
fi
mkdir -p glibc-$GLIBC_VERSION-build && cd glibc-$GLIBC_VERSION-build
if [[ -n "$CENTOS_VERSION" ]]; then
export PATH=/opt/rh/devtoolset-$DEVTOOLSET_VERSION/root/usr/bin:$PATH
fi
../glibc-$GLIBC_VERSION/configure --prefix=/usr CFLAGS='-Wno-stringop-truncation -Wno-format-overflow -Wno-restrict -Wno-format-truncation -g -O2'
make -j$(nproc)
make install
# Cleanup
rm -rf /packages
rm -rf /var/cache/yum/*
rm -rf /var/lib/rpm/__db.*
yum clean all

View File

@ -1,30 +0,0 @@
#!/bin/bash
set -ex
llvm_url="https://github.com/llvm/llvm-project/releases/download/llvmorg-9.0.1/llvm-9.0.1.src.tar.xz"
mkdir /opt/llvm
pushd /tmp
wget --no-verbose --output-document=llvm.tar.xz "$llvm_url"
mkdir llvm
tar -xf llvm.tar.xz -C llvm --strip-components 1
rm -f llvm.tar.xz
cd llvm
mkdir build
cd build
cmake -G "Unix Makefiles" \
-DCMAKE_BUILD_TYPE=MinSizeRel \
-DLLVM_ENABLE_ASSERTIONS=ON \
-DCMAKE_INSTALL_PREFIX=/opt/llvm \
-DLLVM_TARGETS_TO_BUILD="host" \
-DLLVM_BUILD_TOOLS=OFF \
-DLLVM_BUILD_UTILS=OFF \
-DLLVM_TEMPORARILY_ALLOW_OLD_TOOLCHAIN=ON \
../
make -j4
sudo make install
popd

View File

@ -46,11 +46,16 @@ install_centos() {
}
# Install base packages depending on the base OS
if [ -f /etc/lsb-release ]; then
install_ubuntu
elif [ -f /etc/os-release ]; then
install_centos
else
echo "Unable to determine OS..."
exit 1
fi
ID=$(grep -oP '(?<=^ID=).+' /etc/os-release | tr -d '"')
case "$ID" in
ubuntu)
install_ubuntu
;;
centos)
install_centos
;;
*)
echo "Unable to determine OS..."
exit 1
;;
esac

View File

@ -8,6 +8,7 @@ install_ubuntu() {
# gpg-agent is not available by default on 18.04
apt-get install -y --no-install-recommends gpg-agent
fi
apt-get install -y kmod
apt-get install -y wget
apt-get install -y libopenblas-dev
@ -35,6 +36,15 @@ install_ubuntu() {
rocprofiler-dev \
roctracer-dev
# precompiled miopen kernels added in ROCm 3.5; search for all unversioned packages
# if search fails it will abort this script; use true to avoid case where search fails
MIOPENKERNELS=$(apt-cache search --names-only miopenkernels | awk '{print $1}' | grep -F -v . || true)
if [[ "x${MIOPENKERNELS}" = x ]]; then
echo "miopenkernels package not available"
else
DEBIAN_FRONTEND=noninteractive apt-get install -y --allow-unauthenticated ${MIOPENKERNELS}
fi
# Cleanup
apt-get autoclean && apt-get clean
rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/*
@ -43,6 +53,7 @@ install_ubuntu() {
install_centos() {
yum update -y
yum install -y kmod
yum install -y wget
yum install -y openblas-devel
@ -51,7 +62,7 @@ install_centos() {
echo "[ROCm]" > /etc/yum.repos.d/rocm.repo
echo "name=ROCm" >> /etc/yum.repos.d/rocm.repo
echo "baseurl=http://repo.radeon.com/rocm/yum/rpm/" >> /etc/yum.repos.d/rocm.repo
echo "baseurl=http://repo.radeon.com/rocm/yum/${ROCM_VERSION}" >> /etc/yum.repos.d/rocm.repo
echo "enabled=1" >> /etc/yum.repos.d/rocm.repo
echo "gpgcheck=0" >> /etc/yum.repos.d/rocm.repo
@ -79,11 +90,16 @@ install_centos() {
}
# Install Python packages depending on the base OS
if [ -f /etc/lsb-release ]; then
install_ubuntu
elif [ -f /etc/os-release ]; then
install_centos
else
echo "Unable to determine OS..."
exit 1
fi
ID=$(grep -oP '(?<=^ID=).+' /etc/os-release | tr -d '"')
case "$ID" in
ubuntu)
install_ubuntu
;;
centos)
install_centos
;;
*)
echo "Unable to determine OS..."
exit 1
;;
esac

View File

@ -0,0 +1,24 @@
#!/bin/bash
set -ex
[ -n "${SWIFTSHADER}" ]
retry () {
$* || (sleep 1 && $*) || (sleep 2 && $*) || (sleep 4 && $*) || (sleep 8 && $*)
}
_https_amazon_aws=https://ossci-android.s3.amazonaws.com
# SwiftShader
_swiftshader_dir=/var/lib/jenkins/swiftshader
_swiftshader_file_targz=swiftshader-abe07b943-prebuilt.tar.gz
mkdir -p $_swiftshader_dir
_tmp_swiftshader_targz="/tmp/${_swiftshader_file_targz}"
curl --silent --show-error --location --fail --retry 3 \
--output "${_tmp_swiftshader_targz}" "$_https_amazon_aws/${_swiftshader_file_targz}"
tar -C "${_swiftshader_dir}" -xzf "${_tmp_swiftshader_targz}"
export VK_ICD_FILENAMES="${_swiftshader_dir}/build/Linux/vk_swiftshader_icd.json"

View File

@ -49,26 +49,7 @@ if [ -n "$TRAVIS_PYTHON_VERSION" ]; then
pip --version
if [[ "$TRAVIS_PYTHON_VERSION" == nightly ]]; then
# These two packages have broken Cythonizations uploaded
# to PyPi, see:
#
# - https://github.com/numpy/numpy/issues/10500
# - https://github.com/yaml/pyyaml/issues/117
#
# Furthermore, the released version of Cython does not
# have these issues fixed.
#
# While we are waiting on fixes for these, we build
# from Git for now. Feel free to delete this conditional
# branch if things start working again (you may need
# to do this if these packages regress on Git HEAD.)
as_jenkins pip install git+https://github.com/cython/cython.git
as_jenkins pip install git+https://github.com/numpy/numpy.git
as_jenkins pip install git+https://github.com/yaml/pyyaml.git
else
as_jenkins pip install numpy pyyaml
fi
as_jenkins pip install numpy pyyaml
as_jenkins pip install \
future \
@ -76,7 +57,8 @@ if [ -n "$TRAVIS_PYTHON_VERSION" ]; then
protobuf \
pytest \
pillow \
typing
typing \
dataclasses
as_jenkins pip install mkl mkl-devel

View File

@ -47,11 +47,16 @@ install_centos() {
}
# Install base packages depending on the base OS
if [ -f /etc/lsb-release ]; then
install_ubuntu
elif [ -f /etc/os-release ]; then
install_centos
else
echo "Unable to determine OS..."
exit 1
fi
ID=$(grep -oP '(?<=^ID=).+' /etc/os-release | tr -d '"')
case "$ID" in
ubuntu)
install_ubuntu
;;
centos)
install_centos
;;
*)
echo "Unable to determine OS..."
exit 1
;;
esac

View File

@ -0,0 +1,23 @@
#!/bin/bash
set -ex
[ -n "${VULKAN_SDK_VERSION}" ]
retry () {
$* || (sleep 1 && $*) || (sleep 2 && $*) || (sleep 4 && $*) || (sleep 8 && $*)
}
_https_amazon_aws=https://ossci-android.s3.amazonaws.com
_vulkansdk_dir=/var/lib/jenkins/vulkansdk
mkdir -p $_vulkansdk_dir
_tmp_vulkansdk_targz=/tmp/vulkansdk.tar.gz
curl --silent --show-error --location --fail --retry 3 \
--output "$_tmp_vulkansdk_targz" "$_https_amazon_aws/vulkansdk-linux-x86_64-${VULKAN_SDK_VERSION}.tar.gz"
tar -C "$_vulkansdk_dir" -xzf "$_tmp_vulkansdk_targz" --strip-components 1
export VULKAN_SDK="$_vulkansdk_dir/"
rm "$_tmp_vulkansdk_targz"

View File

@ -86,9 +86,8 @@ ENV BUILD_ENVIRONMENT ${BUILD_ENVIRONMENT}
ENV TORCH_CUDA_ARCH_LIST Maxwell
ENV TORCH_NVCC_FLAGS "-Xfatbin -compress-all"
# Install LLVM dev version
ADD ./common/install_llvm.sh install_llvm.sh
RUN bash ./install_llvm.sh
# Install LLVM dev version (Defined in the pytorch/builder github repository)
COPY --from=pytorch/llvm:9.0.1 /opt/llvm /opt/llvm
USER jenkins
CMD ["bash"]

View File

@ -57,6 +57,7 @@ ENV PATH /opt/rocm/bin:$PATH
ENV PATH /opt/rocm/hcc/bin:$PATH
ENV PATH /opt/rocm/hip/bin:$PATH
ENV PATH /opt/rocm/opencl/bin:$PATH
ENV PATH /opt/rocm/llvm/bin:$PATH
ENV HIP_PLATFORM hcc
ENV LANG C.UTF-8
ENV LC_ALL C.UTF-8

View File

@ -85,6 +85,18 @@ RUN rm AndroidManifest.xml
RUN rm build.gradle
ENV INSTALLED_ANDROID ${ANDROID}
# (optional) Install Vulkan SDK
ARG VULKAN_SDK_VERSION
ADD ./common/install_vulkan_sdk.sh install_vulkan_sdk.sh
RUN if [ -n "${VULKAN_SDK_VERSION}" ]; then bash ./install_vulkan_sdk.sh; fi
RUN rm install_vulkan_sdk.sh
# (optional) Install swiftshader
ARG SWIFTSHADER
ADD ./common/install_swiftshader.sh install_swiftshader.sh
RUN if [ -n "${SWIFTSHADER}" ]; then bash ./install_swiftshader.sh; fi
RUN rm install_swiftshader.sh
# (optional) Install non-default CMake version
ARG CMAKE_VERSION
ADD ./common/install_cmake.sh install_cmake.sh
@ -111,9 +123,8 @@ RUN bash ./install_jni.sh && rm install_jni.sh
ARG BUILD_ENVIRONMENT
ENV BUILD_ENVIRONMENT ${BUILD_ENVIRONMENT}
# Install LLVM dev version
ADD ./common/install_llvm.sh install_llvm.sh
RUN bash ./install_llvm.sh
# Install LLVM dev version (Defined in the pytorch/builder github repository)
COPY --from=pytorch/llvm:9.0.1 /opt/llvm /opt/llvm
USER jenkins
CMD ["bash"]

View File

@ -88,6 +88,9 @@ parser = argparse.ArgumentParser(description="Delete old Docker tags from regist
parser.add_argument(
"--dry-run", action="store_true", help="Dry run; print tags that would be deleted"
)
parser.add_argument(
"--debug", action="store_true", help="Debug, print ignored / saved tags"
)
parser.add_argument(
"--keep-stable-days",
type=int,
@ -164,51 +167,48 @@ for repo in repos(client):
# Keep list of image digests to delete for this repository
digest_to_delete = []
print(repositoryName)
for image in images(client, repo):
tags = image.get("imageTags")
if not isinstance(tags, (list,)) or len(tags) == 0:
continue
tag = tags[0]
created = image["imagePushedAt"]
age = now - created
if any([
looks_like_git_sha(tag),
tag.isdigit(),
tag.count("-") == 4, # TODO: Remove, this no longer applies as tags are now built using a SHA1
tag in ignore_tags]):
window = stable_window
if tag in ignore_tags:
stable_window_tags.append((repositoryName, tag, "", age, created))
elif age < window:
stable_window_tags.append((repositoryName, tag, window, age, created))
else:
window = unstable_window
for tag in tags:
if any([
looks_like_git_sha(tag),
tag.isdigit(),
tag.count("-") == 4, # TODO: Remove, this no longer applies as tags are now built using a SHA1
tag in ignore_tags]):
window = stable_window
if tag in ignore_tags:
stable_window_tags.append((repositoryName, tag, "", age, created))
elif age < window:
stable_window_tags.append((repositoryName, tag, window, age, created))
else:
window = unstable_window
if tag in ignore_tags:
print("Ignoring tag {}:{} (age: {})".format(repositoryName, tag, age))
continue
if age < window:
print("Not deleting manifest for tag {}:{} (age: {})".format(repositoryName, tag, age))
continue
if args.dry_run:
print("(dry run) Deleting manifest for tag {}:{} (age: {})".format(repositoryName, tag, age))
if tag in ignore_tags or age < window:
if args.debug:
print("Ignoring {}:{} (age: {})".format(repositoryName, tag, age))
break
else:
print("Deleting manifest for tag{}:{} (age: {})".format(repositoryName, tag, age))
for tag in tags:
print("{}Deleting {}:{} (age: {})".format("(dry run) " if args.dry_run else "", repositoryName, tag, age))
digest_to_delete.append(image["imageDigest"])
if args.dry_run:
if args.debug:
print("Skipping actual deletion, moving on...")
else:
# Issue batch delete for all images to delete for this repository
# Note that as of 2018-07-25, the maximum number of images you can
# delete in a single batch is 100, so chunk our list into batches of
# 100
for c in chunks(digest_to_delete, 100):
client.batch_delete_image(
registryId="308535385114",
repositoryName=repositoryName,
imageIds=[{"imageDigest": digest} for digest in c],
)
# Issue batch delete for all images to delete for this repository
# Note that as of 2018-07-25, the maximum number of images you can
# delete in a single batch is 100, so chunk our list into batches of
# 100
for c in chunks(digest_to_delete, 100):
client.batch_delete_image(
registryId="308535385114",
repositoryName=repositoryName,
imageIds=[{"imageDigest": digest} for digest in c],
)
save_to_s3(args.filter_prefix, stable_window_tags)
save_to_s3(args.filter_prefix, stable_window_tags)

View File

@ -8,10 +8,9 @@ Please see README.md in this directory for details.
import os
import shutil
import sys
from collections import OrderedDict, namedtuple
from collections import namedtuple
import cimodel.data.binary_build_definitions as binary_build_definitions
import cimodel.data.caffe2_build_definitions as caffe2_build_definitions
import cimodel.data.pytorch_build_definitions as pytorch_build_definitions
import cimodel.data.simple.android_definitions
import cimodel.data.simple.bazel_definitions
@ -23,6 +22,7 @@ import cimodel.data.simple.macos_definitions
import cimodel.data.simple.mobile_definitions
import cimodel.data.simple.nightly_android
import cimodel.data.simple.nightly_ios
import cimodel.data.simple.anaconda_prune_defintions
import cimodel.data.windows_build_definitions as windows_build_definitions
import cimodel.lib.miniutils as miniutils
import cimodel.lib.miniyaml as miniyaml
@ -83,6 +83,7 @@ class Header(object):
def gen_build_workflows_tree():
build_workflows_functions = [
cimodel.data.simple.docker_definitions.get_workflow_jobs,
pytorch_build_definitions.get_workflow_jobs,
cimodel.data.simple.macos_definitions.get_workflow_jobs,
cimodel.data.simple.android_definitions.get_workflow_jobs,
@ -90,23 +91,19 @@ def gen_build_workflows_tree():
cimodel.data.simple.mobile_definitions.get_workflow_jobs,
cimodel.data.simple.ge_config_tests.get_workflow_jobs,
cimodel.data.simple.bazel_definitions.get_workflow_jobs,
caffe2_build_definitions.get_workflow_jobs,
cimodel.data.simple.binary_smoketest.get_workflow_jobs,
cimodel.data.simple.nightly_ios.get_workflow_jobs,
cimodel.data.simple.nightly_android.get_workflow_jobs,
cimodel.data.simple.anaconda_prune_defintions.get_workflow_jobs,
windows_build_definitions.get_windows_workflows,
binary_build_definitions.get_post_upload_jobs,
binary_build_definitions.get_binary_smoke_test_jobs,
]
binary_build_functions = [
binary_build_definitions.get_binary_build_jobs,
binary_build_definitions.get_nightly_tests,
binary_build_definitions.get_nightly_uploads,
binary_build_definitions.get_post_upload_jobs,
binary_build_definitions.get_binary_smoke_test_jobs,
]
docker_builder_functions = [
cimodel.data.simple.docker_definitions.get_workflow_jobs
]
return {
@ -115,19 +112,6 @@ def gen_build_workflows_tree():
"when": r"<< pipeline.parameters.run_binary_tests >>",
"jobs": [f() for f in binary_build_functions],
},
"docker_build": OrderedDict(
{
"triggers": [
{
"schedule": {
"cron": miniutils.quote("0 15 * * 0"),
"filters": {"branches": {"only": ["master"]}},
}
}
],
"jobs": [f() for f in docker_builder_functions],
}
),
"build": {"jobs": [f() for f in build_workflows_functions]},
}
}
@ -140,12 +124,10 @@ YAML_SOURCES = [
File("nightly-binary-build-defaults.yml"),
Header("Build parameters"),
File("build-parameters/pytorch-build-params.yml"),
File("build-parameters/caffe2-build-params.yml"),
File("build-parameters/binary-build-params.yml"),
File("build-parameters/promote-build-params.yml"),
Header("Job specs"),
File("job-specs/pytorch-job-specs.yml"),
File("job-specs/caffe2-job-specs.yml"),
File("job-specs/binary-job-specs.yml"),
File("job-specs/job-specs-custom.yml"),
File("job-specs/job-specs-promote.yml"),

View File

@ -14,7 +14,7 @@ mkdir -p ${ZIP_DIR}/src
cp -R ${ARTIFACTS_DIR}/arm64/include ${ZIP_DIR}/install/
# build a FAT bianry
cd ${ZIP_DIR}/install/lib
target_libs=(libc10.a libclog.a libcpuinfo.a libeigen_blas.a libpytorch_qnnpack.a libtorch_cpu.a libtorch.a libXNNPACK.a)
target_libs=(libc10.a libclog.a libcpuinfo.a libeigen_blas.a libpthreadpool.a libpytorch_qnnpack.a libtorch_cpu.a libtorch.a libXNNPACK.a)
for lib in ${target_libs[*]}
do
if [ -f "${ARTIFACTS_DIR}/x86_64/lib/${lib}" ] && [ -f "${ARTIFACTS_DIR}/arm64/lib/${lib}" ]; then

View File

@ -5,26 +5,18 @@ set -eux -o pipefail
source /env
# Defaults here so they can be changed in one place
export MAX_JOBS=12
export MAX_JOBS=${MAX_JOBS:-$(( $(nproc) - 2 ))}
# Parse the parameters
if [[ "$PACKAGE_TYPE" == 'conda' ]]; then
build_script='conda/build_pytorch.sh'
elif [[ "$DESIRED_CUDA" == cpu ]]; then
build_script='manywheel/build_cpu.sh'
elif [[ "$DESIRED_CUDA" == *"rocm"* ]]; then
build_script='manywheel/build_rocm.sh'
else
build_script='manywheel/build.sh'
fi
# We want to call unbuffer, which calls tclsh which finds the expect
# package. The expect was installed by yum into /usr/bin so we want to
# find /usr/bin/tclsh, but this is shadowed by /opt/conda/bin/tclsh in
# the conda docker images, so we prepend it to the path here.
if [[ "$PACKAGE_TYPE" == 'conda' ]]; then
mkdir /just_tclsh_bin
ln -s /usr/bin/tclsh /just_tclsh_bin/tclsh
export PATH=/just_tclsh_bin:$PATH
fi
# Build the package
SKIP_ALL_TESTS=1 unbuffer "/builder/$build_script" | ts
SKIP_ALL_TESTS=1 "/builder/$build_script"

View File

@ -40,7 +40,7 @@ if [[ "$PACKAGE_TYPE" == conda ]]; then
else
cu_ver="${DESIRED_CUDA:2:2}.${DESIRED_CUDA:4}"
fi
retry conda install -yq -c pytorch "cudatoolkit=\${cu_ver}"
retry conda install -yq -c nvidia -c pytorch "cudatoolkit=\${cu_ver}"
fi
elif [[ "$PACKAGE_TYPE" != libtorch ]]; then
pip install "\$pkg"

View File

@ -1,49 +0,0 @@
#!/bin/bash
# Do NOT set -x
source /home/circleci/project/env
set -eu -o pipefail
set +x
declare -x "AWS_ACCESS_KEY_ID=${PYTORCH_BINARY_AWS_ACCESS_KEY_ID}"
declare -x "AWS_SECRET_ACCESS_KEY=${PYTORCH_BINARY_AWS_SECRET_ACCESS_KEY}"
#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!
# DO NOT TURN -x ON BEFORE THIS LINE
#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!
set -eux -o pipefail
export PATH="$MINICONDA_ROOT/bin:$PATH"
# This gets set in binary_populate_env.sh, but lets have a sane default just in case
PIP_UPLOAD_FOLDER=${PIP_UPLOAD_FOLDER:-nightly}
# TODO: Combine CONDA_UPLOAD_CHANNEL and PIP_UPLOAD_FOLDER into one variable
# The only difference is the trailing slash
# Strip trailing slashes if there
CONDA_UPLOAD_CHANNEL=$(echo "${PIP_UPLOAD_FOLDER}" | sed 's:/*$::')
BACKUP_BUCKET="s3://pytorch-backup"
# Upload the package to the final location
pushd /home/circleci/project/final_pkgs
if [[ "$PACKAGE_TYPE" == conda ]]; then
retry conda install -yq anaconda-client
retry anaconda -t "${CONDA_PYTORCHBOT_TOKEN}" upload "$(ls)" -u "pytorch-${CONDA_UPLOAD_CHANNEL}" --label main --no-progress --force
# Fetch platform (eg. win-64, linux-64, etc.) from index file
# Because there's no actual conda command to read this
subdir=$(tar -xOf ./*.bz2 info/index.json | grep subdir | cut -d ':' -f2 | sed -e 's/[[:space:]]//' -e 's/"//g' -e 's/,//')
BACKUP_DIR="conda/${subdir}"
elif [[ "$PACKAGE_TYPE" == libtorch ]]; then
retry pip install -q awscli
s3_dir="s3://pytorch/libtorch/${PIP_UPLOAD_FOLDER}${DESIRED_CUDA}/"
for pkg in $(ls); do
retry aws s3 cp "$pkg" "$s3_dir" --acl public-read
done
BACKUP_DIR="libtorch/${PIP_UPLOAD_FOLDER}${DESIRED_CUDA}/"
else
retry pip install -q awscli
s3_dir="s3://pytorch/whl/${PIP_UPLOAD_FOLDER}${DESIRED_CUDA}/"
retry aws s3 cp "$(ls)" "$s3_dir" --acl public-read
BACKUP_DIR="whl/${PIP_UPLOAD_FOLDER}${DESIRED_CUDA}/"
fi
if [[ -n "${CIRCLE_TAG:-}" ]]; then
s3_dir="${BACKUP_BUCKET}/${CIRCLE_TAG}/${BACKUP_DIR}"
retry aws s3 cp . "$s3_dir"
fi

View File

@ -1,49 +0,0 @@
#!/bin/bash
# Do NOT set -x
set -eu -o pipefail
set +x
export AWS_ACCESS_KEY_ID="${PYTORCH_BINARY_AWS_ACCESS_KEY_ID}"
export AWS_SECRET_ACCESS_KEY="${PYTORCH_BINARY_AWS_SECRET_ACCESS_KEY}"
#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!
# DO NOT TURN -x ON BEFORE THIS LINE
#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!
set -eux -o pipefail
source "/Users/distiller/project/env"
export "PATH=$workdir/miniconda/bin:$PATH"
# This gets set in binary_populate_env.sh, but lets have a sane default just in case
PIP_UPLOAD_FOLDER=${PIP_UPLOAD_FOLDER:-nightly}
# TODO: Combine CONDA_UPLOAD_CHANNEL and PIP_UPLOAD_FOLDER into one variable
# The only difference is the trailing slash
# Strip trailing slashes if there
CONDA_UPLOAD_CHANNEL=$(echo "${PIP_UPLOAD_FOLDER}" | sed 's:/*$::')
BACKUP_BUCKET="s3://pytorch-backup"
pushd "$workdir/final_pkgs"
if [[ "$PACKAGE_TYPE" == conda ]]; then
retry conda install -yq anaconda-client
retry anaconda -t "${CONDA_PYTORCHBOT_TOKEN}" upload "$(ls)" -u "pytorch-${CONDA_UPLOAD_CHANNEL}" --label main --no-progress --force
# Fetch platform (eg. win-64, linux-64, etc.) from index file
# Because there's no actual conda command to read this
subdir=$(tar -xOf ./*.bz2 info/index.json | grep subdir | cut -d ':' -f2 | sed -e 's/[[:space:]]//' -e 's/"//g' -e 's/,//')
BACKUP_DIR="conda/${subdir}"
elif [[ "$PACKAGE_TYPE" == libtorch ]]; then
retry pip install -q awscli
s3_dir="s3://pytorch/libtorch/${PIP_UPLOAD_FOLDER}${DESIRED_CUDA}/"
for pkg in $(ls); do
retry aws s3 cp "$pkg" "$s3_dir" --acl public-read
done
BACKUP_DIR="libtorch/${PIP_UPLOAD_FOLDER}${DESIRED_CUDA}/"
else
retry pip install -q awscli
s3_dir="s3://pytorch/whl/${PIP_UPLOAD_FOLDER}${DESIRED_CUDA}/"
retry aws s3 cp "$(ls)" "$s3_dir" --acl public-read
BACKUP_DIR="whl/${PIP_UPLOAD_FOLDER}${DESIRED_CUDA}/"
fi
if [[ -n "${CIRCLE_TAG:-}" ]]; then
s3_dir="${BACKUP_BUCKET}/${CIRCLE_TAG}/${BACKUP_DIR}"
retry aws s3 cp . "$s3_dir"
fi

View File

@ -73,7 +73,7 @@ PIP_UPLOAD_FOLDER='nightly/'
# We put this here so that OVERRIDE_PACKAGE_VERSION below can read from it
export DATE="$(date -u +%Y%m%d)"
#TODO: We should be pulling semver version from the base version.txt
BASE_BUILD_VERSION="1.6.0.dev$DATE"
BASE_BUILD_VERSION="1.7.0.dev$DATE"
# Change BASE_BUILD_VERSION to git tag when on a git tag
# Use 'git -C' to make doubly sure we're in the correct directory for checking
# the git tag
@ -130,7 +130,7 @@ if [[ "${BUILD_FOR_SYSTEM:-}" == "windows" ]]; then
fi
export DATE="$DATE"
export NIGHTLIES_DATE_PREAMBLE=1.6.0.dev
export NIGHTLIES_DATE_PREAMBLE=1.7.0.dev
export PYTORCH_BUILD_VERSION="$PYTORCH_BUILD_VERSION"
export PYTORCH_BUILD_NUMBER="$PYTORCH_BUILD_NUMBER"
export OVERRIDE_PACKAGE_VERSION="$PYTORCH_BUILD_VERSION"

View File

@ -19,7 +19,7 @@ chmod +x /home/circleci/project/ci_test_script.sh
VOLUME_MOUNTS="-v /home/circleci/project/:/circleci_stuff -v /home/circleci/project/final_pkgs:/final_pkgs -v ${PYTORCH_ROOT}:/pytorch -v ${BUILDER_ROOT}:/builder"
# Run the docker
if [ -n "${USE_CUDA_DOCKER_RUNTIME:-}" ]; then
export id=$(docker run --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --runtime=nvidia ${VOLUME_MOUNTS} -t -d "${DOCKER_IMAGE}")
export id=$(docker run --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --gpus all ${VOLUME_MOUNTS} -t -d "${DOCKER_IMAGE}")
else
export id=$(docker run --cap-add=SYS_PTRACE --security-opt seccomp=unconfined ${VOLUME_MOUNTS} -t -d "${DOCKER_IMAGE}")
fi

View File

@ -0,0 +1,98 @@
#!/usr/bin/env bash
set -euo pipefail
PACKAGE_TYPE=${PACKAGE_TYPE:-conda}
PKG_DIR=${PKG_DIR:-/tmp/workspace/final_pkgs}
# Designates whether to submit as a release candidate or a nightly build
# Value should be `test` when uploading release candidates
# currently set within `designate_upload_channel`
UPLOAD_CHANNEL=${UPLOAD_CHANNEL:-nightly}
# Designates what subfolder to put packages into
UPLOAD_SUBFOLDER=${UPLOAD_SUBFOLDER:-cpu}
UPLOAD_BUCKET="s3://pytorch"
BACKUP_BUCKET="s3://pytorch-backup"
DRY_RUN=${DRY_RUN:-enabled}
# Don't actually do work unless explicit
ANACONDA="true anaconda"
AWS_S3_CP="aws s3 cp --dryrun"
if [[ "${DRY_RUN}" = "disabled" ]]; then
ANACONDA="anaconda"
AWS_S3_CP="aws s3 cp"
fi
do_backup() {
local backup_dir
backup_dir=$1
(
pushd /tmp/workspace
set -x
${AWS_S3_CP} --recursive . "${BACKUP_BUCKET}/${CIRCLE_TAG}/${backup_dir}/"
)
}
conda_upload() {
(
set -x
${ANACONDA} \
upload \
${PKG_DIR}/*.tar.bz2 \
-u "pytorch-${UPLOAD_CHANNEL}" \
--label main \
--no-progress \
--force
)
}
s3_upload() {
local extension
local pkg_type
extension="$1"
pkg_type="$2"
s3_dir="${UPLOAD_BUCKET}/${pkg_type}/${UPLOAD_CHANNEL}/${UPLOAD_SUBFOLDER}/"
(
for pkg in ${PKG_DIR}/*.${extension}; do
(
set -x
${AWS_S3_CP} --no-progress --acl public-read "${pkg}" "${s3_dir}"
)
done
)
}
case "${PACKAGE_TYPE}" in
conda)
conda_upload
# Fetch platform (eg. win-64, linux-64, etc.) from index file
# Because there's no actual conda command to read this
subdir=$(\
tar -xOf ${PKG_DIR}/*.bz2 info/index.json \
| grep subdir \
| cut -d ':' -f2 \
| sed -e 's/[[:space:]]//' -e 's/"//g' -e 's/,//' \
)
BACKUP_DIR="conda/${subdir}"
;;
libtorch)
s3_upload "zip" "libtorch"
BACKUP_DIR="libtorch/${UPLOAD_CHANNEL}/${UPLOAD_SUBFOLDER}"
;;
# wheel can either refer to wheel/manywheel
*wheel)
s3_upload "whl" "whl"
BACKUP_DIR="whl/${UPLOAD_CHANNEL}/${UPLOAD_SUBFOLDER}"
;;
*)
echo "ERROR: unknown package type: ${PACKAGE_TYPE}"
exit 1
;;
esac
# CIRCLE_TAG is defined by upstream circleci,
# this can be changed to recognize tagged versions
if [[ -n "${CIRCLE_TAG:-}" ]]; then
do_backup "${BACKUP_DIR}"
fi

View File

@ -1,48 +0,0 @@
#!/bin/bash
set -eu -o pipefail
set +x
declare -x "AWS_ACCESS_KEY_ID=${PYTORCH_BINARY_AWS_ACCESS_KEY_ID}"
declare -x "AWS_SECRET_ACCESS_KEY=${PYTORCH_BINARY_AWS_SECRET_ACCESS_KEY}"
#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!
# DO NOT TURN -x ON BEFORE THIS LINE
#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!#!
set -eux -o pipefail
source "/env"
# This gets set in binary_populate_env.sh, but lets have a sane default just in case
PIP_UPLOAD_FOLDER=${PIP_UPLOAD_FOLDER:-nightly/}
# TODO: Combine CONDA_UPLOAD_CHANNEL and PIP_UPLOAD_FOLDER into one variable
# The only difference is the trailing slash
# Strip trailing slashes if there
CONDA_UPLOAD_CHANNEL=$(echo "${PIP_UPLOAD_FOLDER}" | sed 's:/*$::')
BACKUP_BUCKET="s3://pytorch-backup"
pushd /root/workspace/final_pkgs
# Upload the package to the final location
if [[ "$PACKAGE_TYPE" == conda ]]; then
retry conda install -yq anaconda-client
retry anaconda -t "${CONDA_PYTORCHBOT_TOKEN}" upload "$(ls)" -u "pytorch-${CONDA_UPLOAD_CHANNEL}" --label main --no-progress --force
# Fetch platform (eg. win-64, linux-64, etc.) from index file
# Because there's no actual conda command to read this
subdir=$(tar -xOf ./*.bz2 info/index.json | grep subdir | cut -d ':' -f2 | sed -e 's/[[:space:]]//' -e 's/"//g' -e 's/,//')
BACKUP_DIR="conda/${subdir}"
elif [[ "$PACKAGE_TYPE" == libtorch ]]; then
retry conda install -c conda-forge -yq awscli
s3_dir="s3://pytorch/libtorch/${PIP_UPLOAD_FOLDER}${DESIRED_CUDA}/"
for pkg in $(ls); do
retry aws s3 cp "$pkg" "$s3_dir" --acl public-read
done
BACKUP_DIR="libtorch/${PIP_UPLOAD_FOLDER}${DESIRED_CUDA}/"
else
retry conda install -c conda-forge -yq awscli
s3_dir="s3://pytorch/whl/${PIP_UPLOAD_FOLDER}${DESIRED_CUDA}/"
retry aws s3 cp "$(ls)" "$s3_dir" --acl public-read
BACKUP_DIR="whl/${PIP_UPLOAD_FOLDER}${DESIRED_CUDA}/"
fi
if [[ -n "${CIRCLE_TAG:-}" ]]; then
s3_dir="${BACKUP_BUCKET}/${CIRCLE_TAG}/${BACKUP_DIR}"
retry aws s3 cp . "$s3_dir"
fi

View File

@ -1,7 +1,11 @@
#!/usr/bin/env bash
set -eux -o pipefail
env
echo "BUILD_ENVIRONMENT:$BUILD_ENVIRONMENT"
export ANDROID_NDK_HOME=/opt/ndk
export ANDROID_NDK=/opt/ndk
export ANDROID_HOME=/opt/android/sdk
# Must be in sync with GRADLE_VERSION in docker image for android
@ -10,6 +14,31 @@ export GRADLE_VERSION=4.10.3
export GRADLE_HOME=/opt/gradle/gradle-$GRADLE_VERSION
export GRADLE_PATH=$GRADLE_HOME/bin/gradle
# touch gradle cache files to prevent expiration
while IFS= read -r -d '' file
do
touch "$file" || true
done < <(find /var/lib/jenkins/.gradle -type f -print0)
export GRADLE_LOCAL_PROPERTIES=~/workspace/android/local.properties
rm -f $GRADLE_LOCAL_PROPERTIES
echo "sdk.dir=/opt/android/sdk" >> $GRADLE_LOCAL_PROPERTIES
echo "ndk.dir=/opt/ndk" >> $GRADLE_LOCAL_PROPERTIES
echo "cmake.dir=/usr/local" >> $GRADLE_LOCAL_PROPERTIES
retry () {
$* || (sleep 1 && $*) || (sleep 2 && $*) || (sleep 4 && $*) || (sleep 8 && $*)
}
# Run custom build script
if [[ "${BUILD_ENVIRONMENT}" == *-gradle-custom-build* ]]; then
# Install torch & torchvision - used to download & dump used ops from test model.
retry pip install torch torchvision --progress-bar off
exec "$(dirname "${BASH_SOURCE[0]}")/../../android/build_test_app_custom.sh" armeabi-v7a
fi
# Run default build
BUILD_ANDROID_INCLUDE_DIR_x86=~/workspace/build_android/install/include
BUILD_ANDROID_LIB_DIR_x86=~/workspace/build_android/install/lib
@ -44,9 +73,6 @@ ln -s ${BUILD_ANDROID_INCLUDE_DIR_arm_v8a} ${JNI_INCLUDE_DIR}/arm64-v8a
ln -s ${BUILD_ANDROID_LIB_DIR_arm_v8a} ${JNI_LIBS_DIR}/arm64-v8a
fi
env
echo "BUILD_ENVIRONMENT:$BUILD_ENVIRONMENT"
GRADLE_PARAMS="-p android assembleRelease --debug --stacktrace"
if [[ "${BUILD_ENVIRONMENT}" == *-gradle-build-only-x86_32* ]]; then
GRADLE_PARAMS+=" -PABI_FILTERS=x86"
@ -56,20 +82,6 @@ if [ -n "{GRADLE_OFFLINE:-}" ]; then
GRADLE_PARAMS+=" --offline"
fi
# touch gradle cache files to prevent expiration
while IFS= read -r -d '' file
do
touch "$file" || true
done < <(find /var/lib/jenkins/.gradle -type f -print0)
env
export GRADLE_LOCAL_PROPERTIES=~/workspace/android/local.properties
rm -f $GRADLE_LOCAL_PROPERTIES
echo "sdk.dir=/opt/android/sdk" >> $GRADLE_LOCAL_PROPERTIES
echo "ndk.dir=/opt/ndk" >> $GRADLE_LOCAL_PROPERTIES
echo "cmake.dir=/usr/local" >> $GRADLE_LOCAL_PROPERTIES
$GRADLE_PATH $GRADLE_PARAMS
find . -type f -name "*.a" -exec ls -lh {} \;

View File

@ -30,13 +30,7 @@ if [ "$version" == "master" ]; then
is_master_doc=true
fi
# Argument 3: (optional) If present, we will NOT do any pushing. Used for testing.
dry_run=false
if [ "$3" != "" ]; then
dry_run=true
fi
echo "install_path: $install_path version: $version dry_run: $dry_run"
echo "install_path: $install_path version: $version"
# ======================== Building PyTorch C++ API Docs ========================
@ -53,16 +47,11 @@ sudo apt-get -y install doxygen
# Generate ATen files
pushd "${pt_checkout}"
pip install -r requirements.txt
time python aten/src/ATen/gen.py \
time python -m tools.codegen.gen \
-s aten/src/ATen \
-d build/aten/src/ATen \
aten/src/ATen/Declarations.cwrap \
aten/src/THCUNN/generic/THCUNN.h \
aten/src/ATen/nn.yaml \
aten/src/ATen/native/native_functions.yaml
-d build/aten/src/ATen
# Copy some required files
cp aten/src/ATen/common_with_cwrap.py tools/shared/cwrap_common.py
cp torch/_utils_internal.py tools/shared
# Generate PyTorch files
@ -72,12 +61,7 @@ time python tools/setup_helpers/generate_code.py \
# Build the docs
pushd docs/cpp
pip install breathe==4.13.0 bs4 lxml six
pip install --no-cache-dir -e "git+https://github.com/pytorch/pytorch_sphinx_theme.git#egg=pytorch_sphinx_theme"
pip install exhale>=0.2.1
pip install sphinx==2.4.4
# Uncomment once it is fixed
# pip install -r requirements.txt
pip install -r requirements.txt
time make VERBOSE=1 html -j
popd
@ -106,21 +90,5 @@ git config user.name "pytorchbot"
git commit -m "Automatic sync on $(date)" || true
git status
if [ "$dry_run" = false ]; then
echo "Pushing to https://github.com/pytorch/cppdocs"
set +x
/usr/bin/expect <<DONE
spawn git push -u origin master
expect "Username*"
send "pytorchbot\n"
expect "Password*"
send "$::env(GITHUB_PYTORCHBOT_TOKEN)\n"
expect eof
DONE
set -x
else
echo "Skipping push due to dry_run"
fi
popd
# =================== The above code **should** be executed inside Docker container ===================

View File

@ -0,0 +1,8 @@
set "DRIVER_DOWNLOAD_LINK=https://s3.amazonaws.com/ossci-windows/451.82-tesla-desktop-winserver-2019-2016-international.exe"
curl --retry 3 -kL %DRIVER_DOWNLOAD_LINK% --output 451.82-tesla-desktop-winserver-2019-2016-international.exe
if errorlevel 1 exit /b 1
start /wait 451.82-tesla-desktop-winserver-2019-2016-international.exe -s -noreboot
if errorlevel 1 exit /b 1
del 451.82-tesla-desktop-winserver-2019-2016-international.exe || ver > NUL

View File

@ -7,6 +7,8 @@ sudo apt-get -y install expect-dev
# This is where the local pytorch install in the docker image is located
pt_checkout="/var/lib/jenkins/workspace"
source "$pt_checkout/.jenkins/pytorch/common_utils.sh"
echo "python_doc_push_script.sh: Invoked with $*"
set -ex
@ -38,13 +40,7 @@ echo "error: python_doc_push_script.sh: branch (arg3) not specified"
exit 1
fi
# Argument 4: (optional) If present, we will NOT do any pushing. Used for testing.
dry_run=false
if [ "$4" != "" ]; then
dry_run=true
fi
echo "install_path: $install_path version: $version dry_run: $dry_run"
echo "install_path: $install_path version: $version"
git clone https://github.com/pytorch/pytorch.github.io -b $branch
pushd pytorch.github.io
@ -54,25 +50,13 @@ export PATH=/opt/conda/bin:$PATH
rm -rf pytorch || true
# Install TensorBoard in python 3 so torch.utils.tensorboard classes render
pip install -q https://s3.amazonaws.com/ossci-linux/wheels/tensorboard-1.14.0a0-py3-none-any.whl
# Get all the documentation sources, put them in one place
pushd "$pt_checkout"
git clone https://github.com/pytorch/vision
pushd vision
conda install -q pillow
time python setup.py install
popd
pushd docs
rm -rf source/torchvision
cp -a ../vision/docs/source source/torchvision
# Build the docs
pip -q install -r requirements.txt || true
pip -q install -r requirements.txt
if [ "$is_master_doc" = true ]; then
# TODO: fix gh-38011 then enable this which changes warnings into errors
# export SPHINXOPTS="-WT --keep-going"
make html
make coverage
# Now we have the coverage report, we need to make sure it is empty.
@ -126,21 +110,5 @@ git config user.name "pytorchbot"
git commit -m "auto-generating sphinx docs" || true
git status
if [ "$dry_run" = false ]; then
echo "Pushing to pytorch.github.io:$branch"
set +x
/usr/bin/expect <<DONE
spawn git push origin $branch
expect "Username*"
send "pytorchbot\n"
expect "Password*"
send "$::env(GITHUB_PYTORCHBOT_TOKEN)\n"
expect eof
DONE
set -x
else
echo "Skipping push due to dry_run"
fi
popd
# =================== The above code **should** be executed inside Docker container ===================

View File

@ -1,12 +1,6 @@
#!/usr/bin/env bash
set -ex -o pipefail
# Set up NVIDIA docker repo
curl -s -L --retry 3 https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
echo "deb https://nvidia.github.io/libnvidia-container/ubuntu16.04/amd64 /" | sudo tee -a /etc/apt/sources.list.d/nvidia-docker.list
echo "deb https://nvidia.github.io/nvidia-container-runtime/ubuntu16.04/amd64 /" | sudo tee -a /etc/apt/sources.list.d/nvidia-docker.list
echo "deb https://nvidia.github.io/nvidia-docker/ubuntu16.04/amd64 /" | sudo tee -a /etc/apt/sources.list.d/nvidia-docker.list
# Remove unnecessary sources
sudo rm -f /etc/apt/sources.list.d/google-chrome.list
sudo rm -f /etc/apt/heroku.list
@ -14,7 +8,7 @@ sudo rm -f /etc/apt/openjdk-r-ubuntu-ppa-xenial.list
sudo rm -f /etc/apt/partner.list
retry () {
$* || $* || $* || $* || $*
$* || $* || $* || $* || $*
}
# Method adapted from here: https://askubuntu.com/questions/875213/apt-get-to-retry-downloading
@ -22,70 +16,75 @@ retry () {
# This is better than retrying the whole apt-get command
echo "APT::Acquire::Retries \"3\";" | sudo tee /etc/apt/apt.conf.d/80-retries
sudo apt-get -y update
sudo apt-get -y remove linux-image-generic linux-headers-generic linux-generic docker-ce
# WARNING: Docker version is hardcoded here; you must update the
# version number below for docker-ce and nvidia-docker2 to get newer
# versions of Docker. We hardcode these numbers because we kept
# getting broken CI when Docker would update their docker version,
# and nvidia-docker2 would be out of date for a day until they
# released a newer version of their package.
#
# How to figure out what the correct versions of these packages are?
# My preferred method is to start a Docker instance of the correct
# Ubuntu version (e.g., docker run -it ubuntu:16.04) and then ask
# apt what the packages you need are. Note that the CircleCI image
# comes with Docker.
#
# Using 'retry' here as belt-and-suspenders even though we are
# presumably retrying at the single-package level via the
# apt.conf.d/80-retries technique.
retry sudo apt-get update -qq
retry sudo apt-get -y install \
linux-headers-$(uname -r) \
linux-image-generic \
moreutils \
docker-ce=5:18.09.4~3-0~ubuntu-xenial \
nvidia-container-runtime=2.0.0+docker18.09.4-1 \
nvidia-docker2=2.0.3+docker18.09.4-1 \
expect-dev
sudo pkill -SIGHUP dockerd
echo "== DOCKER VERSION =="
docker version
retry sudo pip -q install awscli==1.16.35
if [ -n "${USE_CUDA_DOCKER_RUNTIME:-}" ]; then
DRIVER_FN="NVIDIA-Linux-x86_64-440.59.run"
DRIVER_FN="NVIDIA-Linux-x86_64-450.51.06.run"
wget "https://s3.amazonaws.com/ossci-linux/nvidia_driver/$DRIVER_FN"
sudo /bin/bash "$DRIVER_FN" -s --no-drm || (sudo cat /var/log/nvidia-installer.log && false)
nvidia-smi
# Taken directly from https://github.com/NVIDIA/nvidia-docker
# Add the package repositories
distribution=$(. /etc/os-release;echo "$ID$VERSION_ID")
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L "https://nvidia.github.io/nvidia-docker/${distribution}/nvidia-docker.list" | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo apt-get update -qq
# Necessary to get the `--gpus` flag to function within docker
sudo apt-get install -y nvidia-container-toolkit
sudo systemctl restart docker
else
# Explicitly remove nvidia docker apt repositories if not building for cuda
sudo rm -rf /etc/apt/sources.list.d/nvidia-docker.list
fi
add_to_env_file() {
local content
content=$1
# BASH_ENV should be set by CircleCI
echo "${content}" >> "${BASH_ENV:-/tmp/env}"
}
add_to_env_file "IN_CIRCLECI=1"
add_to_env_file "COMMIT_SOURCE=${CIRCLE_BRANCH:-}"
add_to_env_file "BUILD_ENVIRONMENT=${BUILD_ENVIRONMENT}"
add_to_env_file "CIRCLE_PULL_REQUEST=${CIRCLE_PULL_REQUEST}"
if [[ "${BUILD_ENVIRONMENT}" == *-build ]]; then
echo "declare -x IN_CIRCLECI=1" > /home/circleci/project/env
echo "declare -x COMMIT_SOURCE=${CIRCLE_BRANCH:-}" >> /home/circleci/project/env
echo "declare -x SCCACHE_BUCKET=ossci-compiler-cache-circleci-v2" >> /home/circleci/project/env
add_to_env_file "SCCACHE_BUCKET=ossci-compiler-cache-circleci-v2"
SCCACHE_MAX_JOBS=$(( $(nproc) - 1 ))
MEMORY_LIMIT_MAX_JOBS=8 # the "large" resource class on CircleCI has 32 CPU cores, if we use all of them we'll OOM
MAX_JOBS=$(( ${SCCACHE_MAX_JOBS} > ${MEMORY_LIMIT_MAX_JOBS} ? ${MEMORY_LIMIT_MAX_JOBS} : ${SCCACHE_MAX_JOBS} ))
add_to_env_file "MAX_JOBS=${MAX_JOBS}"
if [ -n "${USE_CUDA_DOCKER_RUNTIME:-}" ]; then
echo "declare -x TORCH_CUDA_ARCH_LIST=5.2" >> /home/circleci/project/env
add_to_env_file "TORCH_CUDA_ARCH_LIST=5.2"
fi
export SCCACHE_MAX_JOBS=`expr $(nproc) - 1`
export MEMORY_LIMIT_MAX_JOBS=8 # the "large" resource class on CircleCI has 32 CPU cores, if we use all of them we'll OOM
export MAX_JOBS=$(( ${SCCACHE_MAX_JOBS} > ${MEMORY_LIMIT_MAX_JOBS} ? ${MEMORY_LIMIT_MAX_JOBS} : ${SCCACHE_MAX_JOBS} ))
echo "declare -x MAX_JOBS=${MAX_JOBS}" >> /home/circleci/project/env
if [[ "${BUILD_ENVIRONMENT}" == *xla* ]]; then
# This IAM user allows write access to S3 bucket for sccache & bazels3cache
set +x
echo "declare -x XLA_CLANG_CACHE_S3_BUCKET_NAME=${XLA_CLANG_CACHE_S3_BUCKET_NAME:-}" >> /home/circleci/project/env
echo "declare -x AWS_ACCESS_KEY_ID=${CIRCLECI_AWS_ACCESS_KEY_FOR_SCCACHE_AND_XLA_BAZEL_S3_BUCKET_V2:-}" >> /home/circleci/project/env
echo "declare -x AWS_SECRET_ACCESS_KEY=${CIRCLECI_AWS_SECRET_KEY_FOR_SCCACHE_AND_XLA_BAZEL_S3_BUCKET_V2:-}" >> /home/circleci/project/env
add_to_env_file "XLA_CLANG_CACHE_S3_BUCKET_NAME=${XLA_CLANG_CACHE_S3_BUCKET_NAME:-}"
add_to_env_file "AWS_ACCESS_KEY_ID=${CIRCLECI_AWS_ACCESS_KEY_FOR_SCCACHE_AND_XLA_BAZEL_S3_BUCKET_V2:-}"
add_to_env_file "AWS_SECRET_ACCESS_KEY=${CIRCLECI_AWS_SECRET_KEY_FOR_SCCACHE_AND_XLA_BAZEL_S3_BUCKET_V2:-}"
set -x
else
# This IAM user allows write access to S3 bucket for sccache
set +x
echo "declare -x XLA_CLANG_CACHE_S3_BUCKET_NAME=${XLA_CLANG_CACHE_S3_BUCKET_NAME:-}" >> /home/circleci/project/env
echo "declare -x AWS_ACCESS_KEY_ID=${CIRCLECI_AWS_ACCESS_KEY_FOR_SCCACHE_S3_BUCKET_V4:-}" >> /home/circleci/project/env
echo "declare -x AWS_SECRET_ACCESS_KEY=${CIRCLECI_AWS_SECRET_KEY_FOR_SCCACHE_S3_BUCKET_V4:-}" >> /home/circleci/project/env
add_to_env_file "XLA_CLANG_CACHE_S3_BUCKET_NAME=${XLA_CLANG_CACHE_S3_BUCKET_NAME:-}"
add_to_env_file "AWS_ACCESS_KEY_ID=${CIRCLECI_AWS_ACCESS_KEY_FOR_SCCACHE_S3_BUCKET_V4:-}"
add_to_env_file "AWS_SECRET_ACCESS_KEY=${CIRCLECI_AWS_SECRET_KEY_FOR_SCCACHE_S3_BUCKET_V4:-}"
set -x
fi
fi
@ -94,5 +93,5 @@ fi
set +x
export AWS_ACCESS_KEY_ID=${CIRCLECI_AWS_ACCESS_KEY_FOR_ECR_READ_WRITE_V4:-}
export AWS_SECRET_ACCESS_KEY=${CIRCLECI_AWS_SECRET_KEY_FOR_ECR_READ_WRITE_V4:-}
eval $(aws ecr get-login --region us-east-1 --no-include-email)
eval "$(aws ecr get-login --region us-east-1 --no-include-email)"
set -x

View File

@ -33,7 +33,7 @@ systemctl list-units --all | cat
sudo pkill apt-get || true
# For even better luck, purge unattended-upgrades
sudo apt-get purge -y unattended-upgrades
sudo apt-get purge -y unattended-upgrades || true
cat /etc/apt/sources.list

View File

@ -46,6 +46,7 @@ def build_message(size):
"time": int(time.time()),
"size": size,
"commit_time": int(os.environ.get("COMMIT_TIME", "0")),
"run_duration": int(time.time() - os.path.getmtime(os.path.realpath(__file__))),
},
}
@ -118,6 +119,7 @@ def report_android_sizes(file_dir):
"int": {
"time": int(time.time()),
"commit_time": int(os.environ.get("COMMIT_TIME", "0")),
"run_duration": int(time.time() - os.path.getmtime(os.path.realpath(__file__))),
"size": comp_size,
"raw_size": uncomp_size,
},

View File

@ -1,7 +1,7 @@
$VS_DOWNLOAD_LINK = "https://aka.ms/vs/15/release/vs_buildtools.exe"
$COLLECT_DOWNLOAD_LINK = "https://aka.ms/vscollect.exe"
$VS_INSTALL_ARGS = @("--nocache","--quiet","--wait", "--add Microsoft.VisualStudio.Workload.VCTools",
"--add Microsoft.VisualStudio.Component.VC.Tools.14.11",
"--add Microsoft.VisualStudio.Component.VC.Tools.14.13",
"--add Microsoft.Component.MSBuild",
"--add Microsoft.VisualStudio.Component.Roslyn.Compiler",
"--add Microsoft.VisualStudio.Component.TextTemplating",

View File

@ -1,30 +1,50 @@
#!/bin/bash
set -eux -o pipefail
curl --retry 3 -kLO https://ossci-windows.s3.amazonaws.com/cuda_10.1.243_426.00_win10.exe
7z x cuda_10.1.243_426.00_win10.exe -ocuda_10.1.243_426.00_win10
cd cuda_10.1.243_426.00_win10
if [[ "$CUDA_VERSION" == "10" ]]; then
cuda_complete_version="10.1"
cuda_installer_name="cuda_10.1.243_426.00_win10"
msbuild_project_dir="CUDAVisualStudioIntegration/extras/visual_studio_integration/MSBuildExtensions"
cuda_install_packages="nvcc_10.1 cuobjdump_10.1 nvprune_10.1 cupti_10.1 cublas_10.1 cublas_dev_10.1 cudart_10.1 cufft_10.1 cufft_dev_10.1 curand_10.1 curand_dev_10.1 cusolver_10.1 cusolver_dev_10.1 cusparse_10.1 cusparse_dev_10.1 nvgraph_10.1 nvgraph_dev_10.1 npp_10.1 npp_dev_10.1 nvrtc_10.1 nvrtc_dev_10.1 nvml_dev_10.1"
elif [[ "$CUDA_VERSION" == "11" ]]; then
cuda_complete_version="11.0"
cuda_installer_name="cuda_11.0.2_451.48_win10"
msbuild_project_dir="visual_studio_integration/CUDAVisualStudioIntegration/extras/visual_studio_integration/MSBuildExtensions"
cuda_install_packages="nvcc_11.0 cuobjdump_11.0 nvprune_11.0 nvprof_11.0 cupti_11.0 cublas_11.0 cublas_dev_11.0 cudart_11.0 cufft_11.0 cufft_dev_11.0 curand_11.0 curand_dev_11.0 cusolver_11.0 cusolver_dev_11.0 cusparse_11.0 cusparse_dev_11.0 npp_11.0 npp_dev_11.0 nvrtc_11.0 nvrtc_dev_11.0 nvml_dev_11.0"
else
echo "CUDA_VERSION $CUDA_VERSION is not supported yet"
exit 1
fi
cuda_installer_link="https://ossci-windows.s3.amazonaws.com/${cuda_installer_name}.exe"
curl --retry 3 -kLO $cuda_installer_link
7z x ${cuda_installer_name}.exe -o${cuda_installer_name}
cd ${cuda_installer_name}
mkdir cuda_install_logs
set +e
./setup.exe -s nvcc_10.1 cuobjdump_10.1 nvprune_10.1 cupti_10.1 cublas_10.1 cublas_dev_10.1 cudart_10.1 cufft_10.1 cufft_dev_10.1 curand_10.1 curand_dev_10.1 cusolver_10.1 cusolver_dev_10.1 cusparse_10.1 cusparse_dev_10.1 nvgraph_10.1 nvgraph_dev_10.1 npp_10.1 npp_dev_10.1 nvrtc_10.1 nvrtc_dev_10.1 nvml_dev_10.1 -loglevel:6 -log:"$(pwd -W)/cuda_install_logs"
./setup.exe -s ${cuda_install_packages} -loglevel:6 -log:"$(pwd -W)/cuda_install_logs"
set -e
if [[ "${VC_YEAR}" == "2017" ]]; then
cp -r CUDAVisualStudioIntegration/extras/visual_studio_integration/MSBuildExtensions/* "C:/Program Files (x86)/Microsoft Visual Studio/2017/${VC_PRODUCT}/Common7/IDE/VC/VCTargets/BuildCustomizations/"
cp -r ${msbuild_project_dir}/* "C:/Program Files (x86)/Microsoft Visual Studio/2017/${VC_PRODUCT}/Common7/IDE/VC/VCTargets/BuildCustomizations/"
else
cp -r CUDAVisualStudioIntegration/extras/visual_studio_integration/MSBuildExtensions/* "C:/Program Files (x86)/Microsoft Visual Studio/2019/${VC_PRODUCT}/MSBuild/Microsoft/VC/v160/BuildCustomizations/"
cp -r ${msbuild_project_dir}/* "C:/Program Files (x86)/Microsoft Visual Studio/2019/${VC_PRODUCT}/MSBuild/Microsoft/VC/v160/BuildCustomizations/"
fi
curl --retry 3 -kLO https://ossci-windows.s3.amazonaws.com/NvToolsExt.7z
7z x NvToolsExt.7z -oNvToolsExt
mkdir -p "C:/Program Files/NVIDIA Corporation/NvToolsExt"
cp -r NvToolsExt/* "C:/Program Files/NVIDIA Corporation/NvToolsExt/"
export NVTOOLSEXT_PATH="C:\\Program Files\\NVIDIA Corporation\\NvToolsExt\\"
if ! ls "/c/Program Files/NVIDIA Corporation/NvToolsExt/bin/x64/nvToolsExt64_1.dll"
then
curl --retry 3 -kLO https://ossci-windows.s3.amazonaws.com/NvToolsExt.7z
7z x NvToolsExt.7z -oNvToolsExt
mkdir -p "C:/Program Files/NVIDIA Corporation/NvToolsExt"
cp -r NvToolsExt/* "C:/Program Files/NVIDIA Corporation/NvToolsExt/"
export NVTOOLSEXT_PATH="C:\\Program Files\\NVIDIA Corporation\\NvToolsExt\\"
fi
if ! ls "/c/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v10.1/bin/nvcc.exe"
if ! ls "/c/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v${cuda_complete_version}/bin/nvcc.exe"
then
echo "CUDA installation failed"
mkdir -p /c/w/build-results
@ -33,5 +53,5 @@ then
fi
cd ..
rm -rf ./cuda_10.1.243_426.00_win10
rm -f ./cuda_10.1.243_426.00_win10.exe
rm -rf ./${cuda_installer_name}
rm -f ./${cuda_installer_name}.exe

View File

@ -0,0 +1,21 @@
#!/bin/bash
set -eux -o pipefail
if [[ "$CUDA_VERSION" == "10" ]]; then
cuda_complete_version="10.1"
cudnn_installer_name="cudnn-10.1-windows10-x64-v7.6.4.38"
elif [[ "$CUDA_VERSION" == "11" ]]; then
cuda_complete_version="11.0"
cudnn_installer_name="cudnn-11.0-windows-x64-v8.0.2.39"
else
echo "CUDNN for CUDA_VERSION $CUDA_VERSION is not supported yet"
exit 1
fi
cudnn_installer_link="https://ossci-windows.s3.amazonaws.com/${cudnn_installer_name}.zip"
curl --retry 3 -O $cudnn_installer_link
7z x ${cudnn_installer_name}.zip -ocudnn
cp -r cudnn/cuda/* "C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v${cuda_complete_version}/"
rm -rf cudnn
rm -f ${cudnn_installer_name}.zip

View File

@ -1,45 +0,0 @@
#!/usr/bin/env python3
import cimodel.data.caffe2_build_definitions as caffe2_build_definitions
import cimodel.data.simple.util.docker_constants as pytorch_docker_constants
from yaml import load
try:
from yaml import CLoader as Loader
except ImportError:
from yaml import Loader
def load_config(filename=".circleci/config.yml"):
with open(filename, "r") as fh:
return load("".join(fh.readlines()), Loader)
def load_tags_for_projects(workflow_config):
return {
v["ecr_gc_job"]["project"]: v["ecr_gc_job"]["tags_to_keep"]
for v in workflow_config["workflows"]["ecr_gc"]["jobs"]
if isinstance(v, dict) and "ecr_gc_job" in v
}
def check_version(job, tags, expected_version):
valid_versions = tags[job].split(",")
if expected_version not in valid_versions:
raise RuntimeError(
"We configured {} to use Docker version {}; but this "
"version is not configured in job ecr_gc_job_for_{}. Non-deployed versions will be "
"garbage collected two weeks after they are created. DO NOT LAND "
"THIS TO MASTER without also updating ossci-job-dsl with this version."
"\n\nDeployed versions: {}".format(job, expected_version, job, tags[job])
)
def validate_docker_version():
tags = load_tags_for_projects(load_config())
check_version("pytorch", tags, pytorch_docker_constants.DOCKER_IMAGE_TAG)
check_version("caffe2", tags, caffe2_build_definitions.DOCKER_IMAGE_VERSION)
if __name__ == "__main__":
validate_docker_version()

View File

@ -59,7 +59,7 @@ binary_windows_params: &binary_windows_params
default: ""
executor:
type: string
default: "windows-cpu-with-nvidia-cuda"
default: "windows-xlarge-cpu-with-nvidia-cuda"
environment:
BUILD_ENVIRONMENT: << parameters.build_environment >>
BUILD_FOR_SYSTEM: windows

View File

@ -1,27 +0,0 @@
caffe2_params: &caffe2_params
parameters:
build_environment:
type: string
default: ""
build_ios:
type: string
default: ""
docker_image:
type: string
default: ""
use_cuda_docker_runtime:
type: string
default: ""
build_only:
type: string
default: ""
resource_class:
type: string
default: "large"
environment:
BUILD_ENVIRONMENT: << parameters.build_environment >>
BUILD_IOS: << parameters.build_ios >>
USE_CUDA_DOCKER_RUNTIME: << parameters.use_cuda_docker_runtime >>
DOCKER_IMAGE: << parameters.docker_image >>
BUILD_ONLY: << parameters.build_only >>
resource_class: << parameters.resource_class >>

View File

@ -46,7 +46,7 @@ pytorch_windows_params: &pytorch_windows_params
parameters:
executor:
type: string
default: "windows-cpu-with-nvidia-cuda"
default: "windows-xlarge-cpu-with-nvidia-cuda"
build_environment:
type: string
default: ""
@ -61,10 +61,10 @@ pytorch_windows_params: &pytorch_windows_params
default: "3.6"
vc_version:
type: string
default: "14.11"
default: "14.16"
vc_year:
type: string
default: "2017"
default: "2019"
vc_product:
type: string
default: "BuildTools"

View File

@ -1,23 +1,26 @@
commands:
# Must be run after attaching workspace from previous steps
load_shared_env:
description: "Loads .circleci/shared/env_file into ${BASH_ENV}"
parameters:
# For some weird reason we decide to reattach our workspace to ~/workspace so
# in the vein of making it simple let's assume our share env_file is here
root:
type: string
default: "~/workspace"
calculate_docker_image_tag:
description: "Calculates the docker image tag"
steps:
- run:
name: "Load .circleci/shared/env_file into ${BASH_ENV}"
name: "Calculate docker image hash"
command: |
if [[ -f "<< parameters.root >>/.circleci/shared/env_file" ]]; then
cat << parameters.root >>/.circleci/shared/env_file >> ${BASH_ENV}
else
echo "We didn't have a shared env file, that's weird"
DOCKER_TAG=$(git rev-parse HEAD:.circleci/docker)
echo "DOCKER_TAG=${DOCKER_TAG}" >> "${BASH_ENV}"
designate_upload_channel:
description: "inserts the correct upload channel into ${BASH_ENV}"
steps:
- run:
name: adding UPLOAD_CHANNEL to BASH_ENV
command: |
our_upload_channel=nightly
# On tags upload to test instead
if [[ -n "${CIRCLE_TAG}" ]]; then
our_upload_channel=test
fi
echo "export UPLOAD_CHANNEL=${our_upload_channel}" >> ${BASH_ENV}
# This system setup script is meant to run before the CI-related scripts, e.g.,
# installing Git client, checking out code, setting up CI env, and
@ -130,4 +133,42 @@ commands:
echo "This is not a pull request, skipping..."
fi
upload_binary_size_for_android_build:
description: "Upload binary size data for Android build"
parameters:
build_type:
type: string
default: ""
artifacts:
type: string
default: ""
steps:
- run:
name: "Binary Size - Install Dependencies"
no_output_timeout: "5m"
command: |
retry () {
$* || (sleep 1 && $*) || (sleep 2 && $*) || (sleep 4 && $*) || (sleep 8 && $*)
}
retry pip3 install requests
- run:
name: "Binary Size - Untar Artifacts"
no_output_timeout: "5m"
command: |
# The artifact file is created inside docker container, which contains the result binaries.
# Now unpackage it into the project folder. The subsequent script will scan project folder
# to locate result binaries and report their sizes.
# If artifact file is not provided it assumes that the project folder has been mounted in
# the docker during build and already contains the result binaries, so this step can be skipped.
export ARTIFACTS="<< parameters.artifacts >>"
if [ -n "${ARTIFACTS}" ]; then
tar xf "${ARTIFACTS}" -C ~/project
fi
- run:
name: "Binary Size - Upload << parameters.build_type >>"
no_output_timeout: "5m"
command: |
cd ~/project
export ANDROID_BUILD_TYPE="<< parameters.build_type >>"
export COMMIT_TIME=$(git log --max-count=1 --format=%ct || echo 0)
python3 .circleci/scripts/upload_binary_size_to_scuba.py android

View File

@ -26,9 +26,14 @@ executors:
image: windows-server-2019-nvidia:stable
shell: bash.exe
windows-cpu-with-nvidia-cuda:
windows-xlarge-cpu-with-nvidia-cuda:
machine:
# we will change to CPU host when it's ready
resource_class: windows.xlarge
image: windows-server-2019-vs2019:stable
shell: bash.exe
windows-medium-cpu-with-nvidia-cuda:
machine:
resource_class: windows.medium
image: windows-server-2019-vs2019:stable
shell: bash.exe

View File

@ -1,60 +1,42 @@
binary_linux_build:
<<: *binary_linux_build_params
steps:
# See Note [Workspace for CircleCI scripts] in job-specs-setup.yml
- checkout
- calculate_docker_image_tag
- run:
<<: *binary_checkout
- run:
<<: *binary_populate_env
- run:
name: Install unbuffer and ts
command: |
set -eux -o pipefail
source /env
OS_NAME=`awk -F= '/^NAME/{print $2}' /etc/os-release`
if [[ "$OS_NAME" == *"CentOS Linux"* ]]; then
retry yum -q -y install epel-release
retry yum -q -y install expect moreutils
elif [[ "$OS_NAME" == *"Ubuntu"* ]]; then
retry apt-get update
retry apt-get -y install expect moreutils
retry conda install -y -c eumetsat expect
retry conda install -y cmake
fi
- run:
name: Update compiler to devtoolset7
command: |
set -eux -o pipefail
source /env
if [[ "$DESIRED_DEVTOOLSET" == 'devtoolset7' ]]; then
source "/builder/update_compiler.sh"
# Env variables are not persisted into the next step
echo "export PATH=$PATH" >> /env
echo "export LD_LIBRARY_PATH=$LD_LIBRARY_PATH" >> /env
else
echo "Not updating compiler"
fi
- run:
name: Build
no_output_timeout: "1h"
command: |
source "/pytorch/.circleci/scripts/binary_linux_build.sh"
# Preserve build log
if [ -f /pytorch/build/.ninja_log ]; then
cp /pytorch/build/.ninja_log /final_pkgs
fi
- run:
name: Output binary sizes
no_output_timeout: "1m"
command: |
ls -lah /final_pkgs
- run:
name: save binary size
no_output_timeout: "5m"
command: |
source /env
cd /pytorch && export COMMIT_TIME=$(git log --max-count=1 --format=%ct || echo 0)
pip3 install requests && \
python3 -mpip install requests && \
SCRIBE_GRAPHQL_ACCESS_TOKEN=${SCRIBE_GRAPHQL_ACCESS_TOKEN} \
python3 /pytorch/.circleci/scripts/upload_binary_size_to_scuba.py || exit 0
- persist_to_workspace:
root: /
paths: final_pkgs
- store_artifacts:
path: /final_pkgs
# This should really just be another step of the binary_linux_build job above.
# This isn't possible right now b/c the build job uses the docker executor
# (otherwise they'd be really really slow) but this one uses the macine
@ -63,11 +45,10 @@
binary_linux_test:
<<: *binary_linux_test_upload_params
machine:
image: ubuntu-1604:201903-01
image: ubuntu-1604:202007-01
steps:
# See Note [Workspace for CircleCI scripts] in job-specs-setup.yml
- checkout
# TODO: We shouldn't attach the workspace multiple times
- attach_workspace:
at: /home/circleci/project
- setup_linux_system_environment
@ -83,25 +64,41 @@
- run:
<<: *binary_run_in_docker
binary_linux_upload:
<<: *binary_linux_test_upload_params
machine:
image: ubuntu-1604:201903-01
binary_upload:
parameters:
package_type:
type: string
description: "What type of package we are uploading (eg. wheel, libtorch, conda)"
default: "wheel"
upload_subfolder:
type: string
description: "What subfolder to put our package into (eg. cpu, cudaX.Y, etc.)"
default: "cpu"
docker:
- image: continuumio/miniconda3
environment:
- DRY_RUN: disabled
- PACKAGE_TYPE: "<< parameters.package_type >>"
- UPLOAD_SUBFOLDER: "<< parameters.upload_subfolder >>"
steps:
# See Note [Workspace for CircleCI scripts] in job-specs-setup.yml
- checkout
- setup_linux_system_environment
- setup_ci_environment
- attach_workspace:
at: /home/circleci/project
- run:
<<: *binary_populate_env
- run:
<<: *binary_install_miniconda
- run:
name: Upload
no_output_timeout: "1h"
command: .circleci/scripts/binary_linux_upload.sh
- attach_workspace:
at: /tmp/workspace
- checkout
- designate_upload_channel
- run:
name: Install dependencies
no_output_timeout: "1h"
command: |
conda install -yq anaconda-client
pip install -q awscli
- run:
name: Do upload
no_output_timeout: "1h"
command: |
AWS_ACCESS_KEY_ID="${PYTORCH_BINARY_AWS_ACCESS_KEY_ID}" \
AWS_SECRET_ACCESS_KEY="${PYTORCH_BINARY_AWS_SECRET_ACCESS_KEY}" \
ANACONDA_API_TOKEN="${CONDA_PYTORCHBOT_TOKEN}" \
.circleci/scripts/binary_upload.sh
# Nighlty build smoke tests defaults
# These are the second-round smoke tests. These make sure that the binaries are
@ -111,9 +108,10 @@
smoke_linux_test:
<<: *binary_linux_test_upload_params
machine:
image: ubuntu-1604:201903-01
image: ubuntu-1604:202007-01
steps:
- checkout
- calculate_docker_image_tag
- setup_linux_system_environment
- setup_ci_environment
- run:
@ -137,7 +135,7 @@
smoke_mac_test:
<<: *binary_linux_test_upload_params
macos:
xcode: "9.4.1"
xcode: "11.2.1"
steps:
- checkout
- run:
@ -162,7 +160,7 @@
binary_mac_build:
<<: *binary_mac_params
macos:
xcode: "9.4.1"
xcode: "11.2.1"
steps:
# See Note [Workspace for CircleCI scripts] in job-specs-setup.yml
- checkout
@ -200,30 +198,6 @@
root: /Users/distiller/project
paths: final_pkgs
binary_mac_upload: &binary_mac_upload
<<: *binary_mac_params
macos:
xcode: "9.4.1"
steps:
# See Note [Workspace for CircleCI scripts] in job-specs-setup.yml
- checkout
- run:
<<: *binary_checkout
- run:
<<: *binary_populate_env
- brew_update
- run:
<<: *binary_install_miniconda
- attach_workspace: # TODO - we can `cp` from ~/workspace
at: /Users/distiller/project
- run:
name: Upload
no_output_timeout: "10m"
command: |
script="/Users/distiller/project/pytorch/.circleci/scripts/binary_macos_upload.sh"
cat "$script"
source "$script"
binary_ios_build:
<<: *pytorch_ios_params
macos:
@ -276,7 +250,7 @@
default: ""
executor:
type: string
default: "windows-cpu-with-nvidia-cuda"
default: "windows-xlarge-cpu-with-nvidia-cuda"
executor: <<parameters.executor>>
steps:
# See Note [Workspace for CircleCI scripts] in job-specs-setup.yml
@ -305,7 +279,7 @@
default: ""
executor:
type: string
default: "windows-cpu-with-nvidia-cuda"
default: "windows-medium-cpu-with-nvidia-cuda"
executor: <<parameters.executor>>
steps:
- checkout
@ -324,28 +298,6 @@
cat "$script"
source "$script"
binary_windows_upload:
<<: *binary_windows_params
docker:
- image: continuumio/miniconda
steps:
# See Note [Workspace for CircleCI scripts] in job-specs-setup.yml
- checkout
- attach_workspace:
at: /root/workspace
- run:
<<: *binary_checkout
- run:
<<: *binary_populate_env
- run:
name: Upload
no_output_timeout: "10m"
command: |
set -eux -o pipefail
script="/pytorch/.circleci/scripts/binary_windows_upload.sh"
cat "$script"
source "$script"
smoke_windows_test:
<<: *binary_windows_params
parameters:
@ -354,7 +306,7 @@
default: ""
executor:
type: string
default: "windows-cpu-with-nvidia-cuda"
default: "windows-medium-cpu-with-nvidia-cuda"
executor: <<parameters.executor>>
steps:
- checkout
@ -372,3 +324,32 @@
cat "$script"
source "$script"
anaconda_prune:
parameters:
packages:
type: string
description: "What packages are we pruning? (quoted, space-separated string. eg. 'pytorch', 'torchvision torchaudio', etc.)"
default: "pytorch"
channel:
type: string
description: "What channel are we pruning? (eq. pytorch-nightly)"
default: "pytorch-nightly"
docker:
- image: continuumio/miniconda3
environment:
- PACKAGES: "<< parameters.packages >>"
- CHANNEL: "<< parameters.channel >>"
steps:
- checkout
- run:
name: Install dependencies
no_output_timeout: "1h"
command: |
conda install -yq anaconda-client
- run:
name: Prune packages
no_output_timeout: "1h"
command: |
ANACONDA_API_TOKEN="${CONDA_PYTORCHBOT_TOKEN}" \
scripts/release/anaconda-prune/run.sh

View File

@ -8,7 +8,8 @@
# then install the one with the most recent version.
update_s3_htmls: &update_s3_htmls
machine:
image: ubuntu-1604:201903-01
image: ubuntu-1604:202007-01
resource_class: medium
steps:
- checkout
- setup_linux_system_environment

View File

@ -1,198 +0,0 @@
caffe2_linux_build:
<<: *caffe2_params
machine:
image: ubuntu-1604:201903-01
steps:
# See Note [Workspace for CircleCI scripts] in job-specs-setup.yml
- checkout
- setup_linux_system_environment
- setup_ci_environment
- run:
name: Build
no_output_timeout: "1h"
command: |
set -e
cat >/home/circleci/project/ci_build_script.sh \<<EOL
# =================== The following code will be executed inside Docker container ===================
set -ex
export BUILD_ENVIRONMENT="$BUILD_ENVIRONMENT"
# Reinitialize submodules
git submodule sync && git submodule update -q --init --recursive
# conda must be added to the path for Anaconda builds (this location must be
# the same as that in install_anaconda.sh used to build the docker image)
if [[ "${BUILD_ENVIRONMENT}" == conda* ]]; then
export PATH=/opt/conda/bin:$PATH
sudo chown -R jenkins:jenkins '/opt/conda'
fi
# Build
./.jenkins/caffe2/build.sh
# Show sccache stats if it is running
if pgrep sccache > /dev/null; then
sccache --show-stats
fi
# =================== The above code will be executed inside Docker container ===================
EOL
chmod +x /home/circleci/project/ci_build_script.sh
echo "DOCKER_IMAGE: "${DOCKER_IMAGE}
time docker pull ${DOCKER_IMAGE} >/dev/null
export id=$(docker run --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -t -d -w /var/lib/jenkins ${DOCKER_IMAGE})
docker cp /home/circleci/project/. $id:/var/lib/jenkins/workspace
export COMMAND='((echo "source ./workspace/env" && echo "sudo chown -R jenkins workspace && cd workspace && ./ci_build_script.sh") | docker exec -u jenkins -i "$id" bash) 2>&1'
echo ${COMMAND} > ./command.sh && unbuffer bash ./command.sh | ts
# Push intermediate Docker image for next phase to use
if [ -z "${BUILD_ONLY}" ]; then
if [[ "$BUILD_ENVIRONMENT" == *cmake* ]]; then
export COMMIT_DOCKER_IMAGE=${DOCKER_IMAGE}-cmake-${CIRCLE_SHA1}
else
export COMMIT_DOCKER_IMAGE=${DOCKER_IMAGE}-${CIRCLE_SHA1}
fi
docker commit "$id" ${COMMIT_DOCKER_IMAGE}
time docker push ${COMMIT_DOCKER_IMAGE}
fi
caffe2_linux_test:
<<: *caffe2_params
machine:
image: ubuntu-1604:201903-01
steps:
# See Note [Workspace for CircleCI scripts] in job-specs-setup.yml
- checkout
- setup_linux_system_environment
- setup_ci_environment
- run:
name: Test
no_output_timeout: "1h"
command: |
set -e
# TODO: merge this into Caffe2 test.sh
cat >/home/circleci/project/ci_test_script.sh \<<EOL
# =================== The following code will be executed inside Docker container ===================
set -ex
export BUILD_ENVIRONMENT="$BUILD_ENVIRONMENT"
# libdc1394 (dependency of OpenCV) expects /dev/raw1394 to exist...
sudo ln /dev/null /dev/raw1394
# conda must be added to the path for Anaconda builds (this location must be
# the same as that in install_anaconda.sh used to build the docker image)
if [[ "${BUILD_ENVIRONMENT}" == conda* ]]; then
export PATH=/opt/conda/bin:$PATH
fi
# Upgrade SSL module to avoid old SSL warnings
pip -q install --user --upgrade pyOpenSSL ndg-httpsclient pyasn1
pip -q install --user -b /tmp/pip_install_onnx "file:///var/lib/jenkins/workspace/third_party/onnx#egg=onnx"
# Build
./.jenkins/caffe2/test.sh
# Remove benign core dumps.
# These are tests for signal handling (including SIGABRT).
rm -f ./crash/core.fatal_signal_as.*
rm -f ./crash/core.logging_test.*
# =================== The above code will be executed inside Docker container ===================
EOL
chmod +x /home/circleci/project/ci_test_script.sh
if [[ "$BUILD_ENVIRONMENT" == *cmake* ]]; then
export COMMIT_DOCKER_IMAGE=${DOCKER_IMAGE}-cmake-${CIRCLE_SHA1}
else
export COMMIT_DOCKER_IMAGE=${DOCKER_IMAGE}-${CIRCLE_SHA1}
fi
echo "DOCKER_IMAGE: "${COMMIT_DOCKER_IMAGE}
time docker pull ${COMMIT_DOCKER_IMAGE} >/dev/null
if [ -n "${USE_CUDA_DOCKER_RUNTIME}" ]; then
export id=$(docker run --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --runtime=nvidia -t -d -w /var/lib/jenkins ${COMMIT_DOCKER_IMAGE})
else
export id=$(docker run --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -t -d -w /var/lib/jenkins ${COMMIT_DOCKER_IMAGE})
fi
docker cp /home/circleci/project/. "$id:/var/lib/jenkins/workspace"
export COMMAND='((echo "source ./workspace/env" && echo "sudo chown -R jenkins workspace && cd workspace && ./ci_test_script.sh") | docker exec -u jenkins -i "$id" bash) 2>&1'
echo ${COMMAND} > ./command.sh && unbuffer bash ./command.sh | ts
caffe2_macos_build:
<<: *caffe2_params
macos:
xcode: "9.4.1"
steps:
# See Note [Workspace for CircleCI scripts] in job-specs-setup.yml
- checkout
- run_brew_for_macos_build
- run:
name: Build
no_output_timeout: "1h"
command: |
set -e
export IN_CIRCLECI=1
brew install cmake
# Reinitialize submodules
git submodule sync && git submodule update -q --init --recursive
# Reinitialize path (see man page for path_helper(8))
eval `/usr/libexec/path_helper -s`
export PATH=/usr/local/opt/python/libexec/bin:/usr/local/bin:$PATH
# Install Anaconda if we need to
if [ -n "${CAFFE2_USE_ANACONDA}" ]; then
rm -rf ${TMPDIR}/anaconda
curl --retry 3 -o ${TMPDIR}/conda.sh https://repo.anaconda.com/miniconda/Miniconda${ANACONDA_VERSION}-latest-MacOSX-x86_64.sh
chmod +x ${TMPDIR}/conda.sh
/bin/bash ${TMPDIR}/conda.sh -b -p ${TMPDIR}/anaconda
rm -f ${TMPDIR}/conda.sh
export PATH="${TMPDIR}/anaconda/bin:${PATH}"
source ${TMPDIR}/anaconda/bin/activate
fi
pip -q install numpy
# Install sccache
sudo curl --retry 3 https://s3.amazonaws.com/ossci-macos/sccache --output /usr/local/bin/sccache
sudo chmod +x /usr/local/bin/sccache
export SCCACHE_BUCKET=ossci-compiler-cache-circleci-v2
# This IAM user allows write access to S3 bucket for sccache
set +x
export AWS_ACCESS_KEY_ID=${CIRCLECI_AWS_ACCESS_KEY_FOR_SCCACHE_S3_BUCKET_V4}
export AWS_SECRET_ACCESS_KEY=${CIRCLECI_AWS_SECRET_KEY_FOR_SCCACHE_S3_BUCKET_V4}
set -x
export SCCACHE_BIN=${PWD}/sccache_bin
mkdir -p ${SCCACHE_BIN}
if which sccache > /dev/null; then
printf "#!/bin/sh\nexec sccache $(which clang++) \$*" > "${SCCACHE_BIN}/clang++"
chmod a+x "${SCCACHE_BIN}/clang++"
printf "#!/bin/sh\nexec sccache $(which clang) \$*" > "${SCCACHE_BIN}/clang"
chmod a+x "${SCCACHE_BIN}/clang"
export PATH="${SCCACHE_BIN}:$PATH"
fi
# Build
if [ "${BUILD_IOS:-0}" -eq 1 ]; then
unbuffer scripts/build_ios.sh 2>&1 | ts
elif [ -n "${CAFFE2_USE_ANACONDA}" ]; then
# All conda build logic should be in scripts/build_anaconda.sh
unbuffer scripts/build_anaconda.sh 2>&1 | ts
else
unbuffer scripts/build_local.sh 2>&1 | ts
fi
# Show sccache stats if it is running
if which sccache > /dev/null; then
sccache --show-stats
fi

View File

@ -4,7 +4,7 @@
type: string
default: ""
machine:
image: ubuntu-1604:201903-01
image: ubuntu-1604:202007-01
resource_class: large
environment:
IMAGE_NAME: << parameters.image_name >>
@ -13,20 +13,7 @@
DOCKER_BUILDKIT: 1
steps:
- checkout
- run:
name: Calculate docker tag
command: |
set -x
mkdir .circleci/shared
# git keeps a hash of all sub trees
echo "export DOCKER_TAG=$(git rev-parse HEAD:.circleci/docker)" >> .circleci/shared/env_file
# Saves our calculated docker tag to our workpace for later use
- persist_to_workspace:
root: .
paths:
- .circleci/shared/
- load_shared_env:
root: .
- calculate_docker_image_tag
- run:
name: Check if image should be built
command: |
@ -35,7 +22,6 @@
export AWS_SECRET_ACCESS_KEY=${CIRCLECI_AWS_SECRET_KEY_FOR_DOCKER_BUILDER_V1}
eval $(aws ecr get-login --no-include-email --region us-east-1)
set -x
PREVIOUS_DOCKER_TAG=$(git rev-parse "$(git merge-base HEAD << pipeline.git.base_revision >>):.circleci/docker")
# Check if image already exists, if it does then skip building it
if docker manifest inspect "308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/${IMAGE_NAME}:${DOCKER_TAG}"; then
circleci-agent step halt
@ -43,8 +29,15 @@
# explicitly exit the step here ourselves before it causes too much trouble
exit 0
fi
# Covers the case where a previous tag doesn't exist for the tree
# this is only really applicable on trees that don't have `.circleci/docker` at its merge base, i.e. nightly
if ! git rev-parse "$(git merge-base HEAD << pipeline.git.base_revision >>):.circleci/docker"; then
echo "Directory '.circleci/docker' not found in tree << pipeline.git.base_revision >>, you should probably rebase onto a more recent commit"
exit 1
fi
PREVIOUS_DOCKER_TAG=$(git rev-parse "$(git merge-base HEAD << pipeline.git.base_revision >>):.circleci/docker")
# If no image exists but the hash is the same as the previous hash then we should error out here
if [[ ${PREVIOUS_DOCKER_TAG} = ${DOCKER_TAG} ]]; then
if [[ "${PREVIOUS_DOCKER_TAG}" = "${DOCKER_TAG}" ]]; then
echo "ERROR: Something has gone wrong and the previous image isn't available for the merge-base of your branch"
echo " contact the PyTorch team to restore the original images"
exit 1
@ -60,7 +53,7 @@
cd .circleci/docker && ./build_docker.sh
docker_for_ecr_gc_build_job:
machine:
image: ubuntu-1604:201903-01
image: ubuntu-1604:202007-01
steps:
- checkout
- run:
@ -113,23 +106,3 @@
export AWS_SECRET_ACCESS_KEY=${CIRCLECI_AWS_SECRET_KEY_FOR_DOCKER_BUILDER_V1}
set -x
/usr/bin/gc.py --filter-prefix ${PROJECT} --ignore-tags "${IMAGE_TAG},${GENERATED_IMAGE_TAG}"
docker_hub_index_job:
docker:
- image: 308535385114.dkr.ecr.us-east-1.amazonaws.com/gc/ecr
aws_auth:
aws_access_key_id: ${CIRCLECI_AWS_ACCESS_KEY_FOR_DOCKER_BUILDER_V1}
aws_secret_access_key: ${CIRCLECI_AWS_SECRET_KEY_FOR_DOCKER_BUILDER_V1}
steps:
- run:
name: garbage collecting for ecr images
no_output_timeout: "1h"
command: |
set +x
export AWS_ACCESS_KEY_ID=${CIRCLECI_AWS_ACCESS_KEY_FOR_DOCKER_BUILDER_V1}
export AWS_SECRET_ACCESS_KEY=${CIRCLECI_AWS_SECRET_KEY_FOR_DOCKER_BUILDER_V1}
export DOCKER_HUB_USERNAME=${CIRCLECI_DOCKER_HUB_USERNAME}
export DOCKER_HUB_PASSWORD=${CIRCLECI_DOCKER_HUB_PASSWORD}
set -x
/usr/bin/docker_hub.py

View File

@ -1,13 +1,39 @@
pytorch_python_doc_push:
pytorch_doc_push:
resource_class: medium
machine:
image: ubuntu-1604:202007-01
parameters:
branch:
type: string
default: "master"
steps:
- attach_workspace:
at: /tmp/workspace
- run:
name: Generate netrc
command: |
# set credentials for https pushing
cat > ~/.netrc \<<DONE
machine github.com
login pytorchbot
password ${GITHUB_PYTORCHBOT_TOKEN}
DONE
- run:
name: Docs push
command: |
pushd /tmp/workspace
git push -u origin "<< parameters.branch >>"
pytorch_python_doc_build:
environment:
BUILD_ENVIRONMENT: pytorch-python-doc-push
# TODO: stop hardcoding this
DOCKER_IMAGE: "308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-py3.6-gcc5.4:209062ef-ab58-422a-b295-36c4eed6e906"
DOCKER_IMAGE: "308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-py3.6-gcc5.4"
resource_class: large
machine:
image: ubuntu-1604:201903-01
image: ubuntu-1604:202007-01
steps:
- checkout
- calculate_docker_image_tag
- setup_linux_system_environment
- setup_ci_environment
- run:
@ -15,49 +41,44 @@
no_output_timeout: "1h"
command: |
set -ex
export COMMIT_DOCKER_IMAGE=${DOCKER_IMAGE}-${CIRCLE_SHA1}
export COMMIT_DOCKER_IMAGE=${DOCKER_IMAGE}:${DOCKER_TAG}-${CIRCLE_SHA1}
echo "DOCKER_IMAGE: "${COMMIT_DOCKER_IMAGE}
tag=${CIRCLE_TAG:1:5}
target=${tag:-master}
echo "building for ${target}"
time docker pull ${COMMIT_DOCKER_IMAGE} >/dev/null
export id=$(docker run --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -t -d -w /var/lib/jenkins ${COMMIT_DOCKER_IMAGE})
export id=$(docker run --env-file "${BASH_ENV}" --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -t -d -w /var/lib/jenkins ${COMMIT_DOCKER_IMAGE})
# master branch docs push
if [[ "${CIRCLE_BRANCH}" == "master" ]]; then
export COMMAND='((echo "export BUILD_ENVIRONMENT=${BUILD_ENVIRONMENT}" && echo "export GITHUB_PYTORCHBOT_TOKEN=${GITHUB_PYTORCHBOT_TOKEN}" && echo "source ./workspace/env" && echo "sudo chown -R jenkins workspace && cd workspace && . ./.circleci/scripts/python_doc_push_script.sh docs/master master site") | docker exec -u jenkins -i "$id" bash) 2>&1'
# stable release docs push. Due to some circleci limitations, we keep
# an eternal PR open for merging v1.2.0 -> master for this job.
# XXX: The following code is only run on the v1.2.0 branch, which might
# not be exactly the same as what you see here.
elif [[ "${CIRCLE_BRANCH}" == "v1.2.0" ]]; then
export COMMAND='((echo "export BUILD_ENVIRONMENT=${BUILD_ENVIRONMENT}" && echo "export GITHUB_PYTORCHBOT_TOKEN=${GITHUB_PYTORCHBOT_TOKEN}" && echo "source ./workspace/env" && echo "sudo chown -R jenkins workspace && cd workspace && . ./.circleci/scripts/python_doc_push_script.sh docs/stable 1.2.0 site dry_run") | docker exec -u jenkins -i "$id" bash) 2>&1'
# For open PRs: Do a dry_run of the docs build, don't push build
else
export COMMAND='((echo "export BUILD_ENVIRONMENT=${BUILD_ENVIRONMENT}" && echo "export GITHUB_PYTORCHBOT_TOKEN=${GITHUB_PYTORCHBOT_TOKEN}" && echo "source ./workspace/env" && echo "sudo chown -R jenkins workspace && cd workspace && . ./.circleci/scripts/python_doc_push_script.sh docs/master master site dry_run") | docker exec -u jenkins -i "$id" bash) 2>&1'
fi
export COMMAND='((echo "sudo chown -R jenkins workspace && cd workspace && . ./.circleci/scripts/python_doc_push_script.sh docs/'$target' master site") | docker exec -u jenkins -i "$id" bash) 2>&1'
echo ${COMMAND} > ./command.sh && unbuffer bash ./command.sh | ts
mkdir -p ~/workspace/build_artifacts
docker cp $id:/var/lib/jenkins/workspace/pytorch.github.io/docs/master ~/workspace/build_artifacts
docker cp $id:/var/lib/jenkins/workspace/pytorch.github.io /tmp/workspace
# Save the docs build so we can debug any problems
export DEBUG_COMMIT_DOCKER_IMAGE=${COMMIT_DOCKER_IMAGE}-debug
docker commit "$id" ${DEBUG_COMMIT_DOCKER_IMAGE}
time docker push ${DEBUG_COMMIT_DOCKER_IMAGE}
- persist_to_workspace:
root: /tmp/workspace
paths:
- .
- store_artifacts:
path: ~/workspace/build_artifacts/master
destination: docs
pytorch_cpp_doc_push:
pytorch_cpp_doc_build:
environment:
BUILD_ENVIRONMENT: pytorch-cpp-doc-push
DOCKER_IMAGE: "308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-py3.6-gcc5.4:209062ef-ab58-422a-b295-36c4eed6e906"
DOCKER_IMAGE: "308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-py3.6-gcc5.4"
resource_class: large
machine:
image: ubuntu-1604:201903-01
image: ubuntu-1604:202007-01
steps:
- checkout
- calculate_docker_image_tag
- setup_linux_system_environment
- setup_ci_environment
- run:
@ -65,39 +86,36 @@
no_output_timeout: "1h"
command: |
set -ex
export COMMIT_DOCKER_IMAGE=${DOCKER_IMAGE}-${CIRCLE_SHA1}
export COMMIT_DOCKER_IMAGE=${DOCKER_IMAGE}:${DOCKER_TAG}-${CIRCLE_SHA1}
echo "DOCKER_IMAGE: "${COMMIT_DOCKER_IMAGE}
tag=${CIRCLE_TAG:1:5}
target=${tag:-master}
echo "building for ${target}"
time docker pull ${COMMIT_DOCKER_IMAGE} >/dev/null
export id=$(docker run --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -t -d -w /var/lib/jenkins ${COMMIT_DOCKER_IMAGE})
export id=$(docker run --env-file "${BASH_ENV}" --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -t -d -w /var/lib/jenkins ${COMMIT_DOCKER_IMAGE})
# master branch docs push
if [[ "${CIRCLE_BRANCH}" == "master" ]]; then
export COMMAND='((echo "export BUILD_ENVIRONMENT=${BUILD_ENVIRONMENT}" && echo "export GITHUB_PYTORCHBOT_TOKEN=${GITHUB_PYTORCHBOT_TOKEN}" && echo "source ./workspace/env" && echo "sudo chown -R jenkins workspace && cd workspace && . ./.circleci/scripts/cpp_doc_push_script.sh docs/master master") | docker exec -u jenkins -i "$id" bash) 2>&1'
# stable release docs push. Due to some circleci limitations, we keep
# an eternal PR open (#16502) for merging v1.0.1 -> master for this job.
# XXX: The following code is only run on the v1.0.1 branch, which might
# not be exactly the same as what you see here.
elif [[ "${CIRCLE_BRANCH}" == "v1.0.1" ]]; then
export COMMAND='((echo "export BUILD_ENVIRONMENT=${BUILD_ENVIRONMENT}" && echo "export GITHUB_PYTORCHBOT_TOKEN=${GITHUB_PYTORCHBOT_TOKEN}" && echo "source ./workspace/env" && echo "sudo chown -R jenkins workspace && cd workspace && . ./.circleci/scripts/cpp_doc_push_script.sh docs/stable 1.0.1") | docker exec -u jenkins -i "$id" bash) 2>&1'
# For open PRs: Do a dry_run of the docs build, don't push build
else
export COMMAND='((echo "export BUILD_ENVIRONMENT=${BUILD_ENVIRONMENT}" && echo "export GITHUB_PYTORCHBOT_TOKEN=${GITHUB_PYTORCHBOT_TOKEN}" && echo "source ./workspace/env" && echo "sudo chown -R jenkins workspace && cd workspace && . ./.circleci/scripts/cpp_doc_push_script.sh docs/master master dry_run") | docker exec -u jenkins -i "$id" bash) 2>&1'
fi
export COMMAND='((echo "sudo chown -R jenkins workspace && cd workspace && . ./.circleci/scripts/cpp_doc_push_script.sh docs/"$target" master") | docker exec -u jenkins -i "$id" bash) 2>&1'
echo ${COMMAND} > ./command.sh && unbuffer bash ./command.sh | ts
mkdir -p ~/workspace/build_artifacts
docker cp $id:/var/lib/jenkins/workspace/cppdocs/ /tmp/workspace
# Save the docs build so we can debug any problems
export DEBUG_COMMIT_DOCKER_IMAGE=${COMMIT_DOCKER_IMAGE}-debug
docker commit "$id" ${DEBUG_COMMIT_DOCKER_IMAGE}
time docker push ${DEBUG_COMMIT_DOCKER_IMAGE}
- persist_to_workspace:
root: /tmp/workspace
paths:
- .
pytorch_macos_10_13_py3_build:
environment:
BUILD_ENVIRONMENT: pytorch-macos-10.13-py3-build
macos:
xcode: "9.4.1"
xcode: "11.2.1"
steps:
- checkout
- run_brew_for_macos_build
@ -131,7 +149,7 @@
environment:
BUILD_ENVIRONMENT: pytorch-macos-10.13-py3-test
macos:
xcode: "9.4.1"
xcode: "11.2.1"
steps:
- checkout
- attach_workspace:
@ -152,13 +170,14 @@
pytorch_android_gradle_build:
environment:
BUILD_ENVIRONMENT: pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-build
DOCKER_IMAGE: "308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-py3-clang5-android-ndk-r19c:209062ef-ab58-422a-b295-36c4eed6e906"
DOCKER_IMAGE: "308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-py3-clang5-android-ndk-r19c"
PYTHON_VERSION: "3.6"
resource_class: large
machine:
image: ubuntu-1604:201903-01
image: ubuntu-1604:202007-01
steps:
- checkout
- calculate_docker_image_tag
- setup_linux_system_environment
- setup_ci_environment
- run:
@ -166,7 +185,7 @@
no_output_timeout: "1h"
command: |
set -eux
docker_image_commit=${DOCKER_IMAGE}-${CIRCLE_SHA1}
docker_image_commit=${DOCKER_IMAGE}:${DOCKER_TAG}-${CIRCLE_SHA1}
docker_image_libtorch_android_x86_32=${docker_image_commit}-android-x86_32
docker_image_libtorch_android_x86_64=${docker_image_commit}-android-x86_64
@ -181,16 +200,16 @@
# x86_32
time docker pull ${docker_image_libtorch_android_x86_32} >/dev/null
export id_x86_32=$(docker run --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -t -d -w /var/lib/jenkins ${docker_image_libtorch_android_x86_32})
export id_x86_32=$(docker run --env-file "${BASH_ENV}" -e GRADLE_OFFLINE=1 --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -t -d -w /var/lib/jenkins ${docker_image_libtorch_android_x86_32})
export COMMAND='((echo "source ./workspace/env" && echo "sudo chown -R jenkins workspace") | docker exec -u jenkins -i "$id_x86_32" bash) 2>&1'
export COMMAND='((echo "sudo chown -R jenkins workspace") | docker exec -u jenkins -i "$id_x86_32" bash) 2>&1'
echo ${COMMAND} > ./command.sh && unbuffer bash ./command.sh | ts
# arm-v7a
time docker pull ${docker_image_libtorch_android_arm_v7a} >/dev/null
export id_arm_v7a=$(docker run --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -t -d -w /var/lib/jenkins ${docker_image_libtorch_android_arm_v7a})
export id_arm_v7a=$(docker run --env-file "${BASH_ENV}" --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -t -d -w /var/lib/jenkins ${docker_image_libtorch_android_arm_v7a})
export COMMAND='((echo "source ./workspace/env" && echo "sudo chown -R jenkins workspace") | docker exec -u jenkins -i "$id_arm_v7a" bash) 2>&1'
export COMMAND='((echo "sudo chown -R jenkins workspace") | docker exec -u jenkins -i "$id_arm_v7a" bash) 2>&1'
echo ${COMMAND} > ./command.sh && unbuffer bash ./command.sh | ts
mkdir -p ~/workspace/build_android_install_arm_v7a
@ -198,9 +217,9 @@
# x86_64
time docker pull ${docker_image_libtorch_android_x86_64} >/dev/null
export id_x86_64=$(docker run --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -t -d -w /var/lib/jenkins ${docker_image_libtorch_android_x86_64})
export id_x86_64=$(docker run --env-file "${BASH_ENV}" --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -t -d -w /var/lib/jenkins ${docker_image_libtorch_android_x86_64})
export COMMAND='((echo "source ./workspace/env" && echo "sudo chown -R jenkins workspace") | docker exec -u jenkins -i "$id_x86_64" bash) 2>&1'
export COMMAND='((echo "sudo chown -R jenkins workspace") | docker exec -u jenkins -i "$id_x86_64" bash) 2>&1'
echo ${COMMAND} > ./command.sh && unbuffer bash ./command.sh | ts
mkdir -p ~/workspace/build_android_install_x86_64
@ -208,9 +227,9 @@
# arm-v8a
time docker pull ${docker_image_libtorch_android_arm_v8a} >/dev/null
export id_arm_v8a=$(docker run --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -t -d -w /var/lib/jenkins ${docker_image_libtorch_android_arm_v8a})
export id_arm_v8a=$(docker run --env-file "${BASH_ENV}" --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -t -d -w /var/lib/jenkins ${docker_image_libtorch_android_arm_v8a})
export COMMAND='((echo "source ./workspace/env" && echo "sudo chown -R jenkins workspace") | docker exec -u jenkins -i "$id_arm_v8a" bash) 2>&1'
export COMMAND='((echo "sudo chown -R jenkins workspace") | docker exec -u jenkins -i "$id_arm_v8a" bash) 2>&1'
echo ${COMMAND} > ./command.sh && unbuffer bash ./command.sh | ts
mkdir -p ~/workspace/build_android_install_arm_v8a
@ -221,7 +240,7 @@
docker cp ~/workspace/build_android_install_arm_v8a $id_x86_32:/var/lib/jenkins/workspace/build_android_install_arm_v8a
# run gradle buildRelease
export COMMAND='((echo "source ./workspace/env" && echo "export BUILD_ENVIRONMENT=${BUILD_ENVIRONMENT}" && echo "export GRADLE_OFFLINE=1" && echo "sudo chown -R jenkins workspace && cd workspace && ./.circleci/scripts/build_android_gradle.sh") | docker exec -u jenkins -i "$id_x86_32" bash) 2>&1'
export COMMAND='((echo "sudo chown -R jenkins workspace && cd workspace && ./.circleci/scripts/build_android_gradle.sh") | docker exec -u jenkins -i "$id_x86_32" bash) 2>&1'
echo ${COMMAND} > ./command.sh && unbuffer bash ./command.sh | ts
mkdir -p ~/workspace/build_android_artifacts
@ -230,26 +249,9 @@
output_image=$docker_image_libtorch_android_x86_32-gradle
docker commit "$id_x86_32" ${output_image}
time docker push ${output_image}
- run:
name: save binary size
no_output_timeout: "5m"
command: |
docker_image=${DOCKER_IMAGE}-${CIRCLE_SHA1}-android-x86_32-gradle
export id=$(docker run --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -t -d -w /var/lib/jenkins ${docker_image})
echo "docker-id: $id"
cat \<< EOL | docker exec -u jenkins -i "$id" bash
# ============================== Begin Docker ==============================
cd workspace
source ./env
export ANDROID_BUILD_TYPE="prebuild"
export COMMIT_TIME=\$(git log --max-count=1 --format=%ct || echo 0)
export CIRCLE_BUILD_NUM="${CIRCLE_BUILD_NUM}"
export CIRCLE_SHA1="${CIRCLE_SHA1}"
export CIRCLE_BRANCH="${CIRCLE_BRANCH}"
export SCRIBE_GRAPHQL_ACCESS_TOKEN="${SCRIBE_GRAPHQL_ACCESS_TOKEN}"
python .circleci/scripts/upload_binary_size_to_scuba.py android
# ============================== End Docker ==============================
EOL
- upload_binary_size_for_android_build:
build_type: prebuilt
artifacts: /home/circleci/workspace/build_android_artifacts/artifacts.tgz
- store_artifacts:
path: ~/workspace/build_android_artifacts/artifacts.tgz
destination: artifacts.tgz
@ -257,11 +259,11 @@
pytorch_android_publish_snapshot:
environment:
BUILD_ENVIRONMENT: pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-publish-snapshot
DOCKER_IMAGE: "308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-py3-clang5-android-ndk-r19c:209062ef-ab58-422a-b295-36c4eed6e906"
DOCKER_IMAGE: "308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-py3-clang5-android-ndk-r19c:ab1632df-fa59-40e6-8c23-98e004f61148"
PYTHON_VERSION: "3.6"
resource_class: large
machine:
image: ubuntu-1604:201903-01
image: ubuntu-1604:202007-01
steps:
- checkout
- setup_linux_system_environment
@ -281,9 +283,9 @@
# x86_32
time docker pull ${docker_image_libtorch_android_x86_32_gradle} >/dev/null
export id_x86_32=$(docker run --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -t -d -w /var/lib/jenkins ${docker_image_libtorch_android_x86_32_gradle})
export id_x86_32=$(docker run --env-file "${BASH_ENV}" --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -t -d -w /var/lib/jenkins ${docker_image_libtorch_android_x86_32_gradle})
export COMMAND='((echo "source ./workspace/env" && echo "sudo chown -R jenkins workspace" && echo "export BUILD_ENVIRONMENT=${BUILD_ENVIRONMENT}" && echo "export SONATYPE_NEXUS_USERNAME=${SONATYPE_NEXUS_USERNAME}" && echo "export SONATYPE_NEXUS_PASSWORD=${SONATYPE_NEXUS_PASSWORD}" && echo "export ANDROID_SIGN_KEY=${ANDROID_SIGN_KEY}" && echo "export ANDROID_SIGN_PASS=${ANDROID_SIGN_PASS}" && echo "sudo chown -R jenkins workspace && cd workspace && ./.circleci/scripts/publish_android_snapshot.sh") | docker exec -u jenkins -i "$id_x86_32" bash) 2>&1'
export COMMAND='((echo "sudo chown -R jenkins workspace" && echo "export BUILD_ENVIRONMENT=${BUILD_ENVIRONMENT}" && echo "export SONATYPE_NEXUS_USERNAME=${SONATYPE_NEXUS_USERNAME}" && echo "export SONATYPE_NEXUS_PASSWORD=${SONATYPE_NEXUS_PASSWORD}" && echo "export ANDROID_SIGN_KEY=${ANDROID_SIGN_KEY}" && echo "export ANDROID_SIGN_PASS=${ANDROID_SIGN_PASS}" && echo "sudo chown -R jenkins workspace && cd workspace && ./.circleci/scripts/publish_android_snapshot.sh") | docker exec -u jenkins -i "$id_x86_32" bash) 2>&1'
echo ${COMMAND} > ./command.sh && unbuffer bash ./command.sh | ts
output_image=${docker_image_libtorch_android_x86_32_gradle}-publish-snapshot
@ -293,21 +295,14 @@
pytorch_android_gradle_build-x86_32:
environment:
BUILD_ENVIRONMENT: pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-build-only-x86_32
DOCKER_IMAGE: "308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-py3-clang5-android-ndk-r19c:209062ef-ab58-422a-b295-36c4eed6e906"
DOCKER_IMAGE: "308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-py3-clang5-android-ndk-r19c"
PYTHON_VERSION: "3.6"
resource_class: large
machine:
image: ubuntu-1604:201903-01
image: ubuntu-1604:202007-01
steps:
- checkout
- run:
name: filter out not PR runs
no_output_timeout: "5m"
command: |
echo "CIRCLE_PULL_REQUEST: ${CIRCLE_PULL_REQUEST:-}"
if [ -z "${CIRCLE_PULL_REQUEST:-}" ]; then
circleci step halt
fi
- calculate_docker_image_tag
- setup_linux_system_environment
- checkout
- setup_ci_environment
@ -316,14 +311,14 @@
no_output_timeout: "1h"
command: |
set -e
docker_image_libtorch_android_x86_32=${DOCKER_IMAGE}-${CIRCLE_SHA1}-android-x86_32
docker_image_libtorch_android_x86_32=${DOCKER_IMAGE}:${DOCKER_TAG}-${CIRCLE_SHA1}-android-x86_32
echo "docker_image_libtorch_android_x86_32: "${docker_image_libtorch_android_x86_32}
# x86
time docker pull ${docker_image_libtorch_android_x86_32} >/dev/null
export id=$(docker run --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -t -d -w /var/lib/jenkins ${docker_image_libtorch_android_x86_32})
export id=$(docker run --env-file "${BASH_ENV}" --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -t -d -w /var/lib/jenkins ${docker_image_libtorch_android_x86_32})
export COMMAND='((echo "source ./workspace/env" && echo "export BUILD_ENVIRONMENT=${BUILD_ENVIRONMENT}" && echo "export GRADLE_OFFLINE=1" && echo "sudo chown -R jenkins workspace && cd workspace && ./.circleci/scripts/build_android_gradle.sh") | docker exec -u jenkins -i "$id" bash) 2>&1'
export COMMAND='((echo "export BUILD_ENVIRONMENT=${BUILD_ENVIRONMENT}" && echo "export GRADLE_OFFLINE=1" && echo "sudo chown -R jenkins workspace && cd workspace && ./.circleci/scripts/build_android_gradle.sh") | docker exec -u jenkins -i "$id" bash) 2>&1'
echo ${COMMAND} > ./command.sh && unbuffer bash ./command.sh | ts
mkdir -p ~/workspace/build_android_x86_32_artifacts
@ -332,30 +327,53 @@
output_image=${docker_image_libtorch_android_x86_32}-gradle
docker commit "$id" ${output_image}
time docker push ${output_image}
- run:
name: save binary size
no_output_timeout: "5m"
command: |
docker_image=${DOCKER_IMAGE}-${CIRCLE_SHA1}-android-x86_32-gradle
export id=$(docker run --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -t -d -w /var/lib/jenkins ${docker_image})
echo "docker-id: $id"
cat \<< EOL | docker exec -u jenkins -i "$id" bash
# ============================== Begin Docker ==============================
cd workspace
source ./env
export ANDROID_BUILD_TYPE="prebuild-single"
export COMMIT_TIME=\$(git log --max-count=1 --format=%ct || echo 0)
export CIRCLE_BUILD_NUM="${CIRCLE_BUILD_NUM}"
export CIRCLE_SHA1="${CIRCLE_SHA1}"
export CIRCLE_BRANCH="${CIRCLE_BRANCH}"
export SCRIBE_GRAPHQL_ACCESS_TOKEN="${SCRIBE_GRAPHQL_ACCESS_TOKEN}"
python .circleci/scripts/upload_binary_size_to_scuba.py android
# ============================== End Docker ==============================
EOL
- upload_binary_size_for_android_build:
build_type: prebuilt-single
artifacts: /home/circleci/workspace/build_android_x86_32_artifacts/artifacts.tgz
- store_artifacts:
path: ~/workspace/build_android_x86_32_artifacts/artifacts.tgz
destination: artifacts.tgz
pytorch_android_gradle_custom_build_single:
environment:
BUILD_ENVIRONMENT: pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single
DOCKER_IMAGE: "308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-py3-clang5-android-ndk-r19c"
PYTHON_VERSION: "3.6"
resource_class: large
machine:
image: ubuntu-1604:202007-01
steps:
- checkout
- calculate_docker_image_tag
- setup_linux_system_environment
- checkout
- calculate_docker_image_tag
- setup_ci_environment
- run:
name: pytorch android gradle custom build single architecture (for PR)
no_output_timeout: "1h"
command: |
set -e
# Unlike other gradle jobs, it's not worth building libtorch in a separate CI job and share via docker, because:
# 1) Not shareable: it's custom selective build, which is different from default libtorch mobile build;
# 2) Not parallelizable by architecture: it only builds libtorch for one architecture;
echo "DOCKER_IMAGE: ${DOCKER_IMAGE}:${DOCKER_TAG}"
time docker pull ${DOCKER_IMAGE}:${DOCKER_TAG} >/dev/null
git submodule sync && git submodule update -q --init --recursive
VOLUME_MOUNTS="-v /home/circleci/project/:/var/lib/jenkins/workspace"
export id=$(docker run --env-file "${BASH_ENV}" ${VOLUME_MOUNTS} --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -t -d -w /var/lib/jenkins ${DOCKER_IMAGE}:${DOCKER_TAG})
export COMMAND='((echo "export GRADLE_OFFLINE=1" && echo "sudo chown -R jenkins workspace && cd workspace && ./.circleci/scripts/build_android_gradle.sh") | docker exec -u jenkins -i "$id" bash) 2>&1'
echo ${COMMAND} > ./command.sh && unbuffer bash ./command.sh | ts
# Skip docker push as this job is purely for size analysis purpose.
# Result binaries are already in `/home/circleci/project/` as it's mounted instead of copied.
- upload_binary_size_for_android_build:
build_type: custom-build-single
pytorch_ios_build:
<<: *pytorch_ios_params
macos:
@ -475,9 +493,10 @@
pytorch_linux_bazel_build:
<<: *pytorch_params
machine:
image: ubuntu-1604:201903-01
image: ubuntu-1604:202007-01
steps:
- checkout
- calculate_docker_image_tag
- setup_linux_system_environment
- setup_ci_environment
- run:
@ -486,9 +505,9 @@
command: |
set -e
# Pull Docker image and run build
echo "DOCKER_IMAGE: "${DOCKER_IMAGE}
time docker pull ${DOCKER_IMAGE} >/dev/null
export id=$(docker run --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -t -d -w /var/lib/jenkins ${DOCKER_IMAGE})
echo "DOCKER_IMAGE: "${DOCKER_IMAGE}:${DOCKER_TAG}
time docker pull ${DOCKER_IMAGE}:${DOCKER_TAG} >/dev/null
export id=$(docker run --env-file "${BASH_ENV}" --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -t -d -w /var/lib/jenkins ${DOCKER_IMAGE}:${DOCKER_TAG})
echo "Do NOT merge master branch into $CIRCLE_BRANCH in environment $BUILD_ENVIRONMENT"
@ -496,14 +515,14 @@
docker cp /home/circleci/project/. $id:/var/lib/jenkins/workspace
export COMMAND='((echo "export BUILD_ENVIRONMENT=${BUILD_ENVIRONMENT}" && echo "source ./workspace/env" && echo "sudo chown -R jenkins workspace && cd workspace && .jenkins/pytorch/build.sh") | docker exec -u jenkins -i "$id" bash) 2>&1'
export COMMAND='((echo "sudo chown -R jenkins workspace && cd workspace && .jenkins/pytorch/build.sh") | docker exec -u jenkins -i "$id" bash) 2>&1'
echo ${COMMAND} > ./command.sh && unbuffer bash ./command.sh | ts
# Push intermediate Docker image for next phase to use
if [ -z "${BUILD_ONLY}" ]; then
# Augment our output image name with bazel to avoid collisions
output_image=${DOCKER_IMAGE}-bazel-${CIRCLE_SHA1}
output_image=${DOCKER_IMAGE}:${DOCKER_TAG}-bazel-${CIRCLE_SHA1}
export COMMIT_DOCKER_IMAGE=$output_image
docker commit "$id" ${COMMIT_DOCKER_IMAGE}
time docker push ${COMMIT_DOCKER_IMAGE}
@ -512,9 +531,10 @@
pytorch_linux_bazel_test:
<<: *pytorch_params
machine:
image: ubuntu-1604:201903-01
image: ubuntu-1604:202007-01
steps:
- checkout
- calculate_docker_image_tag
- setup_linux_system_environment
- setup_ci_environment
- run:
@ -522,16 +542,16 @@
no_output_timeout: "90m"
command: |
set -e
output_image=${DOCKER_IMAGE}-bazel-${CIRCLE_SHA1}
output_image=${DOCKER_IMAGE}:${DOCKER_TAG}-bazel-${CIRCLE_SHA1}
export COMMIT_DOCKER_IMAGE=$output_image
echo "DOCKER_IMAGE: "${COMMIT_DOCKER_IMAGE}
time docker pull ${COMMIT_DOCKER_IMAGE} >/dev/null
if [ -n "${USE_CUDA_DOCKER_RUNTIME}" ]; then
export id=$(docker run --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --runtime=nvidia -t -d -w /var/lib/jenkins ${COMMIT_DOCKER_IMAGE})
export id=$(docker run --env-file "${BASH_ENV}" --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --gpus all -t -d -w /var/lib/jenkins ${COMMIT_DOCKER_IMAGE})
else
export id=$(docker run --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -t -d -w /var/lib/jenkins ${COMMIT_DOCKER_IMAGE})
export id=$(docker run --env-file "${BASH_ENV}" --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -t -d -w /var/lib/jenkins ${COMMIT_DOCKER_IMAGE})
fi
retrieve_test_reports() {
@ -541,9 +561,9 @@
trap "retrieve_test_reports" ERR
if [[ ${BUILD_ENVIRONMENT} == *"multigpu"* ]]; then
export COMMAND='((echo "export BUILD_ENVIRONMENT=${BUILD_ENVIRONMENT}" && echo "source ./workspace/env" && echo "sudo chown -R jenkins workspace && cd workspace && .jenkins/pytorch/multigpu-test.sh") | docker exec -u jenkins -i "$id" bash) 2>&1'
export COMMAND='((echo "sudo chown -R jenkins workspace && cd workspace && .jenkins/pytorch/multigpu-test.sh") | docker exec -u jenkins -i "$id" bash) 2>&1'
else
export COMMAND='((echo "export BUILD_ENVIRONMENT=${BUILD_ENVIRONMENT}" && echo "export CIRCLE_PULL_REQUEST=${CIRCLE_PULL_REQUEST}" && echo "source ./workspace/env" && echo "sudo chown -R jenkins workspace && cd workspace && .jenkins/pytorch/test.sh") | docker exec -u jenkins -i "$id" bash) 2>&1'
export COMMAND='((echo "sudo chown -R jenkins workspace && cd workspace && .jenkins/pytorch/test.sh") | docker exec -u jenkins -i "$id" bash) 2>&1'
fi
echo ${COMMAND} > ./command.sh && unbuffer bash ./command.sh | ts
@ -555,13 +575,13 @@
pytorch_doc_test:
environment:
BUILD_ENVIRONMENT: pytorch-doc-test
# TODO: stop hardcoding this
DOCKER_IMAGE: "308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-py3.6-gcc5.4:209062ef-ab58-422a-b295-36c4eed6e906"
DOCKER_IMAGE: "308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-py3.6-gcc5.4"
resource_class: medium
machine:
image: ubuntu-1604:201903-01
image: ubuntu-1604:202007-01
steps:
- checkout
- calculate_docker_image_tag
- setup_linux_system_environment
- setup_ci_environment
- run:
@ -569,9 +589,9 @@
no_output_timeout: "30m"
command: |
set -ex
export COMMIT_DOCKER_IMAGE=${DOCKER_IMAGE}-${CIRCLE_SHA1}
export COMMIT_DOCKER_IMAGE=${DOCKER_IMAGE}:${DOCKER_TAG}-${CIRCLE_SHA1}
echo "DOCKER_IMAGE: "${COMMIT_DOCKER_IMAGE}
time docker pull ${COMMIT_DOCKER_IMAGE} >/dev/null
export id=$(docker run --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -t -d -w /var/lib/jenkins ${COMMIT_DOCKER_IMAGE})
export COMMAND='((echo "export BUILD_ENVIRONMENT=${BUILD_ENVIRONMENT}" && echo "export GITHUB_PYTORCHBOT_TOKEN=${GITHUB_PYTORCHBOT_TOKEN}" && echo "source ./workspace/env" && echo "sudo chown -R jenkins workspace && cd workspace && . ./.jenkins/pytorch/docs-test.sh") | docker exec -u jenkins -i "$id" bash) 2>&1'
export COMMAND='((echo "sudo chown -R jenkins workspace && cd workspace && . ./.jenkins/pytorch/docs-test.sh") | docker exec -u jenkins -i "$id" bash) 2>&1'
echo ${COMMAND} > ./command.sh && unbuffer bash ./command.sh | ts

View File

@ -2,12 +2,12 @@ jobs:
pytorch_linux_build:
<<: *pytorch_params
machine:
image: ubuntu-1604:201903-01
image: ubuntu-1604:202007-01
steps:
# See Note [Workspace for CircleCI scripts] in job-specs-setup.yml
- checkout
- calculate_docker_image_tag
- setup_linux_system_environment
- checkout
- optional_merge_target_branch
- setup_ci_environment
- run:
@ -15,33 +15,46 @@ jobs:
no_output_timeout: "1h"
command: |
set -e
# TODO: Remove this after we figure out why rocm tests are failing
if [[ "${DOCKER_IMAGE}" == *rocm3.5* ]]; then
export DOCKER_TAG="ab1632df-fa59-40e6-8c23-98e004f61148"
fi
if [[ "${DOCKER_IMAGE}" == *rocm3.7* ]]; then
export DOCKER_TAG="1045c7b891104cb4fd23399eab413b6213e48aeb"
fi
if [[ ${BUILD_ENVIRONMENT} == *"pure_torch"* ]]; then
echo 'BUILD_CAFFE2=OFF' >> "${BASH_ENV}"
fi
if [[ ${BUILD_ENVIRONMENT} == *"paralleltbb"* ]]; then
echo 'ATEN_THREADING=TBB' >> "${BASH_ENV}"
echo 'USE_TBB=1' >> "${BASH_ENV}"
elif [[ ${BUILD_ENVIRONMENT} == *"parallelnative"* ]]; then
echo 'ATEN_THREADING=NATIVE' >> "${BASH_ENV}"
fi
echo "Parallel backend flags: "${PARALLEL_FLAGS}
# Pull Docker image and run build
echo "DOCKER_IMAGE: "${DOCKER_IMAGE}
time docker pull ${DOCKER_IMAGE} >/dev/null
export id=$(docker run --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -t -d -w /var/lib/jenkins ${DOCKER_IMAGE})
echo "DOCKER_IMAGE: "${DOCKER_IMAGE}:${DOCKER_TAG}
time docker pull ${DOCKER_IMAGE}:${DOCKER_TAG} >/dev/null
export id=$(docker run --env-file "${BASH_ENV}" --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -t -d -w /var/lib/jenkins ${DOCKER_IMAGE}:${DOCKER_TAG})
git submodule sync && git submodule update -q --init --recursive
docker cp /home/circleci/project/. $id:/var/lib/jenkins/workspace
if [[ ${BUILD_ENVIRONMENT} == *"paralleltbb"* ]]; then
export PARALLEL_FLAGS="export ATEN_THREADING=TBB USE_TBB=1 "
elif [[ ${BUILD_ENVIRONMENT} == *"parallelnative"* ]]; then
export PARALLEL_FLAGS="export ATEN_THREADING=NATIVE "
fi
echo "Parallel backend flags: "${PARALLEL_FLAGS}
export COMMAND='((echo "export BUILD_ENVIRONMENT=${BUILD_ENVIRONMENT}" && echo '"$PARALLEL_FLAGS"' && echo "source ./workspace/env" && echo "sudo chown -R jenkins workspace && cd workspace && .jenkins/pytorch/build.sh") | docker exec -u jenkins -i "$id" bash) 2>&1'
export COMMAND='((echo "sudo chown -R jenkins workspace && cd workspace && .jenkins/pytorch/build.sh && find ${BUILD_ROOT} -type f -name "*.a" -or -name "*.o" -delete") | docker exec -u jenkins -i "$id" bash) 2>&1'
echo ${COMMAND} > ./command.sh && unbuffer bash ./command.sh | ts
# Copy dist folder back
docker cp $id:/var/lib/jenkins/workspace/dist /home/circleci/project/. || echo "Dist folder not found"
# Push intermediate Docker image for next phase to use
if [ -z "${BUILD_ONLY}" ]; then
# Note [Special build images]
# The xla build uses the same docker image as
# pytorch-linux-trusty-py3.6-gcc5.4-build. In the push step, we have to
# distinguish between them so the test can pick up the correct image.
output_image=${DOCKER_IMAGE}-${CIRCLE_SHA1}
output_image=${DOCKER_IMAGE}:${DOCKER_TAG}-${CIRCLE_SHA1}
if [[ ${BUILD_ENVIRONMENT} == *"xla"* ]]; then
export COMMIT_DOCKER_IMAGE=$output_image-xla
elif [[ ${BUILD_ENVIRONMENT} == *"libtorch"* ]]; then
@ -60,20 +73,25 @@ jobs:
export COMMIT_DOCKER_IMAGE=$output_image-android-x86_32
elif [[ ${BUILD_ENVIRONMENT} == *"android-ndk-r19c-vulkan-x86_32"* ]]; then
export COMMIT_DOCKER_IMAGE=$output_image-android-vulkan-x86_32
elif [[ ${BUILD_ENVIRONMENT} == *"vulkan-linux"* ]]; then
export COMMIT_DOCKER_IMAGE=$output_image-vulkan
else
export COMMIT_DOCKER_IMAGE=$output_image
fi
docker commit "$id" ${COMMIT_DOCKER_IMAGE}
time docker push ${COMMIT_DOCKER_IMAGE}
fi
- store_artifacts:
path: /home/circleci/project/dist
pytorch_linux_test:
<<: *pytorch_params
machine:
image: ubuntu-1604:201903-01
image: ubuntu-1604:202007-01
steps:
# See Note [Workspace for CircleCI scripts] in job-specs-setup.yml
- checkout
- calculate_docker_image_tag
- setup_linux_system_environment
- setup_ci_environment
- run:
@ -81,8 +99,16 @@ jobs:
no_output_timeout: "90m"
command: |
set -e
export PYTHONUNBUFFERED=1
# TODO: Remove this after we figure out why rocm tests are failing
if [[ "${DOCKER_IMAGE}" == *rocm3.5* ]]; then
export DOCKER_TAG="ab1632df-fa59-40e6-8c23-98e004f61148"
fi
if [[ "${DOCKER_IMAGE}" == *rocm3.7* ]]; then
export DOCKER_TAG="1045c7b891104cb4fd23399eab413b6213e48aeb"
fi
# See Note [Special build images]
output_image=${DOCKER_IMAGE}-${CIRCLE_SHA1}
output_image=${DOCKER_IMAGE}:${DOCKER_TAG}-${CIRCLE_SHA1}
if [[ ${BUILD_ENVIRONMENT} == *"xla"* ]]; then
export COMMIT_DOCKER_IMAGE=$output_image-xla
elif [[ ${BUILD_ENVIRONMENT} == *"libtorch"* ]]; then
@ -91,30 +117,34 @@ jobs:
export COMMIT_DOCKER_IMAGE=$output_image-paralleltbb
elif [[ ${BUILD_ENVIRONMENT} == *"parallelnative"* ]]; then
export COMMIT_DOCKER_IMAGE=$output_image-parallelnative
elif [[ ${BUILD_ENVIRONMENT} == *"vulkan-linux"* ]]; then
export COMMIT_DOCKER_IMAGE=$output_image-vulkan
else
export COMMIT_DOCKER_IMAGE=$output_image
fi
echo "DOCKER_IMAGE: "${COMMIT_DOCKER_IMAGE}
if [[ ${BUILD_ENVIRONMENT} == *"paralleltbb"* ]]; then
export PARALLEL_FLAGS="export ATEN_THREADING=TBB USE_TBB=1 "
echo 'ATEN_THREADING=TBB' >> "${BASH_ENV}"
echo 'USE_TBB=1' >> "${BASH_ENV}"
elif [[ ${BUILD_ENVIRONMENT} == *"parallelnative"* ]]; then
export PARALLEL_FLAGS="export ATEN_THREADING=NATIVE "
echo 'ATEN_THREADING=NATIVE' >> "${BASH_ENV}"
fi
echo "Parallel backend flags: "${PARALLEL_FLAGS}
time docker pull ${COMMIT_DOCKER_IMAGE} >/dev/null
# TODO: Make this less painful
if [ -n "${USE_CUDA_DOCKER_RUNTIME}" ]; then
export id=$(docker run --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --runtime=nvidia -t -d -w /var/lib/jenkins ${COMMIT_DOCKER_IMAGE})
export id=$(docker run --env-file "${BASH_ENV}" --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --gpus all --shm-size=2g -t -d -w /var/lib/jenkins ${COMMIT_DOCKER_IMAGE})
elif [[ ${BUILD_ENVIRONMENT} == *"rocm"* ]]; then
hostname
export id=$(docker run --env-file "${BASH_ENV}" --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --shm-size=8g --ipc=host --device /dev/kfd --device /dev/dri --group-add video -t -d -w /var/lib/jenkins ${COMMIT_DOCKER_IMAGE})
else
export id=$(docker run --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -t -d -w /var/lib/jenkins ${COMMIT_DOCKER_IMAGE})
export id=$(docker run --env-file "${BASH_ENV}" --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -t -d -w /var/lib/jenkins ${COMMIT_DOCKER_IMAGE})
fi
echo "id=${id}" >> "${BASH_ENV}"
# Pass environment variables to the next step
# See https://circleci.com/docs/2.0/env-vars/#using-parameters-and-bash-environment
echo "export PARALLEL_FLAGS=\"${PARALLEL_FLAGS}\"" >> $BASH_ENV
echo "export id=$id" >> $BASH_ENV
- run:
name: Check for no AVX instruction by default
no_output_timeout: "20m"
@ -131,8 +161,8 @@ jobs:
}
if is_vanilla_build; then
echo "apt-get update && apt-get install -y qemu-user" | docker exec -u root -i "$id" bash
echo "cd workspace/build; qemu-x86_64 -cpu Broadwell -E ATEN_CPU_CAPABILITY=default ./bin/basic --gtest_filter=BasicTest.BasicTestCPU" | docker exec -u jenkins -i "$id" bash
echo "apt-get update && apt-get install -y qemu-user gdb" | docker exec -u root -i "$id" bash
echo "cd workspace/build; qemu-x86_64 -g 2345 -cpu Broadwell -E ATEN_CPU_CAPABILITY=default ./bin/basic --gtest_filter=BasicTest.BasicTestCPU & gdb ./bin/basic -ex 'set pagination off' -ex 'target remote :2345' -ex 'continue' -ex 'bt' -ex='set confirm off' -ex 'quit \$_isvoid(\$_exitcode)'" | docker exec -u jenkins -i "$id" bash
else
echo "Skipping for ${BUILD_ENVIRONMENT}"
fi
@ -142,21 +172,56 @@ jobs:
command: |
set -e
cat >docker_commands.sh \<<EOL
# =================== The following code will be executed inside Docker container ===================
set -ex
export SCRIBE_GRAPHQL_ACCESS_TOKEN="${SCRIBE_GRAPHQL_ACCESS_TOKEN}"
${PARALLEL_FLAGS}
cd workspace
EOL
if [[ ${BUILD_ENVIRONMENT} == *"multigpu"* ]]; then
export COMMAND='((echo "export BUILD_ENVIRONMENT=${BUILD_ENVIRONMENT}" && echo "${PARALLEL_FLAGS}" && echo "source ./workspace/env" && echo "sudo chown -R jenkins workspace && cd workspace && .jenkins/pytorch/multigpu-test.sh") | docker exec -u jenkins -i "$id" bash) 2>&1'
echo ".jenkins/pytorch/multigpu-test.sh" >> docker_commands.sh
elif [[ ${BUILD_ENVIRONMENT} == *onnx* ]]; then
echo "pip install click mock tabulate networkx==2.0" >> docker_commands.sh
echo "pip -q install --user -b /tmp/pip_install_onnx \"file:///var/lib/jenkins/workspace/third_party/onnx#egg=onnx\"" >> docker_commands.sh
echo ".jenkins/caffe2/test.sh" >> docker_commands.sh
else
export COMMAND='((echo "export BUILD_ENVIRONMENT=${BUILD_ENVIRONMENT}" && echo "export CIRCLE_PULL_REQUEST=${CIRCLE_PULL_REQUEST}" && echo "${PARALLEL_FLAGS}" && echo "source ./workspace/env" && echo "sudo chown -R jenkins workspace && cd workspace && .jenkins/pytorch/test.sh") | docker exec -u jenkins -i "$id" bash) 2>&1'
echo ".jenkins/pytorch/test.sh" >> docker_commands.sh
fi
echo ${COMMAND} > ./command.sh && unbuffer bash ./command.sh | ts
echo "(cat docker_commands.sh | docker exec -u jenkins -i "$id" bash) 2>&1" > command.sh
unbuffer bash command.sh | ts
- run:
name: Report results
no_output_timeout: "5m"
command: |
set -e
docker stats --all --no-stream
echo "cd workspace; python test/print_test_stats.py test" | docker exec -u jenkins -i "$id" bash
cat >docker_commands.sh \<<EOL
# =================== The following code will be executed inside Docker container ===================
set -ex
export BUILD_ENVIRONMENT=${BUILD_ENVIRONMENT}
export SCRIBE_GRAPHQL_ACCESS_TOKEN="${SCRIBE_GRAPHQL_ACCESS_TOKEN}"
export CIRCLE_TAG="${CIRCLE_TAG:-}"
export CIRCLE_SHA1="$CIRCLE_SHA1"
export CIRCLE_PR_NUMBER="${CIRCLE_PR_NUMBER:-}"
export CIRCLE_BRANCH="$CIRCLE_BRANCH"
export CIRCLE_JOB="$CIRCLE_JOB"
cd workspace
python test/print_test_stats.py test
EOL
echo "(cat docker_commands.sh | docker exec -u jenkins -i "$id" bash) 2>&1" > command.sh
unbuffer bash command.sh | ts
echo "Retrieving test reports"
docker cp $id:/var/lib/jenkins/workspace/test/test-reports ./ || echo 'No test reports found!'
if [[ ${BUILD_ENVIRONMENT} == *"coverage"* ]]; then
echo "Retrieving coverage report"
docker cp $id:/var/lib/jenkins/workspace/test/.coverage ./test
docker cp $id:/var/lib/jenkins/workspace/test/coverage.xml ./test
python3 -mpip install codecov
python3 -mcodecov
fi
when: always
- store_test_results:
path: test-reports
@ -166,7 +231,7 @@ jobs:
parameters:
executor:
type: string
default: "windows-cpu-with-nvidia-cuda"
default: "windows-xlarge-cpu-with-nvidia-cuda"
build_environment:
type: string
default: ""
@ -181,10 +246,10 @@ jobs:
default: "3.6"
vc_version:
type: string
default: "14.11"
default: "14.16"
vc_year:
type: string
default: "2017"
default: "2019"
vc_product:
type: string
default: "BuildTools"
@ -194,12 +259,6 @@ jobs:
executor: <<parameters.executor>>
steps:
- checkout
- run:
name: Install VS2017
command: |
if [[ "${VC_YEAR}" == "2017" ]]; then
powershell .circleci/scripts/vs_install.ps1
fi
- run:
name: Install Cuda
no_output_timeout: 30m
@ -211,10 +270,7 @@ jobs:
name: Install Cudnn
command : |
if [[ "${USE_CUDA}" == "1" ]]; then
cd c:/
curl --retry 3 -O https://ossci-windows.s3.amazonaws.com/cudnn-10.1-windows10-x64-v7.6.4.38.zip
7z x cudnn-10.1-windows10-x64-v7.6.4.38.zip -ocudnn
cp -r cudnn/cuda/* "C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v10.1/"
.circleci/scripts/windows_cudnn_install.sh
fi
- run:
name: Build
@ -237,7 +293,7 @@ jobs:
parameters:
executor:
type: string
default: "windows-cpu-with-nvidia-cuda"
default: "windows-medium-cpu-with-nvidia-cuda"
build_environment:
type: string
default: ""
@ -252,10 +308,10 @@ jobs:
default: "3.6"
vc_version:
type: string
default: "14.11"
default: "14.16"
vc_year:
type: string
default: "2017"
default: "2019"
vc_product:
type: string
default: "BuildTools"
@ -267,27 +323,23 @@ jobs:
- checkout
- attach_workspace:
at: c:/users/circleci/workspace
- run:
name: Install VS2017
command: |
if [[ "${VC_YEAR}" == "2017" ]]; then
powershell .circleci/scripts/vs_install.ps1
fi
- run:
name: Install Cuda
no_output_timeout: 30m
command: |
if [[ "${CUDA_VERSION}" != "cpu" && "${JOB_EXECUTOR}" != "windows-with-nvidia-gpu" ]]; then
.circleci/scripts/windows_cuda_install.sh
if [[ "${CUDA_VERSION}" != "cpu" ]]; then
if [[ "${CUDA_VERSION}" != "10" || "${JOB_EXECUTOR}" != "windows-with-nvidia-gpu" ]]; then
.circleci/scripts/windows_cuda_install.sh
fi
if [[ "${CUDA_VERSION}" != "10" && "${JOB_EXECUTOR}" == "windows-with-nvidia-gpu" ]]; then
.circleci/scripts/driver_update.bat
fi
fi
- run:
name: Install Cudnn
command : |
if [[ "${CUDA_VERSION}" != "cpu" && "${JOB_EXECUTOR}" != "windows-with-nvidia-gpu" ]]; then
cd c:/
curl --retry 3 -O https://ossci-windows.s3.amazonaws.com/cudnn-10.1-windows10-x64-v7.6.4.38.zip
7z x cudnn-10.1-windows10-x64-v7.6.4.38.zip -ocudnn
cp -r cudnn/cuda/* "C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v10.1/"
if [[ "${CUDA_VERSION}" != "cpu" ]]; then
.circleci/scripts/windows_cudnn_install.sh
fi
- run:
name: Test

View File

@ -11,7 +11,7 @@
- ecr_gc_job:
name: ecr_gc_job_for_pytorch
project: pytorch
tags_to_keep: "271,262,256,278,282,291,300,323,327,347,389,401,402,403,405,a8006f9a-272d-4478-b137-d121c6f05c83,6e7b11da-a919-49e5-b2ba-da66e3d4bb0a,f990c76a-a798-42bb-852f-5be5006f8026,e43973a9-9d5a-4138-9181-a08a0fc55e2f,8fcf46ef-4a34-480b-a8ee-b0a30a4d3e59,9a3986fa-7ce7-4a36-a001-3c9bef9892e2,1bc00f11-e0f3-4e5c-859f-15937dd938cd,209062ef-ab58-422a-b295-36c4eed6e906,be76e8fd-44e2-484d-b090-07e0cc3a56f0"
tags_to_keep: "271,262,256,278,282,291,300,323,327,347,389,401,402,403,405,a8006f9a-272d-4478-b137-d121c6f05c83,6e7b11da-a919-49e5-b2ba-da66e3d4bb0a,f990c76a-a798-42bb-852f-5be5006f8026,e43973a9-9d5a-4138-9181-a08a0fc55e2f,8fcf46ef-4a34-480b-a8ee-b0a30a4d3e59,9a3986fa-7ce7-4a36-a001-3c9bef9892e2,1bc00f11-e0f3-4e5c-859f-15937dd938cd,209062ef-ab58-422a-b295-36c4eed6e906,be76e8fd-44e2-484d-b090-07e0cc3a56f0,fff7795428560442086f7b2bb6004b65245dc11a,ab1632df-fa59-40e6-8c23-98e004f61148"
requires:
- docker_for_ecr_gc_build_job
- ecr_gc_job:
@ -32,4 +32,3 @@
tags_to_keep: "34"
requires:
- docker_for_ecr_gc_build_job
- docker_hub_index_job

View File

@ -0,0 +1 @@
Fixes #{issue number}

View File

@ -35,7 +35,7 @@ jobs:
HEAD_SHA=${{ github.event.pull_request.head.sha }}
MERGE_BASE=$(git merge-base $BASE_SHA $HEAD_SHA)
# only run clang-format on whitelisted files
# only run clang-format on allowlisted files
echo "~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~"
echo "| clang-format failures found! Run: "
echo "| tools/clang_format_ci.sh ${MERGE_BASE} "

78
.github/workflows/jit_triage.yml vendored Normal file
View File

@ -0,0 +1,78 @@
name: jit-triage
on:
issues:
types: [labeled]
jobs:
welcome:
runs-on: ubuntu-latest
steps:
- uses: actions/github-script@v2
with:
github-token: ${{secrets.GITHUB_TOKEN}}
script: |
// Arguments available:
// - github: A pre-authenticated octokit/rest.js client
// - context: An object containing the context of the workflow run
// - core: A reference to the @actions/core package
// - io: A reference to the @actions/io package
// Check if issue has a JIT label.
const kJitLabel = "jit";
issue = await github.issues.get({
owner: context.issue.owner,
repo: context.issue.repo,
issue_number: context.issue.number,
})
const hasJitLabel = issue.data.labels.filter(label => label.name == kJitLabel).length > 0;
if (!hasJitLabel) {
core.debug("Issue " + issue.data.title + " does not have JIT label");
return;
}
// Get project column ID.
const kProjectName = "JIT Triage";
const kColumnName = "Need triage";
// Query all projects in the repository.
// TODO: Support pagination once there are > 30 projects.
const projects = await github.projects.listForRepo({
owner: context.issue.owner,
repo: context.issue.repo,
});
// Filter out unwanted projects and get the ID for the JIT Triage project.
const filteredProjects = projects.data.filter(project => project.name == kProjectName);
if (filteredProjects.length != 1) {
core.setFailed("Unable to find a project named " + kProjectName);
return;
}
const projectId = filteredProjects[0].id;
// First, query all columns in the project.
// TODO: Support pagination once there are > 30 columns.
const columns = await github.projects.listColumns({
project_id: projectId,
});
// Filter out unwanted projects and get the ID for the Need triage column.
const filteredColumns = columns.data.filter(column => column.name == kColumnName);
if (filteredColumns.length != 1) {
core.setFailed("Unable to find a column named " + kColumnName);
return;
}
const columnId = filteredColumns[0].id;
// Create a project card for this new issue.
await github.projects.createCard({
column_id: columnId,
content_id: issue.data.id,
content_type: "Issue",
})

View File

@ -21,10 +21,6 @@ jobs:
run: |
pip install -r requirements.txt
cd .circleci && ./ensure-consistency.py
- name: Ensure Docker version is correctly deployed
run: |
pip install pyyaml
.circleci/validate-docker-version.py
- name: Shellcheck Jenkins scripts
run: |
sudo apt-get install -y shellcheck
@ -135,13 +131,9 @@ jobs:
time python setup.py --cmake-only build
# Generate ATen files.
time python aten/src/ATen/gen.py \
time python -m tools.codegen.gen \
-s aten/src/ATen \
-d build/aten/src/ATen \
aten/src/ATen/Declarations.cwrap \
aten/src/THCUNN/generic/THCUNN.h \
aten/src/ATen/nn.yaml \
aten/src/ATen/native/native_functions.yaml
-d build/aten/src/ATen
# Generate PyTorch files.
time python tools/setup_helpers/generate_code.py \
@ -152,16 +144,22 @@ jobs:
# Run Clang-Tidy
# The negative filters below are to exclude files that include onnx_pb.h or
# caffe2_pb.h, otherwise we'd have to build protos as part of this CI job.
# FunctionsManual.cpp is excluded to keep this diff clean. It will be fixed
# in a follow up PR.
python tools/clang_tidy.py \
--verbose \
--paths torch/csrc/ \
--diff "$MERGE_BASE" \
-g"-torch/csrc/jit/passes/onnx/helper.cpp" \
-g"-torch/csrc/jit/passes/onnx/shape_type_inference.cpp"\
-g"-torch/csrc/jit/serialization/onnx.cpp" \
-g"-torch/csrc/jit/serialization/export.cpp" \
-g"-torch/csrc/jit/serialization/import.cpp" \
-g"-torch/csrc/jit/serialization/import_legacy.cpp" \
-g"-torch/csrc/onnx/init.cpp" \
-g"-torch/csrc/cuda/nccl.*" \
-g"-torch/csrc/cuda/python_nccl.cpp" \
-g"-torch/csrc/autograd/FunctionsManual.cpp" \
"$@" > ${GITHUB_WORKSPACE}/clang-tidy-output.txt
cat ${GITHUB_WORKSPACE}/clang-tidy-output.txt

15
.gitignore vendored
View File

@ -9,6 +9,7 @@
## PyTorch
.coverage
coverage.xml
.gradle
.hypothesis
.mypy_cache
@ -25,8 +26,10 @@
aten/build/
aten/src/ATen/Config.h
aten/src/ATen/cuda/CUDAConfig.h
benchmarks/.data
caffe2/cpp_test/
dist/
docs/cpp/src
docs/src/**/*
docs/cpp/build
docs/cpp/source/api
@ -38,7 +41,7 @@ test/cpp/api/mnist
test/custom_operator/model.pt
test/data/legacy_modules.t7
test/data/*.pt
test/backward_compatibility/new_schemas.txt
test/backward_compatibility/nightly_schemas.txt
dropout_model.pt
test/generated_type_hints_smoketest.py
test/htmlcov
@ -47,13 +50,15 @@ test/test-reports/
third_party/build/
tools/shared/_utils_internal.py
torch.egg-info/
torch/__init__.pyi
torch/_C/__init__.pyi
torch/_C/_nn.pyi
torch/_C/_VariableFunctions.pyi
torch/_VF.pyi
torch/nn/functional.pyi
torch/nn/modules/*.pyi
torch/csrc/autograd/generated/*
# Listed manually because some files in this directory are not generated
torch/testing/_internal/generated/annotated_fn_args.py
torch/testing/_internal/data/*.pt
torch/csrc/cudnn/cuDNN.cpp
torch/csrc/generated
torch/csrc/generic/TensorMethods.cpp
@ -105,9 +110,6 @@ env
# macOS dir files
.DS_Store
# Symbolic files
tools/shared/cwrap_common.py
# Ninja files
.ninja_deps
.ninja_log
@ -259,6 +261,7 @@ TAGS
# clangd background index
.clangd/
.cache/
# bazel symlinks
bazel-*

View File

@ -248,6 +248,8 @@ else
export MAX_JOBS=`expr $(nproc) - 1`
fi
pip install --user dataclasses
$PYTHON setup.py install --user
report_compile_cache_stats
@ -260,9 +262,4 @@ fi
# Install ONNX into a local directory
pip install --user -b /tmp/pip_install_onnx "file://${ROOT_DIR}/third_party/onnx#egg=onnx"
if [[ $BUILD_ENVIRONMENT == *rocm* ]]; then
# runtime compilation of MIOpen kernels manages to crash sccache - hence undo the wrapping
bash tools/amd_build/unwrap_clang.sh
fi
report_compile_cache_stats

View File

@ -12,6 +12,18 @@ if [[ "${BUILD_ENVIRONMENT}" =~ py((2|3)\.?[0-9]?\.?[0-9]?) ]]; then
PYTHON=$(which "python${BASH_REMATCH[1]}")
fi
if [[ "${BUILD_ENVIRONMENT}" == *rocm* ]]; then
if which sccache > /dev/null; then
# Save sccache logs to file
sccache --stop-server || true
rm ~/sccache_error.log || true
SCCACHE_ERROR_LOG=~/sccache_error.log SCCACHE_IDLE_TIMEOUT=0 sccache --start-server
# Report sccache stats for easier debugging
sccache --zero-stats
fi
fi
# /usr/local/caffe2 is where the cpp bits are installed to in in cmake-only
# builds. In +python builds the cpp tests are copied to /usr/local/caffe2 so
# that the test code in .jenkins/test.sh is the same

View File

@ -12,30 +12,13 @@ if [[ "${BUILD_ENVIRONMENT}" == *-rocm* ]]; then
export HSAKMT_DEBUG_LEVEL=4
fi
# These additional packages are needed for circleci ROCm builds.
if [[ $BUILD_ENVIRONMENT == pytorch-linux-xenial-rocm* ]]; then
if [[ $BUILD_ENVIRONMENT == *rocm* ]]; then
# Need networkx 2.0 because bellmand_ford was moved in 2.1 . Scikit-image by
# defaults installs the most recent networkx version, so we install this lower
# version explicitly before scikit-image pulls it in as a dependency
pip install networkx==2.0
# click - onnx
pip install --progress-bar off click protobuf tabulate virtualenv mock typing-extensions
# TODO: Remove this once ROCm CI images are >= ROCm 3.5
# ROCm 3.5 required a backwards-incompatible change; the kernel and thunk must match.
# Detect kernel version and upgrade thunk if this is a ROCm 3.3 container running on a 3.5 kernel.
ROCM_ASD_FW_VERSION=$(/opt/rocm/bin/rocm-smi --showfwinfo -d 1 | grep ASD | awk '{print $6}')
if [[ $ROCM_ASD_FW_VERSION = 553648174 && "$BUILD_ENVIRONMENT" == *rocm3.3* ]]; then
# upgrade thunk to 3.5
mkdir rocm3.5-thunk
pushd rocm3.5-thunk
wget http://repo.radeon.com/rocm/apt/3.5/pool/main/h/hsakmt-roct3.5.0/hsakmt-roct3.5.0_1.0.9-347-gd4b224f_amd64.deb
wget http://repo.radeon.com/rocm/apt/3.5/pool/main/h/hsakmt-roct-dev3.5.0/hsakmt-roct-dev3.5.0_1.0.9-347-gd4b224f_amd64.deb
dpkg-deb -vx hsakmt-roct3.5.0_1.0.9-347-gd4b224f_amd64.deb .
dpkg-deb -vx hsakmt-roct-dev3.5.0_1.0.9-347-gd4b224f_amd64.deb .
sudo cp -r opt/rocm-3.5.0/* /opt/rocm-3.3.0/
popd
rm -rf rocm3.5-thunk
fi
fi
# Find where cpp tests and Caffe2 itself are installed
@ -138,6 +121,10 @@ if [[ $BUILD_ENVIRONMENT == *-rocm* ]]; then
# This test has been flaky in ROCm CI (but note the tests are
# cpu-only so should be unrelated to ROCm)
rocm_ignore_test+=("--ignore $caffe2_pypath/python/operator_test/blobs_queue_db_test.py")
# This test is skipped on Jenkins(compiled without MKL) and otherwise known flaky
rocm_ignore_test+=("--ignore $caffe2_pypath/python/ideep/convfusion_op_test.py")
# This test is skipped on Jenkins(compiled without MKL) and causing segfault on Circle
rocm_ignore_test+=("--ignore $caffe2_pypath/python/ideep/pool_op_test.py")
fi
# NB: Warnings are disabled because they make it harder to see what
@ -183,8 +170,8 @@ if [[ "$BUILD_ENVIRONMENT" == *onnx* ]]; then
if [[ "$BUILD_ENVIRONMENT" == *py3* ]]; then
# default pip version is too old(9.0.2), unable to support tag `manylinux2010`.
# Fix the pip error: Couldn't find a version that satisfies the requirement
sudo pip install --upgrade pip
pip install -q --user -i https://test.pypi.org/simple/ ort-nightly==1.3.0.dev202005123
pip install --upgrade pip
pip install -q --user -i https://test.pypi.org/simple/ ort-nightly==1.5.0.dev202009182
fi
"$ROOT_DIR/scripts/onnx/test.sh"
fi

View File

@ -14,7 +14,7 @@ clang --version
# detect_leaks=0: Python is very leaky, so we need suppress it
# symbolize=1: Gives us much better errors when things go wrong
export ASAN_OPTIONS=detect_leaks=0:symbolize=1
export ASAN_OPTIONS=detect_leaks=0:symbolize=1:detect_odr_violation=0
# FIXME: Remove the hardcoded "-pthread" option.
# With asan build, the cmake thread CMAKE_HAVE_LIBC_CREATE[1] checking will

View File

@ -15,11 +15,4 @@ clang --version
export LLVM_DIR="$(llvm-config-5.0 --prefix)"
echo "LLVM_DIR: ${LLVM_DIR}"
# Run the following 2 steps together because they share the same (reusable) time
# consuming process to build LibTorch into LLVM assembly.
# 1. Run code analysis test first to fail fast
time ANALYZE_TEST=1 CHECK_RESULT=1 tools/code_analyzer/build.sh
# 2. Run code analysis on mobile LibTorch
time ANALYZE_TORCH=1 tools/code_analyzer/build.sh

View File

@ -11,13 +11,18 @@ COMPACT_JOB_NAME="${BUILD_ENVIRONMENT}"
source "$(dirname "${BASH_SOURCE[0]}")/common.sh"
# Install torch & torchvision - used to download & trace test model.
retry pip install torch torchvision --progress-bar off
# Ideally we should use the libtorch built on the PR so that backward
# incompatible changes won't break this script - but it will significantly slow
# down mobile CI jobs.
# Here we install nightly instead of stable so that we have an option to
# temporarily skip mobile CI jobs on BC-breaking PRs until they are in nightly.
retry pip install --pre torch torchvision \
-f https://download.pytorch.org/whl/nightly/cpu/torch_nightly.html \
--progress-bar off
# Run end-to-end process of building mobile library, linking into the predictor
# binary, and running forward pass with a real model.
if [[ "$BUILD_ENVIRONMENT" == *-mobile-custom-build-static* ]]; then
TEST_CUSTOM_BUILD_STATIC=1 test/mobile/custom_build/build.sh
elif [[ "$BUILD_ENVIRONMENT" == *-mobile-custom-build-dynamic* ]]; then
if [[ "$BUILD_ENVIRONMENT" == *-mobile-custom-build-dynamic* ]]; then
export LLVM_DIR="$(llvm-config-5.0 --prefix)"
echo "LLVM_DIR: ${LLVM_DIR}"
TEST_CUSTOM_BUILD_DYNAMIC=1 test/mobile/custom_build/build.sh

View File

@ -7,6 +7,13 @@
# shellcheck disable=SC2034
COMPACT_JOB_NAME="${BUILD_ENVIRONMENT}"
# Temp: use new sccache
if [[ -n "$IN_CIRCLECI" && "$BUILD_ENVIRONMENT" == *rocm* ]]; then
# Download customized sccache
sudo curl --retry 3 http://repo.radeon.com/misc/.sccache_amd/sccache -o /opt/cache/bin/sccache
sudo chmod 755 /opt/cache/bin/sccache
fi
source "$(dirname "${BASH_SOURCE[0]}")/common.sh"
# For distributed, four environmental configs:
@ -28,7 +35,6 @@ if [[ "$BUILD_ENVIRONMENT" == *-xenial-cuda9*gcc7* ]] || [[ "$BUILD_ENVIRONMENT"
else
sudo apt-get -qq install --allow-downgrades --allow-change-held-packages openmpi-bin libopenmpi-dev
fi
sudo apt-get -qq install --no-install-recommends openssh-client openssh-server
sudo mkdir -p /var/run/sshd
fi
@ -115,6 +121,11 @@ if [[ "${BUILD_ENVIRONMENT}" == *-android* ]]; then
exec ./scripts/build_android.sh "${build_args[@]}" "$@"
fi
if [[ "$BUILD_ENVIRONMENT" != *android* && "$BUILD_ENVIRONMENT" == *vulkan-linux* ]]; then
export USE_VULKAN=1
export VULKAN_SDK=/var/lib/jenkins/vulkansdk/
fi
if [[ "$BUILD_ENVIRONMENT" == *rocm* ]]; then
# hcc used to run out of memory, silently exiting without stopping
# the build process, leaving undefined symbols in the shared lib,
@ -126,7 +137,7 @@ if [[ "$BUILD_ENVIRONMENT" == *rocm* ]]; then
# ROCm CI is using Caffe2 docker images, which needs these wrapper
# scripts to correctly use sccache.
if [ -n "${SCCACHE_BUCKET}" ]; then
if [[ -n "${SCCACHE_BUCKET}" && -z "$IN_CIRCLECI" ]]; then
mkdir -p ./sccache
SCCACHE="$(which sccache)"
@ -150,12 +161,15 @@ if [[ "$BUILD_ENVIRONMENT" == *rocm* ]]; then
export PATH="$CACHE_WRAPPER_DIR:$PATH"
fi
if [[ -n "$IN_CIRCLECI" ]]; then
# Set ROCM_ARCH to gtx900 and gtx906 in CircleCI
echo "Limiting PYTORCH_ROCM_ARCH to gfx90[06] for CircleCI builds"
export PYTORCH_ROCM_ARCH="gfx900;gfx906"
fi
python tools/amd_build/build_amd.py
python setup.py install --user
# runtime compilation of MIOpen kernels manages to crash sccache - hence undo the wrapping
bash tools/amd_build/unwrap_clang.sh
exit 0
fi
@ -204,9 +218,11 @@ else
# set only when building other architectures
# only use for "python setup.py install" line
if [[ "$BUILD_ENVIRONMENT" != *ppc64le* && "$BUILD_ENVIRONMENT" != *clang* ]]; then
WERROR=1 python setup.py install
WERROR=1 python setup.py bdist_wheel
python -mpip install dist/*.whl
else
python setup.py install
python setup.py bdist_wheel
python -mpip install dist/*.whl
fi
# TODO: I'm not sure why, but somehow we lose verbose commands
@ -218,6 +234,11 @@ else
fi
assert_git_not_dirty
# Copy ninja build logs to dist folder
mkdir -p dist
if [ -f build/.ninja_log ]; then
cp build/.ninja_log dist
fi
# Build custom operator tests.
CUSTOM_OP_BUILD="$PWD/../custom-op-build"
@ -230,6 +251,17 @@ else
make VERBOSE=1
popd
assert_git_not_dirty
# Build custom backend tests.
CUSTOM_BACKEND_BUILD="$PWD/../custom-backend-build"
CUSTOM_BACKEND_TEST="$PWD/test/custom_backend"
python --version
mkdir "$CUSTOM_BACKEND_BUILD"
pushd "$CUSTOM_BACKEND_BUILD"
cmake "$CUSTOM_BACKEND_TEST" -DCMAKE_PREFIX_PATH="$SITE_PACKAGES/torch" -DPYTHON_EXECUTABLE="$(which python)"
make VERBOSE=1
popd
assert_git_not_dirty
else
# Test standalone c10 build
if [[ "$BUILD_ENVIRONMENT" == *xenial-cuda10.1-cudnn7-py3* ]]; then

View File

@ -1,20 +1,7 @@
#!/bin/bash
# Common setup for all Jenkins scripts
# NB: define this function before set -x, so that we don't
# pollute the log with a premature EXITED_USER_LAND ;)
function cleanup {
# Note that if you've exited user land, then CI will conclude that
# any failure is the CI's fault. So we MUST only output this
# string
retcode=$?
set +x
if [ $retcode -eq 0 ]; then
echo "EXITED_USER_LAND"
fi
}
source "$(dirname "${BASH_SOURCE[0]}")/common_utils.sh"
set -ex
# Save the SCRIPT_DIR absolute path in case later we chdir (as occurs in the gpu perf test)
@ -77,28 +64,18 @@ declare -f -t trap_add
trap_add cleanup EXIT
function assert_git_not_dirty() {
# TODO: we should add an option to `build_amd.py` that reverts the repo to
# an unmodified state.
if ([[ "$BUILD_ENVIRONMENT" != *rocm* ]] && [[ "$BUILD_ENVIRONMENT" != *xla* ]]) ; then
git_status=$(git status --porcelain)
if [[ $git_status ]]; then
echo "Build left local git repository checkout dirty"
echo "git status --porcelain:"
echo "${git_status}"
exit 1
fi
fi
}
if [[ "$BUILD_ENVIRONMENT" != *pytorch-win-* ]]; then
if which sccache > /dev/null; then
# Save sccache logs to file
sccache --stop-server || true
rm ~/sccache_error.log || true
# increasing SCCACHE_IDLE_TIMEOUT so that extension_backend_test.cpp can build after this PR:
# https://github.com/pytorch/pytorch/pull/16645
SCCACHE_ERROR_LOG=~/sccache_error.log SCCACHE_IDLE_TIMEOUT=1200 RUST_LOG=sccache::server=error sccache --start-server
if [[ "${BUILD_ENVIRONMENT}" == *rocm* ]]; then
SCCACHE_ERROR_LOG=~/sccache_error.log SCCACHE_IDLE_TIMEOUT=0 sccache --start-server
else
# increasing SCCACHE_IDLE_TIMEOUT so that extension_backend_test.cpp can build after this PR:
# https://github.com/pytorch/pytorch/pull/16645
SCCACHE_ERROR_LOG=~/sccache_error.log SCCACHE_IDLE_TIMEOUT=1200 RUST_LOG=sccache::server=error sccache --start-server
fi
# Report sccache stats for easier debugging
sccache --zero-stats
@ -158,43 +135,7 @@ if [[ "$BUILD_ENVIRONMENT" == *pytorch-xla-linux-bionic* ]] || \
fi
fi
function pip_install() {
# retry 3 times
# old versions of pip don't have the "--progress-bar" flag
pip install --progress-bar off "$@" || pip install --progress-bar off "$@" || pip install --progress-bar off "$@" ||\
pip install "$@" || pip install "$@" || pip install "$@"
}
function pip_uninstall() {
# uninstall 2 times
pip uninstall -y "$@" || pip uninstall -y "$@"
}
retry () {
$* || (sleep 1 && $*) || (sleep 2 && $*)
}
function get_exit_code() {
set +e
"$@"
retcode=$?
set -e
return $retcode
}
function file_diff_from_base() {
# The fetch may fail on Docker hosts, but it's not always necessary.
set +e
git fetch origin master --quiet
set -e
git diff --name-only "$(git merge-base origin/master HEAD)" > "$1"
}
function get_bazel() {
# download bazel version
wget https://github.com/bazelbuild/bazel/releases/download/3.1.0/bazel-3.1.0-linux-x86_64 -O tools/bazel
# verify content
echo '753434f4fa730266cf5ce21d1fdd425e1e167dd9347ad3e8adc19e8c0d54edca tools/bazel' | sha256sum --quiet -c
chmod +x tools/bazel
}

View File

@ -0,0 +1,83 @@
#!/bin/bash
# Common util **functions** that can be sourced in other scripts.
# NB: define this function before set -x, so that we don't
# pollute the log with a premature EXITED_USER_LAND ;)
function cleanup {
# Note that if you've exited user land, then CI will conclude that
# any failure is the CI's fault. So we MUST only output this
# string
retcode=$?
set +x
if [ $retcode -eq 0 ]; then
echo "EXITED_USER_LAND"
fi
}
function assert_git_not_dirty() {
# TODO: we should add an option to `build_amd.py` that reverts the repo to
# an unmodified state.
if ([[ "$BUILD_ENVIRONMENT" != *rocm* ]] && [[ "$BUILD_ENVIRONMENT" != *xla* ]]) ; then
git_status=$(git status --porcelain)
if [[ $git_status ]]; then
echo "Build left local git repository checkout dirty"
echo "git status --porcelain:"
echo "${git_status}"
exit 1
fi
fi
}
function pip_install() {
# retry 3 times
# old versions of pip don't have the "--progress-bar" flag
pip install --progress-bar off "$@" || pip install --progress-bar off "$@" || pip install --progress-bar off "$@" ||\
pip install "$@" || pip install "$@" || pip install "$@"
}
function pip_uninstall() {
# uninstall 2 times
pip uninstall -y "$@" || pip uninstall -y "$@"
}
function get_exit_code() {
set +e
"$@"
retcode=$?
set -e
return $retcode
}
function file_diff_from_base() {
# The fetch may fail on Docker hosts, but it's not always necessary.
set +e
git fetch origin release/1.7 --quiet
set -e
git diff --name-only "$(git merge-base origin/release/1.7 HEAD)" > "$1"
}
function get_bazel() {
# download bazel version
wget https://github.com/bazelbuild/bazel/releases/download/3.1.0/bazel-3.1.0-linux-x86_64 -O tools/bazel
# verify content
echo '753434f4fa730266cf5ce21d1fdd425e1e167dd9347ad3e8adc19e8c0d54edca tools/bazel' | sha256sum --quiet -c
chmod +x tools/bazel
}
TORCHVISION_COMMIT=c2e8a00885e68ae1200eb6440f540e181d9125de
function install_torchvision() {
# Check out torch/vision at Jun 11 2020 commit
# This hash must match one in .jenkins/caffe2/test.sh
pip_install --user "git+https://github.com/pytorch/vision.git@$TORCHVISION_COMMIT"
}
function checkout_install_torchvision() {
git clone https://github.com/pytorch/vision
pushd vision
git checkout "$TORCHVISION_COMMIT"
time python setup.py install
popd
}

View File

@ -5,6 +5,5 @@ pr="$2"
git diff --name-only "$upstream" "$pr"
# Now that PyTorch build depends on Caffe2, unconditionally trigger
# for any changes.
# TODO: Replace this with a NEGATIVE regex that allows us to blacklist
# files (letting us skip builds when they are unnecessary)
# TODO: Replace this with a NEGATIVE regex that allows us to skip builds when they are unnecessary
#git diff --name-only "$upstream" "$pr" | grep -Eq '^(aten/|caffe2/|.jenkins/pytorch|docs/(make.bat|Makefile|requirements.txt|source)|mypy|requirements.txt|setup.py|test/|third_party/|tools/|\.gitmodules|torch/)'

View File

@ -15,12 +15,12 @@ mkdir -p ${WORKSPACE_DIR}
# If a local installation of conda doesn't exist, we download and install conda
if [ ! -d "${WORKSPACE_DIR}/miniconda3" ]; then
mkdir -p ${WORKSPACE_DIR}
curl --retry 3 https://repo.anaconda.com/miniconda/Miniconda3-latest-MacOSX-x86_64.sh -o ${WORKSPACE_DIR}/miniconda3.sh
curl --retry 3 https://repo.anaconda.com/miniconda/Miniconda3-py37_4.8.3-MacOSX-x86_64.sh -o ${WORKSPACE_DIR}/miniconda3.sh
retry bash ${WORKSPACE_DIR}/miniconda3.sh -b -p ${WORKSPACE_DIR}/miniconda3
fi
export PATH="${WORKSPACE_DIR}/miniconda3/bin:$PATH"
source ${WORKSPACE_DIR}/miniconda3/bin/activate
retry conda install -y mkl mkl-include numpy pyyaml=5.3 setuptools=46.0.0 cmake cffi ninja
retry conda install -y mkl mkl-include numpy=1.18.5 pyyaml=5.3 setuptools=46.0.0 cmake cffi ninja typing_extensions dataclasses
# The torch.hub tests make requests to GitHub.
#

View File

@ -63,7 +63,7 @@ test_python_all() {
# Increase default limit on open file handles from 256 to 1024
ulimit -n 1024
python test/run_test.py --verbose --exclude test_jit_profiling test_jit_legacy test_jit_fuser_legacy test_jit_fuser_te test_tensorexpr --determine-from="$DETERMINE_FROM"
python test/run_test.py --verbose --exclude test_jit_cuda_fuser_profiling test_jit_cuda_fuser_legacy test_jit_legacy test_jit_fuser_legacy --determine-from="$DETERMINE_FROM"
assert_git_not_dirty
}
@ -99,6 +99,26 @@ test_libtorch() {
fi
}
test_custom_backend() {
echo "Testing custom backends"
pushd test/custom_backend
rm -rf build && mkdir build
pushd build
SITE_PACKAGES="$(python -c 'from distutils.sysconfig import get_python_lib; print(get_python_lib())')"
CMAKE_PREFIX_PATH="$SITE_PACKAGES/torch" cmake ..
make VERBOSE=1
popd
# Run Python tests and export a lowered module.
python test_custom_backend.py -v
python backend.py --export-module-to=model.pt
# Run C++ tests using the exported module.
build/test_custom_backend ./model.pt
rm -f ./model.pt
popd
assert_git_not_dirty
}
test_custom_script_ops() {
echo "Testing custom script operators"
pushd test/custom_operator
@ -124,11 +144,13 @@ if [ -z "${BUILD_ENVIRONMENT}" ] || [[ "${BUILD_ENVIRONMENT}" == *-test ]]; then
test_python_all
test_libtorch
test_custom_script_ops
test_custom_backend
else
if [[ "${BUILD_ENVIRONMENT}" == *-test1 ]]; then
test_python_all
elif [[ "${BUILD_ENVIRONMENT}" == *-test2 ]]; then
test_libtorch
test_custom_script_ops
test_custom_backend
fi
fi

View File

@ -24,14 +24,12 @@ if [ -n "${IN_CIRCLECI}" ]; then
# TODO: move this to Docker
sudo apt-get update
sudo apt-get install -y --allow-downgrades --allow-change-held-packages openmpi-bin libopenmpi-dev
sudo apt-get install -y --no-install-recommends openssh-client openssh-server
sudo mkdir -p /var/run/sshd
fi
fi
python tools/download_mnist.py --quiet -d test/cpp/api/mnist
OMP_NUM_THREADS=2 TORCH_CPP_TEST_MNIST_PATH="test/cpp/api/mnist" build/bin/test_api
time python test/run_test.py --verbose -i distributed/test_distributed
time python test/run_test.py --verbose -i distributed/test_distributed_fork
time python test/run_test.py --verbose -i distributed/test_c10d
time python test/run_test.py --verbose -i distributed/test_c10d_spawn
assert_git_not_dirty

View File

@ -6,6 +6,7 @@ with open(log_file_path) as f:
lines = f.readlines()
for line in lines:
# Ignore errors from CPU instruction set testing
if 'src.c' not in line:
# Ignore errors from CPU instruction set or symbol existing testing
keywords = ['src.c', 'CheckSymbolExists.c']
if all([keyword not in line for keyword in keywords]):
print(line)

View File

@ -13,7 +13,7 @@ echo "Testing pytorch"
if [ -n "${IN_CIRCLECI}" ]; then
# TODO move this to docker
pip_install unittest-xml-reporting
pip_install unittest-xml-reporting coverage
if [[ "$BUILD_ENVIRONMENT" == *-xenial-cuda10.1-* ]]; then
# TODO: move this to Docker
@ -25,47 +25,29 @@ if [ -n "${IN_CIRCLECI}" ]; then
# TODO: move this to Docker
sudo apt-get -qq update
sudo apt-get -qq install --allow-downgrades --allow-change-held-packages openmpi-bin libopenmpi-dev
sudo apt-get -qq install --no-install-recommends openssh-client openssh-server
sudo mkdir -p /var/run/sshd
fi
if [[ "$BUILD_ENVIRONMENT" == *-slow-* ]]; then
export PYTORCH_TEST_WITH_SLOW=1
export PYTORCH_TEST_SKIP_FAST=1
fi
if [[ "$BUILD_ENVIRONMENT" == *coverage* ]]; then
export PYTORCH_COLLECT_COVERAGE=1
fi
fi
if [[ "$BUILD_ENVIRONMENT" == *rocm* ]]; then
# Print GPU info
rocminfo | egrep 'Name:.*\sgfx|Marketing'
# TODO: Move this to Docker
sudo apt-get -qq update
sudo apt-get -qq install --no-install-recommends libsndfile1
# TODO: Remove this once ROCm CI images are >= ROCm 3.5
# ROCm 3.5 required a backwards-incompatible change; the kernel and thunk must match.
# Detect kernel version and upgrade thunk if this is a ROCm 3.3 container running on a 3.5 kernel.
ROCM_ASD_FW_VERSION=$(/opt/rocm/bin/rocm-smi --showfwinfo -d 1 | grep ASD | awk '{print $6}')
if [[ $ROCM_ASD_FW_VERSION = 553648174 && "$BUILD_ENVIRONMENT" == *rocm3.3* ]]; then
# upgrade thunk to 3.5
mkdir rocm3.5-thunk
pushd rocm3.5-thunk
wget http://repo.radeon.com/rocm/apt/3.5/pool/main/h/hsakmt-roct3.5.0/hsakmt-roct3.5.0_1.0.9-347-gd4b224f_amd64.deb
wget http://repo.radeon.com/rocm/apt/3.5/pool/main/h/hsakmt-roct-dev3.5.0/hsakmt-roct-dev3.5.0_1.0.9-347-gd4b224f_amd64.deb
dpkg-deb -vx hsakmt-roct3.5.0_1.0.9-347-gd4b224f_amd64.deb .
dpkg-deb -vx hsakmt-roct-dev3.5.0_1.0.9-347-gd4b224f_amd64.deb .
sudo cp -r opt/rocm-3.5.0/* /opt/rocm-3.3.0/
popd
rm -rf rocm3.5-thunk
fi
fi
# --user breaks ppc64le builds and these packages are already in ppc64le docker
if [[ "$BUILD_ENVIRONMENT" != *ppc64le* ]] && [[ "$BUILD_ENVIRONMENT" != *-bazel-* ]] ; then
# JIT C++ extensions require ninja.
pip_install --user ninja
# ninja is installed in /var/lib/jenkins/.local/bin
export PATH="/var/lib/jenkins/.local/bin:$PATH"
# ninja is installed in $HOME/.local/bin, e.g., /var/lib/jenkins/.local/bin for CI user jenkins
# but this script should be runnable by any user, including root
export PATH="$HOME/.local/bin:$PATH"
# TODO: Please move this to Docker
# The version is fixed to avoid flakiness: https://github.com/pytorch/pytorch/issues/31136
@ -85,7 +67,7 @@ fi
# ASAN test is not working
if [[ "$BUILD_ENVIRONMENT" == *asan* ]]; then
# Suppress vptr violations arising from multiple copies of pybind11
export ASAN_OPTIONS=detect_leaks=0:symbolize=1:strict_init_order=true
export ASAN_OPTIONS=detect_leaks=0:symbolize=1:strict_init_order=true:detect_odr_violation=0
export UBSAN_OPTIONS=print_stacktrace=1:suppressions=$PWD/ubsan.supp
export PYTORCH_TEST_WITH_ASAN=1
export PYTORCH_TEST_WITH_UBSAN=1
@ -139,7 +121,7 @@ elif [[ "${BUILD_ENVIRONMENT}" == *-NO_AVX2-* ]]; then
export ATEN_CPU_CAPABILITY=avx
fi
if [ -n "$CIRCLE_PULL_REQUEST" ]; then
if ([ -n "$CIRCLE_PULL_REQUEST" ] && [[ "$BUILD_ENVIRONMENT" != *coverage* ]]); then
DETERMINE_FROM=$(mktemp)
file_diff_from_base "$DETERMINE_FROM"
fi
@ -150,17 +132,17 @@ test_python_nn() {
}
test_python_ge_config_profiling() {
time python test/run_test.py --include test_jit_profiling test_jit_fuser_te --verbose --determine-from="$DETERMINE_FROM"
time python test/run_test.py --include test_jit_cuda_fuser_profiling test_jit_profiling test_jit_fuser_te test_tensorexpr --verbose --determine-from="$DETERMINE_FROM"
assert_git_not_dirty
}
test_python_ge_config_legacy() {
time python test/run_test.py --include test_jit_legacy test_jit_fuser_legacy --verbose --determine-from="$DETERMINE_FROM"
time python test/run_test.py --include test_jit_cuda_fuser_legacy test_jit_legacy test_jit_fuser_legacy --verbose --determine-from="$DETERMINE_FROM"
assert_git_not_dirty
}
test_python_all_except_nn_and_cpp_extensions() {
time python test/run_test.py --exclude test_nn test_jit_profiling test_jit_legacy test_jit_fuser_legacy test_jit_fuser_te test_tensorexpr --verbose --determine-from="$DETERMINE_FROM"
time python test/run_test.py --exclude test_jit_cuda_fuser_profiling test_jit_cuda_fuser_legacy test_nn test_jit_profiling test_jit_legacy test_jit_fuser_legacy test_jit_fuser_te test_tensorexpr --verbose --determine-from="$DETERMINE_FROM"
assert_git_not_dirty
}
@ -204,12 +186,6 @@ if [[ "${BUILD_ENVIRONMENT}" == *tbb* ]]; then
sudo cp -r $PWD/third_party/tbb/include/tbb/* /usr/include/tbb
fi
test_torchvision() {
# Check out torch/vision at Jun 11 2020 commit
# This hash must match one in .jenkins/caffe2/test.sh
pip_install --user git+https://github.com/pytorch/vision.git@c2e8a00885e68ae1200eb6440f540e181d9125de
}
test_libtorch() {
if [[ "$BUILD_ENVIRONMENT" != *rocm* ]]; then
echo "Testing libtorch"
@ -234,6 +210,53 @@ test_libtorch() {
fi
}
test_vulkan() {
if [[ "$BUILD_ENVIRONMENT" == *vulkan-linux* ]]; then
export VK_ICD_FILENAMES=/var/lib/jenkins/swiftshader/build/Linux/vk_swiftshader_icd.json
mkdir -p test/test-reports/cpp-vulkan
build/bin/vulkan_test --gtest_output=xml:test/test-reports/cpp-vulkan/vulkan_test.xml
fi
}
test_distributed() {
if [[ "$BUILD_ENVIRONMENT" == *cuda* ]]; then
echo "Testing distributed C++ tests"
mkdir -p test/test-reports/cpp-distributed
build/bin/FileStoreTest --gtest_output=xml:test/test-reports/cpp-distributed/FileStoreTest.xml
build/bin/HashStoreTest --gtest_output=xml:test/test-reports/cpp-distributed/HashStoreTest.xml
build/bin/TCPStoreTest --gtest_output=xml:test/test-reports/cpp-distributed/TCPStoreTest.xml
build/bin/ProcessGroupGlooTest --gtest_output=xml:test/test-reports/cpp-distributed/ProcessGroupGlooTest.xml
build/bin/ProcessGroupNCCLTest --gtest_output=xml:test/test-reports/cpp-distributed/ProcessGroupNCCLTest.xml
build/bin/ProcessGroupNCCLErrorsTest --gtest_output=xml:test/test-reports/cpp-distributed/ProcessGroupNCCLErrorsTest.xml
fi
}
test_rpc() {
if [[ "$BUILD_ENVIRONMENT" != *rocm* ]]; then
echo "Testing RPC C++ tests"
mkdir -p test/test-reports/cpp-rpc
build/bin/test_cpp_rpc --gtest_output=xml:test/test-reports/cpp-rpc/test_cpp_rpc.xml
fi
}
test_custom_backend() {
if [[ "$BUILD_ENVIRONMENT" != *rocm* ]] && [[ "$BUILD_ENVIRONMENT" != *asan* ]] ; then
echo "Testing custom backends"
CUSTOM_BACKEND_BUILD="$PWD/../custom-backend-build"
pushd test/custom_backend
cp -a "$CUSTOM_BACKEND_BUILD" build
# Run tests Python-side and export a lowered module.
python test_custom_backend.py -v
python backend.py --export-module-to=model.pt
# Run tests C++-side and load the exported lowered module.
build/test_custom_backend ./model.pt
rm -f ./model.pt
popd
assert_git_not_dirty
fi
}
test_custom_script_ops() {
if [[ "$BUILD_ENVIRONMENT" != *rocm* ]] && [[ "$BUILD_ENVIRONMENT" != *asan* ]] ; then
echo "Testing custom script operators"
@ -287,10 +310,15 @@ test_xla() {
test_backward_compatibility() {
set -x
pushd test/backward_compatibility
python dump_all_function_schemas.py --filename new_schemas.txt
pip_uninstall torch
python -m venv venv
. venv/bin/activate
pip_install --pre torch -f https://download.pytorch.org/whl/nightly/cpu/torch_nightly.html
python check_backward_compatibility.py --new-schemas new_schemas.txt
pip show torch
python dump_all_function_schemas.py --filename nightly_schemas.txt
deactivate
rm -r venv
pip show torch
python check_backward_compatibility.py --existing-schemas nightly_schemas.txt
popd
set +x
assert_git_not_dirty
@ -304,12 +332,44 @@ test_bazel() {
tools/bazel test --test_timeout=480 --test_output=all --test_tag_filters=-gpu-required --test_filter=-*CUDA :all_tests
}
test_benchmarks() {
if [[ "$BUILD_ENVIRONMENT" == *cuda* && "$BUILD_ENVIRONMENT" != *nogpu* ]]; then
pip_install --user "pytest-benchmark==3.2.3"
pip_install --user "requests"
BENCHMARK_DATA="benchmarks/.data"
mkdir -p ${BENCHMARK_DATA}
pytest benchmarks/fastrnns/test_bench.py --benchmark-sort=Name --benchmark-json=${BENCHMARK_DATA}/fastrnns_default.json --fuser=default --executor=default
python benchmarks/upload_scribe.py --pytest_bench_json ${BENCHMARK_DATA}/fastrnns_default.json
pytest benchmarks/fastrnns/test_bench.py --benchmark-sort=Name --benchmark-json=${BENCHMARK_DATA}/fastrnns_legacy_old.json --fuser=old --executor=legacy
python benchmarks/upload_scribe.py --pytest_bench_json ${BENCHMARK_DATA}/fastrnns_legacy_old.json
pytest benchmarks/fastrnns/test_bench.py --benchmark-sort=Name --benchmark-json=${BENCHMARK_DATA}/fastrnns_profiling_te.json --fuser=te --executor=profiling
python benchmarks/upload_scribe.py --pytest_bench_json ${BENCHMARK_DATA}/fastrnns_profiling_te.json
assert_git_not_dirty
fi
}
test_cpp_extensions() {
# This is to test whether cpp extension build is compatible with current env. No need to test both ninja and no-ninja build
time python test/run_test.py --include test_cpp_extensions_aot_ninja --verbose --determine-from="$DETERMINE_FROM"
assert_git_not_dirty
}
test_vec256() {
# This is to test vec256 instructions DEFAULT/AVX/AVX2 (platform dependent, some platforms might not support AVX/AVX2)
if [[ "$BUILD_ENVIRONMENT" != *asan* ]] && [[ "$BUILD_ENVIRONMENT" != *rocm* ]]; then
echo "Testing vec256 instructions"
mkdir -p test/test-reports/vec256
pushd build/bin
vec256_tests=$(find . -maxdepth 1 -executable -name 'vec256_test*')
for vec256_exec in $vec256_tests
do
$vec256_exec --gtest_output=xml:test/test-reports/vec256/$vec256_exec.xml
done
popd
assert_git_not_dirty
fi
}
if ! [[ "${BUILD_ENVIRONMENT}" == *libtorch* || "${BUILD_ENVIRONMENT}" == *-bazel-* ]]; then
(cd test && python -c "import torch; print(torch.__config__.show())")
(cd test && python -c "import torch; print(torch.__config__.parallel_info())")
@ -319,7 +379,7 @@ if [[ "${BUILD_ENVIRONMENT}" == *backward* ]]; then
test_backward_compatibility
# Do NOT add tests after bc check tests, see its comment.
elif [[ "${BUILD_ENVIRONMENT}" == *xla* || "${JOB_BASE_NAME}" == *xla* ]]; then
test_torchvision
install_torchvision
test_xla
elif [[ "${BUILD_ENVIRONMENT}" == *ge_config_legacy* || "${JOB_BASE_NAME}" == *ge_config_legacy* ]]; then
test_python_ge_config_legacy
@ -332,25 +392,39 @@ elif [[ "${BUILD_ENVIRONMENT}" == *-test1 || "${JOB_BASE_NAME}" == *-test1 ]]; t
test_python_nn
test_cpp_extensions
elif [[ "${BUILD_ENVIRONMENT}" == *-test2 || "${JOB_BASE_NAME}" == *-test2 ]]; then
test_torchvision
install_torchvision
test_python_all_except_nn_and_cpp_extensions
test_aten
test_libtorch
test_custom_script_ops
test_custom_backend
test_torch_function_benchmark
elif [[ "${BUILD_ENVIRONMENT}" == *vulkan-linux* ]]; then
test_vulkan
elif [[ "${BUILD_ENVIRONMENT}" == *-bazel-* ]]; then
test_bazel
elif [[ "${BUILD_ENVIRONMENT}" == pytorch-linux-xenial-cuda9.2-cudnn7-py3-gcc5.4* ]]; then
# test cpp extension for xenial + cuda 9.2 + gcc 5.4 to make sure
# cpp extension can be built correctly under this old env
# test cpp extension for xenial + cuda 9.2 + gcc 5.4 to make sure
# cpp extension can be built correctly under this old env
test_cpp_extensions
else
test_torchvision
install_torchvision
test_python_nn
test_python_all_except_nn_and_cpp_extensions
test_cpp_extensions
test_aten
test_vec256
test_libtorch
test_custom_script_ops
test_custom_backend
test_torch_function_benchmark
test_distributed
test_benchmarks
test_rpc
if [[ "$BUILD_ENVIRONMENT" == *coverage* ]]; then
pushd test
echo "Generating XML coverage report"
time python -mcoverage xml
popd
fi
fi

View File

@ -21,8 +21,8 @@ call %INSTALLER_DIR%\install_sccache.bat
call %INSTALLER_DIR%\install_miniconda3.bat
:: Install ninja
if "%REBUILD%"=="" ( pip install -q "ninja==1.9.0" )
:: Install ninja and other deps
if "%REBUILD%"=="" ( pip install -q "ninja==1.9.0" dataclasses )
git submodule sync --recursive
git submodule update --init --recursive
@ -39,6 +39,7 @@ popd
if "%CUDA_VERSION%" == "9" goto cuda_build_9
if "%CUDA_VERSION%" == "10" goto cuda_build_10
if "%CUDA_VERSION%" == "11" goto cuda_build_11
goto cuda_build_end
:cuda_build_9
@ -55,6 +56,13 @@ set CUDA_PATH_V10_1=%CUDA_PATH%
goto cuda_build_common
:cuda_build_11
set CUDA_PATH=C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.0
set CUDA_PATH_V11_0=%CUDA_PATH%
goto cuda_build_common
:cuda_build_common
set CUDNN_LIB_DIR=%CUDA_PATH%\lib\x64

View File

@ -1,17 +1,18 @@
if "%CUDA_VERSION%" == "9" set CUDA_SUFFIX=cuda92
if "%CUDA_VERSION%" == "10" set CUDA_SUFFIX=cuda101
if "%CUDA_VERSION%" == "11" set CUDA_SUFFIX=cuda110
if "%CUDA_SUFFIX%" == "" (
echo unknown CUDA version, please set `CUDA_VERSION` to 9 or 10.
echo unknown CUDA version, please set `CUDA_VERSION` to 9, 10 or 11.
exit /b 1
)
if "%REBUILD%"=="" (
if "%BUILD_ENVIRONMENT%"=="" (
curl --retry 3 -k https://s3.amazonaws.com/ossci-windows/magma_2.5.2_%CUDA_SUFFIX%_%BUILD_TYPE%.7z --output %TMP_DIR_WIN%\magma_2.5.2_%CUDA_SUFFIX%_%BUILD_TYPE%.7z
curl --retry 3 -k https://s3.amazonaws.com/ossci-windows/magma_2.5.3_%CUDA_SUFFIX%_%BUILD_TYPE%.7z --output %TMP_DIR_WIN%\magma_2.5.3_%CUDA_SUFFIX%_%BUILD_TYPE%.7z
) else (
aws s3 cp s3://ossci-windows/magma_2.5.2_%CUDA_SUFFIX%_%BUILD_TYPE%.7z %TMP_DIR_WIN%\magma_2.5.2_%CUDA_SUFFIX%_%BUILD_TYPE%.7z --quiet
aws s3 cp s3://ossci-windows/magma_2.5.3_%CUDA_SUFFIX%_%BUILD_TYPE%.7z %TMP_DIR_WIN%\magma_2.5.3_%CUDA_SUFFIX%_%BUILD_TYPE%.7z --quiet
)
7z x -aoa %TMP_DIR_WIN%\magma_2.5.2_%CUDA_SUFFIX%_%BUILD_TYPE%.7z -o%TMP_DIR_WIN%\magma
7z x -aoa %TMP_DIR_WIN%\magma_2.5.3_%CUDA_SUFFIX%_%BUILD_TYPE%.7z -o%TMP_DIR_WIN%\magma
)
set MAGMA_HOME=%TMP_DIR_WIN%\magma

View File

@ -12,4 +12,11 @@ call %CONDA_PARENT_DIR%\Miniconda3\Scripts\activate.bat %CONDA_PARENT_DIR%\Minic
if "%REBUILD%"=="" (
call conda install -y -q python=%PYTHON_VERSION% numpy cffi pyyaml boto3
call conda install -y -q -c conda-forge cmake
call conda install -y -q -c rdonnelly libuv
)
:: Get installed libuv path
@echo off
set libuv_ROOT=%CONDA_PARENT_DIR%\Miniconda3\Library
@echo on
echo libuv_ROOT=%libuv_ROOT%

View File

@ -1,7 +1,5 @@
#!/usr/bin/env python
from __future__ import print_function
import subprocess
import os

View File

@ -22,7 +22,7 @@ call %CONDA_PARENT_DIR%\Miniconda3\Scripts\activate.bat %CONDA_PARENT_DIR%\Minic
if NOT "%BUILD_ENVIRONMENT%"=="" (
:: We have to pin Python version to 3.6.7, until mkl supports Python 3.7
:: Numba is pinned to 0.44.0 to avoid https://github.com/numba/numba/issues/4352
call conda install -y -q python=3.6.7 numpy mkl cffi pyyaml boto3 protobuf numba==0.44.0
call conda install -y -q python=3.6.7 numpy mkl cffi pyyaml boto3 protobuf numba==0.44.0 scipy==1.5.0 typing_extensions dataclasses
if %errorlevel% neq 0 ( exit /b %errorlevel% )
call conda install -y -q -c conda-forge cmake
if %errorlevel% neq 0 ( exit /b %errorlevel% )
@ -39,7 +39,7 @@ if %errorlevel% neq 0 ( exit /b %errorlevel% )
popd
:: The version is fixed to avoid flakiness: https://github.com/pytorch/pytorch/issues/31136
pip install "ninja==1.9.0" future "hypothesis==4.53.2" "librosa>=0.6.2" psutil pillow unittest-xml-reporting "scipy==1.4.1"
pip install "ninja==1.10.0.post1" future "hypothesis==4.53.2" "librosa>=0.6.2" psutil pillow unittest-xml-reporting
if %errorlevel% neq 0 ( exit /b %errorlevel% )
:: No need to install faulthandler since we only test Python >= 3.6 on Windows
:: faulthandler is builtin since Python 3.3
@ -48,6 +48,7 @@ set DISTUTILS_USE_SDK=1
if "%CUDA_VERSION%" == "9" goto cuda_build_9
if "%CUDA_VERSION%" == "10" goto cuda_build_10
if "%CUDA_VERSION%" == "11" goto cuda_build_11
goto cuda_build_end
:cuda_build_9
@ -64,6 +65,13 @@ set CUDA_PATH_V10_1=%CUDA_PATH%
goto cuda_build_common
:cuda_build_11
set CUDA_PATH=C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.0
set CUDA_PATH_V11_0=%CUDA_PATH%
goto cuda_build_common
:cuda_build_common
set CUDNN_LIB_DIR=%CUDA_PATH%\lib\x64

Some files were not shown because too many files have changed in this diff Show More